题目
BU.330.760.52.SP25 Final- Requires Respondus LockDown Browser
单项选择题
Which of the following statements is correct about query, key, and value in transformer models?
选项
A.The query describes the task that each word or token can help with in the sentence.
B.The key describes the current task that each word or token needs to perform in the sentence.
C.The value indicates how important each word or token is in the sentence.
D.The higher the resonance between the query and key, the greater the contribution of the value to the current task.
查看解析
标准答案
Please login to view
思路分析
To assess how query, key, and value function in transformer attention, let's evaluate what each component represents in this context.
Option 1: 'The query describes the task that each word or token can help with in the sentence.' This misframes Q as a descriptor of the task. In attention, the query is used to compute compatibility with keys, not to ......Login to view full explanation登录即可查看完整答案
我们收录了全球超50000道考试原题与详细解析,现在登录,立即获得答案。
类似问题
On scaled dot-product attention and training stability of a transformer: I Without scaling by 𝐷 𝑘 , the variance of the dot product 𝑞 𝑛 ⊤ 𝑘 𝑚 grows with dimensionality, producing large logits that can saturate the softmax. II Scaling by 𝐷 𝑘 primarily solves exploding-gradient problems inside the value projection 𝑉 . III The softmax-normalized matrix S o f t m a x ( 𝑄 𝐾 ⊤ ) is applied row-wise, making each row represent how strongly a query attends to all keys. IV Scaled dot-product attention computes A t t e n t i o n ( 𝑄 , 𝐾 , 𝑉 ) = S o f t m a x ! ( 𝑄 𝐾 ⊤ 𝐷 𝑘 ) 𝑉 , and the resulting matrix always has the same dimension as 𝑉 .
Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?
As defined in Attention is All You Need, what is the size of the cross-attention matrix between the encoder and decoder given the following English to Spanish translation: I am very handsome -> Soy muy guapo Please assume the following: d_k = d_q = 64 d_v = 32 Please ignore the <SOS> and <EOS> tokens. cross-attention means Attention(Q, K, V) NOTE: Please round to the nearest integer. [Fill in the blank] rows[Fill in the blank] columns
Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?
更多留学生实用工具
希望你的学习变得更简单
加入我们,立即解锁 海量真题 与 独家解析,让复习快人一步!