Which of the following statements is correct about query, key, and value in transformer models?

Question

BlackTom AI · Accepted Answer

To assess how query, key, and value function in transformer attention, let's evaluate what each component represents in this context.
Option 1: 'The query describes the task that each word or token can help with in the sentence.' This misframes Q as a descriptor of the task. In attention, the query is used to compute compatibility with keys, not to declare the task itself. So this is not accurate.
Option 2: 'The key describes the current task that each word or token needs to perform in the sentence.' Here, the key is not about the task; it serves as a content descriptor that can be matched against the query to determine attention weights. This is a misattribution of the role of keys.
Option 3: 'The value indicates how important each word or token is in the sentence.' Values carry the actual information to be aggregated, not a direct measure of importance. The attention mechanism uses the attention weights derived from Q and K to weight the values, but V itself isn't a measure of importance. This statement conflates value with importance, so it’s incorrect.
Option 4: 'The higher the resonance between the query and key, the greater the contribution of the value to the current task.' This aligns with the core mechanism: attention scores are computed from the similarity (often dot product) between Q and K, and these scores (after softmax) scale the corresponding V vectors, increasing their influence when Q matches a given K. This description correctly captures the intended relationship among Q, K, and V in transformer attention.
In summary, the first three options misrepresent the roles of query, key, or value, whereas the fourth option accurately reflects how attention uses Q–K similarity to weight V for the current computation.

BU.330.760.52.SP25 Final- Requires Respondus LockDown Browser

Which of the following statements is correct about query, key, and value in transformer models?

查看解析

登录即可查看完整答案

类似问题

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单