Which of the following statements is correct about query, key, and value in transformer models?

Question

BlackTom AI · Accepted Answer

To assess how query, key, and value function in transformer attention, let's evaluate what each component represents in this context.
Option 1: 'The query describes the task that each word or token can help with in the sentence.' This misframes Q as a descriptor of the task. In attention, the query is used to compute compatibility with keys, not to declare the task itself. So this is not accurate.
Option 2: 'The key describes the current task that each word or token needs to perform in the sentence.' Here, the key is not about the task; it serves as a content descriptor that can be matched against the query to determine attention weights. This is a misattribution of the role of keys.
Option 3: 'The value indicates how important each word or token is in the sentence.' Values carry the actual information to be aggregated, not a direct measure of importance. The attention mechanism uses the attention weights derived from Q and K to weight the values, but V itself isn't a measure of importance. This statement conflates value with importance, so it’s incorrect.
Option 4: 'The higher the resonance between the query and key, the greater the contribution of the value to the current task.' This aligns with the core mechanism: attention scores are computed from the similarity (often dot product) between Q and K, and these scores (after softmax) scale the corresponding V vectors, increasing their influence when Q matches a given K. This description correctly captures the intended relationship among Q, K, and V in transformer attention.
In summary, the first three options misrepresent the roles of query, key, or value, whereas the fourth option accurately reflects how attention uses Q–K similarity to weight V for the current computation.

BU.330.760.52.SP25 Final- Requires Respondus LockDown Browser

Which of the following statements is correct about query, key, and value in transformer models?

View Explanation

Log in for full answers

Similar Questions

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

More Practical Tools for Students Powered by AI Study Helper

Homework AI Solver

Stylized AI Paper Writer

Plagiarism Checker Assistant

Citation AI Academic Writing Tool

In-Class Translation Assistant

AI Note Generator

AI Quiz Answers

Past Exam Questions from University Test Bank

Smart Practice Assistant

Adaptive Practice

Making Your Study Simpler