Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

Question

BlackTom AI · Accepted Answer

Start by recalling the role of the value weight matrix Wv in a single-headed attention layer: Wv is the projection that maps the input feature dimension (commonly denoted as d_model) to the value space (often denoted as d_v). Crucially, this matrix is defined by feature dimensions, not by the sequence length.

Option 1: None of the above. This is the correct core idea because changing the maximum input sequence length (i.e., doubling the sequence length) does not alter the per-token feature dimensions, and thus does not change the shape of Wv. The Wv matrix remains of shape [d_model, d_v], unaffected by sequence length.

Option 2: Half the number of columns. The number of columns in Wv corresponds to the output feature dimension d_v, not the sequence length. Doubling sequence length does not imply halving the output dimension, so this is incorrect.

Option 3: Half the number of rows. The number of rows in Wv corresponds to the input feature dimension d_model. Sequence length changes do not change d_model, so this is incorrect.

Option 4: Double the number of rows. Again, the row count is tied to the input feature dimension d_model, not to sequence length. This option is incorrect for the same reason as Option 3.

Option 5: Double the number of columns. The column count relates to the value dimension d_v, independent of sequence length. Doubling sequence length does not mandate changing d_v, so this is incorrect.

In summary, none of the proposed changes to Wv’s dimensions arise from simply doubling the maximum input sequence length; Wv’s shape is governed by feature dimensions, which remain constant regardless of sequence length.

11785/11685/11485 Quiz-14

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

View Explanation

Log in for full answers

Similar Questions

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Which of the following statements is correct about query, key, and value in transformer models?

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

More Practical Tools for Students Powered by AI Study Helper

Homework AI Solver

Stylized AI Paper Writer

Plagiarism Checker Assistant

Citation AI Academic Writing Tool

In-Class Translation Assistant

AI Note Generator

AI Quiz Answers

Past Exam Questions from University Test Bank

Smart Practice Assistant

Adaptive Practice

Making Your Study Simpler