Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

Question

BlackTom AI · Accepted Answer

Start by recalling the role of the value weight matrix Wv in a single-headed attention layer: Wv is the projection that maps the input feature dimension (commonly denoted as d_model) to the value space (often denoted as d_v). Crucially, this matrix is defined by feature dimensions, not by the sequence length.

Option 1: None of the above. This is the correct core idea because changing the maximum input sequence length (i.e., doubling the sequence length) does not alter the per-token feature dimensions, and thus does not change the shape of Wv. The Wv matrix remains of shape [d_model, d_v], unaffected by sequence length.

Option 2: Half the number of columns. The number of columns in Wv corresponds to the output feature dimension d_v, not the sequence length. Doubling sequence length does not imply halving the output dimension, so this is incorrect.

Option 3: Half the number of rows. The number of rows in Wv corresponds to the input feature dimension d_model. Sequence length changes do not change d_model, so this is incorrect.

Option 4: Double the number of rows. Again, the row count is tied to the input feature dimension d_model, not to sequence length. This option is incorrect for the same reason as Option 3.

Option 5: Double the number of columns. The column count relates to the value dimension d_v, independent of sequence length. Doubling sequence length does not mandate changing d_v, so this is incorrect.

In summary, none of the proposed changes to Wv’s dimensions arise from simply doubling the maximum input sequence length; Wv’s shape is governed by feature dimensions, which remain constant regardless of sequence length.

11785/11685/11485 Quiz-14

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

查看解析

登录即可查看完整答案

类似问题

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Which of the following statements is correct about query, key, and value in transformer models?

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单