Questions
11785/11685/11485 Quiz-14
Multiple choice
Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply
Options
A.None of the above
B.Half the number of columns
C.Half the number of rows
D.Double the number of rows
E.Double the number of columns
View Explanation
Verified Answer
Please login to view
Step-by-Step Analysis
Start by recalling the role of the value weight matrix Wv in a single-headed attention layer: Wv is the projection that maps the input feature dimension (commonly denoted as d_model) to the value space (often denoted as d_v). Crucially, this matrix is defined by feature dimensions, not by the sequence length.
Option 1: None of th......Login to view full explanationLog in for full answers
We've collected over 50,000 authentic exam questions and detailed explanations from around the globe. Log in now and get instant access to the answers!
Similar Questions
On scaled dot-product attention and training stability of a transformer: I Without scaling by 𝐷 𝑘 , the variance of the dot product 𝑞 𝑛 ⊤ 𝑘 𝑚 grows with dimensionality, producing large logits that can saturate the softmax. II Scaling by 𝐷 𝑘 primarily solves exploding-gradient problems inside the value projection 𝑉 . III The softmax-normalized matrix S o f t m a x ( 𝑄 𝐾 ⊤ ) is applied row-wise, making each row represent how strongly a query attends to all keys. IV Scaled dot-product attention computes A t t e n t i o n ( 𝑄 , 𝐾 , 𝑉 ) = S o f t m a x ! ( 𝑄 𝐾 ⊤ 𝐷 𝑘 ) 𝑉 , and the resulting matrix always has the same dimension as 𝑉 .
Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?
As defined in Attention is All You Need, what is the size of the cross-attention matrix between the encoder and decoder given the following English to Spanish translation: I am very handsome -> Soy muy guapo Please assume the following: d_k = d_q = 64 d_v = 32 Please ignore the <SOS> and <EOS> tokens. cross-attention means Attention(Q, K, V) NOTE: Please round to the nearest integer. [Fill in the blank] rows[Fill in the blank] columns
Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?
More Practical Tools for Students Powered by AI Study Helper
Making Your Study Simpler
Join us and instantly unlock extensive past papers & exclusive solutions to get a head start on your studies!