As defined in Attention is All You Need, what is the size of the cross-attention matrix between the encoder and decoder given the following English to Spanish translation: I am very handsome - Soy muy guapo Please assume the following: d_k = d_q = 64 d_v = 32 Please ignore the and tokens. cross-attention means Attention(Q, K, V) NOTE: Please round to the nearest integer. [Fill in the blank] rows[Fill in the blank] columns

Question

As defined in Attention is All You Need, what is the size of the cross-attention matrix between the encoder and decoder given the following English to Spanish translation: I am very handsome -> Soy muy guapo Please assume the following: d_k = d_q = 64 d_v = 32 Please ignore the and tokens. cross-attention means Attention(Q, K, V) NOTE: Please round to the nearest integer. [Fill in the blank] rows[Fill in the blank] columns

BlackTom AI · Accepted Answer

To determine the size of the cross-attention matrix, we need to recall how cross-attention is defined in Transformer architectures. The cross-attention matrix is formed by Attention(Q, K, V), where Q comes from the decoder (the current target sequence being generated) and K and V come from the encoder (the source sequence). The dimensions of this matrix correspond to (sequence length of the decoder, sequence length of the encoder). Step 1: Count the length of the decoder sequence that the model is generating. The Spanish translation target is "Soy muy guapo", which consists of three tokens when ignoring any special tokens like SOS or EOS. Therefore, the number of rows in the cross-attention matrix is 3. Step 2: Count the length of the encoder sequence, i.e., the source English sentence after removing special tokens. The source "I am very handsome" contains four tokens when ignoring special tokens, so the number of columns is 4. Step 3: Acknowledge the given constants d_k = d_q = 64 and d_v = 32. While these dimensions affect the internal projection sizes of Q, K, and V, they do not directly change the shape of the cross-attention matrix itself; the shape is determined solely by the token counts in the decoder and encoder sequences. Step 4: Ensure there are no extra tokens influencing counts. The instruction to ignore and tokens means we should not include them in either the decoder or encoder lengths, reinforcing that the counts used are 3 (decoder) and 4 (encoder). Step 5: Return the matrix size in the requested format. Given the reasoning above, the cross-attention matrix has 3 rows and 4 columns. If you were asked to fill two blanks, you would place 3 for the rows and 4 for the columns, yielding a 3 × 4 matrix.

CS-7643-O01, OAN, OSZ Quiz #4: Module 3

查看解析

登录即可查看完整答案

类似问题

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Which of the following statements is correct about query, key, and value in transformer models?

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单