As defined in Attention is All You Need, what is the size of the cross-attention matrix between the encoder and decoder given the following English to Spanish translation: I am very handsome - Soy muy guapo Please assume the following: d_k = d_q = 64 d_v = 32 Please ignore the and tokens. cross-attention means Attention(Q, K, V) NOTE: Please round to the nearest integer. [Fill in the blank] rows[Fill in the blank] columns

Question

As defined in Attention is All You Need, what is the size of the cross-attention matrix between the encoder and decoder given the following English to Spanish translation: I am very handsome -> Soy muy guapo Please assume the following: d_k = d_q = 64 d_v = 32 Please ignore the and tokens. cross-attention means Attention(Q, K, V) NOTE: Please round to the nearest integer. [Fill in the blank] rows[Fill in the blank] columns

BlackTom AI · Accepted Answer

To determine the size of the cross-attention matrix, we need to recall how cross-attention is defined in Transformer architectures. The cross-attention matrix is formed by Attention(Q, K, V), where Q comes from the decoder (the current target sequence being generated) and K and V come from the encoder (the source sequence). The dimensions of this matrix correspond to (sequence length of the decoder, sequence length of the encoder). Step 1: Count the length of the decoder sequence that the model is generating. The Spanish translation target is "Soy muy guapo", which consists of three tokens when ignoring any special tokens like SOS or EOS. Therefore, the number of rows in the cross-attention matrix is 3. Step 2: Count the length of the encoder sequence, i.e., the source English sentence after removing special tokens. The source "I am very handsome" contains four tokens when ignoring special tokens, so the number of columns is 4. Step 3: Acknowledge the given constants d_k = d_q = 64 and d_v = 32. While these dimensions affect the internal projection sizes of Q, K, and V, they do not directly change the shape of the cross-attention matrix itself; the shape is determined solely by the token counts in the decoder and encoder sequences. Step 4: Ensure there are no extra tokens influencing counts. The instruction to ignore and tokens means we should not include them in either the decoder or encoder lengths, reinforcing that the counts used are 3 (decoder) and 4 (encoder). Step 5: Return the matrix size in the requested format. Given the reasoning above, the cross-attention matrix has 3 rows and 4 columns. If you were asked to fill two blanks, you would place 3 for the rows and 4 for the columns, yielding a 3 × 4 matrix.

CS-7643-O01, OAN, OSZ Quiz #4: Module 3

View Explanation

Log in for full answers

Similar Questions

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Which of the following statements is correct about query, key, and value in transformer models?

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

More Practical Tools for Students Powered by AI Study Helper

Homework AI Solver

Stylized AI Paper Writer

Plagiarism Checker Assistant

Citation AI Academic Writing Tool

In-Class Translation Assistant

AI Note Generator

AI Quiz Answers

Past Exam Questions from University Test Bank

Smart Practice Assistant

Adaptive Practice

Making Your Study Simpler