As explained in the lecture, in sequence-to-sequence learning, attention is a probability distribution over

Encoder’s input values
Encoder’s hidden state values
Decoder’s hidden state values
Decoder’s input values
Encoder’s last state value
None of the above

(Select all that apply)

Hint: Lec 18, Slide 20-62

Question

As explained in the lecture, in sequence-to-sequence learning, attention is a probability distribution over

Encoder’s input values
Encoder’s hidden state values
Decoder’s hidden state values
Decoder’s input values
Encoder’s last state value
None of the above

(Select all that apply)

Hint: Lec 18, Slide 20-62

BlackTom AI · Accepted Answer

In sequence-to-sequence learning with attention, the model learns to align each step of the decoder with specific parts of the input sequence processed by the encoder. This alignment is represented as a probability distribution over the encoder-side representations that the decoder can attend to during generation.

Because attention weights are computed to focus on encoder-side information, the natural target of the distribution is the Encoder’s hidden state values. These hidden states summarize the processed input sequence at each position and provide a compact representation for the decoder to weigh when producing the next token.

Given the standard attention mechanism, options that would imply attention over encoder inputs directly, decoder states, or decoder inputs would not align with the typical formulation. The attention distribution is not usually defined over the raw input values themselves, nor over the decoder’s states, since the decoder already has its own internal state that evolves as decoding proceeds. Similarly, attending over the encoder’s last state value alone would ignore the full set of encoder representations across time steps, which are essential for capturing varying input positions.

Note: In your prompt, the option list text isn’t provided beyond the selection indicator ("2"). The conventional interpretation is that option 2 corresponds to "Encoder’s hidden state values." If the available options were fully listed, the reasoning above would contrast each candidate with this standard formulation.

If you’re revisiting Lec 18, Slide 20-62, recall that attention distributions crisscross between the decoder and the sequence of encoder hidden states, producing a context vector that informs the next output token.

11785/11685/11485 Quiz-09

View Explanation

Log in for full answers

Similar Questions

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Which of the following statements is correct about query, key, and value in transformer models?

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

More Practical Tools for Students Powered by AI Study Helper

Homework AI Solver

Stylized AI Paper Writer

Plagiarism Checker Assistant

Citation AI Academic Writing Tool

In-Class Translation Assistant

AI Note Generator

AI Quiz Answers

Past Exam Questions from University Test Bank

Smart Practice Assistant

Adaptive Practice

Making Your Study Simpler