As explained in the lecture, in sequence-to-sequence learning, attention is a probability distribution over

Encoder’s input values
Encoder’s hidden state values
Decoder’s hidden state values
Decoder’s input values
Encoder’s last state value
None of the above

(Select all that apply)

Hint: Lec 18, Slide 20-62

Question

As explained in the lecture, in sequence-to-sequence learning, attention is a probability distribution over

Encoder’s input values
Encoder’s hidden state values
Decoder’s hidden state values
Decoder’s input values
Encoder’s last state value
None of the above

(Select all that apply)

Hint: Lec 18, Slide 20-62

BlackTom AI · Accepted Answer

In sequence-to-sequence learning with attention, the model learns to align each step of the decoder with specific parts of the input sequence processed by the encoder. This alignment is represented as a probability distribution over the encoder-side representations that the decoder can attend to during generation.

Because attention weights are computed to focus on encoder-side information, the natural target of the distribution is the Encoder’s hidden state values. These hidden states summarize the processed input sequence at each position and provide a compact representation for the decoder to weigh when producing the next token.

Given the standard attention mechanism, options that would imply attention over encoder inputs directly, decoder states, or decoder inputs would not align with the typical formulation. The attention distribution is not usually defined over the raw input values themselves, nor over the decoder’s states, since the decoder already has its own internal state that evolves as decoding proceeds. Similarly, attending over the encoder’s last state value alone would ignore the full set of encoder representations across time steps, which are essential for capturing varying input positions.

Note: In your prompt, the option list text isn’t provided beyond the selection indicator ("2"). The conventional interpretation is that option 2 corresponds to "Encoder’s hidden state values." If the available options were fully listed, the reasoning above would contrast each candidate with this standard formulation.

If you’re revisiting Lec 18, Slide 20-62, recall that attention distributions crisscross between the decoder and the sequence of encoder hidden states, producing a context vector that informs the next output token.

11785/11685/11485 Quiz-09

查看解析

登录即可查看完整答案

类似问题

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Which of the following statements is correct about query, key, and value in transformer models?

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单