题目

11785/11685/11485 Quiz-8

多项选择题

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

查看解析

标准答案

Please login to view

思路分析

This question asks us to evaluate several statements about a neural sequence model paper (likely about bidirectional RNNs and attention/encoders-decoder interactions). I will assess each option in turn, noting what makes sense conceptually and where common claims may be misleading. Option 1: 'Due to the bi-directional RNN, the hidden states at time t=j in the encoder contains the summary of preceding and succeeding words. This helps the soft alignment model to make a better context vector.' - The core idea here is that a bidirectional RNN (BiRNN) processes the sequence both forwards and backwards, so the hidden state at position j indeed encodes information from both the left (previous words) and the right (subsequent words). This enriched representation can improve alignment (e.g., attention weights) because each position’s state carries context about the entire sequence, which can lead to a more informative context vector when computing......Login to view full explanation

登录即可查看完整答案

我们收录了全球超50000道考试原题与详细解析,现在登录,立即获得答案。

类似问题

On scaled dot-product attention and training stability of a transformer: I Without scaling by 𝐷 𝑘 , the variance of the dot product 𝑞 𝑛 ⊤ 𝑘 𝑚 grows with dimensionality, producing large logits that can saturate the softmax. II Scaling by 𝐷 𝑘 primarily solves exploding-gradient problems inside the value projection 𝑉 . III The softmax-normalized matrix S o f t m a x ( 𝑄 𝐾 ⊤ ) is applied row-wise, making each row represent how strongly a query attends to all keys. IV Scaled dot-product attention computes A t t e n t i o n ( 𝑄 , 𝐾 , 𝑉 ) = S o f t m a x ! ( 𝑄 𝐾 ⊤ 𝐷 𝑘 ) 𝑉 , and the resulting matrix always has the same dimension as 𝑉 .

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

As defined in Attention is All You Need, what is the size of the cross-attention matrix between the encoder and decoder given the following English to Spanish translation: I am very handsome -> Soy muy guapo Please assume the following: d_k = d_q = 64 d_v = 32 Please ignore the <SOS> and <EOS> tokens. cross-attention means Attention(Q, K, V) NOTE: Please round to the nearest integer. [Fill in the blank] rows[Fill in the blank] columns

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

希望你的学习变得更简单

加入我们，立即解锁海量真题与独家解析，让复习快人一步！

11785/11685/11485 Quiz-8

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

查看解析

登录即可查看完整答案

类似问题

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Which of the following statements is correct about query, key, and value in transformer models?

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单