Attention Mechanism improves word embedding by:

Question

BlackTom AI · Accepted Answer

Examining the concept of attention mechanisms in word embedding requires comparing how each option describes the role of attention in contextual representations.
Option 1 argues that attention replaces traditional fixed embeddings like Word2Vec and GloVe with fixed vector representations. In reality, attention mechanisms do not replace fixed embeddings; they operate on top of contextual information to adjust representations dynamically, preserving the idea of context, not discarding fixed embeddings entirely. This misstates the relationship between attention and traditional embeddings.
Option 2 claims that attention increases the number of parameters in a neural network without improving contextual understanding. While attention can add parameters or compute costs, the core purpose of attention is precisely to improve contextual modeling by weighing surrounding words; saying it doesn’t improve contextual understanding is misleading and ignores the empirical benefits of attention in capturing dependencies.
Option 3 states that attention ignores the sequential nature of text and treats words as independent entities. This is incorrect, because attention mechanisms explicitly model relationships across positions in a sequence, and are designed to weigh context across neighbors and even distant tokens, thereby leveraging sequential structure rather than ignoring it.
Option 4 describes attention as dynamically adjusting word embeddings by weighting the relevance of surrounding words in context. This aligns with how attention works in models like transformers: each token's representation is refined by attending to other tokens, weighting their relevance to produce a contextualized embedding. This is the most accurate characterization among the choices, reflecting both dynamic weighting and contextual integration.

BU.520.710.T1.SP25 Quiz 4

Attention Mechanism improves word embedding by:

查看解析

登录即可查看完整答案

类似问题

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Which of the following statements is correct about query, key, and value in transformer models?

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

Please read the following paper to answer the below question. https://arxiv.org/pdf/1409.0473.pdf Links to an external site. Based on your reading of the paper, which of the following are true?

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单