What is the primary role of the self-attention mechanism in Transformer-based language models?

Question

BlackTom AI · Accepted Answer

First, let's lay out the core question and the given options in a clear way to ground the discussion.
Question: What is the primary role of the self-attention mechanism in Transformer-based language models?
Answer options:
1) To assign weights to output layers based on model depth.
2) To evaluate the importance of each word in a sentence relative to every other word.
3) To randomly initialize new tokens during training.
4) To reduce the training time by compressing input sequences.

Now, evaluating each option one by one:
Option 1: To assign weights to output layers based on model depth. This is not what self-attention does. In a Transformer, attention does not decide weights for layers or their depth; layer weighting is a design choice for model architecture, not the attention mechanism itself. The self-attention module computes relationships among tokens, not layer-wise weights.

Option 2: To evaluate the importance of each word in a sentence relative to every other word. This aligns with the fundamental purpose of self-attention: for each token, the mechanism computes compatibility scores with all other tokens, producing a weighted representation that captures contextual dependencies across the entire sequence. This is exactly how self-attention aggregates information from all positions.

Option 3: To randomly initialize new tokens during training. Random initialization of tokens is not a function of self-attention. Token embeddings are typically learned parameters or provided inputs; self-attention uses them to compute contextualized representations, not to initialize new tokens randomly.

Option 4: To reduce the training time by compressing input sequences. While attention mechanisms enable parallel computation and can influence efficiency, self-attention itself does not inherently compress input sequences. It computes relationships across all tokens, which can be computationally intensive for long sequences, and any sequence compression would be a separate engineering strategy, not the core role of self-attention.

In summary, option 2 correctly captures the essence of self-attention: evaluating each token's importance relative to all other tokens to build contextual representations, while the other options misstate the mechanism's function or scope.

BUSML 4382 SP2025 (36004) Final Exam- Requires Respondus LockDown Browser

What is the primary role of the self-attention mechanism in Transformer-based language models?

查看解析

登录即可查看完整答案

类似问题

What key mechanism do transformers use to process sequential data effectively?

Consider the sentence “Mary went to the mall because she wanted a new pair of shoes.” This sentence is passed through an encoder-only transformer model. What model component enables it to learn that “she” refers to “Mary”? Hint: Lec 19.

Cyclic Shift + Masked MSA are necessary for the correct operation of SW-MSA; their absence will render SW-MSA either non-functional or produce incorrect results.

Identify the bond angle and molecular shape of H2O.

Using Dalton's Atomic theory, explain the following observation. Hint: DO NOT answer with a law! The mass of the products at the end of the reaction was the same as the mass of the reactants at the beginning.

The number of valence electrons in a sulfur atom is 16.

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单