A team of data scientists is experiencing poor performance with their Transformer model. After inspecting their implementation, you find several suspicious design choices. Which one would you identify as the MOST problematic based on the architectural principles discussed in the lecture?

Question

BlackTom AI · Accepted Answer

Question restated: A data science team using a Transformer model has poor performance due to several suspicious design choices. Which choice is MOST problematic based on the architectural principles discussed in the lecture?

Option 1: They have implemented Pre-Norm rather than Post-Norm for layer normalization
- Pre-Norm vs Post-Norm: Some literature suggests Pre-Norm can improve gradient flow in very deep transformers, reducing training instability in some cases. However, calling this inherently 'most problematic' ignores context; it could be a design preference with trade-offs and is not guaranteed to cause the worst issues across architectures. In other words, this alone is not necessarily the primary architectural flaw.

Option 2: They have removed all position encodings because “the attention weights already capture token relationships.”
- Position encodings provide explicit information about token order, which is essential for sequence modeling. Self-attention is permutation-invariant by default, meaning without positional information the model cannot distinguish the order of tokens. Removing positional encodings directly undermines the fundamental ability to model sequences, making this choice highly problematic and likely to explain poor performance across tasks that rely on order.

Option 3: They have increased the dimensionality of the Feed-Forward Network from 4d to 8d to improve expressivity
- Expanding the inner dimension (the hidden size) of the FFN can increase model capacity and expressivity. While it raises compute and may impact regularization, it is not inherently detrimental; it could even be beneficial if other aspects (like regularization or training stability) are managed. This is not the strongest architectural flaw by itself.

Option 4: They are using Multi-Head Attention with 64 attention heads for a model with an embedding dimension of 768
- Using a large number of heads relative to the model dimension can reduce per-head expressivity and may incur inefficiencies, but it is not inherently structurally invalid. It can still function and often is a design choice; unless the heads are so many that each head has effectively tiny dimensionality or overhead dominates, this is usually not the single most critical problem.

Why the incorrect options fail to explain the most severe issue: Options 1, 3, and 4 describe settings that, while potentially suboptimal or inefficient in certain scenarios, do not directly negate the model’s ability to learn sequence structure or cause systemic architectural failure. Option 2 directly strips away the model’s capacity to capture order information, which is a foundational requirement for language and sequential tasks. Without positional information, attention cannot properly model relationships across time steps, leading to severely degraded performance across typical transformer applications.

In summary, the reasoning indicates that removing positional encodings eliminates a core mechanism for encoding sequence order, making Option 2 the most problematic architectural choice among the four. The remaining options represent debatable or context-dependent choices rather than an outright fundamental flaw in the transformer’s ability to process sequences.

11785/11685/11485 Quiz-14

A team of data scientists is experiencing poor performance with their Transformer model. After inspecting their implementation, you find several suspicious design choices. Which one would you identify as the MOST problematic based on the architectural principles discussed in the lecture?

查看解析

登录即可查看完整答案

类似问题

What is the main architectural innovation in ChatGPT that allows it to handle complex language tasks?

What operations are part of a standard Transformer block? (Select all that apply.)

In a consumer society, many adults channel creativity into buying things

Economic stress and unpredictable times have resulted in a booming industry for self-help products

People born without creativity never can develop it

A product has a selling price of $20, a contribution margin ratio of 40% and fixed cost of $120,000. To make a profit of $30,000. The number of units that must be sold is: Type the number without $ and a comma. Eg: 20000

Which of the following statement regarding cost is correct:

Under the assumptions used in cost-volume-profit analysis, as the activity increases:

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单