A team of data scientists is experiencing poor performance with their Transformer model. After inspecting their implementation, you find several suspicious design choices. Which one would you identify as the MOST problematic based on the architectural principles discussed in the lecture?

Question

BlackTom AI · Accepted Answer

Question restated: A data science team using a Transformer model has poor performance due to several suspicious design choices. Which choice is MOST problematic based on the architectural principles discussed in the lecture?

Option 1: They have implemented Pre-Norm rather than Post-Norm for layer normalization
- Pre-Norm vs Post-Norm: Some literature suggests Pre-Norm can improve gradient flow in very deep transformers, reducing training instability in some cases. However, calling this inherently 'most problematic' ignores context; it could be a design preference with trade-offs and is not guaranteed to cause the worst issues across architectures. In other words, this alone is not necessarily the primary architectural flaw.

Option 2: They have removed all position encodings because “the attention weights already capture token relationships.”
- Position encodings provide explicit information about token order, which is essential for sequence modeling. Self-attention is permutation-invariant by default, meaning without positional information the model cannot distinguish the order of tokens. Removing positional encodings directly undermines the fundamental ability to model sequences, making this choice highly problematic and likely to explain poor performance across tasks that rely on order.

Option 3: They have increased the dimensionality of the Feed-Forward Network from 4d to 8d to improve expressivity
- Expanding the inner dimension (the hidden size) of the FFN can increase model capacity and expressivity. While it raises compute and may impact regularization, it is not inherently detrimental; it could even be beneficial if other aspects (like regularization or training stability) are managed. This is not the strongest architectural flaw by itself.

Option 4: They are using Multi-Head Attention with 64 attention heads for a model with an embedding dimension of 768
- Using a large number of heads relative to the model dimension can reduce per-head expressivity and may incur inefficiencies, but it is not inherently structurally invalid. It can still function and often is a design choice; unless the heads are so many that each head has effectively tiny dimensionality or overhead dominates, this is usually not the single most critical problem.

Why the incorrect options fail to explain the most severe issue: Options 1, 3, and 4 describe settings that, while potentially suboptimal or inefficient in certain scenarios, do not directly negate the model’s ability to learn sequence structure or cause systemic architectural failure. Option 2 directly strips away the model’s capacity to capture order information, which is a foundational requirement for language and sequential tasks. Without positional information, attention cannot properly model relationships across time steps, leading to severely degraded performance across typical transformer applications.

In summary, the reasoning indicates that removing positional encodings eliminates a core mechanism for encoding sequence order, making Option 2 the most problematic architectural choice among the four. The remaining options represent debatable or context-dependent choices rather than an outright fundamental flaw in the transformer’s ability to process sequences.

11785/11685/11485 Quiz-14

A team of data scientists is experiencing poor performance with their Transformer model. After inspecting their implementation, you find several suspicious design choices. Which one would you identify as the MOST problematic based on the architectural principles discussed in the lecture?

View Explanation

Log in for full answers

Similar Questions

What is the main architectural innovation in ChatGPT that allows it to handle complex language tasks?

What operations are part of a standard Transformer block? (Select all that apply.)

In a consumer society, many adults channel creativity into buying things

Economic stress and unpredictable times have resulted in a booming industry for self-help products

People born without creativity never can develop it

A product has a selling price of $20, a contribution margin ratio of 40% and fixed cost of $120,000. To make a profit of $30,000. The number of units that must be sold is: Type the number without $ and a comma. Eg: 20000

Which of the following statement regarding cost is correct:

Under the assumptions used in cost-volume-profit analysis, as the activity increases:

More Practical Tools for International Students

Homework AI Solver

Stylized Writing Assistant

Plagiarism Checker Assistant

Citation Assistant

In-Class Translation Assistant

Class Notes Assistant

Quiz Search Assistant

Past Exam Questions by School

Smart Practice Assistant

Adaptive Practice

Making Your Study Simpler