题目
11785/11685/11485 Quiz-14
单项选择题
A team of data scientists is experiencing poor performance with their Transformer model. After inspecting their implementation, you find several suspicious design choices. Which one would you identify as the MOST problematic based on the architectural principles discussed in the lecture?
选项
A.They have implemented Pre-Norm rather than Post-Norm for layer normalization
B.They have removed all position encodings because “the attention weights already capture token relationships.”
C.They have increased the dimensionality of the Feed-Forward Network from 4d to 8d to improve expressivity
D.They are using Multi-Head Attention with 64 attention heads for a model with an embedding dimension of 768
查看解析
标准答案
Please login to view
思路分析
Question restated: A data science team using a Transformer model has poor performance due to several suspicious design choices. Which choice is MOST problematic based on the architectural principles discussed in the lecture?
Option 1: They have implemented Pre-Norm rather than Post-Norm for layer normalization
- Pre-Norm vs Post-Norm: Some literature suggests Pre-Norm can improve gradient flow in very deep transformers, reducing training instability in some cases. However, calling this inherently 'most problematic' ignores context; it could be a design preference with trade-offs and is not guaranteed to cause the worst issues across architecture......Login to view full explanation登录即可查看完整答案
我们收录了全球超50000道考试原题与详细解析,现在登录,立即获得答案。
类似问题
What is the main architectural innovation in ChatGPT that allows it to handle complex language tasks?
What operations are part of a standard Transformer block? (Select all that apply.)
In a consumer society, many adults channel creativity into buying things
Economic stress and unpredictable times have resulted in a booming industry for self-help products
更多留学生实用工具
希望你的学习变得更简单
加入我们,立即解锁 海量真题 与 独家解析,让复习快人一步!