Questions
Questions

11785/11685/11485 Quiz-14

Single choice

A team of data scientists is experiencing poor performance with their Transformer model. After inspecting their implementation, you find several suspicious design choices. Which one would you identify as the MOST problematic based on the architectural principles discussed in the lecture?

Options
A.They have implemented Pre-Norm rather than Post-Norm for layer normalization
B.They have removed all position encodings because “the attention weights already capture token relationships.”
C.They have increased the dimensionality of the Feed-Forward Network from 4d to 8d to improve expressivity
D.They are using Multi-Head Attention with 64 attention heads for a model with an embedding dimension of 768
View Explanation

View Explanation

Verified Answer
Please login to view
Step-by-Step Analysis
Question restated: A data science team using a Transformer model has poor performance due to several suspicious design choices. Which choice is MOST problematic based on the architectural principles discussed in the lecture? Option 1: They have implemented Pre-Norm rather than Post-Norm for layer normalization - Pre-Norm vs Post-Norm: Some literature suggests Pre-Norm can improve gradient flow in very deep transformers, reducing training instability in some cases. However, calling this inherently 'most problematic' ignores context; it could be a design preference with trade-offs and is not guaranteed to cause the worst issues across architecture......Login to view full explanation

Log in for full answers

We've collected over 50,000 authentic exam questions and detailed explanations from around the globe. Log in now and get instant access to the answers!

More Practical Tools for International Students

Join us and instantly unlock extensive past papers & exclusive solutions to get a head start on your studies!