Questions
Questions

FINTECH 540.01.Fa25 Final Exam

Single choice

ย  On scaled dot-product attention and training stability of a transformer: ย  I Without scaling by ๐ท ๐‘˜ , the variance of the dot product ๐‘ž ๐‘› โŠค ๐‘˜ ๐‘š grows with dimensionality, producing large logits that can saturate the softmax. II Scaling by ๐ท ๐‘˜ primarily solves exploding-gradient problems inside the value projection ๐‘‰ . III The softmax-normalized matrix S o f t m a x ( ๐‘„ ๐พ โŠค ) is applied row-wise, making each row represent how strongly a query attends to all keys. IV Scaled dot-product attention computes A t t e n t i o n ( ๐‘„ , ๐พ , ๐‘‰ ) = S o f t m a x ! ( ๐‘„ ๐พ โŠค ๐ท ๐‘˜ ) ๐‘‰ , and the resulting matrix always has the same dimension as ๐‘‰ . ย 

View Explanation

View Explanation

Verified Answer
Please login to view
Step-by-Step Analysis
Letโ€™s parse the statements about scaled dot-product attention and training stability in transformers, and test each one against the standard formulation. Option I: 'Without scaling by Dk, the variance of the dot product q_n^โŠค k_m grows with dimensionality, producing large logits that can saturate the softmax.' This is correct in spirit. The unscaled dot product between Q and K tends to have variance that grows with the dimensionality Dk, which makes the distribution produced by softmax very peaky as Dk increases. Scaling by sqrt(Dk) is introduced precisely to counteract this by keeping the variance of QK^โŠค/โˆšDk roughly constant regardless of Dk. The claim......Login to view full explanation

Log in for full answers

We've collected overย 50,000 authentic exam questionsย andย detailed explanationsย from around the globe. Log in now and get instant access to the answers!

Similar Questions

More Practical Tools for Students Powered by AI Study Helper

Join us and instantly unlock extensive past papers & exclusive solutions to get a head start on your studies!