题目
题目

FINTECH 540.01.Fa25 Final Exam

单项选择题

  On scaled dot-product attention and training stability of a transformer:   I Without scaling by 𝐷 𝑘 , the variance of the dot product 𝑞 𝑛 ⊤ 𝑘 𝑚 grows with dimensionality, producing large logits that can saturate the softmax. II Scaling by 𝐷 𝑘 primarily solves exploding-gradient problems inside the value projection 𝑉 . III The softmax-normalized matrix S o f t m a x ( 𝑄 𝐾 ⊤ ) is applied row-wise, making each row represent how strongly a query attends to all keys. IV Scaled dot-product attention computes A t t e n t i o n ( 𝑄 , 𝐾 , 𝑉 ) = S o f t m a x ! ( 𝑄 𝐾 ⊤ 𝐷 𝑘 ) 𝑉 , and the resulting matrix always has the same dimension as 𝑉 .  

查看解析

查看解析

标准答案
Please login to view
思路分析
Let’s parse the statements about scaled dot-product attention and training stability in transformers, and test each one against the standard formulation. Option I: 'Without scaling by Dk, the variance of the dot product q_n^⊤ k_m grows with dimensionality, producing large logits that can saturate the softmax.' This is correct in spirit. The unscaled dot product between Q and K tends to have variance that grows with the dimensionality Dk, which makes the distribution produced by softmax very peaky as Dk increases. Scaling by sqrt(Dk) is introduced precisely to counteract this by keeping the variance of QK^⊤/√Dk roughly constant regardless of Dk. The claim......Login to view full explanation

登录即可查看完整答案

我们收录了全球超50000道考试原题与详细解析,现在登录,立即获得答案。

类似问题

更多留学生实用工具

加入我们,立即解锁 海量真题独家解析,让复习快人一步!