Context:

Let's look at a simple example of why vanishing and exploding gradients occur in RNNs.

Consider a univariate version of RNN with the following update rules

𝑧
(
𝑡
)
=
𝑢
𝑥
(
𝑡
)
+
𝑤
ℎ
(
𝑡
−
1
)

ℎ
(
𝑡
)
=
𝜙
(
𝑧
(
𝑡
)
)

To keep things simple, let us assume
𝜙
is the identity function, i.e.,
𝜙
(
𝑖
)
=
𝑖

Consider we have the a final loss
𝐿
, and computed the derivative of
∂
𝐿
∂
ℎ
𝑇
for some
𝑡
=
𝑇

Using the update rules, the value of
∂
ℎ
𝑇
∂
ℎ
1
comes out to be
𝑤
(
𝑎
𝑇
+
𝑏
)

Main Question:

What is the value of a?

Question

Context:

Let's look at a simple example of why vanishing and exploding gradients occur in RNNs.

Consider a univariate version of RNN with the following update rules

𝑧
(
𝑡
)
=
𝑢
𝑥
(
𝑡
)
+
𝑤
ℎ
(
𝑡
−
1
)

ℎ
(
𝑡
)
=
𝜙
(
𝑧
(
𝑡
)
)

To keep things simple, let us assume 
𝜙
 is the identity function, i.e., 
𝜙
(
𝑖
)
=
𝑖

Consider we have the a final loss 
𝐿
, and computed the derivative of 
∂
𝐿
∂
ℎ
𝑇
 for some 
𝑡
=
𝑇

Using the update rules, the value of 
∂
ℎ
𝑇
∂
ℎ
1
  comes out to be 
𝑤
(
𝑎
𝑇
+
𝑏
)

Main Question:

What is the value of a?

BlackTom AI · Accepted Answer

We start by restating the problem setup and what is being asked.
Context: In a simple RNN with univariate x and identity activation, z(t) = u x(t) + w h(t−1) and h(t) = φ(z(t)). Since φ is the identity, h(t) = z(t) for all t. We consider a final loss L and the derivative ∂L/∂h_T for time T. It is given that this derivative takes the form ∂L/∂h_T = w (a_T + b), where a_T and b are terms that accumulate from the backpropagation through time.
Main idea: When we backpropagate the gradient from time T down to time 1 through the recurrence, each step multiplies by w due to the dependence h(t) on h(t−1) via z(t) and the identity activation. Specifically, in a chain of T to 1 steps, the gradient with respect to h_1 will accumulate a factor of w raised to the number of recurrences traversed.
Now, let us examine how many times the hidden state is fed back into itself from t = T down to t = 2 (i.e., how many backpropagation multiplications by w occur).
- At time T, the gradient contribution starts with ∂L/∂h_T.
- When propagating one step back, we multiply by w to get the contribution to ∂L/∂h_{T−1}.
- This process repeats for each step until we reach h_1, which involves propagating through (T − 1) transitions in total.
Therefore, the cumulative factor contributed by the recurrent connections is w^(T−1). The linear combination that results in ∂L/∂h_T being expressed as w (a_T + b) must account for this (T − 1) count in the exponent structure of a_T, with a_T capturing the sum of terms that scale with w^(T−1) and b capturing any bias-like offset.
Thus, the parameter a that appears in the expression for ∂L/∂h_T is tied to the number of backpropagation steps, which is (T − 1). In other words, a = T − 1.
If you compare with the general form for backpropagating through time in a linear recurrent unit, you typically see gradient terms accumulate as powers of w, and the exponent count equals the number of elapsed time steps, i.e., T−1, confirming why a equals T−1.
In the given setup, there are no alternative options to evaluate here, but the key takeaway is that the depth of temporal recurrence dictates the (T−1) factor in the gradient expression, hence a = T − 1.

CS-7643-O01, OAN, OSZ Quiz #4: Module 3

查看解析

登录即可查看完整答案

类似问题

What distinguishes RNNs from feedforward neural networks in terms of processing sequences?

Which of the following best describes the primary purpose of Long Short-Term Memory (LSTM) networks in deep learning?

Please read the following paper to answer the below question. https://arxiv.org/pdf/1411.4555.pdf Links to an external site. From your reading of the paper, which of the following are true?

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单