The value of an action
𝑞
𝜋
(
𝑠
,
𝑎
)
depends on the expected next reward and the expected value of the next state. We can think of this in terms of a small backup diagram, as follows:

Let
𝑃
(
𝑠
′
|
𝑠
,
𝑎
)
be the transition probability and
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
=
𝐸
[
𝑅
𝑡
+
1
|
𝑆
𝑡
=
𝑠
,
𝐴
𝑡
=
𝑎
,
𝑆
𝑡
+
1
=
𝑠
′
]
the expected reward for the transion from state
𝑠
to state
𝑠
′
via action
𝑎
.

Rearrange the definition of
𝑞
𝜋
(
𝑠
,
𝑎
)
in terms of these quantities, such that no expected-value notation appears in the equation.

A.
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
𝑃
(
𝑠
′
∣
𝑠
,
𝑎
)
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑞
𝜋
(
𝑠
′
,
𝑎
)
]

B.
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
]
𝑃
(
𝑠
′
∣
𝑠
,
𝑎
)
𝑣
𝜋
(
𝑠
′
)

C.
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
𝑃
(
𝑠
′
|
𝑠
,
𝑎
)
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑣
𝜋
(
𝑠
′
)
]

D.
𝑞
𝜋
(
𝑠
,
𝑎
)
=
𝑃
[
𝑠
′
∣
𝑠
,
𝑎
]
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑣
𝜋
(
𝑠
′
)
]

Question

The value of an action 
𝑞
𝜋
(
𝑠
,
𝑎
)
 depends on the expected next reward and the expected value of the next state.  We can think of this in terms of a small backup diagram, as follows:

Let 
𝑃
(
𝑠
′
|
𝑠
,
𝑎
)
 be the transition probability and 
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
=
𝐸
[
𝑅
𝑡
+
1
|
𝑆
𝑡
=
𝑠
,
𝐴
𝑡
=
𝑎
,
𝑆
𝑡
+
1
=
𝑠
′
]
 the expected reward for the transion from state 
𝑠
 to state 
𝑠
′
 via action 
𝑎
.

Rearrange the definition of 
𝑞
𝜋
(
𝑠
,
𝑎
)
 in terms of these quantities, such that no expected-value notation appears in the equation.

A.   
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
𝑃
(
𝑠
′
∣
𝑠
,
𝑎
)
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑞
𝜋
(
𝑠
′
,
𝑎
)
]

B.    
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
]
𝑃
(
𝑠
′
∣
𝑠
,
𝑎
)
𝑣
𝜋
(
𝑠
′
)

C.    
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
𝑃
(
𝑠
′
|
𝑠
,
𝑎
)
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑣
𝜋
(
𝑠
′
)
]

D.   
𝑞
𝜋
(
𝑠
,
𝑎
)
=
𝑃
[
𝑠
′
∣
𝑠
,
𝑎
]
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑣
𝜋
(
𝑠
′
)
]

BlackTom AI · Accepted Answer

We begin by identifying the underlyingBellman equation for action values under policy π. The quantity qπ(s, a) is the expected return when taking action a in state s and thereafter following policy π, which can be written as an expectation over next states s' of the immediate reward plus the discounted future value from the next state when continuing with the same action a (since the action chosen at state s is fixed to a). Now we evaluate each option.

Option A: qπ(s,a) = ∑_{s'} P(s'|s,a) [ r̄(s,a,s') + γ qπ(s',a) ]. This exactly matches the standard Bellman equation for action-value under policy π: average over next states weighted by transition probabilities, of the immediate expected reward plus the discounted action-value from the next state using the same action a. The structure is correct: P(s'|s,a) outside the square brackets as the weighting, r̄ is inside, and γ qπ(s',a) is the future value term.

Option B: qπ(s,a) = ∑_{s'} [ r̄(s,a,s') + γ ] P(s'|s,a) vπ(s'). This form places γ outside the q-term and uses vπ(s') instead of qπ(s',a). The use of vπ(s') implies the value of the state under π regardless of the next action, which is inappropriate here because the next step should continue with the same selected action a, i.e., qπ(s',a). Additionally, the placement of terms and the multiplication order do not align with the standard expectation over s'. Overall, this mixes state-value and action-value concepts incorrectly.

Option C: qπ(s,a) = ∑_{s'} P(s'|s,a)[ r̄(s,a,s') + γ vπ(s') ]. Here the reward term is correct, the weighting by P(s'|s,a) is correct, but the future-value term uses vπ(s') instead of qπ(s',a). That is the key flaw: after taking action a, the future value should reflect continuing with the same action a in the next state, not the best state value under π. Therefore this is not the correct back-up equation for qπ.

Option D: qπ(s,a) = P[s'|s,a][ r̄(s,a,s') + γ vπ(s') ]. This option drops the summation over s' entirely or treats P(s'|s,a) as a scalar outside of a sum, which is not a valid expectation expression. It also uses vπ(s') instead of qπ(s',a) for the future value. The overall formulation fails to correctly average over all possible next states and uses the state-value in place of the action-value, making it incorrect.

COMP90054_2025_SM2 Supplementary or Special Exam: AI Planning for Autonomy (COMP90054_2025_SM2)- Requires Respondus LockDown Browser

查看解析

登录即可查看完整答案

类似问题

Which statement best describes the difference between SARSA and Q-learning?

Which of the following best describes a key difference between Monte Carlo and Temporal-Difference (TD) learning?

Select all of the following methods that use bootstrapping to estimate values

Choose all that apply to Reinforcement Learning (RL). I Regression tree algorithms power deep RL. II An RL agent wants to maximize its cumulative reward. III It is an ML paradigm that differs from supervised and unsupervised. IV It mathematically formalized the idea of learning by interactions.

强化学习的重点是什么？What is the focus of reinforcement learning?

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单