The value of an action
𝑞
𝜋
(
𝑠
,
𝑎
)
depends on the expected next reward and the expected value of the next state. We can think of this in terms of a small backup diagram, as follows:

Let
𝑃
(
𝑠
′
|
𝑠
,
𝑎
)
be the transition probability and
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
=
𝐸
[
𝑅
𝑡
+
1
|
𝑆
𝑡
=
𝑠
,
𝐴
𝑡
=
𝑎
,
𝑆
𝑡
+
1
=
𝑠
′
]
the expected reward for the transion from state
𝑠
to state
𝑠
′
via action
𝑎
.

Rearrange the definition of
𝑞
𝜋
(
𝑠
,
𝑎
)
in terms of these quantities, such that no expected-value notation appears in the equation.

A.
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
𝑃
(
𝑠
′
∣
𝑠
,
𝑎
)
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑞
𝜋
(
𝑠
′
,
𝑎
)
]

B.
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
]
𝑃
(
𝑠
′
∣
𝑠
,
𝑎
)
𝑣
𝜋
(
𝑠
′
)

C.
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
𝑃
(
𝑠
′
|
𝑠
,
𝑎
)
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑣
𝜋
(
𝑠
′
)
]

D.
𝑞
𝜋
(
𝑠
,
𝑎
)
=
𝑃
[
𝑠
′
∣
𝑠
,
𝑎
]
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑣
𝜋
(
𝑠
′
)
]

Question

The value of an action 
𝑞
𝜋
(
𝑠
,
𝑎
)
 depends on the expected next reward and the expected value of the next state.  We can think of this in terms of a small backup diagram, as follows:

Let 
𝑃
(
𝑠
′
|
𝑠
,
𝑎
)
 be the transition probability and 
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
=
𝐸
[
𝑅
𝑡
+
1
|
𝑆
𝑡
=
𝑠
,
𝐴
𝑡
=
𝑎
,
𝑆
𝑡
+
1
=
𝑠
′
]
 the expected reward for the transion from state 
𝑠
 to state 
𝑠
′
 via action 
𝑎
.

Rearrange the definition of 
𝑞
𝜋
(
𝑠
,
𝑎
)
 in terms of these quantities, such that no expected-value notation appears in the equation.

A.   
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
𝑃
(
𝑠
′
∣
𝑠
,
𝑎
)
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑞
𝜋
(
𝑠
′
,
𝑎
)
]

B.    
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
]
𝑃
(
𝑠
′
∣
𝑠
,
𝑎
)
𝑣
𝜋
(
𝑠
′
)

C.    
𝑞
𝜋
(
𝑠
,
𝑎
)
=
∑
𝑠
′
𝑃
(
𝑠
′
|
𝑠
,
𝑎
)
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑣
𝜋
(
𝑠
′
)
]

D.   
𝑞
𝜋
(
𝑠
,
𝑎
)
=
𝑃
[
𝑠
′
∣
𝑠
,
𝑎
]
[
𝑟
¯
(
𝑠
,
𝑎
,
𝑠
′
)
+
𝛾
𝑣
𝜋
(
𝑠
′
)
]

BlackTom AI · Accepted Answer

We begin by identifying the underlyingBellman equation for action values under policy π. The quantity qπ(s, a) is the expected return when taking action a in state s and thereafter following policy π, which can be written as an expectation over next states s' of the immediate reward plus the discounted future value from the next state when continuing with the same action a (since the action chosen at state s is fixed to a). Now we evaluate each option.

Option A: qπ(s,a) = ∑_{s'} P(s'|s,a) [ r̄(s,a,s') + γ qπ(s',a) ]. This exactly matches the standard Bellman equation for action-value under policy π: average over next states weighted by transition probabilities, of the immediate expected reward plus the discounted action-value from the next state using the same action a. The structure is correct: P(s'|s,a) outside the square brackets as the weighting, r̄ is inside, and γ qπ(s',a) is the future value term.

Option B: qπ(s,a) = ∑_{s'} [ r̄(s,a,s') + γ ] P(s'|s,a) vπ(s'). This form places γ outside the q-term and uses vπ(s') instead of qπ(s',a). The use of vπ(s') implies the value of the state under π regardless of the next action, which is inappropriate here because the next step should continue with the same selected action a, i.e., qπ(s',a). Additionally, the placement of terms and the multiplication order do not align with the standard expectation over s'. Overall, this mixes state-value and action-value concepts incorrectly.

Option C: qπ(s,a) = ∑_{s'} P(s'|s,a)[ r̄(s,a,s') + γ vπ(s') ]. Here the reward term is correct, the weighting by P(s'|s,a) is correct, but the future-value term uses vπ(s') instead of qπ(s',a). That is the key flaw: after taking action a, the future value should reflect continuing with the same action a in the next state, not the best state value under π. Therefore this is not the correct back-up equation for qπ.

Option D: qπ(s,a) = P[s'|s,a][ r̄(s,a,s') + γ vπ(s') ]. This option drops the summation over s' entirely or treats P(s'|s,a) as a scalar outside of a sum, which is not a valid expectation expression. It also uses vπ(s') instead of qπ(s',a) for the future value. The overall formulation fails to correctly average over all possible next states and uses the state-value in place of the action-value, making it incorrect.

COMP90054_2025_SM2 Supplementary or Special Exam: AI Planning for Autonomy (COMP90054_2025_SM2)- Requires Respondus LockDown Browser

View Explanation

Log in for full answers

Similar Questions

Which statement best describes the difference between SARSA and Q-learning?

Which of the following best describes a key difference between Monte Carlo and Temporal-Difference (TD) learning?

Select all of the following methods that use bootstrapping to estimate values

Choose all that apply to Reinforcement Learning (RL). I Regression tree algorithms power deep RL. II An RL agent wants to maximize its cumulative reward. III It is an ML paradigm that differs from supervised and unsupervised. IV It mathematically formalized the idea of learning by interactions.

强化学习的重点是什么？What is the focus of reinforcement learning?

More Practical Tools for Students Powered by AI Study Helper

Homework AI Solver

Stylized AI Paper Writer

Plagiarism Checker Assistant

Citation AI Academic Writing Tool

In-Class Translation Assistant

AI Note Generator

AI Quiz Answers

Past Exam Questions from University Test Bank

Smart Practice Assistant

Adaptive Practice

Making Your Study Simpler