Shown is the Q Actor-Critic (QAC) function, with line numbers.

1. Initialise
𝑠
,
𝜃

2. Sample
𝑎
∼
𝜋
𝜃

3. for each step do

4. Sample reward
𝑟
=
𝑅
𝑠
𝑎
; sample transition
𝑠
′
∼
𝑃
𝑠
,
⋅
𝑎

5. Sample action
𝑎
′
∼
𝜋
𝜃
(
𝑠
′
,
𝑎
′
)

6.
𝛿
=
𝑟
+
𝛾
𝑄
𝑤
(
𝑠
′
,
𝑎
′
)
−
𝑄
𝑤
(
𝑠
,
𝑎
)

7.
𝜃
←
𝜃
+
𝛼
∇
𝜃
𝑙
𝑜
𝑔
𝜋
𝜃
(
𝑠
,
𝑎
)
𝑄
𝑤
(
𝑠
,
𝑎
)

8.
𝑤
←
𝑤
+
𝛽
𝛿
𝜙
(
𝑠
,
𝑎
)

9.
𝑎
←
𝑎
′
,
𝑠
←
𝑠
′

10. end for

Which of the following statements is true (can be more than one)?

Question

Shown is the Q Actor-Critic (QAC) function, with line numbers.

1. Initialise 
𝑠
,
𝜃

2. Sample 
𝑎
∼
𝜋
𝜃

3. for each step do

4.      Sample reward 
𝑟
=
𝑅
𝑠
𝑎
; sample transition 
𝑠
′
∼
𝑃
𝑠
,
⋅
𝑎

5.      Sample action 
𝑎
′
∼
𝜋
𝜃
(
𝑠
′
,
𝑎
′
)

6.      
𝛿
=
𝑟
+
𝛾
𝑄
𝑤
(
𝑠
′
,
𝑎
′
)
−
𝑄
𝑤
(
𝑠
,
𝑎
)

7.      
𝜃
←
𝜃
+
𝛼
∇
𝜃
𝑙
𝑜
𝑔
𝜋
𝜃
(
𝑠
,
𝑎
)
𝑄
𝑤
(
𝑠
,
𝑎
)

8.      
𝑤
←
𝑤
+
𝛽
𝛿
𝜙
(
𝑠
,
𝑎
)

9.      
𝑎
←
𝑎
′
,
𝑠
←
𝑠
′

10. end for

Which of the following statements is true (can be more than one)?

BlackTom AI · Accepted Answer

First, let’s restate the setup and walk through what each line does in the Q Actor-Critic (QAC) algorithm as given.
Line 6 defines δ (the TD error) as r + γ Q_w(s', a') − Q_w(s, a). This uses the critic’s current Q function to evaluate the next state-action pair and compare it to the current estimate, which is precisely how the critic helps update value estimates.
Line 7 shows an update to θ with α ∇_θ log π_θ(s, a) Q_w(s, a). Here θ are the actor parameters, and the update uses the policy gradient term multiplied by the critic’s current Q value. This is the actor’s learning step, adjusting how the policy chooses actions to maximize the estimated value.
Line 8 shows an update to w with β δ φ(s, a). In this line, the critic’s parameters w are updated using the TD error δ and the feature vector φ(s, a). This is the critic’s learning step, refining the value-function approximation based on the TD error.
Now evaluating each option:
- Option A: "The critic is used to estimate the value function on line 6". This is true. Line 6 computes the TD target using Q_w(s', a') and subtracts Q_w(s, a), which is the critic estimating the value function and its improvement signal.
- Option B: "The actor is used to estimate the value function on line 6". This is false. Line 6 does not involve the policy π_θ directly; it uses the critic’s Q_w for the TD error, not the actor’s policy evaluation.
- Option C: "Actor parameters are updated on line 7 and critic parameters are updated on line 8". This is true. Line 7 updates θ (actor parameters) via the policy-gradient term weighted by Q_w(s, a), while line 8 updates w (critic parameters) using the TD error δ and φ(s, a).
- Option D: "Critic parameters are updated on line 7 and actor parameters are updated on line 8". This is false. Line 7 updates the actor, not the critic, and line 8 updates the critic, not the actor.
In summary, the true statements are A and C, while B and D are incorrect due to misattributing the roles of the lines to the wrong components of the algorithm.

COMP90054_2025_SM2 Supplementary or Special Exam: AI Planning for Autonomy (COMP90054_2025_SM2)- Requires Respondus LockDown Browser

查看解析

登录即可查看完整答案

类似问题

Which statement best describes the difference between SARSA and Q-learning?

Which of the following best describes a key difference between Monte Carlo and Temporal-Difference (TD) learning?

Select all of the following methods that use bootstrapping to estimate values

Choose all that apply to Reinforcement Learning (RL). I Regression tree algorithms power deep RL. II An RL agent wants to maximize its cumulative reward. III It is an ML paradigm that differs from supervised and unsupervised. IV It mathematically formalized the idea of learning by interactions.

强化学习的重点是什么？What is the focus of reinforcement learning?

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单