Questions
Questions

COMP90054_2025_SM2 Supplementary or Special Exam: AI Planning for Autonomy (COMP90054_2025_SM2)- Requires Respondus LockDown Browser

Multiple choice

Shown is the Q Actor-Critic (QAC) function, with line numbers. 1. Initialise ๐‘  , ๐œƒ 2. Sample ๐‘Ž โˆผ ๐œ‹ ๐œƒ 3. for each step do 4.ย  ย  ย  Sample reward ๐‘Ÿ = ๐‘… ๐‘  ๐‘Ž ; sample transition ๐‘  โ€ฒ โˆผ ๐‘ƒ ๐‘  , โ‹… ๐‘Ž 5.ย  ย  ย  Sample action ๐‘Ž โ€ฒ โˆผ ๐œ‹ ๐œƒ ( ๐‘  โ€ฒ , ๐‘Ž โ€ฒ ) 6.ย  ย  ย  ๐›ฟ = ๐‘Ÿ + ๐›พ ๐‘„ ๐‘ค ( ๐‘  โ€ฒ , ๐‘Ž โ€ฒ ) โˆ’ ๐‘„ ๐‘ค ( ๐‘  , ๐‘Ž ) 7.ย  ย  ย  ๐œƒ โ† ๐œƒ + ๐›ผ โˆ‡ ๐œƒ ๐‘™ ๐‘œ ๐‘” ๐œ‹ ๐œƒ ( ๐‘  , ๐‘Ž ) ๐‘„ ๐‘ค ( ๐‘  , ๐‘Ž ) 8.ย  ย  ย  ๐‘ค โ† ๐‘ค + ๐›ฝ ๐›ฟ ๐œ™ ( ๐‘  , ๐‘Ž ) 9.ย  ย  ย  ๐‘Ž โ† ๐‘Ž โ€ฒ , ๐‘  โ† ๐‘  โ€ฒ 10. end for Which of the following statements is true (can be more than one)?

Options
A.The critic is used to estimate the value function on line 6
B.The actor is used to estimate the value function on line 6
C.Actor parameters are updated on line 7 and critic parameters are updated on line 8
D.Critic parameters are updated on line 7 and actor parameters are updated on line 8
View Explanation

View Explanation

Verified Answer
Please login to view
Step-by-Step Analysis
First, letโ€™s restate the setup and walk through what each line does in the Q Actor-Critic (QAC) algorithm as given. Line 6 defines ฮด (the TD error) as r + ฮณ Q_w(s', a') โˆ’ Q_w(s, a). This uses the criticโ€™s current Q function to evaluate the next state-action pair and compare it to the current estimate, which is precisely how the critic helps update value estimates. Line 7 shows an update to ฮธ with ฮฑ โˆ‡_ฮธ l......Login to view full explanation

Log in for full answers

We've collected overย 50,000 authentic exam questionsย andย detailed explanationsย from around the globe. Log in now and get instant access to the answers!

Similar Questions

The value of an action ๐‘ž ๐œ‹ ( ๐‘  , ๐‘Ž ) depends on the expected next reward and the expected value of the next state.ย  We can think of this in terms of a small backup diagram, as follows: Let ๐‘ƒ ( ๐‘  โ€ฒ | ๐‘  , ๐‘Ž ) be the transition probability and ๐‘Ÿ ยฏ ( ๐‘  , ๐‘Ž , ๐‘  โ€ฒ ) = ๐ธ [ ๐‘… ๐‘ก + 1 | ๐‘† ๐‘ก = ๐‘  , ๐ด ๐‘ก = ๐‘Ž , ๐‘† ๐‘ก + 1 = ๐‘  โ€ฒ ] the expected reward for the transion from state ๐‘  to state ๐‘  โ€ฒ via action ๐‘Ž . Rearrange the definition of ๐‘ž ๐œ‹ ( ๐‘  , ๐‘Ž ) in terms of these quantities, such that no expected-value notation appears in the equation. A. ย  ๐‘ž ๐œ‹ ( ๐‘  , ๐‘Ž ) = โˆ‘ ๐‘  โ€ฒ ๐‘ƒ ( ๐‘  โ€ฒ โˆฃ ๐‘  , ๐‘Ž ) [ ๐‘Ÿ ยฏ ( ๐‘  , ๐‘Ž , ๐‘  โ€ฒ ) + ๐›พ ๐‘ž ๐œ‹ ( ๐‘  โ€ฒ , ๐‘Ž ) ] B. ย  ย  ๐‘ž ๐œ‹ ( ๐‘  , ๐‘Ž ) = โˆ‘ ๐‘  โ€ฒ [ ๐‘Ÿ ยฏ ( ๐‘  , ๐‘Ž , ๐‘  โ€ฒ ) + ๐›พ ] ๐‘ƒ ( ๐‘  โ€ฒ โˆฃ ๐‘  , ๐‘Ž ) ๐‘ฃ ๐œ‹ ( ๐‘  โ€ฒ ) C. ย  ย  ๐‘ž ๐œ‹ ( ๐‘  , ๐‘Ž ) = โˆ‘ ๐‘  โ€ฒ ๐‘ƒ ( ๐‘  โ€ฒ | ๐‘  , ๐‘Ž ) [ ๐‘Ÿ ยฏ ( ๐‘  , ๐‘Ž , ๐‘  โ€ฒ ) + ๐›พ ๐‘ฃ ๐œ‹ ( ๐‘  โ€ฒ ) ] D. ย  ๐‘ž ๐œ‹ ( ๐‘  , ๐‘Ž ) = ๐‘ƒ [ ๐‘  โ€ฒ โˆฃ ๐‘  , ๐‘Ž ] [ ๐‘Ÿ ยฏ ( ๐‘  , ๐‘Ž , ๐‘  โ€ฒ ) + ๐›พ ๐‘ฃ ๐œ‹ ( ๐‘  โ€ฒ ) ] ย 

Which statement best describes the difference between SARSA and Q-learning?

Which of the following best describes a key difference between Monte Carlo and Temporal-Difference (TD) learning?

Select all of the following methods that use bootstrapping to estimate values

More Practical Tools for Students Powered by AI Study Helper

Join us and instantly unlock extensive past papers & exclusive solutions to get a head start on your studies!