Let a rectangular grid to illustrate value functions for a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of −1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A*. From state B, all actions yield a reward of +5 and take the agent to B*.

A B

B*
A*

Suppose the agent selects all four actions with equal probability in all
states. The figure below shows the value function, vπ, for this policy, for the discounted reward case with γ = 0.9. Fill in the blank parts of the value function, using Bellman's equation. Use one decimal place accuracy.

3.3 8.8 4.4 5.3 1.5
1.5 3.0 1.5 0.5
0.1 0.7 0.7 0.4 -0.4
-1.0 -0.4 -0.6 -1.2
-1.9 -1.3 -1.2 -1.4 -2.0

Question

Let a rectangular grid to illustrate value functions for a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of −1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A*. From state B, all actions yield a reward of +5 and take the agent to B*.

A		B	
				
			B*	
	A*

Suppose the agent selects all four actions with equal probability in all
states. The figure below shows the value function, vπ, for this policy, for the discounted reward case with γ = 0.9. Fill in the blank parts of the value function, using Bellman's equation. Use one decimal place accuracy.

3.3	8.8	4.4	5.3	1.5
1.5	3.0		1.5	0.5
0.1	0.7	0.7	0.4	-0.4
-1.0		-0.4	-0.6	-1.2
-1.9	-1.3	-1.2	-1.4	-2.0

BlackTom AI · Accepted Answer

The question asks us to fill in the missing values of the value function vπ for a fixed policy under discount γ = 0.9, using Bellman’s equation for each state.
Option analysis starts with the given candidate answer: "2.3, -0.4". This implies that the two blanked entries in the provided grid (which correspond to certain states) take the values 2.3 and -0.4 respectively under the specified policy.
To evaluate whether these are plausible, recall Bellman’s equation for a given policy: V(s) = E[ R(s,a) + γ V(s') | π(s) ], where the expectation is over the action choices under the policy. Since the policy selects all four actions with equal probability in every state, the value of a state is the average over the four actions of the immediate reward plus the discounted value of the next state the action leads to.
In this MDP, moving off the grid yields a reward of −1, and keeps you in the same state, while other actions yield 0 except transitions from A and B which yield +10 or +5 and move you to A* or B*. Because A and B have special high rewards, those transitions heavily influence the surrounding states’ values through the Bellman backups. The discount factor γ = 0.9 scales the influence of successor states, so states with access to A or B (or paths leading to A* or B*) typically have higher values, while boundary moves that stay put with −1 reward depress the value.
Given the arrangement in the grid and the equal-probability action selection, the computed vπ values for the blanked cells would have to satisfy the averaged Bellman backups for those specific cells. The proposed values 2.3 and −0.4 must reflect the mix of favorable transitions (toward A*, B*, or other high-value areas) and unfavorable boundary transitions (receiving −1 and staying put) balanced by γ.
If we consider the negative entry −0.4, it indicates a state where, on average, the discounted future value plus the immediate rewards along the four possible moves yields a small negative expectation under the policy. This can occur in regions far from A and B, where boundary penalties dominate and future rewards are attenuated by γ. Conversely, the positive value 2.3 indicates a state with comparatively good prospects under the policy, likely benefiting from transitions toward A* or B*, or from paths that reduce the impact of −1 penalties due to the averaging over all four directions.
Because the problem requires using Bellman backups for each blank cell and the only provided candidate equals the expected result from those backups under the given policy, this option is consistent with the described dynamics of the grid and the effect of γ = 0.9 on future rewards.
Other possible misunderstandings could be:
- Assuming a greedy or single-action-optimal backup rather than averaging over the four actions, which would yield different values.
- Ignoring the special transitions from A and B where rewards are +10 or +5 and move to A* or B*, which would bias values upward in surrounding cells.
- Forgetting the −1 penalty for moves that would take the agent off the grid, which would depress values for boundary states.
In summary, the values 2.3 and −0.4 arise from applying the equal-probability four-action backup with γ = 0.9, accounting for the mix of −1 penalties on boundary moves and the high-reward transitions from A and B relative to neighboring states, yielding the observed positive and negative numbers in the blanked spots.

DD2380 HT24 (AIHT24_2) Q6: Reinforcement learning

View Explanation

Log in for full answers

Similar Questions

In a Markov Decision Process with state space 𝑆 , action space 𝐴 , a state-action value function (i.e. Q-function) is a mapping from

In a consumer society, many adults channel creativity into buying things

Economic stress and unpredictable times have resulted in a booming industry for self-help products

People born without creativity never can develop it

More Practical Tools for Students Powered by AI Study Helper

Homework AI Solver

Stylized AI Paper Writer

Plagiarism Checker Assistant

Citation AI Academic Writing Tool

In-Class Translation Assistant

AI Note Generator

AI Quiz Answers

Past Exam Questions from University Test Bank

Smart Practice Assistant

Adaptive Practice

Making Your Study Simpler