有一个 2×3 网格表示我们的 MDP 世界。从左到右的列分别标记为 A、B、C。从下到上的行分别标记为 1、2。第 1 行 A 列包含字母“S”。第 2 行 A 列包含数字“-1”。第 2 行 C 列包含数字“+1”。

表：Gridworld MDP。

A 2-by-3 grid representing our MDP world. Moving left to right, the columns are labeled A, B, C. Moving bottom to top, rows are labeled 1, 2. Row 1, Column A contains a letter "S". Row 2, Column A contains the number "-1". Row 2, column C contains the number "+1".

Table: Gridworld MDP.

图：转移函数（Artificial intelligence - a modern approach，Russell, Stuart J 和 Norvig, Peter）。

查看表 (Gridworld MDP) 和图（转移函数）。Gridworld MDP 以讲座中讨论的方式运行。状态是网格方块，由其列（A、B 或 C）和行（1 或 2）值标识，如表中所示。智能体的初始状态始终是 (A,1)，用字母 S 标记。有两个终止目标状态：奖励为 +1 的 (C,2) 和奖励为 -1 的 (A,2)。非终止状态下奖励为 0。（在智能体执行下一个动作之前，收到状态奖励。）转移函数（图）使得智能体以 0.8 的概率发生预期的移动（上、下、左或右）。智能体最终处于与预期方向垂直的状态之一的概率各为 0.1。如果与墙壁发生碰撞，智能体将保持状态不变，并且漂移概率将添加到保持相同状态的概率中。折扣因子为 1。

智能体从始终选择向上的策略开始执行三次尝试：第一次尝试是 (C,1)–(C,2)，第二次尝试是 (C,1)–(C,2)，第三次尝试是 (C,1)–(B,1)–(B,2)–(C,2)。根据这些轨迹，状态 (C,1) 的蒙特卡罗（直接效用）估计是多少？

Figure: Transition Function ( Artificial intelligence — a modern approach, Russell, Stuart J and Norvig, Peter).

Review the table (Gridworld MDP) and the figure (Transition Function). The gridworld MDP operates like the one discussed in lecture. The states are grid squares, identified by their column (A, B, or C) and row (1 or 2) values, as presented in the table. The agent always starts in state (A,1), marked with the letter S. There are two terminal goal states: (C,2) with reward +1, and (A,2) with reward -1. Rewards are 0 in non-terminal states. (The reward for a state is received before the agent applies the next action.) The transition function (Figure) is such that the intended agent movement (Up, Down, Left, or Right) happens with probability 0.8. The probability that the agent ends up in one of the states perpendicular to the intended direction is 0.1 each. If a collision with a wall happens, the agent stays in the same state, and the drift probability is added to the probability of remaining in the same state. The discounting factor is 1.

The agent starts with the policy that always chooses to go Up, and it executes three trials: the first trial is (C,1)–(C,2), the second is (C,1)–(C,2), and the third is (C,1)–(B,1)–(B,2)–(C,2). Given these traces, what is the Monte Carlo (direct utility) estimate for state (C,1)?

Question

有一个 2×3 网格表示我们的 MDP 世界。从左到右的列分别标记为 A、B、C。从下到上的行分别标记为 1、2。第 1 行 A 列包含字母“S”。第 2 行 A 列包含数字“-1”。第 2 行 C 列包含数字“+1”。

表：Gridworld MDP。

A 2-by-3 grid representing our MDP world. Moving left to right, the columns are labeled A, B, C. Moving bottom to top, rows are labeled 1, 2. Row 1, Column A contains a letter "S". Row 2, Column A contains the number "-1". Row 2, column C contains the number "+1".

Table: Gridworld MDP.

图：转移函数（Artificial intelligence - a modern approach，Russell, Stuart J 和 Norvig, Peter）。

查看表 (Gridworld MDP) 和图（转移函数）。Gridworld MDP 以讲座中讨论的方式运行。状态是网格方块，由其列（A、B 或 C）和行（1 或 2）值标识，如表中所示。智能体的初始状态始终是 (A,1)，用字母 S 标记。有两个终止目标状态：奖励为 +1 的 (C,2) 和奖励为 -1 的 (A,2)。非终止状态下奖励为 0。（在智能体执行下一个动作之前，收到状态奖励。）转移函数（图）使得智能体以 0.8 的概率发生预期的移动（上、下、左或右）。智能体最终处于与预期方向垂直的状态之一的概率各为 0.1。如果与墙壁发生碰撞，智能体将保持状态不变，并且漂移概率将添加到保持相同状态的概率中。折扣因子为 1。

智能体从始终选择向上的策略开始执行三次尝试：第一次尝试是 (C,1)–(C,2)，第二次尝试是 (C,1)–(C,2)，第三次尝试是 (C,1)–(B,1)–(B,2)–(C,2)。根据这些轨迹，状态 (C,1) 的蒙特卡罗（直接效用）估计是多少？

Figure: Transition Function ( Artificial intelligence — a modern approach, Russell, Stuart J and Norvig, Peter).

Review the table (Gridworld MDP) and the figure (Transition Function). The gridworld MDP operates like the one discussed in lecture. The states are grid squares, identified by their column (A, B, or C) and row (1 or 2) values, as presented in the table. The agent always starts in state (A,1), marked with the letter S. There are two terminal goal states: (C,2) with reward +1, and (A,2) with reward -1. Rewards are 0 in non-terminal states. (The reward for a state is received before the agent applies the next action.) The transition function (Figure) is such that the intended agent movement (Up, Down, Left, or Right) happens with probability 0.8. The probability that the agent ends up in one of the states perpendicular to the intended direction is 0.1 each. If a collision with a wall happens, the agent stays in the same state, and the drift probability is added to the probability of remaining in the same state. The discounting factor is 1.

The agent starts with the policy that always chooses to go Up, and it executes three trials: the first trial is (C,1)–(C,2), the second is (C,1)–(C,2), and the third is (C,1)–(B,1)–(B,2)–(C,2). Given these traces, what is the Monte Carlo (direct utility) estimate for state (C,1)?

BlackTom AI · Accepted Answer

The question describes a Gridworld MDP and provides three complete traces starting from state (C,1). We are to compute the Monte Carlo (direct utility) estimate for (C,1) based on these traces, with a discount factor of 1.

First, restating the scenario: the agent always starts in (A,1) with S, but the traces given begin at (C,1) and show the sequence of states visited until reaching the terminal reward state (C,2) or (A,2). The rewards are 0 in non-terminal states, and +1 is received when reaching (C,2), -1 when reaching (A,2). The three traces are:
- Trace 1: (C,1) → (C,2)
- Trace 2: (C,1) → (C,2)
- Trace 3: (C,1) → (B,1) → (B,2) → (C,2)

Monte Carlo return from a given time step is the sum of rewards from that time step onward, with gamma = 1, so we simply add the rewards along the remainder of the episode.

Now evaluate each trace for the state (C,1):
- Trace 1: At (C,1) the immediate reward is 0 (non-terminal). The next state is (C,2), which is terminal and provides +1. Since gamma = 1, the return from (C,1) is 0 + 1 = +1.
- Trace 2: Identical to Trace 1, so the return from (C,1) is again 0 + 1 = +1.
- Trace 3: From (C,1) we see rewards 0 for (C,1), then 0 for moving to (B,1) (non-terminal), 0 for (B,2) (non-terminal), and finally +1 upon reaching (C,2). Adding these with gamma = 1 yields 0 + 0 + 0 + 1 = +1.

Thus, the three returns from state (C,1) across the traces are all +1. The Monte Carlo estimate is the average of these returns: (1 + 1 + 1) / 3 = 1.0.

Therefore, the Monte Carlo (direct) utility estimate for state (C,1) given these traces is 1.0.

2025FallB-X-CSE571-78760 模块 7: 顺序决策知识检查 Module 7: Sequential Decision-Making Knowledge Check

查看解析

登录即可查看完整答案

类似问题

Which statement best describes the difference between SARSA and Q-learning?

Which of the following best describes a key difference between Monte Carlo and Temporal-Difference (TD) learning?

Select all of the following methods that use bootstrapping to estimate values

Choose all that apply to Reinforcement Learning (RL). I Regression tree algorithms power deep RL. II An RL agent wants to maximize its cumulative reward. III It is an ML paradigm that differs from supervised and unsupervised. IV It mathematically formalized the idea of learning by interactions.

强化学习的重点是什么？What is the focus of reinforcement learning?

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单

2025FallB-X-CSE571-78760 模块 7: 顺序决策 知识检查 Module 7: Sequential Decision-Making Knowledge Check