您的位置：首页 > 移动开发

Chapter 10: On-policy Control with Approximation

2020-08-25 10:36 956 查看

Notes of Chapter 10: On-policy Control with Approximation

1 Introduction
2 On-policy control with approximation of episodic tasks
2.1 *General gradient-descent update* for action-value prediction is:
2.2 Semi-gradient n-step Sarsa

3 On-policy control with approximation of continuing tasks

3.1 Average reward
***Average reward***:
***Ergodicity assumption***
***Steady state distribution***
***Differential return:***
***Bellman equations***:
***Differential TD errors***:
***Gradient update with differential return/ differential TD errors***:
Convergence

3.2 Differential Semi-gradient n-step Sarsa

1 Introduction

In control problem, we focus on action-value function q^(s,a,w)≈q∗(s,a)\hat{q}(s,a,\mathbf{w})\approx q_*(s,a)q^(s,a,w)≈q∗(s,a), where w∈Rd\mathbf{w}\in \mathbb{R}^dw∈Rd, because it is easy to plan with action-value function (just select the action with largest value; if the action-value function is not accurate enough, then we can polish it during decision time with Rollout algorithm, Monte Carlo Tree Search).

For episodic cases, it is easy to extend the evaluation algorithm in chapter 9, just use a ϵ\epsilonϵ-greedy policy (a soft version of greedy policy). A semi-gradient n-step sarsa algorithm is proposed.
For continuing cases, new definition of return (average reward) is defined. A differential semi-gradient Sarsa is proposed.

2 On-policy control with approximation of episodic tasks

2.1 General gradient-descent update for action-value prediction is:

wt+1=wt+α[Ut−q^(St,At,wt)]∇q^(St,At,wt)(10.1)\mathbf{w}_{t+1}=\mathbf{w}_{t}+\alpha[U_t-\hat{q}(S_t,A_t,\mathbf{w}_{t})]\nabla \hat{q}(S_t,A_t,\mathbf{w}_{t}) \tag{10.1}wt+1=wt+α[Ut−q^(St,At,wt)]∇q^(St,At,wt)(10.1)

2.2 Semi-gradient n-step Sarsa

By replacing the update target of (10.1) with n-step return:
Gt:t+n=Rt+1+γRt+2+⋯+γn−1Rt+n+γnq^(St+n,At+n,wt+n−1),(t+n<T)(10.4)G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+\dots+\gamma^{n-1}R_{t+n}+\gamma^{n}\hat{q}(S_{t+n},A_{t+n},\mathbf{w}_{t+n-1}) ,(t+n<T)\tag{10.4}Gt:t+n=Rt+1+γRt+2+⋯+γn−1Rt+n+γnq^(St+n,At+n,wt+n−1),(t+n<T)(10.4) We can get the update equation for semi-gradient n-step Sarsa:
wt+n=wt+n−1+α[Gt:t+n−q^(St,At,wt+n−1)]∇q^(St,At,wt+n−1),(0≤t<T)(10.5)\mathbf{w}_{t+n}=\mathbf{w}_{t+n-1}+\alpha[G_{t:t+n}-\hat{q}(S_t,A_t,\mathbf{w}_{t+n-1})]\nabla \hat{q}(S_t,A_t,\mathbf{w}_{t+n-1}),(0\leq t<T) \tag{10.5}wt+n=wt+n−1+α[Gt:t+n−q^(St,At,wt+n−1)]∇q^(St,At,wt+n−1),(0≤t<T)(10.5) Episodic semi-gradient n-step Sarsa for estimating q^≈q∗\hat{q}\approx q_*q^≈q∗ or qπq_{\pi}qπ:

3 On-policy control with approximation of continuing tasks

Average reward setting, alongside the episodic and discounted settings—for formulating the goal in Markov decision problems (MDPs). This setting applies to continuing problems with no start or end state, but also no discounting.

3.1 Average reward

Discounted value is problematic with function approximation. The root cause of the difficulties with the discounted control setting is that with function approximation we have lost the policy improvement theorem (Section 4.2). It is no longer true that if we change the policy to improve the discounted value of one state then we are guaranteed to have improved the overall policy in any useful sense (e.g. generalisation could ruin the policy elsewhere).

Average reward:

This quantity is essentially the average reward under π\piπ, as suggested by (10.7). In particular, we consider all policies that attain the maximal value of r(π)r(\pi)r(π) to be optimal.

Ergodicity assumption

μπ=limt→∞Pr{St=s∣A0:t−1∼π}\mu_{\pi}=\mathop{lim}\limits_{t\to\infin}Pr\{S_t=s|A_{0:t-1}\sim\pi\}μπ=t→∞limPr{St=s∣A0:t−1∼π} This assumption about the MDP is known as ergodicity. It means that where the MDP starts or any early decision made by the agent can have only a temporary effect; in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities. Ergodicity is sufficient to guarantee the existence of the limits in the equations above.

Steady state distribution

∑sμs(s)∑aπ(a∣s)p(s′∣s,a)=μπ(s′)(10.8)\sum_s \mu_s(s)\sum_a \pi(a|s)p(s'|s,a)=\mu_{\pi}(s') \tag{10.8}s∑μs(s)a∑π(a∣s)p(s′∣s,a)=μπ(s′)(10.8) It is the special distribution under which, if you select actions according to π\piπ, you remain in the same distribution.

Differential return:

Gt=Rt+1−r(π)+Rt+2−r(π)+Rt+3−r(π)+…(10.9)G_t=R_{t+1}-r(\pi)+R_{t+2}-r(\pi)+R_{t+3}-r(\pi)+\dots \tag{10.9}Gt=Rt+1−r(π)+Rt+2−r(π)+Rt+3−r(π)+…(10.9)

Bellman equations:

Differential TD errors:

Gradient update with differential return/ differential TD errors:

wt+1=wt+αδt∇q^(St,At,wt)(10.12)\mathbf{w}_{t+1}=\mathbf{w}_{t}+\alpha \delta_t \nabla \hat{q}(S_t,A_t,\mathbf{w}_t) \tag{10.12}wt+1=wt+αδt∇q^(St,At,wt)(10.12)
Many of the previous algorithms and theoretical results carry over to this new setting without change.

Convergence

Methods that learn action values we seem to be currently without a local improvement guarantee.

3.2 Differential Semi-gradient n-step Sarsa

Differential n-step return:
Gt:t+n=Rt+1−Rˉt+n−1+⋯+Rt+n−Rˉt+n−1+q^(St+n,At+n,wt+n−1)(10.14)G_{t:t+n}=R_{t+1}-\bar{R}_{t+n-1}+\dots+R_{t+n}-\bar{R}_{t+n-1}+\hat{q}(S_{t+n},A_{t+n},\mathbf{w}_{t+n-1}) \tag{10.14}Gt:t+n=Rt+1−Rˉt+n−1+⋯+Rt+n−Rˉt+n−1+q^(St+n,At+n,wt+n−1)(10.14)
n-Step TD error
δt=Gt:t+n−q^(St,At,wt+n−1)(10.15)\delta_t=G_{t:t+n}-\hat{q}(S_{t},A_t,\mathbf{w}_{t+n-1}) \tag{10.15}δt=Gt:t+n−q^(St,At,wt+n−1)(10.15)
Differential semi-gradient n-step Sarsa for estimating q^≈qπ\hat{q}\approx q_{\pi}q^≈qπ or q∗q_*q∗
TODO upload figure at page 277

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航