Chapter 10: On-policy Control with Approximation
Notes of Chapter 10: On-policy Control with Approximation
- 1 Introduction
- 2 On-policy control with approximation of episodic tasks
- 2.1 *General gradient-descent update* for action-value prediction is:
- 2.2 Semi-gradient n-step Sarsa
- 3.1 Average reward
- ***Average reward***:
- ***Ergodicity assumption***
- ***Steady state distribution***
- ***Differential return:***
- ***Bellman equations***:
- ***Differential TD errors***:
- ***Gradient update with differential return/ differential TD errors***:
- Convergence
1 Introduction
In control problem, we focus on action-value function q^(s,a,w)≈q∗(s,a)\hat{q}(s,a,\mathbf{w})\approx q_*(s,a)q^(s,a,w)≈q∗(s,a), where w∈Rd\mathbf{w}\in \mathbb{R}^dw∈Rd, because it is easy to plan with action-value function (just select the action with largest value; if the action-value function is not accurate enough, then we can polish it during decision time with Rollout algorithm, Monte Carlo Tree Search).
- For episodic cases, it is easy to extend the evaluation algorithm in chapter 9, just use a ϵ\epsilonϵ-greedy policy (a soft version of greedy policy). A semi-gradient n-step sarsa algorithm is proposed.
- For continuing cases, new definition of return (average reward) is defined. A differential semi-gradient Sarsa is proposed.
2 On-policy control with approximation of episodic tasks
2.1 General gradient-descent update for action-value prediction is:
wt+1=wt+α[Ut−q^(St,At,wt)]∇q^(St,At,wt)(10.1)\mathbf{w}_{t+1}=\mathbf{w}_{t}+\alpha[U_t-\hat{q}(S_t,A_t,\mathbf{w}_{t})]\nabla \hat{q}(S_t,A_t,\mathbf{w}_{t}) \tag{10.1}wt+1=wt+α[Ut−q^(St,At,wt)]∇q^(St,At,wt)(10.1)
2.2 Semi-gradient n-step Sarsa
By replacing the update target of (10.1) with n-step return:
Gt:t+n=Rt+1+γRt+2+⋯+γn−1Rt+n+γnq^(St+n,At+n,wt+n−1),(t+n<T)(10.4)G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+\dots+\gamma^{n-1}R_{t+n}+\gamma^{n}\hat{q}(S_{t+n},A_{t+n},\mathbf{w}_{t+n-1}) ,(t+n<T)\tag{10.4}Gt:t+n=Rt+1+γRt+2+⋯+γn−1Rt+n+γnq^(St+n,At+n,wt+n−1),(t+n<T)(10.4) We can get the update equation for semi-gradient n-step Sarsa:
wt+n=wt+n−1+α[Gt:t+n−q^(St,At,wt+n−1)]∇q^(St,At,wt+n−1),(0≤t<T)(10.5)\mathbf{w}_{t+n}=\mathbf{w}_{t+n-1}+\alpha[G_{t:t+n}-\hat{q}(S_t,A_t,\mathbf{w}_{t+n-1})]\nabla \hat{q}(S_t,A_t,\mathbf{w}_{t+n-1}),(0\leq t<T) \tag{10.5}wt+n=wt+n−1+α[Gt:t+n−q^(St,At,wt+n−1)]∇q^(St,At,wt+n−1),(0≤t<T)(10.5) Episodic semi-gradient n-step Sarsa for estimating q^≈q∗\hat{q}\approx q_*q^≈q∗ or qπq_{\pi}qπ:
3 On-policy control with approximation of continuing tasks
Average reward setting, alongside the episodic and discounted settings—for formulating the goal in Markov decision problems (MDPs). This setting applies to continuing problems with no start or end state, but also no discounting.
3.1 Average reward
Discounted value is problematic with function approximation. The root cause of the difficulties with the discounted control setting is that with function approximation we have lost the policy improvement theorem (Section 4.2). It is no longer true that if we change the policy to improve the discounted value of one state then we are guaranteed to have improved the overall policy in any useful sense (e.g. generalisation could ruin the policy elsewhere).
Average reward:
This quantity is essentially the average reward under π\piπ, as suggested by (10.7). In particular, we consider all policies that attain the maximal value of r(π)r(\pi)r(π) to be optimal.
Ergodicity assumption
μπ=limt→∞Pr{St=s∣A0:t−1∼π}\mu_{\pi}=\mathop{lim}\limits_{t\to\infin}Pr\{S_t=s|A_{0:t-1}\sim\pi\}μπ=t→∞limPr{St=s∣A0:t−1∼π} This assumption about the MDP is known as ergodicity. It means that where the MDP starts or any early decision made by the agent can have only a temporary effect; in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities. Ergodicity is sufficient to guarantee the existence of the limits in the equations above.
Steady state distribution
∑sμs(s)∑aπ(a∣s)p(s′∣s,a)=μπ(s′)(10.8)\sum_s \mu_s(s)\sum_a \pi(a|s)p(s'|s,a)=\mu_{\pi}(s') \tag{10.8}s∑μs(s)a∑π(a∣s)p(s′∣s,a)=μπ(s′)(10.8) It is the special distribution under which, if you select actions according to π\piπ, you remain in the same distribution.
Differential return:
Gt=Rt+1−r(π)+Rt+2−r(π)+Rt+3−r(π)+…(10.9)G_t=R_{t+1}-r(\pi)+R_{t+2}-r(\pi)+R_{t+3}-r(\pi)+\dots \tag{10.9}Gt=Rt+1−r(π)+Rt+2−r(π)+Rt+3−r(π)+…(10.9)
Bellman equations:
Differential TD errors:
Gradient update with differential return/ differential TD errors:
wt+1=wt+αδt∇q^(St,At,wt)(10.12)\mathbf{w}_{t+1}=\mathbf{w}_{t}+\alpha \delta_t \nabla \hat{q}(S_t,A_t,\mathbf{w}_t) \tag{10.12}wt+1=wt+αδt∇q^(St,At,wt)(10.12)
Many of the previous algorithms and theoretical results carry over to this new setting without change.
Convergence
Methods that learn action values we seem to be currently without a local improvement guarantee.
3.2 Differential Semi-gradient n-step Sarsa
- Differential n-step return:
Gt:t+n=Rt+1−Rˉt+n−1+⋯+Rt+n−Rˉt+n−1+q^(St+n,At+n,wt+n−1)(10.14)G_{t:t+n}=R_{t+1}-\bar{R}_{t+n-1}+\dots+R_{t+n}-\bar{R}_{t+n-1}+\hat{q}(S_{t+n},A_{t+n},\mathbf{w}_{t+n-1}) \tag{10.14}Gt:t+n=Rt+1−Rˉt+n−1+⋯+Rt+n−Rˉt+n−1+q^(St+n,At+n,wt+n−1)(10.14) - n-Step TD error
δt=Gt:t+n−q^(St,At,wt+n−1)(10.15)\delta_t=G_{t:t+n}-\hat{q}(S_{t},A_t,\mathbf{w}_{t+n-1}) \tag{10.15}δt=Gt:t+n−q^(St,At,wt+n−1)(10.15) - Differential semi-gradient n-step Sarsa for estimating q^≈qπ\hat{q}\approx q_{\pi}q^≈qπ or q∗q_*q∗
TODO upload figure at page 277
- 《reinforcement learning:an introduction》第十章《On-policy Control with Approximation》总结
- Chapter 9: On-policy Prediction with Approximation
- 《reinforcement learning:an introduction》第九章《On-policy Prediction with Approximation》总结
- 《代码大全2》阅读笔记10--Chapter 17 Unusual Control Structures
- chapter2 of OReilly.Hands-On.Machine.Learning.with.Scikit-Learn.and.TensorFlow
- Android:StrictMode VmPolicy violation with POLICY_DEATH; android.os.NetworkOnMainThreadException
- sun cluster 3.3 +oracle 10g R2 RAC with ASM on solaris 10 U9
- sun cluster 3.3 +oracle 10g R2 RAC with ASM on solaris 10 U9
- 集成算法(chapter 7 - Hands on machine learning with scikit learn and tensorflow)
- Name Control(Chapter 10 of Thinking in C++)
- Policy Gradient Methods for Reinforcement Learning with Function Approximation
- Continuous control with Deep Reinforcement Learning与DDPG(Deep Deterministic Policy Gradient)的理解
- Chapter 5 Testing the Software with Blinders On
- 10 steps to get Ruby on Rails running on Windows with IIS FastCGI- 摘自网络
- 强化学习读书笔记 - 10 - on-policy控制的近似方法
- Deep Dream with Caffe on Windows 10
- Chapter 10: Test Driven Development with ASP.NET MVC --- Professional ASP.NET MVC 1.0
- using Bash script to control LED with WiringPi on Raspberry Pi
- #481 – 在InkCanvas 上使用鼠标绘图(You Can Draw On an InkCanvas Control with the Mouse)
- paper2-Policy Gradient Methods for Reinforcement Learning with Function Approximation