您的位置：首页 > 其它

强化学习读书笔记_0

2017-03-14 14:39 204 查看

强化学习读书笔记 - 10 - on-policy控制的近似方法

学习笔记：

Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

参照

Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

强化学习读书笔记 - 00 - 术语和数学符号
强化学习读书笔记 - 01 - 强化学习的问题
强化学习读书笔记 - 02 - 多臂老O虎O机问题
强化学习读书笔记 - 03 - 有限马尔科夫决策过程
强化学习读书笔记 - 04 - 动态规划
强化学习读书笔记 - 05 - 蒙特卡洛方法(Monte Carlo Methods)
强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)
强化学习读书笔记 - 08 - 规划式方法和学习式方法
强化学习读书笔记 - 09 - on-policy预测的近似方法
需要了解强化学习的数学符号，先看看这里：

强化学习读书笔记 - 00 - 术语和数学符号

on-policy控制的近似方法

近似控制方法(Control Methods)是求策略的行动状态价值$q_{\pi}(s, a)$的近似值$\hat{q}(s, a, \theta)$。

半梯度递减的控制Sarsa方法 (Episodic Semi-gradient Sarsa for Control)

Input: a differentiable function $\hat{q} : \mathcal{S} \times \mathcal{A} \times \mathbb{R}^n \to \mathbb{R}$

Initialize value-function weights $\theta \in \mathbb{R}^n$ arbitrarily (e.g.,
$\theta = 0$)

Repeat (for each episode):

$S, A \gets $initial state and action of episode (e.g., "$\epsilon$-greedy)

Repeat (for each step of episode):

Take action $A$, observe
$R, S'$

If $S'$ is terminal:

$\theta \gets \theta + \alpha [R - \hat{q}(S, A, \theta)] \nabla \hat{q}(S, A, \theta)$

Go to next episode

Choose $A'$ as a function of
$\hat{q}(S', \dot \ , \theta)$ (e.g., $\epsilon$-greedy)

$\theta \gets \theta + \alpha [R + \gamma \hat{q}(S', A', \theta) - \hat{q}(S, A, \theta)] \nabla \hat{q}(S, A, \theta)$

$S \gets S'$

$A \gets A'$

多步半梯度递减的控制Sarsa方法 (n-step Semi-gradient Sarsa for Control)

请看原书，不做拗述。

（连续性任务的）平均奖赏

由于打折率($\gamma$, the discounting rate)在近似计算中存在一些问题（说是下一章说明问题是什么）。

因此，在连续性任务中引进了平均奖赏(Average Reward)$\eta(\pi)$:
\[ \begin{align} \eta(\pi) & \doteq \lim_{T \to \infty} \frac{1}{T} \sum_{t=1}{T} \mathbb{E} [R_t | A_{0:t-1} \sim \pi] \\ & = \lim_{t \to \infty} \mathbb{E} [R_t | A_{0:t-1} \sim \pi] \\ & = \sum_s d_{\pi}(s) \sum_a \pi(a|s) \sum_{s',r}
p(s,r'|s,a)r \end{align} \]

目标回报（= 原奖赏 - 平均奖赏）
\[ G_t \doteq R_{t+1} - \eta(\pi) + R_{t+2} - \eta(\pi) + \cdots \]

策略价值
\[ v_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{r,s'} p(s',r|s,a)[r - \eta(\pi) + v_{\pi}(s')] \\ q_{\pi}(s,a) = \sum_{r,s'} p(s',r|s,a)[r - \eta(\pi) + \sum_{a'} \pi(a'|s') q_{\pi}(s',a')] \\ \]

策略最优价值
\[ v_{*}(s) = \underset{a}{max} \sum_{r,s'} p(s',r|s,a)[r - \eta(\pi) + v_{*}(s')] \\ q_{*}(s,a) = \sum_{r,s'} p(s',r|s,a)[r - \eta(\pi) + \underset{a'}{max} \ q_{*}(s',a')] \\ \]

时序差分误差
\[ \delta_t \doteq R_{t+1} - \bar{R} + \hat{v}(S_{t+1},\theta) - \hat{v}(S_{t},\theta) \\ \delta_t \doteq R_{t+1} - \bar{R} + \hat{q}(S_{t+1},A_t,\theta) - \hat{q}(S_{t},A_t,\theta) \\ where \\ \bar{R} \text{ - is an estimate of the
average reward } \eta(\pi) \]

半梯度递减Sarsa的平均奖赏版
\[ \theta_{t+1} \doteq \theta_t + \alpha \delta_t \nabla \hat{q}(S_{t},A_t,\theta) \]

半梯度递减Sarsa的平均奖赏版(for continuing tasks)

Input: a differentiable function $\hat{q} : \mathcal{S} \times \mathcal{A} \times \mathbb{R}^n \to \mathbb{R}$

Parameters: step sizes $\alpha, \beta > 0$

Initialize value-function weights $\theta \in \mathbb{R}^n$ arbitrarily (e.g.,
$\theta = 0$)

Initialize average reward estimate $\bar{R}$ arbitrarily (e.g.,
$\bar{R} = 0$)

Initialize state $S$, and action
$A$

Repeat (for each step):

Take action $A$, observe
$R, S'$

Choose $A'$ as a function of
$\hat{q}(S', \dot \ , \theta)$ (e.g., $\epsilon$-greedy)

$\delta \gets R - \bar{R} + \hat{q}(S', A', \theta) - \hat{q}(S, A, \theta)$

$\bar{R} \gets \bar{R} + \beta \delta$

$\theta \gets \theta + \alpha \delta \nabla \hat{q}(S, A, \theta)$

$S \gets S'$

$A \gets A'$

多步半梯度递减的控制Sarsa方法 - 平均奖赏版(for continuing tasks)

请看原书，不做拗述。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航