您的位置：首页 > 其它

【CS229 lecture20】策略搜索

2016-03-12 21:54 471 查看

lecture20

强化学习最后一课

Agenda

-POMDPs (partially observable MDPs)

-Policy search (the main topic for today will be policy search algorithm, specificlly I’ll talk about two algorithms named Reinforced and Pegasus)

-Reinforced

-Pegasus

-Conclusion

Recap last lecture, I actually started talk about one specific example of a POMDP which was this sort of linear dynamical system(St+1=A*St+1+B*at+wt). This is sort of LQR, linear quadratic revelation problem, but I change it and said what if we only have observations yt…

POMDP的形式化定义(in general, PODMP problem is NP-hard)和policy search(I think it is the most effective classes of reinforcement learning algorithm as well both for MDPs and for POMDPs,今天先讲将policy search algorithm应用到MDP中，也就是有完全的observations中，然后再讲怎么应用到POMDP中，但是将其应用到POMDP中时，难以保证你得到的是一个全局最优policy，因为一般来讲POMDP是NP-hard的，但我认为policy search algorithm对MDP和POMDP都是最effective的)：

so our first policy search algorithm——Reinforced algorithm

give one specific example to present our algorithm(倒立摆)

（下图横线以上是回答同学“当有多个actions时。。。”）

具体的求解过程

证明：

value approximation approach to find the policy 以及刚才讲到的policy search algorithm 哪个更好？

本能，条件反射式的低级别决策，比如倒立摆等很有可能存在一个logistic函数从状态映射到policy 用后者

高级别决策，比如围棋，要前后考虑，使用前者。

后者还可以应用于POMDP，尽管是partially observed states, estimated states也无妨

It turns out Reinforced algorithm is effective, but it’s noisy??

另一个策略搜索算法： Pegasus（我们在自主直升机飞行上使用多年了）

Pegasus 是 policy evaluation of gradient and search using scenarios的缩写。

这就是Pegasus policy search algorithm. 我们用在了直升机中，而且对于大规模问题也有很好的效果。

In closing, let me just say this class has been really fun…

Thank you ！

至此，课程到此结束……

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航