您的位置:首页 > 其它

【CS229 lecture20】策略搜索

2016-03-12 21:54 471 查看
lecture20

强化学习最后一课

Agenda

-POMDPs (partially observable MDPs)

-Policy search (the main topic for today will be policy search algorithm, specificlly I’ll talk about two algorithms named Reinforced and Pegasus)

-Reinforced

-Pegasus

-Conclusion

Recap last lecture, I actually started talk about one specific example of a POMDP which was this sort of linear dynamical system(St+1=A*St+1+B*at+wt). This is sort of LQR, linear quadratic revelation problem, but I change it and said what if we only have observations yt…



POMDP的形式化定义(in general, PODMP problem is NP-hard)和policy search(I think it is the most effective classes of reinforcement learning algorithm as well both for MDPs and for POMDPs,今天先讲将policy search algorithm应用到MDP中,也就是有完全的observations中,然后再讲怎么应用到POMDP中,但是将其应用到POMDP中时,难以保证你得到的是一个全局最优policy,因为一般来讲POMDP是NP-hard的,但我认为policy search algorithm对MDP和POMDP都是最effective的):



so our first policy search algorithm——Reinforced algorithm



give one specific example to present our algorithm(倒立摆)



(下图横线以上是回答同学“当有多个actions时。。。”)



具体的求解过程



证明:





value approximation approach to find the policy 以及刚才讲到的policy search algorithm 哪个更好?

本能,条件反射式的低级别决策,比如倒立摆等很有可能存在一个logistic函数从状态映射到policy 用后者

高级别决策,比如围棋,要前后考虑,使用前者。

后者还可以应用于POMDP,尽管是partially observed states, estimated states也无妨



It turns out Reinforced algorithm is effective, but it’s noisy??

另一个策略搜索算法: Pegasus(我们在自主直升机飞行上使用多年了)

Pegasus 是 policy evaluation of gradient and search using scenarios的缩写。















这就是Pegasus policy search algorithm. 我们用在了直升机中,而且对于大规模问题也有很好的效果。

In closing, let me just say this class has been really fun…

Thank you !

至此,课程到此结束……
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: