您的位置:首页 > 其它

Reinforcement Learning_By David Silver笔记二: Markov Decision Processes

2017-12-11 17:00 441 查看
Markov Process



Markov Reward Process









直接求解的时间复杂度是O(N^3), 对于small MRPs,可使用直接计算的方法,对于large MRPs使用如下迭代法:动态规划,蒙特卡洛评估,时序差分学习



Markov Decision Process (Markov reward process with decisions)



a policy is a distribution over actions given states. GIven an MDP and policy, the state sequence is Markov process, the state and reward sequence is Markov reward process.





state-value function of an MDP is the expected return starting from state and then following policy

action-value function is the expected return starting from state, taking action and following policy

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  MDP MP MRP RL