您的位置:首页 > 移动开发

Chapter 16 Applications and 17 Frontiers

2020-08-25 10:38 429 查看

Notes of chapter 16 Applications and 17 Frontiers

  • Application
  • Personalized Web Services
  • Thermal Soaring
  • Frontiers
  • Questions

    Why not combine function approximation and policy approximation with Dyna? The update of policy can be realized with minimizing TD error, and value table can be replaced by ANN or linear approximation.

    Application

    Samuel’s Checkers Player 1959-1967

    Samuel was one of the first to make effective use of heuristic search methods and of what we would now call temporal-difference learning.

    Samuel’s programs played by performing a lookahead search from each current position. They used what we now call heuristic search methods to determine how to expand the search tree and when to stop searching. The terminal board positions of each search were evaluated, or “scored,” by a value function, or “scoring polynomial,” using linear function approximation.

    Samuel used two main learning methods, the simplest of which he called rote learning. It consisted simply of saving a description of each board position encountered during play together with its backed-up value determined by the minimax procedure. the essential idea of temporal-difference learning—that the value of a state should equal the value of likely following states. Samuel came closest to this idea in his second learning method, his “learning by generalization”.

    Samuel did not include explicit rewards. Instead, he fixed the weight of the most important feature, the piece advantage feature.

    “Better-than-average novice” after learning from many games against itself. Fairly good amateur opponents characterized it as “tricky but beatable”.

    TD-Gammon 1992-2002

    The learning algorithm in TD-Gammon was a straightforward combination of the TD(λ\lambdaλ) algorithm and nonlinear function approximation using a multilayer artificial neural network (ANN) trained by backpropagating TD errors.

    Tesauro obtained an unending sequence of games by playing his learning backgammon player against itself.

    After playing about 300,000 games against itself, TD-Gammon 0.0 as described above
    learned to play approximately as well as the best previous backgammon computer programs.

    The tournament success of TD-Gammon 0.0 with zero expert backgammon knowledge
    suggested an obvious modification: add the specialized backgammon features but keep
    the self-play TD learning method. This produced TD-Gammon 1.0. TD-Gammon 1.0 was
    clearly substantially better than all previous backgammon programs and found serious
    competition only among human experts.

    TD-Gammon illustrates the combination of learned value functions and decision-time search as in heuristic search and MCTS methods, resulting in great improvements
    in the overall caliber of human tournament play.

    Watson’s Daily-Double Wagering 2011-2013

    It adapted Tesauro’s TD-Gammon system described above to create the strategy used by Watson in “Daily-Double” (DD) wagering in its celebrated winning performance against human champions.

    Action values were computed whenever a betting decision was needed by using two types of estimates that were learned before any live game play took place. The first were estimated values of the afterstates (Section 6.8) that would result from selecting each legal bet. These estimates were obtained from a state-value function, v^(⋅,w)\hat{v}(·,w)v^(⋅,w), defined by parameters www, that gave estimates of the probability of a win for Watson from any game state. The second estimates used to compute action values gave the “in-category DD confidence".

    Human-level Video Game Play

    A team of researchers at Google DeepMind developed an impressive demonstration that
    a deep multi-layer ANN can automate the feature design process (Mnih et al., 2013, 2015).

    Mnih et al. developed a reinforcement learning agent called deep Q-network (DQN) that combined Q-learning with a deep convolutional ANN, a many-layered, or deep, ANN specialized for processing spatial arrays of data such as images. Another motivation for using Q-learning was that DQN used the experience replay method, described below, which requires an off-policy algorithm. Being model-free and off-policy made Q-learning a natural choice.

    DQN advanced the state-of-the-art in machine learning by impressively demonstrating the promise of combining reinforcement learning with modern methods of deep learning.

    Mastering the Game of Go

    Alpha Go

    It selected moves by a novel version of MCTS that was guided by both a policy and a value function learned by reinforcement learning with function approximation provided by deep convolutional ANNs.

    It started from weights that were the result of previous supervised learning from a large collection of human expert moves

    Alpha Zero

    AlphaGo Zero’s MCTS was simpler than the version used by AlphaGo in that it did not include rollouts of complete games, and therefore did not need a rollout policy. AlphaGo Zero used only one deep convolutional ANN and used a simpler version of MCTS.

    Personalized Web Services

    It formulated personalized recommendation as a Markov decision problem (MDP) with
    the objective of maximizing the total number of clicks users make over repeated visits to
    a website, using life-time value (LTV) optimization.

    Thermal Soaring

    By experimenting with various reward signals, they found that learning was best with a reward signal that at each time step linearly combined the vertical wind velocity and vertical wind acceleration observed on the previous time step.

    Learning was by one-step Sarsa, with actions selected according to a soft-max distribution
    based on normalized action values.

    This computational study of thermal soaring illustrates how reinforcement learning can further progress toward different kinds of objectives.

    Frontiers

    General Value Functions and Auxiliary Tasks

    Rather than predicting the sum of future rewards, we might predict the sum of the future values of a sound or color sensation, or of an internal, highly processed signal such as another prediction. Whatever signal is added up in this way in a value-function-like prediction, we call it the cumulant of that prediction. We formalize it in a cumulant signal Ct∈RC_t \in RCt​∈R. Using this, a general value function, or GVFGVFGVF.

    One simple way in which auxiliary tasks can help on the main task is that they may require some of the same representations as are needed on the main task.

    Another simple way in which the learning of auxiliary tasks can improve performance is best explained by analogy to the psychological phenomena of classical conditioning

    Finally, perhaps the most important role for auxiliary tasks is in moving beyond the assumption we have made throughout this book that the state representation is fixed and given to the agent.

    Temporal Abstraction via Options

    Can the MDP framework be stretched to cover all the levels simultaneously?

    Perhaps it can. One popular idea is to formalize an MDP at a detailed level, with a small time step, yet enable planning at higher levels using extended courses of action that correspond to many base-level time steps. To do this we need a notion of course of action that extends over many time steps and includes a notion of termination. A general way to formulate these two ideas is as a policy, π\piπ, and a state-dependent termination function, γ\gammaγ, as in GVFs. We define a pair of these as a generalized notion of action termed an option.

    Options effectively extend the action space. The agent can either select a low-level action/option, terminating after one time step, or select an extended option that might execute for many time steps before terminating.

    Observations and State

    In many cases of interest, and certainly in the lives of all natural intelligences, the sensory input gives only partial information about the state of the world.

    The framework of parametric function approximation that we developed in Part II is far less restrictive and, arguably, no limitation at all.

    First, we would change the problem. The environment would emit not its states, but only observations—signals that depend on its state but, like a robot’s sensors, provide only partial information about it.
    Second, we can recover the idea of state as used in this book from the sequence of observations and actions.
    The third step in extending reinforcement learning to partial observability is to deal with certain computational considerations.
    The fourth and final step in our brief outline of how to handle partial observability in reinforcement learning is to re-introduce approximation.

    Designing Reward Signals

    Success of a reinforcement learning application strongly depends on how well the reward signal frames the goal of the application’s designer and how well the signal assesses progress in reaching that goal.

    One challenge is to design a reward signal so that as an agent learns, its behavior approaches, and ideally eventually achieves, what the application’s designer actually desires.

    Even when there is a simple and easily identifiable goal, the problem of sparse reward often arises.

    It is tempting to address the sparse reward problem by rewarding the agent for achieving subgoals that the designer thinks are important way stations to the overall goal. But augmenting the reward signal with well-intentioned supplemental rewards may lead the agent to behave very di↵erently from what is intended;

    Remaining Issues

    First, we still need powerful parametric function approximation methods that work well in fully incremental and online settings.

    Second (and perhaps closely related), we still need methods for learning features such that subsequent learning generalizes well.

    Third, we still need scalable methods for planning with learned environment models.

    A fourth issue that needs to be addressed in future research is that of automating the choice of tasks on which an agent works and uses to structure its developing competence.

    The fifth issue that we would like to highlight for future research is that of the interaction between behavior and learning via some computational analog of curiosity.

    A final issue that demands attention in future research is that of developing methods to make it acceptably safe to embed reinforcement learning agents into physical environments.

    内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
    标签: