Science.gov

Sample records for reinforcement learning based

  1. Reinforcement learning based artificial immune classifier.

    PubMed

    Karakose, Mehmet

    2013-01-01

    One of the widely used methods for classification that is a decision-making process is artificial immune systems. Artificial immune systems based on natural immunity system can be successfully applied for classification, optimization, recognition, and learning in real-world problems. In this study, a reinforcement learning based artificial immune classifier is proposed as a new approach. This approach uses reinforcement learning to find better antibody with immune operators. The proposed new approach has many contributions according to other methods in the literature such as effectiveness, less memory cell, high accuracy, speed, and data adaptability. The performance of the proposed approach is demonstrated by simulation and experimental results using real data in Matlab and FPGA. Some benchmark data and remote image data are used for experimental results. The comparative results with supervised/unsupervised based artificial immune system, negative selection classifier, and resource limited artificial immune classifier are given to demonstrate the effectiveness of the proposed new method.

  2. Reinforcement Learning Based Artificial Immune Classifier

    PubMed Central

    Karakose, Mehmet

    2013-01-01

    One of the widely used methods for classification that is a decision-making process is artificial immune systems. Artificial immune systems based on natural immunity system can be successfully applied for classification, optimization, recognition, and learning in real-world problems. In this study, a reinforcement learning based artificial immune classifier is proposed as a new approach. This approach uses reinforcement learning to find better antibody with immune operators. The proposed new approach has many contributions according to other methods in the literature such as effectiveness, less memory cell, high accuracy, speed, and data adaptability. The performance of the proposed approach is demonstrated by simulation and experimental results using real data in Matlab and FPGA. Some benchmark data and remote image data are used for experimental results. The comparative results with supervised/unsupervised based artificial immune system, negative selection classifier, and resource limited artificial immune classifier are given to demonstrate the effectiveness of the proposed new method. PMID:23935424

  3. Partial Planning Reinforcement Learning

    DTIC Science & Technology

    2012-08-31

    This project explored several problems in the areas of reinforcement learning , probabilistic planning, and transfer learning. In particular, it...studied Bayesian Optimization for model-based and model-free reinforcement learning , transfer in the context of model-free reinforcement learning based on

  4. Model-Based Reinforcement Learning under Concurrent Schedules of Reinforcement in Rodents

    ERIC Educational Resources Information Center

    Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

    2009-01-01

    Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's…

  5. Model-based reinforcement learning with dimension reduction.

    PubMed

    Tangkaratt, Voot; Morimoto, Jun; Sugiyama, Masashi

    2016-12-01

    The goal of reinforcement learning is to learn an optimal policy which controls an agent to acquire the maximum cumulative reward. The model-based reinforcement learning approach learns a transition model of the environment from data, and then derives the optimal policy using the transition model. However, learning an accurate transition model in high-dimensional environments requires a large amount of data which is difficult to obtain. To overcome this difficulty, in this paper, we propose to combine model-based reinforcement learning with the recently developed least-squares conditional entropy (LSCE) method, which simultaneously performs transition model estimation and dimension reduction. We also further extend the proposed method to imitation learning scenarios. The experimental results show that policy search combined with LSCE performs well for high-dimensional control tasks including real humanoid robot control.

  6. Model-based reinforcement learning under concurrent schedules of reinforcement in rodents.

    PubMed

    Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

    2009-05-01

    Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's knowledge or model of the environment in model-based reinforcement learning algorithms. To investigate how animals update value functions, we trained rats under two different free-choice tasks. The reward probability of the unchosen target remained unchanged in one task, whereas it increased over time since the target was last chosen in the other task. The results show that goal choice probability increased as a function of the number of consecutive alternative choices in the latter, but not the former task, indicating that the animals were aware of time-dependent increases in arming probability and used this information in choosing goals. In addition, the choice behavior in the latter task was better accounted for by a model-based reinforcement learning algorithm. Our results show that rats adopt a decision-making process that cannot be accounted for by simple reinforcement learning models even in a relatively simple binary choice task, suggesting that rats can readily improve their decision-making strategy through the knowledge of their environments.

  7. A reinforcement learning-based architecture for fuzzy logic control

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1992-01-01

    This paper introduces a new method for learning to refine a rule-based fuzzy logic controller. A reinforcement learning technique is used in conjunction with a multilayer neural network model of a fuzzy controller. The approximate reasoning based intelligent control (ARIC) architecture proposed here learns by updating its prediction of the physical system's behavior and fine tunes a control knowledge base. Its theory is related to Sutton's temporal difference (TD) method. Because ARIC has the advantage of using the control knowledge of an experienced operator and fine tuning it through the process of learning, it learns faster than systems that train networks from scratch. The approach is applied to a cart-pole balancing system.

  8. Mobile robots exploration through cnn-based reinforcement learning.

    PubMed

    Tai, Lei; Liu, Ming

    2016-01-01

    Exploration in an unknown environment is an elemental application for mobile robots. In this paper, we outlined a reinforcement learning method aiming for solving the exploration problem in a corridor environment. The learning model took the depth image from an RGB-D sensor as the only input. The feature representation of the depth image was extracted through a pre-trained convolutional-neural-networks model. Based on the recent success of deep Q-network on artificial intelligence, the robot controller achieved the exploration and obstacle avoidance abilities in several different simulated environments. It is the first time that the reinforcement learning is used to build an exploration strategy for mobile robots through raw sensor information.

  9. Manifold-Based Reinforcement Learning via Locally Linear Reconstruction.

    PubMed

    Xu, Xin; Huang, Zhenhua; Zuo, Lei; He, Haibo

    2017-04-01

    Feature representation is critical not only for pattern recognition tasks but also for reinforcement learning (RL) methods to solve learning control problems under uncertainties. In this paper, a manifold-based RL approach using the principle of locally linear reconstruction (LLR) is proposed for Markov decision processes with large or continuous state spaces. In the proposed approach, an LLR-based feature learning scheme is developed for value function approximation in RL, where a set of smooth feature vectors is generated by preserving the local approximation properties of neighboring points in the original state space. By using the proposed feature learning scheme, an LLR-based approximate policy iteration (API) algorithm is designed for learning control problems with large or continuous state spaces. The relationship between the value approximation error of a new data point and the estimated values of its nearest neighbors is analyzed. In order to compare different feature representation and learning approaches for RL, a comprehensive simulation and experimental study was conducted on three benchmark learning control problems. It is illustrated that under a wide range of parameter settings, the LLR-based API algorithm can obtain better learning control performance than the previous API methods with different feature representation schemes.

  10. The ubiquity of model-based reinforcement learning.

    PubMed

    Doll, Bradley B; Simon, Dylan A; Daw, Nathaniel D

    2012-12-01

    The reward prediction error (RPE) theory of dopamine (DA) function has enjoyed great success in the neuroscience of learning and decision-making. This theory is derived from model-free reinforcement learning (RL), in which choices are made simply on the basis of previously realized rewards. Recently, attention has turned to correlates of more flexible, albeit computationally complex, model-based methods in the brain. These methods are distinguished from model-free learning by their evaluation of candidate actions using expected future outcomes according to a world model. Puzzlingly, signatures from these computations seem to be pervasive in the very same regions previously thought to support model-free learning. Here, we review recent behavioral and neural evidence about these two systems, in attempt to reconcile their enigmatic cohabitation in the brain.

  11. Reinforcement Learning Based Web Service Compositions for Mobile Business

    NASA Astrophysics Data System (ADS)

    Zhou, Juan; Chen, Shouming

    In this paper, we propose a new solution to Reactive Web Service Composition, via molding with Reinforcement Learning, and introducing modified (alterable) QoS variables into the model as elements in the Markov Decision Process tuple. Moreover, we give an example of Reactive-WSC-based mobile banking, to demonstrate the intrinsic capability of the solution in question of obtaining the optimized service composition, characterized by (alterable) target QoS variable sets with optimized values. Consequently, we come to the conclusion that the solution has decent potentials in boosting customer experiences and qualities of services in Web Services, and those in applications in the whole electronic commerce and business sector.

  12. Model-based hierarchical reinforcement learning and human action control.

    PubMed

    Botvinick, Matthew; Weinstein, Ari

    2014-11-05

    Recent work has reawakened interest in goal-directed or 'model-based' choice, where decisions are based on prospective evaluation of potential action outcomes. Concurrently, there has been growing attention to the role of hierarchy in decision-making and action control. We focus here on the intersection between these two areas of interest, considering the topic of hierarchical model-based control. To characterize this form of action control, we draw on the computational framework of hierarchical reinforcement learning, using this to interpret recent empirical findings. The resulting picture reveals how hierarchical model-based mechanisms might play a special and pivotal role in human decision-making, dramatically extending the scope and complexity of human behaviour.

  13. Model-based hierarchical reinforcement learning and human action control

    PubMed Central

    Botvinick, Matthew; Weinstein, Ari

    2014-01-01

    Recent work has reawakened interest in goal-directed or ‘model-based’ choice, where decisions are based on prospective evaluation of potential action outcomes. Concurrently, there has been growing attention to the role of hierarchy in decision-making and action control. We focus here on the intersection between these two areas of interest, considering the topic of hierarchical model-based control. To characterize this form of action control, we draw on the computational framework of hierarchical reinforcement learning, using this to interpret recent empirical findings. The resulting picture reveals how hierarchical model-based mechanisms might play a special and pivotal role in human decision-making, dramatically extending the scope and complexity of human behaviour. PMID:25267822

  14. Variable Resolution Reinforcement Learning.

    DTIC Science & Technology

    1995-04-01

    Can reinforcement learning ever become a practical method for real control problems? This paper begins by reviewing three reinforcement learning algorithms... reinforcement learning . In addition to exploring state space, and developing a control policy to achieve a task, partigame also learns a kd-tree partitioning of

  15. Robot Docking Based on Omnidirectional Vision and Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Muse, David; Weber, Cornelius; Wermter, Stefan

    We present a system for visual robotic docking using an omnidirectional camera coupled with the actor critic reinforcement learning algorithm. The system enables a PeopleBot robot to locate and approach a table so that it can pick an object from it using the pan-tilt camera mounted on the robot. We use a staged approach to solve this problem as there are distinct sub tasks and different sensors used. Starting with random wandering of the robot until the table is located via a landmark, and then a network trained via reinforcement allows the robot to rum to and approach the table. Once at the table the robot is to pick the object from it. We argue that our approach has a lot of potential allowing the learning of robot control for navigation removing the need for internal maps of the environment. This is achieved by allowing the robot to learn couplings between motor actions and the position of a landmark.

  16. B-tree search reinforcement learning for model based intelligent agent

    NASA Astrophysics Data System (ADS)

    Bhuvaneswari, S.; Vignashwaran, R.

    2013-03-01

    Agents trained by learning techniques provide a powerful approximation of active solutions for naive approaches. In this study using B - Trees implying reinforced learning the data search for information retrieval is moderated to achieve accuracy with minimum search time. The impact of variables and tactics applied in training are determined using reinforcement learning. Agents based on these techniques perform satisfactory baseline and act as finite agents based on the predetermined model against competitors from the course.

  17. Reinforcement Learning: A Tutorial.

    DTIC Science & Technology

    1997-01-01

    The purpose of this tutorial is to provide an introduction to reinforcement learning (RL) at a level easily understood by students and researchers in...provides a simple example to develop intuition of the underlying dynamic programming mechanism. In Section (2) the parts of a reinforcement learning problem... reinforcement learning algorithms. These include TD(lambda) and both the residual and direct forms of value iteration, Q-learning, and advantage learning

  18. Knowledge-Based Reinforcement Learning for Data Mining

    NASA Astrophysics Data System (ADS)

    Kudenko, Daniel; Grzes, Marek

    Data Mining is the process of extracting patterns from data. Two general avenues of research in the intersecting areas of agents and data mining can be distinguished. The first approach is concerned with mining an agent’s observation data in order to extract patterns, categorize environment states, and/or make predictions of future states. In this setting, data is normally available as a batch, and the agent’s actions and goals are often independent of the data mining task. The data collection is mainly considered as a side effect of the agent’s activities. Machine learning techniques applied in such situations fall into the class of supervised learning. In contrast, the second scenario occurs where an agent is actively performing the data mining, and is responsible for the data collection itself. For example, a mobile network agent is acquiring and processing data (where the acquisition may incur a certain cost), or a mobile sensor agent is moving in a (perhaps hostile) environment, collecting and processing sensor readings. In these settings, the tasks of the agent and the data mining are highly intertwined and interdependent (or even identical). Supervised learning is not a suitable technique for these cases. Reinforcement Learning (RL) enables an agent to learn from experience (in form of reward and punishment for explorative actions) and adapt to new situations, without a teacher. RL is an ideal learning technique for these data mining scenarios, because it fits the agent paradigm of continuous sensing and acting, and the RL agent is able to learn to make decisions on the sampling of the environment which provides the data. Nevertheless, RL still suffers from scalability problems, which have prevented its successful use in many complex real-world domains. The more complex the tasks, the longer it takes a reinforcement learning algorithm to converge to a good solution. For many real-world tasks, human expert knowledge is available. For example, human

  19. Reinforcement of Learning

    ERIC Educational Resources Information Center

    Jones, Peter

    1977-01-01

    A company trainer shows some ways of scheduling reinforcement of learning for trainees: continuous reinforcement, fixed ratio, variable ratio, fixed interval, and variable interval. As there are problems with all methods, he suggests trying combinations of various types of reinforcement. (MF)

  20. [The model of the reward choice basing on the theory of reinforcement learning].

    PubMed

    Smirnitskaia, I A; Frolov, A A; Merzhanova, G Kh

    2007-01-01

    We developed the model of alimentary instrumental conditioned bar-pressing reflex for cats making a choice between either immediate small reinforcement ("impulsive behavior") or delayed more valuable reinforcement ("self-control behavior"). Our model is based on the reinforcement learning theory. We emulated dopamine contribution by discount coefficient of this theory (a subjective decrease in the value of a delayed reinforcement). The results of computer simulation showed that "cats" with large discount coefficient demonstrated "self-control behavior"; small discount coefficient was associated with "impulsive behavior". This data are in agreement with the experimental data indicating that the impulsive behavior is due to a decreased amount of dopamine in striatum.

  1. A Reward Optimization Method Based on Action Subrewards in Hierarchical Reinforcement Learning

    PubMed Central

    Liu, Quan; Ling, Xionghong; Cui, Zhiming

    2014-01-01

    Reinforcement learning (RL) is one kind of interactive learning methods. Its main characteristics are “trial and error” and “related reward.” A hierarchical reinforcement learning method based on action subrewards is proposed to solve the problem of “curse of dimensionality,” which means that the states space will grow exponentially in the number of features and low convergence speed. The method can reduce state spaces greatly and choose actions with favorable purpose and efficiency so as to optimize reward function and enhance convergence speed. Apply it to the online learning in Tetris game, and the experiment result shows that the convergence speed of this algorithm can be enhanced evidently based on the new method which combines hierarchical reinforcement learning algorithm and action subrewards. The “curse of dimensionality” problem is also solved to a certain extent with hierarchical method. All the performance with different parameters is compared and analyzed as well. PMID:24600318

  2. Hierarchical Multiagent Reinforcement Learning

    DTIC Science & Technology

    2004-01-25

    In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multiagent tasks. We...introduce a hierarchical multiagent reinforcement learning (RL) framework and propose a hierarchical multiagent RL algorithm called Cooperative HRL. In

  3. Reinforcement Learning Trees.

    PubMed

    Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R

    In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings.

  4. Reinforcement Learning Trees

    PubMed Central

    Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R.

    2015-01-01

    In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings. PMID:26903687

  5. Reinforcement learning in scheduling

    NASA Technical Reports Server (NTRS)

    Dietterich, Tom G.; Ok, Dokyeong; Zhang, Wei; Tadepalli, Prasad

    1994-01-01

    The goal of this research is to apply reinforcement learning methods to real-world problems like scheduling. In this preliminary paper, we show that learning to solve scheduling problems such as the Space Shuttle Payload Processing and the Automatic Guided Vehicle (AGV) scheduling can be usefully studied in the reinforcement learning framework. We discuss some of the special challenges posed by the scheduling domain to these methods and propose some possible solutions we plan to implement.

  6. Extraversion differentiates between model-based and model-free strategies in a reinforcement learning task.

    PubMed

    Skatova, Anya; Chan, Patricia A; Daw, Nathaniel D

    2013-01-01

    Prominent computational models describe a neural mechanism for learning from reward prediction errors, and it has been suggested that variations in this mechanism are reflected in personality factors such as trait extraversion. However, although trait extraversion has been linked to improved reward learning, it is not yet known whether this relationship is selective for the particular computational strategy associated with error-driven learning, known as model-free reinforcement learning, vs. another strategy, model-based learning, which the brain is also known to employ. In the present study we test this relationship by examining whether humans' scores on an extraversion scale predict individual differences in the balance between model-based and model-free learning strategies in a sequentially structured decision task designed to distinguish between them. In previous studies with this task, participants have shown a combination of both types of learning, but with substantial individual variation in the balance between them. In the current study, extraversion predicted worse behavior across both sorts of learning. However, the hypothesis that extraverts would be selectively better at model-free reinforcement learning held up among a subset of the more engaged participants, and overall, higher task engagement was associated with a more selective pattern by which extraversion predicted better model-free learning. The findings indicate a relationship between a broad personality orientation and detailed computational learning mechanisms. Results like those in the present study suggest an intriguing and rich relationship between core neuro-computational mechanisms and broader life orientations and outcomes.

  7. Reinforcement learning and Tourette syndrome.

    PubMed

    Palminteri, Stefano; Pessiglione, Mathias

    2013-01-01

    In this chapter, we report the first experimental explorations of reinforcement learning in Tourette syndrome, realized by our team in the last few years. This report will be preceded by an introduction aimed to provide the reader with the state of the art of the knowledge concerning the neural bases of reinforcement learning at the moment of these studies and the scientific rationale beyond them. In short, reinforcement learning is learning by trial and error to maximize rewards and minimize punishments. This decision-making and learning process implicates the dopaminergic system projecting to the frontal cortex-basal ganglia circuits. A large body of evidence suggests that the dysfunction of the same neural systems is implicated in the pathophysiology of Tourette syndrome. Our results show that Tourette condition, as well as the most common pharmacological treatments (dopamine antagonists), affects reinforcement learning performance in these patients. Specifically, the results suggest a deficit in negative reinforcement learning, possibly underpinned by a functional hyperdopaminergia, which could explain the persistence of tics, despite their evident inadaptive (negative) value. This idea, together with the implications of these results in Tourette therapy and the future perspectives, is discussed in Section 4 of this chapter.

  8. Reinforcement learning in supply chains.

    PubMed

    Valluri, Annapurna; North, Michael J; Macal, Charles M

    2009-10-01

    Effective management of supply chains creates value and can strategically position companies. In practice, human beings have been found to be both surprisingly successful and disappointingly inept at managing supply chains. The related fields of cognitive psychology and artificial intelligence have postulated a variety of potential mechanisms to explain this behavior. One of the leading candidates is reinforcement learning. This paper applies agent-based modeling to investigate the comparative behavioral consequences of three simple reinforcement learning algorithms in a multi-stage supply chain. For the first time, our findings show that the specific algorithm that is employed can have dramatic effects on the results obtained. Reinforcement learning is found to be valuable in multi-stage supply chains with several learning agents, as independent agents can learn to coordinate their behavior. However, learning in multi-stage supply chains using these postulated approaches from cognitive psychology and artificial intelligence take extremely long time periods to achieve stability which raises questions about their ability to explain behavior in real supply chains. The fact that it takes thousands of periods for agents to learn in this simple multi-agent setting provides new evidence that real world decision makers are unlikely to be using strict reinforcement learning in practice.

  9. Manifold Regularized Reinforcement Learning.

    PubMed

    Li, Hongliang; Liu, Derong; Wang, Ding

    2017-01-27

    This paper introduces a novel manifold regularized reinforcement learning scheme for continuous Markov decision processes. Smooth feature representations for value function approximation can be automatically learned using the unsupervised manifold regularization method. The learned features are data-driven, and can be adapted to the geometry of the state space. Furthermore, the scheme provides a direct basis representation extension for novel samples during policy learning and control. The performance of the proposed scheme is evaluated on two benchmark control tasks, i.e., the inverted pendulum and the energy storage problem. Simulation results illustrate the concepts of the proposed scheme and show that it can obtain excellent performance.

  10. Somatic and Reinforcement-Based Plasticity in the Initial Stages of Human Motor Learning.

    PubMed

    Sidarta, Ananda; Vahdat, Shahabeddin; Bernardi, Nicolò F; Ostry, David J

    2016-11-16

    As one learns to dance or play tennis, the desired somatosensory state is typically unknown. Trial and error is important as motor behavior is shaped by successful and unsuccessful movements. As an experimental model, we designed a task in which human participants make reaching movements to a hidden target and receive positive reinforcement when successful. We identified somatic and reinforcement-based sources of plasticity on the basis of changes in functional connectivity using resting-state fMRI before and after learning. The neuroimaging data revealed reinforcement-related changes in both motor and somatosensory brain areas in which a strengthening of connectivity was related to the amount of positive reinforcement during learning. Areas of prefrontal cortex were similarly altered in relation to reinforcement, with connectivity between sensorimotor areas of putamen and the reward-related ventromedial prefrontal cortex strengthened in relation to the amount of successful feedback received. In other analyses, we assessed connectivity related to changes in movement direction between trials, a type of variability that presumably reflects exploratory strategies during learning. We found that connectivity in a network linking motor and somatosensory cortices increased with trial-to-trial changes in direction. Connectivity varied as well with the change in movement direction following incorrect movements. Here the changes were observed in a somatic memory and decision making network involving ventrolateral prefrontal cortex and second somatosensory cortex. Our results point to the idea that the initial stages of motor learning are not wholly motor but rather involve plasticity in somatic and prefrontal networks related both to reward and exploration.

  11. Model-Based Reinforcement Learning for Infinite-Horizon Approximate Optimal Tracking.

    PubMed

    Kamalapurkar, Rushikesh; Andrews, Lindsey; Walters, Patrick; Dixon, Warren E

    2017-03-01

    This brief paper provides an approximate online adaptive solution to the infinite-horizon optimal tracking problem for control-affine continuous-time nonlinear systems with unknown drift dynamics. To relax the persistence of excitation condition, model-based reinforcement learning is implemented using a concurrent-learning-based system identifier to simulate experience by evaluating the Bellman error over unexplored areas of the state space. Tracking of the desired trajectory and convergence of the developed policy to a neighborhood of the optimal policy are established via Lyapunov-based stability analysis. Simulation results demonstrate the effectiveness of the developed technique.

  12. Reward, motivation, and reinforcement learning.

    PubMed

    Dayan, Peter; Balleine, Bernard W

    2002-10-10

    There is substantial evidence that dopamine is involved in reward learning and appetitive conditioning. However, the major reinforcement learning-based theoretical models of classical conditioning (crudely, prediction learning) are actually based on rules designed to explain instrumental conditioning (action learning). Extensive anatomical, pharmacological, and psychological data, particularly concerning the impact of motivational manipulations, show that these models are unreasonable. We review the data and consider the involvement of a rich collection of different neural systems in various aspects of these forms of conditioning. Dopamine plays a pivotal, but complicated, role.

  13. Model-based reinforcement learning for partially observable games with sampling-based state estimation.

    PubMed

    Fujita, Hajime; Ishii, Shin

    2007-11-01

    Games constitute a challenging domain of reinforcement learning (RL) for acquiring strategies because many of them include multiple players and many unobservable variables in a large state space. The difficulty of solving such realistic multiagent problems with partial observability arises mainly from the fact that the computational cost for the estimation and prediction in the whole state space, including unobservable variables, is too heavy. To overcome this intractability and enable an agent to learn in an unknown environment, an effective approximation method is required with explicit learning of the environmental model. We present a model-based RL scheme for large-scale multiagent problems with partial observability and apply it to a card game, hearts. This game is a well-defined example of an imperfect information game and can be approximately formulated as a partially observable Markov decision process (POMDP) for a single learning agent. To reduce the computational cost, we use a sampling technique in which the heavy integration required for the estimation and prediction can be approximated by a plausible number of samples. Computer simulation results show that our method is effective in solving such a difficult, partially observable multiagent problem.

  14. Online Pedagogical Tutorial Tactics Optimization Using Genetic-Based Reinforcement Learning

    PubMed Central

    Lin, Hsuan-Ta; Lee, Po-Ming; Hsiao, Tzu-Chien

    2015-01-01

    Tutorial tactics are policies for an Intelligent Tutoring System (ITS) to decide the next action when there are multiple actions available. Recent research has demonstrated that when the learning contents were controlled so as to be the same, different tutorial tactics would make difference in students' learning gains. However, the Reinforcement Learning (RL) techniques that were used in previous studies to induce tutorial tactics are insufficient when encountering large problems and hence were used in offline manners. Therefore, we introduced a Genetic-Based Reinforcement Learning (GBML) approach to induce tutorial tactics in an online-learning manner without basing on any preexisting dataset. The introduced method can learn a set of rules from the environment in a manner similar to RL. It includes a genetic-based optimizer for rule discovery task by generating new rules from the old ones. This increases the scalability of a RL learner for larger problems. The results support our hypothesis about the capability of the GBML method to induce tutorial tactics. This suggests that the GBML method should be favorable in developing real-world ITS applications in the domain of tutorial tactics induction. PMID:26065018

  15. Online Pedagogical Tutorial Tactics Optimization Using Genetic-Based Reinforcement Learning.

    PubMed

    Lin, Hsuan-Ta; Lee, Po-Ming; Hsiao, Tzu-Chien

    2015-01-01

    Tutorial tactics are policies for an Intelligent Tutoring System (ITS) to decide the next action when there are multiple actions available. Recent research has demonstrated that when the learning contents were controlled so as to be the same, different tutorial tactics would make difference in students' learning gains. However, the Reinforcement Learning (RL) techniques that were used in previous studies to induce tutorial tactics are insufficient when encountering large problems and hence were used in offline manners. Therefore, we introduced a Genetic-Based Reinforcement Learning (GBML) approach to induce tutorial tactics in an online-learning manner without basing on any preexisting dataset. The introduced method can learn a set of rules from the environment in a manner similar to RL. It includes a genetic-based optimizer for rule discovery task by generating new rules from the old ones. This increases the scalability of a RL learner for larger problems. The results support our hypothesis about the capability of the GBML method to induce tutorial tactics. This suggests that the GBML method should be favorable in developing real-world ITS applications in the domain of tutorial tactics induction.

  16. A Reinforcement Learning Approach to Control.

    DTIC Science & Technology

    1997-05-31

    acquisition is inherently a partially observable Markov decision problem. This report describes an efficient, scalable reinforcement learning approach to the...deployment of refined intelligent gaze control techniques. This report first lays a theoretical foundation for reinforcement learning . It then introduces...perform well in both high and low SNR ATR environments. Reinforcement learning coupled with history features appears to be both a sound foundation and a practical scalable base for gaze control.

  17. Extraversion differentiates between model-based and model-free strategies in a reinforcement learning task

    PubMed Central

    Skatova, Anya; Chan, Patricia A.; Daw, Nathaniel D.

    2013-01-01

    Prominent computational models describe a neural mechanism for learning from reward prediction errors, and it has been suggested that variations in this mechanism are reflected in personality factors such as trait extraversion. However, although trait extraversion has been linked to improved reward learning, it is not yet known whether this relationship is selective for the particular computational strategy associated with error-driven learning, known as model-free reinforcement learning, vs. another strategy, model-based learning, which the brain is also known to employ. In the present study we test this relationship by examining whether humans' scores on an extraversion scale predict individual differences in the balance between model-based and model-free learning strategies in a sequentially structured decision task designed to distinguish between them. In previous studies with this task, participants have shown a combination of both types of learning, but with substantial individual variation in the balance between them. In the current study, extraversion predicted worse behavior across both sorts of learning. However, the hypothesis that extraverts would be selectively better at model-free reinforcement learning held up among a subset of the more engaged participants, and overall, higher task engagement was associated with a more selective pattern by which extraversion predicted better model-free learning. The findings indicate a relationship between a broad personality orientation and detailed computational learning mechanisms. Results like those in the present study suggest an intriguing and rich relationship between core neuro-computational mechanisms and broader life orientations and outcomes. PMID:24027514

  18. Reinforcement learning with Marr.

    PubMed

    Niv, Yael; Langdon, Angela

    2016-10-01

    To many, the poster child for David Marr's famous three levels of scientific inquiry is reinforcement learning-a computational theory of reward optimization, which readily prescribes algorithmic solutions that evidence striking resemblance to signals found in the brain, suggesting a straightforward neural implementation. Here we review questions that remain open at each level of analysis, concluding that the path forward to their resolution calls for inspiration across levels, rather than a focus on mutual constraints.

  19. Quantum reinforcement learning.

    PubMed

    Dong, Daoyi; Chen, Chunlin; Li, Hanxiong; Tarn, Tzyh-Jong

    2008-10-01

    The key approaches for machine learning, particularly learning in unknown probabilistic environments, are new representations and computation mechanisms. In this paper, a novel quantum reinforcement learning (QRL) method is proposed by combining quantum theory and reinforcement learning (RL). Inspired by the state superposition principle and quantum parallelism, a framework of a value-updating algorithm is introduced. The state (action) in traditional RL is identified as the eigen state (eigen action) in QRL. The state (action) set can be represented with a quantum superposition state, and the eigen state (eigen action) can be obtained by randomly observing the simulated quantum state according to the collapse postulate of quantum measurement. The probability of the eigen action is determined by the probability amplitude, which is updated in parallel according to rewards. Some related characteristics of QRL such as convergence, optimality, and balancing between exploration and exploitation are also analyzed, which shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism. To evaluate the performance and practicability of QRL, several simulated experiments are given, and the results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems. This paper is also an effective exploration on the application of quantum computation to artificial intelligence.

  20. Protein interaction network constructing based on text mining and reinforcement learning with application to prostate cancer.

    PubMed

    Zhu, Fei; Liu, Quan; Zhang, Xiaofang; Shen, Bairong

    2015-08-01

    Constructing interaction network from biomedical texts is a very important and interesting work. The authors take advantage of text mining and reinforcement learning approaches to establish protein interaction network. Considering the high computational efficiency of co-occurrence-based interaction extraction approaches and high precision of linguistic patterns approaches, the authors propose an interaction extracting algorithm where they utilise frequently used linguistic patterns to extract the interactions from texts and then find out interactions from extended unprocessed texts under the basic idea of co-occurrence approach, meanwhile they discount the interaction extracted from extended texts. They put forward a reinforcement learning-based algorithm to establish a protein interaction network, where nodes represent proteins and edges denote interactions. During the evolutionary process, a node selects another node and the attained reward determines which predicted interaction should be reinforced. The topology of the network is updated by the agent until an optimal network is formed. They used texts downloaded from PubMed to construct a prostate cancer protein interaction network by the proposed methods. The results show that their method brought out pretty good matching rate. Network topology analysis results also demonstrate that the curves of node degree distribution, node degree probability and probability distribution of constructed network accord with those of the scale-free network well.

  1. Reinforcement Learning of Optimal Supervisor based on the Worst-Case Behavior

    NASA Astrophysics Data System (ADS)

    Kajiwara, Kouji; Yamasaki, Tatsushi

    The supervisory control initiated by Ramadge and Wonham is a framework for logical control of discrete event systems. In the original supervisory control, the costs for occurrence and disabling of events have not been considered. Then, the optimal supervisory control based on quatitative measures has also been studied. This paper proposes a synthesis method of the optimal supervisor based on the worst-case behavior of discrete event systems. We introduce the new value functions for the assigned control patterns. The new value functions are not based on the expected total rewards, but based on the most undesirable event occurrence in the assigned control pattern. In the proposed method, the supervisor learns how to assign the control pattern based on reinforcement learning so as to maximize the value functions. We show the efficiency of the proposed method by computer simulation.

  2. The "proactive" model of learning: Integrative framework for model-free and model-based reinforcement learning utilizing the associative learning-based proactive brain concept.

    PubMed

    Zsuga, Judit; Biro, Klara; Papp, Csaba; Tajti, Gabor; Gesztelyi, Rudolf

    2016-02-01

    Reinforcement learning (RL) is a powerful concept underlying forms of associative learning governed by the use of a scalar reward signal, with learning taking place if expectations are violated. RL may be assessed using model-based and model-free approaches. Model-based reinforcement learning involves the amygdala, the hippocampus, and the orbitofrontal cortex (OFC). The model-free system involves the pedunculopontine-tegmental nucleus (PPTgN), the ventral tegmental area (VTA) and the ventral striatum (VS). Based on the functional connectivity of VS, model-free and model based RL systems center on the VS that by integrating model-free signals (received as reward prediction error) and model-based reward related input computes value. Using the concept of reinforcement learning agent we propose that the VS serves as the value function component of the RL agent. Regarding the model utilized for model-based computations we turned to the proactive brain concept, which offers an ubiquitous function for the default network based on its great functional overlap with contextual associative areas. Hence, by means of the default network the brain continuously organizes its environment into context frames enabling the formulation of analogy-based association that are turned into predictions of what to expect. The OFC integrates reward-related information into context frames upon computing reward expectation by compiling stimulus-reward and context-reward information offered by the amygdala and hippocampus, respectively. Furthermore we suggest that the integration of model-based expectations regarding reward into the value signal is further supported by the efferent of the OFC that reach structures canonical for model-free learning (e.g., the PPTgN, VTA, and VS).

  3. Fuzzy OLAP association rules mining-based modular reinforcement learning approach for multiagent systems.

    PubMed

    Kaya, Mehmet; Alhajj, Reda

    2005-04-01

    Multiagent systems and data mining have recently attracted considerable attention in the field of computing. Reinforcement learning is the most commonly used learning process for multiagent systems. However, it still has some drawbacks, including modeling other learning agents present in the domain as part of the state of the environment, and some states are experienced much less than others, or some state-action pairs are never visited during the learning phase. Further, before completing the learning process, an agent cannot exhibit a certain behavior in some states that may be experienced sufficiently. In this study, we propose a novel multiagent learning approach to handle these problems. Our approach is based on utilizing the mining process for modular cooperative learning systems. It incorporates fuzziness and online analytical processing (OLAP) based mining to effectively process the information reported by agents. First, we describe a fuzzy data cube OLAP architecture which facilitates effective storage and processing of the state information reported by agents. This way, the action of the other agent, not even in the visual environment. of the agent under consideration, can simply be predicted by extracting online association rules, a well-known data mining technique, from the constructed data cube. Second, we present a new action selection model, which is also based on association rules mining. Finally, we generalize not sufficiently experienced states, by mining multilevel association rules from the proposed fuzzy data cube. Experimental results obtained on two different versions of a well-known pursuit domain show the robustness and effectiveness of the proposed fuzzy OLAP mining based modular learning approach. Finally, we tested the scalability of the approach presented in this paper and compared it with our previous work on modular-fuzzy Q-learning and ordinary Q-learning.

  4. [Reinforcement learning by striatum].

    PubMed

    Kunisato, Yoshihiko; Okada, Go; Okamoto, Yasumasa

    2009-04-01

    Recently, computational models of reinforcement learning have been applied for the analysis of neuroimaging data. It has been clarified that the striatum plays a key role in decision making. We review the reinforcement learning theory and the biological structures such as the brain and signals such as neuromodulators associated with reinforcement learning. We also investigated the function of the striatum and the neurotransmitter serotonin in reward prediction. We first studied the brain mechanisms for reward prediction at different time scales. Our experiment on the striatum showed that the ventroanterior regions are involved in predicting immediate rewards and the dorsoposterior regions are involved in predicting future rewards. Further, we investigated whether serotonin regulates both the reward selection and the striatum function are specialized reward prediction at different time scales. To this end, we regulated the dietary intake of tryptophan, a precursor of serotonin. Our experiment showed that the activity of the ventral part of the striatum was correlated with reward prediction at shorter time scales, and this activity was stronger at low serotonin levels. By contrast, the activity of the dorsal part of the striatum was correlated with reward prediction at longer time scales, and this activity was stronger at high serotonin levels. Further, a higher proportion of small reward choices, together with a higher rate of discounting of delayed rewards is observed in the low-serotonin condition than in the control and high-serotonin conditions. Further examinations are required in future to assess the relation between the disturbance of reward prediction caused by low serotonin and mental disorders related to serotonin such as depression.

  5. Multi-Objective Reinforcement Learning for Cognitive Radio-Based Satellite Communications

    NASA Technical Reports Server (NTRS)

    Ferreira, Paulo Victor R.; Paffenroth, Randy; Wyglinski, Alexander M.; Hackett, Timothy M.; Bilen, Sven G.; Reinhart, Richard C.; Mortensen, Dale J.

    2016-01-01

    Previous research on cognitive radios has addressed the performance of various machine-learning and optimization techniques for decision making of terrestrial link properties. In this paper, we present our recent investigations with respect to reinforcement learning that potentially can be employed by future cognitive radios installed onboard satellite communications systems specifically tasked with radio resource management. This work analyzes the performance of learning, reasoning, and decision making while considering multiple objectives for time-varying communications channels, as well as different cross-layer requirements. Based on the urgent demand for increased bandwidth, which is being addressed by the next generation of high-throughput satellites, the performance of cognitive radio is assessed considering links between a geostationary satellite and a fixed ground station operating at Ka-band (26 GHz). Simulation results show multiple objective performance improvements of more than 3.5 times for clear sky conditions and 6.8 times for rain conditions.

  6. Variability in Dopamine Genes Dissociates Model-Based and Model-Free Reinforcement Learning

    PubMed Central

    Bath, Kevin G.; Daw, Nathaniel D.; Frank, Michael J.

    2016-01-01

    Considerable evidence suggests that multiple learning systems can drive behavior. Choice can proceed reflexively from previous actions and their associated outcomes, as captured by “model-free” learning algorithms, or flexibly from prospective consideration of outcomes that might occur, as captured by “model-based” learning algorithms. However, differential contributions of dopamine to these systems are poorly understood. Dopamine is widely thought to support model-free learning by modulating plasticity in striatum. Model-based learning may also be affected by these striatal effects, or by other dopaminergic effects elsewhere, notably on prefrontal working memory function. Indeed, prominent demonstrations linking striatal dopamine to putatively model-free learning did not rule out model-based effects, whereas other studies have reported dopaminergic modulation of verifiably model-based learning, but without distinguishing a prefrontal versus striatal locus. To clarify the relationships between dopamine, neural systems, and learning strategies, we combine a genetic association approach in humans with two well-studied reinforcement learning tasks: one isolating model-based from model-free behavior and the other sensitive to key aspects of striatal plasticity. Prefrontal function was indexed by a polymorphism in the COMT gene, differences of which reflect dopamine levels in the prefrontal cortex. This polymorphism has been associated with differences in prefrontal activity and working memory. Striatal function was indexed by a gene coding for DARPP-32, which is densely expressed in the striatum where it is necessary for synaptic plasticity. We found evidence for our hypothesis that variations in prefrontal dopamine relate to model-based learning, whereas variations in striatal dopamine function relate to model-free learning. SIGNIFICANCE STATEMENT Decisions can stem reflexively from their previously associated outcomes or flexibly from deliberative

  7. Reinforcement Learning Through Gradient Descent

    DTIC Science & Technology

    1999-05-14

    Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for...practice of existing types of algorithms, the gradient descent approach makes it possible to create entirely new classes of reinforcement learning algorithms

  8. A graph-based evolutionary algorithm: Genetic Network Programming (GNP) and its extension using reinforcement learning.

    PubMed

    Mabu, Shingo; Hirasawa, Kotaro; Hu, Jinglu

    2007-01-01

    This paper proposes a graph-based evolutionary algorithm called Genetic Network Programming (GNP). Our goal is to develop GNP, which can deal with dynamic environments efficiently and effectively, based on the distinguished expression ability of the graph (network) structure. The characteristics of GNP are as follows. 1) GNP programs are composed of a number of nodes which execute simple judgment/processing, and these nodes are connected by directed links to each other. 2) The graph structure enables GNP to re-use nodes, thus the structure can be very compact. 3) The node transition of GNP is executed according to its node connections without any terminal nodes, thus the past history of the node transition affects the current node to be used and this characteristic works as an implicit memory function. These structural characteristics are useful for dealing with dynamic environments. Furthermore, we propose an extended algorithm, "GNP with Reinforcement Learning (GNPRL)" which combines evolution and reinforcement learning in order to create effective graph structures and obtain better results in dynamic environments. In this paper, we applied GNP to the problem of determining agents' behavior to evaluate its effectiveness. Tileworld was used as the simulation environment. The results show some advantages for GNP over conventional methods.

  9. The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

    PubMed

    Huertas, Marco A; Schwettmann, Sarah E; Shouval, Harel Z

    2016-01-01

    The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for

  10. The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces

    PubMed Central

    Huertas, Marco A.; Schwettmann, Sarah E.; Shouval, Harel Z.

    2016-01-01

    The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for

  11. Dopamine-mediated reinforcement learning signals in the striatum and ventromedial prefrontal cortex underlie value-based choices.

    PubMed

    Jocham, Gerhard; Klein, Tilmann A; Ullsperger, Markus

    2011-02-02

    A large body of evidence exists on the role of dopamine in reinforcement learning. Less is known about how dopamine shapes the relative impact of positive and negative outcomes to guide value-based choices. We combined administration of the dopamine D(2) receptor antagonist amisulpride with functional magnetic resonance imaging in healthy human volunteers. Amisulpride did not affect initial reinforcement learning. However, in a later transfer phase that involved novel choice situations requiring decisions between two symbols based on their previously learned values, amisulpride improved participants' ability to select the better of two highly rewarding options, while it had no effect on choices between two very poor options. During the learning phase, activity in the striatum encoded a reward prediction error. In the transfer phase, in the absence of any outcome, ventromedial prefrontal cortex (vmPFC) continually tracked the learned value of the available options on each trial. Both striatal prediction error coding and tracking of learned value in the vmPFC were predictive of subjects' choice performance in the transfer phase, and both were enhanced under amisulpride. These findings show that dopamine-dependent mechanisms enhance reinforcement learning signals in the striatum and sharpen representations of associative values in prefrontal cortex that are used to guide reinforcement-based decisions.

  12. Approach for moving small target detection in infrared image sequence based on reinforcement learning

    NASA Astrophysics Data System (ADS)

    Wang, Chuanyun; Qin, Shiyin

    2016-09-01

    Addressing the problems of moving small target detection in infrared image sequence caused by background clutter and target size variation with time, an approach for moving small target detection is proposed under a pipeline framework with an optimization strategy based on reinforcement learning. The pipeline framework is composed by pipeline establishment, target-background images separation, and target confirmation, in which the pipeline is established by designating several successive images with temporal sliding window, target-background images separation is dealt with low-rank and sparse matrix decomposition via robust principal component analysis, and target confirmation is achieved by employing a voting mechanism over more than one separated target images of the same input image. For unremitting optimization of target-background images separation, the weighting parameter of low-rank and sparse matrix decomposition is dynamically regulated by the way of reinforcement learning in consecutive detection, in which the complexity evaluation from sequential infrared images and results assessment of moving small target detection are integrated. The experiment results over four infrared small target image sequences with different cloudy sky backgrounds demonstrate the effectiveness and advantages of the proposed approach in both background clutter suppression and small target detection.

  13. Collaborating Fuzzy Reinforcement Learning Agents

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1997-01-01

    Earlier, we introduced GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Relearning and at the local level, each agent learns and operates based on ANTARCTIC, a technique for fuzzy reinforcement learning. In this paper, we show that it is possible for these agents to compete in order to affect the selected control policy but at the same time, they can collaborate while investigating the state space. In this model, the evaluator or the critic learns by observing all the agents behaviors but the control policy changes only based on the behavior of the winning agent also known as the super agent.

  14. Interactive Augmentation of Computer Generated Force Behavior Based on Cooperative and Reinforcement Learning. Phase 1.

    DTIC Science & Technology

    1995-09-01

    produced a design methodology for augmenting computer generated force behavior with the NeurRule Technology concepts of cooperative and reinforcement ... learning . The Phase I results indicate that (1) Intelligent CGFs can improve task performance through on-line learning, utilizing information from

  15. When, What, and How Much to Reward in Reinforcement Learning-Based Models of Cognition

    ERIC Educational Resources Information Center

    Janssen, Christian P.; Gray, Wayne D.

    2012-01-01

    Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other…

  16. Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories.

    PubMed

    Fonteneau, Raphael; Murphy, Susan A; Wehenkel, Louis; Ernst, Damien

    2013-09-01

    In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of "artificial trajectories" from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning.

  17. Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories

    PubMed Central

    Fonteneau, Raphael; Murphy, Susan A.; Wehenkel, Louis; Ernst, Damien

    2013-01-01

    In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of “artificial trajectories” from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning. PMID:24049244

  18. From free energy to expected energy: Improving energy-based value function approximation in reinforcement learning.

    PubMed

    Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji

    2016-12-01

    Free-energy based reinforcement learning (FERL) was proposed for learning in high-dimensional state and action spaces. However, the FERL method does only really work well with binary, or close to binary, state input, where the number of active states is fewer than the number of non-active states. In the FERL method, the value function is approximated by the negative free energy of a restricted Boltzmann machine (RBM). In our earlier study, we demonstrated that the performance and the robustness of the FERL method can be improved by scaling the free energy by a constant that is related to the size of network. In this study, we propose that RBM function approximation can be further improved by approximating the value function by the negative expected energy (EERL), instead of the negative free energy, as well as being able to handle continuous state input. We validate our proposed method by demonstrating that EERL: (1) outperforms FERL, as well as standard neural network and linear function approximation, for three versions of a gridworld task with high-dimensional image state input; (2) achieves new state-of-the-art results in stochastic SZ-Tetris in both model-free and model-based learning settings; and (3) significantly outperforms FERL and standard neural network function approximation for a robot navigation task with raw and noisy RGB images as state input and a large number of actions.

  19. Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning

    PubMed Central

    Konovalov, Arkady; Krajbich, Ian

    2016-01-01

    Organisms appear to learn and make decisions using different strategies known as model-free and model-based learning; the former is mere reinforcement of previously rewarded actions and the latter is a forward-looking strategy that involves evaluation of action-state transition probabilities. Prior work has used neural data to argue that both model-based and model-free learners implement a value comparison process at trial onset, but model-based learners assign more weight to forward-looking computations. Here using eye-tracking, we report evidence for a different interpretation of prior results: model-based subjects make their choices prior to trial onset. In contrast, model-free subjects tend to ignore model-based aspects of the task and instead seem to treat the decision problem as a simple comparison process between two differentially valued items, consistent with previous work on sequential-sampling models of decision making. These findings illustrate a problem with assuming that experimental subjects make their decisions at the same prescribed time. PMID:27511383

  20. Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning.

    PubMed

    Konovalov, Arkady; Krajbich, Ian

    2016-08-11

    Organisms appear to learn and make decisions using different strategies known as model-free and model-based learning; the former is mere reinforcement of previously rewarded actions and the latter is a forward-looking strategy that involves evaluation of action-state transition probabilities. Prior work has used neural data to argue that both model-based and model-free learners implement a value comparison process at trial onset, but model-based learners assign more weight to forward-looking computations. Here using eye-tracking, we report evidence for a different interpretation of prior results: model-based subjects make their choices prior to trial onset. In contrast, model-free subjects tend to ignore model-based aspects of the task and instead seem to treat the decision problem as a simple comparison process between two differentially valued items, consistent with previous work on sequential-sampling models of decision making. These findings illustrate a problem with assuming that experimental subjects make their decisions at the same prescribed time.

  1. Meta-learning in reinforcement learning.

    PubMed

    Schweighofer, Nicolas; Doya, Kenji

    2003-01-01

    Meta-parameters in reinforcement learning should be tuned to the environmental dynamics and the animal performance. Here, we propose a biologically plausible meta-reinforcement learning algorithm for tuning these meta-parameters in a dynamic, adaptive manner. We tested our algorithm in both a simulation of a Markov decision task and in a non-linear control task. Our results show that the algorithm robustly finds appropriate meta-parameter values, and controls the meta-parameter time course, in both static and dynamic environments. We suggest that the phasic and tonic components of dopamine neuron firing can encode the signal required for meta-learning of reinforcement learning.

  2. A clustering-based graph Laplacian framework for value function approximation in reinforcement learning.

    PubMed

    Xu, Xin; Huang, Zhenhua; Graves, Daniel; Pedrycz, Witold

    2014-12-01

    In order to deal with the sequential decision problems with large or continuous state spaces, feature representation and function approximation have been a major research topic in reinforcement learning (RL). In this paper, a clustering-based graph Laplacian framework is presented for feature representation and value function approximation (VFA) in RL. By making use of clustering-based techniques, that is, K-means clustering or fuzzy C-means clustering, a graph Laplacian is constructed by subsampling in Markov decision processes (MDPs) with continuous state spaces. The basis functions for VFA can be automatically generated from spectral analysis of the graph Laplacian. The clustering-based graph Laplacian is integrated with a class of approximation policy iteration algorithms called representation policy iteration (RPI) for RL in MDPs with continuous state spaces. Simulation and experimental results show that, compared with previous RPI methods, the proposed approach needs fewer sample points to compute an efficient set of basis functions and the learning control performance can be improved for a variety of parameter settings.

  3. The Evaluation of Interactive Learning Modules to Reinforce Helping Skills in a Web-Based Interview Simulation Training Environment

    ERIC Educational Resources Information Center

    Adcock, Amy B.; Duggan, Molly H.; Perry, Terrell

    2010-01-01

    The research presented in this paper shows the continued evaluation of a web-based interview simulation designed for human services and counseling students. The system allows students to practice empathetic helping skills in their own time. As a possible means to reinforce acquisition and transfer of these skills, interactive learning modules…

  4. Online learning of shaping rewards in reinforcement learning.

    PubMed

    Grześ, Marek; Kudenko, Daniel

    2010-05-01

    Potential-based reward shaping has been shown to be a powerful method to improve the convergence rate of reinforcement learning agents. It is a flexible technique to incorporate background knowledge into temporal-difference learning in a principled way. However, the question remains of how to compute the potential function which is used to shape the reward that is given to the learning agent. In this paper, we show how, in the absence of knowledge to define the potential function manually, this function can be learned online in parallel with the actual reinforcement learning process. Two cases are considered. The first solution which is based on the multi-grid discretisation is designed for model-free reinforcement learning. In the second case, the approach for the prototypical model-based R-max algorithm is proposed. It learns the potential function using the free space assumption about the transitions in the environment. Two novel algorithms are presented and evaluated.

  5. A model of reward choice based on the theory of reinforcement learning.

    PubMed

    Smirnitskaya, I A; Frolov, A A; Merzhanova, G Kh

    2008-03-01

    A model explaining behavioral "impulsivity" and "self-control" is proposed on the basis of the theory of reinforcement learning. The discount coefficient gamma, which in this theory accounts for the subjective reduction in the value of a delayed reinforcement, is identified with the overall level of dopaminergic neuron activity which, according to published data, also determines the behavioral variant. Computer modeling showed that high values of gamma are characteristic of predominantly "self-controlled" subjects, while smaller values of gamma are characteristic of "impulsive" subjects.

  6. Cognitive Control Predicts Use of Model-Based Reinforcement-Learning

    PubMed Central

    Otto, A. Ross; Skatova, Anya; Madlon-Kay, Seth; Daw, Nathaniel D.

    2015-01-01

    Accounts of decision-making and its neural substrates have long posited the operation of separate, competing valuation systems in the control of choice behavior. Recent theoretical and experimental work suggest that this classic distinction between behaviorally and neurally dissociable systems for habitual and goal-directed (or more generally, automatic and controlled) choice may arise from two computational strategies for reinforcement learning (RL), called model-free and model-based RL, but the cognitive or computational processes by which one system may dominate over the other in the control of behavior is a matter of ongoing investigation. To elucidate this question, we leverage the theoretical framework of cognitive control, demonstrating that individual differences in utilization of goal-related contextual information—in the service of overcoming habitual, stimulus-driven responses—in established cognitive control paradigms predict model-based behavior in a separate, sequential choice task. The behavioral correspondence between cognitive control and model-based RL compellingly suggests that a common set of processes may underpin the two behaviors. In particular, computational mechanisms originally proposed to underlie controlled behavior may be applicable to understanding the interactions between model-based and model-free choice behavior. PMID:25170791

  7. A Novel Dynamic Spectrum Access Framework Based on Reinforcement Learning for Cognitive Radio Sensor Networks.

    PubMed

    Lin, Yun; Wang, Chao; Wang, Jiaxing; Dou, Zheng

    2016-10-12

    Cognitive radio sensor networks are one of the kinds of application where cognitive techniques can be adopted and have many potential applications, challenges and future research trends. According to the research surveys, dynamic spectrum access is an important and necessary technology for future cognitive sensor networks. Traditional methods of dynamic spectrum access are based on spectrum holes and they have some drawbacks, such as low accessibility and high interruptibility, which negatively affect the transmission performance of the sensor networks. To address this problem, in this paper a new initialization mechanism is proposed to establish a communication link and set up a sensor network without adopting spectrum holes to convey control information. Specifically, firstly a transmission channel model for analyzing the maximum accessible capacity for three different polices in a fading environment is discussed. Secondly, a hybrid spectrum access algorithm based on a reinforcement learning model is proposed for the power allocation problem of both the transmission channel and the control channel. Finally, extensive simulations have been conducted and simulation results show that this new algorithm provides a significant improvement in terms of the tradeoff between the control channel reliability and the efficiency of the transmission channel.

  8. A Novel Dynamic Spectrum Access Framework Based on Reinforcement Learning for Cognitive Radio Sensor Networks

    PubMed Central

    Lin, Yun; Wang, Chao; Wang, Jiaxing; Dou, Zheng

    2016-01-01

    Cognitive radio sensor networks are one of the kinds of application where cognitive techniques can be adopted and have many potential applications, challenges and future research trends. According to the research surveys, dynamic spectrum access is an important and necessary technology for future cognitive sensor networks. Traditional methods of dynamic spectrum access are based on spectrum holes and they have some drawbacks, such as low accessibility and high interruptibility, which negatively affect the transmission performance of the sensor networks. To address this problem, in this paper a new initialization mechanism is proposed to establish a communication link and set up a sensor network without adopting spectrum holes to convey control information. Specifically, firstly a transmission channel model for analyzing the maximum accessible capacity for three different polices in a fading environment is discussed. Secondly, a hybrid spectrum access algorithm based on a reinforcement learning model is proposed for the power allocation problem of both the transmission channel and the control channel. Finally, extensive simulations have been conducted and simulation results show that this new algorithm provides a significant improvement in terms of the tradeoff between the control channel reliability and the efficiency of the transmission channel. PMID:27754316

  9. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning

    PubMed Central

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  10. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning.

    PubMed

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  11. A Comparative Analysis of Reinforcement Learning Methods

    DTIC Science & Technology

    1991-10-01

    reinforcement learning for both programming and adapting situated agents. In the first part of the paper we discuss two specific reinforcement learning algorithms: Q-learning and the Bucket Brigade. We introduce a special case of the Bucket Brigade, and analyze and compare its performance to Q-learning in a number of experiments. The second part of the paper discusses the key problems of reinforcement learning : time and space complexity, input generalization, sensitivity to parameter values, and selection of the reinforcement

  12. Application of fuzzy logic-neural network based reinforcement learning to proximity and docking operations

    NASA Technical Reports Server (NTRS)

    Jani, Yashvant

    1992-01-01

    As part of the Research Institute for Computing and Information Systems (RICIS) activity, the reinforcement learning techniques developed at Ames Research Center are being applied to proximity and docking operations using the Shuttle and Solar Max satellite simulation. This activity is carried out in the software technology laboratory utilizing the Orbital Operations Simulator (OOS). This interim report provides the status of the project and outlines the future plans.

  13. Corticostriatal circuit mechanisms of value-based action selection: Implementation of reinforcement learning algorithms and beyond.

    PubMed

    Morita, Kenji; Jitsev, Jenia; Morrison, Abigail

    2016-09-15

    Value-based action selection has been suggested to be realized in the corticostriatal local circuits through competition among neural populations. In this article, we review theoretical and experimental studies that have constructed and verified this notion, and provide new perspectives on how the local-circuit selection mechanisms implement reinforcement learning (RL) algorithms and computations beyond them. The striatal neurons are mostly inhibitory, and lateral inhibition among them has been classically proposed to realize "Winner-Take-All (WTA)" selection of the maximum-valued action (i.e., 'max' operation). Although this view has been challenged by the revealed weakness, sparseness, and asymmetry of lateral inhibition, which suggest more complex dynamics, WTA-like competition could still occur on short time scales. Unlike the striatal circuit, the cortical circuit contains recurrent excitation, which may enable retention or temporal integration of information and probabilistic "soft-max" selection. The striatal "max" circuit and the cortical "soft-max" circuit might co-implement an RL algorithm called Q-learning; the cortical circuit might also similarly serve for other algorithms such as SARSA. In these implementations, the cortical circuit presumably sustains activity representing the executed action, which negatively impacts dopamine neurons so that they can calculate reward-prediction-error. Regarding the suggested more complex dynamics of striatal, as well as cortical, circuits on long time scales, which could be viewed as a sequence of short WTA fragments, computational roles remain open: such a sequence might represent (1) sequential state-action-state transitions, constituting replay or simulation of the internal model, (2) a single state/action by the whole trajectory, or (3) probabilistic sampling of state/action.

  14. Fuzzy Q-Learning for Generalization of Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1996-01-01

    Fuzzy Q-Learning, introduced earlier by the author, is an extension of Q-Learning into fuzzy environments. GARIC is a methodology for fuzzy reinforcement learning. In this paper, we introduce GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Q-Learning and at the local level, each agent learns and operates based on GARIC. GARIC-Q improves the speed and applicability of Fuzzy Q-Learning through generalization of input space by using fuzzy rules and bridges the gap between Q-Learning and rule based intelligent systems.

  15. Reinforcement Learning with Bounded Information Loss

    NASA Astrophysics Data System (ADS)

    Peters, Jan; Mülling, Katharina; Seldin, Yevgeny; Altun, Yasemin

    2011-03-01

    Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant or natural policy gradients, many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest two reinforcement learning methods, i.e., a model-based and a model free algorithm that bound the loss in relative entropy while maximizing their return. The resulting methods differ significantly from previous policy gradient approaches and yields an exact update step. It works well on typical reinforcement learning benchmark problems as well as novel evaluations in robotics. We also show a Bayesian bound motivation of this new approach [8].

  16. Reinforcing Constructivist Teaching in Advanced Level Biochemistry through the Introduction of Case-Based Learning Activities

    ERIC Educational Resources Information Center

    Hartfield, Perry J.

    2010-01-01

    In the process of curriculum development, I have integrated a constructivist teaching strategy into an advanced-level biochemistry teaching unit. Specifically, I have introduced case-based learning activities into the teaching/learning framework. These case-based learning activities were designed to develop problem-solving skills, consolidate…

  17. Rational and Mechanistic Perspectives on Reinforcement Learning

    ERIC Educational Resources Information Center

    Chater, Nick

    2009-01-01

    This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: "mechanistic" and "rational." Reinforcement learning is often viewed in mechanistic terms--as…

  18. Ventral striatum and orbitofrontal cortex are both required for model-based, but not model-free, reinforcement learning.

    PubMed

    McDannald, Michael A; Lucantonio, Federica; Burke, Kathryn A; Niv, Yael; Schoenbaum, Geoffrey

    2011-02-16

    In many cases, learning is thought to be driven by differences between the value of rewards we expect and rewards we actually receive. Yet learning can also occur when the identity of the reward we receive is not as expected, even if its value remains unchanged. Learning from changes in reward identity implies access to an internal model of the environment, from which information about the identity of the expected reward can be derived. As a result, such learning is not easily accounted for by model-free reinforcement learning theories such as temporal difference reinforcement learning (TDRL), which predicate learning on changes in reward value, but not identity. Here, we used unblocking procedures to assess learning driven by value- versus identity-based prediction errors. Rats were trained to associate distinct visual cues with different food quantities and identities. These cues were subsequently presented in compound with novel auditory cues and the reward quantity or identity was selectively changed. Unblocking was assessed by presenting the auditory cues alone in a probe test. Consistent with neural implementations of TDRL models, we found that the ventral striatum was necessary for learning in response to changes in reward value. However, this area, along with orbitofrontal cortex, was also required for learning driven by changes in reward identity. This observation requires that existing models of TDRL in the ventral striatum be modified to include information about the specific features of expected outcomes derived from model-based representations, and that the role of orbitofrontal cortex in these models be clearly delineated.

  19. Cognitive navigation based on nonuniform Gabor space sampling, unsupervised growing networks, and reinforcement learning.

    PubMed

    Arleo, Angelo; Smeraldi, Fabrizio; Gerstner, Wulfram

    2004-05-01

    We study spatial learning and navigation for autonomous agents. A state space representation is constructed by unsupervised Hebbian learning during exploration. As a result of learning, a representation of the continuous two-dimensional (2-D) manifold in the high-dimensional input space is found. The representation consists of a population of localized overlapping place fields covering the 2-D space densely and uniformly. This space coding is comparable to the representation provided by hippocampal place cells in rats. Place fields are learned by extracting spatio-temporal properties of the environment from sensory inputs. The visual scene is modeled using the responses of modified Gabor filters placed at the nodes of a sparse Log-polar graph. Visual sensory aliasing is eliminated by taking into account self-motion signals via path integration. This solves the hidden state problem and provides a suitable representation for applying reinforcement learning in continuous space for action selection. A temporal-difference prediction scheme is used to learn sensorimotor mappings to perform goal-oriented navigation. Population vector coding is employed to interpret ensemble neural activity. The model is validated on a mobile Khepera miniature robot.

  20. Perception-based Co-evolutionary Reinforcement Learning for UAV Sensor Allocation

    DTIC Science & Technology

    2003-02-01

    reinforcement learning was developed for jointly addressing sensor allocation on each individual UAV and allocation of a team of UAVs in the geographical search space. An elaborate problem setup was simulated and experimented with, for testing and analysis of this framework using the Player-Stage multi-agent simulator. This simulator was developed jointly at the USC Robotics Research Lab and HRL Labs.The experimental results demonstrated a very strong performance of our methodology for UAV sensor allocation problem domains. Our results indicate that not only it is feasible

  1. Application of fuzzy logic-neural network based reinforcement learning to proximity and docking operations: Translational controller results

    NASA Technical Reports Server (NTRS)

    Jani, Yashvant

    1992-01-01

    The reinforcement learning techniques developed at Ames Research Center are being applied to proximity and docking operations using the Shuttle and Solar Maximum Mission (SMM) satellite simulation. In utilizing these fuzzy learning techniques, we also use the Approximate Reasoning based Intelligent Control (ARIC) architecture, and so we use two terms interchangeable to imply the same. This activity is carried out in the Software Technology Laboratory utilizing the Orbital Operations Simulator (OOS). This report is the deliverable D3 in our project activity and provides the test results of the fuzzy learning translational controller. This report is organized in six sections. Based on our experience and analysis with the attitude controller, we have modified the basic configuration of the reinforcement learning algorithm in ARIC as described in section 2. The shuttle translational controller and its implementation in fuzzy learning architecture is described in section 3. Two test cases that we have performed are described in section 4. Our results and conclusions are discussed in section 5, and section 6 provides future plans and summary for the project.

  2. Reinforcement Learning for Robots Using Neural Networks

    DTIC Science & Technology

    1993-01-06

    Reinforcement learning agents are adaptive, reactive, and self-supervised. The aim of this dissertation is to extend the state of the art of... reinforcement learning and enable its applications to complex robot-learning problems. In particular, it focuses on two issues. First, learning from sparse... reinforcement learning methods assume that the world is a Markov decision process. This assumption is too strong for many robot tasks of interest, This

  3. Maximum correntropy based attention-gated reinforcement learning designed for brain machine interface.

    PubMed

    Li, Hongbao; Wang, Fang; Zhang, Qiaosheng; Zhang, Shaomin; Wang, Yiwen; Zheng, Xiaoxiang; Principe, Jose C; Hongbao Li; Fang Wang; Qiaosheng Zhang; Shaomin Zhang; Yiwen Wang; Xiaoxiang Zheng; Principe, Jose C; Wang, Yiwen; Principe, Jose C; Zheng, Xiaoxiang; Zhang, Qiaosheng; Zhang, Shaomin; Li, Hongbao; Wang, Fang

    2016-08-01

    Reinforcement learning is an effective algorithm for brain machine interfaces (BMIs) which interprets the mapping between neural activities with plasticity and the kinematics. Exploring large state-action space is difficulty when the complicated BMIs needs to assign credits over both time and space. For BMIs attention gated reinforcement learning (AGREL) has been developed to classify multi-actions for spatial credit assignment task with better efficiency. However, the outliers existing in the neural signals still make interpret the neural-action mapping difficult. We propose an enhanced AGREL algorithm using correntropy as a criterion, which is more insensitive to noise. Then the algorithm is tested on the neural data where the monkey is trained to do the obstacle avoidance task. The new method converges faster during the training period, and improves from 44.63% to 68.79% on average in success rate compared with the original AGREL. The result indicates that the combination of correntropy criterion and AGREL can reduce the effect of the outliers with better performance when interpreting the mapping between neural signal and kinematics.

  4. Autonomous reinforcement learning with experience replay.

    PubMed

    Wawrzyński, Paweł; Tanwani, Ajay Kumar

    2013-05-01

    This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time.

  5. From Creatures of Habit to Goal-Directed Learners: Tracking the Developmental Emergence of Model-Based Reinforcement Learning.

    PubMed

    Decker, Johannes H; Otto, A Ross; Daw, Nathaniel D; Hartley, Catherine A

    2016-06-01

    Theoretical models distinguish two decision-making strategies that have been formalized in reinforcement-learning theory. A model-based strategy leverages a cognitive model of potential actions and their consequences to make goal-directed choices, whereas a model-free strategy evaluates actions based solely on their reward history. Research in adults has begun to elucidate the psychological mechanisms and neural substrates underlying these learning processes and factors that influence their relative recruitment. However, the developmental trajectory of these evaluative strategies has not been well characterized. In this study, children, adolescents, and adults performed a sequential reinforcement-learning task that enabled estimation of model-based and model-free contributions to choice. Whereas a model-free strategy was apparent in choice behavior across all age groups, a model-based strategy was absent in children, became evident in adolescents, and strengthened in adults. These results suggest that recruitment of model-based valuation systems represents a critical cognitive component underlying the gradual maturation of goal-directed behavior.

  6. Design issues of a reinforcement-based self-learning fuzzy controller for petrochemical process control

    NASA Technical Reports Server (NTRS)

    Yen, John; Wang, Haojin; Daugherity, Walter C.

    1992-01-01

    Fuzzy logic controllers have some often-cited advantages over conventional techniques such as PID control, including easier implementation, accommodation to natural language, and the ability to cover a wider range of operating conditions. One major obstacle that hinders the broader application of fuzzy logic controllers is the lack of a systematic way to develop and modify their rules; as a result the creation and modification of fuzzy rules often depends on trial and error or pure experimentation. One of the proposed approaches to address this issue is a self-learning fuzzy logic controller (SFLC) that uses reinforcement learning techniques to learn the desirability of states and to adjust the consequent part of its fuzzy control rules accordingly. Due to the different dynamics of the controlled processes, the performance of a self-learning fuzzy controller is highly contingent on its design. The design issue has not received sufficient attention. The issues related to the design of a SFLC for application to a petrochemical process are discussed, and its performance is compared with that of a PID and a self-tuning fuzzy logic controller.

  7. Design issues for a reinforcement-based self-learning fuzzy controller

    NASA Technical Reports Server (NTRS)

    Yen, John; Wang, Haojin; Dauherity, Walter

    1993-01-01

    Fuzzy logic controllers have some often cited advantages over conventional techniques such as PID control: easy implementation, its accommodation to natural language, the ability to cover wider range of operating conditions and others. One major obstacle that hinders its broader application is the lack of a systematic way to develop and modify its rules and as result the creation and modification of fuzzy rules often depends on try-error or pure experimentation. One of the proposed approaches to address this issue is self-learning fuzzy logic controllers (SFLC) that use reinforcement learning techniques to learn the desirability of states and to adjust the consequent part of fuzzy control rules accordingly. Due to the different dynamics of the controlled processes, the performance of self-learning fuzzy controller is highly contingent on the design. The design issue has not received sufficient attention. The issues related to the design of a SFLC for the application to chemical process are discussed and its performance is compared with that of PID and self-tuning fuzzy logic controller.

  8. Nearly data-based optimal control for linear discrete model-free systems with delays via reinforcement learning

    NASA Astrophysics Data System (ADS)

    Zhang, Jilie; Zhang, Huaguang; Wang, Binrui; Cai, Tiaoyang

    2016-05-01

    In this paper, a nearly data-based optimal control scheme is proposed for linear discrete model-free systems with delays. The nearly optimal control can be obtained using only measured input/output data from systems, by reinforcement learning technology, which combines Q-learning with value iterative algorithm. First, we construct a state estimator by using the measured input/output data. Second, the quadratic functional is used to approximate the value function at each point in the state space, and the data-based control is designed by Q-learning method using the obtained state estimator. Then, the paper states the method, that is, how to solve the optimal inner kernel matrix ? in the least-square sense, by value iteration algorithm. Finally, the numerical examples are given to illustrate the effectiveness of our approach.

  9. Modular Inverse Reinforcement Learning for Visuomotor Behavior

    PubMed Central

    Rothkopf, Constantin A.; Ballard, Dana H.

    2013-01-01

    In a large variety of situations one would like to have an expressive and accurate model of observed animal or human behavior. While general purpose mathematical models may capture successfully properties of observed behavior, it is desirable to root models in biological facts. Because of ample empirical evidence for reward-based learning in visuomotor tasks we use a computational model based on the assumption that the observed agent is balancing the costs and benefits of its behavior to meet its goals. This leads to using the framework of Reinforcement Learning, which additionally provides well-established algorithms for learning of visuomotor task solutions. To quantify the agent’s goals as rewards implicit in the observed behavior we propose to use inverse reinforcement learning, which quantifies the agent’s goals as rewards implicit in the observed behavior. Based on the assumption of a modular cognitive architecture, we introduce a modular inverse reinforcement learning algorithm that estimates the relative reward contributions of the component tasks in navigation, consisting of following a path while avoiding obstacles and approaching targets. It is shown how to recover the component reward weights for individual tasks and that variability in observed trajectories can be explained succinctly through behavioral goals. It is demonstrated through simulations that good estimates can be obtained already with modest amounts of observation data, which in turn allows the prediction of behavior in novel configurations. PMID:23832417

  10. Reinforcement learning or active inference?

    PubMed

    Friston, Karl J; Daunizeau, Jean; Kiebel, Stefan J

    2009-07-29

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

  11. Reinforcement Learning or Active Inference?

    PubMed Central

    Friston, Karl J.; Daunizeau, Jean; Kiebel, Stefan J.

    2009-01-01

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain. PMID:19641614

  12. Global reinforcement learning in neural networks.

    PubMed

    Ma, Xiaolong; Likharev, Konstantin K

    2007-03-01

    In this letter, we have found a more general formulation of the REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility (REINFORCE) learning principle first suggested by Williams. The new formulation has enabled us to apply the principle to global reinforcement learning in networks with various sources of randomness, and to suggest several simple local rules for such networks. Numerical simulations have shown that for simple classification and reinforcement learning tasks, at least one family of the new learning rules gives results comparable to those provided by the famous Rules A(r-i) and A(r-p) for the Boltzmann machines.

  13. Embedded Incremental Feature Selection for Reinforcement Learning

    DTIC Science & Technology

    2012-05-01

    policy by a problem-specific fit- ness function. The composition of the selected subset in terms of the fraction of relevant features among se- lected...features. In Figure 4b we see the composition of the se- lected subsets by the three algorithms. IFSE-NEAT clearly has the highest percentage of relevant...528. Kroon, M. and Whiteson, S. (2009). Automatic feature se- lection for model-based reinforcement learning in fac- tored mdps . In Proceedings of the

  14. Optimal chaos control through reinforcement learning.

    PubMed

    Gadaleta, Sabino; Dangelmayr, Gerhard

    1999-09-01

    A general purpose chaos control algorithm based on reinforcement learning is introduced and applied to the stabilization of unstable periodic orbits in various chaotic systems and to the targeting problem. The algorithm does not require any information about the dynamical system nor about the location of periodic orbits. Numerical tests demonstrate good and fast performance under noisy and nonstationary conditions. (c) 1999 American Institute of Physics.

  15. Classroom Reinforcement and Learning: A Quantitative Synthesis.

    ERIC Educational Resources Information Center

    Lysakowski, Richard S.; Walberg, Herbert J.

    1981-01-01

    A preview of statistical data from previous studies determined the benefits of positive reinforcement on learning in students from kindergarten through college. Results indicate that differences between reinforced and control groups are greater for girls and for students from special schools and that reinforcement appears to have a strong effect…

  16. Local path planning method of the self-propelled model based on reinforcement learning in complex conditions

    NASA Astrophysics Data System (ADS)

    Yang, Yi; Pang, Yongjie; Li, Hongwei; Zhang, Rubo

    2014-09-01

    Conducting hydrodynamic and physical motion simulation tests using a large-scale self-propelled model under actual wave conditions is an important means for researching environmental adaptability of ships. During the navigation test of the self-propelled model, the complex environment including various port facilities, navigation facilities, and the ships nearby must be considered carefully, because in this dense environment the impact of sea waves and winds on the model is particularly significant. In order to improve the security of the self-propelled model, this paper introduces the Q learning based on reinforcement learning combined with chaotic ideas for the model's collision avoidance, in order to improve the reliability of the local path planning. Simulation and sea test results show that this algorithm is a better solution for collision avoidance of the self navigation model under the interference of sea winds and waves with good adaptability.

  17. States versus Rewards: Dissociable neural prediction error signals underlying model-based and model-free reinforcement learning

    PubMed Central

    Gläscher, Jan; Daw, Nathaniel; Dayan, Peter; O’Doherty, John P.

    2010-01-01

    Reinforcement learning (RL) uses sequential experience with situations (“states”) and outcomes to assess actions. Whereas model-free RL uses this experience directly, in the form of a reward prediction error (RPE), model-based RL uses it indirectly, building a model of the state transition and outcome structure of the environment, and evaluating actions by searching this model. A state prediction error (SPE) plays a central role, reporting discrepancies between the current model and the observed state transitions. Using functional magnetic resonance imaging in humans solving a probabilistic Markov decision task we found the neural signature of an SPE in the intraparietal sulcus and lateral prefrontal cortex, in addition to the previously well-characterized RPE in the ventral striatum. This finding supports the existence of two unique forms of learning signal in humans, which may form the basis of distinct computational strategies for guiding behavior. PMID:20510862

  18. On the Possibility of a Reinforcement Theory of Cognitive Learning.

    ERIC Educational Resources Information Center

    Smith, Kendon

    This paper discusses cognitive learning in terms of reinforcement theory and presents arguments suggesting that a viable theory of cognition based on reinforcement principles is not out of the question. This position is supported by a discussion of the weaknesses of theories based entirely on contiguity and of considerations that are more positive…

  19. A new computational account of cognitive control over reinforcement-based decision-making: Modeling of a probabilistic learning task.

    PubMed

    Zendehrouh, Sareh

    2015-11-01

    Recent work on decision-making field offers an account of dual-system theory for decision-making process. This theory holds that this process is conducted by two main controllers: a goal-directed system and a habitual system. In the reinforcement learning (RL) domain, the habitual behaviors are connected with model-free methods, in which appropriate actions are learned through trial-and-error experiences. However, goal-directed behaviors are associated with model-based methods of RL, in which actions are selected using a model of the environment. Studies on cognitive control also suggest that during processes like decision-making, some cortical and subcortical structures work in concert to monitor the consequences of decisions and to adjust control according to current task demands. Here a computational model is presented based on dual system theory and cognitive control perspective of decision-making. The proposed model is used to simulate human performance on a variant of probabilistic learning task. The basic proposal is that the brain implements a dual controller, while an accompanying monitoring system detects some kinds of conflict including a hypothetical cost-conflict one. The simulation results address existing theories about two event-related potentials, namely error related negativity (ERN) and feedback related negativity (FRN), and explore the best account of them. Based on the results, some testable predictions are also presented.

  20. Reinforcement Learning for Scheduling of Maintenance

    NASA Astrophysics Data System (ADS)

    Knowles, Michael; Baglee, David; Wermter, Stefan

    Improving maintenance scheduling has become an area of crucial importance in recent years. Condition-based maintenance (CBM) has started to move away from scheduled maintenance by providing an indication of the likelihood of failure. Improving the timing of maintenance based on this information to maintain high reliability without resorting to over-maintenance remains, however, a problem. In this paper we propose Reinforcement Learning (RL), to improve long term reward for a multistage decision based on feedback given either during or at the end of a sequence of actions, as a potential solution to this problem. Several indicative scenarios are presented and simulated experiments illustrate the performance of RL in this application.

  1. Reinforcement learning design for cancer clinical trials

    PubMed Central

    Zhao, Yufan; Kosorok, Michael R.; Zeng, Donglin

    2009-01-01

    Summary We develop reinforcement learning trials for discovering individualized treatment regimens for life-threatening diseases such as cancer. A temporal-difference learning method called Q-learning is utilized which involves learning an optimal policy from a single training set of finite longitudinal patient trajectories. Approximating the Q-function with time-indexed parameters can be achieved by using support vector regression or extremely randomized trees. Within this framework, we demonstrate that the procedure can extract optimal strategies directly from clinical data without relying on the identification of any accurate mathematical models, unlike approaches based on adaptive design. We show that reinforcement learning has tremendous potential in clinical research because it can select actions that improve outcomes by taking into account delayed effects even when the relationship between actions and outcomes is not fully known. To support our claims, the methodology's practical utility is illustrated in a simulation analysis. In the immediate future, we will apply this general strategy to studying and identifying new treatments for advanced metastatic stage IIIB/IV non-small cell lung cancer, which usually includes multiple lines of chemotherapy treatment. Moreover, there is significant potential of the proposed methodology for developing personalized treatment strategies in other cancers, in cystic fibrosis, and in other life-threatening diseases. PMID:19750510

  2. Novelty as a Reinforcer for Position Learning in Children

    ERIC Educational Resources Information Center

    Wilson, Marian Monyok

    1974-01-01

    The stimulus-familiarization-effect (SFE) paradigm, a reaction-time (RT) task based on a response to novelty procedure, was modified to assess response for novelty, ie., a response-reinforcement sequence. The potential implications of attention for reinforcement theory and learning in general are discussed. (Author/CS)

  3. An Energy-Efficient Spectrum-Aware Reinforcement Learning-Based Clustering Algorithm for Cognitive Radio Sensor Networks.

    PubMed

    Mustapha, Ibrahim; Mohd Ali, Borhanuddin; Rasid, Mohd Fadlee A; Sali, Aduwati; Mohamad, Hafizal

    2015-08-13

    It is well-known that clustering partitions network into logical groups of nodes in order to achieve energy efficiency and to enhance dynamic channel access in cognitive radio through cooperative sensing. While the topic of energy efficiency has been well investigated in conventional wireless sensor networks, the latter has not been extensively explored. In this paper, we propose a reinforcement learning-based spectrum-aware clustering algorithm that allows a member node to learn the energy and cooperative sensing costs for neighboring clusters to achieve an optimal solution. Each member node selects an optimal cluster that satisfies pairwise constraints, minimizes network energy consumption and enhances channel sensing performance through an exploration technique. We first model the network energy consumption and then determine the optimal number of clusters for the network. The problem of selecting an optimal cluster is formulated as a Markov Decision Process (MDP) in the algorithm and the obtained simulation results show convergence, learning and adaptability of the algorithm to dynamic environment towards achieving an optimal solution. Performance comparisons of our algorithm with the Groupwise Spectrum Aware (GWSA)-based algorithm in terms of Sum of Square Error (SSE), complexity, network energy consumption and probability of detection indicate improved performance from the proposed approach. The results further reveal that an energy savings of 9% and a significant Primary User (PU) detection improvement can be achieved with the proposed approach.

  4. An Energy-Efficient Spectrum-Aware Reinforcement Learning-Based Clustering Algorithm for Cognitive Radio Sensor Networks

    PubMed Central

    Mustapha, Ibrahim; Ali, Borhanuddin Mohd; Rasid, Mohd Fadlee A.; Sali, Aduwati; Mohamad, Hafizal

    2015-01-01

    It is well-known that clustering partitions network into logical groups of nodes in order to achieve energy efficiency and to enhance dynamic channel access in cognitive radio through cooperative sensing. While the topic of energy efficiency has been well investigated in conventional wireless sensor networks, the latter has not been extensively explored. In this paper, we propose a reinforcement learning-based spectrum-aware clustering algorithm that allows a member node to learn the energy and cooperative sensing costs for neighboring clusters to achieve an optimal solution. Each member node selects an optimal cluster that satisfies pairwise constraints, minimizes network energy consumption and enhances channel sensing performance through an exploration technique. We first model the network energy consumption and then determine the optimal number of clusters for the network. The problem of selecting an optimal cluster is formulated as a Markov Decision Process (MDP) in the algorithm and the obtained simulation results show convergence, learning and adaptability of the algorithm to dynamic environment towards achieving an optimal solution. Performance comparisons of our algorithm with the Groupwise Spectrum Aware (GWSA)-based algorithm in terms of Sum of Square Error (SSE), complexity, network energy consumption and probability of detection indicate improved performance from the proposed approach. The results further reveal that an energy savings of 9% and a significant Primary User (PU) detection improvement can be achieved with the proposed approach. PMID:26287191

  5. Risk-sensitive reinforcement learning.

    PubMed

    Shen, Yun; Tobia, Michael J; Sommer, Tobias; Obermayer, Klaus

    2014-07-01

    We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

  6. A neural signature of hierarchical reinforcement learning.

    PubMed

    Ribas-Fernandes, José J F; Solway, Alec; Diuk, Carlos; McGuire, Joseph T; Barto, Andrew G; Niv, Yael; Botvinick, Matthew M

    2011-07-28

    Human behavior displays hierarchical structure: simple actions cohere into subtask sequences, which work together to accomplish overall task goals. Although the neural substrates of such hierarchy have been the target of increasing research, they remain poorly understood. We propose that the computations supporting hierarchical behavior may relate to those in hierarchical reinforcement learning (HRL), a machine-learning framework that extends reinforcement-learning mechanisms into hierarchical domains. To test this, we leveraged a distinctive prediction arising from HRL. In ordinary reinforcement learning, reward prediction errors are computed when there is an unanticipated change in the prospects for accomplishing overall task goals. HRL entails that prediction errors should also occur in relation to task subgoals. In three neuroimaging studies we observed neural responses consistent with such subgoal-related reward prediction errors, within structures previously implicated in reinforcement learning. The results reported support the relevance of HRL to the neural processes underlying hierarchical behavior.

  7. Effective reinforcement learning following cerebellar damage requires a balance between exploration and motor noise

    PubMed Central

    Therrien, Amanda S.; Wolpert, Daniel M.

    2016-01-01

    See Miall and Galea (doi: 10.1093/awv343) for a scientific commentary on this article. Reinforcement and error-based processes are essential for motor learning, with the cerebellum thought to be required only for the error-based mechanism. Here we examined learning and retention of a reaching skill under both processes. Control subjects learned similarly from reinforcement and error-based feedback, but showed much better retention under reinforcement. To apply reinforcement to cerebellar patients, we developed a closed-loop reinforcement schedule in which task difficulty was controlled based on recent performance. This schedule produced substantial learning in cerebellar patients and controls. Cerebellar patients varied in their learning under reinforcement but fully retained what was learned. In contrast, they showed complete lack of retention in error-based learning. We developed a mechanistic model of the reinforcement task and found that learning depended on a balance between exploration variability and motor noise. While the cerebellar and control groups had similar exploration variability, the patients had greater motor noise and hence learned less. Our results suggest that cerebellar damage indirectly impairs reinforcement learning by increasing motor noise, but does not interfere with the reinforcement mechanism itself. Therefore, reinforcement can be used to learn and retain novel skills, but optimal reinforcement learning requires a balance between exploration variability and motor noise. PMID:26626368

  8. Effective reinforcement learning following cerebellar damage requires a balance between exploration and motor noise.

    PubMed

    Therrien, Amanda S; Wolpert, Daniel M; Bastian, Amy J

    2016-01-01

    Reinforcement and error-based processes are essential for motor learning, with the cerebellum thought to be required only for the error-based mechanism. Here we examined learning and retention of a reaching skill under both processes. Control subjects learned similarly from reinforcement and error-based feedback, but showed much better retention under reinforcement. To apply reinforcement to cerebellar patients, we developed a closed-loop reinforcement schedule in which task difficulty was controlled based on recent performance. This schedule produced substantial learning in cerebellar patients and controls. Cerebellar patients varied in their learning under reinforcement but fully retained what was learned. In contrast, they showed complete lack of retention in error-based learning. We developed a mechanistic model of the reinforcement task and found that learning depended on a balance between exploration variability and motor noise. While the cerebellar and control groups had similar exploration variability, the patients had greater motor noise and hence learned less. Our results suggest that cerebellar damage indirectly impairs reinforcement learning by increasing motor noise, but does not interfere with the reinforcement mechanism itself. Therefore, reinforcement can be used to learn and retain novel skills, but optimal reinforcement learning requires a balance between exploration variability and motor noise.

  9. Drive-Reinforcement Learning System Applications

    DTIC Science & Technology

    1992-07-31

    evidence suggests that D-R would be effective in control system applications outside the robotics arena.... Drive- Reinforcement Learning , Neural Network Controllers, Robotics, Manipulator Kinematics, Dynamics and Control.

  10. Benchmarking for Bayesian Reinforcement Learning

    PubMed Central

    Ernst, Damien; Couëtoux, Adrien

    2016-01-01

    In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but the benchmarks used to compare them are only relevant for specific cases. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed. PMID:27304891

  11. Benchmarking for Bayesian Reinforcement Learning.

    PubMed

    Castronovo, Michael; Ernst, Damien; Couëtoux, Adrien; Fonteneau, Raphael

    2016-01-01

    In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but the benchmarks used to compare them are only relevant for specific cases. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed.

  12. Reinforcement learning, conditioning, and the brain: Successes and challenges.

    PubMed

    Maia, Tiago V

    2009-12-01

    The field of reinforcement learning has greatly influenced the neuroscientific study of conditioning. This article provides an introduction to reinforcement learning followed by an examination of the successes and challenges using reinforcement learning to understand the neural bases of conditioning. Successes reviewed include (1) the mapping of positive and negative prediction errors to the firing of dopamine neurons and neurons in the lateral habenula, respectively; (2) the mapping of model-based and model-free reinforcement learning to associative and sensorimotor cortico-basal ganglia-thalamo-cortical circuits, respectively; and (3) the mapping of actor and critic to the dorsal and ventral striatum, respectively. Challenges reviewed consist of several behavioral and neural findings that are at odds with standard reinforcement-learning models, including, among others, evidence for hyperbolic discounting and adaptive coding. The article suggests ways of reconciling reinforcement-learning models with many of the challenging findings, and highlights the need for further theoretical developments where necessary. Additional information related to this study may be downloaded from http://cabn.psychonomic-journals.org/content/supplemental.

  13. Reinforcement learning improves behaviour from evaluative feedback

    NASA Astrophysics Data System (ADS)

    Littman, Michael L.

    2015-05-01

    Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

  14. Credit assignment during movement reinforcement learning.

    PubMed

    Dam, Gregory; Kording, Konrad; Wei, Kunlin

    2013-01-01

    We often need to learn how to move based on a single performance measure that reflects the overall success of our movements. However, movements have many properties, such as their trajectories, speeds and timing of end-points, thus the brain needs to decide which properties of movements should be improved; it needs to solve the credit assignment problem. Currently, little is known about how humans solve credit assignment problems in the context of reinforcement learning. Here we tested how human participants solve such problems during a trajectory-learning task. Without an explicitly-defined target movement, participants made hand reaches and received monetary rewards as feedback on a trial-by-trial basis. The curvature and direction of the attempted reach trajectories determined the monetary rewards received in a manner that can be manipulated experimentally. Based on the history of action-reward pairs, participants quickly solved the credit assignment problem and learned the implicit payoff function. A Bayesian credit-assignment model with built-in forgetting accurately predicts their trial-by-trial learning.

  15. A computational psychiatry approach identifies how alpha-2A noradrenergic agonist Guanfacine affects feature-based reinforcement learning in the macaque.

    PubMed

    Hassani, S A; Oemisch, M; Balcarras, M; Westendorff, S; Ardid, S; van der Meer, M A; Tiesinga, P; Womelsdorf, T

    2017-01-16

    Noradrenaline is believed to support cognitive flexibility through the alpha 2A noradrenergic receptor (a2A-NAR) acting in prefrontal cortex. Enhanced flexibility has been inferred from improved working memory with the a2A-NA agonist Guanfacine. But it has been unclear whether Guanfacine improves specific attention and learning mechanisms beyond working memory, and whether the drug effects can be formalized computationally to allow single subject predictions. We tested and confirmed these suggestions in a case study with a healthy nonhuman primate performing a feature-based reversal learning task evaluating performance using Bayesian and Reinforcement learning models. In an initial dose-testing phase we found a Guanfacine dose that increased performance accuracy, decreased distractibility and improved learning. In a second experimental phase using only that dose we examined the faster feature-based reversal learning with Guanfacine with single-subject computational modeling. Parameter estimation suggested that improved learning is not accounted for by varying a single reinforcement learning mechanism, but by changing the set of parameter values to higher learning rates and stronger suppression of non-chosen over chosen feature information. These findings provide an important starting point for developing nonhuman primate models to discern the synaptic mechanisms of attention and learning functions within the context of a computational neuropsychiatry framework.

  16. A computational psychiatry approach identifies how alpha-2A noradrenergic agonist Guanfacine affects feature-based reinforcement learning in the macaque

    PubMed Central

    Hassani, S. A.; Oemisch, M.; Balcarras, M.; Westendorff, S.; Ardid, S.; van der Meer, M. A.; Tiesinga, P.; Womelsdorf, T.

    2017-01-01

    Noradrenaline is believed to support cognitive flexibility through the alpha 2A noradrenergic receptor (a2A-NAR) acting in prefrontal cortex. Enhanced flexibility has been inferred from improved working memory with the a2A-NA agonist Guanfacine. But it has been unclear whether Guanfacine improves specific attention and learning mechanisms beyond working memory, and whether the drug effects can be formalized computationally to allow single subject predictions. We tested and confirmed these suggestions in a case study with a healthy nonhuman primate performing a feature-based reversal learning task evaluating performance using Bayesian and Reinforcement learning models. In an initial dose-testing phase we found a Guanfacine dose that increased performance accuracy, decreased distractibility and improved learning. In a second experimental phase using only that dose we examined the faster feature-based reversal learning with Guanfacine with single-subject computational modeling. Parameter estimation suggested that improved learning is not accounted for by varying a single reinforcement learning mechanism, but by changing the set of parameter values to higher learning rates and stronger suppression of non-chosen over chosen feature information. These findings provide an important starting point for developing nonhuman primate models to discern the synaptic mechanisms of attention and learning functions within the context of a computational neuropsychiatry framework. PMID:28091572

  17. Reinforcement learning: Computational theory and biological mechanisms.

    PubMed

    Doya, Kenji

    2007-05-01

    Reinforcement learning is a computational framework for an active agent to learn behaviors on the basis of a scalar reward signal. The agent can be an animal, a human, or an artificial system such as a robot or a computer program. The reward can be food, water, money, or whatever measure of the performance of the agent. The theory of reinforcement learning, which was developed in an artificial intelligence community with intuitions from animal learning theory, is now giving a coherent account on the function of the basal ganglia. It now serves as the "common language" in which biologists, engineers, and social scientists can exchange their problems and findings. This article reviews the basic theoretical framework of reinforcement learning and discusses its recent and future contributions toward the understanding of animal behaviors and human decision making.

  18. Behavioral and neural properties of social reinforcement learning.

    PubMed

    Jones, Rebecca M; Somerville, Leah H; Li, Jian; Ruberry, Erika J; Libby, Victoria; Glover, Gary; Voss, Henning U; Ballon, Douglas J; Casey, B J

    2011-09-14

    Social learning is critical for engaging in complex interactions with other individuals. Learning from positive social exchanges, such as acceptance from peers, may be similar to basic reinforcement learning. We formally test this hypothesis by developing a novel paradigm that is based on work in nonhuman primates and human imaging studies of reinforcement learning. The probability of receiving positive social reinforcement from three distinct peers was parametrically manipulated while brain activity was recorded in healthy adults using event-related functional magnetic resonance imaging. Over the course of the experiment, participants responded more quickly to faces of peers who provided more frequent positive social reinforcement, and rated them as more likeable. Modeling trial-by-trial learning showed ventral striatum and orbital frontal cortex activity correlated positively with forming expectations about receiving social reinforcement. Rostral anterior cingulate cortex activity tracked positively with modulations of expected value of the cues (peers). Together, the findings across three levels of analysis--social preferences, response latencies, and modeling neural responses--are consistent with reinforcement learning theory and nonhuman primate electrophysiological studies of reward. This work highlights the fundamental influence of acceptance by one's peers in altering subsequent behavior.

  19. Processing speed enhances model-based over model-free reinforcement learning in the presence of high working memory functioning.

    PubMed

    Schad, Daniel J; Jünger, Elisabeth; Sebold, Miriam; Garbusow, Maria; Bernhardt, Nadine; Javadi, Amir-Homayoun; Zimmermann, Ulrich S; Smolka, Michael N; Heinz, Andreas; Rapp, Michael A; Huys, Quentin J M

    2014-01-01

    Theories of decision-making and its neural substrates have long assumed the existence of two distinct and competing valuation systems, variously described as goal-directed vs. habitual, or, more recently and based on statistical arguments, as model-free vs. model-based reinforcement-learning. Though both have been shown to control choices, the cognitive abilities associated with these systems are under ongoing investigation. Here we examine the link to cognitive abilities, and find that individual differences in processing speed covary with a shift from model-free to model-based choice control in the presence of above-average working memory function. This suggests shared cognitive and neural processes; provides a bridge between literatures on intelligence and valuation; and may guide the development of process models of different valuation components. Furthermore, it provides a rationale for individual differences in the tendency to deploy valuation systems, which may be important for understanding the manifold neuropsychiatric diseases associated with malfunctions of valuation.

  20. Processing speed enhances model-based over model-free reinforcement learning in the presence of high working memory functioning

    PubMed Central

    Schad, Daniel J.; Jünger, Elisabeth; Sebold, Miriam; Garbusow, Maria; Bernhardt, Nadine; Javadi, Amir-Homayoun; Zimmermann, Ulrich S.; Smolka, Michael N.; Heinz, Andreas; Rapp, Michael A.; Huys, Quentin J. M.

    2014-01-01

    Theories of decision-making and its neural substrates have long assumed the existence of two distinct and competing valuation systems, variously described as goal-directed vs. habitual, or, more recently and based on statistical arguments, as model-free vs. model-based reinforcement-learning. Though both have been shown to control choices, the cognitive abilities associated with these systems are under ongoing investigation. Here we examine the link to cognitive abilities, and find that individual differences in processing speed covary with a shift from model-free to model-based choice control in the presence of above-average working memory function. This suggests shared cognitive and neural processes; provides a bridge between literatures on intelligence and valuation; and may guide the development of process models of different valuation components. Furthermore, it provides a rationale for individual differences in the tendency to deploy valuation systems, which may be important for understanding the manifold neuropsychiatric diseases associated with malfunctions of valuation. PMID:25566131

  1. Effect of reinforcement learning on coordination of multiangent systems

    NASA Astrophysics Data System (ADS)

    Bukkapatnam, Satish T. S.; Gao, Greg

    2000-12-01

    For effective coordination of distributed environments involving multiagent systems, learning ability of each agent in the environment plays a crucial role. In this paper, we develop a simple group learning method based on reinforcement, and study its effect on coordination through application to a supply chain procurement scenario involving a computer manufacturer. Here, all parties are represented by self-interested, autonomous agents, each capable of performing specific simple tasks. They negotiate with each other to perform complex tasks and thus coordinate supply chain procurement. Reinforcement learning is intended to enable each agent to reach a best negotiable price within a shortest possible time. Our simulations of the application scenario under different learning strategies reveals the positive effects of reinforcement learning on an agent's as well as the system's performance.

  2. Vicarious reinforcement learning signals when instructing others.

    PubMed

    Apps, Matthew A J; Lesage, Elise; Ramnani, Narender

    2015-02-18

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action-outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors.

  3. Vicarious Reinforcement Learning Signals When Instructing Others

    PubMed Central

    Lesage, Elise; Ramnani, Narender

    2015-01-01

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action–outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors. PMID:25698730

  4. Reinforcement-Learning-Based Robust Controller Design for Continuous-Time Uncertain Nonlinear Systems Subject to Input Constraints.

    PubMed

    Liu, Derong; Yang, Xiong; Wang, Ding; Wei, Qinglai

    2015-07-01

    The design of stabilizing controller for uncertain nonlinear systems with control constraints is a challenging problem. The constrained-input coupled with the inability to identify accurately the uncertainties motivates the design of stabilizing controller based on reinforcement-learning (RL) methods. In this paper, a novel RL-based robust adaptive control algorithm is developed for a class of continuous-time uncertain nonlinear systems subject to input constraints. The robust control problem is converted to the constrained optimal control problem with appropriately selecting value functions for the nominal system. Distinct from typical action-critic dual networks employed in RL, only one critic neural network (NN) is constructed to derive the approximate optimal control. Meanwhile, unlike initial stabilizing control often indispensable in RL, there is no special requirement imposed on the initial control. By utilizing Lyapunov's direct method, the closed-loop optimal control system and the estimated weights of the critic NN are proved to be uniformly ultimately bounded. In addition, the derived approximate optimal control is verified to guarantee the uncertain nonlinear system to be stable in the sense of uniform ultimate boundedness. Two simulation examples are provided to illustrate the effectiveness and applicability of the present approach.

  5. Racial bias shapes social reinforcement learning.

    PubMed

    Lindström, Björn; Selbing, Ida; Molapour, Tanaz; Olsson, Andreas

    2014-03-01

    Both emotional facial expressions and markers of racial-group belonging are ubiquitous signals in social interaction, but little is known about how these signals together affect future behavior through learning. To address this issue, we investigated how emotional (threatening or friendly) in-group and out-group faces reinforced behavior in a reinforcement-learning task. We asked whether reinforcement learning would be modulated by intergroup attitudes (i.e., racial bias). The results showed that individual differences in racial bias critically modulated reinforcement learning. As predicted, racial bias was associated with more efficiently learned avoidance of threatening out-group individuals. We used computational modeling analysis to quantitatively delimit the underlying processes affected by social reinforcement. These analyses showed that racial bias modulates the rate at which exposure to threatening out-group individuals is transformed into future avoidance behavior. In concert, these results shed new light on the learning processes underlying social interaction with racial-in-group and out-group individuals.

  6. A REINFORCEMENT LEARNING MODEL OF PERSUASIVE COMMUNICATION.

    ERIC Educational Resources Information Center

    WEISS, ROBERT FRANK

    THEORETICAL AND EXPERIMENTAL ANALOGIES ARE DRAWN BETWEEN LEARNING THEORY AND PERSUASIVE COMMUNICATION AS AN EXTENSION OF LIBERALIZED STIMULUS RESPONSE THEORY. IN THE FIRST EXPERIMENT ON INSTRUMENTAL CONDITIONING OF ATTITUDES, THE SUBJECTS READ AN OPINION TO BE LEARNED, FOLLOWED BY A SUPPORTING ARGUMENT ASSUMED TO FUNCTION AS A REINFORCER. THE TIME…

  7. Adaptive Educational Software by Applying Reinforcement Learning

    ERIC Educational Resources Information Center

    Bennane, Abdellah

    2013-01-01

    The introduction of the intelligence in teaching software is the object of this paper. In software elaboration process, one uses some learning techniques in order to adapt the teaching software to characteristics of student. Generally, one uses the artificial intelligence techniques like reinforcement learning, Bayesian network in order to adapt…

  8. Can model-free reinforcement learning explain deontological moral judgments?

    PubMed

    Ayars, Alisabeth

    2016-05-01

    Dual-systems frameworks propose that moral judgments are derived from both an immediate emotional response, and controlled/rational cognition. Recently Cushman (2013) proposed a new dual-system theory based on model-free and model-based reinforcement learning. Model-free learning attaches values to actions based on their history of reward and punishment, and explains some deontological, non-utilitarian judgments. Model-based learning involves the construction of a causal model of the world and allows for far-sighted planning; this form of learning fits well with utilitarian considerations that seek to maximize certain kinds of outcomes. I present three concerns regarding the use of model-free reinforcement learning to explain deontological moral judgment. First, many actions that humans find aversive from model-free learning are not judged to be morally wrong. Moral judgment must require something in addition to model-free learning. Second, there is a dearth of evidence for central predictions of the reinforcement account-e.g., that people with different reinforcement histories will, all else equal, make different moral judgments. Finally, to account for the effect of intention within the framework requires certain assumptions which lack support. These challenges are reasonable foci for future empirical/theoretical work on the model-free/model-based framework.

  9. Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer.

    PubMed

    Hu, Yujing; Gao, Yang; An, Bo

    2015-07-01

    An important approach in multiagent reinforcement learning (MARL) is equilibrium-based MARL, which adopts equilibrium solution concepts in game theory and requires agents to play equilibrium strategies at each state. However, most existing equilibrium-based MARL algorithms cannot scale due to a large number of computationally expensive equilibrium computations (e.g., computing Nash equilibria is PPAD-hard) during learning. For the first time, this paper finds that during the learning process of equilibrium-based MARL, the one-shot games corresponding to each state's successive visits often have the same or similar equilibria (for some states more than 90% of games corresponding to successive visits have similar equilibria). Inspired by this observation, this paper proposes to use equilibrium transfer to accelerate equilibrium-based MARL. The key idea of equilibrium transfer is to reuse previously computed equilibria when each agent has a small incentive to deviate. By introducing transfer loss and transfer condition, a novel framework called equilibrium transfer-based MARL is proposed. We prove that although equilibrium transfer brings transfer loss, equilibrium-based MARL algorithms can still converge to an equilibrium policy under certain assumptions. Experimental results in widely used benchmarks (e.g., grid world game, soccer game, and wall game) show that the proposed framework: 1) not only significantly accelerates equilibrium-based MARL (up to 96.7% reduction in learning time), but also achieves higher average rewards than algorithms without equilibrium transfer and 2) scales significantly better than algorithms without equilibrium transfer when the state/action space grows and the number of agents increases.

  10. Ecological momentary assessment of negative symptoms in schizophrenia: Relationships to effort-based decision making and reinforcement learning.

    PubMed

    Moran, Erin K; Culbreth, Adam J; Barch, Deanna M

    2017-01-01

    Negative symptoms are a core clinical feature of schizophrenia, but conceptual and methodological problems with current instruments can make their assessment challenging. One hypothesis is that current symptom assessments may be influenced by impairments in memory and may not be fully reflective of actual functioning outside of the laboratory. The present study sought to investigate the validity of assessing negative symptoms using ecological momentary assessment (EMA). Participants with schizophrenia (N = 31) completed electronic questionnaires on smartphones 4 times a day for 1 week. Participants also completed effort-based decision making and reinforcement learning (RL) tasks to assess the relationship between EMA and laboratory measures, which tap into negative symptom relevant domains. Hierarchical linear modeling analyses revealed that clinician-rated and self-report measures of negative symptoms were significantly related to negative symptoms assessed via EMA. However, working memory moderated the relationship between EMA and retrospective measures of negative symptoms, such that there was a stronger relationship between EMA and retrospective negative symptom measures among individuals with better working memory. The authors also found that negative symptoms assessed via EMA were related to poor performance on the effort task, whereas clinician-rated symptoms and self-reports were not. Further, they found that negative symptoms were related to poorer performance on learning reward contingencies. The findings suggest that negative symptoms can be assessed through EMA and that working memory impairments frequently seen in schizophrenia may affect recall of symptoms. Moreover, these findings suggest the importance of examining the relationship between laboratory tasks and symptoms assessed during daily life. (PsycINFO Database Record

  11. Evolution with reinforcement learning in negotiation.

    PubMed

    Zou, Yi; Zhan, Wenjie; Shao, Yuan

    2014-01-01

    Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm.

  12. Evolution with Reinforcement Learning in Negotiation

    PubMed Central

    Zou, Yi; Zhan, Wenjie; Shao, Yuan

    2014-01-01

    Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm. PMID:25048108

  13. A confidence metric for using neurobiological feedback in actor-critic reinforcement learning based brain-machine interfaces

    PubMed Central

    Prins, Noeline W.; Sanchez, Justin C.; Prasad, Abhishek

    2014-01-01

    Brain-Machine Interfaces (BMIs) can be used to restore function in people living with paralysis. Current BMIs require extensive calibration that increase the set-up times and external inputs for decoder training that may be difficult to produce in paralyzed individuals. Both these factors have presented challenges in transitioning the technology from research environments to activities of daily living (ADL). For BMIs to be seamlessly used in ADL, these issues should be handled with minimal external input thus reducing the need for a technician/caregiver to calibrate the system. Reinforcement Learning (RL) based BMIs are a good tool to be used when there is no external training signal and can provide an adaptive modality to train BMI decoders. However, RL based BMIs are sensitive to the feedback provided to adapt the BMI. In actor-critic BMIs, this feedback is provided by the critic and the overall system performance is limited by the critic accuracy. In this work, we developed an adaptive BMI that could handle inaccuracies in the critic feedback in an effort to produce more accurate RL based BMIs. We developed a confidence measure, which indicated how appropriate the feedback is for updating the decoding parameters of the actor. The results show that with the new update formulation, the critic accuracy is no longer a limiting factor for the overall performance. We tested and validated the system onthree different data sets: synthetic data generated by an Izhikevich neural spiking model, synthetic data with a Gaussian noise distribution, and data collected from a non-human primate engaged in a reaching task. All results indicated that the system with the critic confidence built in always outperformed the system without the critic confidence. Results of this study suggest the potential application of the technique in developing an autonomous BMI that does not need an external signal for training or extensive calibration. PMID:24904257

  14. A confidence metric for using neurobiological feedback in actor-critic reinforcement learning based brain-machine interfaces.

    PubMed

    Prins, Noeline W; Sanchez, Justin C; Prasad, Abhishek

    2014-01-01

    Brain-Machine Interfaces (BMIs) can be used to restore function in people living with paralysis. Current BMIs require extensive calibration that increase the set-up times and external inputs for decoder training that may be difficult to produce in paralyzed individuals. Both these factors have presented challenges in transitioning the technology from research environments to activities of daily living (ADL). For BMIs to be seamlessly used in ADL, these issues should be handled with minimal external input thus reducing the need for a technician/caregiver to calibrate the system. Reinforcement Learning (RL) based BMIs are a good tool to be used when there is no external training signal and can provide an adaptive modality to train BMI decoders. However, RL based BMIs are sensitive to the feedback provided to adapt the BMI. In actor-critic BMIs, this feedback is provided by the critic and the overall system performance is limited by the critic accuracy. In this work, we developed an adaptive BMI that could handle inaccuracies in the critic feedback in an effort to produce more accurate RL based BMIs. We developed a confidence measure, which indicated how appropriate the feedback is for updating the decoding parameters of the actor. The results show that with the new update formulation, the critic accuracy is no longer a limiting factor for the overall performance. We tested and validated the system onthree different data sets: synthetic data generated by an Izhikevich neural spiking model, synthetic data with a Gaussian noise distribution, and data collected from a non-human primate engaged in a reaching task. All results indicated that the system with the critic confidence built in always outperformed the system without the critic confidence. Results of this study suggest the potential application of the technique in developing an autonomous BMI that does not need an external signal for training or extensive calibration.

  15. Generalization of value in reinforcement learning by humans.

    PubMed

    Wimmer, G Elliott; Daw, Nathaniel D; Shohamy, Daphna

    2012-04-01

    Research in decision-making has focused on the role of dopamine and its striatal targets in guiding choices via learned stimulus-reward or stimulus-response associations, behavior that is well described by reinforcement learning theories. However, basic reinforcement learning is relatively limited in scope and does not explain how learning about stimulus regularities or relations may guide decision-making. A candidate mechanism for this type of learning comes from the domain of memory, which has highlighted a role for the hippocampus in learning of stimulus-stimulus relations, typically dissociated from the role of the striatum in stimulus-response learning. Here, we used functional magnetic resonance imaging and computational model-based analyses to examine the joint contributions of these mechanisms to reinforcement learning. Humans performed a reinforcement learning task with added relational structure, modeled after tasks used to isolate hippocampal contributions to memory. On each trial participants chose one of four options, but the reward probabilities for pairs of options were correlated across trials. This (uninstructed) relationship between pairs of options potentially enabled an observer to learn about option values based on experience with the other options and to generalize across them. We observed blood oxygen level-dependent (BOLD) activity related to learning in the striatum and also in the hippocampus. By comparing a basic reinforcement learning model to one augmented to allow feedback to generalize between correlated options, we tested whether choice behavior and BOLD activity were influenced by the opportunity to generalize across correlated options. Although such generalization goes beyond standard computational accounts of reinforcement learning and striatal BOLD, both choices and striatal BOLD activity were better explained by the augmented model. Consistent with the hypothesized role for the hippocampus in this generalization, functional

  16. Reinforcement learning: Solving two case studies

    NASA Astrophysics Data System (ADS)

    Duarte, Ana Filipa; Silva, Pedro; dos Santos, Cristina Peixoto

    2012-09-01

    Reinforcement Learning algorithms offer interesting features for the control of autonomous systems, such as the ability to learn from direct interaction with the environment, and the use of a simple reward signalas opposed to the input-outputs pairsused in classic supervised learning. The reward signal indicates the success of failure of the actions executed by the agent in the environment. In this work, are described RL algorithmsapplied to two case studies: the Crawler robot and the widely known inverted pendulum. We explore RL capabilities to autonomously learn a basic locomotion pattern in the Crawler, andapproach the balancing problem of biped locomotion using the inverted pendulum.

  17. Refining Linear Fuzzy Rules by Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.; Malkani, Anil

    1996-01-01

    Linear fuzzy rules are increasingly being used in the development of fuzzy logic systems. Radial basis functions have also been used in the antecedents of the rules for clustering in product space which can automatically generate a set of linear fuzzy rules from an input/output data set. Manual methods are usually used in refining these rules. This paper presents a method for refining the parameters of these rules using reinforcement learning which can be applied in domains where supervised input-output data is not available and reinforcements are received only after a long sequence of actions. This is shown for a generalization of radial basis functions. The formation of fuzzy rules from data and their automatic refinement is an important step in closing the gap between the application of reinforcement learning methods in the domains where only some limited input-output data is available.

  18. Geographical Inquiry and Learning Reinforcement Theory.

    ERIC Educational Resources Information Center

    Davies, Christopher S.

    1983-01-01

    Although instructors have been reluctant to utilize the Keller Plan (a personalized system of instruction), it lends itself to teaching introductory geography. College students found that the routine and frequent reinforcement led to progressive learning. However, it does not lend itself to the study of reflexive or polemical concepts. (IS)

  19. Classroom Reinforcement and Learning: A Quantitative Synthesis.

    ERIC Educational Resources Information Center

    Lysakowski, Richard S.; Walberg, Herbert J.

    To estimate the influence of positive reinforcement on classroom learning, the authors analyzed statistical data from 39 studies spanning the years 1958-1978 and containing a combined sample of 4,842 students in 202 classes. Twenty-nine characteristics of each study's sample, methodology, and reliability were coded to measure their effects on…

  20. Reinforcement Learning in Information Searching

    ERIC Educational Resources Information Center

    Cen, Yonghua; Gan, Liren; Bai, Chen

    2013-01-01

    Introduction: The study seeks to answer two questions: How do university students learn to use correct strategies to conduct scholarly information searches without instructions? and, What are the differences in learning mechanisms between users at different cognitive levels? Method: Two groups of users, thirteen first year undergraduate students…

  1. Assist-as-needed robotic trainer based on reinforcement learning and its application to dart-throwing.

    PubMed

    Obayashi, Chihiro; Tamei, Tomoya; Shibata, Tomohiro

    2014-05-01

    This paper proposes a novel robotic trainer for motor skill learning. It is user-adaptive inspired by the assist-as-needed principle well known in the field of physical therapy. Most previous studies in the field of the robotic assistance of motor skill learning have used predetermined desired trajectories, and it has not been examined intensively whether these trajectories were optimal for each user. Furthermore, the guidance hypothesis states that humans tend to rely too much on external assistive feedback, resulting in interference with the internal feedback necessary for motor skill learning. A few studies have proposed a system that adjusts its assistive strength according to the user's performance in order to prevent the user from relying too much on the robotic assistance. There are, however, problems in these studies, in that a physical model of the user's motor system is required, which is inherently difficult to construct. In this paper, we propose a framework for a robotic trainer that is user-adaptive and that neither requires a specific desired trajectory nor a physical model of the user's motor system, and we achieve this using model-free reinforcement learning. We chose dart-throwing as an example motor-learning task as it is one of the simplest throwing tasks, and its performance can easily be and quantitatively measured. Training experiments with novices, aiming at maximizing the score with the darts and minimizing the physical robotic assistance, demonstrate the feasibility and plausibility of the proposed framework.

  2. Multiplexing signals in reinforcement learning with internal models and dopamine.

    PubMed

    Nakahara, Hiroyuki

    2014-04-01

    A fundamental challenge for computational and cognitive neuroscience is to understand how reward-based learning and decision-making are made and how accrued knowledge and internal models of the environment are incorporated. Remarkable progress has been made in the field, guided by the midbrain dopamine reward prediction error hypothesis and the underlying reinforcement learning framework, which does not involve internal models ('model-free'). Recent studies, however, have begun not only to address more complex decision-making processes that are integrated with model-free decision-making, but also to include internal models about environmental reward structures and the minds of other agents, including model-based reinforcement learning and using generalized prediction errors. Even dopamine, a classic model-free signal, may work as multiplexed signals using model-based information and contribute to representational learning of reward structure.

  3. Reinforcement learning for robot control

    NASA Astrophysics Data System (ADS)

    Smart, William D.; Pack Kaelbling, Leslie

    2002-02-01

    Writing control code for mobile robots can be a very time-consuming process. Even for apparently simple tasks, it is often difficult to specify in detail how the robot should accomplish them. Robot control code is typically full of magic numbers that must be painstakingly set for each environment that the robot must operate in. The idea of having a robot learn how to accomplish a task, rather than being told explicitly is an appealing one. It seems easier and much more intuitive for the programmer to specify what the robot should be doing, and let it learn the fine details of how to do it. In this paper, we describe JAQL, a framework for efficient learning on mobile robots, and present the results of using it to learn control policies for simple tasks.

  4. Tunnel Ventilation Control Using Reinforcement Learning Methodology

    NASA Astrophysics Data System (ADS)

    Chu, Baeksuk; Kim, Dongnam; Hong, Daehie; Park, Jooyoung; Chung, Jin Taek; Kim, Tae-Hyung

    The main purpose of tunnel ventilation system is to maintain CO pollutant concentration and VI (visibility index) under an adequate level to provide drivers with comfortable and safe driving environment. Moreover, it is necessary to minimize power consumption used to operate ventilation system. To achieve the objectives, the control algorithm used in this research is reinforcement learning (RL) method. RL is a goal-directed learning of a mapping from situations to actions without relying on exemplary supervision or complete models of the environment. The goal of RL is to maximize a reward which is an evaluative feedback from the environment. In the process of constructing the reward of the tunnel ventilation system, two objectives listed above are included, that is, maintaining an adequate level of pollutants and minimizing power consumption. RL algorithm based on actor-critic architecture and gradient-following algorithm is adopted to the tunnel ventilation system. The simulations results performed with real data collected from existing tunnel ventilation system and real experimental verification are provided in this paper. It is confirmed that with the suggested controller, the pollutant level inside the tunnel was well maintained under allowable limit and the performance of energy consumption was improved compared to conventional control scheme.

  5. Habits, action sequences, and reinforcement learning

    PubMed Central

    Dezfouli, Amir; Balleine, Bernard W.

    2012-01-01

    It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquire and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions. PMID:22487034

  6. Habits, action sequences and reinforcement learning.

    PubMed

    Dezfouli, Amir; Balleine, Bernard W

    2012-04-01

    It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquired and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary, and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions.

  7. Stress modulates reinforcement learning in younger and older adults.

    PubMed

    Lighthall, Nichole R; Gorlick, Marissa A; Schoeke, Andrej; Frank, Michael J; Mather, Mara

    2013-03-01

    Animal research and human neuroimaging studies indicate that stress increases dopamine levels in brain regions involved in reward processing, and stress also appears to increase the attractiveness of addictive drugs. The current study tested the hypothesis that stress increases reward salience, leading to more effective learning about positive than negative outcomes in a probabilistic selection task. Changes to dopamine pathways with age raise the question of whether stress effects on incentive-based learning differ by age. Thus, the present study also examined whether effects of stress on reinforcement learning differed for younger (age 18-34) and older participants (age 65-85). Cold pressor stress was administered to half of the participants in each age group, and salivary cortisol levels were used to confirm biophysiological response to cold stress. After the manipulation, participants completed a probabilistic learning task involving positive and negative feedback. In both younger and older adults, stress enhanced learning about cues that predicted positive outcomes. In addition, during the initial learning phase, stress diminished sensitivity to recent feedback across age groups. These results indicate that stress affects reinforcement learning in both younger and older adults and suggests that stress exerts different effects on specific components of reinforcement learning depending on their neural underpinnings.

  8. Multi-Agent Reinforcement Learning and Adaptive Neural Networks.

    DTIC Science & Technology

    2007-11-02

    learning method. The objective was to study the utility of reinforcement learning as an approach to complex decentralized control problems. The major...accomplishment was a detailed study of multi-agent reinforcement learning applied to a large-scale decentralized stochastic control problem. This study...included a very successful demonstration that a multi-agent reinforcement learning system using neural networks could learn high-performance

  9. An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning.

    PubMed

    Balasubramani, Pragathi P; Chakravarthy, V Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A

    2014-01-01

    Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG.

  10. An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning

    PubMed Central

    Balasubramani, Pragathi P.; Chakravarthy, V. Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A.

    2014-01-01

    Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG. PMID:24795614

  11. A reinforcement learning approach to gait training improves retention

    PubMed Central

    Hasson, Christopher J.; Manczurowsky, Julia; Yen, Sheng-Che

    2015-01-01

    Many gait training programs are based on supervised learning principles: an individual is guided towards a desired gait pattern with directional error feedback. While this results in rapid adaptation, improvements quickly disappear. This study tested the hypothesis that a reinforcement learning approach improves retention and transfer of a new gait pattern. The results of a pilot study and larger experiment are presented. Healthy subjects were randomly assigned to either a supervised group, who received explicit instructions and directional error feedback while they learned a new gait pattern on a treadmill, or a reinforcement group, who was only shown whether they were close to or far from the desired gait. Subjects practiced for 10 min, followed by immediate and overnight retention and over-ground transfer tests. The pilot study showed that subjects could learn a new gait pattern under a reinforcement learning paradigm. The larger experiment, which had twice as many subjects (16 in each group) showed that the reinforcement group had better overnight retention than the supervised group (a 32% vs. 120% error increase, respectively), but there were no differences for over-ground transfer. These results suggest that encouraging participants to find rewarding actions through self-guided exploration is beneficial for retention. PMID:26379524

  12. A reinforcement learning approach to gait training improves retention.

    PubMed

    Hasson, Christopher J; Manczurowsky, Julia; Yen, Sheng-Che

    2015-01-01

    Many gait training programs are based on supervised learning principles: an individual is guided towards a desired gait pattern with directional error feedback. While this results in rapid adaptation, improvements quickly disappear. This study tested the hypothesis that a reinforcement learning approach improves retention and transfer of a new gait pattern. The results of a pilot study and larger experiment are presented. Healthy subjects were randomly assigned to either a supervised group, who received explicit instructions and directional error feedback while they learned a new gait pattern on a treadmill, or a reinforcement group, who was only shown whether they were close to or far from the desired gait. Subjects practiced for 10 min, followed by immediate and overnight retention and over-ground transfer tests. The pilot study showed that subjects could learn a new gait pattern under a reinforcement learning paradigm. The larger experiment, which had twice as many subjects (16 in each group) showed that the reinforcement group had better overnight retention than the supervised group (a 32% vs. 120% error increase, respectively), but there were no differences for over-ground transfer. These results suggest that encouraging participants to find rewarding actions through self-guided exploration is beneficial for retention.

  13. Reinforcement and inference in cross-situational word learning

    PubMed Central

    Tilles, Paulo F. C.; Fontanari, José F.

    2013-01-01

    Cross-situational word learning is based on the notion that a learner can determine the referent of a word by finding something in common across many observed uses of that word. Here we propose an adaptive learning algorithm that contains a parameter that controls the strength of the reinforcement applied to associations between concurrent words and referents, and a parameter that regulates inference, which includes built-in biases, such as mutual exclusivity, and information of past learning events. By adjusting these parameters so that the model predictions agree with data from representative experiments on cross-situational word learning, we were able to explain the learning strategies adopted by the participants of those experiments in terms of a trade-off between reinforcement and inference. These strategies can vary wildly depending on the conditions of the experiments. For instance, for fast mapping experiments (i.e., the correct referent could, in principle, be inferred in a single observation) inference is prevalent, whereas for segregated contextual diversity experiments (i.e., the referents are separated in groups and are exhibited with members of their groups only) reinforcement is predominant. Other experiments are explained with more balanced doses of reinforcement and inference. PMID:24312030

  14. Reinforcement Learning in Young Adults with Developmental Language Impairment

    PubMed Central

    Lee, Joanna C.; Tomblin, J. Bruce

    2012-01-01

    The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank et al., 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic selection task was used to assess how participants implicitly extracted reinforcement history from the environment based on probabilistic positive/negative feedback. The findings showed impaired RL in individuals with DLI, indicating an altered gating function of the striatum in testing. However, they exploited similar learning strategies as comparison participants at the beginning of training, reflecting relatively intact functions of the prefrontal cortex to rapidly update reinforcement information. Within the context of Frank’s model, these results can be interpreted as evidence for alterations in the basal ganglia of individuals with DLI. PMID:22921956

  15. Reinforcement learning in young adults with developmental language impairment.

    PubMed

    Lee, Joanna C; Tomblin, J Bruce

    2012-12-01

    The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank, Seeberger, & O'Reilly, 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic selection task was used to assess how participants implicitly extracted reinforcement history from the environment based on probabilistic positive/negative feedback. The findings showed impaired RL in individuals with DLI, indicating an altered gating function of the striatum in testing. However, they exploited similar learning strategies as comparison participants at the beginning of training, reflecting relatively intact functions of the prefrontal cortex to rapidly update reinforcement information. Within the context of Frank's model, these results can be interpreted as evidence for alterations in the basal ganglia of individuals with DLI.

  16. Application of fuzzy logic-neural network based reinforcement learning to proximity and docking operations: Attitude control results

    NASA Technical Reports Server (NTRS)

    Jani, Yashvant

    1992-01-01

    As part of the RICIS activity, the reinforcement learning techniques developed at Ames Research Center are being applied to proximity and docking operations using the Shuttle and Solar Max satellite simulation. This activity is carried out in the software technology laboratory utilizing the Orbital Operations Simulator (OOS). This report is deliverable D2 Altitude Control Results and provides the status of the project after four months of activities and outlines the future plans. In section 2 we describe the Fuzzy-Learner system for the attitude control functions. In section 3, we provide the description of test cases and results in a chronological order. In section 4, we have summarized our results and conclusions. Our future plans and recommendations are provided in section 5.

  17. Reinforcement Learning and Savings Behavior.

    PubMed

    Choi, James J; Laibson, David; Madrian, Brigitte C; Metrick, Andrew

    2009-12-01

    We show that individual investors over-extrapolate from their personal experience when making savings decisions. Investors who experience particularly rewarding outcomes from saving in their 401(k)-a high average and/or low variance return-increase their 401(k) savings rate more than investors who have less rewarding experiences with saving. This finding is not driven by aggregate time-series shocks, income effects, rational learning about investing skill, investor fixed effects, or time-varying investor-level heterogeneity that is correlated with portfolio allocations to stock, bond, and cash asset classes. We discuss implications for the equity premium puzzle and interventions aimed at improving household financial outcomes.

  18. Enhanced Experience Replay for Deep Reinforcement Learning

    DTIC Science & Technology

    2015-11-01

    Temporal-difference Q-learning is used to train the network, and a memory of state–action–reward transitions is kept and used in an experience-reply...al. 2015) uses a convolutional neural network to automatically extract relevant features from the video-game display, then uses reinforcement...situations. To counteract this problem, a memory of past experiences (the state–action–reward information) is stored during training and the network is

  19. The Function of Direct and Vicarious Reinforcement in Human Learning.

    ERIC Educational Resources Information Center

    Owens, Carl R.; And Others

    The role of reinforcement has long been an issue in learning theory. The effects of reinforcement in learning were investigated under circumstances which made the information necessary for correct performance equally available to reinforced and nonreinforced subjects. Fourth graders (N=36) were given a pre-test of 20 items from the Peabody Picture…

  20. Application of fuzzy logic-neural network based reinforcement learning to proximity and docking operations: Special approach/docking testcase results

    NASA Technical Reports Server (NTRS)

    Jani, Yashvant

    1993-01-01

    As part of the RICIS project, the reinforcement learning techniques developed at Ames Research Center are being applied to proximity and docking operations using the Shuttle and Solar Maximum Mission (SMM) satellite simulation. In utilizing these fuzzy learning techniques, we use the Approximate Reasoning based Intelligent Control (ARIC) architecture, and so we use these two terms interchangeably to imply the same. This activity is carried out in the Software Technology Laboratory utilizing the Orbital Operations Simulator (OOS) and programming/testing support from other contractor personnel. This report is the final deliverable D4 in our milestones and project activity. It provides the test results for the special testcase of approach/docking scenario for the shuttle and SMM satellite. Based on our experience and analysis with the attitude and translational controllers, we have modified the basic configuration of the reinforcement learning algorithm in ARIC. The shuttle translational controller and its implementation in ARIC is described in our deliverable D3. In order to simulate the final approach and docking operations, we have set-up this special testcase as described in section 2. The ARIC performance results for these operations are discussed in section 3 and conclusions are provided in section 4 along with the summary for the project.

  1. Motivational neural circuits underlying reinforcement learning.

    PubMed

    Averbeck, Bruno B; Costa, Vincent D

    2017-03-29

    Reinforcement learning (RL) is the behavioral process of learning the values of actions and objects. Most models of RL assume that the dopaminergic prediction error signal drives plasticity in frontal-striatal circuits. The striatum then encodes value representations that drive decision processes. However, the amygdala has also been shown to play an important role in forming Pavlovian stimulus-outcome associations. These Pavlovian associations can drive motivated behavior via the amygdala projections to the ventral striatum or the ventral tegmental area. The amygdala may, therefore, play a central role in RL. Here we compare the contributions of the amygdala and the striatum to RL and show that both the amygdala and striatum learn and represent expected values in RL tasks. Furthermore, value representations in the striatum may be inherited, to some extent, from the amygdala. The striatum may, therefore, play less of a primary role in learning stimulus-outcome associations in RL than previously suggested.

  2. Intrinsically Motivated Reinforcement Learning: A Promising Framework for Developmental Robot Learning

    DTIC Science & Technology

    2005-01-01

    for intrinsically motivated reinforcement learning that strives to achieve broad competence in an environment in a task-nonspecific manner by...hierarchical learning, intrinsically motivated reinforcement learning is an obvious choice for organizing behavior in developmental robotics. We present

  3. Structure identification in fuzzy inference using reinforcement learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1993-01-01

    In our previous work on the GARIC architecture, we have shown that the system can start with surface structure of the knowledge base (i.e., the linguistic expression of the rules) and learn the deep structure (i.e., the fuzzy membership functions of the labels used in the rules) by using reinforcement learning. Assuming the surface structure, GARIC refines the fuzzy membership functions used in the consequents of the rules using a gradient descent procedure. This hybrid fuzzy logic and reinforcement learning approach can learn to balance a cart-pole system and to backup a truck to its docking location after a few trials. In this paper, we discuss how to do structure identification using reinforcement learning in fuzzy inference systems. This involves identifying both surface as well as deep structure of the knowledge base. The term set of fuzzy linguistic labels used in describing the values of each control variable must be derived. In this process, splitting a label refers to creating new labels which are more granular than the original label and merging two labels creates a more general label. Splitting and merging of labels directly transform the structure of the action selection network used in GARIC by increasing or decreasing the number of hidden layer nodes.

  4. Positive impact of state similarity on reinforcement learning performance.

    PubMed

    Girgin, Sertan; Polat, Faruk; Alhajj, Reda

    2007-10-01

    In this paper, we propose a novel approach to identify states with similar subpolicies and show how they can be integrated into the reinforcement learning framework to improve learning performance. The method utilizes a specialized tree structure to identify common action sequences of states, which are derived from possible optimal policies, and defines a similarity function between two states based on the number of such sequences. Using this similarity function, updates on the action-value function of a state are reflected onto all similar states. This allows experience that is acquired during learning to be applied to a broader context. The effectiveness of the method is demonstrated empirically.

  5. Single Dose of a Dopamine Agonist Impairs Reinforcement Learning in Humans: Behavioral Evidence from a Laboratory-based Measure of Reward Responsiveness

    PubMed Central

    Pizzagalli, Diego A.; Evins, A. Eden; Schetter, Erika Cowman; Frank, Michael J.; Pajtas, Petra E.; Santesso, Diane L.; Culhane, Melissa

    2007-01-01

    Rationale The dopaminergic system, particularly D2-like dopamine receptors, has been strongly implicated in reward processing. Animal studies have emphasized the role of phasic dopamine (DA) signaling in reward-related learning, but these processes remain largely unexplored in humans. Objectives To evaluate the effect of a single, low dose of a D2/D3 agonist—pramipexole—on reinforcement learning in healthy adults. Based on prior evidence indicating that low doses of DA agonists decrease phasic DA release through autoreceptor stimulation, we hypothesized that 0.5 mg of pramipexole would impair reward learning due to presynaptic mechanisms. Methods Using a double-blind design, a single 0.5 mg dose of pramipexole or placebo was administered to 32 healthy volunteers, who performed a probabilistic reward task involving a differential reinforcement schedule as well as various control tasks. Results As hypothesized, response bias toward the more frequently rewarded stimulus was impaired in the pramipexole group, even after adjusting for transient adverse effects. In addition, the pramipexole group showed reaction time and motor speed slowing and increased negative affect; however, when adverse physical side effects were considered, group differences in motor speed and negative affect disappeared. Conclusions These findings show that a single low dose of pramipexole impaired the acquisition of reward-related behavior in healthy participants, and they are consistent with prior evidence suggesting that phasic DA signaling is required to reinforce actions leading to reward. The potential implications of the present findings to psychiatric conditions, including depression and impulse control disorders related to addiction, are discussed. PMID:17909750

  6. Inter-module credit assignment in modular reinforcement learning.

    PubMed

    Samejima, Kazuyuki; Doya, Kenji; Kawato, Mitsuo

    2003-09-01

    Critical issues in modular or hierarchical reinforcement learning (RL) are (i) how to decompose a task into sub-tasks, (ii) how to achieve independence of learning of sub-tasks, and (iii) how to assure optimality of the composite policy for the entire task. The second and last requirements are often under trade-off. We propose a method for propagating the reward for the entire task achievement between modules. This is done in the form of a 'modular reward', which is calculated from the temporal difference of the module gating signal and the value of the succeeding module. We implement modular reward for a multiple model-based reinforcement learning (MMRL) architecture and show its effectiveness in simulations of a pursuit task with hidden states and a continuous-time non-linear control task.

  7. Fast Reinforcement Learning for Energy-Efficient Wireless Communication

    NASA Astrophysics Data System (ADS)

    Mastronarde, Nicholas; van der Schaar, Mihaela

    2011-12-01

    We consider the problem of energy-efficient point-to-point transmission of delay-sensitive data (e.g. multimedia data) over a fading channel. Existing research on this topic utilizes either physical-layer centric solutions, namely power-control and adaptive modulation and coding (AMC), or system-level solutions based on dynamic power management (DPM); however, there is currently no rigorous and unified framework for simultaneously utilizing both physical-layer centric and system-level techniques to achieve the minimum possible energy consumption, under delay constraints, in the presence of stochastic and a priori unknown traffic and channel conditions. In this report, we propose such a framework. We formulate the stochastic optimization problem as a Markov decision process (MDP) and solve it online using reinforcement learning. The advantages of the proposed online method are that (i) it does not require a priori knowledge of the traffic arrival and channel statistics to determine the jointly optimal power-control, AMC, and DPM policies; (ii) it exploits partial information about the system so that less information needs to be learned than when using conventional reinforcement learning algorithms; and (iii) it obviates the need for action exploration, which severely limits the adaptation speed and run-time performance of conventional reinforcement learning algorithms. Our results show that the proposed learning algorithms can converge up to two orders of magnitude faster than a state-of-the-art learning algorithm for physical layer power-control and up to three orders of magnitude faster than conventional reinforcement learning algorithms.

  8. An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning

    DTIC Science & Technology

    2000-10-01

    Learning behaviors in a multiagent environment are crucial for developing and adapting multiagent systems. Reinforcement learning techniques have...presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities. We examine the

  9. Learning strategies in table tennis using inverse reinforcement learning.

    PubMed

    Muelling, Katharina; Boularias, Abdeslam; Mohler, Betty; Schölkopf, Bernhard; Peters, Jan

    2014-10-01

    Learning a complex task such as table tennis is a challenging problem for both robots and humans. Even after acquiring the necessary motor skills, a strategy is needed to choose where and how to return the ball to the opponent's court in order to win the game. The data-driven identification of basic strategies in interactive tasks, such as table tennis, is a largely unexplored problem. In this paper, we suggest a computational model for representing and inferring strategies, based on a Markov decision problem, where the reward function models the goal of the task as well as the strategic information. We show how this reward function can be discovered from demonstrations of table tennis matches using model-free inverse reinforcement learning. The resulting framework allows to identify basic elements on which the selection of striking movements is based. We tested our approach on data collected from players with different playing styles and under different playing conditions. The estimated reward function was able to capture expert-specific strategic information that sufficed to distinguish the expert among players with different skill levels as well as different playing styles.

  10. Mapping anhedonia onto reinforcement learning: a behavioural meta-analysis

    PubMed Central

    2013-01-01

    Background Depression is characterised partly by blunted reactions to reward. However, tasks probing this deficiency have not distinguished insensitivity to reward from insensitivity to the prediction errors for reward that determine learning and are putatively reported by the phasic activity of dopamine neurons. We attempted to disentangle these factors with respect to anhedonia in the context of stress, Major Depressive Disorder (MDD), Bipolar Disorder (BPD) and a dopaminergic challenge. Methods Six behavioural datasets involving 392 experimental sessions were subjected to a model-based, Bayesian meta-analysis. Participants across all six studies performed a probabilistic reward task that used an asymmetric reinforcement schedule to assess reward learning. Healthy controls were tested under baseline conditions, stress or after receiving the dopamine D2 agonist pramipexole. In addition, participants with current or past MDD or BPD were evaluated. Reinforcement learning models isolated the contributions of variation in reward sensitivity and learning rate. Results MDD and anhedonia reduced reward sensitivity more than they affected the learning rate, while a low dose of the dopamine D2 agonist pramipexole showed the opposite pattern. Stress led to a pattern consistent with a mixed effect on reward sensitivity and learning rate. Conclusion Reward-related learning reflected at least two partially separable contributions. The first related to phasic prediction error signalling, and was preferentially modulated by a low dose of the dopamine agonist pramipexole. The second related directly to reward sensitivity, and was preferentially reduced in MDD and anhedonia. Stress altered both components. Collectively, these findings highlight the contribution of model-based reinforcement learning meta-analysis for dissecting anhedonic behavior. PMID:23782813

  11. Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning.

    PubMed

    Piot, Bilal; Geist, Matthieu; Pietquin, Olivier

    2016-05-04

    Learning from demonstrations is a paradigm by which an apprentice agent learns a control policy for a dynamic environment by observing demonstrations delivered by an expert agent. It is usually implemented as either imitation learning (IL) or inverse reinforcement learning (IRL) in the literature. On the one hand, IRL is a paradigm relying on the Markov decision processes, where the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior. On the other hand, IL consists in directly generalizing the expert strategy, observed in the demonstrations, to unvisited states (and it is therefore close to classification, when there is a finite set of possible decisions). While these two visions are often considered as opposite to each other, the purpose of this paper is to exhibit a formal link between these approaches from which new algorithms can be derived. We show that IL and IRL can be redefined in a way that they are equivalent, in the sense that there exists an explicit bijective operator (namely, the inverse optimal Bellman operator) between their respective spaces of solutions. To do so, we introduce the set-policy framework that creates a clear link between the IL and the IRL. As a result, the IL and IRL solutions making the best of both worlds are obtained. In addition, it is a unifying framework from which existing IL and IRL algorithms can be derived and which opens the way for the IL methods able to deal with the environment's dynamics. Finally, the IRL algorithms derived from the set-policy framework are compared with the algorithms belonging to the more common trajectory-matching family. Experiments demonstrate that the set-policy-based algorithms outperform both the standard IRL and IL ones and result in more robust solutions.

  12. Stochastic Scheduling and Planning Using Reinforcement Learning

    DTIC Science & Technology

    2007-11-02

    reinforcement learning (RL) methods to large-scale optimization problems relevant to Air Force operations planning, scheduling, and maintenance. The objectives of this project were to: (1) investigate the utility of RL on large-scale logistics problems; (2) extend existing RL theory and practice to enhance the ease of application and the performance of RL on these problems; and (3) explore new problem formulations in order to take maximal advantage of RL methods. A method using RL to modify local search cost functions was developed and shown to yield significant

  13. Reinforcement active learning in the vibrissae system: optimal object localization.

    PubMed

    Gordon, Goren; Dorfman, Nimrod; Ahissar, Ehud

    2013-01-01

    Rats move their whiskers to acquire information about their environment. It has been observed that they palpate novel objects and objects they are required to localize in space. We analyze whisker-based object localization using two complementary paradigms, namely, active learning and intrinsic-reward reinforcement learning. Active learning algorithms select the next training samples according to the hypothesized solution in order to better discriminate between correct and incorrect labels. Intrinsic-reward reinforcement learning uses prediction errors as the reward to an actor-critic design, such that behavior converges to the one that optimizes the learning process. We show that in the context of object localization, the two paradigms result in palpation whisking as their respective optimal solution. These results suggest that rats may employ principles of active learning and/or intrinsic reward in tactile exploration and can guide future research to seek the underlying neuronal mechanisms that implement them. Furthermore, these paradigms are easily transferable to biomimetic whisker-based artificial sensors and can improve the active exploration of their environment.

  14. Reward and reinforcement activity in the nucleus accumbens during learning

    PubMed Central

    Gale, John T.; Shields, Donald C.; Ishizawa, Yumiko; Eskandar, Emad N.

    2014-01-01

    The nucleus accumbens core (NAcc) has been implicated in learning associations between sensory cues and profitable motor responses. However, the precise mechanisms that underlie these functions remain unclear. We recorded single-neuron activity from the NAcc of primates trained to perform a visual-motor associative learning task. During learning, we found two distinct classes of NAcc neurons. The first class demonstrated progressive increases in firing rates at the go-cue, feedback/tone and reward epochs of the task, as novel associations were learned. This suggests that these neurons may play a role in the exploitation of rewarding behaviors. In contrast, the second class exhibited attenuated firing rates, but only at the reward epoch of the task. These findings suggest that some NAcc neurons play a role in reward-based reinforcement during learning. PMID:24765069

  15. A parallel framework for Bayesian reinforcement learning

    NASA Astrophysics Data System (ADS)

    Barrett, Enda; Duggan, Jim; Howley, Enda

    2014-01-01

    Solving a finite Markov decision process using techniques from dynamic programming such as value or policy iteration require a complete model of the environmental dynamics. The distribution of rewards, transition probabilities, states and actions all need to be fully observable, discrete and complete. For many problem domains, a complete model containing a full representation of the environmental dynamics may not be readily available. Bayesian reinforcement learning (RL) is a technique devised to make better use of the information observed through learning than simply computing Q-functions. However, this approach can often require extensive experience in order to build up an accurate representation of the true values. To address this issue, this paper proposes a method for parallelising a Bayesian RL technique aimed at reducing the time it takes to approximate the missing model. We demonstrate the technique on learning next state transition probabilities without prior knowledge. The approach is general enough for approximating any probabilistically driven component of the model. The solution involves multiple learning agents learning in parallel on the same task. Agents share probability density estimates amongst each other in an effort to speed up convergence to the true values.

  16. Connectionist reinforcement learning of robot control skills

    NASA Astrophysics Data System (ADS)

    Araújo, Rui; Nunes, Urbano; de Almeida, A. T.

    1998-07-01

    Many robot manipulator tasks are difficult to model explicitly and it is difficult to design and program automatic control algorithms for them. The development, improvement, and application of learning techniques taking advantage of sensory information would enable the acquisition of new robot skills and avoid some of the difficulties of explicit programming. In this paper we use a reinforcement learning approach for on-line generation of skills for control of robot manipulator systems. Instead of generating skills by explicit programming of a perception to action mapping they are generated by trial and error learning, guided by a performance evaluation feedback function. The resulting system may be seen as an anticipatory system that constructs an internal representation model of itself and of its environment. This enables it to identify its current situation and to generate corresponding appropriate commands to the system in order to perform the required skill. The method was applied to the problem of learning a force control skill in which the tool-tip of a robot manipulator must be moved from a free space situation, to a contact state with a compliant surface and having a constant interaction force.

  17. Learning and tuning fuzzy logic controllers through reinforcements

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1992-01-01

    This paper presents a new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system. In particular, our generalized approximate reasoning-based intelligent control (GARIC) architecture (1) learns and tunes a fuzzy logic controller even when only weak reinforcement, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward neural network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto et al. (1983) to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and demonstrates significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.

  18. Learning and tuning fuzzy logic controllers through reinforcements

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1992-01-01

    A new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system is presented. In particular, our Generalized Approximate Reasoning-based Intelligent Control (GARIC) architecture: (1) learns and tunes a fuzzy logic controller even when only weak reinforcements, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto, Sutton, and Anderson to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and has demonstrated significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.

  19. Balancing Multiple Sources of Reward in Reinforcement Learning

    DTIC Science & Technology

    2006-01-01

    For many problems which would be natural for reinforcement learning , the reward signal is not a single scalar value but has multiple scalar...problems with applying traditional reinforcement learning . We then present an new algorithm for finding a solution and results on simulated environments.

  20. Reinforcement Learning for the Adaptive Control of Perception and Action

    DTIC Science & Technology

    1992-02-01

    This dissertation applies reinforcement learning to the adaptive control of active sensory-motor systems. Active sensory-motor systems, in addition...distinct states in the external world. This phenomenon, called perceptual aliasing, is shown to destabilize existing reinforcement learning algorithms

  1. Reinforcement of Science Learning through Local Culture: A Delphi Study

    ERIC Educational Resources Information Center

    Nuangchalerm, Prasart

    2008-01-01

    This study aims to explore the ways to reinforce science learning through local culture by using Delphi technique. Twenty four participants in various fields of study were selected. The result of study provides a framework for reinforcement of science learning through local culture on the theme life and environment. (Contains 1 table.)

  2. Punishment Insensitivity and Impaired Reinforcement Learning in Preschoolers

    ERIC Educational Resources Information Center

    Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

    2014-01-01

    Background: Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a…

  3. Experienced Gray Wolf Optimization Through Reinforcement Learning and Neural Networks.

    PubMed

    Emary, E; Zawbaa, Hossam M; Grosan, Crina

    2017-01-10

    In this paper, a variant of gray wolf optimization (GWO) that uses reinforcement learning principles combined with neural networks to enhance the performance is proposed. The aim is to overcome, by reinforced learning, the common challenge of setting the right parameters for the algorithm. In GWO, a single parameter is used to control the exploration/exploitation rate, which influences the performance of the algorithm. Rather than using a global way to change this parameter for all the agents, we use reinforcement learning to set it on an individual basis. The adaptation of the exploration rate for each agent depends on the agent's own experience and the current terrain of the search space. In order to achieve this, experience repository is built based on the neural network to map a set of agents' states to a set of corresponding actions that specifically influence the exploration rate. The experience repository is updated by all the search agents to reflect experience and to enhance the future actions continuously. The resulted algorithm is called experienced GWO (EGWO) and its performance is assessed on solving feature selection problems and on finding optimal weights for neural networks algorithm. We use a set of performance indicators to evaluate the efficiency of the method. Results over various data sets demonstrate an advance of the EGWO over the original GWO and over other metaheuristics, such as genetic algorithms and particle swarm optimization.

  4. Choice as a function of reinforcer "hold": from probability learning to concurrent reinforcement.

    PubMed

    Jensen, Greg; Neuringer, Allen

    2008-10-01

    Two procedures commonly used to study choice are concurrent reinforcement and probability learning. Under concurrent-reinforcement procedures, once a reinforcer is scheduled, it remains available indefinitely until collected. Therefore reinforcement becomes increasingly likely with passage of time or responses on other operanda. Under probability learning, reinforcer probabilities are constant and independent of passage of time or responses. Therefore a particular reinforcer is gained or not, on the basis of a single response, and potential reinforcers are not retained, as when betting at a roulette wheel. In the "real" world, continued availability of reinforcers often lies between these two extremes, with potential reinforcers being lost owing to competition, maturation, decay, and random scatter. The authors parametrically manipulated the likelihood of continued reinforcer availability, defined as hold, and examined the effects on pigeons' choices. Choices varied as power functions of obtained reinforcers under all values of hold. Stochastic models provided generally good descriptions of choice emissions with deviations from stochasticity systematically related to hold. Thus, a single set of principles accounted for choices across hold values that represent a wide range of real-world conditions.

  5. Fuzzy Lyapunov Reinforcement Learning for Non Linear Systems.

    PubMed

    Kumar, Abhishek; Sharma, Rajneesh

    2017-03-01

    We propose a fuzzy reinforcement learning (RL) based controller that generates a stable control action by lyapunov constraining fuzzy linguistic rules. In particular, we attempt at lyapunov constraining the consequent part of fuzzy rules in a fuzzy RL setup. Ours is a first attempt at designing a linguistic RL controller with lyapunov constrained fuzzy consequents to progressively learn a stable optimal policy. The proposed controller does not need system model or desired response and can effectively handle disturbances in continuous state-action space problems. Proposed controller has been employed on the benchmark Inverted Pendulum (IP) and Rotational/Translational Proof-Mass Actuator (RTAC) control problems (with and without disturbances). Simulation results and comparison against a) baseline fuzzy Q learning, b) Lyapunov theory based Actor-Critic, and c) Lyapunov theory based Markov game controller, elucidate stability and viability of the proposed control scheme.

  6. Use of Inverse Reinforcement Learning for Identity Prediction

    NASA Technical Reports Server (NTRS)

    Hayes, Roy; Bao, Jonathan; Beling, Peter; Horowitz, Barry

    2011-01-01

    We adopt Markov Decision Processes (MDP) to model sequential decision problems, which have the characteristic that the current decision made by a human decision maker has an uncertain impact on future opportunity. We hypothesize that the individuality of decision makers can be modeled as differences in the reward function under a common MDP model. A machine learning technique, Inverse Reinforcement Learning (IRL), was used to learn an individual's reward function based on limited observation of his or her decision choices. This work serves as an initial investigation for using IRL to analyze decision making, conducted through a human experiment in a cyber shopping environment. Specifically, the ability to determine the demographic identity of users is conducted through prediction analysis and supervised learning. The results show that IRL can be used to correctly identify participants, at a rate of 68% for gender and 66% for one of three college major categories.

  7. Safe Exploration Algorithms for Reinforcement Learning Controllers.

    PubMed

    Mannucci, Tommaso; van Kampen, Erik-Jan; de Visser, Cornelis; Chu, Qiping

    2017-02-06

    Self-learning approaches, such as reinforcement learning, offer new possibilities for autonomous control of uncertain or time-varying systems. However, exploring an unknown environment under limited prediction capabilities is a challenge for a learning agent. If the environment is dangerous, free exploration can result in physical damage or in an otherwise unacceptable behavior. With respect to existing methods, the main contribution of this paper is the definition of a new approach that does not require global safety functions, nor specific formulations of the dynamics or of the environment, but relies on interval estimation of the dynamics of the agent during the exploration phase, assuming a limited capability of the agent to perceive the presence of incoming fatal states. Two algorithms are presented with this approach. The first is the Safety Handling Exploration with Risk Perception Algorithm (SHERPA), which provides safety by individuating temporary safety functions, called backups. SHERPA is shown in a simulated, simplified quadrotor task, for which dangerous states are avoided. The second algorithm, denominated OptiSHERPA, can safely handle more dynamically complex systems for which SHERPA is not sufficient through the use of safety metrics. An application of OptiSHERPA is simulated on an aircraft altitude control task.

  8. Optimal Reward Functions in Distributed Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Wolpert, David H.; Tumer, Kagan

    2000-01-01

    We consider the design of multi-agent systems so as to optimize an overall world utility function when (1) those systems lack centralized communication and control, and (2) each agents runs a distinct Reinforcement Learning (RL) algorithm. A crucial issue in such design problems is to initialize/update each agent's private utility function, so as to induce best possible world utility. Traditional 'team game' solutions to this problem sidestep this issue and simply assign to each agent the world utility as its private utility function. In previous work we used the 'Collective Intelligence' framework to derive a better choice of private utility functions, one that results in world utility performance up to orders of magnitude superior to that ensuing from use of the team game utility. In this paper we extend these results. We derive the general class of private utility functions that both are easy for the individual agents to learn and that, if learned well, result in high world utility. We demonstrate experimentally that using these new utility functions can result in significantly improved performance over that of our previously proposed utility, over and above that previous utility's superiority to the conventional team game utility.

  9. Reinforcement learning in complementarity game and population dynamics.

    PubMed

    Jost, Jürgen; Li, Wei

    2014-02-01

    We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005)] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.

  10. The role of GABAB receptors in human reinforcement learning.

    PubMed

    Ort, Andres; Kometer, Michael; Rohde, Judith; Seifritz, Erich; Vollenweider, Franz X

    2014-10-01

    Behavioral evidence from human studies suggests that the γ-aminobutyric acid type B receptor (GABAB receptor) agonist baclofen modulates reinforcement learning and reduces craving in patients with addiction spectrum disorders. However, in contrast to the well established role of dopamine in reinforcement learning, the mechanisms by which the GABAB receptor influences reinforcement learning in humans remain completely unknown. To further elucidate this issue, a cross-over, double-blind, placebo-controlled study was performed in healthy human subjects (N=15) to test the effects of baclofen (20 and 50mg p.o.) on probabilistic reinforcement learning. Outcomes were the feedback-induced P2 component of the event-related potential, the feedback-related negativity, and the P300 component of the event-related potential. Baclofen produced a reduction of P2 amplitude over the course of the experiment, but did not modulate the feedback-related negativity. Furthermore, there was a trend towards increased learning after baclofen administration relative to placebo over the course of the experiment. The present results extend previous theories of reinforcement learning, which focus on the importance of mesolimbic dopamine signaling, and indicate that stimulation of cortical GABAB receptors in a fronto-parietal network leads to better attentional allocation in reinforcement learning. This observation is a first step in our understanding of how baclofen may improve reinforcement learning in healthy subjects. Further studies with bigger sample sizes are needed to corroborate this conclusion and furthermore, test this effect in patients with addiction spectrum disorder.

  11. Using Fuzzy Logic for Performance Evaluation in Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.

    1992-01-01

    Current reinforcement learning algorithms require long training periods which generally limit their applicability to small size problems. A new architecture is described which uses fuzzy rules to initialize its two neural networks: a neural network for performance evaluation and another for action selection. This architecture is applied to control of dynamic systems and it is demonstrated that it is possible to start with an approximate prior knowledge and learn to refine it through experiments using reinforcement learning.

  12. Context transfer in reinforcement learning using action-value functions.

    PubMed

    Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

    2014-01-01

    This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task.

  13. Context Transfer in Reinforcement Learning Using Action-Value Functions

    PubMed Central

    Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

    2014-01-01

    This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task. PMID:25610457

  14. Reinforcement learning in multidimensional environments relies on attention mechanisms.

    PubMed

    Niv, Yael; Daniel, Reka; Geana, Andra; Gershman, Samuel J; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C

    2015-05-27

    In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this "representation learning" process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the "curse of dimensionality" in reinforcement learning.

  15. Changes in corticostriatal connectivity during reinforcement learning in humans.

    PubMed

    Horga, Guillermo; Maia, Tiago V; Marsh, Rachel; Hao, Xuejun; Xu, Dongrong; Duan, Yunsuo; Tau, Gregory Z; Graniello, Barbara; Wang, Zhishun; Kangarlu, Alayar; Martinez, Diana; Packard, Mark G; Peterson, Bradley S

    2015-02-01

    Many computational models assume that reinforcement learning relies on changes in synaptic efficacy between cortical regions representing stimuli and striatal regions involved in response selection, but this assumption has thus far lacked empirical support in humans. We recorded hemodynamic signals with fMRI while participants navigated a virtual maze to find hidden rewards. We fitted a reinforcement-learning algorithm to participants' choice behavior and evaluated the neural activity and the changes in functional connectivity related to trial-by-trial learning variables. Activity in the posterior putamen during choice periods increased progressively during learning. Furthermore, the functional connections between the sensorimotor cortex and the posterior putamen strengthened progressively as participants learned the task. These changes in corticostriatal connectivity differentiated participants who learned the task from those who did not. These findings provide a direct link between changes in corticostriatal connectivity and learning, thereby supporting a central assumption common to several computational models of reinforcement learning.

  16. Changes in corticostriatal connectivity during reinforcement learning in humans

    PubMed Central

    Horga, Guillermo; Maia, Tiago V.; Marsh, Rachel; Hao, Xuejun; Xu, Dongrong; Duan, Yunsuo; Tau, Gregory Z.; Graniello, Barbara; Wang, Zhishun; Kangarlu, Alayar; Martinez, Diana; Packard, Mark G.; Peterson, Bradley S.

    2015-01-01

    Many computational models assume that reinforcement learning relies on changes in synaptic efficacy between cortical regions representing stimuli and striatal regions involved in response selection, but this assumption has thus far lacked empirical support in humans. We recorded hemodynamic signals with fMRI while participants navigated a virtual maze to find hidden rewards. We fitted a reinforcement-learning algorithm to participants’ choice behavior and evaluated the neural activity and the changes in functional connectivity related to trial-by-trial learning variables. Activity in the posterior putamen during choice periods increased progressively during learning. Furthermore, the functional connections between the sensorimotor cortex and the posterior putamen strengthened progressively as participants learned the task. These changes in corticostriatal connectivity differentiated participants who learned the task from those who did not. These findings provide a direct link between changes in corticostriatal connectivity and learning, thereby supporting a central assumption common to several computational models of reinforcement learning. PMID:25393839

  17. Prespeech motor learning in a neural network using reinforcement.

    PubMed

    Warlaumont, Anne S; Westermann, Gert; Buder, Eugene H; Oller, D Kimbrough

    2013-02-01

    Vocal motor development in infancy provides a crucial foundation for language development. Some significant early accomplishments include learning to control the process of phonation (the production of sound at the larynx) and learning to produce the sounds of one's language. Previous work has shown that social reinforcement shapes the kinds of vocalizations infants produce. We present a neural network model that provides an account of how vocal learning may be guided by reinforcement. The model consists of a self-organizing map that outputs to muscles of a realistic vocalization synthesizer. Vocalizations are spontaneously produced by the network. If a vocalization meets certain acoustic criteria, it is reinforced, and the weights are updated to make similar muscle activations increasingly likely to recur. We ran simulations of the model under various reinforcement criteria and tested the types of vocalizations it produced after learning in the different conditions. When reinforcement was contingent on the production of phonated (i.e. voiced) sounds, the network's post-learning productions were almost always phonated, whereas when reinforcement was not contingent on phonation, the network's post-learning productions were almost always not phonated. When reinforcement was contingent on both phonation and proximity to English vowels as opposed to Korean vowels, the model's post-learning productions were more likely to resemble the English vowels and vice versa.

  18. Coevolutionary networks of reinforcement-learning agents

    NASA Astrophysics Data System (ADS)

    Kianercy, Ardeshir; Galstyan, Aram

    2013-07-01

    This paper presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement-learning scheme. It is demonstrated that the coevolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with nontrivial structural dynamics. In particular, we characterize the Nash equilibria (NE) in such games and examine the local stability of the rest points corresponding to those equilibria. We also study general n-player networks via both simulations and analytical methods and find that, in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable.

  19. Coevolutionary networks of reinforcement-learning agents.

    PubMed

    Kianercy, Ardeshir; Galstyan, Aram

    2013-07-01

    This paper presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement-learning scheme. It is demonstrated that the coevolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with nontrivial structural dynamics. In particular, we characterize the Nash equilibria (NE) in such games and examine the local stability of the rest points corresponding to those equilibria. We also study general n-player networks via both simulations and analytical methods and find that, in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable.

  20. Developing PFC representations using reinforcement learning.

    PubMed

    Reynolds, Jeremy R; O'Reilly, Randall C

    2009-12-01

    From both functional and biological considerations, it is widely believed that action production, planning, and goal-oriented behaviors supported by the frontal cortex are organized hierarchically [Fuster (1991); Koechlin, E., Ody, C., & Kouneiher, F. (2003). Neuroscience: The architecture of cognitive control in the human prefrontal cortex. Science, 424, 1181-1184; Miller, G. A., Galanter, E., & Pribram, K. H. (1960). Plans and the structure of behavior. New York: Holt]. However, the nature of the different levels of the hierarchy remains unclear, and little attention has been paid to the origins of such a hierarchy. We address these issues through biologically-inspired computational models that develop representations through reinforcement learning. We explore several different factors in these models that might plausibly give rise to a hierarchical organization of representations within the PFC, including an initial connectivity hierarchy within PFC, a hierarchical set of connections between PFC and subcortical structures controlling it, and differential synaptic plasticity schedules. Simulation results indicate that architectural constraints contribute to the segregation of different types of representations, and that this segregation facilitates learning. These findings are consistent with the idea that there is a functional hierarchy in PFC, as captured in our earlier computational models of PFC function and a growing body of empirical data.

  1. Reinforcement learning of periodical gaits in locomotion robots

    NASA Astrophysics Data System (ADS)

    Svinin, Mikhail; Yamada, Kazuyaki; Ushio, S.; Ueda, Kanji

    1999-08-01

    Emergence of stable gaits in locomotion robots is studied in this paper. A classifier system, implementing an instance- based reinforcement learning scheme, is used for sensory- motor control of an eight-legged mobile robot. Important feature of the classifier system is its ability to work with the continuous sensor space. The robot does not have a prior knowledge of the environment, its own internal model, and the goal coordinates. It is only assumed that the robot can acquire stable gaits by learning how to reach a light source. During the learning process the control system, is self-organized by reinforcement signals. Reaching the light source defines a global reward. Forward motion gets a local reward, while stepping back and falling down get a local punishment. Feasibility of the proposed self-organized system is tested under simulation and experiment. The control actions are specified at the leg level. It is shown that, as learning progresses, the number of the action rules in the classifier systems is stabilized to a certain level, corresponding to the acquired gait patterns.

  2. Preliminary Work for Examining the Scalability of Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Clouse, Jeff

    1998-01-01

    Researchers began studying automated agents that learn to perform multiple-step tasks early in the history of artificial intelligence (Samuel, 1963; Samuel, 1967; Waterman, 1970; Fikes, Hart & Nilsonn, 1972). Multiple-step tasks are tasks that can only be solved via a sequence of decisions, such as control problems, robotics problems, classic problem-solving, and game-playing. The objective of agents attempting to learn such tasks is to use the resources they have available in order to become more proficient at the tasks. In particular, each agent attempts to develop a good policy, a mapping from states to actions, that allows it to select actions that optimize a measure of its performance on the task; for example, reducing the number of steps necessary to complete the task successfully. Our study focuses on reinforcement learning, a set of learning techniques where the learner performs trial-and-error experiments in the task and adapts its policy based on the outcome of those experiments. Much of the work in reinforcement learning has focused on a particular, simple representation, where every problem state is represented explicitly in a table, and associated with each state are the actions that can be chosen in that state. A major advantage of this table lookup representation is that one can prove that certain reinforcement learning techniques will develop an optimal policy for the current task. The drawback is that the representation limits the application of reinforcement learning to multiple-step tasks with relatively small state-spaces. There has been a little theoretical work that proves that convergence to optimal solutions can be obtained when using generalization structures, but the structures are quite simple. The theory says little about complex structures, such as multi-layer, feedforward artificial neural networks (Rumelhart & McClelland, 1986), but empirical results indicate that the use of reinforcement learning with such structures is promising

  3. Separation of time-based and trial-based accounts of the partial reinforcement extinction effect.

    PubMed

    Bouton, Mark E; Woods, Amanda M; Todd, Travis P

    2014-01-01

    Two appetitive conditioning experiments with rats examined time-based and trial-based accounts of the partial reinforcement extinction effect (PREE). In the PREE, the loss of responding that occurs in extinction is slower when the conditioned stimulus (CS) has been paired with a reinforcer on some of its presentations (partially reinforced) instead of every presentation (continuously reinforced). According to a time-based or "time-accumulation" view (e.g., Gallistel and Gibbon, 2000), the PREE occurs because the organism has learned in partial reinforcement to expect the reinforcer after a larger amount of time has accumulated in the CS over trials. In contrast, according to a trial-based view (e.g., Capaldi, 1967), the PREE occurs because the organism has learned in partial reinforcement to expect the reinforcer after a larger number of CS presentations. Experiment 1 used a procedure that equated partially and continuously reinforced groups on their expected times to reinforcement during conditioning. A PREE was still observed. Experiment 2 then used an extinction procedure that allowed time in the CS and the number of trials to accumulate differentially through extinction. The PREE was still evident when responding was examined as a function of expected time units to the reinforcer, but was eliminated when responding was examined as a function of expected trial units to the reinforcer. There was no evidence that the animal responded according to the ratio of time accumulated during the CS in extinction over the time in the CS expected before the reinforcer. The results thus favor a trial-based account over a time-based account of extinction and the PREE. This article is part of a Special Issue entitled: Associative and Temporal Learning.

  4. Covert Operant Reinforcement of Remedial Reading Learning Tasks.

    ERIC Educational Resources Information Center

    Schmickley, Verne G.

    The effects of covert operant reinforcement upon remedial reading learning tasks were investigated. Forty junior high school students were taught to imagine either neutral scenes (control) or positive scenes (treatment) upon cue while reading. It was hypothesized that positive covert reinforcement would enhance performance on several measures of…

  5. On the integration of reinforcement learning and approximate reasoning for control

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1991-01-01

    The author discusses the importance of strengthening the knowledge representation characteristic of reinforcement learning techniques using methods such as approximate reasoning. The ARIC (approximate reasoning-based intelligent control) architecture is an example of such a hybrid approach in which the fuzzy control rules are modified (fine-tuned) using reinforcement learning. ARIC also demonstrates that it is possible to start with an approximately correct control knowledge base and learn to refine this knowledge through further experience. On the other hand, techniques such as the TD (temporal difference) algorithm and Q-learning establish stronger theoretical foundations for their use in adaptive control and also in stability analysis of hybrid reinforcement learning and approximate reasoning-based controllers.

  6. Neural basis of reinforcement learning and decision making.

    PubMed

    Lee, Daeyeol; Seo, Hyojung; Jung, Min Whan

    2012-01-01

    Reinforcement learning is an adaptive process in which an animal utilizes its previous experience to improve the outcomes of future choices. Computational theories of reinforcement learning play a central role in the newly emerging areas of neuroeconomics and decision neuroscience. In this framework, actions are chosen according to their value functions, which describe how much future reward is expected from each action. Value functions can be adjusted not only through reward and penalty, but also by the animal's knowledge of its current environment. Studies have revealed that a large proportion of the brain is involved in representing and updating value functions and using them to choose an action. However, how the nature of a behavioral task affects the neural mechanisms of reinforcement learning remains incompletely understood. Future studies should uncover the principles by which different computational elements of reinforcement learning are dynamically coordinated across the entire brain.

  7. Optimal control in microgrid using multi-agent reinforcement learning.

    PubMed

    Li, Fu-Dong; Wu, Min; He, Yong; Chen, Xin

    2012-11-01

    This paper presents an improved reinforcement learning method to minimize electricity costs on the premise of satisfying the power balance and generation limit of units in a microgrid with grid-connected mode. Firstly, the microgrid control requirements are analyzed and the objective function of optimal control for microgrid is proposed. Then, a state variable "Average Electricity Price Trend" which is used to express the most possible transitions of the system is developed so as to reduce the complexity and randomicity of the microgrid, and a multi-agent architecture including agents, state variables, action variables and reward function is formulated. Furthermore, dynamic hierarchical reinforcement learning, based on change rate of key state variable, is established to carry out optimal policy exploration. The analysis shows that the proposed method is beneficial to handle the problem of "curse of dimensionality" and speed up learning in the unknown large-scale world. Finally, the simulation results under JADE (Java Agent Development Framework) demonstrate the validity of the presented method in optimal control for a microgrid with grid-connected mode.

  8. Reinforcement learning of motor skills with policy gradients.

    PubMed

    Peters, Jan; Schaal, Stefan

    2008-05-01

    Autonomous learning is one of the hallmarks of human and animal behavior, and understanding the principles of learning will be crucial in order to achieve true autonomy in advanced machines like humanoid robots. In this paper, we examine learning of complex motor skills with human-like limbs. While supervised learning can offer useful tools for bootstrapping behavior, e.g., by learning from demonstration, it is only reinforcement learning that offers a general approach to the final trial-and-error improvement that is needed by each individual acquiring a skill. Neither neurobiological nor machine learning studies have, so far, offered compelling results on how reinforcement learning can be scaled to the high-dimensional continuous state and action spaces of humans or humanoids. Here, we combine two recent research developments on learning motor control in order to achieve this scaling. First, we interpret the idea of modular motor control by means of motor primitives as a suitable way to generate parameterized control policies for reinforcement learning. Second, we combine motor primitives with the theory of stochastic policy gradient learning, which currently seems to be the only feasible framework for reinforcement learning for humanoids. We evaluate different policy gradient methods with a focus on their applicability to parameterized motor primitives. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm.

  9. Reinforcement Learning in a Nonstationary Environment: The El Farol Problem

    NASA Technical Reports Server (NTRS)

    Bell, Ann Maria

    1999-01-01

    This paper examines the performance of simple learning rules in a complex adaptive system based on a coordination problem modeled on the El Farol problem. The key features of the El Farol problem are that it typically involves a medium number of agents and that agents' pay-off functions have a discontinuous response to increased congestion. First we consider a single adaptive agent facing a stationary environment. We demonstrate that the simple learning rules proposed by Roth and Er'ev can be extremely sensitive to small changes in the initial conditions and that events early in a simulation can affect the performance of the rule over a relatively long time horizon. In contrast, a reinforcement learning rule based on standard practice in the computer science literature converges rapidly and robustly. The situation is reversed when multiple adaptive agents interact: the RE algorithms often converge rapidly to a stable average aggregate attendance despite the slow and erratic behavior of individual learners, while the CS based learners frequently over-attend in the early and intermediate terms. The symmetric mixed strategy equilibria is unstable: all three learning rules ultimately tend towards pure strategies or stabilize in the medium term at non-equilibrium probabilities of attendance. The brittleness of the algorithms in different contexts emphasize the importance of thorough and thoughtful examination of simulation-based results.

  10. Generalization of value in reinforcement learning by humans

    PubMed Central

    Wimmer, G. Elliott; Daw, Nathaniel D.; Shohamy, Daphna

    2012-01-01

    Research in decision making has focused on the role of dopamine and its striatal targets in guiding choices via learned stimulus-reward or stimulus-response associations, behavior that is well-described by reinforcement learning (RL) theories. However, basic RL is relatively limited in scope and does not explain how learning about stimulus regularities or relations may guide decision making. A candidate mechanism for this type of learning comes from the domain of memory, which has highlighted a role for the hippocampus in learning of stimulus-stimulus relations, typically dissociated from the role of the striatum in stimulus-response learning. Here, we used fMRI and computational model-based analyses to examine the joint contributions of these mechanisms to RL. Humans performed an RL task with added relational structure, modeled after tasks used to isolate hippocampal contributions to memory. On each trial participants chose one of four options, but the reward probabilities for pairs of options were correlated across trials. This (uninstructed) relationship between pairs of options potentially enabled an observer to learn about options’ values based on experience with the other options and to generalize across them. We observed BOLD activity related to learning in the striatum and also in the hippocampus. By comparing a basic RL model to one augmented to allow feedback to generalize between correlated options, we tested whether choice behavior and BOLD activity were influenced by the opportunity to generalize across correlated options. Although such generalization goes beyond standard computational accounts of RL and striatal BOLD, both choices and striatal BOLD were better explained by the augmented model. Consistent with the hypothesized role for the hippocampus in this generalization, functional connectivity between the ventral striatum and hippocampus was modulated, across participants, by the ability of the augmented model to capture participants’ choice

  11. Reinforcement learning and dopamine in schizophrenia: dimensions of symptoms or specific features of a disease group?

    PubMed

    Deserno, Lorenz; Boehme, Rebecca; Heinz, Andreas; Schlagenhauf, Florian

    2013-12-23

    Abnormalities in reinforcement learning are a key finding in schizophrenia and have been proposed to be linked to elevated levels of dopamine neurotransmission. Behavioral deficits in reinforcement learning and their neural correlates may contribute to the formation of clinical characteristics of schizophrenia. The ability to form predictions about future outcomes is fundamental for environmental interactions and depends on neuronal teaching signals, like reward prediction errors. While aberrant prediction errors, that encode non-salient events as surprising, have been proposed to contribute to the formation of positive symptoms, a failure to build neural representations of decision values may result in negative symptoms. Here, we review behavioral and neuroimaging research in schizophrenia and focus on studies that implemented reinforcement learning models. In addition, we discuss studies that combined reinforcement learning with measures of dopamine. Thereby, we suggest how reinforcement learning abnormalities in schizophrenia may contribute to the formation of psychotic symptoms and may interact with cognitive deficits. These ideas point toward an interplay of more rigid versus flexible control over reinforcement learning. Pronounced deficits in the flexible or model-based domain may allow for a detailed characterization of well-established cognitive deficits in schizophrenia patients based on computational models of learning. Finally, we propose a framework based on the potentially crucial contribution of dopamine to dysfunctional reinforcement learning on the level of neural networks. Future research may strongly benefit from computational modeling but also requires further methodological improvement for clinical group studies. These research tools may help to improve our understanding of disease-specific mechanisms and may help to identify clinically relevant subgroups of the heterogeneous entity schizophrenia.

  12. Reinforcement Learning and Dopamine in Schizophrenia: Dimensions of Symptoms or Specific Features of a Disease Group?

    PubMed Central

    Deserno, Lorenz; Boehme, Rebecca; Heinz, Andreas; Schlagenhauf, Florian

    2013-01-01

    Abnormalities in reinforcement learning are a key finding in schizophrenia and have been proposed to be linked to elevated levels of dopamine neurotransmission. Behavioral deficits in reinforcement learning and their neural correlates may contribute to the formation of clinical characteristics of schizophrenia. The ability to form predictions about future outcomes is fundamental for environmental interactions and depends on neuronal teaching signals, like reward prediction errors. While aberrant prediction errors, that encode non-salient events as surprising, have been proposed to contribute to the formation of positive symptoms, a failure to build neural representations of decision values may result in negative symptoms. Here, we review behavioral and neuroimaging research in schizophrenia and focus on studies that implemented reinforcement learning models. In addition, we discuss studies that combined reinforcement learning with measures of dopamine. Thereby, we suggest how reinforcement learning abnormalities in schizophrenia may contribute to the formation of psychotic symptoms and may interact with cognitive deficits. These ideas point toward an interplay of more rigid versus flexible control over reinforcement learning. Pronounced deficits in the flexible or model-based domain may allow for a detailed characterization of well-established cognitive deficits in schizophrenia patients based on computational models of learning. Finally, we propose a framework based on the potentially crucial contribution of dopamine to dysfunctional reinforcement learning on the level of neural networks. Future research may strongly benefit from computational modeling but also requires further methodological improvement for clinical group studies. These research tools may help to improve our understanding of disease-specific mechanisms and may help to identify clinically relevant subgroups of the heterogeneous entity schizophrenia. PMID:24391603

  13. Microstimulation of the Human Substantia Nigra Alters Reinforcement Learning

    PubMed Central

    Ramayya, Ashwin G.; Misra, Amrit

    2014-01-01

    Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action–reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action–reward associations rather than stimulus–reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action–reward associations during reinforcement learning. PMID:24828643

  14. Microstimulation of the human substantia nigra alters reinforcement learning.

    PubMed

    Ramayya, Ashwin G; Misra, Amrit; Baltuch, Gordon H; Kahana, Michael J

    2014-05-14

    Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action-reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action-reward associations rather than stimulus-reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action-reward associations during reinforcement learning.

  15. Role of Dopamine D2 Receptors in Human Reinforcement Learning

    PubMed Central

    Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

    2014-01-01

    Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well. PMID:24713613

  16. Role of dopamine D2 receptors in human reinforcement learning.

    PubMed

    Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

    2014-09-01

    Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well.

  17. Situational Reinforcement: The New Old Way to Learn Languages.

    ERIC Educational Resources Information Center

    Petrucelli, Gerald J.

    1977-01-01

    Situational Reinforcement, a teaching methodology developed out of the cognitive-field theory of learning, is described. It combines many techniques and methods developed over the years. This discussion of it considers common learning problems: (1) boredom, apathy and passivity on the part of the student, (2) the teacher's preoccupation with…

  18. Social Learning, Reinforcement and Crime: Evidence from Three European Cities

    ERIC Educational Resources Information Center

    Tittle, Charles R.; Antonaccio, Olena; Botchkovar, Ekaterina

    2012-01-01

    This study reports a cross-cultural test of Social Learning Theory using direct measures of social learning constructs and focusing on the causal structure implied by the theory. Overall, the results strongly confirm the main thrust of the theory. Prior criminal reinforcement and current crime-favorable definitions are highly related in all three…

  19. Mastery Learning through Individualized Instruction: A Reinforcement Strategy

    ERIC Educational Resources Information Center

    Sagy, John; Ravi, R.; Ananthasayanam, R.

    2009-01-01

    The present study attempts to gauge the effect of individualized instructional methods as a reinforcement strategy for mastery learning. Among various individualized instructional methods, the study focuses on PIM (Programmed Instructional Method) and CAIM (Computer Assisted Instruction Method). Mastery learning is a process where students achieve…

  20. Reinforcement learning in complementarity game and population dynamics

    NASA Astrophysics Data System (ADS)

    Jost, Jürgen; Li, Wei

    2014-02-01

    We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005), 10.1016/j.physa.2004.07.005] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.

  1. Democratic reinforcement: learning via self-organization

    SciTech Connect

    Stassinopoulos, D.; Bak, P.

    1995-12-31

    The problem of learning in the absence of external intelligence is discussed in the context of a simple model. The model consists of a set of randomly connected, or layered integrate-and fire neurons. Inputs to and outputs from the environment are connected randomly to subsets of neurons. The connections between firing neurons are strengthened or weakened according to whether the action is successful or not. The model departs from the traditional gradient-descent based approaches to learning by operating at a highly susceptible ``critical`` state, with low activity and sparse connections between firing neurons. Quantitative studies on the performance of our model in a simple association task show that by tuning our system close to this critical state we can obtain dramatic gains in performance.

  2. Human-level control through deep reinforcement learning

    NASA Astrophysics Data System (ADS)

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K.; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-01

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  3. Human-level control through deep reinforcement learning.

    PubMed

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A; Veness, Joel; Bellemare, Marc G; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-26

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  4. Novel reinforcement learning approach for difficult control problems

    NASA Astrophysics Data System (ADS)

    Becus, Georges A.; Thompson, Edward A.

    1997-09-01

    We review work conducted over the past several years and aimed at developing reinforcement learning architectures for solving difficult control problems and based on and inspired by associative control process (ACP) networks. We briefly review ACP networks able to reproduce many classical instrumental conditioning test results observed in animal research and to engage in real-time, closed-loop, goal-seeking interactions with their environment. Chronologically, our contributions include the ideally interfaced ACP network which is endowed with hierarchical, attention, and failure recognition interface mechanisms which greatly enhanced the capabilities of the original ACP network. When solving the cart-pole problem, it achieves 100 percent reliability and a reduction in training time similar to that of Baird and Klopf's modified ACP network and additionally an order of magnitude reduction in number of failures experienced for successful training. Next we introduced the command and control center/internal drive (Cid) architecture for artificial neural learning systems. It consists of a hierarchy of command and control centers governing motor selection networks. Internal drives, similar hunger, thirst, or reproduction in biological systems, are formed within the controller to facilitate learning. Efficiency, reliability, and adjustability of this architecture were demonstrated on the benchmark cart-pole control problem. A comparison with other artificial learning systems indicates that it learns over 100 times faster than Barto, et al's adaptive search element/adaptive critic element, experiencing less failures by more than an order of magnitude while capable of being fine-tuned by the user, on- line, for improved performance without additional training. Finally we present work in progress on a 'peaks and valleys' scheme which moves away from the one-dimensional learning mechanism currently found in Cid and shows promises in solving even more difficult learning control

  5. Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation.

    PubMed

    Shih, Peter; Kaul, Brian C; Jagannathan, S; Drallmeier, James A

    2008-08-01

    A novel reinforcement-learning-based dual-control methodology adaptive neural network (NN) controller is developed to deliver a desired tracking performance for a class of complex feedback nonlinear discrete-time systems, which consists of a second-order nonlinear discrete-time system in nonstrict feedback form and an affine nonlinear discrete-time system, in the presence of bounded and unknown disturbances. For example, the exhaust gas recirculation (EGR) operation of a spark ignition (SI) engine is modeled by using such a complex nonlinear discrete-time system. A dual-controller approach is undertaken where primary adaptive critic NN controller is designed for the nonstrict feedback nonlinear discrete-time system whereas the secondary one for the affine nonlinear discrete-time system but the controllers together offer the desired performance. The primary adaptive critic NN controller includes an NN observer for estimating the states and output, an NN critic, and two action NNs for generating virtual control and actual control inputs for the nonstrict feedback nonlinear discrete-time system, whereas an additional critic NN and an action NN are included for the affine nonlinear discrete-time system by assuming the state availability. All NN weights adapt online towards minimization of a certain performance index, utilizing gradient-descent-based rule. Using Lyapunov theory, the uniformly ultimate boundedness (UUB) of the closed-loop tracking error, weight estimates, and observer estimates are shown. The adaptive critic NN controller performance is evaluated on an SI engine operating with high EGR levels where the controller objective is to reduce cyclic dispersion in heat release while minimizing fuel intake. Simulation and experimental results indicate that engine out emissions drop significantly at 20% EGR due to reduction in dispersion in heat release thus verifying the dual-control approach.

  6. Social Cognition as Reinforcement Learning: Feedback Modulates Emotion Inference.

    PubMed

    Zaki, Jamil; Kallman, Seth; Wimmer, G Elliott; Ochsner, Kevin; Shohamy, Daphna

    2016-09-01

    Neuroscientific studies of social cognition typically employ paradigms in which perceivers draw single-shot inferences about the internal states of strangers. Real-world social inference features much different parameters: People often encounter and learn about particular social targets (e.g., friends) over time and receive feedback about whether their inferences are correct or incorrect. Here, we examined this process and, more broadly, the intersection between social cognition and reinforcement learning. Perceivers were scanned using fMRI while repeatedly encountering three social targets who produced conflicting visual and verbal emotional cues. Perceivers guessed how targets felt and received feedback about whether they had guessed correctly. Visual cues reliably predicted one target's emotion, verbal cues predicted a second target's emotion, and neither reliably predicted the third target's emotion. Perceivers successfully used this information to update their judgments over time. Furthermore, trial-by-trial learning signals-estimated using two reinforcement learning models-tracked activity in ventral striatum and ventromedial pFC, structures associated with reinforcement learning, and regions associated with updating social impressions, including TPJ. These data suggest that learning about others' emotions, like other forms of feedback learning, relies on domain-general reinforcement mechanisms as well as domain-specific social information processing.

  7. Reinforcement learning in depression: A review of computational research.

    PubMed

    Chen, Chong; Takahashi, Taiki; Nakagawa, Shin; Inoue, Takeshi; Kusumi, Ichiro

    2015-08-01

    Despite being considered primarily a mood disorder, major depressive disorder (MDD) is characterized by cognitive and decision making deficits. Recent research has employed computational models of reinforcement learning (RL) to address these deficits. The computational approach has the advantage in making explicit predictions about learning and behavior, specifying the process parameters of RL, differentiating between model-free and model-based RL, and the computational model-based functional magnetic resonance imaging and electroencephalography. With these merits there has been an emerging field of computational psychiatry and here we review specific studies that focused on MDD. Considerable evidence suggests that MDD is associated with impaired brain signals of reward prediction error and expected value ('wanting'), decreased reward sensitivity ('liking') and/or learning (be it model-free or model-based), etc., although the causality remains unclear. These parameters may serve as valuable intermediate phenotypes of MDD, linking general clinical symptoms to underlying molecular dysfunctions. We believe future computational research at clinical, systems, and cellular/molecular/genetic levels will propel us toward a better understanding of the disease.

  8. Reinforcement learning for resource allocation in LEO satellite networks.

    PubMed

    Usaha, Wipawee; Barria, Javier A

    2007-06-01

    In this paper, we develop and assess online decision-making algorithms for call admission and routing for low Earth orbit (LEO) satellite networks. It has been shown in a recent paper that, in a LEO satellite system, a semi-Markov decision process formulation of the call admission and routing problem can achieve better performance in terms of an average revenue function than existing routing methods. However, the conventional dynamic programming (DP) numerical solution becomes prohibited as the problem size increases. In this paper, two solution methods based on reinforcement learning (RL) are proposed in order to circumvent the computational burden of DP. The first method is based on an actor-critic method with temporal-difference (TD) learning. The second method is based on a critic-only method, called optimistic TD learning. The algorithms enhance performance in terms of requirements in storage, computational complexity and computational time, and in terms of an overall long-term average revenue function that penalizes blocked calls. Numerical studies are carried out, and the results obtained show that the RL framework can achieve up to 56% higher average revenue over existing routing methods used in LEO satellite networks with reasonable storage and computational requirements.

  9. [Multiple Dopamine Signals and Their Contributions to Reinforcement Learning].

    PubMed

    Matsumoto, Masayuki

    2016-10-01

    Midbrain dopamine neurons are activated by reward and sensory cue that predicts reward. Their responses resemble reward prediction error that indicates the discrepancy between obtained and expected reward values, which has been thought to play an important role as a teaching signal in reinforcement learning. Indeed, pharmacological blockade of dopamine transmission interferes with reinforcement learning. Recent studies reported, however, that not all dopamine neurons transmit the reward-related signal. They found that a subset of dopamine neurons transmits signals related to non-rewarding, salient experiences such as aversive stimulations and cognitively demanding events. How these signals contribute to animal behavior is not yet well understood. This article reviews recent findings on dopamine signals related to rewarding and non-rewarding experiences, and discusses their contributions to reinforcement learning.

  10. Literacy and Learning: Integrated Skills Reinforcement.

    ERIC Educational Resources Information Center

    Anderson, JoAnn Romero; And Others

    1991-01-01

    Describes the integrated skills reinforcement (ISR) approach to Language across the Curriculum used at La Guardia Community College (LCC) to teach basic skills within the context of the subject-content of various disciplines. Explains LCC's student-centered approach to faculty development, and the use of ISR as a basis for curricular,…

  11. The Computational Development of Reinforcement Learning during Adolescence

    PubMed Central

    Palminteri, Stefano; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-01-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents’ behaviour was better explained by a basic reinforcement learning algorithm, adults’ behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence. PMID:27322574

  12. The Computational Development of Reinforcement Learning during Adolescence.

    PubMed

    Palminteri, Stefano; Kilford, Emma J; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-06-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents' behaviour was better explained by a basic reinforcement learning algorithm, adults' behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence.

  13. Reinforcement learning algorithms for robotic navigation in dynamic environments.

    PubMed

    Yen, Gary G; Hickey, Travis W

    2004-04-01

    The purpose of this study was to examine improvements to reinforcement learning (RL) algorithms in order to successfully interact within dynamic environments. The scope of the research was that of RL algorithms as applied to robotic navigation. Proposed improvements include: addition of a forgetting mechanism, use of feature based state inputs, and hierarchical structuring of an RL agent. Simulations were performed to evaluate the individual merits and flaws of each proposal, to compare proposed methods to prior established methods, and to compare proposed methods to theoretically optimal solutions. Incorporation of a forgetting mechanism did considerably improve the learning times of RL agents in a dynamic environment. However, direct implementation of a feature-based RL agent did not result in any performance enhancements, as pure feature-based navigation results in a lack of positional awareness, and the inability of the agent to determine the location of the goal state. Inclusion of a hierarchical structure in an RL agent resulted in significantly improved performance, specifically when one layer of the hierarchy included a feature-based agent for obstacle avoidance, and a standard RL agent for global navigation. In summary, the inclusion of a forgetting mechanism, and the use of a hierarchically structured RL agent offer substantially increased performance when compared to traditional RL agents navigating in a dynamic environment.

  14. SOCIAL REINFORCEMENT, PERSONALITY AND LEARNING PERFORMANCE IN CROSS-CULTURAL PROGRAMMED INSTRUCTION.

    DTIC Science & Technology

    of learning program, containing four conditions of social reinforcement; positive reinforcement for correct choices, negative reinforcement for...incorrect choices, both positive and negative evaluation for either response, and no social evaluation. It was found that the presence of negative ... reinforcement as a factor significantly lowered the learning performance in one group. The opposite trend was evidenced in the other group. This discrepancy

  15. Stochastic optimization of multireservoir systems via reinforcement learning

    NASA Astrophysics Data System (ADS)

    Lee, Jin-Hee; Labadie, John W.

    2007-11-01

    Although several variants of stochastic dynamic programming have been applied to optimal operation of multireservoir systems, they have been plagued by a high-dimensional state space and the inability to accurately incorporate the stochastic environment as characterized by temporally and spatially correlated hydrologic inflows. Reinforcement learning has emerged as an effective approach to solving sequential decision problems by combining concepts from artificial intelligence, cognitive science, and operations research. A reinforcement learning system has a mathematical foundation similar to dynamic programming and Markov decision processes, with the goal of maximizing the long-term reward or returns as conditioned on the state of the system environment and the immediate reward obtained from operational decisions. Reinforcement learning can include Monte Carlo simulation where transition probabilities and rewards are not explicitly known a priori. The Q-Learning method in reinforcement learning is demonstrated on the two-reservoir Geum River system, South Korea, and is shown to outperform implicit stochastic dynamic programming and sampling stochastic dynamic programming methods.

  16. Punishment insensitivity and impaired reinforcement learning in preschoolers

    PubMed Central

    Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

    2013-01-01

    Background Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a developmental vulnerability to psychopathic traits. Methods 157 preschoolers (mean age 4.7 ±0.8 years) participated in a substudy that was embedded within a larger project. Children completed the “Stars-in-Jars” task, which involved learning to select rewarded jars and avoid punished jars. Maternal report of responsiveness to socialization was assessed with the Punishment Insensitivity and Low Concern for Others scales of the Multidimensional Assessment of Preschool Disruptive Behavior (MAP-DB). Results Punishment Insensitivity, but not Low Concern for Others, was significantly associated with reinforcement learning in multivariate models that accounted for age and sex. Specifically, higher Punishment Insensitivity was associated with significantly lower overall performance and more errors on punished trials (“passive avoidance”). Conclusions Impairments in reinforcement learning manifest in preschoolers who are high in maternal ratings of Punishment Insensitivity. If replicated, these findings may help to pinpoint the neurodevelopmental antecedents of psychopathic tendencies and suggest novel intervention targets beginning in early childhood. PMID:24033313

  17. Learning the specific quality of taste reinforcement in larval Drosophila.

    PubMed

    Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

    2015-01-27

    The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing-in any brain.

  18. Learning the specific quality of taste reinforcement in larval Drosophila

    PubMed Central

    Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

    2015-01-01

    The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing—in any brain. DOI: http://dx.doi.org/10.7554/eLife.04711.001 PMID:25622533

  19. A reinforcement learning approach to instrumental contingency degradation in rats.

    PubMed

    Dutech, Alain; Coutureau, Etienne; Marchand, Alain R

    2011-01-01

    Goal-directed action involves a representation of action consequences. Adapting to changes in action-outcome contingency requires the prefrontal region. Indeed, rats with lesions of the medial prefrontal cortex do not adapt their free operant response when food delivery becomes unrelated to lever-pressing. The present study explores the bases of this deficit through a combined behavioural and computational approach. We show that lesioned rats retain some behavioural flexibility and stop pressing if this action prevents food delivery. We attempt to model this phenomenon in a reinforcement learning framework. The model assumes that distinct action values are learned in an incremental manner in distinct states. The model represents states as n-uplets of events, emphasizing sequences rather than the continuous passage of time. Probabilities of lever-pressing and visits to the food magazine observed in the behavioural experiments are first analyzed as a function of these states, to identify sequences of events that influence action choice. Observed action probabilities appear to be essentially function of the last event that occurred, with reward delivery and waiting significantly facilitating magazine visits and lever-pressing respectively. Behavioural sequences of normal and lesioned rats are then fed into the model, action values are updated at each event transition according to the SARSA algorithm, and predicted action probabilities are derived through a softmax policy. The model captures the time course of learning, as well as the differential adaptation of normal and prefrontal lesioned rats to contingency degradation with the same parameters for both groups. The results suggest that simple temporal difference algorithms with low learning rates can largely account for instrumental learning and performance. Prefrontal lesioned rats appear to mainly differ from control rats in their low rates of visits to the magazine after a lever press, and their inability to

  20. Reinforcement Learning in Multidimensional Environments Relies on Attention Mechanisms

    PubMed Central

    Daniel, Reka; Geana, Andra; Gershman, Samuel J.; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C.

    2015-01-01

    In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this “representation learning” process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the “curse of dimensionality” in reinforcement learning. PMID:26019331

  1. The drift diffusion model as the choice rule in reinforcement learning.

    PubMed

    Pedersen, Mads Lund; Frank, Michael J; Biele, Guido

    2016-12-13

    Current reinforcement-learning models often assume simplified decision processes that do not fully reflect the dynamic complexities of choice processes. Conversely, sequential-sampling models of decision making account for both choice accuracy and response time, but assume that decisions are based on static decision values. To combine these two computational models of decision making and learning, we implemented reinforcement-learning models in which the drift diffusion model describes the choice process, thereby capturing both within- and across-trial dynamics. To exemplify the utility of this approach, we quantitatively fit data from a common reinforcement-learning paradigm using hierarchical Bayesian parameter estimation, and compared model variants to determine whether they could capture the effects of stimulant medication in adult patients with attention-deficit hyperactivity disorder (ADHD). The model with the best relative fit provided a good description of the learning process, choices, and response times. A parameter recovery experiment showed that the hierarchical Bayesian modeling approach enabled accurate estimation of the model parameters. The model approach described here, using simultaneous estimation of reinforcement-learning and drift diffusion model parameters, shows promise for revealing new insights into the cognitive and neural mechanisms of learning and decision making, as well as the alteration of such processes in clinical groups.

  2. Short-term memory traces for action bias in human reinforcement learning.

    PubMed

    Bogacz, Rafal; McClure, Samuel M; Li, Jian; Cohen, Jonathan D; Montague, P Read

    2007-06-11

    Recent experimental and theoretical work on reinforcement learning has shed light on the neural bases of learning from rewards and punishments. One fundamental problem in reinforcement learning is the credit assignment problem, or how to properly assign credit to actions that lead to reward or punishment following a delay. Temporal difference learning solves this problem, but its efficiency can be significantly improved by the addition of eligibility traces (ET). In essence, ETs function as decaying memories of previous choices that are used to scale synaptic weight changes. It has been shown in theoretical studies that ETs spanning a number of actions may improve the performance of reinforcement learning. However, it remains an open question whether including ETs that persist over sequences of actions allows reinforcement learning models to better fit empirical data regarding the behaviors of humans and other animals. Here, we report an experiment in which human subjects performed a sequential economic decision game in which the long-term optimal strategy differed from the strategy that leads to the greatest short-term return. We demonstrate that human subjects' performance in the task is significantly affected by the time between choices in a surprising and seemingly counterintuitive way. However, this behavior is naturally explained by a temporal difference learning model which includes ETs persisting across actions. Furthermore, we review recent findings that suggest that short-term synaptic plasticity in dopamine neurons may provide a realistic biophysical mechanism for producing ETs that persist on a timescale consistent with behavioral observations.

  3. Efficient reinforcement learning: computational theories, neuroscience and robotics.

    PubMed

    Kawato, Mitsuo; Samejima, Kazuyuki

    2007-04-01

    Reinforcement learning algorithms have provided some of the most influential computational theories for behavioral learning that depends on reward and penalty. After briefly reviewing supporting experimental data, this paper tackles three difficult theoretical issues that remain to be explored. First, plain reinforcement learning is much too slow to be considered a plausible brain model. Second, although the temporal-difference error has an important role both in theory and in experiments, how to compute it remains an enigma. Third, function of all brain areas, including the cerebral cortex, cerebellum, brainstem and basal ganglia, seems to necessitate a new computational framework. Computational studies that emphasize meta-parameters, hierarchy, modularity and supervised learning to resolve these issues are reviewed here, together with the related experimental data.

  4. Reinforcement learning accounts for moody conditional cooperation behavior: experimental results

    PubMed Central

    Horita, Yutaka; Takezawa, Masanori; Inukai, Keigo; Kita, Toshimasa; Masuda, Naoki

    2017-01-01

    In social dilemma games, human participants often show conditional cooperation (CC) behavior or its variant called moody conditional cooperation (MCC), with which they basically tend to cooperate when many other peers have previously cooperated. Recent computational studies showed that CC and MCC behavioral patterns could be explained by reinforcement learning. In the present study, we use a repeated multiplayer prisoner’s dilemma game and the repeated public goods game played by human participants to examine whether MCC is observed across different types of game and the possibility that reinforcement learning explains observed behavior. We observed MCC behavior in both games, but the MCC that we observed was different from that observed in the past experiments. In the present study, whether or not a focal participant cooperated previously affected the overall level of cooperation, instead of changing the tendency of cooperation in response to cooperation of other participants in the previous time step. We found that, across different conditions, reinforcement learning models were approximately as accurate as a MCC model in describing the experimental results. Consistent with the previous computational studies, the present results suggest that reinforcement learning may be a major proximate mechanism governing MCC behavior. PMID:28071646

  5. Reinforcement learning accounts for moody conditional cooperation behavior: experimental results.

    PubMed

    Horita, Yutaka; Takezawa, Masanori; Inukai, Keigo; Kita, Toshimasa; Masuda, Naoki

    2017-01-10

    In social dilemma games, human participants often show conditional cooperation (CC) behavior or its variant called moody conditional cooperation (MCC), with which they basically tend to cooperate when many other peers have previously cooperated. Recent computational studies showed that CC and MCC behavioral patterns could be explained by reinforcement learning. In the present study, we use a repeated multiplayer prisoner's dilemma game and the repeated public goods game played by human participants to examine whether MCC is observed across different types of game and the possibility that reinforcement learning explains observed behavior. We observed MCC behavior in both games, but the MCC that we observed was different from that observed in the past experiments. In the present study, whether or not a focal participant cooperated previously affected the overall level of cooperation, instead of changing the tendency of cooperation in response to cooperation of other participants in the previous time step. We found that, across different conditions, reinforcement learning models were approximately as accurate as a MCC model in describing the experimental results. Consistent with the previous computational studies, the present results suggest that reinforcement learning may be a major proximate mechanism governing MCC behavior.

  6. Kinesthetic Reinforcement-Is It a Boon to Learning?

    ERIC Educational Resources Information Center

    Bohrer, Roxilu K.

    1970-01-01

    Language instruction, particularly in the elementary school, should be reinforced through the use of visual aids and through associated physical activity. Kinesthetic experiences provide an opportunity to make use of non-verbal cues to meaning, enliven classroom activities, and maximize learning for pupils. The author discusses the educational…

  7. Reinforcement Learning in Young Adults with Developmental Language Impairment

    ERIC Educational Resources Information Center

    Lee, Joanna C.; Tomblin, J. Bruce

    2012-01-01

    The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank, Seeberger, & O'Reilly, 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic…

  8. Reinforcement learning: the good, the bad and the ugly.

    PubMed

    Dayan, Peter; Niv, Yael

    2008-04-01

    Reinforcement learning provides both qualitative and quantitative frameworks for understanding and modeling adaptive decision-making in the face of rewards and punishments. Here we review the latest dispatches from the forefront of this field, and map out some of the territories where lie monsters.

  9. Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin.

    PubMed

    Ezaki, Takahiro; Horita, Yutaka; Takezawa, Masanori; Masuda, Naoki

    2016-07-01

    Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner's dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations.

  10. Human Operant Learning under Concurrent Reinforcement of Response Variability

    ERIC Educational Resources Information Center

    Maes, J. H. R.; van der Goot, M.

    2006-01-01

    This study asked whether the concurrent reinforcement of behavioral variability facilitates learning to emit a difficult target response. Sixty students repeatedly pressed sequences of keys, with an originally infrequently occurring target sequence consistently being followed by positive feedback. Three conditions differed in the feedback given to…

  11. Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin

    PubMed Central

    Ezaki, Takahiro; Horita, Yutaka; Masuda, Naoki

    2016-01-01

    Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner’s dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations. PMID:27438888

  12. Neural correlates of reinforcement learning and social preferences in competitive bidding.

    PubMed

    van den Bos, Wouter; Talwar, Arjun; McClure, Samuel M

    2013-01-30

    In competitive social environments, people often deviate from what rational choice theory prescribes, resulting in losses or suboptimal monetary gains. We investigate how competition affects learning and decision-making in a common value auction task. During the experiment, groups of five human participants were simultaneously scanned using MRI while playing the auction task. We first demonstrate that bidding is well characterized by reinforcement learning with biased reward representations dependent on social preferences. Indicative of reinforcement learning, we found that estimated trial-by-trial prediction errors correlated with activity in the striatum and ventromedial prefrontal cortex. Additionally, we found that individual differences in social preferences were related to activity in the temporal-parietal junction and anterior insula. Connectivity analyses suggest that monetary and social value signals are integrated in the ventromedial prefrontal cortex and striatum. Based on these results, we argue for a novel mechanistic account for the integration of reinforcement history and social preferences in competitive decision-making.

  13. Reinforcement learning techniques for controlling resources in power networks

    NASA Astrophysics Data System (ADS)

    Kowli, Anupama Sunil

    As power grids transition towards increased reliance on renewable generation, energy storage and demand response resources, an effective control architecture is required to harness the full functionalities of these resources. There is a critical need for control techniques that recognize the unique characteristics of the different resources and exploit the flexibility afforded by them to provide ancillary services to the grid. The work presented in this dissertation addresses these needs. Specifically, new algorithms are proposed, which allow control synthesis in settings wherein the precise distribution of the uncertainty and its temporal statistics are not known. These algorithms are based on recent developments in Markov decision theory, approximate dynamic programming and reinforcement learning. They impose minimal assumptions on the system model and allow the control to be "learned" based on the actual dynamics of the system. Furthermore, they can accommodate complex constraints such as capacity and ramping limits on generation resources, state-of-charge constraints on storage resources, comfort-related limitations on demand response resources and power flow limits on transmission lines. Numerical studies demonstrating applications of these algorithms to practical control problems in power systems are discussed. Results demonstrate how the proposed control algorithms can be used to improve the performance and reduce the computational complexity of the economic dispatch mechanism in a power network. We argue that the proposed algorithms are eminently suitable to develop operational decision-making tools for large power grids with many resources and many sources of uncertainty.

  14. Parsing facades with shape grammars and reinforcement learning.

    PubMed

    Teboul, Olivier; Kokkinos, Iasonas; Simon, Loic; Koutsourakis, Panagiotis; Paragios, Nikos

    2013-07-01

    In this paper, we use shape grammars (SGs) for facade parsing, which amounts to segmenting 2D building facades into balconies, walls, windows, and doors in an architecturally meaningful manner. The main thrust of our work is the introduction of reinforcement learning (RL) techniques to deal with the computational complexity of the problem. RL provides us with techniques such as Q-learning and state aggregation which we exploit to efficiently solve facade parsing. We initially phrase the 1D parsing problem in terms of a Markov Decision Process, paving the way for the application of RL-based tools. We then develop novel techniques for the 2D shape parsing problem that take into account the specificities of the facade parsing problem. Specifically, we use state aggregation to enforce the symmetry of facade floors and demonstrate how to use RL to exploit bottom-up, image-based guidance during optimization. We provide systematic results on the Paris building dataset and obtain state-of-the-art results in a fraction of the time required by previous methods. We validate our method under diverse imaging conditions and make our software and results available online.

  15. A reinforcement learning model of joy, distress, hope and fear

    NASA Astrophysics Data System (ADS)

    Broekens, Joost; Jacobs, Elmer; Jonker, Catholijn M.

    2015-07-01

    In this paper we computationally study the relation between adaptive behaviour and emotion. Using the reinforcement learning framework, we propose that learned state utility, ?, models fear (negative) and hope (positive) based on the fact that both signals are about anticipation of loss or gain. Further, we propose that joy/distress is a signal similar to the error signal. We present agent-based simulation experiments that show that this model replicates psychological and behavioural dynamics of emotion. This work distinguishes itself by assessing the dynamics of emotion in an adaptive agent framework - coupling it to the literature on habituation, development, extinction and hope theory. Our results support the idea that the function of emotion is to provide a complex feedback signal for an organism to adapt its behaviour. Our work is relevant for understanding the relation between emotion and adaptation in animals, as well as for human-robot interaction, in particular how emotional signals can be used to communicate between adaptive agents and humans.

  16. Antipsychotic dose modulates behavioral and neural responses to feedback during reinforcement learning in schizophrenia.

    PubMed

    Insel, Catherine; Reinen, Jenna; Weber, Jochen; Wager, Tor D; Jarskog, L Fredrik; Shohamy, Daphna; Smith, Edward E

    2014-03-01

    Schizophrenia is characterized by an abnormal dopamine system, and dopamine blockade is the primary mechanism of antipsychotic treatment. Consistent with the known role of dopamine in reward processing, prior research has demonstrated that patients with schizophrenia exhibit impairments in reward-based learning. However, it remains unknown how treatment with antipsychotic medication impacts the behavioral and neural signatures of reinforcement learning in schizophrenia. The goal of this study was to examine whether antipsychotic medication modulates behavioral and neural responses to prediction error coding during reinforcement learning. Patients with schizophrenia completed a reinforcement learning task while undergoing functional magnetic resonance imaging. The task consisted of two separate conditions in which participants accumulated monetary gain or avoided monetary loss. Behavioral results indicated that antipsychotic medication dose was associated with altered behavioral approaches to learning, such that patients taking higher doses of medication showed increased sensitivity to negative reinforcement. Higher doses of antipsychotic medication were also associated with higher learning rates (LRs), suggesting that medication enhanced sensitivity to trial-by-trial feedback. Neuroimaging data demonstrated that antipsychotic dose was related to differences in neural signatures of feedback prediction error during the loss condition. Specifically, patients taking higher doses of medication showed attenuated prediction error responses in the striatum and the medial prefrontal cortex. These findings indicate that antipsychotic medication treatment may influence motivational processes in patients with schizophrenia.

  17. Neurofeedback in Learning Disabled Children: Visual versus Auditory Reinforcement.

    PubMed

    Fernández, Thalía; Bosch-Bayard, Jorge; Harmony, Thalía; Caballero, María I; Díaz-Comas, Lourdes; Galán, Lídice; Ricardo-Garcell, Josefina; Aubert, Eduardo; Otero-Ojeda, Gloria

    2016-03-01

    Children with learning disabilities (LD) frequently have an EEG characterized by an excess of theta and a deficit of alpha activities. NFB using an auditory stimulus as reinforcer has proven to be a useful tool to treat LD children by positively reinforcing decreases of the theta/alpha ratio. The aim of the present study was to optimize the NFB procedure by comparing the efficacy of visual (with eyes open) versus auditory (with eyes closed) reinforcers. Twenty LD children with an abnormally high theta/alpha ratio were randomly assigned to the Auditory or the Visual group, where a 500 Hz tone or a visual stimulus (a white square), respectively, was used as a positive reinforcer when the value of the theta/alpha ratio was reduced. Both groups had signs consistent with EEG maturation, but only the Auditory Group showed behavioral/cognitive improvements. In conclusion, the auditory reinforcer was more efficacious in reducing the theta/alpha ratio, and it improved the cognitive abilities more than the visual reinforcer.

  18. Investigation of a Reinforcement-Based Toilet Training Procedure for Children with Autism.

    ERIC Educational Resources Information Center

    Cicero, Frank R.; Pfadt, Al

    2002-01-01

    This study evaluated the effectiveness of a reinforcement-based toilet training intervention with three children with autism. Procedures included positive reinforcement, graduated guidance, scheduled practice trials, and forward prompting. All three children reduced urination accidents to zero and learned to request bathroom use spontaneously…

  19. Working memory contributions to reinforcement learning impairments in schizophrenia.

    PubMed

    Collins, Anne G E; Brown, Jaime K; Gold, James M; Waltz, James A; Frank, Michael J

    2014-10-08

    Previous research has shown that patients with schizophrenia are impaired in reinforcement learning tasks. However, behavioral learning curves in such tasks originate from the interaction of multiple neural processes, including the basal ganglia- and dopamine-dependent reinforcement learning (RL) system, but also prefrontal cortex-dependent cognitive strategies involving working memory (WM). Thus, it is unclear which specific system induces impairments in schizophrenia. We recently developed a task and computational model allowing us to separately assess the roles of RL (slow, cumulative learning) mechanisms versus WM (fast but capacity-limited) mechanisms in healthy adult human subjects. Here, we used this task to assess patients' specific sources of impairments in learning. In 15 separate blocks, subjects learned to pick one of three actions for stimuli. The number of stimuli to learn in each block varied from two to six, allowing us to separate influences of capacity-limited WM from the incremental RL system. As expected, both patients (n = 49) and healthy controls (n = 36) showed effects of set size and delay between stimulus repetitions, confirming the presence of working memory effects. Patients performed significantly worse than controls overall, but computational model fits and behavioral analyses indicate that these deficits could be entirely accounted for by changes in WM parameters (capacity and reliability), whereas RL processes were spared. These results suggest that the working memory system contributes strongly to learning impairments in schizophrenia.

  20. Working Memory Contributions to Reinforcement Learning Impairments in Schizophrenia

    PubMed Central

    Brown, Jaime K.; Gold, James M.; Waltz, James A.; Frank, Michael J.

    2014-01-01

    Previous research has shown that patients with schizophrenia are impaired in reinforcement learning tasks. However, behavioral learning curves in such tasks originate from the interaction of multiple neural processes, including the basal ganglia- and dopamine-dependent reinforcement learning (RL) system, but also prefrontal cortex-dependent cognitive strategies involving working memory (WM). Thus, it is unclear which specific system induces impairments in schizophrenia. We recently developed a task and computational model allowing us to separately assess the roles of RL (slow, cumulative learning) mechanisms versus WM (fast but capacity-limited) mechanisms in healthy adult human subjects. Here, we used this task to assess patients' specific sources of impairments in learning. In 15 separate blocks, subjects learned to pick one of three actions for stimuli. The number of stimuli to learn in each block varied from two to six, allowing us to separate influences of capacity-limited WM from the incremental RL system. As expected, both patients (n = 49) and healthy controls (n = 36) showed effects of set size and delay between stimulus repetitions, confirming the presence of working memory effects. Patients performed significantly worse than controls overall, but computational model fits and behavioral analyses indicate that these deficits could be entirely accounted for by changes in WM parameters (capacity and reliability), whereas RL processes were spared. These results suggest that the working memory system contributes strongly to learning impairments in schizophrenia. PMID:25297101

  1. Novelty and Inductive Generalization in Human Reinforcement Learning

    PubMed Central

    Gershman, Samuel J.; Niv, Yael

    2015-01-01

    In reinforcement learning, a decision maker searching for the most rewarding option is often faced with the question: what is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: how can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of reinforcement learning in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional reinforcement learning algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty. PMID:25808176

  2. The curse of planning: dissecting multiple reinforcement-learning systems by taxing the central executive.

    PubMed

    Otto, A Ross; Gershman, Samuel J; Markman, Arthur B; Daw, Nathaniel D

    2013-05-01

    A number of accounts of human and animal behavior posit the operation of parallel and competing valuation systems in the control of choice behavior. In these accounts, a flexible but computationally expensive model-based reinforcement-learning system has been contrasted with a less flexible but more efficient model-free reinforcement-learning system. The factors governing which system controls behavior-and under what circumstances-are still unclear. Following the hypothesis that model-based reinforcement learning requires cognitive resources, we demonstrated that having human decision makers perform a demanding secondary task engenders increased reliance on a model-free reinforcement-learning strategy. Further, we showed that, across trials, people negotiate the trade-off between the two systems dynamically as a function of concurrent executive-function demands, and people's choice latencies reflect the computational expenses of the strategy they employ. These results demonstrate that competition between multiple learning systems can be controlled on a trial-by-trial basis by modulating the availability of cognitive resources.

  3. Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats

    PubMed Central

    Lloyd, Kevin; Becker, Nadine; Jones, Matthew W.; Bogacz, Rafal

    2012-01-01

    Learning to form appropriate, task-relevant working memory representations is a complex process central to cognition. Gating models frame working memory as a collection of past observations and use reinforcement learning (RL) to solve the problem of when to update these observations. Investigation of how gating models relate to brain and behavior remains, however, at an early stage. The current study sought to explore the ability of simple RL gating models to replicate rule learning behavior in rats. Rats were trained in a maze-based spatial learning task that required animals to make trial-by-trial choices contingent upon their previous experience. Using an abstract version of this task, we tested the ability of two gating algorithms, one based on the Actor-Critic and the other on the State-Action-Reward-State-Action (SARSA) algorithm, to generate behavior consistent with the rats'. Both models produced rule-acquisition behavior consistent with the experimental data, though only the SARSA gating model mirrored faster learning following rule reversal. We also found that both gating models learned multiple strategies in solving the initial task, a property which highlights the multi-agent nature of such models and which is of importance in considering the neural basis of individual differences in behavior. PMID:23115551

  4. A reinforcement learning approach to model interactions between landmarks and geometric cues during spatial learning.

    PubMed

    Sheynikhovich, Denis; Arleo, Angelo

    2010-12-13

    In contrast to predictions derived from the associative learning theory, a number of behavioral studies suggested the absence of competition between geometric cues and landmarks in some experimental paradigms. In parallel to these studies, neurobiological experiments suggested the existence of separate independent memory systems which may not always interact according to classic associative principles. In this paper we attempt to combine these two lines of research by proposing a model of spatial learning that is based on the theory of multiple memory systems. In our model, a place-based locale strategy uses activities of modeled hippocampal place cells to drive navigation to a hidden goal, while a stimulus-response taxon strategy, presumably mediated by the dorso-lateral striatum, learns landmark-approaching behavior. A strategy selection network, proposed to reside in the prefrontal cortex, implements a simple reinforcement learning rule to switch behavioral strategies. The model is used to reproduce the results of a behavioral experiment in which an interaction between a landmark and geometric cues was studied. We show that this model, built on the basis of neurobiological data, can explain the lack of competition between the landmark and geometry, potentiation of geometry learning by the landmark, and blocking. Namely, we propose that the geometry potentiation is a consequence of cooperation between memory systems during learning, while blocking is due to competition between the memory systems during action selection.

  5. Amygdala and Ventral Striatum Make Distinct Contributions to Reinforcement Learning.

    PubMed

    Costa, Vincent D; Dal Monte, Olga; Lucas, Daniel R; Murray, Elisabeth A; Averbeck, Bruno B

    2016-10-19

    Reinforcement learning (RL) theories posit that dopaminergic signals are integrated within the striatum to associate choices with outcomes. Often overlooked is that the amygdala also receives dopaminergic input and is involved in Pavlovian processes that influence choice behavior. To determine the relative contributions of the ventral striatum (VS) and amygdala to appetitive RL, we tested rhesus macaques with VS or amygdala lesions on deterministic and stochastic versions of a two-arm bandit reversal learning task. When learning was characterized with an RL model relative to controls, amygdala lesions caused general decreases in learning from positive feedback and choice consistency. By comparison, VS lesions only affected learning in the stochastic task. Moreover, the VS lesions hastened the monkeys' choice reaction times, which emphasized a speed-accuracy trade-off that accounted for errors in deterministic learning. These results update standard accounts of RL by emphasizing distinct contributions of the amygdala and VS to RL.

  6. Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain.

    PubMed

    Niv, Yael; Edlund, Jeffrey A; Dayan, Peter; O'Doherty, John P

    2012-01-11

    Humans and animals are exquisitely, though idiosyncratically, sensitive to risk or variance in the outcomes of their actions. Economic, psychological, and neural aspects of this are well studied when information about risk is provided explicitly. However, we must normally learn about outcomes from experience, through trial and error. Traditional models of such reinforcement learning focus on learning about the mean reward value of cues and ignore higher order moments such as variance. We used fMRI to test whether the neural correlates of human reinforcement learning are sensitive to experienced risk. Our analysis focused on anatomically delineated regions of a priori interest in the nucleus accumbens, where blood oxygenation level-dependent (BOLD) signals have been suggested as correlating with quantities derived from reinforcement learning. We first provide unbiased evidence that the raw BOLD signal in these regions corresponds closely to a reward prediction error. We then derive from this signal the learned values of cues that predict rewards of equal mean but different variance and show that these values are indeed modulated by experienced risk. Moreover, a close neurometric-psychometric coupling exists between the fluctuations of the experience-based evaluations of risky options that we measured neurally and the fluctuations in behavioral risk aversion. This suggests that risk sensitivity is integral to human learning, illuminating economic models of choice, neuroscientific models of affective learning, and the workings of the underlying neural mechanisms.

  7. From Recurrent Choice to Skill Learning: A Reinforcement-Learning Model

    ERIC Educational Resources Information Center

    Fu, Wai-Tat; Anderson, John R.

    2006-01-01

    The authors propose a reinforcement-learning mechanism as a model for recurrent choice and extend it to account for skill learning. The model was inspired by recent research in neurophysiological studies of the basal ganglia and provides an integrated explanation of recurrent choice behavior and skill learning. The behavior includes effects of…

  8. Reinforcement learning agents providing advice in complex video games

    NASA Astrophysics Data System (ADS)

    Taylor, Matthew E.; Carboni, Nicholas; Fachantidis, Anestis; Vlahavas, Ioannis; Torrey, Lisa

    2014-01-01

    This article introduces a teacher-student framework for reinforcement learning, synthesising and extending material that appeared in conference proceedings [Torrey, L., & Taylor, M. E. (2013)]. Teaching on a budget: Agents advising agents in reinforcement learning. {Proceedings of the international conference on autonomous agents and multiagent systems}] and in a non-archival workshop paper [Carboni, N., &Taylor, M. E. (2013, May)]. Preliminary results for 1 vs. 1 tactics in StarCraft. {Proceedings of the adaptive and learning agents workshop (at AAMAS-13)}]. In this framework, a teacher agent instructs a student agent by suggesting actions the student should take as it learns. However, the teacher may only give such advice a limited number of times. We present several novel algorithms that teachers can use to budget their advice effectively, and we evaluate them in two complex video games: StarCraft and Pac-Man. Our results show that the same amount of advice, given at different moments, can have different effects on student learning, and that teachers can significantly affect student learning even when students use different learning methods and state representations.

  9. Maximization of Learning Speed Due to Neuronal Redundancy in Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Takiyama, Ken

    2016-11-01

    Adaptable neural activity contributes to the flexibility of human behavior, which is optimized in situations such as motor learning and decision making. Although learning signals in motor learning and decision making are low-dimensional, neural activity, which is very high dimensional, must be modified to achieve optimal performance based on the low-dimensional signal, resulting in a severe credit-assignment problem. Despite this problem, the human brain contains a vast number of neurons, leaving an open question: what is the functional significance of the huge number of neurons? Here, I address this question by analyzing a redundant neural network with a reinforcement-learning algorithm in which the numbers of neurons and output units are N and M, respectively. Because many combinations of neural activity can generate the same output under the condition of N ≫ M, I refer to the index N - M as neuronal redundancy. Although greater neuronal redundancy makes the credit-assignment problem more severe, I demonstrate that a greater degree of neuronal redundancy facilitates learning speed. Thus, in an apparent contradiction of the credit-assignment problem, I propose the hypothesis that a functional role of a huge number of neurons or a huge degree of neuronal redundancy is to facilitate learning speed.

  10. Pleasurable music affects reinforcement learning according to the listener

    PubMed Central

    Gold, Benjamin P.; Frank, Michael J.; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875

  11. Pleasurable music affects reinforcement learning according to the listener.

    PubMed

    Gold, Benjamin P; Frank, Michael J; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy.

  12. Multiagent cooperation and competition with deep reinforcement learning.

    PubMed

    Tampuu, Ardi; Matiisen, Tambet; Kodelja, Dorian; Kuzovkin, Ilya; Korjus, Kristjan; Aru, Juhan; Aru, Jaan; Vicente, Raul

    2017-01-01

    Evolution of cooperation and competition can appear when multiple adaptive agents share a biological, social, or technological niche. In the present work we study how cooperation and competition emerge between autonomous agents that learn by reinforcement while using only their raw visual input as the state representation. In particular, we extend the Deep Q-Learning framework to multiagent environments to investigate the interaction between two learning agents in the well-known video game Pong. By manipulating the classical rewarding scheme of Pong we show how competitive and collaborative behaviors emerge. We also describe the progression from competitive to collaborative behavior when the incentive to cooperate is increased. Finally we show how learning by playing against another adaptive agent, instead of against a hard-wired algorithm, results in more robust strategies. The present work shows that Deep Q-Networks can become a useful tool for studying decentralized learning of multiagent systems coping with high-dimensional environments.

  13. Emotional Multiagent Reinforcement Learning in Spatial Social Dilemmas.

    PubMed

    Yu, Chao; Zhang, Minjie; Ren, Fenghui; Tan, Guozhen

    2015-12-01

    Social dilemmas have attracted extensive interest in the research of multiagent systems in order to study the emergence of cooperative behaviors among selfish agents. Understanding how agents can achieve cooperation in social dilemmas through learning from local experience is a critical problem that has motivated researchers for decades. This paper investigates the possibility of exploiting emotions in agent learning in order to facilitate the emergence of cooperation in social dilemmas. In particular, the spatial version of social dilemmas is considered to study the impact of local interactions on the emergence of cooperation in the whole system. A double-layered emotional multiagent reinforcement learning framework is proposed to endow agents with internal cognitive and emotional capabilities that can drive these agents to learn cooperative behaviors. Experimental results reveal that various network topologies and agent heterogeneities have significant impacts on agent learning behaviors in the proposed framework, and under certain circumstances, high levels of cooperation can be achieved among the agents.

  14. Reinforcement learning for high-level fuzzy Petri nets.

    PubMed

    Shen, V L

    2003-01-01

    The author has developed a reinforcement learning algorithm for the high-level fuzzy Petri net (HLFPN) models in order to perform structure and parameter learning simultaneously. In addition to the HLFPN itself, the difference and similarity among a variety of subclasses concerning Petri nets are also discussed. As compared with the fuzzy adaptive learning control network (FALCON), the HLFPN model preserves the advantages that: 1) it offers more flexible learning capability because it is able to model both IF-THEN and IF-THEN-ELSE rules; 2) it allows multiple heterogeneous outputs to be drawn if they exist; 3) it offers a more compact data structure for fuzzy production rules so as to save information storage; and 4) it is able to learn faster due to its structural reduction. Finally, main results are presented in the form of seven propositions and supported by some experiments.

  15. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading.

    PubMed

    Deng, Yue; Bao, Feng; Kong, Youyong; Ren, Zhiquan; Dai, Qionghai

    2017-03-01

    Can we train the computer to beat experienced traders for financial assert trading? In this paper, we try to address this challenge by introducing a recurrent deep neural network (NN) for real-time financial signal representation and trading. Our model is inspired by two biological-related learning concepts of deep learning (DL) and reinforcement learning (RL). In the framework, the DL part automatically senses the dynamic market condition for informative feature learning. Then, the RL module interacts with deep representations and makes trading decisions to accumulate the ultimate rewards in an unknown environment. The learning system is implemented in a complex NN that exhibits both the deep and recurrent structures. Hence, we propose a task-aware backpropagation through time method to cope with the gradient vanishing issue in deep training. The robustness of the neural system is verified on both the stock and the commodity future markets under broad testing conditions.

  16. Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling.

    PubMed

    Redish, A David; Jensen, Steve; Johnson, Adam; Kurth-Nelson, Zeb

    2007-07-01

    Because learned associations are quickly renewed following extinction, the extinction process must include processes other than unlearning. However, reinforcement learning models, such as the temporal difference reinforcement learning (TDRL) model, treat extinction as an unlearning of associated value and are thus unable to capture renewal. TDRL models are based on the hypothesis that dopamine carries a reward prediction error signal; these models predict reward by driving that reward error to zero. The authors construct a TDRL model that can accommodate extinction and renewal through two simple processes: (a) a TDRL process that learns the value of situation-action pairs and (b) a situation recognition process that categorizes the observed cues into situations. This model has implications for dysfunctional states, including relapse after addiction and problem gambling.

  17. Neural control of dopamine neurotransmission: implications for reinforcement learning.

    PubMed

    Aggarwal, Mayank; Hyland, Brian I; Wickens, Jeffery R

    2012-04-01

    In the past few decades there has been remarkable convergence of machine learning with neurobiological understanding of reinforcement learning mechanisms, exemplified by temporal difference (TD) learning models. The anatomy of the basal ganglia provides a number of potential substrates for instantiation of the TD mechanism. In contrast to the traditional concept of direct and indirect pathway outputs from the striatum, we emphasize that projection neurons of the striatum are branched and individual striatofugal neurons innervate both globus pallidus externa and globus pallidus interna/substantia nigra (GPi/SNr). This suggests that the GPi/SNr has the necessary inputs to operate as the source of a TD signal. We also discuss the mechanism for the timing processes necessary for learning in the TD framework. The TD framework has been particularly successful in analysing electrophysiogical recordings from dopamine (DA) neurons during learning, in terms of reward prediction error. However, present understanding of the neural control of DA release is limited, and hence the neural mechanisms involved are incompletely understood. Inhibition is very conspicuously present among the inputs to the DA neurons, with inhibitory synapses accounting for the majority of synapses on DA neurons. Furthermore, synchronous firing of the DA neuron population requires disinhibition and excitation to occur together in a coordinated manner. We conclude that the inhibitory circuits impinging directly or indirectly on the DA neurons play a central role in the control of DA neuron activity and further investigation of these circuits may provide important insight into the biological mechanisms of reinforcement learning.

  18. Robot Cognitive Control with a Neurophysiologically Inspired Reinforcement Learning Model

    PubMed Central

    Khamassi, Mehdi; Lallée, Stéphane; Enel, Pierre; Procyk, Emmanuel; Dominey, Peter F.

    2011-01-01

    A major challenge in modern robotics is to liberate robots from controlled industrial settings, and allow them to interact with humans and changing environments in the real-world. The current research attempts to determine if a neurophysiologically motivated model of cortical function in the primate can help to address this challenge. Primates are endowed with cognitive systems that allow them to maximize the feedback from their environment by learning the values of actions in diverse situations and by adjusting their behavioral parameters (i.e., cognitive control) to accommodate unexpected events. In such contexts uncertainty can arise from at least two distinct sources – expected uncertainty resulting from noise during sensory-motor interaction in a known context, and unexpected uncertainty resulting from the changing probabilistic structure of the environment. However, it is not clear how neurophysiological mechanisms of reinforcement learning and cognitive control integrate in the brain to produce efficient behavior. Based on primate neuroanatomy and neurophysiology, we propose a novel computational model for the interaction between lateral prefrontal and anterior cingulate cortex reconciling previous models dedicated to these two functions. We deployed the model in two robots and demonstrate that, based on adaptive regulation of a meta-parameter β that controls the exploration rate, the model can robustly deal with the two kinds of uncertainties in the real-world. In addition the model could reproduce monkey behavioral performance and neurophysiological data in two problem-solving tasks. A last experiment extends this to human–robot interaction with the iCub humanoid, and novel sources of uncertainty corresponding to “cheating” by the human. The combined results provide concrete evidence for the ability of neurophysiologically inspired cognitive systems to control advanced robots in the real-world. PMID:21808619

  19. Robot cognitive control with a neurophysiologically inspired reinforcement learning model.

    PubMed

    Khamassi, Mehdi; Lallée, Stéphane; Enel, Pierre; Procyk, Emmanuel; Dominey, Peter F

    2011-01-01

    A major challenge in modern robotics is to liberate robots from controlled industrial settings, and allow them to interact with humans and changing environments in the real-world. The current research attempts to determine if a neurophysiologically motivated model of cortical function in the primate can help to address this challenge. Primates are endowed with cognitive systems that allow them to maximize the feedback from their environment by learning the values of actions in diverse situations and by adjusting their behavioral parameters (i.e., cognitive control) to accommodate unexpected events. In such contexts uncertainty can arise from at least two distinct sources - expected uncertainty resulting from noise during sensory-motor interaction in a known context, and unexpected uncertainty resulting from the changing probabilistic structure of the environment. However, it is not clear how neurophysiological mechanisms of reinforcement learning and cognitive control integrate in the brain to produce efficient behavior. Based on primate neuroanatomy and neurophysiology, we propose a novel computational model for the interaction between lateral prefrontal and anterior cingulate cortex reconciling previous models dedicated to these two functions. We deployed the model in two robots and demonstrate that, based on adaptive regulation of a meta-parameter β that controls the exploration rate, the model can robustly deal with the two kinds of uncertainties in the real-world. In addition the model could reproduce monkey behavioral performance and neurophysiological data in two problem-solving tasks. A last experiment extends this to human-robot interaction with the iCub humanoid, and novel sources of uncertainty corresponding to "cheating" by the human. The combined results provide concrete evidence for the ability of neurophysiologically inspired cognitive systems to control advanced robots in the real-world.

  20. Individual differences in reinforcement learning: behavioral, electrophysiological, and neuroimaging correlates.

    PubMed

    Santesso, Diane L; Dillon, Daniel G; Birk, Jeffrey L; Holmes, Avram J; Goetz, Elena; Bogdan, Ryan; Pizzagalli, Diego A

    2008-08-15

    During reinforcement learning, phasic modulations of activity in midbrain dopamine neurons are conveyed to the dorsal anterior cingulate cortex (dACC) and basal ganglia (BG) and serve to guide adaptive responding. While the animal literature supports a role for the dACC in integrating reward history over time, most human electrophysiological studies of dACC function have focused on responses to single positive and negative outcomes. The present electrophysiological study investigated the role of the dACC in probabilistic reward learning in healthy subjects using a task that required integration of reinforcement history over time. We recorded the feedback-related negativity (FRN) to reward feedback in subjects who developed a response bias toward a more frequently rewarded ("rich") stimulus ("learners") versus subjects who did not ("non-learners"). Compared to non-learners, learners showed more positive (i.e., smaller) FRNs and greater dACC activation upon receiving reward for correct identification of the rich stimulus. In addition, dACC activation and a bias to select the rich stimulus were positively correlated. The same participants also completed a monetary incentive delay (MID) task administered during functional magnetic resonance imaging. Compared to non-learners, learners displayed stronger BG responses to reward in the MID task. These findings raise the possibility that learners in the probabilistic reinforcement task were characterized by stronger dACC and BG responses to rewarding outcomes. Furthermore, these results highlight the importance of the dACC to probabilistic reward learning in humans.

  1. Time-Extended Policies in Mult-Agent Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian K.

    2004-01-01

    Reinforcement learning methods perform well in many domains where a single agent needs to take a sequence of actions to perform a task. These methods use sequences of single-time-step rewards to create a policy that tries to maximize a time-extended utility, which is a (possibly discounted) sum of these rewards. In this paper we build on our previous work showing how these methods can be extended to a multi-agent environment where each agent creates its own policy that works towards maximizing a time-extended global utility over all agents actions. We show improved methods for creating time-extended utilities for the agents that are both "aligned" with the global utility and "learnable." We then show how to crate single-time-step rewards while avoiding the pi fall of having rewards aligned with the global reward leading to utilities not aligned with the global utility. Finally, we apply these reward functions to the multi-agent Gridworld problem. We explicitly quantify a utility's learnability and alignment, and show that reinforcement learning agents using the prescribed reward functions successfully tradeoff learnability and alignment. As a result they outperform both global (e.g., team games ) and local (e.g., "perfectly learnable" ) reinforcement learning solutions by as much as an order of magnitude.

  2. Cocaine addiction as a homeostatic reinforcement learning disorder.

    PubMed

    Keramati, Mehdi; Durand, Audrey; Girardeau, Paul; Gutkin, Boris; Ahmed, Serge H

    2017-03-01

    Drug addiction implicates both reward learning and homeostatic regulation mechanisms of the brain. This has stimulated 2 partially successful theoretical perspectives on addiction. Many important aspects of addiction, however, remain to be explained within a single, unified framework that integrates the 2 mechanisms. Building upon a recently developed homeostatic reinforcement learning theory, the authors focus on a key transition stage of addiction that is well modeled in animals, escalation of drug use, and propose a computational theory of cocaine addiction where cocaine reinforces behavior due to its rapid homeostatic corrective effect, whereas its chronic use induces slow and long-lasting changes in homeostatic setpoint. Simulations show that our new theory accounts for key behavioral and neurobiological features of addiction, most notably, escalation of cocaine use, drug-primed craving and relapse, individual differences underlying dose-response curves, and dopamine D2-receptor downregulation in addicts. The theory also generates unique predictions about cocaine self-administration behavior in rats that are confirmed by new experimental results. Viewing addiction as a homeostatic reinforcement learning disorder coherently explains many behavioral and neurobiological aspects of the transition to cocaine addiction, and suggests a new perspective toward understanding addiction. (PsycINFO Database Record

  3. Reinforcement learning output feedback NN control using deterministic learning technique.

    PubMed

    Xu, Bin; Yang, Chenguang; Shi, Zhongke

    2014-03-01

    In this brief, a novel adaptive-critic-based neural network (NN) controller is investigated for nonlinear pure-feedback systems. The controller design is based on the transformed predictor form, and the actor-critic NN control architecture includes two NNs, whereas the critic NN is used to approximate the strategic utility function, and the action NN is employed to minimize both the strategic utility function and the tracking error. A deterministic learning technique has been employed to guarantee that the partial persistent excitation condition of internal states is satisfied during tracking control to a periodic reference orbit. The uniformly ultimate boundedness of closed-loop signals is shown via Lyapunov stability analysis. Simulation results are presented to demonstrate the effectiveness of the proposed control.

  4. A general framework for context-specific image segmentation using reinforcement learning.

    PubMed

    Wang, Lichao; Lekadir, Karim; Lee, Su-Lin; Merrifield, Robert; Yang, Guang-Zhong

    2013-05-01

    This paper presents an online reinforcement learning framework for medical image segmentation. The concept of context-specific segmentation is introduced such that the model is adaptive not only to a defined objective function but also to the user's intention and prior knowledge. Based on this concept, a general segmentation framework using reinforcement learning is proposed, which can assimilate specific user intention and behavior seamlessly in the background. The method is able to establish an implicit model for a large state-action space and generalizable to different image contents or segmentation requirements based on learning in situ. In order to demonstrate the practical value of the method, example applications of the technique to four different segmentation problems are presented. Detailed validation results have shown that the proposed framework is able to significantly reduce user interaction, while maintaining both segmentation accuracy and consistency.

  5. Cerebellar and Prefrontal Cortex Contributions to Adaptation, Strategies, and Reinforcement Learning

    PubMed Central

    Taylor, Jordan A.; Ivry, Richard B.

    2014-01-01

    Traditionally, motor learning has been studied as an implicit learning process, one in which movement errors are used to improve performance in a continuous, gradual manner. The cerebellum figures prominently in this literature given well-established ideas about the role of this system in error-based learning and the production of automatized skills. Recent developments have brought into focus the relevance of multiple learning mechanisms for sensorimotor learning. These include processes involving repetition, reinforcement learning, and strategy utilization. We examine these developments, considering their implications for understanding cerebellar function and how this structure interacts with other neural systems to support motor learning. Converging lines of evidence from behavioral, computational, and neuropsychological studies suggest a fundamental distinction between processes that use error information to improve action execution or action selection. While the cerebellum is clearly linked to the former, its role in the latter remains an open question. PMID:24916295

  6. Cerebellar and prefrontal cortex contributions to adaptation, strategies, and reinforcement learning.

    PubMed

    Taylor, Jordan A; Ivry, Richard B

    2014-01-01

    Traditionally, motor learning has been studied as an implicit learning process, one in which movement errors are used to improve performance in a continuous, gradual manner. The cerebellum figures prominently in this literature given well-established ideas about the role of this system in error-based learning and the production of automatized skills. Recent developments have brought into focus the relevance of multiple learning mechanisms for sensorimotor learning. These include processes involving repetition, reinforcement learning, and strategy utilization. We examine these developments, considering their implications for understanding cerebellar function and how this structure interacts with other neural systems to support motor learning. Converging lines of evidence from behavioral, computational, and neuropsychological studies suggest a fundamental distinction between processes that use error information to improve action execution or action selection. While the cerebellum is clearly linked to the former, its role in the latter remains an open question.

  7. REINFORCEMENT IN CLASSROOM LEARNING. PART II, STUDIES OF REINFORCEMENT IN SIMULATED CLASSROOM SITUATIONS. PART III, IDENTIFICATION OF REINFORCERS OF HUMAN BEHAVIOR.

    ERIC Educational Resources Information Center

    TRAVERS, ROBERT M.W.; AND OTHERS

    REINFORCEMENT CONCEPTS DERIVED LARGELY FROM RESEARCH OF SUBHUMAN SUBJECTS WERE TESTED FOR APPLICABILITY TO HUMAN-LEARNING SITUATIONS SIMILAR TO THOSE THAT OCCUR IN SCHOOLS. A SERIES OF EXPLORATORY STUDIES CONDUCTED IS DESCRIBED IN PART II OF THIS REPORT. IN PART III, TWO EXPERIMENTS CONDUCTED TO DETERMINE THE REINFORCING VALUE OF DIFFERENT STIMULI…

  8. A Discussion of Possibility of Reinforcement Learning Using Event-Related Potential in BCI

    NASA Astrophysics Data System (ADS)

    Yamagishi, Yuya; Tsubone, Tadashi; Wada, Yasuhiro

    Recently, Brain computer interface (BCI) which is a direct connecting pathway an external device such as a computer or a robot and a human brain have gotten a lot of attention. Since BCI can control the machines as robots by using the brain activity without using the voluntary muscle, the BCI may become a useful communication tool for handicapped persons, for instance, amyotrophic lateral sclerosis patients. However, in order to realize the BCI system which can perform precise tasks on various environments, it is necessary to design the control rules to adapt to the dynamic environments. Reinforcement learning is one approach of the design of the control rule. If this reinforcement leaning can be performed by the brain activity, it leads to the attainment of BCI that has general versatility. In this research, we paid attention to P300 of event-related potential as an alternative signal of the reward of reinforcement learning. We discriminated between the success and the failure trials from P300 of the EEG of the single trial by using the proposed discrimination algorithm based on Support vector machine. The possibility of reinforcement learning was examined from the viewpoint of the number of discriminated trials. It was shown that there was a possibility to be able to learn in most subjects.

  9. Towards autonomous neuroprosthetic control using Hebbian reinforcement learning

    NASA Astrophysics Data System (ADS)

    Mahmoudi, Babak; Pohlmeyer, Eric A.; Prins, Noeline W.; Geng, Shijia; Sanchez, Justin C.

    2013-12-01

    Objective. Our goal was to design an adaptive neuroprosthetic controller that could learn the mapping from neural states to prosthetic actions and automatically adjust adaptation using only a binary evaluative feedback as a measure of desirability/undesirability of performance. Approach. Hebbian reinforcement learning (HRL) in a connectionist network was used for the design of the adaptive controller. The method combines the efficiency of supervised learning with the generality of reinforcement learning. The convergence properties of this approach were studied using both closed-loop control simulations and open-loop simulations that used primate neural data from robot-assisted reaching tasks. Main results. The HRL controller was able to perform classification and regression tasks using its episodic and sequential learning modes, respectively. In our experiments, the HRL controller quickly achieved convergence to an effective control policy, followed by robust performance. The controller also automatically stopped adapting the parameters after converging to a satisfactory control policy. Additionally, when the input neural vector was reorganized, the controller resumed adaptation to maintain performance. Significance. By estimating an evaluative feedback directly from the user, the HRL control algorithm may provide an efficient method for autonomous adaptation of neuroprosthetic systems. This method may enable the user to teach the controller the desired behavior using only a simple feedback signal.

  10. Distributed Reinforcement Learning Approach for Vehicular Ad Hoc Networks

    NASA Astrophysics Data System (ADS)

    Wu, Celimuge; Kumekawa, Kazuya; Kato, Toshihiko

    In Vehicular Ad hoc Networks (VANETs), general purpose ad hoc routing protocols such as AODV cannot work efficiently due to the frequent changes in network topology caused by vehicle movement. This paper proposes a VANET routing protocol QLAODV (Q-Learning AODV) which suits unicast applications in high mobility scenarios. QLAODV is a distributed reinforcement learning routing protocol, which uses a Q-Learning algorithm to infer network state information and uses unicast control packets to check the path availability in a real time manner in order to allow Q-Learning to work efficiently in a highly dynamic network environment. QLAODV is favored by its dynamic route change mechanism, which makes it capable of reacting quickly to network topology changes. We present an analysis of the performance of QLAODV by simulation using different mobility models. The simulation results show that QLAODV can efficiently handle unicast applications in VANETs.

  11. Utilising reinforcement learning to develop strategies for driving auditory neural implants

    NASA Astrophysics Data System (ADS)

    Lee, Geoffrey W.; Zambetta, Fabio; Li, Xiaodong; Paolini, Antonio G.

    2016-08-01

    Objective. In this paper we propose a novel application of reinforcement learning to the area of auditory neural stimulation. We aim to develop a simulation environment which is based off real neurological responses to auditory and electrical stimulation in the cochlear nucleus (CN) and inferior colliculus (IC) of an animal model. Using this simulator we implement closed loop reinforcement learning algorithms to determine which methods are most effective at learning effective acoustic neural stimulation strategies. Approach. By recording a comprehensive set of acoustic frequency presentations and neural responses from a set of animals we created a large database of neural responses to acoustic stimulation. Extensive electrical stimulation in the CN and the recording of neural responses in the IC provides a mapping of how the auditory system responds to electrical stimuli. The combined dataset is used as the foundation for the simulator, which is used to implement and test learning algorithms. Main results. Reinforcement learning, utilising a modified n-Armed Bandit solution, is implemented to demonstrate the model’s function. We show the ability to effectively learn stimulation patterns which mimic the cochlea’s ability to covert acoustic frequencies to neural activity. Time taken to learn effective replication using neural stimulation takes less than 20 min under continuous testing. Significance. These results show the utility of reinforcement learning in the field of neural stimulation. These results can be coupled with existing sound processing technologies to develop new auditory prosthetics that are adaptable to the recipients current auditory pathway. The same process can theoretically be abstracted to other sensory and motor systems to develop similar electrical replication of neural signals.

  12. Learning and altering behaviours by reinforcement: neurocognitive differences between children and adults.

    PubMed

    Shephard, E; Jackson, G M; Groom, M J

    2014-01-01

    This study examined neurocognitive differences between children and adults in the ability to learn and adapt simple stimulus-response associations through feedback. Fourteen typically developing children (mean age=10.2) and 15 healthy adults (mean age=25.5) completed a simple task in which they learned to associate visually presented stimuli with manual responses based on performance feedback (acquisition phase), and then reversed and re-learned those associations following an unexpected change in reinforcement contingencies (reversal phase). Electrophysiological activity was recorded throughout task performance. We found no group differences in learning-related changes in performance (reaction time, accuracy) or in the amplitude of event-related potentials (ERPs) associated with stimulus processing (P3 ERP) or feedback processing (feedback-related negativity; FRN) during the acquisition phase. However, children's performance was significantly more disrupted by the reversal than adults and FRN amplitudes were significantly modulated by the reversal phase in children but not adults. These findings indicate that children have specific difficulties with reinforcement learning when acquired behaviours must be altered. This may be caused by the added demands on immature executive functioning, specifically response monitoring, created by the requirement to reverse the associations, or a developmental difference in the way in which children and adults approach reinforcement learning.

  13. Reinforced AdaBoost learning for object detection with local pattern representations.

    PubMed

    Lee, Younghyun; Han, David K; Ko, Hanseok

    2013-01-01

    A reinforced AdaBoost learning algorithm is proposed for object detection with local pattern representations. In implementing AdaBoost learning, the proposed algorithm employs an exponential criterion as a cost function and Newton's method for its optimization. In particular, we introduce an optimal selection of weak classifiers minimizing the cost function and derive the reinforced predictions based on a judicial confidence estimate to determine the classification results. The weak classifier of the proposed method produces real-valued predictions while that of the conventional AdaBoost method produces integer valued predictions of +1 or -1. Hence, in the conventional learning algorithms, the entire sample weights are updated by the same rate. On the contrary, the proposed learning algorithm allows the sample weights to be updated individually depending on the confidence level of each weak classifier prediction, thereby reducing the number of weak classifier iterations for convergence. Experimental classification performance on human face and license plate images confirm that the proposed method requires smaller number of weak classifiers than the conventional learning algorithm, resulting in higher learning and faster classification rates. An object detector implemented based on the proposed learning algorithm yields better performance in field tests in terms of higher detection rate with lower false positives than that of the conventional learning algorithm.

  14. SAN-RL: combining spreading activation networks and reinforcement learning to learn configurable behaviors

    NASA Astrophysics Data System (ADS)

    Gaines, Daniel M.; Wilkes, Don M.; Kusumalnukool, Kanok; Thongchai, Siripun; Kawamura, Kazuhiko; White, John H.

    2002-02-01

    Reinforcement learning techniques have been successful in allowing an agent to learn a policy for achieving tasks. The overall behavior of the agent can be controlled with an appropriate reward function. However, the policy that is learned will be fixed to this reward function. If the user wishes to change his or her preference about how the task is achieved the agent must be retrained with this new reward function. We address this challenge by combining Spreading Activation Networks and Reinforcement Learning in an approach we call SAN-RL. This approach provides the agent with a causal structure, the spreading activation network, relating goals to the actions that can achieve those goals. This enables the agent to select actions relative to the goal priorities. We combine this with reinforcement learning to enable the agent to learn a policy. Together, these approaches enable the learning of a configurable behaviors, a policy that can be adapted to meet the current preferences. We compare the approach with Q-learning on a robot navigation task. We demonstrate that SAN-RL exhibits goal-directed behavior before learning, exploits the causal structure of the network to focus its search during learning and results in configurable behaviors after learning.

  15. Beyond simple reinforcement learning: the computational neurobiology of reward-learning and valuation.

    PubMed

    O'Doherty, John P

    2012-04-01

    Neural computational accounts of reward-learning have been dominated by the hypothesis that dopamine neurons behave like a reward-prediction error and thus facilitate reinforcement learning in striatal target neurons. While this framework is consistent with a lot of behavioral and neural evidence, this theory fails to account for a number of behavioral and neurobiological observations. In this special issue of EJN we feature a combination of theoretical and experimental papers highlighting some of the explanatory challenges faced by simple reinforcement-learning models and describing some of the ways in which the framework is being extended in order to address these challenges.

  16. Oculomotor learning revisited: a model of reinforcement learning in the basal ganglia incorporating an efference copy of motor actions

    PubMed Central

    Fee, Michale S.

    2012-01-01

    In its simplest formulation, reinforcement learning is based on the idea that if an action taken in a particular context is followed by a favorable outcome, then, in the same context, the tendency to produce that action should be strengthened, or reinforced. While reinforcement learning forms the basis of many current theories of basal ganglia (BG) function, these models do not incorporate distinct computational roles for signals that convey context, and those that convey what action an animal takes. Recent experiments in the songbird suggest that vocal-related BG circuitry receives two functionally distinct excitatory inputs. One input is from a cortical region that carries context information about the current “time” in the motor sequence. The other is an efference copy of motor commands from a separate cortical brain region that generates vocal variability during learning. Based on these findings, I propose here a general model of vertebrate BG function that combines context information with a distinct motor efference copy signal. The signals are integrated by a learning rule in which efference copy inputs gate the potentiation of context inputs (but not efference copy inputs) onto medium spiny neurons in response to a rewarded action. The hypothesis is described in terms of a circuit that implements the learning of visually guided saccades. The model makes testable predictions about the anatomical and functional properties of hypothesized context and efference copy inputs to the striatum from both thalamic and cortical sources. PMID:22754501

  17. Partial reinforcement effects on learning and extinction of place preferences in the water maze.

    PubMed

    Prados, José; Sansa, Joan; Artigas, Antonio A

    2008-11-01

    In two experiments, two groups of rats were trained in a navigation task according to either a continuous or a partial schedule of reinforcement. In Experiment 1, animals that were given continuous reinforcement extinguished the spatial response of approaching the goal location more readily than animals given partial reinforcement-a partial reinforcement extinction effect. In Experiment 2, after partially or continuously reinforced training, animals were trained in a new task that made use of the same reinforcer according to a continuous reinforcement schedule. Animals initially given partial reinforcement performed better in the novel task than did rats initially given continuous reinforcement. These results replicate, in the spatial domain, well-known partial reinforcement phenomena typically observed in the context of Pavlovian and instrumental conditioning, suggesting that similar principles govern spatial and associative learning. The results reported support the notion that salience modulation processes play a key role in determining partial reinforcement effects.

  18. Dynamics of learning in coupled oscillators tutored with delayed reinforcements.

    PubMed

    Trevisan, M A; Bouzat, S; Samengo, I; Mindlin, G B

    2005-07-01

    In this work we analyze the solutions of a simple system of coupled phase oscillators in which the connectivity is learned dynamically. The model is inspired by the process of learning of birdsongs by oscine birds. An oscillator acts as the generator of a basic rhythm and drives slave oscillators which are responsible for different motor actions. The driving signal arrives at each driven oscillator through two different pathways. One of them is a direct pathway. The other one is a reinforcement pathway, through which the signal arrives delayed. The coupling coefficients between the driving oscillator and the slave ones evolve in time following a Hebbian-like rule. We discuss the conditions under which a driven oscillator is capable of learning to lock to the driver. The resulting phase difference and connectivity are a function of the delay of the reinforcement. Around some specific delays, the system is capable of generating dramatic changes in the phase difference between the driver and the driven systems. We discuss the dynamical mechanism responsible for this effect and possible applications of this learning scheme.

  19. Dynamics of learning in coupled oscillators tutored with delayed reinforcements

    NASA Astrophysics Data System (ADS)

    Trevisan, M. A.; Bouzat, S.; Samengo, I.; Mindlin, G. B.

    2005-07-01

    In this work we analyze the solutions of a simple system of coupled phase oscillators in which the connectivity is learned dynamically. The model is inspired by the process of learning of birdsongs by oscine birds. An oscillator acts as the generator of a basic rhythm and drives slave oscillators which are responsible for different motor actions. The driving signal arrives at each driven oscillator through two different pathways. One of them is a direct pathway. The other one is a reinforcement pathway, through which the signal arrives delayed. The coupling coefficients between the driving oscillator and the slave ones evolve in time following a Hebbian-like rule. We discuss the conditions under which a driven oscillator is capable of learning to lock to the driver. The resulting phase difference and connectivity are a function of the delay of the reinforcement. Around some specific delays, the system is capable of generating dramatic changes in the phase difference between the driver and the driven systems. We discuss the dynamical mechanism responsible for this effect and possible applications of this learning scheme.

  20. Credit assignment in movement-dependent reinforcement learning

    PubMed Central

    Boggess, Matthew J.; Crossley, Matthew J.; Parvin, Darius; Ivry, Richard B.; Taylor, Jordan A.

    2016-01-01

    When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants’ explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem. PMID:27247404

  1. Credit assignment in movement-dependent reinforcement learning.

    PubMed

    McDougle, Samuel D; Boggess, Matthew J; Crossley, Matthew J; Parvin, Darius; Ivry, Richard B; Taylor, Jordan A

    2016-06-14

    When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants' explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem.

  2. Dissociable neural representations of reinforcement and belief prediction errors underlie strategic learning.

    PubMed

    Zhu, Lusha; Mathewson, Kyle E; Hsu, Ming

    2012-01-31

    Decision-making in the presence of other competitive intelligent agents is fundamental for social and economic behavior. Such decisions require agents to behave strategically, where in addition to learning about the rewards and punishments available in the environment, they also need to anticipate and respond to actions of others competing for the same rewards. However, whereas we know much about strategic learning at both theoretical and behavioral levels, we know relatively little about the underlying neural mechanisms. Here, we show using a multi-strategy competitive learning paradigm that strategic choices can be characterized by extending the reinforcement learning (RL) framework to incorporate agents' beliefs about the actions of their opponents. Furthermore, using this characterization to generate putative internal values, we used model-based functional magnetic resonance imaging to investigate neural computations underlying strategic learning. We found that the distinct notions of prediction errors derived from our computational model are processed in a partially overlapping but distinct set of brain regions. Specifically, we found that the RL prediction error was correlated with activity in the ventral striatum. In contrast, activity in the ventral striatum, as well as the rostral anterior cingulate (rACC), was correlated with a previously uncharacterized belief-based prediction error. Furthermore, activity in rACC reflected individual differences in degree of engagement in belief learning. These results suggest a model of strategic behavior where learning arises from interaction of dissociable reinforcement and belief-based inputs.

  3. Reinforcement learning on slow features of high-dimensional input streams.

    PubMed

    Legenstein, Robert; Wilbert, Niko; Wiskott, Laurenz

    2010-08-19

    Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.

  4. Extinction learning, reconsolidation and the internal reinforcement hypothesis.

    PubMed

    Eisenhardt, Dorothea; Menzel, Randolf

    2007-02-01

    Retrieving a consolidated memory--by exposing an animal to the learned stimulus but not to the associated reinforcement--leads to two opposing processes: one that weakens the old memory as a result of extinction learning, and another that strengthens the old, already-consolidated memory as a result of some less well-understood form of learning. This latter process of memory strengthening is often referred to as "reconsolidation", since protein synthesis can inhibit this form of memory formation. Although the behavioral phenomena of the two antagonizing forms of learning are well documented, the mechanisms behind the corresponding processes of memory formation are still quite controversial. Referring to results of extinction/reconsolidation experiments in honeybees, we argue that two opposing learning processes--with their respective consolidation phases and memories--are initiated by retrieval trials: extinction learning and reminder learning, the latter leading to the phenomenon of spontaneous recovery from extinction, a process that can be blocked with protein synthesis inhibition.

  5. Decision theory, reinforcement learning, and the brain.

    PubMed

    Dayan, Peter; Daw, Nathaniel D

    2008-12-01

    Decision making is a core competence for animals and humans acting and surviving in environments they only partially comprehend, gaining rewards and punishments for their troubles. Decision-theoretic concepts permeate experiments and computational models in ethology, psychology, and neuroscience. Here, we review a well-known, coherent Bayesian approach to decision making, showing how it unifies issues in Markovian decision problems, signal detection psychophysics, sequential sampling, and optimal exploration and discuss paradigmatic psychological and neural examples of each problem. We discuss computational issues concerning what subjects know about their task and how ambitious they are in seeking optimal solutions; we address algorithmic topics concerning model-based and model-free methods for making choices; and we highlight key aspects of the neural implementation of decision making.

  6. An Upside to Reward Sensitivity: The Hippocampus Supports Enhanced Reinforcement Learning in Adolescence.

    PubMed

    Davidow, Juliet Y; Foerde, Karin; Galván, Adriana; Shohamy, Daphna

    2016-10-05

    Adolescents are notorious for engaging in reward-seeking behaviors, a tendency attributed to heightened activity in the brain's reward systems during adolescence. It has been suggested that reward sensitivity in adolescence might be adaptive, but evidence of an adaptive role has been scarce. Using a probabilistic reinforcement learning task combined with reinforcement learning models and fMRI, we found that adolescents showed better reinforcement learning and a stronger link between reinforcement learning and episodic memory for rewarding outcomes. This behavioral benefit was related to heightened prediction error-related BOLD activity in the hippocampus and to stronger functional connectivity between the hippocampus and the striatum at the time of reinforcement. These findings reveal an important role for the hippocampus in reinforcement learning in adolescence and suggest that reward sensitivity in adolescence is related to adaptive differences in how adolescents learn from experience.

  7. Reconciling Reinforcement Learning Models with Behavioral Extinction and Renewal: Implications for Addiction, Relapse, and Problem Gambling

    ERIC Educational Resources Information Center

    Redish, A. David; Jensen, Steve; Johnson, Adam; Kurth-Nelson, Zeb

    2007-01-01

    Because learned associations are quickly renewed following extinction, the extinction process must include processes other than unlearning. However, reinforcement learning models, such as the temporal difference reinforcement learning (TDRL) model, treat extinction as an unlearning of associated value and are thus unable to capture renewal. TDRL…

  8. Distributed Reinforcement Learning for Policy Synchronization in Infinite-Horizon Dec-POMDPs

    DTIC Science & Technology

    2012-01-01

    REPORT Distributed Reinforcement Learning for PolicySynchronization in Infinite-Horizon Dec-POMDPs 14. ABSTRACT 16. SECURITY CLASSIFICATION OF: In many...ADDRESSES U.S. Army Research Office P.O. Box 12211 Research Triangle Park, NC 27709-2211 15. SUBJECT TERMS Dec-POMDPs, reinforcement learning , multi...Rev 8/98) Prescribed by ANSI Std. Z39.18 - Distributed Reinforcement Learning for PolicySynchronization in Infinite-Horizon Dec-POMDPs Report

  9. Extending Hierarchical Reinforcement Learning to Continuous-Time, Average-Reward, and Multi-Agent Models

    DTIC Science & Technology

    2003-07-09

    Hierarchical reinforcement learning (HRL) is a general framework that studies how to exploit the structure of actions and tasks to accelerate policy...framework could su ce, we focus in this paper on the MAXQ framework. We describe three new hierarchical reinforcement learning algorithms: continuous-time... reinforcement learning to speed up the acquisition of cooperative multiagent tasks. We extend the MAXQ framework to the multiagent case which we term

  10. Distributed reinforcement learning for adaptive and robust network intrusion response

    NASA Astrophysics Data System (ADS)

    Malialis, Kleanthis; Devlin, Sam; Kudenko, Daniel

    2015-07-01

    Distributed denial of service (DDoS) attacks constitute a rapidly evolving threat in the current Internet. Multiagent Router Throttling is a novel approach to defend against DDoS attacks where multiple reinforcement learning agents are installed on a set of routers and learn to rate-limit or throttle traffic towards a victim server. The focus of this paper is on online learning and scalability. We propose an approach that incorporates task decomposition, team rewards and a form of reward shaping called difference rewards. One of the novel characteristics of the proposed system is that it provides a decentralised coordinated response to the DDoS problem, thus being resilient to DDoS attacks themselves. The proposed system learns remarkably fast, thus being suitable for online learning. Furthermore, its scalability is successfully demonstrated in experiments involving 1000 learning agents. We compare our approach against a baseline and a popular state-of-the-art throttling technique from the network security literature and show that the proposed approach is more effective, adaptive to sophisticated attack rate dynamics and robust to agent failures.

  11. Longitudinal investigation on learned helplessness tested under negative and positive reinforcement involving stimulus control.

    PubMed

    Oliveira, Emileane C; Hunziker, Maria Helena

    2014-07-01

    In this study, we investigated whether (a) animals demonstrating the learned helplessness effect during an escape contingency also show learning deficits under positive reinforcement contingencies involving stimulus control and (b) the exposure to positive reinforcement contingencies eliminates the learned helplessness effect under an escape contingency. Rats were initially exposed to controllable (C), uncontrollable (U) or no (N) shocks. After 24h, they were exposed to 60 escapable shocks delivered in a shuttlebox. In the following phase, we selected from each group the four subjects that presented the most typical group pattern: no escape learning (learned helplessness effect) in Group U and escape learning in Groups C and N. All subjects were then exposed to two phases, the (1) positive reinforcement for lever pressing under a multiple FR/Extinction schedule and (2) a re-test under negative reinforcement (escape). A fourth group (n=4) was exposed only to the positive reinforcement sessions. All subjects showed discrimination learning under multiple schedule. In the escape re-test, the learned helplessness effect was maintained for three of the animals in Group U. These results suggest that the learned helplessness effect did not extend to discriminative behavior that is positively reinforced and that the learned helplessness effect did not revert for most subjects after exposure to positive reinforcement. We discuss some theoretical implications as related to learned helplessness as an effect restricted to aversive contingencies and to the absence of reversion after positive reinforcement. This article is part of a Special Issue entitled: insert SI title.

  12. Finding intrinsic rewards by embodied evolution and constrained reinforcement learning.

    PubMed

    Uchibe, Eiji; Doya, Kenji

    2008-12-01

    Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and 'mating' by software reproduction are three major 'extrinsic' rewards. We show in hardware experiments that the robots can find appropriate 'intrinsic' rewards for the vision of battery packs and other robots to promote approach behaviors.

  13. Optimized Assistive Human-Robot Interaction Using Reinforcement Learning.

    PubMed

    Modares, Hamidreza; Ranatunga, Isura; Lewis, Frank L; Popa, Dan O

    2016-03-01

    An intelligent human-robot interaction (HRI) system with adjustable robot behavior is presented. The proposed HRI system assists the human operator to perform a given task with minimum workload demands and optimizes the overall human-robot system performance. Motivated by human factor studies, the presented control structure consists of two control loops. First, a robot-specific neuro-adaptive controller is designed in the inner loop to make the unknown nonlinear robot behave like a prescribed robot impedance model as perceived by a human operator. In contrast to existing neural network and adaptive impedance-based control methods, no information of the task performance or the prescribed robot impedance model parameters is required in the inner loop. Then, a task-specific outer-loop controller is designed to find the optimal parameters of the prescribed robot impedance model to adjust the robot's dynamics to the operator skills and minimize the tracking error. The outer loop includes the human operator, the robot, and the task performance details. The problem of finding the optimal parameters of the prescribed robot impedance model is transformed into a linear quadratic regulator (LQR) problem which minimizes the human effort and optimizes the closed-loop behavior of the HRI system for a given task. To obviate the requirement of the knowledge of the human model, integral reinforcement learning is used to solve the given LQR problem. Simulation results on an x - y table and a robot arm, and experimental implementation results on a PR2 robot confirm the suitability of the proposed method.

  14. Simulation of rat behavior by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals.

    PubMed

    Murakoshi, Kazushi; Noguchi, Takuya

    2005-04-01

    Brown and Wanger [Brown, R.T., Wanger, A.R., 1964. Resistance to punishment and extinction following training with shock or nonreinforcement. J. Exp. Psychol. 68, 503-507] investigated rat behaviors with the following features: (1) rats were exposed to reward and punishment at the same time, (2) environment changed and rats relearned, and (3) rats were stochastically exposed to reward and punishment. The results are that exposure to nonreinforcement produces resistance to the decremental effects of behavior after stochastic reward schedule and that exposure to both punishment and reinforcement produces resistance to the decremental effects of behavior after stochastic punishment schedule. This paper aims to simulate the rat behaviors by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals. The former algorithms of reinforcement learning were unable to simulate the behavior of the feature (3). We improve the former reinforcement learning algorithms by controlling learning parameters in consideration of the acquisition probabilities of reinforcement signals. The proposed algorithm qualitatively simulates the result of the animal experiment of Brown and Wanger.

  15. Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback.

    PubMed

    Tan, A H; Lu, N; Xiao, D

    2008-02-01

    This paper presents a neural architecture for learning category nodes encoding mappings across multimodal patterns involving sensory inputs, actions, and rewards. By integrating adaptive resonance theory (ART) and temporal difference (TD) methods, the proposed neural model, called TD fusion architecture for learning, cognition, and navigation (TD-FALCON), enables an autonomous agent to adapt and function in a dynamic environment with immediate as well as delayed evaluative feedback (reinforcement) signals. TD-FALCON learns the value functions of the state-action space estimated through on-policy and off-policy TD learning methods, specifically state-action-reward-state-action (SARSA) and Q-learning. The learned value functions are then used to determine the optimal actions based on an action selection policy. We have developed TD-FALCON systems using various TD learning strategies and compared their performance in terms of task completion, learning speed, as well as time and space efficiency. Experiments based on a minefield navigation task have shown that TD-FALCON systems are able to learn effectively with both immediate and delayed reinforcement and achieve a stable performance in a pace much faster than those of standard gradient-descent-based reinforcement learning systems.

  16. Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive.

    PubMed

    Collins, Anne G E; Frank, Michael J

    2014-07-01

    The striatal dopaminergic system has been implicated in reinforcement learning (RL), motor performance, and incentive motivation. Various computational models have been proposed to account for each of these effects individually, but a formal analysis of their interactions is lacking. Here we present a novel algorithmic model expanding the classical actor-critic architecture to include fundamental interactive properties of neural circuit models, incorporating both incentive and learning effects into a single theoretical framework. The standard actor is replaced by a dual opponent actor system representing distinct striatal populations, which come to differentially specialize in discriminating positive and negative action values. Dopamine modulates the degree to which each actor component contributes to both learning and choice discriminations. In contrast to standard frameworks, this model simultaneously captures documented effects of dopamine on both learning and choice incentive-and their interactions-across a variety of studies, including probabilistic RL, effort-based choice, and motor skill learning.

  17. Dynamic Sensor Tasking for Space Situational Awareness via Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Linares, R.; Furfaro, R.

    2016-09-01

    This paper studies the Sensor Management (SM) problem for optical Space Object (SO) tracking. The tasking problem is formulated as a Markov Decision Process (MDP) and solved using Reinforcement Learning (RL). The RL problem is solved using the actor-critic policy gradient approach. The actor provides a policy which is random over actions and given by a parametric probability density function (pdf). The critic evaluates the policy by calculating the estimated total reward or the value function for the problem. The parameters of the policy action pdf are optimized using gradients with respect to the reward function. Both the critic and the actor are modeled using deep neural networks (multi-layer neural networks). The policy neural network takes the current state as input and outputs probabilities for each possible action. This policy is random, and can be evaluated by sampling random actions using the probabilities determined by the policy neural network's outputs. The critic approximates the total reward using a neural network. The estimated total reward is used to approximate the gradient of the policy network with respect to the network parameters. This approach is used to find the non-myopic optimal policy for tasking optical sensors to estimate SO orbits. The reward function is based on reducing the uncertainty for the overall catalog to below a user specified uncertainty threshold. This work uses a 30 km total position error for the uncertainty threshold. This work provides the RL method with a negative reward as long as any SO has a total position error above the uncertainty threshold. This penalizes policies that take longer to achieve the desired accuracy. A positive reward is provided when all SOs are below the catalog uncertainty threshold. An optimal policy is sought that takes actions to achieve the desired catalog uncertainty in minimum time. This work trains the policy in simulation by letting it task a single sensor to "learn" from its performance

  18. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation.

    PubMed

    Kato, Ayaka; Morita, Kenji

    2016-10-01

    It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of 'Go' or 'No-Go' selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of 'Go' values towards a goal, and (2) value-contrasts between 'Go' and 'No-Go' are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological systems for value-learning

  19. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation

    PubMed Central

    Morita, Kenji

    2016-01-01

    It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of ‘Go’ or ‘No-Go’ selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of ‘Go’ values towards a goal, and (2) value-contrasts between ‘Go’ and ‘No-Go’ are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological

  20. Beamforming and power control in sensor arrays using reinforcement learning.

    PubMed

    Almeida, Náthalee C; Fernandes, Marcelo A C; Neto, Adrião D D

    2015-03-19

    The use of beamforming and power control, combined or separately, has advantages and disadvantages, depending on the application. The combined use of beamforming and power control has been shown to be highly effective in applications involving the suppression of interference signals from different sources. However, it is necessary to identify efficient methodologies for the combined operation of these two techniques. The most appropriate technique may be obtained by means of the implementation of an intelligent agent capable of making the best selection between beamforming and power control. The present paper proposes an algorithm using reinforcement learning (RL) to determine the optimal combination of beamforming and power control in sensor arrays. The RL algorithm used was Q-learning, employing an ε-greedy policy, and training was performed using the offline method. The simulations showed that RL was effective for implementation of a switching policy involving the different techniques, taking advantage of the positive characteristics of each technique in terms of signal reception.

  1. Reinforcement Learning in Distributed Domains: Beyond Team Games

    NASA Technical Reports Server (NTRS)

    Wolpert, David H.; Sill, Joseph; Turner, Kagan

    2000-01-01

    Distributed search algorithms are crucial in dealing with large optimization problems, particularly when a centralized approach is not only impractical but infeasible. Many machine learning concepts have been applied to search algorithms in order to improve their effectiveness. In this article we present an algorithm that blends Reinforcement Learning (RL) and hill climbing directly, by using the RL signal to guide the exploration step of a hill climbing algorithm. We apply this algorithm to the domain of a constellations of communication satellites where the goal is to minimize the loss of importance weighted data. We introduce the concept of 'ghost' traffic, where correctly setting this traffic induces the satellites to act to optimize the world utility. Our results indicated that the bi-utility search introduced in this paper outperforms both traditional hill climbing algorithms and distributed RL approaches such as team games.

  2. Aggression as positive reinforcement in mice under various ratio- and time-based reinforcement schedules.

    PubMed

    May, Michael E; Kennedy, Craig H

    2009-03-01

    There is evidence suggesting aggression may be a positive reinforcer in many species. However, only a few studies have examined the characteristics of aggression as a positive reinforcer in mice. Four types of reinforcement schedules were examined in the current experiment using male Swiss CFW albino mice in a resident-intruder model of aggression as a positive reinforcer. A nose poke response on an operant conditioning panel was reinforced under fixed-ratio (FR 8), fixed-interval (FI 5-min), progressive ratio (PR 2), or differential reinforcement of low rate behavior reinforcement schedules (DRL 40-s and DRL 80-s). In the FR conditions, nose pokes were maintained by aggression and extinguished when the aggression contingency was removed. There were long postreinforcement pauses followed by bursts of responses with short interresponse times (IRTs). In the FI conditions, nose pokes were maintained by aggression, occurred more frequently as the interval elapsed, and extinguished when the contingency was removed. In the PR conditions, nose pokes were maintained by aggression, postreinforcement pauses increased as the ratio requirement increased, and responding was extinguished when the aggression contingency was removed. In the DRL conditions, the nose poke rate decreased, while the proportional distributions of IRTs and postreinforcement pauses shifted toward longer durations as the DRL interval increased. However, most responses occurred before the minimum IRT interval elapsed, suggesting weak temporal control of behavior. Overall, the findings suggest aggression can be a positive reinforcer for nose poke responses in mice on ratio- and time-based reinforcement schedules.

  3. AGGRESSION AS POSITIVE REINFORCEMENT IN MICE UNDER VARIOUS RATIO- AND TIME-BASED REINFORCEMENT SCHEDULES

    PubMed Central

    May, Michael E; Kennedy, Craig H

    2009-01-01

    There is evidence suggesting aggression may be a positive reinforcer in many species. However, only a few studies have examined the characteristics of aggression as a positive reinforcer in mice. Four types of reinforcement schedules were examined in the current experiment using male Swiss CFW albino mice in a resident–intruder model of aggression as a positive reinforcer. A nose poke response on an operant conditioning panel was reinforced under fixed-ratio (FR 8), fixed-interval (FI 5-min), progressive ratio (PR 2), or differential reinforcement of low rate behavior reinforcement schedules (DRL 40-s and DRL 80-s). In the FR conditions, nose pokes were maintained by aggression and extinguished when the aggression contingency was removed. There were long postreinforcement pauses followed by bursts of responses with short interresponse times (IRTs). In the FI conditions, nose pokes were maintained by aggression, occurred more frequently as the interval elapsed, and extinguished when the contingency was removed. In the PR conditions, nose pokes were maintained by aggression, postreinforcement pauses increased as the ratio requirement increased, and responding was extinguished when the aggression contingency was removed. In the DRL conditions, the nose poke rate decreased, while the proportional distributions of IRTs and postreinforcement pauses shifted toward longer durations as the DRL interval increased. However, most responses occurred before the minimum IRT interval elapsed, suggesting weak temporal control of behavior. Overall, the findings suggest aggression can be a positive reinforcer for nose poke responses in mice on ratio- and time-based reinforcement schedules. PMID:19794833

  4. Context conditional flavor preferences in the rat based on fructose and maltodextrin reinforcers.

    PubMed

    Dwyer, Dominic M; Quirk, Rachel H

    2008-04-01

    In three experiments, rats were exposed to a flavor preference procedure in which flavor A was paired with the reinforcer and flavor B presented alone in Context 1, while in Context 2 flavor A was presented alone and flavor B with the reinforcer. With fructose as the reinforcer both two- and one-bottle training procedures produced a context-dependent preference (Experiments 1 and 2). With maltodextrin as the reinforcer two-bottle training produced a context-dependent preference (Experiment 1). Following one-bottle training with maltodextrin reinforcement rats demonstrated a context-dependent preference when the conditioned stimulus (CS)- was presented with a dilute solution of the reinforcer during training (Experiment 3B) but not when the CS- was presented alone (Experiments 2 and 3A). The pattern of results with maltodextrin reinforcement suggests that there was competition between the cue flavors and the taste of the maltodextrin as predictors of the postingestive consequences of the maltodextrin reinforcer. The fact that rats were able to display context-dependent flavor preferences is consistent with the idea that learned flavor preferences rely on the sort of cue-consequence associations that underpin other forms of conditioning which produce accurate performance on biconditional tasks. The differences between fructose- and maltodextrin-based preferences are discussed in terms of configural and elemental learning processes.

  5. The combination of appetitive and aversive reinforcers and the nature of their interaction during auditory learning.

    PubMed

    Ilango, A; Wetzel, W; Scheich, H; Ohl, F W

    2010-03-31

    Learned changes in behavior can be elicited by either appetitive or aversive reinforcers. It is, however, not clear whether the two types of motivation, (approaching appetitive stimuli and avoiding aversive stimuli) drive learning in the same or different ways, nor is their interaction understood in situations where the two types are combined in a single experiment. To investigate this question we have developed a novel learning paradigm for Mongolian gerbils, which not only allows rewards and punishments to be presented in isolation or in combination with each other, but also can use these opposite reinforcers to drive the same learned behavior. Specifically, we studied learning of tone-conditioned hurdle crossing in a shuttle box driven by either an appetitive reinforcer (brain stimulation reward) or an aversive reinforcer (electrical footshock), or by a combination of both. Combination of the two reinforcers potentiated speed of acquisition, led to maximum possible performance, and delayed extinction as compared to either reinforcer alone. Additional experiments, using partial reinforcement protocols and experiments in which one of the reinforcers was omitted after the animals had been previously trained with the combination of both reinforcers, indicated that appetitive and aversive reinforcers operated together but acted in different ways: in this particular experimental context, punishment appeared to be more effective for initial acquisition and reward more effective to maintain a high level of conditioned responses (CRs). The results imply that learning mechanisms in problem solving were maximally effective when the initial punishment of mistakes was combined with the subsequent rewarding of correct performance.

  6. Memory Transformation Enhances Reinforcement Learning in Dynamic Environments.

    PubMed

    Santoro, Adam; Frankland, Paul W; Richards, Blake A

    2016-11-30

    Over the course of systems consolidation, there is a switch from a reliance on detailed episodic memories to generalized schematic memories. This switch is sometimes referred to as "memory transformation." Here we demonstrate a previously unappreciated benefit of memory transformation, namely, its ability to enhance reinforcement learning in a dynamic environment. We developed a neural network that is trained to find rewards in a foraging task where reward locations are continuously changing. The network can use memories for specific locations (episodic memories) and statistical patterns of locations (schematic memories) to guide its search. We find that switching from an episodic to a schematic strategy over time leads to enhanced performance due to the tendency for the reward location to be highly correlated with itself in the short-term, but regress to a stable distribution in the long-term. We also show that the statistics of the environment determine the optimal utilization of both types of memory. Our work recasts the theoretical question of why memory transformation occurs, shifting the focus from the avoidance of memory interference toward the enhancement of reinforcement learning across multiple timescales.

  7. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia

    PubMed Central

    Markou, Athina; Salamone, John D.; Bussey, Timothy; Mar, Adam; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-01-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu). A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. PMID:23994273

  8. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia.

    PubMed

    Markou, Athina; Salamone, John D; Bussey, Timothy J; Mar, Adam C; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-11-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu) meeting. A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits.

  9. "Notice of Violation of IEEE Publication Principles" Multiobjective Reinforcement Learning: A Comprehensive Overview.

    PubMed

    Liu, Chunming; Xu, Xin; Hu, Dewen

    2013-04-29

    Reinforcement learning is a powerful mechanism for enabling agents to learn in an unknown environment, and most reinforcement learning algorithms aim to maximize some numerical value, which represents only one long-term objective. However, multiple long-term objectives are exhibited in many real-world decision and control problems; therefore, recently, there has been growing interest in solving multiobjective reinforcement learning (MORL) problems with multiple conflicting objectives. The aim of this paper is to present a comprehensive overview of MORL. In this paper, the basic architecture, research topics, and naive solutions of MORL are introduced at first. Then, several representative MORL approaches and some important directions of recent research are reviewed. The relationships between MORL and other related research are also discussed, which include multiobjective optimization, hierarchical reinforcement learning, and multi-agent reinforcement learning. Finally, research challenges and open problems of MORL techniques are highlighted.

  10. Reward-weighted regression with sample reuse for direct policy search in reinforcement learning.

    PubMed

    Hachiya, Hirotaka; Peters, Jan; Sugiyama, Masashi

    2011-11-01

    Direct policy search is a promising reinforcement learning framework, in particular for controlling continuous, high-dimensional systems. Policy search often requires a large number of samples for obtaining a stable policy update estimator, and this is prohibitive when the sampling cost is expensive. In this letter, we extend an expectation-maximization-based policy search method so that previously collected samples can be efficiently reused. The usefulness of the proposed method, reward-weighted regression with sample reuse (R3), is demonstrated through robot learning experiments. (This letter is an extended version of our earlier conference paper: Hachiya, Peters, & Sugiyama, 2009 .).

  11. Artist Agent: A Reinforcement Learning Approach to Automatic Stroke Generation in Oriental Ink Painting

    NASA Astrophysics Data System (ADS)

    Xie, Ning; Hachiya, Hirotaka; Sugiyama, Masashi

    Oriental ink painting, called Sumi-e, is one of the most appealing painting styles that has attracted artists around the world. Major challenges in computer-based Sumi-e simulation are to abstract complex scene information and draw smooth and natural brush strokes. To automatically find such strokes, we propose to model the brush as a reinforcement learning agent, and learn desired brush-trajectories by maximizing the sum of rewards in the policy search framework. We also provide elaborate design of actions, states, and rewards tailored for a Sumi-e agent. The effectiveness of our proposed approach is demonstrated through simulated Sumi-e experiments.

  12. A multiplicative reinforcement learning model capturing learning dynamics and interindividual variability in mice.

    PubMed

    Bathellier, Brice; Tee, Sui Poh; Hrovat, Christina; Rumpel, Simon

    2013-12-03

    Both in humans and in animals, different individuals may learn the same task with strikingly different speeds; however, the sources of this variability remain elusive. In standard learning models, interindividual variability is often explained by variations of the learning rate, a parameter indicating how much synapses are updated on each learning event. Here, we theoretically show that the initial connectivity between the neurons involved in learning a task is also a strong determinant of how quickly the task is learned, provided that connections are updated in a multiplicative manner. To experimentally test this idea, we trained mice to perform an auditory Go/NoGo discrimination task followed by a reversal to compare learning speed when starting from naive or already trained synaptic connections. All mice learned the initial task, but often displayed sigmoid-like learning curves, with a variable delay period followed by a steep increase in performance, as often observed in operant conditioning. For all mice, learning was much faster in the subsequent reversal training. An accurate fit of all learning curves could be obtained with a reinforcement learning model endowed with a multiplicative learning rule, but not with an additive rule. Surprisingly, the multiplicative model could explain a large fraction of the interindividual variability by variations in the initial synaptic weights. Altogether, these results demonstrate the power of multiplicative learning rules to account for the full dynamics of biological learning and suggest an important role of initial wiring in the brain for predispositions to different tasks.

  13. A multiplicative reinforcement learning model capturing learning dynamics and interindividual variability in mice

    PubMed Central

    Bathellier, Brice; Tee, Sui Poh; Hrovat, Christina; Rumpel, Simon

    2013-01-01

    Both in humans and in animals, different individuals may learn the same task with strikingly different speeds; however, the sources of this variability remain elusive. In standard learning models, interindividual variability is often explained by variations of the learning rate, a parameter indicating how much synapses are updated on each learning event. Here, we theoretically show that the initial connectivity between the neurons involved in learning a task is also a strong determinant of how quickly the task is learned, provided that connections are updated in a multiplicative manner. To experimentally test this idea, we trained mice to perform an auditory Go/NoGo discrimination task followed by a reversal to compare learning speed when starting from naive or already trained synaptic connections. All mice learned the initial task, but often displayed sigmoid-like learning curves, with a variable delay period followed by a steep increase in performance, as often observed in operant conditioning. For all mice, learning was much faster in the subsequent reversal training. An accurate fit of all learning curves could be obtained with a reinforcement learning model endowed with a multiplicative learning rule, but not with an additive rule. Surprisingly, the multiplicative model could explain a large fraction of the interindividual variability by variations in the initial synaptic weights. Altogether, these results demonstrate the power of multiplicative learning rules to account for the full dynamics of biological learning and suggest an important role of initial wiring in the brain for predispositions to different tasks. PMID:24255115

  14. How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis.

    PubMed

    Collins, Anne G E; Frank, Michael J

    2012-04-01

    Instrumental learning involves corticostriatal circuitry and the dopaminergic system. This system is typically modeled in the reinforcement learning (RL) framework by incrementally accumulating reward values of states and actions. However, human learning also implicates prefrontal cortical mechanisms involved in higher level cognitive functions. The interaction of these systems remains poorly understood, and models of human behavior often ignore working memory (WM) and therefore incorrectly assign behavioral variance to the RL system. Here we designed a task that highlights the profound entanglement of these two processes, even in simple learning problems. By systematically varying the size of the learning problem and delay between stimulus repetitions, we separately extracted WM-specific effects of load and delay on learning. We propose a new computational model that accounts for the dynamic integration of RL and WM processes observed in subjects' behavior. Incorporating capacity-limited WM into the model allowed us to capture behavioral variance that could not be captured in a pure RL framework even if we (implausibly) allowed separate RL systems for each set size. The WM component also allowed for a more reasonable estimation of a single RL process. Finally, we report effects of two genetic polymorphisms having relative specificity for prefrontal and basal ganglia functions. Whereas the COMT gene coding for catechol-O-methyl transferase selectively influenced model estimates of WM capacity, the GPR6 gene coding for G-protein-coupled receptor 6 influenced the RL learning rate. Thus, this study allowed us to specify distinct influences of the high-level and low-level cognitive functions on instrumental learning, beyond the possibilities offered by simple RL models.

  15. Reinforcement learning for routing in cognitive radio ad hoc networks.

    PubMed

    Al-Rawi, Hasan A A; Yau, Kok-Lim Alvin; Mohamad, Hafizal; Ramli, Nordin; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs.

  16. Reinforcement Learning for Routing in Cognitive Radio Ad Hoc Networks

    PubMed Central

    Al-Rawi, Hasan A. A.; Mohamad, Hafizal; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs. PMID:25140350

  17. The Effect of Tutoring on Children's Learning Under Two Conditions of Reinforcement.

    ERIC Educational Resources Information Center

    Zach, Lillian; And Others

    Studies were some problems of learning motivation and extrinsic reinforcement in a group of disadvantaged youngsters. Also tested was the hypothesis that learning would be facilitated for those children who received regular individual tutoring in addition to classroom instruction, regardless of conditions of reinforcement. Subjects were 60 Negro…

  18. Social Reinforcement, Personality and Learning Performance in Cross-Cultural Programmed Instruction.

    ERIC Educational Resources Information Center

    Symonds, John D.

    An Arab Culture Assimilator, basically a branching type of learning program, was administered to forty-seven subjects in two groups involved in a cross-cultural field study in order to determine the differing effects of varied social reinforcement on learning performance. The conditions of reinforcement were: correct, positive; incorrect,…

  19. Adventitious Reinforcement of Maladaptive Stimulus Control Interferes with Learning.

    PubMed

    Saunders, Kathryn J; Hine, Kathleen; Hayashi, Yusuke; Williams, Dean C

    2016-09-01

    Persistent error patterns sometimes develop when teaching new discriminations. These patterns can be adventitiously reinforced, especially during long periods of chance-level responding (including baseline). Such behaviors can interfere with learning a new discrimination. They can also disrupt already learned discriminations, if they re-emerge during teaching procedures that generate errors. We present an example of this process. Our goal was to teach a boy with intellectual disabilities to touch one of two shapes on a computer screen (in technical terms, a simple simultaneous discrimination). We used a size-fading procedure. The correct stimulus was at full size, and the incorrect-stimulus size increased in increments of 10 %. Performance was nearly error free up to and including 60 % of full size. In a probe session with the incorrect stimulus at full size, however, accuracy plummeted. Also, a pattern of switching between choices, which apparently had been established in classroom instruction, re-emerged. The switching pattern interfered with already-learned discriminations. Despite having previously mastered a fading step with the incorrect stimulus up to 60 %, we were unable to maintain consistently high accuracy beyond 20 % of full size. We refined the teaching program such that fading was done in smaller steps (5 %), and decisions to "step back" to a smaller incorrect stimulus were made after every 5-instead of 20-trials. Errors were rare, switching behavior stopped, and he mastered the discrimination. This is a practical example of the importance of designing instruction that prevents adventitious reinforcement of maladaptive discriminated response patterns by reducing errors during acquisition.

  20. How we learn to make decisions: rapid propagation of reinforcement learning prediction errors in humans.

    PubMed

    Krigolson, Olav E; Hassall, Cameron D; Handy, Todd C

    2014-03-01

    Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors-discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833-1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129-141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769-776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679-709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward

  1. Don't Think, Just Feel the Music: Individuals with Strong Pavlovian-to-Instrumental Transfer Effects Rely Less on Model-based Reinforcement Learning.

    PubMed

    Sebold, Miriam; Schad, Daniel J; Nebe, Stephan; Garbusow, Maria; Jünger, Elisabeth; Kroemer, Nils B; Kathmann, Norbert; Zimmermann, Ulrich S; Smolka, Michael N; Rapp, Michael A; Heinz, Andreas; Huys, Quentin J M

    2016-07-01

    Behavioral choice can be characterized along two axes. One axis distinguishes reflexive, model-free systems that slowly accumulate values through experience and a model-based system that uses knowledge to reason prospectively. The second axis distinguishes Pavlovian valuation of stimuli from instrumental valuation of actions or stimulus-action pairs. This results in four values and many possible interactions between them, with important consequences for accounts of individual variation. We here explored whether individual variation along one axis was related to individual variation along the other. Specifically, we asked whether individuals' balance between model-based and model-free learning was related to their tendency to show Pavlovian interferences with instrumental decisions. In two independent samples with a total of 243 participants, Pavlovian-instrumental transfer effects were negatively correlated with the strength of model-based reasoning in a two-step task. This suggests a potential common underlying substrate predisposing individuals to both have strong Pavlovian interference and be less model-based and provides a framework within which to interpret the observation of both effects in addiction.

  2. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective.

    PubMed

    Botvinick, Matthew M; Niv, Yael; Barto, Andrew C

    2009-12-01

    Research on human and animal behavior has long emphasized its hierarchical structure-the divisibility of ongoing behavior into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behavior has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this paper, we reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behavior. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behavior.

  3. Hierarchically organized behavior and its neural foundations: A reinforcement-learning perspective

    PubMed Central

    Botvinick, Matthew M.; Niv, Yael; Barto, Andrew C.

    2009-01-01

    Research on human and animal behavior has long emphasized its hierarchical structure — the divisibility of ongoing behavior into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behavior has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this paper, we reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behavior. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behavior. PMID:18926527

  4. Attentional Selection Can Be Predicted by Reinforcement Learning of Task-relevant Stimulus Features Weighted by Value-independent Stickiness.

    PubMed

    Balcarras, Matthew; Ardid, Salva; Kaping, Daniel; Everling, Stefan; Womelsdorf, Thilo

    2016-02-01

    Attention includes processes that evaluate stimuli relevance, select the most relevant stimulus against less relevant stimuli, and bias choice behavior toward the selected information. It is not clear how these processes interact. Here, we captured these processes in a reinforcement learning framework applied to a feature-based attention task that required macaques to learn and update the value of stimulus features while ignoring nonrelevant sensory features, locations, and action plans. We found that value-based reinforcement learning mechanisms could account for feature-based attentional selection and choice behavior but required a value-independent stickiness selection process to explain selection errors while at asymptotic behavior. By comparing different reinforcement learning schemes, we found that trial-by-trial selections were best predicted by a model that only represents expected values for the task-relevant feature dimension, with nonrelevant stimulus features and action plans having only a marginal influence on covert selections. These findings show that attentional control subprocesses can be described by (1) the reinforcement learning of feature values within a restricted feature space that excludes irrelevant feature dimensions, (2) a stochastic selection process on feature-specific value representations, and (3) value-independent stickiness toward previous feature selections akin to perseveration in the motor domain. We speculate that these three mechanisms are implemented by distinct but interacting brain circuits and that the proposed formal account of feature-based stimulus selection will be important to understand how attentional subprocesses are implemented in primate brain networks.

  5. Reinforcement learning of targeted movement in a spiking neuronal model of motor cortex.

    PubMed

    Chadderdon, George L; Neymotin, Samuel A; Kerr, Cliff C; Lytton, William W

    2012-01-01

    Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint "forearm" to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (-1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior.

  6. Reinforcement Learning of Targeted Movement in a Spiking Neuronal Model of Motor Cortex

    PubMed Central

    Chadderdon, George L.; Neymotin, Samuel A.; Kerr, Cliff C.; Lytton, William W.

    2012-01-01

    Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint “forearm” to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (−1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior. PMID:23094042

  7. Curiosity driven reinforcement learning for motion planning on humanoids

    PubMed Central

    Frank, Mikhail; Leitner, Jürgen; Stollenga, Marijn; Förster, Alexander; Schmidhuber, Jürgen

    2014-01-01

    Most previous work on artificial curiosity (AC) and intrinsic motivation focuses on basic concepts and theory. Experimental results are generally limited to toy scenarios, such as navigation in a simulated maze, or control of a simple mechanical system with one or two degrees of freedom. To study AC in a more realistic setting, we embody a curious agent in the complex iCub humanoid robot. Our novel reinforcement learning (RL) framework consists of a state-of-the-art, low-level, reactive control layer, which controls the iCub while respecting constraints, and a high-level curious agent, which explores the iCub's state-action space through information gain maximization, learning a world model from experience, controlling the actual iCub hardware in real-time. To the best of our knowledge, this is the first ever embodied, curious agent for real-time motion planning on a humanoid. We demonstrate that it can learn compact Markov models to represent large regions of the iCub's configuration space, and that the iCub explores intelligently, showing interest in its physical constraints as well as in objects it finds in its environment. PMID:24432001

  8. Curiosity driven reinforcement learning for motion planning on humanoids.

    PubMed

    Frank, Mikhail; Leitner, Jürgen; Stollenga, Marijn; Förster, Alexander; Schmidhuber, Jürgen

    2014-01-06

    Most previous work on artificial curiosity (AC) and intrinsic motivation focuses on basic concepts and theory. Experimental results are generally limited to toy scenarios, such as navigation in a simulated maze, or control of a simple mechanical system with one or two degrees of freedom. To study AC in a more realistic setting, we embody a curious agent in the complex iCub humanoid robot. Our novel reinforcement learning (RL) framework consists of a state-of-the-art, low-level, reactive control layer, which controls the iCub while respecting constraints, and a high-level curious agent, which explores the iCub's state-action space through information gain maximization, learning a world model from experience, controlling the actual iCub hardware in real-time. To the best of our knowledge, this is the first ever embodied, curious agent for real-time motion planning on a humanoid. We demonstrate that it can learn compact Markov models to represent large regions of the iCub's configuration space, and that the iCub explores intelligently, showing interest in its physical constraints as well as in objects it finds in its environment.

  9. Novelty and Inductive Generalization in Human Reinforcement Learning.

    PubMed

    Gershman, Samuel J; Niv, Yael

    2015-07-01

    In reinforcement learning (RL), a decision maker searching for the most rewarding option is often faced with the question: What is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: How can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and we describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of RL in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional RL algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty.

  10. Inferring reward prediction errors in patients with schizophrenia: a dynamic reward task for reinforcement learning.

    PubMed

    Li, Chia-Tzu; Lai, Wen-Sung; Liu, Chih-Min; Hsu, Yung-Fong

    2014-01-01

    Abnormalities in the dopamine system have long been implicated in explanations of reinforcement learning and psychosis. The updated reward prediction error (RPE)-a discrepancy between the predicted and actual rewards-is thought to be encoded by dopaminergic neurons. Dysregulation of dopamine systems could alter the appraisal of stimuli and eventually lead to schizophrenia. Accordingly, the measurement of RPE provides a potential behavioral index for the evaluation of brain dopamine activity and psychotic symptoms. Here, we assess two features potentially crucial to the RPE process, namely belief formation and belief perseveration, via a probability learning task and reinforcement-learning modeling. Forty-five patients with schizophrenia [26 high-psychosis and 19 low-psychosis, based on their p1 and p3 scores in the positive-symptom subscales of the Positive and Negative Syndrome Scale (PANSS)] and 24 controls were tested in a feedback-based dynamic reward task for their RPE-related decision making. While task scores across the three groups were similar, matching law analysis revealed that the reward sensitivities of both psychosis groups were lower than that of controls. Trial-by-trial data were further fit with a reinforcement learning model using the Bayesian estimation approach. Model fitting results indicated that both psychosis groups tend to update their reward values more rapidly than controls. Moreover, among the three groups, high-psychosis patients had the lowest degree of choice perseveration. Lumping patients' data together, we also found that patients' perseveration appears to be negatively correlated (p = 0.09, trending toward significance) with their PANSS p1 + p3 scores. Our method provides an alternative for investigating reward-related learning and decision making in basic and clinical settings.

  11. Inferring reward prediction errors in patients with schizophrenia: a dynamic reward task for reinforcement learning

    PubMed Central

    Li, Chia-Tzu; Lai, Wen-Sung; Liu, Chih-Min; Hsu, Yung-Fong

    2014-01-01

    Abnormalities in the dopamine system have long been implicated in explanations of reinforcement learning and psychosis. The updated reward prediction error (RPE)—a discrepancy between the predicted and actual rewards—is thought to be encoded by dopaminergic neurons. Dysregulation of dopamine systems could alter the appraisal of stimuli and eventually lead to schizophrenia. Accordingly, the measurement of RPE provides a potential behavioral index for the evaluation of brain dopamine activity and psychotic symptoms. Here, we assess two features potentially crucial to the RPE process, namely belief formation and belief perseveration, via a probability learning task and reinforcement-learning modeling. Forty-five patients with schizophrenia [26 high-psychosis and 19 low-psychosis, based on their p1 and p3 scores in the positive-symptom subscales of the Positive and Negative Syndrome Scale (PANSS)] and 24 controls were tested in a feedback-based dynamic reward task for their RPE-related decision making. While task scores across the three groups were similar, matching law analysis revealed that the reward sensitivities of both psychosis groups were lower than that of controls. Trial-by-trial data were further fit with a reinforcement learning model using the Bayesian estimation approach. Model fitting results indicated that both psychosis groups tend to update their reward values more rapidly than controls. Moreover, among the three groups, high-psychosis patients had the lowest degree of choice perseveration. Lumping patients' data together, we also found that patients' perseveration appears to be negatively correlated (p = 0.09, trending toward significance) with their PANSS p1 + p3 scores. Our method provides an alternative for investigating reward-related learning and decision making in basic and clinical settings. PMID:25426091

  12. Reinforcement Learning Performance and Risk for Psychosis in Youth.

    PubMed

    Waltz, James A; Demro, Caroline; Schiffman, Jason; Thompson, Elizabeth; Kline, Emily; Reeves, Gloria; Xu, Ziye; Gold, James

    2015-12-01

    Early identification efforts for psychosis have thus far yielded many more individuals "at risk" than actually develop psychotic illness. Here, we test whether measures of reinforcement learning (RL), known to be impaired in chronic schizophrenia, are related to the severity of clinical risk symptoms. Because of the reliance of RL on dopamine-rich frontostriatal systems and evidence of dopamine system dysfunction in the psychosis prodrome, RL measures are of specific interest in this clinical population. The current study examines relationships between psychosis risk symptoms and RL task performance in a sample of adolescents and young adults (n = 70) receiving mental health services. We observed significant correlations between multiple measures of RL performance and measures of both positive and negative symptoms. These results suggest that RL measures may provide a psychosis risk signal in treatment-seeking youth. Further research is necessary to understand the potential predictive role of RL measures for conversion to psychosis.

  13. Tuning fuzzy PD and PI controllers using reinforcement learning.

    PubMed

    Boubertakh, Hamid; Tadjine, Mohamed; Glorennec, Pierre-Yves; Labiod, Salim

    2010-10-01

    In this paper, we propose a new auto-tuning fuzzy PD and PI controllers using reinforcement Q-learning (QL) algorithm for SISO (single-input single-output) and TITO (two-input two-output) systems. We first, investigate the design parameters and settings of a typical class of Fuzzy PD (FPD) and Fuzzy PI (FPI) controllers: zero-order Takagi-Sugeno controllers with equidistant triangular membership functions for inputs, equidistant singleton membership functions for output, Larsen's implication method, and average sum defuzzification method. Secondly, the analytical structures of these typical fuzzy PD and PI controllers are compared to their classical counterpart PD and PI controllers. Finally, the effectiveness of the proposed method is proven through simulation examples.

  14. Incremental state aggregation for value function estimation in reinforcement learning.

    PubMed

    Mori, Takeshi; Ishii, Shin

    2011-10-01

    In reinforcement learning, large state and action spaces make the estimation of value functions impractical, so a value function is often represented as a linear combination of basis functions whose linear coefficients constitute parameters to be estimated. However, preparing basis functions requires a certain amount of prior knowledge and is, in general, a difficult task. To overcome this difficulty, an adaptive basis function construction technique has been proposed by Keller recently, but it requires excessive computational cost. We propose an efficient approach to this difficulty, in which the problem of approximating the value function is decomposed into a number of subproblems, each of which can be solved with small computational cost. Computer experiments show that the CPU time needed by our method is much smaller than that by the existing method.

  15. Conflict acts as an implicit cost in reinforcement learning.

    PubMed

    Cavanagh, James F; Masters, Sean E; Bath, Kevin; Frank, Michael J

    2014-11-04

    Conflict has been proposed to act as a cost in action selection, implying a general function of medio-frontal cortex in the adaptation to aversive events. Here we investigate if response conflict acts as a cost during reinforcement learning by modulating experienced reward values in cortical and striatal systems. Electroencephalography recordings show that conflict diminishes the relationship between reward-related frontal theta power and cue preference yet it enhances the relationship between punishment and cue avoidance. Individual differences in the cost of conflict on reward versus punishment sensitivity are also related to a genetic polymorphism associated with striatal D1 versus D2 pathway balance (DARPP-32). We manipulate these patterns with the D2 agent cabergoline, which induces a strong bias to amplify the aversive value of punishment outcomes following conflict. Collectively, these findings demonstrate that interactive cortico-striatal systems implicitly modulate experienced reward and punishment values as a function of conflict.

  16. Time representation in reinforcement learning models of the basal ganglia

    PubMed Central

    Gershman, Samuel J.; Moustafa, Ahmed A.; Ludvig, Elliot A.

    2014-01-01

    Reinforcement learning (RL) models have been influential in understanding many aspects of basal ganglia function, from reward prediction to action selection. Time plays an important role in these models, but there is still no theoretical consensus about what kind of time representation is used by the basal ganglia. We review several theoretical accounts and their supporting evidence. We then discuss the relationship between RL models and the timing mechanisms that have been attributed to the basal ganglia. We hypothesize that a single computational system may underlie both RL and interval timing—the perception of duration in the range of seconds to hours. This hypothesis, which extends earlier models by incorporating a time-sensitive action selection mechanism, may have important implications for understanding disorders like Parkinson's disease in which both decision making and timing are impaired. PMID:24409138

  17. A Robust Reinforcement Learning Control Design Method for Nonlinear System with Partially Unknown Structure

    NASA Astrophysics Data System (ADS)

    Nakano, Kazuhiro; Obayashi, Masanao; Kuremoto, Takashi; Kobayashi, Kunikazu

    We propose a robust control system which has robustness for disturbance and can deal with a nonlinear system with partially unknown structure by fusing reinforcement learning and robust control theory. First, we solved an optimal control problem without using unknown part of functions of the system, using neural network and the repetition learning of reinforcement learning algorithm. Second, we built the robust reinforcement learning control system which permits uncertainty and has robustness for disturbance by fusing the idea of H infinity control theory with above system.

  18. Reinforcement learning for discounted values often loses the goal in the application to animal learning.

    PubMed

    Yamaguchi, Yoshiya; Sakai, Yutaka

    2012-11-01

    The impulsive preference of an animal for an immediate reward implies that it might subjectively discount the value of potential future outcomes. A theoretical framework to maximize the discounted subjective value has been established in the reinforcement learning theory. The framework has been successfully applied in engineering. However, this study identified a limitation when applied to animal behavior, where in some cases, there is no learning goal. Here a possible learning framework was proposed that is well-posed in any cases and that is consistent with the impulsive preference.

  19. The left hemisphere learns what is right: Hemispatial reward learning depends on reinforcement learning processes in the contralateral hemisphere.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly Crystal; Schwartz, Sophie

    2016-08-01

    Orienting biases refer to consistent, trait-like direction of attention or locomotion toward one side of space. Recent studies suggest that such hemispatial biases may determine how well people memorize information presented in the left or right hemifield. Moreover, lesion studies indicate that learning rewarded stimuli in one hemispace depends on the integrity of the contralateral striatum. However, the exact neural and computational mechanisms underlying the influence of individual orienting biases on reward learning remain unclear. Because reward-based behavioural adaptation depends on the dopaminergic system and prediction error (PE) encoding in the ventral striatum, we hypothesized that hemispheric asymmetries in dopamine (DA) function may determine individual spatial biases in reward learning. To test this prediction, we acquired fMRI in 33 healthy human participants while they performed a lateralized reward task. Learning differences between hemispaces were assessed by presenting stimuli, assigned to different reward probabilities, to the left or right of central fixation, i.e. presented in the left or right visual hemifield. Hemispheric differences in DA function were estimated through differential fMRI responses to positive vs. negative feedback in the left vs. right ventral striatum, and a computational approach was used to identify the neural correlates of PEs. Our results show that spatial biases favoring reward learning in the right (vs. left) hemifield were associated with increased reward responses in the left hemisphere and relatively better neural encoding of PEs for stimuli presented in the right (vs. left) hemifield. These findings demonstrate that trait-like spatial biases implicate hemisphere-specific learning mechanisms, with individual differences between hemispheres contributing to reinforcing spatial biases.

  20. The Drive-Reinforcement Neuronal Model: A Real-Time Learning Mechanism For Unsupervised Learning

    NASA Astrophysics Data System (ADS)

    Klopf, A. H.

    1988-05-01

    The drive-reinforcement neuronal model is described as an example of a newly discovered class of real-time learning mechanisms that correlate earlier derivatives of inputs with later derivatives of outputs. The drive-reinforcement neuronal model has been demonstrated to predict a wide range of classical conditioning phenomena in animal learning. A variety of classes of connectionist and neural network models have been investigated in recent years (Hinton and Anderson, 1981; Levine, 1983; Barto, 1985; Feldman, 1985; Rumelhart and McClelland, 1986). After a brief review of these models, discussion will focus on the class of real-time models because they appear to be making the strongest contact with the experimental evidence of animal learning. Theoretical models in physics have inspired Boltzmann machines (Ackley, Hinton, and Sejnowski, 1985) and what are sometimes called Hopfield networks (Hopfield, 1982; Hopfield and Tank, 1986). These connectionist models utilize symmetric connections and adaptive equilibrium processes during which the networks settle into minimal energy states. Networks utilizing error-correction learning mechanisms go back to Rosenblatt's (1962) perception and Widrow's (1962) adaline and currently take the form of back propagation networks (Parker, 1985; Rumelhart, Hinton, and Williams, 1985, 1986). These networks require a "teacher" or "trainer" to provide error signals indicating the difference between desired and actual responses. Networks employing real-time learning mechanisms, in which the temporal association of signals is of fundamental importance, go back to Hebb (1949). Real-time learning mechanisms may require no teacher or trainer and thus may lend themselves to unsupervised learning. Such models have been extended by Klopf (1972, 1982), who introduced the notions of synaptic eligibility and generalized reinforcement. Sutton and Barto (1981) advanced this class of models by proposing that a derivative of the theoretical neuron's out

  1. Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

    PubMed

    Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

    2013-04-01

    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

  2. Beamforming and Power Control in Sensor Arrays Using Reinforcement Learning

    PubMed Central

    Almeida, Náthalee C.; Fernandes, Marcelo A.C.; Neto, Adrião D.D.

    2015-01-01

    The use of beamforming and power control, combined or separately, has advantages and disadvantages, depending on the application. The combined use of beamforming and power control has been shown to be highly effective in applications involving the suppression of interference signals from different sources. However, it is necessary to identify efficient methodologies for the combined operation of these two techniques. The most appropriate technique may be obtained by means of the implementation of an intelligent agent capable of making the best selection between beamforming and power control. The present paper proposes an algorithm using reinforcement learning (RL) to determine the optimal combination of beamforming and power control in sensor arrays. The RL algorithm used was Q-learning, employing an ε-greedy policy, and training was performed using the offline method. The simulations showed that RL was effective for implementation of a switching policy involving the different techniques, taking advantage of the positive characteristics of each technique in terms of signal reception. PMID:25808769

  3. Online Reinforcement Learning Using a Probability Density Estimation.

    PubMed

    Agostini, Alejandro; Celaya, Enric

    2017-01-01

    Function approximation in online, incremental, reinforcement learning needs to deal with two fundamental problems: biased sampling and nonstationarity. In this kind of task, biased sampling occurs because samples are obtained from specific trajectories dictated by the dynamics of the environment and are usually concentrated in particular convergence regions, which in the long term tend to dominate the approximation in the less sampled regions. The nonstationarity comes from the recursive nature of the estimations typical of temporal difference methods. This nonstationarity has a local profile, varying not only along the learning process but also along different regions of the state space. We propose to deal with these problems using an estimation of the probability density of samples represented with a gaussian mixture model. To deal with the nonstationarity problem, we use the common approach of introducing a forgetting factor in the updating formula. However, instead of using the same forgetting factor for the whole domain, we make it dependent on the local density of samples, which we use to estimate the nonstationarity of the function at any given input point. To address the biased sampling problem, the forgetting factor applied to each mixture component is modulated according to the new information provided in the updating, rather than forgetting depending only on time, thus avoiding undesired distortions of the approximation in less sampled regions.

  4. Electrophysiological correlates of reinforcement learning in young people with Tourette syndrome with and without co-occurring ADHD symptoms.

    PubMed

    Shephard, Elizabeth; Jackson, Georgina M; Groom, Madeleine J

    2016-06-01

    Altered reinforcement learning is implicated in the causes of Tourette syndrome (TS) and attention-deficit/hyperactivity disorder (ADHD). TS and ADHD frequently co-occur but how this affects reinforcement learning has not been investigated. We examined the ability of young people with TS (n=18), TS+ADHD (N=17), ADHD (n=13) and typically developing controls (n=20) to learn and reverse stimulus-response (S-R) associations based on positive and negative reinforcement feedback. We used a 2 (TS-yes, TS-no)×2 (ADHD-yes, ADHD-no) factorial design to assess the effects of TS, ADHD, and their interaction on behavioural (accuracy, RT) and event-related potential (stimulus-locked P3, feedback-locked P2, feedback-related negativity, FRN) indices of learning and reversing the S-R associations. TS was associated with intact learning and reversal performance and largely typical ERP amplitudes. ADHD was associated with lower accuracy during S-R learning and impaired reversal learning (significantly reduced accuracy and a trend for smaller P3 amplitude). The results indicate that co-occurring ADHD symptoms impair reversal learning in TS+ADHD. The implications of these findings for behavioural tic therapies are discussed.

  5. Numerical analysis of a reinforcement learning model with the dynamic aspiration level in the iterated Prisoner's dilemma.

    PubMed

    Masuda, Naoki; Nakamura, Mitsuhiro

    2011-06-07

    Humans and other animals can adapt their social behavior in response to environmental cues including the feedback obtained through experience. Nevertheless, the effects of the experience-based learning of players in evolution and maintenance of cooperation in social dilemma games remain relatively unclear. Some previous literature showed that mutual cooperation of learning players is difficult or requires a sophisticated learning model. In the context of the iterated Prisoner's dilemma, we numerically examine the performance of a reinforcement learning model. Our model modifies those of Karandikar et al. (1998), Posch et al. (1999), and Macy and Flache (2002) in which players satisfy if the obtained payoff is larger than a dynamic threshold. We show that players obeying the modified learning mutually cooperate with high probability if the dynamics of threshold is not too fast and the association between the reinforcement signal and the action in the next round is sufficiently strong. The learning players also perform efficiently against the reactive strategy. In evolutionary dynamics, they can invade a population of players adopting simpler but competitive strategies. Our version of the reinforcement learning model does not complicate the previous model and is sufficiently simple yet flexible. It may serve to explore the relationships between learning and evolution in social dilemma situations.

  6. Carbon Fiber Reinforced Glass Matrix Composites for Space Based Applications.

    DTIC Science & Technology

    1987-08-31

    Nardone , "Carbon Fiber Reinforced Glass Matrix Composites for Space Based Applications", Office of Naval Research Contract N00014-85-C-0332, Report R86... Nardone and K M. Prewo, "Tensile Performance of Carbon Fiber Reinforced Glass", J. Mater. Sci. accepted for publication, 1987. 27. R. F. Cooper and K

  7. Intelligence moderates reinforcement learning: a mini-review of the neural evidence.

    PubMed

    Chen, Chong

    2015-06-01

    Our understanding of the neural basis of reinforcement learning and intelligence, two key factors contributing to human strivings, has progressed significantly recently. However, the overlap of these two lines of research, namely, how intelligence affects neural responses during reinforcement learning, remains uninvestigated. A mini-review of three existing studies suggests that higher IQ (especially fluid IQ) may enhance the neural signal of positive prediction error in dorsolateral prefrontal cortex, dorsal anterior cingulate cortex, and striatum, several brain substrates of reinforcement learning or intelligence.

  8. Intelligence moderates reinforcement learning: a mini-review of the neural evidence

    PubMed Central

    2014-01-01

    Our understanding of the neural basis of reinforcement learning and intelligence, two key factors contributing to human strivings, has progressed significantly recently. However, the overlap of these two lines of research, namely, how intelligence affects neural responses during reinforcement learning, remains uninvestigated. A mini-review of three existing studies suggests that higher IQ (especially fluid IQ) may enhance the neural signal of positive prediction error in dorsolateral prefrontal cortex, dorsal anterior cingulate cortex, and striatum, several brain substrates of reinforcement learning or intelligence. PMID:25185818

  9. Identifying Cognitive Remediation Change Through Computational Modelling—Effects on Reinforcement Learning in Schizophrenia

    PubMed Central

    Cella, Matteo; Bishara, Anthony J.; Medin, Evelina; Swan, Sarah; Reeder, Clare; Wykes, Til

    2014-01-01

    Objective: Converging research suggests that individuals with schizophrenia show a marked impairment in reinforcement learning, particularly in tasks requiring flexibility and adaptation. The problem has been associated with dopamine reward systems. This study explores, for the first time, the characteristics of this impairment and how it is affected by a behavioral intervention—cognitive remediation. Method: Using computational modelling, 3 reinforcement learning parameters based on the Wisconsin Card Sorting Test (WCST) trial-by-trial performance were estimated: R (reward sensitivity), P (punishment sensitivity), and D (choice consistency). In Study 1 the parameters were compared between a group of individuals with schizophrenia (n = 100) and a healthy control group (n = 50). In Study 2 the effect of cognitive remediation therapy (CRT) on these parameters was assessed in 2 groups of individuals with schizophrenia, one receiving CRT (n = 37) and the other receiving treatment as usual (TAU, n = 34). Results: In Study 1 individuals with schizophrenia showed impairment in the R and P parameters compared with healthy controls. Study 2 demonstrated that sensitivity to negative feedback (P) and reward (R) improved in the CRT group after therapy compared with the TAU group. R and P parameter change correlated with WCST outputs. Improvements in R and P after CRT were associated with working memory gains and reduction of negative symptoms, respectively. Conclusion: Schizophrenia reinforcement learning difficulties negatively influence performance in shift learning tasks. CRT can improve sensitivity to reward and punishment. Identifying parameters that show change may be useful in experimental medicine studies to identify cognitive domains susceptible to improvement. PMID:24214932

  10. fMRI and EEG Predictors of Dynamic Decision Parameters during Human Reinforcement Learning

    PubMed Central

    Gagne, Chris; Nyhus, Erika; Masters, Sean; Wiecki, Thomas V.; Cavanagh, James F.; Badre, David

    2015-01-01

    What are the neural dynamics of choice processes during reinforcement learning? Two largely separate literatures have examined dynamics of reinforcement learning (RL) as a function of experience but assuming a static choice process, or conversely, the dynamics of choice processes in decision making but based on static decision values. Here we show that human choice processes during RL are well described by a drift diffusion model (DDM) of decision making in which the learned trial-by-trial reward values are sequentially sampled, with a choice made when the value signal crosses a decision threshold. Moreover, simultaneous fMRI and EEG recordings revealed that this decision threshold is not fixed across trials but varies as a function of activity in the subthalamic nucleus (STN) and is further modulated by trial-by-trial measures of decision conflict and activity in the dorsomedial frontal cortex (pre-SMA BOLD and mediofrontal theta in EEG). These findings provide converging multimodal evidence for a model in which decision threshold in reward-based tasks is adjusted as a function of communication from pre-SMA to STN when choices differ subtly in reward values, allowing more time to choose the statistically more rewarding option. PMID:25589744

  11. fMRI and EEG predictors of dynamic decision parameters during human reinforcement learning.

    PubMed

    Frank, Michael J; Gagne, Chris; Nyhus, Erika; Masters, Sean; Wiecki, Thomas V; Cavanagh, James F; Badre, David

    2015-01-14

    What are the neural dynamics of choice processes during reinforcement learning? Two largely separate literatures have examined dynamics of reinforcement learning (RL) as a function of experience but assuming a static choice process, or conversely, the dynamics of choice processes in decision making but based on static decision values. Here we show that human choice processes during RL are well described by a drift diffusion model (DDM) of decision making in which the learned trial-by-trial reward values are sequentially sampled, with a choice made when the value signal crosses a decision threshold. Moreover, simultaneous fMRI and EEG recordings revealed that this decision threshold is not fixed across trials but varies as a function of activity in the subthalamic nucleus (STN) and is further modulated by trial-by-trial measures of decision conflict and activity in the dorsomedial frontal cortex (pre-SMA BOLD and mediofrontal theta in EEG). These findings provide converging multimodal evidence for a model in which decision threshold in reward-based tasks is adjusted as a function of communication from pre-SMA to STN when choices differ subtly in reward values, allowing more time to choose the statistically more rewarding option.

  12. Efficient exploration through active learning for value function approximation in reinforcement learning.

    PubMed

    Akiyama, Takayuki; Hachiya, Hirotaka; Sugiyama, Masashi

    2010-06-01

    Appropriately designing sampling policies is highly important for obtaining better control policies in reinforcement learning. In this paper, we first show that the least-squares policy iteration (LSPI) framework allows us to employ statistical active learning methods for linear regression. Then we propose a design method of good sampling policies for efficient exploration, which is particularly useful when the sampling cost of immediate rewards is high. The effectiveness of the proposed method, which we call active policy iteration (API), is demonstrated through simulations with a batting robot.

  13. A parameter control method in reinforcement learning to rapidly follow unexpected environmental changes.

    PubMed

    Murakoshi, Kazushi; Mizuno, Junya

    2004-11-01

    In order to rapidly follow unexpected environmental changes, we propose a parameter control method in reinforcement learning that changes each of learning parameters in appropriate directions. We determine each appropriate direction on the basis of relationships between behaviors and neuromodulators by considering an emergency as a key word. Computer experiments show that the agents using our proposed method could rapidly respond to unexpected environmental changes, not depending on either two reinforcement learning algorithms (Q-learning and actor-critic (AC) architecture) or two learning problems (discontinuous and continuous state-action problems).

  14. Self-organizing neural networks integrating domain knowledge and reinforcement learning.

    PubMed

    Teng, Teck-Hou; Tan, Ah-Hwee; Zurada, Jacek M

    2015-05-01

    The use of domain knowledge in learning systems is expected to improve learning efficiency and reduce model complexity. However, due to the incompatibility with knowledge structure of the learning systems and real-time exploratory nature of reinforcement learning (RL), domain knowledge cannot be inserted directly. In this paper, we show how self-organizing neural networks designed for online and incremental adaptation can integrate domain knowledge and RL. Specifically, symbol-based domain knowledge is translated into numeric patterns before inserting into the self-organizing neural networks. To ensure effective use of domain knowledge, we present an analysis of how the inserted knowledge is used by the self-organizing neural networks during RL. To this end, we propose a vigilance adaptation and greedy exploitation strategy to maximize exploitation of the inserted domain knowledge while retaining the plasticity of learning and using new knowledge. Our experimental results based on the pursuit-evasion and minefield navigation problem domains show that such self-organizing neural network can make effective use of domain knowledge to improve learning efficiency and reduce model complexity.

  15. Dopamine-Dependent Reinforcement of Motor Skill Learning: Evidence from Gilles de la Tourette Syndrome

    ERIC Educational Resources Information Center

    Palminteri, Stefano; Lebreton, Mael; Worbe, Yulia; Hartmann, Andreas; Lehericy, Stephane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-01-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only…

  16. Reinforcement learning solution for HJB equation arising in constrained optimal control problem.

    PubMed

    Luo, Biao; Wu, Huai-Ning; Huang, Tingwen; Liu, Derong

    2015-11-01

    The constrained optimal control problem depends on the solution of the complicated Hamilton-Jacobi-Bellman equation (HJBE). In this paper, a data-based off-policy reinforcement learning (RL) method is proposed, which learns the solution of the HJBE and the optimal control policy from real system data. One important feature of the off-policy RL is that its policy evaluation can be realized with data generated by other behavior policies, not necessarily the target policy, which solves the insufficient exploration problem. The convergence of the off-policy RL is proved by demonstrating its equivalence to the successive approximation approach. Its implementation procedure is based on the actor-critic neural networks structure, where the function approximation is conducted with linearly independent basis functions. Subsequently, the convergence of the implementation procedure with function approximation is also proved. Finally, its effectiveness is verified through computer simulations.

  17. Reinforcement Learning with Autonomous Small Unmanned Aerial Vehicles in Cluttered Environments

    NASA Technical Reports Server (NTRS)

    Tran, Loc; Cross, Charles; Montague, Gilbert; Motter, Mark; Neilan, James; Qualls, Garry; Rothhaar, Paul; Trujillo, Anna; Allen, B. Danette

    2015-01-01

    We present ongoing work in the Autonomy Incubator at NASA Langley Research Center (LaRC) exploring the efficacy of a data set aggregation approach to reinforcement learning for small unmanned aerial vehicle (sUAV) flight in dense and cluttered environments with reactive obstacle avoidance. The goal is to learn an autonomous flight model using training experiences from a human piloting a sUAV around static obstacles. The training approach uses video data from a forward-facing camera that records the human pilot's flight. Various computer vision based features are extracted from the video relating to edge and gradient information. The recorded human-controlled inputs are used to train an autonomous control model that correlates the extracted feature vector to a yaw command. As part of the reinforcement learning approach, the autonomous control model is iteratively updated with feedback from a human agent who corrects undesired model output. This data driven approach to autonomous obstacle avoidance is explored for simulated forest environments furthering autonomous flight under the tree canopy research. This enables flight in previously inaccessible environments which are of interest to NASA researchers in Earth and Atmospheric sciences.

  18. Neural Circuits Trained with Standard Reinforcement Learning Can Accumulate Probabilistic Information during Decision Making.

    PubMed

    Kurzawa, Nils; Summerfield, Christopher; Bogacz, Rafal

    2017-02-01

    Much experimental evidence suggests that during decision making, neural circuits accumulate evidence supporting alternative options. A computational model well describing this accumulation for choices between two options assumes that the brain integrates the log ratios of the likelihoods of the sensory inputs given the two options. Several models have been proposed for how neural circuits can learn these log-likelihood ratios from experience, but all of these models introduced novel and specially dedicated synaptic plasticity rules. Here we show that for a certain wide class of tasks, the log-likelihood ratios are approximately linearly proportional to the expected rewards for selecting actions. Therefore, a simple model based on standard reinforcement learning rules is able to estimate the log-likelihood ratios from experience and on each trial accumulate the log-likelihood ratios associated with presented stimuli while selecting an action. The simulations of the model replicate experimental data on both behavior and neural activity in tasks requiring accumulation of probabilistic cues. Our results suggest that there is no need for the brain to support dedicated plasticity rules, as the standard mechanisms proposed to describe reinforcement learning can enable the neural circuits to perform efficient probabilistic inference.

  19. Universal effect of dynamical reinforcement learning mechanism in spatial evolutionary games

    NASA Astrophysics Data System (ADS)

    Zhang, Hai-Feng; Wu, Zhi-Xi; Wang, Bing-Hong

    2012-06-01

    One of the prototypical mechanisms in understanding the ubiquitous cooperation in social dilemma situations is the win-stay, lose-shift rule. In this work, a generalized win-stay, lose-shift learning model—a reinforcement learning model with dynamic aspiration level—is proposed to describe how humans adapt their social behaviors based on their social experiences. In the model, the players incorporate the information of the outcomes in previous rounds with time-dependent aspiration payoffs to regulate the probability of choosing cooperation. By investigating such a reinforcement learning rule in the spatial prisoner's dilemma game and public goods game, a most noteworthy viewpoint is that moderate greediness (i.e. moderate aspiration level) favors best the development and organization of collective cooperation. The generality of this observation is tested against different regulation strengths and different types of network of interaction as well. We also make comparisons with two recently proposed models to highlight the importance of the mechanism of adaptive aspiration level in supporting cooperation in structured populations.

  20. Complexity Analysis of Real-Time Reinforcement Learning Applied to Finding Shortest Paths in Deterministic Domains

    DTIC Science & Technology

    1992-12-01

    i.e. tabula rasa ) reinforcement learning was exponential for such problems, or that it was tractable (i.e. of polynomial time-complexity) only if the...Figure 1: Navigating on a map studied by [2], [51, [23], [19], [24], and others. [35] showed that reaching a goal state with uninformed (i.e. tabula ... rasa ) reinforcement learning methods can require a number of action executions that is exponential in the size of the state space. [33] has shown that

  1. Cooperation and Coordination Between Fuzzy Reinforcement Learning Agents in Continuous State Partially Observable Markov Decision Processes

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Vengerov, David

    1999-01-01

    Successful operations of future multi-agent intelligent systems require efficient cooperation schemes between agents sharing learning experiences. We consider a pseudo-realistic world in which one or more opportunities appear and disappear in random locations. Agents use fuzzy reinforcement learning to learn which opportunities are most worthy of pursuing based on their promise rewards, expected lifetimes, path lengths and expected path costs. We show that this world is partially observable because the history of an agent influences the distribution of its future states. We consider a cooperation mechanism in which agents share experience by using and-updating one joint behavior policy. We also implement a coordination mechanism for allocating opportunities to different agents in the same world. Our results demonstrate that K cooperative agents each learning in a separate world over N time steps outperform K independent agents each learning in a separate world over K*N time steps, with this result becoming more pronounced as the degree of partial observability in the environment increases. We also show that cooperation between agents learning in the same world decreases performance with respect to independent agents. Since cooperation reduces diversity between agents, we conclude that diversity is a key parameter in the trade off between maximizing utility from cooperation when diversity is low and maximizing utility from competitive coordination when diversity is high.

  2. Evolution of cooperation facilitated by reinforcement learning with adaptive aspiration levels.

    PubMed

    Tanabe, Shoma; Masuda, Naoki

    2012-01-21

    Repeated interaction between individuals is the main mechanism for maintaining cooperation in social dilemma situations. Variants of tit-for-tat (repeating the previous action of the opponent) and the win-stay lose-shift strategy are known as strong competitors in iterated social dilemma games. On the other hand, real repeated interaction generally allows plasticity (i.e., learning) of individuals based on the experience of the past. Although plasticity is relevant to various biological phenomena, its role in repeated social dilemma games is relatively unexplored. In particular, if experience-based learning plays a key role in promotion and maintenance of cooperation, learners should evolve in the contest with nonlearners under selection pressure. By modeling players using a simple reinforcement learning model, we numerically show that learning enables the evolution of cooperation. We also show that numerically estimated adaptive dynamics appositely predict the outcome of evolutionary simulations. The analysis of the adaptive dynamics enables us to capture the obtained results as an affirmative example of the Baldwin effect, where learning accelerates the evolution to optimality.

  3. The cerebellum: a neural system for the study of reinforcement learning.

    PubMed

    Swain, Rodney A; Kerr, Abigail L; Thompson, Richard F

    2011-01-01

    In its strictest application, the term "reinforcement learning" refers to a computational approach to learning in which an agent (often a machine) interacts with a mutable environment to maximize reward through trial and error. The approach borrows essentials from several fields, most notably Computer Science, Behavioral Neuroscience, and Psychology. At the most basic level, a neural system capable of mediating reinforcement learning must be able to acquire sensory information about the external environment and internal milieu (either directly or through connectivities with other brain regions), must be able to select a behavior to be executed, and must be capable of providing evaluative feedback about the success of that behavior. Given that Psychology informs us that reinforcers, both positive and negative, are stimuli or consequences that increase the probability that the immediately antecedent behavior will be repeated and that reinforcer strength or viability is modulated by the organism's past experience with the reinforcer, its affect, and even the state of its muscles (e.g., eyes open or closed); it is the case that any neural system that supports reinforcement learning must also be sensitive to these same considerations. Once learning is established, such a neural system must finally be able to maintain continued response expression and prevent response drift. In this report, we examine both historical and recent evidence that the cerebellum satisfies all of these requirements. While we report evidence from a variety of learning paradigms, the majority of our discussion will focus on classical conditioning of the rabbit eye blink response as an ideal model system for the study of reinforcement and reinforcement learning.

  4. Reinforcement learning deficits in people with schizophrenia persist after extended trials.

    PubMed

    Cicero, David C; Martin, Elizabeth A; Becker, Theresa M; Kerns, John G

    2014-12-30

    Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive feedback and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning.

  5. Reinforcement Learning Deficits in People with Schizophrenia Persist after Extended Trials

    PubMed Central

    Cicero, David C.; Martin, Elizabeth A.; Becker, Theresa M.; Kerns, John G.

    2014-01-01

    Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning. PMID:25172610

  6. Variance-penalized Markov decision processes: dynamic programming and reinforcement learning techniques

    NASA Astrophysics Data System (ADS)

    Gosavi, Abhijit

    2014-08-01

    In control systems theory, the Markov decision process (MDP) is a widely used optimization model involving selection of the optimal action in each state visited by a discrete-event system driven by Markov chains. The classical MDP model is suitable for an agent/decision-maker interested in maximizing expected revenues, but does not account for minimizing variability in the revenues. An MDP model in which the agent can maximize the revenues while simultaneously controlling the variance in the revenues is proposed. This work is rooted in machine learning/neural network concepts, where updating is based on system feedback and step sizes. First, a Bellman equation for the problem is proposed. Thereafter, convergent dynamic programming and reinforcement learning techniques for solving the MDP are provided along with encouraging numerical results on a small MDP and a preventive maintenance problem.

  7. Learned helplessness: effects of noncontingent reinforcement and response cost with emotionally disturbed children.

    PubMed

    Saylor, C F; Finch, A J; Cassel, S C; Saylor, C B; Penberthy, A R

    1984-07-01

    In order to investigate the effectiveness of noncontingent reinforcement and response cost in inducing learned helplessness and to determine whether depressed Ss respond differently than nondepressed Ss, 28 emotionally disturbed children (20 boys, 8 girls) were tested in a modified learned helplessness paradigm. Children's Depression Inventory score and diagnosis were each used to distinguish "depressed" and "nondepressed" children. Half of the depressed group and half of the nondepressed group received noncontingent response cost, the other half of the two groups received noncontingent positive reinforcement. Results indicated that both noncontingent response cost and noncontingent reinforcement led to reduced persistence time relative to persistence under conditions of contingent reinforcement. There was only one significant difference between depressed and nondepressed Ss (differential persistence time over trials) and there were no significant interactions. Results were discussed in terms of Seligman's formulation of learned helplessness and the extension of this model to a clinical child population.

  8. Human dorsal striatal activity during choice discriminates reinforcement learning behavior from the gambler's fallacy.

    PubMed

    Jessup, Ryan K; O'Doherty, John P

    2011-04-27

    Reinforcement learning theory has generated substantial interest in neurobiology, particularly because of the resemblance between phasic dopamine and reward prediction errors. Actor-critic theories have been adapted to account for the functions of the striatum, with parts of the dorsal striatum equated to the actor. Here, we specifically test whether the human dorsal striatum--as predicted by an actor-critic instantiation--is used on a trial-to-trial basis at the time of choice to choose in accordance with reinforcement learning theory, as opposed to a competing strategy: the gambler's fallacy. Using a partial-brain functional magnetic resonance imaging scanning protocol focused on the striatum and other ventral brain areas, we found that the dorsal striatum is more active when choosing consistent with reinforcement learning compared with the competing strategy. Moreover, an overlapping area of dorsal striatum along with the ventral striatum was found to be correlated with reward prediction errors at the time of outcome, as predicted by the actor-critic framework. These findings suggest that the same region of dorsal striatum involved in learning stimulus-response associations may contribute to the control of behavior during choice, thereby using those learned associations. Intriguingly, neither reinforcement learning nor the gambler's fallacy conformed to the optimal choice strategy on the specific decision-making task we used. Thus, the dorsal striatum may contribute to the control of behavior according to reinforcement learning even when the prescriptions of such an algorithm are suboptimal in terms of maximizing future rewards.

  9. Learning processes affecting human decision making: An assessment of reinforcer-selective Pavlovian-to-instrumental transfer following reinforcer devaluation.

    PubMed

    Allman, Melissa J; DeLeon, Iser G; Cataldo, Michael F; Holland, Peter C; Johnson, Alexander W

    2010-07-01

    In reinforcer-selective transfer, Pavlovian stimuli that are predictive of specific outcomes bias performance toward responses associated with those outcomes. Although this phenomenon has been extensively examined in rodents, recent assessments have extended to humans. Using a stock market paradigm adults were trained to associate particular symbols and responses with particular currencies. During the first test, individuals showed a preference for responding on actions associated with the same outcome as that predicted by the presented stimulus (i.e., a reinforcer-selective transfer effect). In the second test of the experiment, one of the currencies was devalued. We found it notable that this served to reduce responses to those stimuli associated with the devalued currency. This finding is in contrast to that typically observed in rodent studies, and suggests that participants in this task represented the sensory features that differentiate the reinforcers and their value during reinforcer-selective transfer. These results are discussed in terms of implications for understanding associative learning processes in humans and the ability of reward-paired cues to direct adaptive and maladaptive behavior.

  10. A Judgement-Based Model of Workplace Learning

    ERIC Educational Resources Information Center

    Athanasou, James A.

    2004-01-01

    The purpose of this paper is to outline a judgement-based model of adult learning. This approach is set out as a Perceptual-Judgemental-Reinforcement approach to social learning under conditions of complexity and where there is no single, clearly identified correct response. The model builds upon the Hager-Halliday thesis of workplace learning and…

  11. Adolescent-specific patterns of behavior and neural activity during social reinforcement learning.

    PubMed

    Jones, Rebecca M; Somerville, Leah H; Li, Jian; Ruberry, Erika J; Powers, Alisa; Mehta, Natasha; Dyke, Jonathan; Casey, B J

    2014-06-01

    Humans are sophisticated social beings. Social cues from others are exceptionally salient, particularly during adolescence. Understanding how adolescents interpret and learn from variable social signals can provide insight into the observed shift in social sensitivity during this period. The present study tested 120 participants between the ages of 8 and 25 years on a social reinforcement learning task where the probability of receiving positive social feedback was parametrically manipulated. Seventy-eight of these participants completed the task during fMRI scanning. Modeling trial-by-trial learning, children and adults showed higher positive learning rates than did adolescents, suggesting that adolescents demonstrated less differentiation in their reaction times for peers who provided more positive feedback. Forming expectations about receiving positive social reinforcement correlated with neural activity within the medial prefrontal cortex and ventral striatum across age. Adolescents, unlike children and adults, showed greater insular activity during positive prediction error learning and increased activity in the supplementary motor cortex and the putamen when receiving positive social feedback regardless of the expected outcome, suggesting that peer approval may motivate adolescents toward action. While different amounts of positive social reinforcement enhanced learning in children and adults, all positive social reinforcement equally motivated adolescents. Together, these findings indicate that sensitivity to peer approval during adolescence goes beyond simple reinforcement theory accounts and suggest possible explanations for how peers may motivate adolescent behavior.

  12. Adolescent-specific patterns of behavior and neural activity during social reinforcement learning

    PubMed Central

    Jones, Rebecca M.; Somerville, Leah H.; Li, Jian; Ruberry, Erika J.; Powers, Alisa; Mehta, Natasha; Dyke, Jonathan; Casey, BJ

    2014-01-01

    Humans are sophisticated social beings. Social cues from others are exceptionally salient, particularly during adolescence. Understanding how adolescents interpret and learn from variable social signals can provide insight into the observed shift in social sensitivity during this period. The current study tested 120 participants between the ages of 8 and 25 years on a social reinforcement learning task where the probability of receiving positive social feedback was parametrically manipulated. Seventy-eight of these participants completed the task during fMRI scanning. Modeling trial-by-trial learning, children and adults showed higher positive learning rates than adolescents, suggesting that adolescents demonstrated less differentiation in their reaction times for peers who provided more positive feedback. Forming expectations about receiving positive social reinforcement correlated with neural activity within the medial prefrontal cortex and ventral striatum across age. Adolescents, unlike children and adults, showed greater insular activity during positive prediction error learning and increased activity in the supplementary motor cortex and the putamen when receiving positive social feedback regardless of the expected outcome, suggesting that peer approval may motivate adolescents towards action. While different amounts of positive social reinforcement enhanced learning in children and adults, all positive social reinforcement equally motivated adolescents. Together, these findings indicate that sensitivity to peer approval during adolescence goes beyond simple reinforcement theory accounts and suggests possible explanations for how peers may motivate adolescent behavior. PMID:24550063

  13. Oxytocin enhances amygdala-dependent, socially reinforced learning and emotional empathy in humans.

    PubMed

    Hurlemann, René; Patin, Alexandra; Onur, Oezguer A; Cohen, Michael X; Baumgartner, Tobias; Metzler, Sarah; Dziobek, Isabel; Gallinat, Juergen; Wagner, Michael; Maier, Wolfgang; Kendrick, Keith M

    2010-04-07

    Oxytocin (OT) is becoming increasingly established as a prosocial neuropeptide in humans with therapeutic potential in treatment of social, cognitive, and mood disorders. However, the potential of OT as a general facilitator of human learning and empathy is unclear. The current double-blind experiments on healthy adult male volunteers investigated first whether treatment with intranasal OT enhanced learning performance on a feedback-guided item-category association task where either social (smiling and angry faces) or nonsocial (green and red lights) reinforcers were used, and second whether it increased either cognitive or emotional empathy measured by the Multifaceted Empathy Test. Further experiments investigated whether OT-sensitive behavioral components required a normal functional amygdala. Results in control groups showed that learning performance was improved when social rather than nonsocial reinforcement was used. Intranasal OT potentiated this social reinforcement advantage and greatly increased emotional, but not cognitive, empathy in response to both positive and negative valence stimuli. Interestingly, after OT treatment, emotional empathy responses in men were raised to levels similar to those found in untreated women. Two patients with selective bilateral damage to the amygdala (monozygotic twins with congenital Urbach-Wiethe disease) were impaired on both OT-sensitive aspects of these learning and empathy tasks, but performed normally on nonsocially reinforced learning and cognitive empathy. Overall these findings provide the first demonstration that OT can facilitate amygdala-dependent, socially reinforced learning and emotional empathy in men.

  14. Transcranial Direct Current Stimulation of Right Dorsolateral Prefrontal Cortex Does Not Affect Model-Based or Model-Free Reinforcement Learning in Humans

    PubMed Central

    Smittenaar, Peter; Prichard, George; FitzGerald, Thomas H. B.; Diedrichsen, Joern; Dolan, Raymond J.

    2014-01-01

    There is broad consensus that the prefrontal cortex supports goal-directed, model-based decision-making. Consistent with this, we have recently shown that model-based control can be impaired through transcranial magnetic stimulation of right dorsolateral prefrontal cortex in humans. We hypothesized that an enhancement of model-based control might be achieved by anodal transcranial direct current stimulation of the same region. We tested 22 healthy adult human participants in a within-subject, double-blind design in which participants were given Active or Sham stimulation over two sessions. We show Active stimulation had no effect on model-based control or on model-free (‘habitual’) control compared to Sham stimulation. These null effects are substantiated by a power analysis, which suggests that our study had at least 60% power to detect a true effect, and by a Bayesian model comparison, which favors a model of the data that assumes stimulation had no effect over models that assume stimulation had an effect on behavioral control. Although we cannot entirely exclude more trivial explanations for our null effect, for example related to (faults in) our experimental setup, these data suggest that anodal transcranial direct current stimulation over right dorsolateral prefrontal cortex does not improve model-based control, despite existing evidence that transcranial magnetic stimulation can disrupt such control in the same brain region. PMID:24475185

  15. Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations.

    PubMed

    Lee, Jae Young; Park, Jin Bae; Choi, Yoon Ho

    2015-05-01

    This paper focuses on a class of reinforcement learning (RL) algorithms, named integral RL (I-RL), that solve continuous-time (CT) nonlinear optimal control problems with input-affine system dynamics. First, we extend the concepts of exploration, integral temporal difference, and invariant admissibility to the target CT nonlinear system that is governed by a control policy plus a probing signal called an exploration. Then, we show input-to-state stability (ISS) and invariant admissibility of the closed-loop systems with the policies generated by integral policy iteration (I-PI) or invariantly admissible PI (IA-PI) method. Based on these, three online I-RL algorithms named explorized I-PI and integral Q -learning I, II are proposed, all of which generate the same convergent sequences as I-PI and IA-PI under the required excitation condition on the exploration. All the proposed methods are partially or completely model free, and can simultaneously explore the state space in a stable manner during the online learning processes. ISS, invariant admissibility, and convergence properties of the proposed methods are also investigated, and related with these, we show the design principles of the exploration for safe learning. Neural-network-based implementation methods for the proposed schemes are also presented in this paper. Finally, several numerical simulations are carried out to verify the effectiveness of the proposed methods.

  16. Components of the Flight Response Can Reinforce Bar-Press Avoidance Learning

    ERIC Educational Resources Information Center

    Crawford, Mary; Masterson, Fred

    1978-01-01

    While responses permitting no change in location are learned very slowly, responses that allow unambiguous flight from a dangerous location are learned very rapidly. Two experiments examine the possible reinforcing properties of the flight response in avoidance acquisition. (Author/RK)

  17. The Effects of Reinforcement and Developmental Stage on Learning in Children. Final Report.

    ERIC Educational Resources Information Center

    Ratliff, Richard G.

    A series of investigations which simultaneously manipulated parameters of reinforcement and age and sex of children were conducted in order to further describe the learning process in children. In addition, an attempt was made to relate perceived parental discipline to performance in the discrimination learning tasks employed in this research. The…

  18. An Evaluation of Pedagogical Tutorial Tactics for a Natural Language Tutoring System: A Reinforcement Learning Approach

    ERIC Educational Resources Information Center

    Chi, Min; VanLehn, Kurt; Litman, Diane; Jordan, Pamela

    2011-01-01

    Pedagogical strategies are policies for a tutor to decide the next action when there are multiple actions available. When the content is controlled to be the same across experimental conditions, there has been little evidence that tutorial decisions have an impact on students' learning. In this paper, we applied Reinforcement Learning (RL) to…

  19. Adaptive Design of Role Differentiation by Division of Reward Function in Multi-Agent Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Taniguchi, Tadahiro; Tabuchi, Kazuma; Sawaragi, Tetsuo

    There are several problems which discourage an organization from achieving tasks, e.g., partial observation, credit assignment, and concurrent learning in multi-agent reinforcement learning. In many conventional approaches, each agent estimates hidden states, e.g., sensor inputs, positions, and policies of other agents, and reduces the uncertainty in the partially-observable Markov decision process (POMDP), which partially solve the multiagent reinforcement learning problem. In contrast, people reduce uncertainty in human organizations in the real world by autonomously dividing the roles played by individual agents. In a framework of reinforcement learning, roles are mainly represented by goals for individual agents. This paper presents a method for generating internal rewards from manager agents to worker agents. It also explicitly divides the roles, which enables a POMDP task for each agent to be transformed into a simple MDP task under certain conditions. Several situational experiments are also described and the validity of the proposed method is evaluated.

  20. Reinforcement learning can account for associative and perceptual learning on a visual-decision task.

    PubMed

    Law, Chi-Tat; Gold, Joshua I

    2009-05-01

    We recently showed that improved perceptual performance on a visual motion direction-discrimination task corresponds to changes in how an unmodified sensory representation in the brain is interpreted to form a decision that guides behavior. Here we found that these changes can be accounted for using a reinforcement-learning rule to shape functional connectivity between the sensory and decision neurons. We modeled performance on the basis of the readout of simulated responses of direction-selective sensory neurons in the middle temporal area (MT) of monkey cortex. A reward prediction error guided changes in connections between these sensory neurons and the decision process, first establishing the association between motion direction and response direction, and then gradually improving perceptual sensitivity by selectively strengthening the connections from the most sensitive neurons in the sensory population. The results suggest a common, feedback-driven mechanism for some forms of associative and perceptual learning.

  1. Investigation of a reinforcement-based toilet training procedure for children with autism.

    PubMed

    Cicero, Frank R; Pfadt, Al

    2002-01-01

    Independent toileting is an important developmental skill which individuals with developmental disabilities often find a challenge to master. Effective toilet training interventions have been designed which rely on a combination of basic operant principles of positive reinforcement and punishment. In the present study, the effectiveness of a reinforcement-based toilet training intervention was investigated with three children with a diagnosis of autism. Procedures included a combination of positive reinforcement, graduated guidance, scheduled practice trials and forward prompting. Results indicated that all procedures were implemented in response to urination accidents. A three participants reduced urination accidents to zero and learned to spontaneously request use of the bathroom within 7-11 days of training. Gains were maintained over 6-month and 1-year follow-ups. Findings suggest that the proposed procedure is an effective and rapid method of toilet training, which can be implemented within a structured school setting with generalization to the home environment.

  2. Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning.

    PubMed

    Frank, Michael J; Moustafa, Ahmed A; Haughey, Heather M; Curran, Tim; Hutchison, Kent E

    2007-10-09

    What are the genetic and neural components that support adaptive learning from positive and negative outcomes? Here, we show with genetic analyses that three independent dopaminergic mechanisms contribute to reward and avoidance learning in humans. A polymorphism in the DARPP-32 gene, associated with striatal dopamine function, predicted relatively better probabilistic reward learning. Conversely, the C957T polymorphism of the DRD2 gene, associated with striatal D2 receptor function, predicted the degree to which participants learned to avoid choices that had been probabilistically associated with negative outcomes. The Val/Met polymorphism of the COMT gene, associated with prefrontal cortical dopamine function, predicted participants' ability to rapidly adapt behavior on a trial-to-trial basis. These findings support a neurocomputational dissociation between striatal and prefrontal dopaminergic mechanisms in reinforcement learning. Computational maximum likelihood analyses reveal independent gene effects on three reinforcement learning parameters that can explain the observed dissociations.

  3. Computational models of reinforcement learning: the role of dopamine as a reward signal

    PubMed Central

    Samson, R. D.; Frank, M. J.

    2010-01-01

    Reinforcement learning is ubiquitous. Unlike other forms of learning, it involves the processing of fast yet content-poor feedback information to correct assumptions about the nature of a task or of a set of stimuli. This feedback information is often delivered as generic rewards or punishments, and has little to do with the stimulus features to be learned. How can such low-content feedback lead to such an efficient learning paradigm? Through a review of existing neuro-computational models of reinforcement learning, we suggest that the efficiency of this type of learning resides in the dynamic and synergistic cooperation of brain systems that use different levels of computations. The implementation of reward signals at the synaptic, cellular, network and system levels give the organism the necessary robustness, adaptability and processing speed required for evolutionary and behavioral success. PMID:21629583

  4. Can Traditions Emerge from the Interaction of Stimulus Enhancement and Reinforcement Learning? An Experimental Model.

    PubMed

    Matthews, Luke J; Paukner, Annika; Suomi, Stephen J

    2010-06-01

    The study of social learning in captivity and behavioral traditions in the wild are two burgeoning areas of research, but few empirical studies have tested how learning mechanisms produce emergent patterns of tradition. Studies have examined how social learning mechanisms that are cognitively complex and possessed by few species, such as imitation, result in traditional patterns, yet traditional patterns are also exhibited by species that may not possess such mechanisms. We propose an explicit model of how stimulus enhancement and reinforcement learning could interact to produce traditions. We tested the model experimentally with tufted capuchin monkeys (Cebus apella), which exhibit traditions in the wild but have rarely demonstrated imitative abilities in captive experiments. Monkeys showed both stimulus enhancement learning and a habitual bias to perform whichever behavior first obtained them a reward. These results support our model that simple social learning mechanisms combined with reinforcement can result in traditional patterns of behavior.

  5. Deficient reinforcement learning in medial frontal cortex as a model of dopamine-related motivational deficits in ADHD.

    PubMed

    Silvetti, Massimo; Wiersema, Jan R; Sonuga-Barke, Edmund; Verguts, Tom

    2013-10-01

    Attention Deficit/Hyperactivity Disorder (ADHD) is a pathophysiologically complex and heterogeneous condition with both cognitive and motivational components. We propose a novel computational hypothesis of motivational deficits in ADHD, drawing together recent evidence on the role of anterior cingulate cortex (ACC) and associated mesolimbic dopamine circuits in both reinforcement learning and ADHD. Based on findings of dopamine dysregulation and ACC involvement in ADHD we simulated a lesion in a previously validated computational model of ACC (Reward Value and Prediction Model, RVPM). We explored the effects of the lesion on the processing of reinforcement signals. We tested specific behavioral predictions about the profile of reinforcement-related deficits in ADHD in three experimental contexts; probability tracking task, partial and continuous reward schedules, and immediate versus delayed rewards. In addition, predictions were made at the neurophysiological level. Behavioral and neurophysiological predictions from the RVPM-based lesion-model of motivational dysfunction in ADHD were confirmed by data from previously published studies. RVPM represents a promising model of ADHD reinforcement learning suggesting that ACC dysregulation might play a role in the pathogenesis of motivational deficits in ADHD. However, more behavioral and neurophysiological studies are required to test core predictions of the model. In addition, the interaction with different brain networks underpinning other aspects of ADHD neuropathology (i.e., executive function) needs to be better understood.

  6. Neural Control of a Tracking Task via Attention-Gated Reinforcement Learning for Brain-Machine Interfaces.

    PubMed

    Wang, Yiwen; Wang, Fang; Xu, Kai; Zhang, Qiaosheng; Zhang, Shaomin; Zheng, Xiaoxiang

    2015-05-01

    Reinforcement learning (RL)-based brain machine interfaces (BMIs) enable the user to learn from the environment through interactions to complete the task without desired signals, which is promising for clinical applications. Previous studies exploited Q-learning techniques to discriminate neural states into simple directional actions providing the trial initial timing. However, the movements in BMI applications can be quite complicated, and the action timing explicitly shows the intention when to move. The rich actions and the corresponding neural states form a large state-action space, imposing generalization difficulty on Q-learning. In this paper, we propose to adopt attention-gated reinforcement learning (AGREL) as a new learning scheme for BMIs to adaptively decode high-dimensional neural activities into seven distinct movements (directional moves, holdings and resting) due to the efficient weight-updating. We apply AGREL on neural data recorded from M1 of a monkey to directly predict a seven-action set in a time sequence to reconstruct the trajectory of a center-out task. Compared to Q-learning techniques, AGREL could improve the target acquisition rate to 90.16% in average with faster convergence and more stability to follow neural activity over multiple days, indicating the potential to achieve better online decoding performance for more complicated BMI tasks.

  7. Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints

    NASA Astrophysics Data System (ADS)

    Yang, Xiong; Liu, Derong; Wang, Ding

    2014-03-01

    In this paper, an adaptive reinforcement learning-based solution is developed for the infinite-horizon optimal control problem of constrained-input continuous-time nonlinear systems in the presence of nonlinearities with unknown structures. Two different types of neural networks (NNs) are employed to approximate the Hamilton-Jacobi-Bellman equation. That is, an recurrent NN is constructed to identify the unknown dynamical system, and two feedforward NNs are used as the actor and the critic to approximate the optimal control and the optimal cost, respectively. Based on this framework, the action NN and the critic NN are tuned simultaneously, without the requirement for the knowledge of system drift dynamics. Moreover, by using Lyapunov's direct method, the weights of the action NN and the critic NN are guaranteed to be uniformly ultimately bounded, while keeping the closed-loop system stable. To demonstrate the effectiveness of the present approach, simulation results are illustrated.

  8. Activity Based Learning as Self-Accessing Strategy to Promote Learners' Autonomy

    ERIC Educational Resources Information Center

    Ravi, R.; Xavier, P.

    2007-01-01

    The Activity Based Learning (ABL) is unique and effective to attract out-of -school children to schools. It facilitates readiness for learning, instruction, reinforcement and evaluation. ABL has transformed the classrooms into hubs of activities and meaningful learning. Activity-based learning, naturally leads to cooperative learning. Since group…

  9. Deficits in reinforcement learning but no link to apathy in patients with schizophrenia.

    PubMed

    Hartmann-Riemer, Matthias N; Aschenbrenner, Steffen; Bossert, Magdalena; Westermann, Celina; Seifritz, Erich; Tobler, Philippe N; Weisbrod, Matthias; Kaiser, Stefan

    2017-01-10

    Negative symptoms in schizophrenia have been linked to selective reinforcement learning deficits in the context of gains combined with intact loss-avoidance learning. Fundamental mechanisms of reinforcement learning and choice are prediction error signaling and the precise representation of reward value for future decisions. It is unclear which of these mechanisms contribute to the impairments in learning from positive outcomes observed in schizophrenia. A recent study suggested that patients with severe apathy symptoms show deficits in the representation of expected value. Considering the fundamental relevance for the understanding of these symptoms, we aimed to assess the stability of these findings across studies. Sixty-four patients with schizophrenia and 19 healthy control participants performed a probabilistic reward learning task. They had to associate stimuli with gain or loss-avoidance. In a transfer phase participants indicated valuation of the previously learned stimuli by choosing among them. Patients demonstrated an overall impairment in learning compared to healthy controls. No effects of apathy symptoms on task indices were observed. However, patients with schizophrenia learned better in the context of loss-avoidance than in the context of gain. Earlier findings were thus partially replicated. Further studies are needed to clarify the mechanistic link between negative symptoms and reinforcement learning.

  10. Deficits in reinforcement learning but no link to apathy in patients with schizophrenia

    PubMed Central

    Hartmann-Riemer, Matthias N.; Aschenbrenner, Steffen; Bossert, Magdalena; Westermann, Celina; Seifritz, Erich; Tobler, Philippe N.; Weisbrod, Matthias; Kaiser, Stefan

    2017-01-01

    Negative symptoms in schizophrenia have been linked to selective reinforcement learning deficits in the context of gains combined with intact loss-avoidance learning. Fundamental mechanisms of reinforcement learning and choice are prediction error signaling and the precise representation of reward value for future decisions. It is unclear which of these mechanisms contribute to the impairments in learning from positive outcomes observed in schizophrenia. A recent study suggested that patients with severe apathy symptoms show deficits in the representation of expected value. Considering the fundamental relevance for the understanding of these symptoms, we aimed to assess the stability of these findings across studies. Sixty-four patients with schizophrenia and 19 healthy control participants performed a probabilistic reward learning task. They had to associate stimuli with gain or loss-avoidance. In a transfer phase participants indicated valuation of the previously learned stimuli by choosing among them. Patients demonstrated an overall impairment in learning compared to healthy controls. No effects of apathy symptoms on task indices were observed. However, patients with schizophrenia learned better in the context of loss-avoidance than in the context of gain. Earlier findings were thus partially replicated. Further studies are needed to clarify the mechanistic link between negative symptoms and reinforcement learning. PMID:28071747

  11. On the Evolutionary Bases of Consumer Reinforcement

    ERIC Educational Resources Information Center

    Nicholson, Michael; Xiao, Sarah Hong

    2010-01-01

    This article locates consumer behavior analysis within the modern neo-Darwinian synthesis, seeking to establish an interface between the ultimate-level theorizing of human evolutionary psychology and the proximate level of inquiry typically favored by operant learning theorists. Following an initial overview of the central tenets of neo-Darwinism,…

  12. A Robust Cooperated Control Method with Reinforcement Learning and Adaptive H∞ Control

    NASA Astrophysics Data System (ADS)

    Obayashi, Masanao; Uchiyama, Shogo; Kuremoto, Takashi; Kobayashi, Kunikazu

    This study proposes a robust cooperated control method combining reinforcement learning with robust control to control the system. A remarkable characteristic of the reinforcement learning is that it doesn't require model formula, however, it doesn't guarantee the stability of the system. On the other hand, robust control system guarantees stability and robustness, however, it requires model formula. We employ both the actor-critic method which is a kind of reinforcement learning with minimal amount of computation to control continuous valued actions and the traditional robust control, that is, H∞ control. The proposed system was compared method with the conventional control method, that is, the actor-critic only used, through the computer simulation of controlling the angle and the position of a crane system, and the simulation result showed the effectiveness of the proposed method.

  13. A Real-time Reinforcement Learning Control System with H∞ Tracking Performance Compensator

    NASA Astrophysics Data System (ADS)

    Uchiyama, Shogo; Obayashi, Masanao; Kuremoto, Takashi; Kobayashi, Kunikazu

    Robust control theory generally guarantees robustness and stability of the closed-loop system. It however requires a mathematical model of the system to design the control system. It therefore can't often deal with nonlinear systems due to difficulty of modeling of the system. On the other hand, reinforcement learning methods can deal with nonlinear systems without any mathematical model. It however usually doesn't guarantee the stability of the system control. In this paper, we propose a “Real-time Reinforcement Learning Control System (RRLCS)” through combining reinforcement learning to treat unknown nonlinear systems and robust control theory to guarantee the robustness and stability of the system. Moreover, we analyze the stability of the proposed system using H∞ tracking performance and Lyapunov function. Finally, through the computer simulation for controlling an inverted pendulum system, we show the effectiveness of the proposed method.

  14. Integration of reinforcement learning and optimal decision-making theories of the basal ganglia.

    PubMed

    Bogacz, Rafal; Larsen, Tobias

    2011-04-01

    This article seeks to integrate two sets of theories describing action selection in the basal ganglia: reinforcement learning theories describing learning which actions to select to maximize reward and decision-making theories proposing that the basal ganglia selects actions on the basis of sensory evidence accumulated in the cortex. In particular, we present a model that integrates the actor-critic model of reinforcement learning and a model assuming that the cortico-basal-ganglia circuit implements a statistically optimal decision-making procedure. The values of cortico-striatal weights required for optimal decision making in our model differ from those provided by standard reinforcement learning models. Nevertheless, we show that an actor-critic model converges to the weights required for optimal decision making when biologically realistic limits on synaptic weights are introduced. We also describe the model's predictions concerning reaction times and neural responses during learning, and we discuss directions required for further integration of reinforcement learning and optimal decision-making theories.

  15. Team-Based Learning

    ERIC Educational Resources Information Center

    Michaelsen, Larry K.; Sweet, Michael

    2011-01-01

    Team-based learning (TBL), when properly implemented, includes many, if not all, of the common elements of evidence-based best practices. To explain this, a brief overview of TBL is presented. The authors examine the relationship between the best practices of evidence-based teaching and the principles that constitute team-based learning. (Contains…

  16. Acquisition of Flexible Image Recognition by Coupling of Reinforcement Learning and a Neural Network

    NASA Astrophysics Data System (ADS)

    Shibata, Katsunari; Kawano, Tomohiko

    The authors have proposed a very simple autonomous learning system consisting of one neural network (NN), whose inputs are raw sensor signals and whose outputs are directly passed to actuators as control signals, and which is trained by using reinforcement learning (RL). However, the current opinion seems that such simple learning systems do not actually work on complicated tasks in the real world. In this paper, with a view to developing higher functions in robots, the authors bring up the necessity to introduce autonomous learning of a massively parallel and cohesively flexible system with massive inputs based on the consideration about the brain architecture and the sequential property of our consciousness. The authors also bring up the necessity to place more importance on “optimization” of the total system under a uniform criterion than “understandability” for humans. Thus, the authors attempt to stress the importance of their proposed system when considering the future research on robot intelligence. The experimental result in a real-world-like environment shows that image recognition from as many as 6240 visual signals can be acquired through RL under various backgrounds and light conditions without providing any knowledge about image processing or the target object. It works even for camera image inputs that were not experienced in learning. In the hidden layer, template-like representation, division of roles between hidden neurons, and representation to detect the target uninfluenced by light condition or background were observed after learning. The autonomous acquisition of such useful representations or functions makes us feel the potential towards avoidance of the frame problem and the development of higher functions.

  17. A nanostructured carbon-reinforced polyisobutylene-based thermoplastic elastomer.

    PubMed

    Puskas, Judit E; Foreman-Orlowski, Elizabeth A; Lim, Goy Teck; Porosky, Sara E; Evancho-Chapman, Michelle M; Schmidt, Steven P; El Fray, Mirosława; Piatek, Marta; Prowans, Piotr; Lovejoy, Krystal

    2010-03-01

    This paper presents the synthesis and characterization of a polyisobutylene (PIB)-based nanostructured carbon-reinforced thermoplastic elastomer. This thermoplastic elastomer is based on a self-assembling block copolymer having a branched PIB core carrying -OH functional groups at each branch point, flanked by blocks of poly(isobutylene-co-para-methylstyrene). The block copolymer has thermolabile physical crosslinks and can be processed as a plastic, yet retains its rubbery properties at room temperature. The carbon-reinforced thermoplastic elastomer had more than twice the tensile strength of the neat polymer, exceeding the strength of medical grade silicone rubber, while remaining significantly softer. The carbon-reinforced thermoplastic elastomer displayed a high T(g) of 126 degrees C, rendering the material steam-sterilizable. The carbon also acted as a free radical trap, increasing the onset temperature of thermal decomposition in the neat polymer from 256.6 degrees C to 327.7 degrees C. The carbon-reinforced thermoplastic elastomer had the lowest water contact angle at 82 degrees and surface nano-topography. After 180 days of implantation into rabbit soft tissues, the carbon-reinforced thermoplastic elastomer had the thinnest tissue capsule around the microdumbbell specimens, with no eosinophiles present. The material also showed excellent integration into bones.

  18. Multiagent Reinforcement Learning With Sparse Interactions by Negotiation and Knowledge Transfer.

    PubMed

    Zhou, Luowei; Yang, Pei; Chen, Chunlin; Gao, Yang

    2016-03-31

    Reinforcement learning has significant applications for multiagent systems, especially in unknown dynamic environments. However, most multiagent reinforcement learning (MARL) algorithms suffer from such problems as exponential computation complexity in the joint state-action space, which makes it difficult to scale up to realistic multiagent problems. In this paper, a novel algorithm named negotiation-based MARL with sparse interactions (NegoSIs) is presented. In contrast to traditional sparse-interaction-based MARL algorithms, NegoSI adopts the equilibrium concept and makes it possible for agents to select the nonstrict equilibrium-dominating strategy profile (nonstrict EDSP) or meta equilibrium for their joint actions. The presented NegoSI algorithm consists of four parts: 1) the equilibrium-based framework for sparse interactions; 2) the negotiation for the equilibrium set; 3) the minimum variance method for selecting one joint action; and 4) the knowledge transfer of local Q-values. In this integrated algorithm, three techniques, i.e., unshared value functions, equilibrium solutions, and sparse interactions are adopted to achieve privacy protection, better coordination and lower computational complexity, respectively. To evaluate the performance of the presented NegoSI algorithm, two groups of experiments are carried out regarding three criteria: 1) steps of each episode; 2) rewards of each episode; and 3) average runtime. The first group of experiments is conducted using six grid world games and shows fast convergence and high scalability of the presented algorithm. Then in the second group of experiments NegoSI is applied to an intelligent warehouse problem and simulated results demonstrate the effectiveness of the presented NegoSI algorithm compared with other state-of-the-art MARL algorithms.

  19. Partial reinforcement and context switch effects in human predictive learning.

    PubMed

    Abad, María J F; Ramos-Alvarez, Manuel M; Rosas, Juan M

    2009-01-01

    Human participants were trained in a trial-by-trial contingency judgements task in which they had to predict the probability of an outcome (diarrhoea) following different cues (food names) in different contexts (restaurants). Cue P was paired with the outcome on half of the trials (partial reinforcement), while cue C was paired with the outcome on all the trials (continuous reinforcement), both cues in Context A. Test was conducted in both Context A and a different but equally familiar context (B). Context change decreased judgements to C, but not to P (Experiment 1). This effect was found only in the cue trained in the context where a different cue was partially reinforced (Experiment 2). Context switch effects disappeared when different cues received partial reinforcement in both contexts of training (Experiment 3). The implications of these results for an explanation of context switch effects in terms of ambiguity in the meaning of the cues prompting attention to the context (e.g., Bouton, 1997) are discussed.

  20. Reinforcement learning control with approximation of time-dependent agent dynamics

    NASA Astrophysics Data System (ADS)

    Kirkpatrick, Kenton Conrad

    Reinforcement Learning has received a lot of attention over the years for systems ranging from static game playing to dynamic system control. Using Reinforcement Learning for control of dynamical systems provides the benefit of learning a control policy without needing a model of the dynamics. This opens the possibility of controlling systems for which the dynamics are unknown, but Reinforcement Learning methods like Q-learning do not explicitly account for time. In dynamical systems, time-dependent characteristics can have a significant effect on the control of the system, so it is necessary to account for system time dynamics while not having to rely on a predetermined model for the system. In this dissertation, algorithms are investigated for expanding the Q-learning algorithm to account for the learning of sampling rates and dynamics approximations. For determining a proper sampling rate, it is desired to find the largest sample time that still allows the learning agent to control the system to goal achievement. An algorithm called Sampled-Data Q-learning is introduced for determining both this sample time and the control policy associated with that sampling rate. Results show that the algorithm is capable of achieving a desired sampling rate that allows for system control while not sampling "as fast as possible". Determining an approximation of an agent's dynamics can be beneficial for the control of hierarchical multiagent systems by allowing a high-level supervisor to use the dynamics approximations for task allocation decisions. To this end, algorithms are investigated for learning first- and second-order dynamics approximations. These algorithms are respectively called First-Order Dynamics Learning and Second-Order Dynamics Learning. The dynamics learning algorithms are evaluated on several examples that show their capability to learn accurate approximations of state dynamics. All of these algorithms are then evaluated on hierarchical multiagent systems

  1. Hippocampal lesions facilitate instrumental learning with delayed reinforcement but induce impulsive choice in rats

    PubMed Central

    Cheung, Timothy HC; Cardinal, Rudolf N

    2005-01-01

    Background Animals must frequently act to influence the world even when the reinforcing outcomes of their actions are delayed. Learning with action-outcome delays is a complex problem, and little is known of the neural mechanisms that bridge such delays. When outcomes are delayed, they may be attributed to (or associated with) the action that caused them, or mistakenly attributed to other stimuli, such as the environmental context. Consequently, animals that are poor at forming context-outcome associations might learn action-outcome associations better with delayed reinforcement than normal animals. The hippocampus contributes to the representation of environmental context, being required for aspects of contextual conditioning. We therefore hypothesized that animals with hippocampal lesions would be better than normal animals at learning to act on the basis of delayed reinforcement. We tested the ability of hippocampal-lesioned rats to learn a free-operant instrumental response using delayed reinforcement, and what is potentially a related ability – the ability to exhibit self-controlled choice, or to sacrifice an immediate, small reward in order to obtain a delayed but larger reward. Results Rats with sham or excitotoxic hippocampal lesions acquired an instrumental response with different delays (0, 10, or 20 s) between the response and reinforcer delivery. These delays retarded learning in normal rats. Hippocampal-lesioned rats responded slightly less than sham-operated controls in the absence of delays, but they became better at learning (relative to shams) as the delays increased; delays impaired learning less in hippocampal-lesioned rats than in shams. In contrast, lesioned rats exhibited impulsive choice, preferring an immediate, small reward to a delayed, larger reward, even though they preferred the large reward when it was not delayed. Conclusion These results support the view that the hippocampus hinders action-outcome learning with delayed outcomes

  2. Market Model for Resource Allocation in Emerging Sensor Networks with Reinforcement Learning.

    PubMed

    Zhang, Yue; Song, Bin; Zhang, Ying; Du, Xiaojiang; Guizani, Mohsen

    2016-11-29

    Emerging sensor networks (ESNs) are an inevitable trend with the development of the Internet of Things (IoT), and intend to connect almost every intelligent device. Therefore, it is critical to study resource allocation in such an environment, due to the concern of efficiency, especially when resources are limited. By viewing ESNs as multi-agent environments, we model them with an agent-based modelling (ABM) method and deal with resource allocation problems with market models, after describing users' patterns. Reinforcement learning methods are introduced to estimate users' patterns and verify the outcomes in our market models. Experimental results show the efficiency of our methods, which are also capable of guiding topology management.

  3. Multi Objective Dynamic Job Shop Scheduling using Composite Dispatching Rule and Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Chen, Xili; Hao, Xinchang; Lin, Hao Wen; Murata, Tomohiro

    The applications of composite dispatching rules for multi objective dynamic scheduling have been widely studied in literature. In general, a composite dispatching rule is a combination of several elementary dispatching rules, which is designed to optimize multiple objectives of interest under a certain scheduling environment. The relative importance of elementary dispatching rules is modeled by weight factors. A critical issue for implementation of composite dispatching rule is that the inappropriate weight values may result in poor performance. This paper presents an offline scheduling knowledge acquisition method based on reinforcement learning using simulation technique. The scheduling knowledge is applied to adjust the appropriate weight values of elementary dispatching rules in composite manner with respect to work in process fluctuation of machines during online scheduling. Implementation of the proposed method in a two objectives dynamic job shop scheduling problem is demonstrated and the results are satisfactory.

  4. Market Model for Resource Allocation in Emerging Sensor Networks with Reinforcement Learning

    PubMed Central

    Zhang, Yue; Song, Bin; Zhang, Ying; Du, Xiaojiang; Guizani, Mohsen

    2016-01-01

    Emerging sensor networks (ESNs) are an inevitable trend with the development of the Internet of Things (IoT), and intend to connect almost every intelligent device. Therefore, it is critical to study resource allocation in such an environment, due to the concern of efficiency, especially when resources are limited. By viewing ESNs as multi-agent environments, we model them with an agent-based modelling (ABM) method and deal with resource allocation problems with market models, after describing users’ patterns. Reinforcement learning methods are introduced to estimate users’ patterns and verify the outcomes in our market models. Experimental results show the efficiency of our methods, which are also capable of guiding topology management. PMID:27916841

  5. Automatic tuning of the reinforcement function

    SciTech Connect

    Touzet, C.; Santos, J.M.

    1997-12-31

    The aim of this work is to present a method that helps tuning the reinforcement function parameters in a reinforcement learning approach. Since the proposal of neural based implementations for the reinforcement learning paradigm (which reduced learning time and memory requirements to realistic values) reinforcement functions have become the critical components. Using a general definition for reinforcement functions, the authors solve, in a particular case, the so called exploration versus exploitation dilemma through the careful computation of the RF parameter values. They propose an algorithm to compute, during the exploration part of the learning phase, an estimate for the parameter values. Experiments with the mobile robot Nomad 200 validate their proposals.

  6. Mechanisms of hierarchical reinforcement learning in cortico-striatal circuits 2: evidence from fMRI.

    PubMed

    Badre, David; Frank, Michael J

    2012-03-01

    The frontal lobes may be organized hierarchically such that more rostral frontal regions modulate cognitive control operations in caudal regions. In our companion paper (Frank MJ, Badre D. 2011. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits I: computational analysis. 22:509-526), we provide novel neural circuit and algorithmic models of hierarchical cognitive control in cortico-striatal circuits. Here, we test key model predictions using functional magnetic resonance imaging (fMRI). Our neural circuit model proposes that contextual representations in rostral frontal cortex influence the striatal gating of contextual representations in caudal frontal cortex. Reinforcement learning operates at each level, such that the system adaptively learns to gate higher order contextual information into rostral regions. Our algorithmic Bayesian "mixture of experts" model captures the key computations of this neural model and provides trial-by-trial estimates of the learner's latent hypothesis states. In the present paper, we used these quantitative estimates to reanalyze fMRI data from a hierarchical reinforcement learning task reported in Badre D, Kayser AS, D'Esposito M. 2010. Frontal cortex and the discovery of abstract action rules. Neuron. 66:315--326. Results validate key predictions of the models and provide evidence for an individual cortico-striatal circuit for reinforcement learning of hierarchical structure at a specific level of policy abstraction. These findings are initially consistent with the proposal that hierarchical control in frontal cortex may emerge from interactions among nested cortico-striatal circuits at different levels of abstraction.

  7. Reinforcing communication skills while registered nurses simultaneously learn course content: a response to learning needs.

    PubMed

    DeSimone, B B

    1994-01-01

    This article describes the implementation and evaluation of Integrated Skills Reinforcement (ISR) in a baccalaureate nursing course entitled "Principles of Health Assessment" for 15 registered nurse students. ISR is a comprehensive teaching-learning approach that simultaneously reinforces student writing, reading, speaking, and listening skills while they learn course content. The purpose of this study was to assess the influence of ISR on writing skills and student satisfaction. A learner's guide and teacher's guide, created in advance by the teacher, described specific language activities and assignments that were implemented throughout the ISR course. During each class, the teacher promoted discussion, collaboration, and co-inquiry among students, using course content as the vehicle of exchange. Writing was assessed at the beginning and end of the course. The influence of ISR on the content, organization, sentence structure, tone, and strength of position of student writing was analyzed. Writing samples were scored by an independent evaluator trained in methods of holistic scoring. Ninety-three per cent (14 of 15 students) achieved writing growth from .5 to 1.5 points on a scale of 6 points. Student response to both the ISR approach and specific ISR activities was assessed by teacher-created surveys administered at the middle-end of the course. One hundred per cent of the students at the end of this project agreed that the ISR activities, specifically the writing and reading activities, helped them better understand the course content. These responses differed from evaluations written by the same students at the middle of the course. The ISR approach fostered analysis and communication through active collaboration, behaviors cited as critical for effective participation of nurses in today's complex health care environment.

  8. Off-policy integral reinforcement learning optimal tracking control for continuous-time chaotic systems

    NASA Astrophysics Data System (ADS)

    Wei, Qing-Lai; Song, Rui-Zhuo; Sun, Qiu-Ye; Xiao, Wen-Dong

    2015-09-01

    This paper estimates an off-policy integral reinforcement learning (IRL) algorithm to obtain the optimal tracking control of unknown chaotic systems. Off-policy IRL can learn the solution of the HJB equation from the system data generated by an arbitrary control. Moreover, off-policy IRL can be regarded as a direct learning method, which avoids the identification of system dynamics. In this paper, the performance index function is first given based on the system tracking error and control error. For solving the Hamilton-Jacobi-Bellman (HJB) equation, an off-policy IRL algorithm is proposed. It is proven that the iterative control makes the tracking error system asymptotically stable, and the iterative performance index function is convergent. Simulation study demonstrates the effectiveness of the developed tracking control method. Project supported by the National Natural Science Foundation of China (Grant Nos. 61304079 and 61374105), the Beijing Natural Science Foundation, China (Grant Nos. 4132078 and 4143065), the China Postdoctoral Science Foundation (Grant No. 2013M530527), the Fundamental Research Funds for the Central Universities, China (Grant No. FRF-TP-14-119A2), and the Open Research Project from State Key Laboratory of Management and Control for Complex Systems, China (Grant No. 20150104).

  9. Robust Retinal Blood Vessel Segmentation Based on Reinforcement Local Descriptions

    PubMed Central

    Li, Meng; Ma, Zhenshen; Liu, Chao; Han, Zhe

    2017-01-01

    Retinal blood vessels segmentation plays an important role for retinal image analysis. In this paper, we propose robust retinal blood vessel segmentation method based on reinforcement local descriptions. A novel line set based feature is firstly developed to capture local shape information of vessels by employing the length prior of vessels, which is robust to intensity variety. After that, local intensity feature is calculated for each pixel, and then morphological gradient feature is extracted for enhancing the local edge of smaller vessel. At last, line set based feature, local intensity feature, and morphological gradient feature are combined to obtain the reinforcement local descriptions. Compared with existing local descriptions, proposed reinforcement local description contains more local information of local shape, intensity, and edge of vessels, which is more robust. After feature extraction, SVM is trained for blood vessel segmentation. In addition, we also develop a postprocessing method based on morphological reconstruction to connect some discontinuous vessels and further obtain more accurate segmentation result. Experimental results on two public databases (DRIVE and STARE) demonstrate that proposed reinforcement local descriptions outperform the state-of-the-art method. PMID:28194407

  10. Hydrogel-based reinforcement of 3D bioprinted constructs

    PubMed Central

    Levato, R; Peiffer, Q C; de Ruijter, M; Hennink, W E; Vermonden, T; Malda, J

    2016-01-01

    Progress within the field of biofabrication is hindered by a lack of suitable hydrogel formulations. Here, we present a novel approach based on a hybrid printing technique to create cellularized 3D printed constructs. The hybrid bioprinting strategy combines a reinforcing gel for mechanical support with a bioink to provide a cytocompatible environment. In comparison with thermoplastics such as є-polycaprolactone, the hydrogel-based reinforcing gel platform enables printing at cell-friendly temperatures, targets the bioprinting of softer tissues and allows for improved control over degradation kinetics. We prepared amphiphilic macromonomers based on poloxamer that form hydrolysable, covalently cross-linked polymer networks. Dissolved at a concentration of 28.6%w/w in water, it functions as reinforcing gel, while a 5%w/w gelatin-methacryloyl based gel is utilized as bioink. This strategy allows for the creation of complex structures, where the bioink provides a cytocompatible environment for encapsulated cells. Cell viability of equine chondrocytes encapsulated within printed constructs remained largely unaffected by the printing process. The versatility of the system is further demonstrated by the ability to tune the stiffness of printed constructs between 138 and 263 kPa, as well as to tailor the degradation kinetics of the reinforcing gel from several weeks up to more than a year. PMID:27431861

  11. Robust Retinal Blood Vessel Segmentation Based on Reinforcement Local Descriptions.

    PubMed

    Li, Meng; Ma, Zhenshen; Liu, Chao; Zhang, Guang; Han, Zhe

    2017-01-01

    Retinal blood vessels segmentation plays an important role for retinal image analysis. In this paper, we propose robust retinal blood vessel segmentation method based on reinforcement local descriptions. A novel line set based feature is firstly developed to capture local shape information of vessels by employing the length prior of vessels, which is robust to intensity variety. After that, local intensity feature is calculated for each pixel, and then morphological gradient feature is extracted for enhancing the local edge of smaller vessel. At last, line set based feature, local intensity feature, and morphological gradient feature are combined to obtain the reinforcement local descriptions. Compared with existing local descriptions, proposed reinforcement local description contains more local information of local shape, intensity, and edge of vessels, which is more robust. After feature extraction, SVM is trained for blood vessel segmentation. In addition, we also develop a postprocessing method based on morphological reconstruction to connect some discontinuous vessels and further obtain more accurate segmentation result. Experimental results on two public databases (DRIVE and STARE) demonstrate that proposed reinforcement local descriptions outperform the state-of-the-art method.

  12. Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments.

    PubMed

    Leong, Yuan Chang; Radulescu, Angela; Daniel, Reka; DeWoskin, Vivian; Niv, Yael

    2017-01-18

    Little is known about the relationship between attention and learning during decision making. Using eye tracking and multivariate pattern analysis of fMRI data, we measured participants' dimensional attention as they performed a trial-and-error learning task in which only one of three stimulus dimensions was relevant for reward at any given time. Analysis of participants' choices revealed that attention biased both value computation during choice and value update during learning. Value signals in the ventromedial prefrontal cortex and prediction errors in the striatum were similarly biased by attention. In turn, participants' focus of attention was dynamically modulated by ongoing learning. Attentional switches across dimensions correlated with activity in a frontoparietal attention network, which showed enhanced connectivity with the ventromedial prefrontal cortex between switches. Our results suggest a bidirectional interaction between attention and learning: attention constrains learning to relevant dimensions of the environment, while we learn what to attend to via trial and error.

  13. Scaling Ant Colony Optimization with Hierarchical Reinforcement Learning Partitioning

    DTIC Science & Technology

    2007-09-01

    ACO domain and traveling salesman problem ( TSP ). To apply HRL to ACO , a hierarchy must be created for the TSP . A data cluster- ing algorithm creates...methodologies to the ACO domain and the TSP to demonstrate a decomposition of the TSP increases the speed of the algorithm with little to no significant loss in...the use of HRL techniques in the ACO domain and the TSP . 3 II. Backgound This chapter introduces the foundations used in this thesis. Reinforcement

  14. Active-learning strategies: the use of a game to reinforce learning in nursing education. A case study.

    PubMed

    Boctor, Lisa

    2013-03-01

    The majority of nursing students are kinesthetic learners, preferring a hands-on, active approach to education. Research shows that active-learning strategies can increase student learning and satisfaction. This study looks at the use of one active-learning strategy, a Jeopardy-style game, 'Nursopardy', to reinforce Fundamentals of Nursing material, aiding in students' preparation for a standardized final exam. The game was created keeping students varied learning styles and the NCLEX blueprint in mind. The blueprint was used to create 5 categories, with 26 total questions. Student survey results, using a five-point Likert scale showed that they did find this learning method enjoyable and beneficial to learning. More research is recommended regarding learning outcomes, when using active-learning strategies, such as games.

  15. Reinforcement learning of self-regulated β-oscillations for motor restoration in chronic stroke.

    PubMed

    Naros, Georgios; Gharabaghi, Alireza

    2015-01-01

    Neurofeedback training of Motor imagery (MI)-related brain-states with brain-computer/brain-machine interfaces (BCI/BMI) is currently being explored as an experimental intervention prior to standard physiotherapy to improve the motor outcome of stroke rehabilitation. The use of BCI/BMI technology increases the adherence to MI training more efficiently than interventions with sham or no feedback. Moreover, pilot studies suggest that such a priming intervention before physiotherapy might-like some brain stimulation techniques-increase the responsiveness of the brain to the subsequent physiotherapy, thereby improving the general clinical outcome. However, there is little evidence up to now that these BCI/BMI-based interventions have achieved operate conditioning of specific brain states that facilitate task-specific functional gains beyond the practice of primed physiotherapy. In this context, we argue that BCI/BMI technology provides a valuable neurofeedback tool for rehabilitation but needs to aim at physiological features relevant for the targeted behavioral gain. Moreover, this therapeutic intervention has to be informed by concepts of reinforcement learning to develop its full potential. Such a refined neurofeedback approach would need to address the following issues: (1) Defining a physiological feedback target specific to the intended behavioral gain, e.g., β-band oscillations for cortico-muscular communication. This targeted brain state could well be different from the brain state optimal for the neurofeedback task, e.g., α-band oscillations for differentiating MI from rest; (2) Selecting a BCI/BMI classification and thresholding approach on the basis of learning principles, i.e., balancing challenge and reward of the neurofeedback task instead of maximizing the classification accuracy of the difficulty level device; and (3) Adjusting the difficulty level in the course of the training period to account for the cognitive load and the learning experience of

  16. Reinforcement learning of self-regulated β-oscillations for motor restoration in chronic stroke

    PubMed Central

    Naros, Georgios; Gharabaghi, Alireza

    2015-01-01

    Neurofeedback training of Motor imagery (MI)-related brain-states with brain-computer/brain-machine interfaces (BCI/BMI) is currently being explored as an experimental intervention prior to standard physiotherapy to improve the motor outcome of stroke rehabilitation. The use of BCI/BMI technology increases the adherence to MI training more efficiently than interventions with sham or no feedback. Moreover, pilot studies suggest that such a priming intervention before physiotherapy might—like some brain stimulation techniques—increase the responsiveness of the brain to the subsequent physiotherapy, thereby improving the general clinical outcome. However, there is little evidence up to now that these BCI/BMI-based interventions have achieved operate conditioning of specific brain states that facilitate task-specific functional gains beyond the practice of primed physiotherapy. In this context, we argue that BCI/BMI technology provides a valuable neurofeedback tool for rehabilitation but needs to aim at physiological features relevant for the targeted behavioral gain. Moreover, this therapeutic intervention has to be informed by concepts of reinforcement learning to develop its full potential. Such a refined neurofeedback approach would need to address the following issues: (1) Defining a physiological feedback target specific to the intended behavioral gain, e.g., β-band oscillations for cortico-muscular communication. This targeted brain state could well be different from the brain state optimal for the neurofeedback task, e.g., α-band oscillations for differentiating MI from rest; (2) Selecting a BCI/BMI classification and thresholding approach on the basis of learning principles, i.e., balancing challenge and reward of the neurofeedback task instead of maximizing the classification accuracy of the difficulty level device; and (3) Adjusting the difficulty level in the course of the training period to account for the cognitive load and the learning experience

  17. The limits and robustness of reinforcement learning in Lewis signalling games

    NASA Astrophysics Data System (ADS)

    Catteeuw, David; Manderick, Bernard

    2014-04-01

    Lewis signalling games are a standard model to study the emergence of language. We introduce win-stay/lose-inaction, a random process that only updates behaviour on success and never deviates from what was once successful, prove that it always ends up in a state of optimal communication in all Lewis signalling games, and predict the number of interactions it needs to do so: N3 interactions for Lewis signalling games with N equiprobable types. We show three reinforcement learning algorithms (Roth-Erev learning, Q-learning, and Learning Automata) that can imitate win-stay/lose-inaction and can even cope with errors in Lewis signalling games.

  18. SAwSu: an integrated model of associative and reinforcement learning.

    PubMed

    Veksler, Vladislav D; Myers, Christopher W; Gluck, Kevin A

    2014-04-01

    Successfully explaining and replicating the complexity and generality of human and animal learning will require the integration of a variety of learning mechanisms. Here, we introduce a computational model which integrates associative learning (AL) and reinforcement learning (RL). We contrast the integrated model with standalone AL and RL models in three simulation studies. First, a synthetic grid-navigation task is employed to highlight performance advantages for the integrated model in an environment where the reward structure is both diverse and dynamic. The second and third simulations contrast the performances of the three models in behavioral experiments, demonstrating advantages for the integrated model in accounting for behavioral data.

  19. Reinforcement function design and bias for efficient learning in mobile robots

    SciTech Connect

    Touzet, C.; Santos, J.M.

    1998-06-01

    The main paradigm in sub-symbolic learning robot domain is the reinforcement learning method. Various techniques have been developed to deal with the memorization/generalization problem, demonstrating the superior ability of artificial neural network implementations. In this paper, the authors address the issue of designing the reinforcement so as to optimize the exploration part of the learning. They also present and summarize works relative to the use of bias intended to achieve the effective synthesis of the desired behavior. Demonstrative experiments involving a self-organizing map implementation of the Q-learning and real mobile robots (Nomad 200 and Khepera) in a task of obstacle avoidance behavior synthesis are described. 3 figs., 5 tabs.

  20. Multiagent reinforcement learning: spiking and nonspiking agents in the iterated Prisoner's Dilemma.

    PubMed

    Vassiliades, Vassilis; Cleanthous, Aristodemos; Christodoulou, Chris

    2011-04-01

    This paper investigates multiagent reinforcement learning (MARL) in a general-sum game where the payoffs' structure is such that the agents are required to exploit each other in a way that benefits all agents. The contradictory nature of these games makes their study in multiagent systems quite challenging. In particular, we investigate MARL with spiking and nonspiking agents in the Iterated Prisoner's Dilemma by exploring the conditions required to enhance its cooperative outcome. The spiking agents are neural networks with leaky integrate-and-fire neurons trained with two different learning algorithms: 1) reinforcement of stochastic synaptic transmission, or 2) reward-modulated spike-timing-dependent plasticity with eligibility trace. The nonspiking agents use a tabular representation and are trained with Q- and SARSA learning algorithms, with a novel reward transformation process also being applied to the Q-learning agents. According to the results, the cooperative outcome is enhanced by: 1) transformed internal reinforcement signals and a combination of a high learning rate and a low discount factor with an appropriate exploration schedule in the case of non-spiking agents, and 2) having longer eligibility trace time constant in the case of spiking agents. Moreover, it is shown that spiking and nonspiking agents have similar behavior and therefore they can equally well be used in a multiagent interaction setting. For training the spiking agents in the case where more than one output neuron competes for reinforcement, a novel and necessary modification that enhances competition is applied to the two learning algorithms utilized, in order to avoid a possible synaptic saturation. This is done by administering to the networks additional global reinforcement signals for every spike of the output neurons that were not "responsible" for the preceding decision.

  1. Application of Reinforcement Learning Algorithms for the Adaptive Computation of the Smoothing Parameter for Probabilistic Neural Network.

    PubMed

    Kusy, Maciej; Zajdel, Roman

    2015-09-01

    In this paper, we propose new methods for the choice and adaptation of the smoothing parameter of the probabilistic neural network (PNN). These methods are based on three reinforcement learning algorithms: Q(0)-learning, Q(λ)-learning, and stateless Q-learning. We regard three types of PNN classifiers: the model that uses single smoothing parameter for the whole network, the model that utilizes single smoothing parameter for each data attribute, and the model that possesses the matrix of smoothing parameters different for each data variable and data class. Reinforcement learning is applied as the method of finding such a value of the smoothing parameter, which ensures the maximization of the prediction ability. PNN models with smoothing parameters computed according to the proposed algorithms are tested on eight databases by calculating the test error with the use of the cross validation procedure. The results are compared with state-of-the-art methods for PNN training published in the literature up to date and, additionally, with PNN whose sigma is determined by means of the conjugate gradient approach. The results demonstrate that the proposed approaches can be used as alternative PNN training procedures.

  2. Life Span Differences in Electrophysiological Correlates of Monitoring Gains and Losses during Probabilistic Reinforcement Learning

    ERIC Educational Resources Information Center

    Hammerer, Dorothea; Li, Shu-Chen; Muller, Viktor; Lindenberger, Ulman

    2011-01-01

    By recording the feedback-related negativity (FRN) in response to gains and losses, we investigated the contribution of outcome monitoring mechanisms to age-associated differences in probabilistic reinforcement learning. Specifically, we assessed the difference of the monitoring reactions to gains and losses to investigate the monitoring of…

  3. Your Classroom as an Experiment in Education: The Reinforcement Theory of Learning

    ERIC Educational Resources Information Center

    Fuller, Robert G.

    1976-01-01

    Presents the reinforcement theory of learning and explains how it relates to the Keller Plan of instruction. Advocates the use of the Keller Plan as an alternative to the lecture-demonstration system and provides an annotated bibliography of pertinent material. (GS)

  4. The Establishment of Learned Reinforcers in Mildly Retarded Children. IMRID Behavioral Science Monograph No. 24.

    ERIC Educational Resources Information Center

    Worley, John C., Jr.

    Research regarding the establishment of learned reinforcement with mildly retarded children is reviewed. Noted are findings which indicate that educable retarded students, possibly due to cultural differences, are less responsive to social rewards than either nonretarded or more severely retarded children. Characteristics of primary and secondary…

  5. Persistence Training: A Partial Reinforcement Procedure for Reversing Learned Helplessness and Depression.

    ERIC Educational Resources Information Center

    Massad, Phillip; Nation, Jack R.

    1978-01-01

    The results in this experiment identified both partial and continuous reinforcement as efficacious procedures for reversing learned helplessness in college students. The partial success therapy schedule produced greater response persistence in escape extinction following therapy. The framework for a new therapy technique, persistence training, is…

  6. Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults

    PubMed Central

    Smith, Tim J.; Senju, Atsushi

    2017-01-01

    While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue–reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue–reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting. PMID:28250186

  7. Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults.

    PubMed

    Vernetti, Angélina; Smith, Tim J; Senju, Atsushi

    2017-03-15

    While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue-reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue-reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting.

  8. Aggression as Positive Reinforcement in Mice under Various Ratio- and Time-Based Reinforcement Schedules

    ERIC Educational Resources Information Center

    May, Michael E.; Kennedy, Craig H.

    2009-01-01

    There is evidence suggesting aggression may be a positive reinforcer in many species. However, only a few studies have examined the characteristics of aggression as a positive reinforcer in mice. Four types of reinforcement schedules were examined in the current experiment using male Swiss CFW albino mice in a resident-intruder model of aggression…

  9. Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control.

    PubMed

    Uragami, Daisuke; Takahashi, Tatsuji; Matsuo, Yoshiki

    2014-02-01

    Many algorithms and methods in artificial intelligence or machine learning were inspired by human cognition. As a mechanism to handle the exploration-exploitation dilemma in reinforcement learning, the loosely symmetric (LS) value function that models causal intuition of humans was proposed (Shinohara et al., 2007). While LS shows the highest correlation with causal induction by humans, it has been reported that it effectively works in multi-armed bandit problems that form the simplest class of tasks representing the dilemma. However, the scope of application of LS was limited to the reinforcement learning problems that have K actions with only one state (K-armed bandit problems). This study proposes LS-Q learning architecture that can deal with general reinforcement learning tasks with multiple states and delayed reward. We tested the learning performance of the new architecture in giant-swing robot motion learning, where uncertainty and unknown-ness of the environment is huge. In the test, the help of ready-made internal models or functional approximation of the state space were not given. The simulations showed that while the ordinary Q-learning agent does not reach giant-swing motion because of stagnant loops (local optima with low rewards), LS-Q escapes such loops and acquires giant-swing. It is confirmed that the smaller number of states is, in other words, the more coarse-grained the division of states and the more incomplete the state observation is, the better LS-Q performs in comparison with Q-learning. We also showed that the high performance of LS-Q depends comparatively little on parameter tuning and learning time. This suggests that the proposed method inspired by human cognition works adaptively in real environments.

  10. Flexural strength,water sorption and solubility of a methylmethacrylate-free denture base polymer reinforced with glass fibre reinforcement.

    PubMed

    Mutluay, M M; Tezvergil-Mutluay, A; Vallittu, P; Lassila, L

    2013-12-01

    A methylmethacrylate-free denture base polymer (Eclipse) in comparison to a conventional denture base polymer (Palapress vario) was evaluated after water saturation and Stick glass fibre reinforcement. The data were analysed with ANOVA at a = 0.05. Water-storage caused a decrease in the flexural strength and stiffness of the materials (p > 0.05). Conventional denture base material with fibre reinforcement gave highest flexural strength (201.1 MPa) compared to fibre reinforced Eclipse (79.1 MPa) (p < 0.05). Water sorption after 76 days was 2.08% (Palapress vario) and 1.55% (Eclipse). Fibre-reinforcement of methylmethacrylate-free material was not as successful as conventional denture base and needs to be further optimized.

  11. Quantized Attention-Gated Kernel Reinforcement Learning for Brain-Machine Interface Decoding.

    PubMed

    Wang, Fang; Wang, Yiwen; Xu, Kai; Li, Hongbao; Liao, Yuxi; Zhang, Qiaosheng; Zhang, Shaomin; Zheng, Xiaoxiang; Principe, Jose C

    2017-04-01

    Reinforcement learning (RL)-based decoders in brain-machine interfaces (BMIs) interpret dynamic neural activity without patients' real limb movements. In conventional RL, the goal state is selected by the user or defined by the physics of the problem, and the decoder finds an optimal policy essentially by assigning credit over time, which is normally very time-consuming. However, BMI tasks require finding a good policy in very few trials, which impose a limit on the complexity of the tasks that can be learned before the animal quits. Therefore, this paper explores the possibility of letting the agent infer potential goals through actions over space with multiple objects, using the instantaneous reward to assign credit spatially. A previous method, attention-gated RL employs a multilayer perceptron trained with backpropagation, but it is prone to local minima entrapment. We propose a quantized attention-gated kernel RL (QAGKRL) to avoid the local minima adaptation in spatial credit assignment and sparsify the network topology. The experimental results show that the QAGKRL achieves higher successful rates and more stable performance, indicating its powerful decoding ability for more sophisticated BMI tasks as required in clinical applications.

  12. Cocaine dependent individuals with attenuated striatal activation during reinforcement learning are more susceptible to relapse.

    PubMed

    Stewart, Jennifer L; Connolly, Colm G; May, April C; Tapert, Susan F; Wittmann, Marc; Paulus, Martin P

    2014-08-30

    Cocaine-dependent individuals show altered brain activation during decision making. It is unclear, however, whether these activation differences are related to relapse vulnerability. This study tested the hypothesis that brain-activation patterns during reinforcement learning are linked to relapse 1 year later in individuals entering treatment for cocaine dependence. Subjects performed a Paper-Scissors-Rock task during functional magnetic resonance imaging (fMRI). A year later, we examined whether subjects had remained abstinent (n=15) or relapsed (n=15). Although the groups did not differ on demographic characteristics, behavioral performance, or lifetime substance use, abstinent patients reported greater motivation to win than relapsed patients. The fMRI results indicated that compared with abstinent individuals, relapsed users exhibited lower activation in (1) bilateral inferior frontal gyrus and striatum during decision making more generally; and (2) bilateral middle frontal gyrus and anterior insula during reward contingency learning in particular. Moreover, whereas abstinent patients exhibited greater left middle frontal and striatal activation to wins than losses, relapsed users did not demonstrate modulation in these regions as a function of outcome valence. Thus, individuals at high risk for relapse relative to those who are able to abstain allocate fewer neural resources to action-outcome contingency formation and decision making, as well as having less motivation to win on a laboratory-based task.

  13. Cocaine dependent individuals with attenuated striatal activation during reinforcement learning are more susceptible to relapse

    PubMed Central

    Stewart, Jennifer L.; Connolly, Colm G.; May, April C.; Tapert, Susan F.; Wittmann, Marc; Paulus, Martin P.

    2014-01-01

    Cocaine-dependent individuals show altered brain activation during decision making. It is unclear, however, whether these activation differences are related to relapse vulnerability. This study tested the hypothesis that brain-activation patterns during reinforcement learning are linked to relapse 1 year later in individuals entering treatment for cocaine dependence. Subjects performed a Paper-Scissors-Rock task during functional magnetic resonance imaging (fMRI). A year later, we examined whether subjects had remained abstinent (n=15) or relapsed (n=15). Although the groups did not differ on demographic characteristics, behavioral performance, or lifetime substance use, abstinent patients reported greater motivation to win than relapsed patients. The fMRI results indicated that compared with abstinent individuals, relapsed users exhibited lower activation in (1) bilateral inferior frontal gyrus and striatum during decision making more generally; and (2) bilateral middle frontal gyrus and anterior insula during reward contingency learning in particular. Moreover, whereas abstinent patients exhibited greater left middle frontal and striatal activation to wins than losses, relapsed users did not demonstrate modulation in these regions as a function of outcome valence. Thus, individuals at high risk for relapse relative to those who are able to abstain allocate fewer neural resources to action-outcome contingency formation and decision making, as well as having less motivation to win on a laboratory-based task. PMID:24862388

  14. Robot-assisted motor training: assistance decreases exploration during reinforcement learning.

    PubMed

    Sans-Muntadas, Albert; Duarte, Jaime E; Reinkensmeyer, David J

    2014-01-01

    Reinforcement learning (RL) is a form of motor learning that robotic therapy devices could potentially manipulate to promote neurorehabilitation. We developed a system that requires trainees to use RL to learn a predefined target movement. The system provides higher rewards for movements that are more similar to the target movement. We also developed a novel algorithm that rewards trainees of different abilities with comparable reward sizes. This algorithm measures a trainee's performance relative to their best performance, rather than relative to an absolute target performance, to determine reward. We hypothesized this algorithm would permit subjects who cannot normally achieve high reward levels to do so while still learning. In an experiment with 21 unimpaired human subjects, we found that all subjects quickly learned to make a first target movement with and without the reward equalization. However, artificially increasing reward decreased the subjects' tendency to engage in exploration and therefore slowed learning, particularly when we changed the target movement. An anti-slacking watchdog algorithm further slowed learning. These results suggest that robotic algorithms that assist trainees in achieving rewards or in preventing slacking might, over time, discourage the exploration needed for reinforcement learning.

  15. Bi-Directional Effect of Increasing Doses of Baclofen on Reinforcement Learning

    PubMed Central

    Terrier, Jean; Ort, Andres; Yvon, Cédric; Saj, Arnaud; Vuilleumier, Patrik; Lüscher, Christian

    2011-01-01

    In rodents as well as in humans, efficient reinforcement learning depends on dopamine (DA) released from ventral tegmental area (VTA) neurons. It has been shown that in brain slices of mice, GABAB-receptor agonists at low concentrations increase the firing frequency of VTA–DA neurons, while high concentrations reduce the firing frequency. It remains however elusive whether baclofen can modulate reinforcement learning in humans. Here, in a double-blind study in 34 healthy human volunteers, we tested the effects of a low and a high concentration of oral baclofen, a high affinity GABAB-receptor agonist, in a gambling task associated with monetary reward. A low (20 mg) dose of baclofen increased the efficiency of reward-associated learning but had no effect on the avoidance of monetary loss. A high (50 mg) dose of baclofen on the other hand did not affect the learning curve. At the end of the task, subjects who received 20 mg baclofen p.o. were more accurate in choosing the symbol linked to the highest probability of earning money compared to the control group (89.55 ± 1.39 vs. 81.07 ± 1.55%, p = 0.002). Our results support a model where baclofen, at low concentrations, causes a disinhibition of DA neurons, increases DA levels and thus facilitates reinforcement learning. PMID:21811448

  16. Bi-directional effect of increasing doses of baclofen on reinforcement learning.

    PubMed

    Terrier, Jean; Ort, Andres; Yvon, Cédric; Saj, Arnaud; Vuilleumier, Patrik; Lüscher, Christian

    2011-01-01

    In rodents as well as in humans, efficient reinforcement learning depends on dopamine (DA) released from ventral tegmental area (VTA) neurons. It has been shown that in brain slices of mice, GABA(B)-receptor agonists at low concentrations increase the firing frequency of VTA-DA neurons, while high concentrations reduce the firing frequency. It remains however elusive whether baclofen can modulate reinforcement learning in humans. Here, in a double-blind study in 34 healthy human volunteers, we tested the effects of a low and a high concentration of oral baclofen, a high affinity GABA(B)-receptor agonist, in a gambling task associated with monetary reward. A low (20 mg) dose of baclofen increased the efficiency of reward-associated learning but had no effect on the avoidance of monetary loss. A high (50 mg) dose of baclofen on the other hand did not affect the learning curve. At the end of the task, subjects who received 20 mg baclofen p.o. were more accurate in choosing the symbol linked to the highest probability of earning money compared to the control group (89.55 ± 1.39 vs. 81.07 ± 1.55%, p = 0.002). Our results support a model where baclofen, at low concentrations, causes a disinhibition of DA neurons, increases DA levels and thus facilitates reinforcement learning.

  17. A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning.

    PubMed

    Franklin, Nicholas T; Frank, Michael J

    2015-12-25

    Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments.

  18. A reinforcement learning mechanism responsible for the valuation of free choice.

    PubMed

    Cockburn, Jeffrey; Collins, Anne G E; Frank, Michael J

    2014-08-06

    Humans exhibit a preference for options they have freely chosen over equally valued options they have not; however, the neural mechanism that drives this bias and its functional significance have yet to be identified. Here, we propose a model in which choice biases arise due to amplified positive reward prediction errors associated with free choice. Using a novel variant of a probabilistic learning task, we show that choice biases are selective to options that are predominantly associated with positive outcomes. A polymorphism in DARPP-32, a gene linked to dopaminergic striatal plasticity and individual differences in reinforcement learning, was found to predict the effect of choice as a function of value. We propose that these choice biases are the behavioral byproduct of a credit assignment mechanism responsible for ensuring the effective delivery of dopaminergic reinforcement learning signals broadcast to the striatum.

  19. A reinforcement learning mechanism responsible for the valuation of free choice

    PubMed Central

    Cockburn, Jeffrey; Collins, Anne G.E.; Frank, Michael J.

    2014-01-01

    Summary Humans exhibit a preference for options they have freely chosen over equally valued options they have not; however, the neural mechanism that drives this bias and its functional significance have yet to be identified. Here, we propose a model in which choice biases arise due to amplified positive reward prediction errors associated with free choice. Using a novel variant of a probabilistic learning task, we show that choice biases are selective to options that are predominantly associated with positive outcomes. A polymorphism in DARPP-32, a gene linked to dopaminergic striatal plasticity and individual differences in reinforcement learning, was found to predict the effect of choice as a function of value. We propose that these choice biases are the behavioral byproduct of a credit assignment mechanism responsible for ensuring the effective delivery of dopaminergic reinforcement learning signals broadcast to the striatum. PMID:25066083

  20. The effects of aging on the interaction between reinforcement learning and attention.

    PubMed

    Radulescu, Angela; Daniel, Reka; Niv, Yael

    2016-11-01

    Reinforcement learning (RL) in complex environments relies on selective attention to uncover those aspects of the environment that are most predictive of reward. Whereas previous work has focused on age-related changes in RL, it is not known whether older adults learn differently from younger adults when selective attention is required. In 2 experiments, we examined how aging affects the interaction between RL and selective attention. Younger and older adults performed a learning task in which only 1 stimulus dimension was relevant to predicting reward, and within it, 1 "target" feature was the most rewarding. Participants had to discover this target feature through trial and error. In Experiment 1, stimuli varied on 1 or 3 dimensions and participants received hints that revealed the target feature, the relevant dimension, or gave no information. Group-related differences in accuracy and RTs differed systematically as a function of the number of dimensions and the type of hint available. In Experiment 2 we used trial-by-trial computational modeling of the learning process to test for age-related differences in learning strategies. Behavior of both young and older adults was explained well by a reinforcement-learning model that uses selective attention to constrain learning. However, the model suggested that older adults restricted their learning to fewer features, employing more focused attention than younger adults. Furthermore, this difference in strategy predicted age-related deficits in accuracy. We discuss these results suggesting that a narrower filter of attention may reflect an adaptation to the reduced capabilities of the reinforcement learning system. (PsycINFO Database Record

  1. Problem-Based Learning

    ERIC Educational Resources Information Center

    Allen, Deborah E.; Donham, Richard S.; Bernhardt, Stephen A.

    2011-01-01

    In problem-based learning (PBL), students working in collaborative groups learn by resolving complex, realistic problems under the guidance of faculty. There is some evidence of PBL effectiveness in medical school settings where it began, and there are numerous accounts of PBL implementation in various undergraduate contexts, replete with…

  2. Spiking neural networks with different reinforcement learning (RL) schemes in a multiagent setting.

    PubMed

    Christodoulou, Chris; Cleanthous, Aristodemos

    2010-12-31

    This paper investigates the effectiveness of spiking agents when trained with reinforcement learning (RL) in a challenging multiagent task. In particular, it explores learning through reward-modulated spike-timing dependent plasticity (STDP) and compares it to reinforcement of stochastic synaptic transmission in the general-sum game of the Iterated Prisoner's Dilemma (IPD). More specifically, a computational model is developed where we implement two spiking neural networks as two "selfish" agents learning simultaneously but independently, competing in the IPD game. The purpose of our system (or collective) is to maximise its accumulated reward in the presence of reward-driven competing agents within the collective. This can only be achieved when the agents engage in a behaviour of mutual cooperation during the IPD. Previously, we successfully applied reinforcement of stochastic synaptic transmission to the IPD game. The current study utilises reward-modulated STDP with eligibility trace and results show that the system managed to exhibit the desired behaviour by establishing mutual cooperation between the agents. It is noted that the cooperative outcome was attained after a relatively short learning period which enhanced the accumulation of reward by the system. As in our previous implementation, the successful application of the learning algorithm to the IPD becomes possible only after we extended it with additional global reinforcement signals in order to enhance competition at the neuronal level. Moreover it is also shown that learning is enhanced (as indicated by an increased IPD cooperative outcome) through: (i) strong memory for each agent (regulated by a high eligibility trace time constant) and (ii) firing irregularity produced by equipping the agents' LIF neurons with a partial somatic reset mechanism.

  3. Dopamine-dependent reinforcement of motor skill learning: evidence from Gilles de la Tourette syndrome.

    PubMed

    Palminteri, Stefano; Lebreton, Maël; Worbe, Yulia; Hartmann, Andreas; Lehéricy, Stéphane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-08-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only affect choices but also motor skills such as typing. Here, we employed a novel paradigm to demonstrate that monetary rewards can improve motor skill learning in humans. Indeed, healthy participants progressively got faster in executing sequences of key presses that were repeatedly rewarded with 10 euro compared with 1 cent. Control tests revealed that the effect of reinforcement on motor skill learning was independent of subjects being aware of sequence-reward associations. To account for this implicit effect, we developed an actor-critic model, in which reward prediction errors are used by the critic to update state values and by the actor to facilitate action execution. To assess the role of dopamine in such computations, we applied the same paradigm in patients with Gilles de la Tourette syndrome, who were either unmedicated or treated with neuroleptics. We also included patients with focal dystonia, as an example of hyperkinetic motor disorder unrelated to dopamine. Model fit showed the following dissociation: while motor skills were affected in all patient groups, reinforcement learning was selectively enhanced in unmedicated patients with Gilles de la Tourette syndrome and impaired by neuroleptics. These results support the hypothesis that overactive dopamine transmission leads to excessive reinforcement of motor sequences, which might explain the formation of tics in Gilles de la Tourette syndrome.

  4. Rheology of Carbon Fibre Reinforced Cement-Based Mortar

    SciTech Connect

    Banfill, Phillip F. G.; Starrs, Gerry; McCarter, W. John

    2008-07-07

    Carbon fibre reinforced cement based materials (CFRCs) offer the possibility of fabricating 'smart' electrically conductive materials. Rheology of the fresh mix is crucial to satisfactory moulding and fresh CFRC conforms to the Bingham model with slight structural breakdown. Both yield stress and plastic viscosity increase with increasing fibre length and volume concentration. Using a modified Viskomat NT, the concentration dependence of CFRC rheology up to 1.5% fibre volume is reported.

  5. Fabrication of tungsten wire reinforced nickel-base alloy composites

    NASA Technical Reports Server (NTRS)

    Brentnall, W. D.; Toth, I. J.

    1974-01-01

    Fabrication methods for tungsten fiber reinforced nickel-base superalloy composites were investigated. Three matrix alloys in pre-alloyed powder or rolled sheet form were evaluated in terms of fabricability into composite monotape and multi-ply forms. The utility of monotapes for fabricating more complex shapes was demonstrated. Preliminary 1093C (2000F) stress rupture tests indicated that efficient utilization of fiber strength was achieved in composites fabricated by diffusion bonding processes. The fabrication of thermal fatigue specimens is also described.

  6. Homeostatic reinforcement learning for integrating reward collection and physiological stability.

    PubMed

    Keramati, Mehdi; Gutkin, Boris

    2014-12-02

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system.

  7. Abolishing the effect of reinforcement delay on human causal learning.

    PubMed

    Buehner, Marc J; May, Jon

    2004-04-01

    Associative learning theory postulates two main determinants for human causal learning: contingency and contiguity. In line with such an account, participants in Shanks, Pearson, and Dickinson (1989) failed to discover causal relations involving delays of more than two seconds. More recent research has shown that the impact of contiguity and delay is mediated by prior knowledge about the timeframe of the causal relation in question. Buehner and May (2002, 2003) demonstrated that the detrimental effect of delay can be significantly reduced if reasoners are aware of potential delays. Here we demonstrate for the first time that the negative influence of delay can be abolished completely by a subtle change in the experimental instructions. Temporal contiguity is thus not essential for human causal learning.

  8. A hypothesis for basal ganglia-dependent reinforcement learning in the songbird

    PubMed Central

    Fee, Michale S.; Goldberg, Jesse H.

    2011-01-01

    Most of our motor skills are not innately programmed, but are learned by a combination of motor exploration and performance evaluation, suggesting that they proceed through a reinforcement learning (RL) mechanism. Songbirds have emerged as a model system to study how a complex behavioral sequence can be learned through an RL-like strategy. Interestingly, like motor sequence learning in mammals, song learning in birds requires a basal ganglia (BG)-thalamocortical loop, suggesting common neural mechanisms. Here we outline a specific working hypothesis for how BG-forebrain circuits could utilize an internally computed reinforcement signal to direct song learning. Our model includes a number of general concepts borrowed from the mammalian BG literature, including a dopaminergic reward prediction error and dopamine mediated plasticity at corticostriatal synapses. We also invoke a number of conceptual advances arising from recent observations in the songbird. Specifically, there is evidence for a specialized cortical circuit that adds trial-to-trial variability to stereotyped cortical motor programs, and a role for the BG in ‘biasing’ this variability to improve behavioral performance. This BG-dependent ‘premotor bias’ may in turn guide plasticity in downstream cortical synapses to consolidate recently-learned song changes. Given the similarity between mammalian and songbird BG-thalamocortical circuits, our model for the role of the BG in this process may have broader relevance to mammalian BG function. PMID:22015923

  9. CONCRETE POURS HAVE PRODUCED A REINFORCED SUPPORT BASE FOR MTR ...

    Library of Congress Historic Buildings Survey, Historic Engineering Record, Historic Landscapes Survey

    CONCRETE POURS HAVE PRODUCED A REINFORCED SUPPORT BASE FOR MTR REACTOR. PIPE TUNNEL IS UNDER CONSTRUCTION AT CENTER OF VIEW. PIPES WILL CARRY RADIOACTIVE WATER FROM REACTOR TO WATER PROCESS BUILDING. CAMERA LOOKS SOUTH INTO TUNNEL ALONG WEST SIDE OF REACTOR BASE. TWO CAISSONS ARE AT LEFT SIDE OF VIEW. NOTE "WINDOW" IN SOUTH FACE OF REACTOR BASE AND ALSO GROUP OF PENETRATIONS TO ITS LEFT. INL NEGATIVE NO. 733. Unknown Photographer, 10/6/1950 - Idaho National Engineering Laboratory, Test Reactor Area, Materials & Engineering Test Reactors, Scoville, Butte County, ID

  10. Basalt fiber reinforced porous aggregates-geopolymer based cellular material

    NASA Astrophysics Data System (ADS)

    Luo, Xin; Xu, Jin-Yu; Li, Weimin

    2015-09-01

    Basalt fiber reinforced porous aggregates-geopolymer based cellular material (BFRPGCM) was prepared. The stress-strain curve has been worked out. The ideal energy-absorbing efficiency has been analyzed and the application prospect has been explored. The results show the following: fiber reinforced cellular material has successively sized pore structures; the stress-strain curve has two stages: elastic stage and yielding plateau stage; the greatest value of the ideal energy-absorbing efficiency of BFRPGCM is 89.11%, which suggests BFRPGCM has excellent energy-absorbing property. Thus, it can be seen that BFRPGCM is easy and simple to make, has high plasticity, low density and excellent energy-absorbing features. So, BFRPGCM is a promising energy-absorbing material used especially in civil defense engineering.

  11. Goal-Directed and Habit-Like Modulations of Stimulus Processing during Reinforcement Learning.

    PubMed

    Luque, David; Beesley, Tom; Morris, Richard W; Jack, Bradley N; Griffiths, Oren; Whitford, Thomas J; Le Pelley, Mike E

    2017-03-15

    Recent research has shown that perceptual processing of stimuli previously associated with high-value rewards is automatically prioritized even when rewards are no longer available. It has been hypothesized that such reward-related modulation of stimulus salience is conceptually similar to an "attentional habit." Recording event-related potentials in humans during a reinforcement learning task, we show strong evidence in favor of this hypothesis. Resistance to outcome devaluation (the defining feature of a habit) was shown by the stimulus-locked P1 component, reflecting activity in the extrastriate visual cortex. Analysis at longer latencies revealed a positive component (corresponding to the P3b, from 550-700 ms) sensitive to outcome devaluation. Therefore, distinct spatiotemporal patterns of brain activity were observed corresponding to habitual and goal-directed processes. These results demonstrate that reinforcement learning engages both attentional habits and goal-directed processes in parallel. Consequences for brain and computational models of reinforcement learning are discussed.SIGNIFICANCE STATEMENT The human attentional network adapts to detect stimuli that predict important rewards. A recent hypothesis suggests that the visual cortex automatically prioritizes reward-related stimuli, driven by cached representations of reward value; that is, stimulus-response habits. Alternatively, the neural system may track the current value of the predicted outcome. Our results demonstrate for the first time that visual cortex activity is increased for reward-related stimuli even when the rewarding event is temporarily devalued. In contrast, longer-latency brain activity was specifically sensitive to transient changes in reward value. Therefore, we show that both habit-like attention and goal-directed processes occur in the same learning episode at different latencies. This result has important consequences for computational models of reinforcement learning.

  12. Predicting Pilot Behavior in Medium Scale Scenarios Using Game Theory and Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Yildiz, Yildiray; Agogino, Adrian; Brat, Guillaume

    2013-01-01

    Effective automation is critical in achieving the capacity and safety goals of the Next Generation Air Traffic System. Unfortunately creating integration and validation tools for such automation is difficult as the interactions between automation and their human counterparts is complex and unpredictable. This validation becomes even more difficult as we integrate wide-reaching technologies that affect the behavior of different decision makers in the system such as pilots, controllers and airlines. While overt short-term behavior changes can be explicitly modeled with traditional agent modeling systems, subtle behavior changes caused by the integration of new technologies may snowball into larger problems and be very hard to detect. To overcome these obstacles, we show how integration of new technologies can be validated by learning behavior models based on goals. In this framework, human participants are not modeled explicitly. Instead, their goals are modeled and through reinforcement learning their actions are predicted. The main advantage to this approach is that modeling is done within the context of the entire system allowing for accurate modeling of all participants as they interact as a whole. In addition such an approach allows for efficient trade studies and feasibility testing on a wide range of automation scenarios. The goal of this paper is to test that such an approach is feasible. To do this we implement this approach using a simple discrete-state learning system on a scenario where 50 aircraft need to self-navigate using Automatic Dependent Surveillance-Broadcast (ADS-B) information. In this scenario, we show how the approach can be used to predict the ability of pilots to adequately balance aircraft separation and fly efficient paths. We present results with several levels of complexity and airspace congestion.

  13. Homeostatic reinforcement learning for integrating reward collection and physiological stability

    PubMed Central

    Keramati, Mehdi; Gutkin, Boris

    2014-01-01

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system. DOI: http://dx.doi.org/10.7554/eLife.04811.001 PMID:25457346

  14. Prediction error in reinforcement learning: a meta-analysis of neuroimaging studies.

    PubMed

    Garrison, Jane; Erdeniz, Burak; Done, John

    2013-08-01

    Activation likelihood estimation (ALE) meta-analyses were used to examine the neural correlates of prediction error in reinforcement learning. The findings are interpreted in the light of current computational models of learning and action selection. In this context, particular consideration is given to the comparison of activation patterns from studies using instrumental and Pavlovian conditioning, and where reinforcement involved rewarding or punishing feedback. The striatum was the key brain area encoding for prediction error, with activity encompassing dorsal and ventral regions for instrumental and Pavlovian reinforcement alike, a finding which challenges the functional separation of the striatum into a dorsal 'actor' and a ventral 'critic'. Prediction error activity was further observed in diverse areas of predominantly anterior cerebral cortex including medial prefrontal cortex and anterior cingulate cortex. Distinct patterns of prediction error activity were found for studies using rewarding and aversive reinforcers; reward prediction errors were observed primarily in the striatum while aversive prediction errors were found more widely including insula and habenula.

  15. Reinforcement of cement-based matrices with graphite nanomaterials

    NASA Astrophysics Data System (ADS)

    Sadiq, Muhammad Maqbool

    Cement-based materials offer a desirable balance of compressive strength, moisture resistance, durability, economy and energy-efficiency; their tensile strength, fracture energy and durability in aggressive environments, however, could benefit from further improvements. An option for realizing some of these improvements involves introduction of discrete fibers into concrete. When compared with today's micro-scale (steel, polypropylene, glass, etc.) fibers, graphite nanomaterials (carbon nanotube, nanofiber and graphite nanoplatelet) offer superior geometric, mechanical and physical characteristics. Graphite nanomaterials would realize their reinforcement potential as far as they are thoroughly dispersed within cement-based matrices, and effectively bond to cement hydrates. The research reported herein developed non-covalent and covalent surface modification techniques to improve the dispersion and interfacial interactions of graphite nanomaterials in cement-based matrices with a dense and well graded micro-structure. The most successful approach involved polymer wrapping of nanomaterials for increasing the density of hydrophilic groups on the nanomaterial surface without causing any damage to the their structure. The nanomaterials were characterized using various spectrometry techniques, and SEM (Scanning Electron Microscopy). The graphite nanomaterials were dispersed via selected sonication procedures in the mixing water of the cement-based matrix; conventional mixing and sample preparation techniques were then employed to prepare the cement-based nanocomposite samples, which were subjected to steam curing. Comprehensive engineering and durability characteristics of cement-based nanocomposites were determined and their chemical composition, microstructure and failure mechanisms were also assessed through various spectrometry, thermogravimetry, electron microscopy and elemental analyses. Both functionalized and non-functionalized nanomaterials as well as different

  16. Reinforcement learning and counterfactual reasoning explain adaptive behavior in a changing environment.

    PubMed

    Zhang, Yunfeng; Paik, Jaehyon; Pirolli, Peter

    2015-04-01

    Animals routinely adapt to changes in the environment in order to survive. Though reinforcement learning may play a role in such adaptation, it is not clear that it is the only mechanism involved, as it is not well suited to producing rapid, relatively immediate changes in strategies in response to environmental changes. This research proposes that counterfactual reasoning might be an additional mechanism that facilitates change detection. An experiment is conducted in which a task state changes over time and the participants had to detect the changes in order to perform well and gain monetary rewards. A cognitive model is constructed that incorporates reinforcement learning with counterfactual reasoning to help quickly adjust the utility of task strategies in response to changes. The results show that the model can accurately explain human data and that counterfactual reasoning is key to reproducing the various effects observed in this change detection paradigm.

  17. An information-theoretic analysis of return maximization in reinforcement learning.

    PubMed

    Iwata, Kazunori

    2011-12-01

    We present a general analysis of return maximization in reinforcement learning. This analysis does not require assumptions of Markovianity, stationarity, and ergodicity for the stochastic sequential decision processes of reinforcement learning. Instead, our analysis assumes the asymptotic equipartition property fundamental to information theory, providing a substantially different view from that in the literature. As our main results, we show that return maximization is achieved by the overlap of typical and best sequence sets, and we present a class of stochastic sequential decision processes with the necessary condition for return maximization. We also describe several examples of best sequences in terms of return maximization in the class of stochastic sequential decision processes, which satisfy the necessary condition.

  18. Ventral tegmental area neurons in learned appetitive behavior and positive reinforcement.

    PubMed

    Fields, Howard L; Hjelmstad, Gregory O; Margolis, Elyssa B; Nicola, Saleem M

    2007-01-01

    Ventral tegmental area (VTA) neuron firing precedes behaviors elicited by reward-predictive sensory cues and scales with the magnitude and unpredictability of received rewards. These patterns are consistent with roles in the performance of learned appetitive behaviors and in positive reinforcement, respectively. The VTA includes subpopulations of neurons with different afferent connections, neurotransmitter content, and projection targets. Because the VTA and substantia nigra pars compacta are the sole sources of striatal and limbic forebrain dopamine, measurements of dopamine release and manipulations of dopamine function have provided critical evidence supporting a VTA contribution to these functions. However, the VTA also sends GABAergic and glutamatergic projections to the nucleus accumbens and prefrontal cortex. Furthermore, VTA-mediated but dopamine-independent positive reinforcement has been demonstrated. Consequently, identifying the neurotransmitter content and projection target of VTA neurons recorded in vivo will be critical for determining their contribution to learned appetitive behaviors.

  19. Caregiver preference for reinforcement-based interventions for problem behavior maintained by positive reinforcement.

    PubMed

    Gabor, Anne M; Fritz, Jennifer N; Roath, Christopher T; Rothe, Brittany R; Gourley, Denise A

    2016-06-01

    Social validity of behavioral interventions typically is assessed with indirect methods or by determining preferences of the individuals who receive treatment, and direct observation of caregiver preference rarely is described. In this study, preferences of 5 caregivers were determined via a concurrent-chains procedure. Caregivers were neurotypical, and children had been diagnosed with developmental disabilities and engaged in problem behavior maintained by positive reinforcement. Caregivers were taught to implement noncontingent reinforcement (NCR), differential reinforcement of alternative behavior (DRA), and differential reinforcement of other behavior (DRO), and the caregivers selected interventions to implement during sessions with the child after they had demonstrated proficiency in implementing the interventions. Three caregivers preferred DRA, 1 caregiver preferred differential reinforcement procedures, and 1 caregiver did not exhibit a preference. Direct observation of implementation in concurrent-chains procedures may allow the identification of interventions that are implemented with sufficient integrity and preferred by caregivers.

  20. Seizure Control in a Computational Model Using a Reinforcement Learning Stimulation Paradigm.

    PubMed

    Nagaraj, Vivek; Lamperski, Andrew; Netoff, Theoden I

    2016-11-02

    Neuromodulation technologies such as vagus nerve stimulation and deep brain stimulation, have shown some efficacy in controlling seizures in medically intractable patients. However, inherent patient-to-patient variability of seizure disorders leads to a wide range of therapeutic efficacy. A patient specific approach to determining stimulation parameters may lead to increased therapeutic efficacy while minimizing stimulation energy and side effects. This paper presents a reinforcement learning algorithm that optimizes stimulation frequency for controlling seizures with minimum stimulation energy. We apply our method to a computational model called the epileptor. The epileptor model simulates inter-ictal and ictal local field potential data. In order to apply reinforcement learning to the Epileptor, we introduce a specialized reward function and state-space discretization. With the reward function and discretization fixed, we test the effectiveness of the temporal difference reinforcement learning algorithm (TD(0)). For periodic pulsatile stimulation, we derive a relation that describes, for any stimulation frequency, the minimal pulse amplitude required to suppress seizures. The TD(0) algorithm is able to identify parameters that control seizures quickly. Additionally, our results show that the TD(0) algorithm refines the stimulation frequency to minimize stimulation energy thereby converging to optimal parameters reliably. An advantage of the TD(0) algorithm is that it is adaptive so that the parameters necessary to control the seizures can change over time. We show that the algorithm can converge on the optimal solution in simulation with slow and fast inter-seizure intervals.

  1. Reinforcement learning models and their neural correlates: An activation likelihood estimation meta-analysis.

    PubMed

    Chase, Henry W; Kumar, Poornima; Eickhoff, Simon B; Dombrovski, Alexandre Y

    2015-06-01

    Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments-prediction error-is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies have suggested that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that had employed algorithmic reinforcement learning models across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, whereas instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies.

  2. Reinforcement Learning Models and Their Neural Correlates: An Activation Likelihood Estimation Meta-Analysis

    PubMed Central

    Kumar, Poornima; Eickhoff, Simon B.; Dombrovski, Alexandre Y.

    2015-01-01

    Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments – prediction error – is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies suggest that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that employed algorithmic reinforcement learning models, across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, while instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually-estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies. PMID:25665667

  3. A Randomized Trial of Employment-Based Reinforcement of Cocaine Abstinence in Injection Drug Users

    ERIC Educational Resources Information Center

    Silverman, Kenneth; Wong, Conrad J.; Needham, Mick; Diemer, Karly N.; Knealing, Todd; Crone-Todd, Darlene; Fingerhood, Michael; Nuzzo, Paul; Kolodner, Kenneth

    2007-01-01

    High-magnitude and long-duration abstinence reinforcement can promote drug abstinence but can be difficult to finance. Employment may be a vehicle for arranging high-magnitude and long-duration abstinence reinforcement. This study determined if employment-based abstinence reinforcement could increase cocaine abstinence in adults who inject drugs…

  4. Neural regions that underlie reinforcement learning are also active for social expectancy violations.

    PubMed

    Harris, Lasana T; Fiske, Susan T

    2010-01-01

    Prediction error, the difference between an expected and an actual outcome, serves as a learning signal that interacts with reward and punishment value to direct future behavior during reinforcement learning. We hypothesized that similar learning and valuation signals may underlie social expectancy violations. Here, we explore the neural correlates of social expectancy violation signals along the universal person-perception dimensions trait warmth and competence. In this context, social learning may result from expectancy violations that occur when a target is inconsistent with an a priori schema. Expectancy violation may activate neural regions normally implicated in prediction error and valuation during appetitive and aversive conditioning. Using fMRI, we first gave perceivers high warmth or competence behavioral information that led to dispositional or situational attributions for the behavior. Participants then saw pictures of people responsible for the behavior; they represented social groups either inconsistent (rated low on either warmth or competence) or consistent (rated high on either warmth or competence) with the behavior information. Warmth and competence expectancy violations activate striatal regions that represent evaluative and prediction error signals. Social cognition regions underlie consistent expectations. These findings suggest that regions underlying reinforcement learning may work in concert with social cognition regions in warmth and competence social expectancy. This study illustrates the neural overlap between neuroeconomics and social neuroscience.

  5. Does reward frequency or magnitude drive reinforcement-learning in attention-deficit/hyperactivity disorder?

    PubMed

    Luman, Marjolein; Van Meel, Catharina S; Oosterlaan, Jaap; Sergeant, Joseph A; Geurts, Hilde M

    2009-08-15

    Children with attention-deficit/hyperactivity disorder (ADHD) show an impaired ability to use feedback in the context of learning. A stimulus-response learning task was used to investigate whether (1) children with ADHD displayed flatter learning curves, (2) reinforcement-learning in ADHD was sensitive to either reward frequency, magnitude, or both, and (3) altered sensitivity to reward was specific to ADHD or would co-occur in a group of children with autism spectrum disorder (ASD). Performance of 23 boys with ADHD was compared with that of 30 normal controls (NCs) and 21 boys with ASD, all aged 8-12. Rewards were delivered contingent on performance and varied both in frequency (low, high) and magnitude (small, large). The findings showed that, although learning rates were comparable across groups, both clinical groups committed more errors than NCs. In contrast to the NC boys, boys with ADHD were unaffected by frequency and magnitude of reward. The NC group and, to some extent, the ASD group showed improved performance, when rewards were delivered infrequently versus frequently. Children with ADHD as well as children with ASD displayed difficulties in stimulus-response coupling that were independent of motivational modulations. Possibly, these deficits are related to abnormal reinforcement expectancy.

  6. Concept learning without differential reinforcement in pigeons by means of contextual cueing.

    PubMed

    Couto, Kalliu C; Navarro, Victor M; Smith, Tatiana R; Wasserman, Edward A

    2016-04-01

    How supervision is arranged can affect the way that humans learn concepts. Yet very little is known about the role that supervision plays in nonhuman concept learning. Prior research in pigeon concept learning has commonly used differential response-reinforcer procedures (involving high-level supervision) to support reliable discrimination and generalization involving from 4 to 16 concurrently presented photographic categories. In the present project, we used contextual cueing, a nondifferential reinforcement procedure (involving low-level supervision), to investigate concept learning in pigeons. We found that pigeons were faster to peck a target stimulus when 8 members from each of 4 categories of black-and-white photographs-dogs, trees, shoes, and keys-correctly cued its location than when they did not. This faster detection of the target also generalized to 4 untrained members from each of the 4 photographic categories. Our results thus pass the prime behavioral tests of conceptualization and suggest that high-level supervision is unnecessary to support concept learning in pigeons.

  7. Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective

    PubMed Central

    Story, Giles W.; Vlaev, Ivo; Seymour, Ben; Darzi, Ara; Dolan, Raymond J.

    2014-01-01

    The tendency to make unhealthy choices is hypothesized to be related to an individual's temporal discount rate, the theoretical rate at which they devalue delayed rewards. Furthermore, a particular form of temporal discounting, hyperbolic discounting, has been proposed to explain why unhealthy behavior can occur despite healthy intentions. We examine these two hypotheses in turn. We first systematically review studies which investigate whether discount rates can predict unhealthy behavior. These studies reveal that high discount rates for money (and in some instances food or drug rewards) are associated with several unhealthy behaviors and markers of health status, establishing discounting as a promising predictive measure. We secondly examine whether intention-incongruent unhealthy actions are consistent with hyperbolic discounting. We conclude that intention-incongruent actions are often triggered by environmental cues or changes in motivational state, whose effects are not parameterized by hyperbolic discounting. We propose a framework for understanding these state-based effects in terms of the interplay of two distinct reinforcement learning mechanisms: a “model-based” (or goal-directed) system and a “model-free” (or habitual) system. Under this framework, while discounting of delayed health may contribute to the initiation of unhealthy behavior, with repetition, many unhealthy behaviors become habitual; if health goals then change, habitual behavior can still arise in response to environmental cues. We propose that the burgeoning development of computational models of these processes will permit further identification of health decision-making phenotypes. PMID:24659960

  8. Trading Rules on Stock Markets Using Genetic Network Programming with Reinforcement Learning and Importance Index

    NASA Astrophysics Data System (ADS)

    Mabu, Shingo; Hirasawa, Kotaro; Furuzuki, Takayuki

    Genetic Network Programming (GNP) is an evolutionary computation which represents its solutions using graph structures. Since GNP can create quite compact programs and has an implicit memory function, it has been clarified that GNP works well especially in dynamic environments. In addition, a study on creating trading rules on stock markets using GNP with Importance Index (GNP-IMX) has been done. IMX is a new element which is a criterion for decision making. In this paper, we combined GNP-IMX with Actor-Critic (GNP-IMX&AC) and create trading rules on stock markets. Evolution-based methods evolve their programs after enough period of time because they must calculate fitness values, however reinforcement learning can change programs during the period, therefore the trading rules can be created efficiently. In the simulation, the proposed method is trained using the stock prices of 10 brands in 2002 and 2003. Then the generalization ability is tested using the stock prices in 2004. The simulation results show that the proposed method can obtain larger profits than GNP-IMX without AC and Buy&Hold.

  9. A State Space Filter for Reinforcement Learning in Partially Observable Markov Decision Processes

    NASA Astrophysics Data System (ADS)

    Nagayoshi, Masato; Murao, Hajime; Tamaki, Hisashi

    This paper presents a technique for reinforcement learning to deal with both discrete and continuous state space systems in POMDPs while keeping the state space of an agent compact. First, in our computational model for MDP environments, a concept of “ state space filtering ” has been introduced and constructed to make the state space of an agent smaller properly by referring to “ entropy ” calculated based on the state-action mapping. The model is extended to be applicable in POMDP environments by introducing the mechanism of utilizing effectively of history information. The extended model is capable of being dealt with a continuous state space as well as a discrete state space by the extended model. Here, the mechanism of adjusting the amount of history information is also introduced so that the state space of an agent should be compact. Moreover, some computational experiments with a robot navigation problem with a continuous state space have been carried out. The potential and the effectiveness of the extended approach have been confirmed through these experiments.

  10. Reinforcement learning, spike-time-dependent plasticity, and the BCM rule.

    PubMed

    Baras, Dorit; Meir, Ron

    2007-08-01

    Learning agents, whether natural or artificial, must update their internal parameters in order to improve their behavior over time. In reinforcement learning, this plasticity is influenced by an environmental signal, termed a reward, that directs the changes in appropriate directions. We apply a recently introduced policy learning algorithm from machine learning to networks of spiking neurons and derive a spike-time-dependent plasticity rule that ensures convergence to a local optimum of the expected average reward. The approach is applicable to a broad class of neuronal models, including the Hodgkin-Huxley model. We demonstrate the effectiveness of the derived rule in several toy problems. Finally, through statistical analysis, we show that the synaptic plasticity rule established is closely related to the widely used BCM rule, for which good biological evidence exists.

  11. Reinforcement Learning for Weakly-Coupled MDPs and an Application to Planetary Rover Control

    NASA Technical Reports Server (NTRS)

    Bernstein, Daniel S.; Zilberstein, Shlomo

    2003-01-01

    Weakly-coupled Markov decision processes can be decomposed into subprocesses that interact only through a small set of bottleneck states. We study a hierarchical reinforcement learning algorithm designed to take advantage of this particular type of decomposability. To test our algorithm, we use a decision-making problem faced by autonomous planetary rovers. In this problem, a Mars rover must decide which activities to perform and when to traverse between science sites in order to make the best use of its limited resources. In our experiments, the hierarchical algorithm performs better than Q-learning in the early stages of learning, but unlike Q-learning it converges to a suboptimal policy. This suggests that it may be advantageous to use the hierarchical algorithm when training time is limited.

  12. Discrete-time online learning control for a class of unknown nonaffine nonlinear systems using reinforcement learning.

    PubMed

    Yang, Xiong; Liu, Derong; Wang, Ding; Wei, Qinglai

    2014-07-01

    In this paper, a reinforcement-learning-based direct adaptive control is developed to deliver a desired tracking performance for a class of discrete-time (DT) nonlinear systems with unknown bounded disturbances. We investigate multi-input-multi-output unknown nonaffine nonlinear DT systems and employ two neural networks (NNs). By using Implicit Function Theorem, an action NN is used to generate the control signal and it is also designed to cancel the nonlinearity of unknown DT systems, for purpose of utilizing feedback linearization methods. On the other hand, a critic NN is applied to estimate the cost function, which satisfies the recursive equations derived from heuristic dynamic programming. The weights of both the action NN and the critic NN are directly updated online instead of offline training. By utilizing Lyapunov's direct method, the closed-loop tracking errors and the NN estimated weights are demonstrated to be uniformly ultimately bounded. Two numerical examples are provided to show the effectiveness of the present approach.

  13. Mild Reinforcement Learning Deficits in Patients With First-Episode Psychosis.

    PubMed

    Chang, Wing Chung; Waltz, James A; Gold, James M; Chan, Tracey Chi Wan; Chen, Eric Yu Hai

    2016-11-01

    Numerous studies have identified reinforcement learning (RL) deficits in schizophrenia. Most have focused on chronic patients with longstanding antipsychotic treatment, however, and studies of RL in early-illness patients have produced mixed results, particularly regarding gradual/procedural learning. No study has directly contrasted both rapid and gradual RL in first-episode psychosis (FEP) samples. We examined probabilistic RL in 34 FEP patients and 36 controls, using Go/NoGo (GNG) and Gain vs Loss-Avoidance (GLA) paradigms. Our results were mixed, with FEP patients exhibiting greater impairment in the ability to use positive, as opposed to negative, feedback to drive rapid RL on the GLA, but not the GNG. By contrast, patients and controls showed similar improvement across the acquisition. Finally, we found no significant between-group differences in the postacquisition expression of value-based preference in both tasks. Negative symptoms were modestly associated with RL measures, while the overall bias to engage in Go-responding correlated significantly with psychosis severity in FEP patients, consistent with striatal hyperdopaminergia. Taken together, FEP patients demonstrated more circumscribed RL impairments than previous studies have documented in chronic samples, possibly reflecting differential symptom profiles between first-episode and chronic samples. Our finding of relatively preserved gradual/procedural RL, in briefly medicated FEP patients, might suggest spared or restored basal ganglia function. Our findings of preserved abilities to use representations of expected value to guide decision making, and our mixed results regarding rapid RL, may reflect a lesser degree of prefrontal cortical functional impairment in FEP than in chronic samples. Further longitudinal research, in larger samples, is required.

  14. Does overall reinforcer rate affect discrimination of time-based contingencies?

    PubMed

    Cowie, Sarah; Davison, Michael; Blumhardt, Luca; Elliffe, Douglas

    2016-05-01

    Overall reinforcer rate appears to affect choice. The mechanism for such an effect is uncertain, but may relate to reinforcer rate changing the discrimination of the relation between stimuli and reinforcers. We assessed whether a quantitative model based on a stimulus-control approach could be used to account for the effects of overall reinforcer rate on choice under changing time-based contingencies. On a two-key concurrent schedule, the likely availability of a reinforcer reversed when a fixed time had elapsed since the last reinforcer, and the overall reinforcer rate was varied across conditions. Changes in the overall reinforcer rate produced a change in response bias, and some indication of a change in discrimination. These changes in bias and discrimination always occurred quickly, usually within the first session of a condition. The stimulus-control approach provided an excellent account of the data, suggesting that changes in overall reinforcer rate affect choice because they alter the frequency of reinforcers obtained at different times, or in different stimulus contexts, and thus change the discriminated relation between stimuli and reinforcers. These findings support the notion that temporal and spatial discriminations can be understood in terms of discrimination of reinforcers across time and space.

  15. Striatum-medial prefrontal cortex connectivity predicts developmental changes in reinforcement learning.

    PubMed

    van den Bos, Wouter; Cohen, Michael X; Kahnt, Thorsten; Crone, Eveline A

    2012-06-01

    During development, children improve in learning from feedback to adapt their behavior. However, it is still unclear which neural mechanisms might underlie these developmental changes. In the current study, we used a reinforcement learning model to investigate neurodevelopmental changes in the representation and processing of learning signals. Sixty-seven healthy volunteers between ages 8 and 22 (children: 8-11 years, adolescents: 13-16 years, and adults: 18-22 years) performed a probabilistic learning task while in a magnetic resonance imaging scanner. The behavioral data demonstrated age differences in learning parameters with a stronger impact of negative feedback on expected value in children. Imaging data revealed that the neural representation of prediction errors was similar across age groups, but functional connectivity between the ventral striatum and the medial prefrontal cortex changed as a function of age. Furthermore, the connectivity strength predicted the tendency to alter expectations after receiving negative feedback. These findings suggest that the underlying mechanisms of developmental changes in learning are not related to differences in the neural representation of learning signals per se but rather in how learning signals are used to guide behavior and expectations.

  16. Voucher-based reinforcement of cocaine abstinence in treatment-resistant methadone patients: effects of reinforcement magnitude.

    PubMed

    Silverman, K; Chutuape, M A; Bigelow, G E; Stitzer, M L

    1999-09-01

    Voucher-based reinforcement of cocaine abstinence has been one of the most effective means of treating cocaine abuse in methadone patients, but it has not been effective in all patients. This study was designed to determine if we could promote cocaine abstinence in a population of treatment-resistant cocaine abusing methadone patients by increasing the magnitude of voucher-based abstinence reinforcement. Participants were 29 methadone patients who previously failed to achieve sustained cocaine abstinence when exposed to an intervention in which they could earn up to $1155 in vouchers (exchangeable for goods/services) for providing cocaine-free urines. Each patient was exposed in counterbalanced order to three 9-week voucher conditions that varied in magnitude of voucher reinforcement. Patients were exposed to a zero, low and high magnitude condition in which they could earn up to $0, $382, or $3480 in vouchers for providing cocaine-free urines. Analyses for 22 patients exposed to all three conditions showed that increasing voucher magnitude significantly increased patients' longest duration of sustained cocaine abstinence (P<0.001) and percent of cocaine-free urines (P<0.001), and significantly decreased patients' reports of cocaine injections (P=0.024). Almost half (45%) of the patients in the high magnitude condition achieved >/=4 weeks of sustained cocaine abstinence, whereas only one patient in the low and none in the zero magnitude condition achieved more than 2 weeks. Reinforcement magnitude was a critical determinant of the effectiveness of this abstinence reinforcement intervention.

  17. Increasing opiate abstinence through voucher-based reinforcement therapy.

    PubMed

    Silverman, K; Wong, C J; Higgins, S T; Brooner, R K; Montoya, I D; Contoreggi, C; Umbricht-Schneiter, A; Schuster, C R; Preston, K L

    1996-06-01

    Heroin dependence remains a serious and costly public health problem, even in patients receiving methadone maintenance treatment. This study used a within-subject reversal design to assess the effectiveness of voucher-based abstinence reinforcement in reducing opiate use in patients receiving methadone maintenance treatment in an inner-city program. Throughout the study subjects received standard methadone maintenance treatment involving methadone, counseling, and urine monitoring (three times per week). Thirteen patients who continued to use opiates regularly during a 5-week baseline period were exposed to a 12-week program in which they received a voucher for each opiate-free urine sample provided: the vouchers had monetary values that increased as the number of consecutive opiate-free urines increased. Subjects continued receiving standard methadone maintenance for 8 weeks after discontinuation of the voucher program (return-to-baseline). Tukey's posthoc contrasts showed that the percentage of urine specimens that were positive for opiates decreased significantly when the voucher program was instituted. (P < or = 0.01) and then increased significantly when the voucher program was discontinued during the return-to-baseline condition (P < or = 0.01). Rates of opiate positive urines in the return-to-baseline condition remained significantly below the rates observed in the initial baseline period (P < or = 0.01). Overall, the study shows that voucher-based reinforcement contingencies can decrease opiate use in heroin dependent patients receiving methadone maintenance treatment.

  18. Simulating the Effect of Reinforcement Learning on Neuronal Synchrony and Periodicity in the Striatum

    PubMed Central

    Hélie, Sébastien; Fleischer, Pierson J.

    2016-01-01

    The study of rhythms and oscillations in the brain is gaining attention. While it is unclear exactly what the role of oscillation, synchrony, and rhythm is, it appears increasingly likely that synchrony is related to normal and abnormal brain states and possibly cognition. In this article, we explore the relationship between basal ganglia (BG) synchrony and reinforcement learning. We simulate a biologically-realistic model of the striatum initially proposed by Ponzi and Wickens (2010) and enhance the model by adding plastic cortico-BG synapses that can be modified using reinforcement learning. The effect of reinforcement learning on striatal rhythmic activity is then explored, and disrupted using simulated deep brain stimulation (DBS). The stimulator injects current in the brain structure to which it is attached, which affects neuronal synchrony. The results show that training the model without DBS yields a high accuracy in the learning task and reduced the number of active neurons in the striatum, along with an increased firing periodicity and a decreased firing synchrony between neurons in the same assembly. In addition, a spectral decomposition shows a stronger signal for correct trials than incorrect trials in high frequency bands. If the DBS is ON during the training phase, but not the test phase, the amount of learning in the model is reduced, along with firing periodicity. Similar to when the DBS is OFF, spectral decomposition shows a stronger signal for correct trials than for incorrect trials in high frequency domains, but this phenoemenon happens in higher frequency bands than when the DBS is OFF. Synchrony between the neurons is not affected. Finally, the results show that turning the DBS ON at test increases both firing periodicity and striatal synchrony, and spectral decomposition of the signal show that neural activity synchronizes with the DBS fundamental frequency (and its harmonics). Turning the DBS ON during the test phase results in chance

  19. Simulating the Effect of Reinforcement Learning on Neuronal Synchrony and Periodicity in the Striatum.

    PubMed

    Hélie, Sébastien; Fleischer, Pierson J

    2016-01-01

    The study of rhythms and oscillations in the brain is gaining attention. While it is unclear exactly what the role of oscillation, synchrony, and rhythm is, it appears increasingly likely that synchrony is related to normal and abnormal brain states and possibly cognition. In this article, we explore the relationship between basal ganglia (BG) synchrony and reinforcement learning. We simulate a biologically-realistic model of the striatum initially proposed by Ponzi and Wickens (2010) and enhance the model by adding plastic cortico-BG synapses that can be modified using reinforcement learning. The effect of reinforcement learning on striatal rhythmic activity is then explored, and disrupted using simulated deep brain stimulation (DBS). The stimulator injects current in the brain structure to which it is attached, which affects neuronal synchrony. The results show that training the model without DBS yields a high accuracy in the learning task and reduced the number of active neurons in the striatum, along with an increased firing periodicity and a decreased firing synchrony between neurons in the same assembly. In addition, a spectral decomposition shows a stronger signal for correct trials than incorrect trials in high frequency bands. If the DBS is ON during the training phase, but not the test phase, the amount of learning in the model is reduced, along with firing periodicity. Similar to when the DBS is OFF, spectral decomposition shows a stronger signal for correct trials than for incorrect trials in high frequency domains, but this phenoemenon happens in higher frequency bands than when the DBS is OFF. Synchrony between the neurons is not affected. Finally, the results show that turning the DBS ON at test increases both firing periodicity and striatal synchrony, and spectral decomposition of the signal show that neural activity synchronizes with the DBS fundamental frequency (and its harmonics). Turning the DBS ON during the test phase results in chance

  20. Learning, using natural reinforcements, in insect preparations that permit cellular neuronal analysis.

    PubMed

    Hoyle, G

    1980-07-01

    A general paradigm is described that permits testing the ability of an arthropod to learn (by operant conditioning) to alter the position of a single leg segment in order to relate to behaviorally appropriate reinforcement. The paradigm was designed so that intracellular recording from identified neurons involved would be possible during the training of a locust or grasshopper, for which extensive neuron maps are available. As a prelude to such studies, electromyograms were made from the antagonistic muscles that move the conditioned limb, which in the present experiments was the tibia of the metathoracic leg. Negative (aversive) reinforcement was provided by a loud sound/vibration and positive (reward) reinforcement by food in the form of sugar-water or fresh-growing grass. In the aversive reinforcement experiments the sound, which reflexly caused flexion, was on continually except when the tibia of one hind leg was voluntarily placed in an electronically set position "window" displaced, in extension, away from the preferred position. In feeding experiments, food was brought automatically to the mouth by a motor-driven arm when the tibia was held within a position window set away from the preferred position in either extension or flexion. Whole or headless insects learned to turn off the sound permanently, except for sporadic brief interruptions, by tonic shifting of tibial position. Insects learned to bring food to the mouth by modifying the plateau phase of a position displacement lasting for a few minutes, that was found to occur from time to time also in controls. In aversive learning, minimum times to turn off the sound were 22 sec for the easiest position and 4 min for the most difficult. The longest time in the easiest position was 1 min 40 sec and in the most difficult 39 min; excluding measurement for individuals that did not learn. In reward learning, the minimum time in the easiest position was just under 1 min, and 12 min in the most difficult position