reinforcement learning control: Topics by Science.gov

Sample records for reinforcement learning control

Effective reinforcement learning following cerebellar damage requires a balance between exploration and motor noise.

PubMed

Therrien, Amanda S; Wolpert, Daniel M; Bastian, Amy J

2016-01-01

Reinforcement and error-based processes are essential for motor learning, with the cerebellum thought to be required only for the error-based mechanism. Here we examined learning and retention of a reaching skill under both processes. Control subjects learned similarly from reinforcement and error-based feedback, but showed much better retention under reinforcement. To apply reinforcement to cerebellar patients, we developed a closed-loop reinforcement schedule in which task difficulty was controlled based on recent performance. This schedule produced substantial learning in cerebellar patients and controls. Cerebellar patients varied in their learning under reinforcement but fully retained what was learned. In contrast, they showed complete lack of retention in error-based learning. We developed a mechanistic model of the reinforcement task and found that learning depended on a balance between exploration variability and motor noise. While the cerebellar and control groups had similar exploration variability, the patients had greater motor noise and hence learned less. Our results suggest that cerebellar damage indirectly impairs reinforcement learning by increasing motor noise, but does not interfere with the reinforcement mechanism itself. Therefore, reinforcement can be used to learn and retain novel skills, but optimal reinforcement learning requires a balance between exploration variability and motor noise. © The Author (2015). Published by Oxford University Press on behalf of the Guarantors of Brain.
Effective reinforcement learning following cerebellar damage requires a balance between exploration and motor noise

PubMed Central

Therrien, Amanda S.; Wolpert, Daniel M.

2016-01-01

Abstract See Miall and Galea (doi: 10.1093/awv343 ) for a scientific commentary on this article. Reinforcement and error-based processes are essential for motor learning, with the cerebellum thought to be required only for the error-based mechanism. Here we examined learning and retention of a reaching skill under both processes. Control subjects learned similarly from reinforcement and error-based feedback, but showed much better retention under reinforcement. To apply reinforcement to cerebellar patients, we developed a closed-loop reinforcement schedule in which task difficulty was controlled based on recent performance. This schedule produced substantial learning in cerebellar patients and controls. Cerebellar patients varied in their learning under reinforcement but fully retained what was learned. In contrast, they showed complete lack of retention in error-based learning. We developed a mechanistic model of the reinforcement task and found that learning depended on a balance between exploration variability and motor noise. While the cerebellar and control groups had similar exploration variability, the patients had greater motor noise and hence learned less. Our results suggest that cerebellar damage indirectly impairs reinforcement learning by increasing motor noise, but does not interfere with the reinforcement mechanism itself. Therefore, reinforcement can be used to learn and retain novel skills, but optimal reinforcement learning requires a balance between exploration variability and motor noise. PMID:26626368
On the integration of reinforcement learning and approximate reasoning for control

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1991-01-01

The author discusses the importance of strengthening the knowledge representation characteristic of reinforcement learning techniques using methods such as approximate reasoning. The ARIC (approximate reasoning-based intelligent control) architecture is an example of such a hybrid approach in which the fuzzy control rules are modified (fine-tuned) using reinforcement learning. ARIC also demonstrates that it is possible to start with an approximately correct control knowledge base and learn to refine this knowledge through further experience. On the other hand, techniques such as the TD (temporal difference) algorithm and Q-learning establish stronger theoretical foundations for their use in adaptive control and also in stability analysis of hybrid reinforcement learning and approximate reasoning-based controllers.
GA-based fuzzy reinforcement learning for control of a magnetic bearing system.

PubMed

Lin, C T; Jou, C P

2000-01-01

This paper proposes a TD (temporal difference) and GA (genetic algorithm)-based reinforcement (TDGAR) learning method and applies it to the control of a real magnetic bearing system. The TDGAR learning scheme is a new hybrid GA, which integrates the TD prediction method and the GA to perform the reinforcement learning task. The TDGAR learning system is composed of two integrated feedforward networks. One neural network acts as a critic network to guide the learning of the other network (the action network) which determines the outputs (actions) of the TDGAR learning system. The action network can be a normal neural network or a neural fuzzy network. Using the TD prediction method, the critic network can predict the external reinforcement signal and provide a more informative internal reinforcement signal to the action network. The action network uses the GA to adapt itself according to the internal reinforcement signal. The key concept of the TDGAR learning scheme is to formulate the internal reinforcement signal as the fitness function for the GA such that the GA can evaluate the candidate solutions (chromosomes) regularly, even during periods without external feedback from the environment. This enables the GA to proceed to new generations regularly without waiting for the arrival of the external reinforcement signal. This can usually accelerate the GA learning since a reinforcement signal may only be available at a time long after a sequence of actions has occurred in the reinforcement learning problem. The proposed TDGAR learning system has been used to control an active magnetic bearing (AMB) system in practice. A systematic design procedure is developed to achieve successful integration of all the subsystems including magnetic suspension, mechanical structure, and controller training. The results show that the TDGAR learning scheme can successfully find a neural controller or a neural fuzzy controller for a self-designed magnetic bearing system.
Negative reinforcement learning is affected in substance dependence.

PubMed

Thompson, Laetitia L; Claus, Eric D; Mikulich-Gilbertson, Susan K; Banich, Marie T; Crowley, Thomas; Krmpotich, Theodore; Miller, David; Tanabe, Jody

2012-06-01

Negative reinforcement results in behavior to escape or avoid an aversive outcome. Withdrawal symptoms are purported to be negative reinforcers in perpetuating substance dependence, but little is known about negative reinforcement learning in this population. The purpose of this study was to examine reinforcement learning in substance dependent individuals (SDI), with an emphasis on assessing negative reinforcement learning. We modified the Iowa Gambling Task to separately assess positive and negative reinforcement. We hypothesized that SDI would show differences in negative reinforcement learning compared to controls and we investigated whether learning differed as a function of the relative magnitude or frequency of the reinforcer. Thirty subjects dependent on psychostimulants were compared with 28 community controls on a decision making task that manipulated outcome frequencies and magnitudes and required an action to avoid a negative outcome. SDI did not learn to avoid negative outcomes to the same degree as controls. This difference was driven by the magnitude, not the frequency, of negative feedback. In contrast, approach behaviors in response to positive reinforcement were similar in both groups. Our findings are consistent with a specific deficit in negative reinforcement learning in SDI. SDI were relatively insensitive to the magnitude, not frequency, of loss. If this generalizes to drug-related stimuli, it suggests that repeated episodes of withdrawal may drive relapse more than the severity of a single episode. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
Network congestion control algorithm based on Actor-Critic reinforcement learning model

NASA Astrophysics Data System (ADS)

Xu, Tao; Gong, Lina; Zhang, Wei; Li, Xuhong; Wang, Xia; Pan, Wenwen

2018-04-01

Aiming at the network congestion control problem, a congestion control algorithm based on Actor-Critic reinforcement learning model is designed. Through the genetic algorithm in the congestion control strategy, the network congestion problems can be better found and prevented. According to Actor-Critic reinforcement learning, the simulation experiment of network congestion control algorithm is designed. The simulation experiments verify that the AQM controller can predict the dynamic characteristics of the network system. Moreover, the learning strategy is adopted to optimize the network performance, and the dropping probability of packets is adaptively adjusted so as to improve the network performance and avoid congestion. Based on the above finding, it is concluded that the network congestion control algorithm based on Actor-Critic reinforcement learning model can effectively avoid the occurrence of TCP network congestion.
Longitudinal investigation on learned helplessness tested under negative and positive reinforcement involving stimulus control.

PubMed

Oliveira, Emileane C; Hunziker, Maria Helena

2014-07-01

In this study, we investigated whether (a) animals demonstrating the learned helplessness effect during an escape contingency also show learning deficits under positive reinforcement contingencies involving stimulus control and (b) the exposure to positive reinforcement contingencies eliminates the learned helplessness effect under an escape contingency. Rats were initially exposed to controllable (C), uncontrollable (U) or no (N) shocks. After 24h, they were exposed to 60 escapable shocks delivered in a shuttlebox. In the following phase, we selected from each group the four subjects that presented the most typical group pattern: no escape learning (learned helplessness effect) in Group U and escape learning in Groups C and N. All subjects were then exposed to two phases, the (1) positive reinforcement for lever pressing under a multiple FR/Extinction schedule and (2) a re-test under negative reinforcement (escape). A fourth group (n=4) was exposed only to the positive reinforcement sessions. All subjects showed discrimination learning under multiple schedule. In the escape re-test, the learned helplessness effect was maintained for three of the animals in Group U. These results suggest that the learned helplessness effect did not extend to discriminative behavior that is positively reinforced and that the learned helplessness effect did not revert for most subjects after exposure to positive reinforcement. We discuss some theoretical implications as related to learned helplessness as an effect restricted to aversive contingencies and to the absence of reversion after positive reinforcement. This article is part of a Special Issue entitled: insert SI title. Copyright © 2014. Published by Elsevier B.V.
A Robust Cooperated Control Method with Reinforcement Learning and Adaptive H∞ Control

NASA Astrophysics Data System (ADS)

Obayashi, Masanao; Uchiyama, Shogo; Kuremoto, Takashi; Kobayashi, Kunikazu

This study proposes a robust cooperated control method combining reinforcement learning with robust control to control the system. A remarkable characteristic of the reinforcement learning is that it doesn't require model formula, however, it doesn't guarantee the stability of the system. On the other hand, robust control system guarantees stability and robustness, however, it requires model formula. We employ both the actor-critic method which is a kind of reinforcement learning with minimal amount of computation to control continuous valued actions and the traditional robust control, that is, H∞ control. The proposed system was compared method with the conventional control method, that is, the actor-critic only used, through the computer simulation of controlling the angle and the position of a crane system, and the simulation result showed the effectiveness of the proposed method.
Model-based reinforcement learning with dimension reduction.

PubMed

Tangkaratt, Voot; Morimoto, Jun; Sugiyama, Masashi

2016-12-01

The goal of reinforcement learning is to learn an optimal policy which controls an agent to acquire the maximum cumulative reward. The model-based reinforcement learning approach learns a transition model of the environment from data, and then derives the optimal policy using the transition model. However, learning an accurate transition model in high-dimensional environments requires a large amount of data which is difficult to obtain. To overcome this difficulty, in this paper, we propose to combine model-based reinforcement learning with the recently developed least-squares conditional entropy (LSCE) method, which simultaneously performs transition model estimation and dimension reduction. We also further extend the proposed method to imitation learning scenarios. The experimental results show that policy search combined with LSCE performs well for high-dimensional control tasks including real humanoid robot control. Copyright © 2016 Elsevier Ltd. All rights reserved.
A parameter control method in reinforcement learning to rapidly follow unexpected environmental changes.

PubMed

Murakoshi, Kazushi; Mizuno, Junya

2004-11-01

In order to rapidly follow unexpected environmental changes, we propose a parameter control method in reinforcement learning that changes each of learning parameters in appropriate directions. We determine each appropriate direction on the basis of relationships between behaviors and neuromodulators by considering an emergency as a key word. Computer experiments show that the agents using our proposed method could rapidly respond to unexpected environmental changes, not depending on either two reinforcement learning algorithms (Q-learning and actor-critic (AC) architecture) or two learning problems (discontinuous and continuous state-action problems).
The curse of planning: dissecting multiple reinforcement-learning systems by taxing the central executive.

PubMed

Otto, A Ross; Gershman, Samuel J; Markman, Arthur B; Daw, Nathaniel D

2013-05-01

A number of accounts of human and animal behavior posit the operation of parallel and competing valuation systems in the control of choice behavior. In these accounts, a flexible but computationally expensive model-based reinforcement-learning system has been contrasted with a less flexible but more efficient model-free reinforcement-learning system. The factors governing which system controls behavior-and under what circumstances-are still unclear. Following the hypothesis that model-based reinforcement learning requires cognitive resources, we demonstrated that having human decision makers perform a demanding secondary task engenders increased reliance on a model-free reinforcement-learning strategy. Further, we showed that, across trials, people negotiate the trade-off between the two systems dynamically as a function of concurrent executive-function demands, and people's choice latencies reflect the computational expenses of the strategy they employ. These results demonstrate that competition between multiple learning systems can be controlled on a trial-by-trial basis by modulating the availability of cognitive resources.
The Curse of Planning: Dissecting multiple reinforcement learning systems by taxing the central executive

PubMed Central

Otto, A. Ross; Gershman, Samuel J.; Markman, Arthur B.; Daw, Nathaniel D.

2013-01-01

A number of accounts of human and animal behavior posit the operation of parallel and competing valuation systems in the control of choice behavior. Along these lines, a flexible but computationally expensive model-based reinforcement learning system has been contrasted with a less flexible but more efficient model-free reinforcement learning system. The factors governing which system controls behavior—and under what circumstances—are still unclear. Based on the hypothesis that model-based reinforcement learning requires cognitive resources, we demonstrate that having human decision-makers perform a demanding secondary task engenders increased reliance on a model-free reinforcement learning strategy. Further, we show that across trials, people negotiate this tradeoff dynamically as a function of concurrent executive function demands and their choice latencies reflect the computational expenses of the strategy employed. These results demonstrate that competition between multiple learning systems can be controlled on a trial-by-trial basis by modulating the availability of cognitive resources. PMID:23558545
Learning and tuning fuzzy logic controllers through reinforcements.

PubMed

Berenji, H R; Khedkar, P

1992-01-01

A method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system is presented. It is shown that: the generalized approximate-reasoning-based intelligent control (GARIC) architecture learns and tunes a fuzzy logic controller even when only weak reinforcement, such as a binary failure signal, is available; introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward network, which can then adaptively improve performance by using gradient descent methods. The GARIC architecture is applied to a cart-pole balancing system and demonstrates significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.
Machine Learning Control For Highly Reconfigurable High-Order Systems

DTIC Science & Technology

2015-01-02

develop and flight test a Reinforcement Learning based approach for autonomous tracking of ground targets using a fixed wing Unmanned...Reinforcement Learning - based algorithms are developed for learning agents’ time dependent dynamics while also learning to control them. Three algorithms...to a wide range of engineering- based problems . Implementation of these solutions, however, is often complicated by the hysteretic, non-linear,
Learning and tuning fuzzy logic controllers through reinforcements

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap

1992-01-01

A new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system is presented. In particular, our Generalized Approximate Reasoning-based Intelligent Control (GARIC) architecture: (1) learns and tunes a fuzzy logic controller even when only weak reinforcements, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto, Sutton, and Anderson to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and has demonstrated significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.
Autonomous reinforcement learning with experience replay.

PubMed

Wawrzyński, Paweł; Tanwani, Ajay Kumar

2013-05-01

This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time. Copyright © 2012 Elsevier Ltd. All rights reserved.
Instructional control of reinforcement learning: A behavioral and neurocomputational investigation

PubMed Central

Doll, Bradley B.; Jacobs, W. Jake; Sanfey, Alan G.; Frank, Michael J.

2011-01-01

Humans learn how to behave directly through environmental experience and indirectly through rules and instructions. Behavior analytic research has shown that instructions can control behavior, even when such behavior leads to sub-optimal outcomes (Hayes, S. (Ed.). 1989. Rule-governed behavior: cognition, contingencies, and instructional control. Plenum Press.). Here we examine the control of behavior through instructions in a reinforcement learning task known to depend on striatal dopaminergic function. Participants selected between probabilistically reinforced stimuli, and were (incorrectly) told that a specific stimulus had the highest (or lowest) reinforcement probability. Despite experience to the contrary, instructions drove choice behavior. We present neural network simulations that capture the interactions between instruction-driven and reinforcement-driven behavior via two potential neural circuits: one in which the striatum is inaccurately trained by instruction representations coming from prefrontal cortex/hippocampus (PFC/HC), and another in which the striatum learns the environmentally based reinforcement contingencies, but is “overridden” at decision output. Both models capture the core behavioral phenomena but, because they differ fundamentally on what is learned, make distinct predictions for subsequent behavioral and neuroimaging experiments. Finally, we attempt to distinguish between the proposed computational mechanisms governing instructed behavior by fitting a series of abstract “Q-learning” and Bayesian models to subject data. The best-fitting model supports one of the neural models, suggesting the existence of a “confirmation bias” in which the PFC/HC system trains the reinforcement system by amplifying outcomes that are consistent with instructions while diminishing inconsistent outcomes. PMID:19595993
"Notice of Violation of IEEE Publication Principles" Multiobjective Reinforcement Learning: A Comprehensive Overview.

PubMed

Liu, Chunming; Xu, Xin; Hu, Dewen

2013-04-29

Reinforcement learning is a powerful mechanism for enabling agents to learn in an unknown environment, and most reinforcement learning algorithms aim to maximize some numerical value, which represents only one long-term objective. However, multiple long-term objectives are exhibited in many real-world decision and control problems; therefore, recently, there has been growing interest in solving multiobjective reinforcement learning (MORL) problems with multiple conflicting objectives. The aim of this paper is to present a comprehensive overview of MORL. In this paper, the basic architecture, research topics, and naive solutions of MORL are introduced at first. Then, several representative MORL approaches and some important directions of recent research are reviewed. The relationships between MORL and other related research are also discussed, which include multiobjective optimization, hierarchical reinforcement learning, and multi-agent reinforcement learning. Finally, research challenges and open problems of MORL techniques are highlighted.
A reinforcement learning-based architecture for fuzzy logic control

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1992-01-01

This paper introduces a new method for learning to refine a rule-based fuzzy logic controller. A reinforcement learning technique is used in conjunction with a multilayer neural network model of a fuzzy controller. The approximate reasoning based intelligent control (ARIC) architecture proposed here learns by updating its prediction of the physical system's behavior and fine tunes a control knowledge base. Its theory is related to Sutton's temporal difference (TD) method. Because ARIC has the advantage of using the control knowledge of an experienced operator and fine tuning it through the process of learning, it learns faster than systems that train networks from scratch. The approach is applied to a cart-pole balancing system.
Learning and tuning fuzzy logic controllers through reinforcements

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap

1992-01-01

This paper presents a new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system. In particular, our generalized approximate reasoning-based intelligent control (GARIC) architecture (1) learns and tunes a fuzzy logic controller even when only weak reinforcement, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward neural network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto et al. (1983) to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and demonstrates significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.

Agent-based traffic management and reinforcement learning in congested intersection network.

DOT National Transportation Integrated Search

2012-08-01

This study evaluates the performance of traffic control systems based on reinforcement learning (RL), also called approximate dynamic programming (ADP). Two algorithms have been selected for testing: 1) Q-learning and 2) approximate dynamic programmi...
Towards autonomous neuroprosthetic control using Hebbian reinforcement learning.

PubMed

Mahmoudi, Babak; Pohlmeyer, Eric A; Prins, Noeline W; Geng, Shijia; Sanchez, Justin C

2013-12-01

Our goal was to design an adaptive neuroprosthetic controller that could learn the mapping from neural states to prosthetic actions and automatically adjust adaptation using only a binary evaluative feedback as a measure of desirability/undesirability of performance. Hebbian reinforcement learning (HRL) in a connectionist network was used for the design of the adaptive controller. The method combines the efficiency of supervised learning with the generality of reinforcement learning. The convergence properties of this approach were studied using both closed-loop control simulations and open-loop simulations that used primate neural data from robot-assisted reaching tasks. The HRL controller was able to perform classification and regression tasks using its episodic and sequential learning modes, respectively. In our experiments, the HRL controller quickly achieved convergence to an effective control policy, followed by robust performance. The controller also automatically stopped adapting the parameters after converging to a satisfactory control policy. Additionally, when the input neural vector was reorganized, the controller resumed adaptation to maintain performance. By estimating an evaluative feedback directly from the user, the HRL control algorithm may provide an efficient method for autonomous adaptation of neuroprosthetic systems. This method may enable the user to teach the controller the desired behavior using only a simple feedback signal.
Effects of dopamine on reinforcement learning and consolidation in Parkinson's disease.

PubMed

Grogan, John P; Tsivos, Demitra; Smith, Laura; Knight, Brogan E; Bogacz, Rafal; Whone, Alan; Coulthard, Elizabeth J

2017-07-10

Emerging evidence suggests that dopamine may modulate learning and memory with important implications for understanding the neurobiology of memory and future therapeutic targeting. An influential hypothesis posits that dopamine biases reinforcement learning. More recent data also suggest an influence during both consolidation and retrieval. Eighteen Parkinson's disease patients learned through feedback ON or OFF medication, with memory tested 24 hr later ON or OFF medication (4 conditions, within-subjects design with matched healthy control group). Patients OFF medication during learning decreased in memory accuracy over the following 24 hr. In contrast to previous studies, however, dopaminergic medication during learning and testing did not affect expression of positive or negative reinforcement. Two further experiments were run without the 24 hr delay, but they too failed to reproduce effects of dopaminergic medication on reinforcement learning. While supportive of a dopaminergic role in consolidation, this study failed to replicate previous findings on reinforcement learning.
A junction-tree based learning algorithm to optimize network wide traffic control: A coordinated multi-agent framework

DOE PAGES

Zhu, Feng; Aziz, H. M. Abdul; Qian, Xinwu; ...

2015-01-31

Our study develops a novel reinforcement learning algorithm for the challenging coordinated signal control problem. Traffic signals are modeled as intelligent agents interacting with the stochastic traffic environment. The model is built on the framework of coordinated reinforcement learning. The Junction Tree Algorithm (JTA) based reinforcement learning is proposed to obtain an exact inference of the best joint actions for all the coordinated intersections. Moreover, the algorithm is implemented and tested with a network containing 18 signalized intersections in VISSIM. Finally, our results show that the JTA based algorithm outperforms independent learning (Q-learning), real-time adaptive learning, and fixed timing plansmore » in terms of average delay, number of stops, and vehicular emissions at the network level.« less
Framework for robot skill learning using reinforcement learning

NASA Astrophysics Data System (ADS)

Wei, Yingzi; Zhao, Mingyang

2003-09-01

Robot acquiring skill is a process similar to human skill learning. Reinforcement learning (RL) is an on-line actor critic method for a robot to develop its skill. The reinforcement function has become the critical component for its effect of evaluating the action and guiding the learning process. We present an augmented reward function that provides a new way for RL controller to incorporate prior knowledge and experience into the RL controller. Also, the difference form of augmented reward function is considered carefully. The additional reward beyond conventional reward will provide more heuristic information for RL. In this paper, we present a strategy for the task of complex skill learning. Automatic robot shaping policy is to dissolve the complex skill into a hierarchical learning process. The new form of value function is introduced to attain smooth motion switching swiftly. We present a formal, but practical, framework for robot skill learning and also illustrate with an example the utility of method for learning skilled robot control on line.
Using Fuzzy Logic for Performance Evaluation in Reinforcement Learning

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap S.

1992-01-01

Current reinforcement learning algorithms require long training periods which generally limit their applicability to small size problems. A new architecture is described which uses fuzzy rules to initialize its two neural networks: a neural network for performance evaluation and another for action selection. This architecture is applied to control of dynamic systems and it is demonstrated that it is possible to start with an approximate prior knowledge and learn to refine it through experiments using reinforcement learning.
Overcoming Learned Helplessness in Community College Students.

ERIC Educational Resources Information Center

Roueche, John E.; Mink, Oscar G.

1982-01-01

Reviews research on the effects of repeated experiences of helplessness and on locus of control. Identifies conditions necessary for overcoming learned helplessness; i.e., the potential for learning to occur; consistent reinforcement; relevant, valued reinforcers; and favorable psychological situation. Recommends eight ways for teachers to…
Robust reinforcement learning.

PubMed

Morimoto, Jun; Doya, Kenji

2005-02-01

This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both offline learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H(infinity) control, we consider a differential game in which a "disturbing" agent tries to make the worst possible disturbance while a "control" agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H(infinity) control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired.
Application of fuzzy logic-neural network based reinforcement learning to proximity and docking operations: Translational controller results

NASA Technical Reports Server (NTRS)

Jani, Yashvant

1992-01-01

The reinforcement learning techniques developed at Ames Research Center are being applied to proximity and docking operations using the Shuttle and Solar Maximum Mission (SMM) satellite simulation. In utilizing these fuzzy learning techniques, we also use the Approximate Reasoning based Intelligent Control (ARIC) architecture, and so we use two terms interchangeable to imply the same. This activity is carried out in the Software Technology Laboratory utilizing the Orbital Operations Simulator (OOS). This report is the deliverable D3 in our project activity and provides the test results of the fuzzy learning translational controller. This report is organized in six sections. Based on our experience and analysis with the attitude controller, we have modified the basic configuration of the reinforcement learning algorithm in ARIC as described in section 2. The shuttle translational controller and its implementation in fuzzy learning architecture is described in section 3. Two test cases that we have performed are described in section 4. Our results and conclusions are discussed in section 5, and section 6 provides future plans and summary for the project.
Stress enhances model-free reinforcement learning only after negative outcome

PubMed Central

Lee, Daeyeol

2017-01-01

Previous studies found that stress shifts behavioral control by promoting habits while decreasing goal-directed behaviors during reward-based decision-making. It is, however, unclear how stress disrupts the relative contribution of the two systems controlling reward-seeking behavior, i.e. model-free (or habit) and model-based (or goal-directed). Here, we investigated whether stress biases the contribution of model-free and model-based reinforcement learning processes differently depending on the valence of outcome, and whether stress alters the learning rate, i.e., how quickly information from the new environment is incorporated into choices. Participants were randomly assigned to either a stress or a control condition, and performed a two-stage Markov decision-making task in which the reward probabilities underwent periodic reversals without notice. We found that stress increased the contribution of model-free reinforcement learning only after negative outcome. Furthermore, stress decreased the learning rate. The results suggest that stress diminishes one’s ability to make adaptive choices in multiple aspects of reinforcement learning. This finding has implications for understanding how stress facilitates maladaptive habits, such as addictive behavior, and other dysfunctional behaviors associated with stress in clinical and educational contexts. PMID:28723943
Stress enhances model-free reinforcement learning only after negative outcome.

PubMed

Park, Heyeon; Lee, Daeyeol; Chey, Jeanyung

2017-01-01

Previous studies found that stress shifts behavioral control by promoting habits while decreasing goal-directed behaviors during reward-based decision-making. It is, however, unclear how stress disrupts the relative contribution of the two systems controlling reward-seeking behavior, i.e. model-free (or habit) and model-based (or goal-directed). Here, we investigated whether stress biases the contribution of model-free and model-based reinforcement learning processes differently depending on the valence of outcome, and whether stress alters the learning rate, i.e., how quickly information from the new environment is incorporated into choices. Participants were randomly assigned to either a stress or a control condition, and performed a two-stage Markov decision-making task in which the reward probabilities underwent periodic reversals without notice. We found that stress increased the contribution of model-free reinforcement learning only after negative outcome. Furthermore, stress decreased the learning rate. The results suggest that stress diminishes one's ability to make adaptive choices in multiple aspects of reinforcement learning. This finding has implications for understanding how stress facilitates maladaptive habits, such as addictive behavior, and other dysfunctional behaviors associated with stress in clinical and educational contexts.
Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning.

PubMed

Pilarski, Patrick M; Dawson, Michael R; Degris, Thomas; Fahimi, Farbod; Carey, Jason P; Sutton, Richard S

2011-01-01

As a contribution toward the goal of adaptable, intelligent artificial limbs, this work introduces a continuous actor-critic reinforcement learning method for optimizing the control of multi-function myoelectric devices. Using a simulated upper-arm robotic prosthesis, we demonstrate how it is possible to derive successful limb controllers from myoelectric data using only a sparse human-delivered training signal, without requiring detailed knowledge about the task domain. This reinforcement-based machine learning framework is well suited for use by both patients and clinical staff, and may be easily adapted to different application domains and the needs of individual amputees. To our knowledge, this is the first my-oelectric control approach that facilitates the online learning of new amputee-specific motions based only on a one-dimensional (scalar) feedback signal provided by the user of the prosthesis. © 2011 IEEE
Human-level control through deep reinforcement learning.

PubMed

Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A; Veness, Joel; Bellemare, Marc G; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

2015-02-26

The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Human-level control through deep reinforcement learning

NASA Astrophysics Data System (ADS)

Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K.; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

2015-02-01

The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
A Discussion of Possibility of Reinforcement Learning Using Event-Related Potential in BCI

NASA Astrophysics Data System (ADS)

Yamagishi, Yuya; Tsubone, Tadashi; Wada, Yasuhiro

Recently, Brain computer interface (BCI) which is a direct connecting pathway an external device such as a computer or a robot and a human brain have gotten a lot of attention. Since BCI can control the machines as robots by using the brain activity without using the voluntary muscle, the BCI may become a useful communication tool for handicapped persons, for instance, amyotrophic lateral sclerosis patients. However, in order to realize the BCI system which can perform precise tasks on various environments, it is necessary to design the control rules to adapt to the dynamic environments. Reinforcement learning is one approach of the design of the control rule. If this reinforcement leaning can be performed by the brain activity, it leads to the attainment of BCI that has general versatility. In this research, we paid attention to P300 of event-related potential as an alternative signal of the reward of reinforcement learning. We discriminated between the success and the failure trials from P300 of the EEG of the single trial by using the proposed discrimination algorithm based on Support vector machine. The possibility of reinforcement learning was examined from the viewpoint of the number of discriminated trials. It was shown that there was a possibility to be able to learn in most subjects.
Effects of dopamine on reinforcement learning and consolidation in Parkinson’s disease

PubMed Central

Grogan, John P; Tsivos, Demitra; Smith, Laura; Knight, Brogan E; Bogacz, Rafal; Whone, Alan; Coulthard, Elizabeth J

2017-01-01

Emerging evidence suggests that dopamine may modulate learning and memory with important implications for understanding the neurobiology of memory and future therapeutic targeting. An influential hypothesis posits that dopamine biases reinforcement learning. More recent data also suggest an influence during both consolidation and retrieval. Eighteen Parkinson’s disease patients learned through feedback ON or OFF medication, with memory tested 24 hr later ON or OFF medication (4 conditions, within-subjects design with matched healthy control group). Patients OFF medication during learning decreased in memory accuracy over the following 24 hr. In contrast to previous studies, however, dopaminergic medication during learning and testing did not affect expression of positive or negative reinforcement. Two further experiments were run without the 24 hr delay, but they too failed to reproduce effects of dopaminergic medication on reinforcement learning. While supportive of a dopaminergic role in consolidation, this study failed to replicate previous findings on reinforcement learning. DOI: http://dx.doi.org/10.7554/eLife.26801.001 PMID:28691905
Navigating complex decision spaces: Problems and paradigms in sequential choice

PubMed Central

Walsh, Matthew M.; Anderson, John R.

2015-01-01

To behave adaptively, we must learn from the consequences of our actions. Doing so is difficult when the consequences of an action follow a delay. This introduces the problem of temporal credit assignment. When feedback follows a sequence of decisions, how should the individual assign credit to the intermediate actions that comprise the sequence? Research in reinforcement learning provides two general solutions to this problem: model-free reinforcement learning and model-based reinforcement learning. In this review, we examine connections between stimulus-response and cognitive learning theories, habitual and goal-directed control, and model-free and model-based reinforcement learning. We then consider a range of problems related to temporal credit assignment. These include second-order conditioning and secondary reinforcers, latent learning and detour behavior, partially observable Markov decision processes, actions with distributed outcomes, and hierarchical learning. We ask whether humans and animals, when faced with these problems, behave in a manner consistent with reinforcement learning techniques. Throughout, we seek to identify neural substrates of model-free and model-based reinforcement learning. The former class of techniques is understood in terms of the neurotransmitter dopamine and its effects in the basal ganglia. The latter is understood in terms of a distributed network of regions including the prefrontal cortex, medial temporal lobes cerebellum, and basal ganglia. Not only do reinforcement learning techniques have a natural interpretation in terms of human and animal behavior, but they also provide a useful framework for understanding neural reward valuation and action selection. PMID:23834192
Optimal and Autonomous Control Using Reinforcement Learning: A Survey.

PubMed

Kiumarsi, Bahare; Vamvoudakis, Kyriakos G; Modares, Hamidreza; Lewis, Frank L

2018-06-01

This paper reviews the current state of the art on reinforcement learning (RL)-based feedback control solutions to optimal regulation and tracking of single and multiagent systems. Existing RL solutions to both optimal and control problems, as well as graphical games, will be reviewed. RL methods learn the solution to optimal control and game problems online and using measured data along the system trajectories. We discuss Q-learning and the integral RL algorithm as core algorithms for discrete-time (DT) and continuous-time (CT) systems, respectively. Moreover, we discuss a new direction of off-policy RL for both CT and DT systems. Finally, we review several applications.
A look at Behaviourism and Perceptual Control Theory in Interface Design

DTIC Science & Technology

1998-02-01

behaviours such as response variability, instinctive drift, autoshaping , etc. Perceptual Control Theory (PCT) postulates that behaviours result from the...internal variables. Behaviourism, on the other hand, can not account for variability in responses, instinctive drift, autoshaping , etc. Researchers... Autoshaping . Animals appear to learn without reinforcement. However, conditioning theory speculates that learning results only when reinforcement
Refining fuzzy logic controllers with machine learning

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1994-01-01

In this paper, we describe the GARIC (Generalized Approximate Reasoning-Based Intelligent Control) architecture, which learns from its past performance and modifies the labels in the fuzzy rules to improve performance. It uses fuzzy reinforcement learning which is a hybrid method of fuzzy logic and reinforcement learning. This technology can simplify and automate the application of fuzzy logic control to a variety of systems. GARIC has been applied in simulation studies of the Space Shuttle rendezvous and docking experiments. It has the potential of being applied in other aerospace systems as well as in consumer products such as appliances, cameras, and cars.

Model-Free control performance improvement using virtual reference feedback tuning and reinforcement Q-learning

NASA Astrophysics Data System (ADS)

Radac, Mircea-Bogdan; Precup, Radu-Emil; Roman, Raul-Cristian

2017-04-01

This paper proposes the combination of two model-free controller tuning techniques, namely linear virtual reference feedback tuning (VRFT) and nonlinear state-feedback Q-learning, referred to as a new mixed VRFT-Q learning approach. VRFT is first used to find stabilising feedback controller using input-output experimental data from the process in a model reference tracking setting. Reinforcement Q-learning is next applied in the same setting using input-state experimental data collected under perturbed VRFT to ensure good exploration. The Q-learning controller learned with a batch fitted Q iteration algorithm uses two neural networks, one for the Q-function estimator and one for the controller, respectively. The VRFT-Q learning approach is validated on position control of a two-degrees-of-motion open-loop stable multi input-multi output (MIMO) aerodynamic system (AS). Extensive simulations for the two independent control channels of the MIMO AS show that the Q-learning controllers clearly improve performance over the VRFT controllers.
Reinforcement Learning with Orthonormal Basis Adaptation Based on Activity-Oriented Index Allocation

NASA Astrophysics Data System (ADS)

Satoh, Hideki

An orthonormal basis adaptation method for function approximation was developed and applied to reinforcement learning with multi-dimensional continuous state space. First, a basis used for linear function approximation of a control function is set to an orthonormal basis. Next, basis elements with small activities are replaced with other candidate elements as learning progresses. As this replacement is repeated, the number of basis elements with large activities increases. Example chaos control problems for multiple logistic maps were solved, demonstrating that the method for adapting an orthonormal basis can modify a basis while holding the orthonormality in accordance with changes in the environment to improve the performance of reinforcement learning and to eliminate the adverse effects of redundant noisy states.
Prespeech motor learning in a neural network using reinforcement.

PubMed

Warlaumont, Anne S; Westermann, Gert; Buder, Eugene H; Oller, D Kimbrough

2013-02-01

Vocal motor development in infancy provides a crucial foundation for language development. Some significant early accomplishments include learning to control the process of phonation (the production of sound at the larynx) and learning to produce the sounds of one's language. Previous work has shown that social reinforcement shapes the kinds of vocalizations infants produce. We present a neural network model that provides an account of how vocal learning may be guided by reinforcement. The model consists of a self-organizing map that outputs to muscles of a realistic vocalization synthesizer. Vocalizations are spontaneously produced by the network. If a vocalization meets certain acoustic criteria, it is reinforced, and the weights are updated to make similar muscle activations increasingly likely to recur. We ran simulations of the model under various reinforcement criteria and tested the types of vocalizations it produced after learning in the different conditions. When reinforcement was contingent on the production of phonated (i.e. voiced) sounds, the network's post-learning productions were almost always phonated, whereas when reinforcement was not contingent on phonation, the network's post-learning productions were almost always not phonated. When reinforcement was contingent on both phonation and proximity to English vowels as opposed to Korean vowels, the model's post-learning productions were more likely to resemble the English vowels and vice versa. Copyright © 2012 Elsevier Ltd. All rights reserved.
Flow Navigation by Smart Microswimmers via Reinforcement Learning

NASA Astrophysics Data System (ADS)

Colabrese, Simona; Biferale, Luca; Celani, Antonio; Gustavsson, Kristian

2017-11-01

We have numerically modeled active particles which are able to acquire some limited knowledge of the fluid environment from simple mechanical cues and exert a control on their preferred steering direction. We show that those swimmers can learn effective strategies just by experience, using a reinforcement learning algorithm. As an example, we focus on smart gravitactic swimmers. These are active particles whose task is to reach the highest altitude within some time horizon, exploiting the underlying flow whenever possible. The reinforcement learning algorithm allows particles to learn effective strategies even in difficult situations when, in the absence of control, they would end up being trapped by flow structures. These strategies are highly nontrivial and cannot be easily guessed in advance. This work paves the way towards the engineering of smart microswimmers that solve difficult navigation problems. ERC AdG NewTURB 339032.
Reinforcement learning in multidimensional environments relies on attention mechanisms.

PubMed

Niv, Yael; Daniel, Reka; Geana, Andra; Gershman, Samuel J; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C

2015-05-27

In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this "representation learning" process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the "curse of dimensionality" in reinforcement learning. Copyright © 2015 the authors 0270-6474/15/358145-13$15.00/0.
General functioning predicts reward and punishment learning in schizophrenia.

PubMed

Somlai, Zsuzsanna; Moustafa, Ahmed A; Kéri, Szabolcs; Myers, Catherine E; Gluck, Mark A

2011-04-01

Previous studies investigating feedback-driven reinforcement learning in patients with schizophrenia have provided mixed results. In this study, we explored the clinical predictors of reward and punishment learning using a probabilistic classification learning task. Patients with schizophrenia (n=40) performed similarly to healthy controls (n=30) on the classification learning task. However, more severe negative and general symptoms were associated with lower reward-learning performance, whereas poorer general psychosocial functioning was correlated with both lower reward- and punishment-learning performances. Multiple linear regression analyses indicated that general psychosocial functioning was the only significant predictor of reinforcement learning performance when education, antipsychotic dose, and positive, negative and general symptoms were included in the analysis. These results suggest a close relationship between reinforcement learning and general psychosocial functioning in schizophrenia. Published by Elsevier B.V.
Prespeech motor learning in a neural network using reinforcement☆

PubMed Central

Warlaumont, Anne S.; Westermann, Gert; Buder, Eugene H.; Oller, D. Kimbrough

2012-01-01

Vocal motor development in infancy provides a crucial foundation for language development. Some significant early accomplishments include learning to control the process of phonation (the production of sound at the larynx) and learning to produce the sounds of one’s language. Previous work has shown that social reinforcement shapes the kinds of vocalizations infants produce. We present a neural network model that provides an account of how vocal learning may be guided by reinforcement. The model consists of a self-organizing map that outputs to muscles of a realistic vocalization synthesizer. Vocalizations are spontaneously produced by the network. If a vocalization meets certain acoustic criteria, it is reinforced, and the weights are updated to make similar muscle activations increasingly likely to recur. We ran simulations of the model under various reinforcement criteria and tested the types of vocalizations it produced after learning in the differ-ent conditions. When reinforcement was contingent on the production of phonated (i.e. voiced) sounds, the network’s post learning productions were almost always phonated, whereas when reinforcement was not contingent on phonation, the network’s post-learning productions were almost always not phonated. When reinforcement was contingent on both phonation and proximity to English vowels as opposed to Korean vowels, the model’s post-learning productions were more likely to resemble the English vowels and vice versa. PMID:23275137
The role of GABAB receptors in human reinforcement learning.

PubMed

Ort, Andres; Kometer, Michael; Rohde, Judith; Seifritz, Erich; Vollenweider, Franz X

2014-10-01

Behavioral evidence from human studies suggests that the γ-aminobutyric acid type B receptor (GABAB receptor) agonist baclofen modulates reinforcement learning and reduces craving in patients with addiction spectrum disorders. However, in contrast to the well established role of dopamine in reinforcement learning, the mechanisms by which the GABAB receptor influences reinforcement learning in humans remain completely unknown. To further elucidate this issue, a cross-over, double-blind, placebo-controlled study was performed in healthy human subjects (N=15) to test the effects of baclofen (20 and 50mg p.o.) on probabilistic reinforcement learning. Outcomes were the feedback-induced P2 component of the event-related potential, the feedback-related negativity, and the P300 component of the event-related potential. Baclofen produced a reduction of P2 amplitude over the course of the experiment, but did not modulate the feedback-related negativity. Furthermore, there was a trend towards increased learning after baclofen administration relative to placebo over the course of the experiment. The present results extend previous theories of reinforcement learning, which focus on the importance of mesolimbic dopamine signaling, and indicate that stimulation of cortical GABAB receptors in a fronto-parietal network leads to better attentional allocation in reinforcement learning. This observation is a first step in our understanding of how baclofen may improve reinforcement learning in healthy subjects. Further studies with bigger sample sizes are needed to corroborate this conclusion and furthermore, test this effect in patients with addiction spectrum disorder. Copyright © 2014 Elsevier B.V. and ECNP. All rights reserved.
Simultaneous vibration control and energy harvesting using actor-critic based reinforcement learning

NASA Astrophysics Data System (ADS)

Loong, Cheng Ning; Chang, C. C.; Dimitrakopoulos, Elias G.

2018-03-01

Mitigating excessive vibration of civil engineering structures using various types of devices has been a conspicuous research topic in the past few decades. Some devices, such as electromagnetic transducers, which have a capability of exerting control forces while simultaneously harvesting energy, have been proposed recently. These devices make possible a self-regenerative system that can semi-actively mitigate structural vibration without the need of external energy. Integrating mechanical, electrical components, and control algorithms, these devices open up a new research domain that needs to be addressed. In this study, the feasibility of using an actor-critic based reinforcement learning control algorithm for simultaneous vibration control and energy harvesting for a civil engineering structure is investigated. The actor-critic based reinforcement learning control algorithm is a real-time, model-free adaptive technique that can adjust the controller parameters based on observations and reward signals without knowing the system characteristics. It is suitable for the control of a partially known nonlinear system with uncertain parameters. The feasibility of implementing this algorithm on a building structure equipped with an electromagnetic damper will be investigated in this study. Issues related to the modelling of learning algorithm, initialization and convergence will be presented and discussed.
Reinforcement learning in computer vision

NASA Astrophysics Data System (ADS)

Bernstein, A. V.; Burnaev, E. V.

2018-04-01

Nowadays, machine learning has become one of the basic technologies used in solving various computer vision tasks such as feature detection, image segmentation, object recognition and tracking. In many applications, various complex systems such as robots are equipped with visual sensors from which they learn state of surrounding environment by solving corresponding computer vision tasks. Solutions of these tasks are used for making decisions about possible future actions. It is not surprising that when solving computer vision tasks we should take into account special aspects of their subsequent application in model-based predictive control. Reinforcement learning is one of modern machine learning technologies in which learning is carried out through interaction with the environment. In recent years, Reinforcement learning has been used both for solving such applied tasks as processing and analysis of visual information, and for solving specific computer vision problems such as filtering, extracting image features, localizing objects in scenes, and many others. The paper describes shortly the Reinforcement learning technology and its use for solving computer vision problems.
Reinforcement Learning in Multidimensional Environments Relies on Attention Mechanisms

PubMed Central

Daniel, Reka; Geana, Andra; Gershman, Samuel J.; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C.

2015-01-01

In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this “representation learning” process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the “curse of dimensionality” in reinforcement learning. PMID:26019331
An Evaluation of Pedagogical Tutorial Tactics for a Natural Language Tutoring System: A Reinforcement Learning Approach

ERIC Educational Resources Information Center

Chi, Min; VanLehn, Kurt; Litman, Diane; Jordan, Pamela

2011-01-01

Pedagogical strategies are policies for a tutor to decide the next action when there are multiple actions available. When the content is controlled to be the same across experimental conditions, there has been little evidence that tutorial decisions have an impact on students' learning. In this paper, we applied Reinforcement Learning (RL) to…
Collaborating Fuzzy Reinforcement Learning Agents

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1997-01-01

Earlier, we introduced GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Relearning and at the local level, each agent learns and operates based on ANTARCTIC, a technique for fuzzy reinforcement learning. In this paper, we show that it is possible for these agents to compete in order to affect the selected control policy but at the same time, they can collaborate while investigating the state space. In this model, the evaluator or the critic learns by observing all the agents behaviors but the control policy changes only based on the behavior of the winning agent also known as the super agent.
Flow Navigation by Smart Microswimmers via Reinforcement Learning

NASA Astrophysics Data System (ADS)

Colabrese, Simona; Gustavsson, Kristian; Celani, Antonio; Biferale, Luca

2017-04-01

Smart active particles can acquire some limited knowledge of the fluid environment from simple mechanical cues and exert a control on their preferred steering direction. Their goal is to learn the best way to navigate by exploiting the underlying flow whenever possible. As an example, we focus our attention on smart gravitactic swimmers. These are active particles whose task is to reach the highest altitude within some time horizon, given the constraints enforced by fluid mechanics. By means of numerical experiments, we show that swimmers indeed learn nearly optimal strategies just by experience. A reinforcement learning algorithm allows particles to learn effective strategies even in difficult situations when, in the absence of control, they would end up being trapped by flow structures. These strategies are highly nontrivial and cannot be easily guessed in advance. This Letter illustrates the potential of reinforcement learning algorithms to model adaptive behavior in complex flows and paves the way towards the engineering of smart microswimmers that solve difficult navigation problems.
Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories

PubMed Central

Fonteneau, Raphael; Murphy, Susan A.; Wehenkel, Louis; Ernst, Damien

2013-01-01

In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of “artificial trajectories” from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning. PMID:24049244
Altered neural encoding of prediction errors in assault-related posttraumatic stress disorder.

PubMed

Ross, Marisa C; Lenow, Jennifer K; Kilts, Clinton D; Cisler, Josh M

2018-05-12

Posttraumatic stress disorder (PTSD) is widely associated with deficits in extinguishing learned fear responses, which relies on mechanisms of reinforcement learning (e.g., updating expectations based on prediction errors). However, the degree to which PTSD is associated with impairments in general reinforcement learning (i.e., outside of the context of fear stimuli) remains poorly understood. Here, we investigate brain and behavioral differences in general reinforcement learning between adult women with and without a current diagnosis of PTSD. 29 adult females (15 PTSD with exposure to assaultive violence, 14 controls) underwent a neutral reinforcement-learning task (i.e., two arm bandit task) during fMRI. We modeled participant behavior using different adaptations of the Rescorla-Wagner (RW) model and used Independent Component Analysis to identify timecourses for large-scale a priori brain networks. We found that an anticorrelated and risk sensitive RW model best fit participant behavior, with no differences in computational parameters between groups. Women in the PTSD group demonstrated significantly less neural encoding of prediction errors in both a ventral striatum/mPFC and anterior insula network compared to healthy controls. Weakened encoding of prediction errors in the ventral striatum/mPFC and anterior insula during a general reinforcement learning task, outside of the context of fear stimuli, suggests the possibility of a broader conceptualization of learning differences in PTSD than currently proposed in current neurocircuitry models of PTSD. Copyright © 2018 Elsevier Ltd. All rights reserved.
Relationship between Reinforcement and Eye Movements during Ocular Motor Training with Learning Disabled Children.

ERIC Educational Resources Information Center

Punnett, Audrey F.; Steinhauer, Gene D.

1984-01-01

Four reading disabled children were given eight sessions of ocular motor training with reinforcement and eight sessions without reinforcement. Two reading disabled control Ss were treated similarly but received no ocular motor training. Results demonstrated that reinforcement can improve ocular motor skills, which in turn elevates reading…
Fuzzy Q-Learning for Generalization of Reinforcement Learning

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1996-01-01

Fuzzy Q-Learning, introduced earlier by the author, is an extension of Q-Learning into fuzzy environments. GARIC is a methodology for fuzzy reinforcement learning. In this paper, we introduce GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Q-Learning and at the local level, each agent learns and operates based on GARIC. GARIC-Q improves the speed and applicability of Fuzzy Q-Learning through generalization of input space by using fuzzy rules and bridges the gap between Q-Learning and rule based intelligent systems.
Reinforcement Learning and Dopamine in Schizophrenia: Dimensions of Symptoms or Specific Features of a Disease Group?

PubMed Central

Deserno, Lorenz; Boehme, Rebecca; Heinz, Andreas; Schlagenhauf, Florian

2013-01-01

Abnormalities in reinforcement learning are a key finding in schizophrenia and have been proposed to be linked to elevated levels of dopamine neurotransmission. Behavioral deficits in reinforcement learning and their neural correlates may contribute to the formation of clinical characteristics of schizophrenia. The ability to form predictions about future outcomes is fundamental for environmental interactions and depends on neuronal teaching signals, like reward prediction errors. While aberrant prediction errors, that encode non-salient events as surprising, have been proposed to contribute to the formation of positive symptoms, a failure to build neural representations of decision values may result in negative symptoms. Here, we review behavioral and neuroimaging research in schizophrenia and focus on studies that implemented reinforcement learning models. In addition, we discuss studies that combined reinforcement learning with measures of dopamine. Thereby, we suggest how reinforcement learning abnormalities in schizophrenia may contribute to the formation of psychotic symptoms and may interact with cognitive deficits. These ideas point toward an interplay of more rigid versus flexible control over reinforcement learning. Pronounced deficits in the flexible or model-based domain may allow for a detailed characterization of well-established cognitive deficits in schizophrenia patients based on computational models of learning. Finally, we propose a framework based on the potentially crucial contribution of dopamine to dysfunctional reinforcement learning on the level of neural networks. Future research may strongly benefit from computational modeling but also requires further methodological improvement for clinical group studies. These research tools may help to improve our understanding of disease-specific mechanisms and may help to identify clinically relevant subgroups of the heterogeneous entity schizophrenia. PMID:24391603
Reinforcement learning of periodical gaits in locomotion robots

NASA Astrophysics Data System (ADS)

Svinin, Mikhail; Yamada, Kazuyaki; Ushio, S.; Ueda, Kanji

1999-08-01

Emergence of stable gaits in locomotion robots is studied in this paper. A classifier system, implementing an instance- based reinforcement learning scheme, is used for sensory- motor control of an eight-legged mobile robot. Important feature of the classifier system is its ability to work with the continuous sensor space. The robot does not have a prior knowledge of the environment, its own internal model, and the goal coordinates. It is only assumed that the robot can acquire stable gaits by learning how to reach a light source. During the learning process the control system, is self-organized by reinforcement signals. Reaching the light source defines a global reward. Forward motion gets a local reward, while stepping back and falling down get a local punishment. Feasibility of the proposed self-organized system is tested under simulation and experiment. The control actions are specified at the leg level. It is shown that, as learning progresses, the number of the action rules in the classifier systems is stabilized to a certain level, corresponding to the acquired gait patterns.

Optimal control in microgrid using multi-agent reinforcement learning.

PubMed

Li, Fu-Dong; Wu, Min; He, Yong; Chen, Xin

2012-11-01

This paper presents an improved reinforcement learning method to minimize electricity costs on the premise of satisfying the power balance and generation limit of units in a microgrid with grid-connected mode. Firstly, the microgrid control requirements are analyzed and the objective function of optimal control for microgrid is proposed. Then, a state variable "Average Electricity Price Trend" which is used to express the most possible transitions of the system is developed so as to reduce the complexity and randomicity of the microgrid, and a multi-agent architecture including agents, state variables, action variables and reward function is formulated. Furthermore, dynamic hierarchical reinforcement learning, based on change rate of key state variable, is established to carry out optimal policy exploration. The analysis shows that the proposed method is beneficial to handle the problem of "curse of dimensionality" and speed up learning in the unknown large-scale world. Finally, the simulation results under JADE (Java Agent Development Framework) demonstrate the validity of the presented method in optimal control for a microgrid with grid-connected mode. Copyright © 2012 ISA. Published by Elsevier Ltd. All rights reserved.
Design issues for a reinforcement-based self-learning fuzzy controller

NASA Technical Reports Server (NTRS)

Yen, John; Wang, Haojin; Dauherity, Walter

1993-01-01

Fuzzy logic controllers have some often cited advantages over conventional techniques such as PID control: easy implementation, its accommodation to natural language, the ability to cover wider range of operating conditions and others. One major obstacle that hinders its broader application is the lack of a systematic way to develop and modify its rules and as result the creation and modification of fuzzy rules often depends on try-error or pure experimentation. One of the proposed approaches to address this issue is self-learning fuzzy logic controllers (SFLC) that use reinforcement learning techniques to learn the desirability of states and to adjust the consequent part of fuzzy control rules accordingly. Due to the different dynamics of the controlled processes, the performance of self-learning fuzzy controller is highly contingent on the design. The design issue has not received sufficient attention. The issues related to the design of a SFLC for the application to chemical process are discussed and its performance is compared with that of PID and self-tuning fuzzy logic controller.
Mobile robots exploration through cnn-based reinforcement learning.

PubMed

Tai, Lei; Liu, Ming

2016-01-01

Exploration in an unknown environment is an elemental application for mobile robots. In this paper, we outlined a reinforcement learning method aiming for solving the exploration problem in a corridor environment. The learning model took the depth image from an RGB-D sensor as the only input. The feature representation of the depth image was extracted through a pre-trained convolutional-neural-networks model. Based on the recent success of deep Q-network on artificial intelligence, the robot controller achieved the exploration and obstacle avoidance abilities in several different simulated environments. It is the first time that the reinforcement learning is used to build an exploration strategy for mobile robots through raw sensor information.
Fault-tolerant optimised tracking control for unknown discrete-time linear systems using a combined reinforcement learning and residual compensation methodology

NASA Astrophysics Data System (ADS)

Han, Ke-Zhen; Feng, Jian; Cui, Xiaohong

2017-10-01

This paper considers the fault-tolerant optimised tracking control (FTOTC) problem for unknown discrete-time linear system. A research scheme is proposed on the basis of data-based parity space identification, reinforcement learning and residual compensation techniques. The main characteristic of this research scheme lies in the parity-space-identification-based simultaneous tracking control and residual compensation. The specific technical line consists of four main contents: apply subspace aided method to design observer-based residual generator; use reinforcement Q-learning approach to solve optimised tracking control policy; rely on robust H∞ theory to achieve noise attenuation; adopt fault estimation triggered by residual generator to perform fault compensation. To clarify the design and implementation procedures, an integrated algorithm is further constructed to link up these four functional units. The detailed analysis and proof are subsequently given to explain the guaranteed FTOTC performance of the proposed conclusions. Finally, a case simulation is provided to verify its effectiveness.
Can model-free reinforcement learning explain deontological moral judgments?

PubMed

Ayars, Alisabeth

2016-05-01

Dual-systems frameworks propose that moral judgments are derived from both an immediate emotional response, and controlled/rational cognition. Recently Cushman (2013) proposed a new dual-system theory based on model-free and model-based reinforcement learning. Model-free learning attaches values to actions based on their history of reward and punishment, and explains some deontological, non-utilitarian judgments. Model-based learning involves the construction of a causal model of the world and allows for far-sighted planning; this form of learning fits well with utilitarian considerations that seek to maximize certain kinds of outcomes. I present three concerns regarding the use of model-free reinforcement learning to explain deontological moral judgment. First, many actions that humans find aversive from model-free learning are not judged to be morally wrong. Moral judgment must require something in addition to model-free learning. Second, there is a dearth of evidence for central predictions of the reinforcement account-e.g., that people with different reinforcement histories will, all else equal, make different moral judgments. Finally, to account for the effect of intention within the framework requires certain assumptions which lack support. These challenges are reasonable foci for future empirical/theoretical work on the model-free/model-based framework. Copyright © 2016 Elsevier B.V. All rights reserved.
Effects of a National Public Service Information Campaign on Crime Prevention: Perspectives from Social Learning and Social Control Theory.

ERIC Educational Resources Information Center

Lordan, Edward J.; Kwon, Joongrok

This study examined the effects of public service advertising from two theoretical backgrounds: social learning theory and social control theory. Traditional social learning theory assumes that learning occurs by subjects performing responses and experiencing their effects, with reinforcement as the main determinant. Social control theory, as…
Reinforcement and inference in cross-situational word learning.

PubMed

Tilles, Paulo F C; Fontanari, José F

2013-01-01

Cross-situational word learning is based on the notion that a learner can determine the referent of a word by finding something in common across many observed uses of that word. Here we propose an adaptive learning algorithm that contains a parameter that controls the strength of the reinforcement applied to associations between concurrent words and referents, and a parameter that regulates inference, which includes built-in biases, such as mutual exclusivity, and information of past learning events. By adjusting these parameters so that the model predictions agree with data from representative experiments on cross-situational word learning, we were able to explain the learning strategies adopted by the participants of those experiments in terms of a trade-off between reinforcement and inference. These strategies can vary wildly depending on the conditions of the experiments. For instance, for fast mapping experiments (i.e., the correct referent could, in principle, be inferred in a single observation) inference is prevalent, whereas for segregated contextual diversity experiments (i.e., the referents are separated in groups and are exhibited with members of their groups only) reinforcement is predominant. Other experiments are explained with more balanced doses of reinforcement and inference.
Seizure Control in a Computational Model Using a Reinforcement Learning Stimulation Paradigm.

PubMed

Nagaraj, Vivek; Lamperski, Andrew; Netoff, Theoden I

2017-11-01

Neuromodulation technologies such as vagus nerve stimulation and deep brain stimulation, have shown some efficacy in controlling seizures in medically intractable patients. However, inherent patient-to-patient variability of seizure disorders leads to a wide range of therapeutic efficacy. A patient specific approach to determining stimulation parameters may lead to increased therapeutic efficacy while minimizing stimulation energy and side effects. This paper presents a reinforcement learning algorithm that optimizes stimulation frequency for controlling seizures with minimum stimulation energy. We apply our method to a computational model called the epileptor. The epileptor model simulates inter-ictal and ictal local field potential data. In order to apply reinforcement learning to the Epileptor, we introduce a specialized reward function and state-space discretization. With the reward function and discretization fixed, we test the effectiveness of the temporal difference reinforcement learning algorithm (TD(0)). For periodic pulsatile stimulation, we derive a relation that describes, for any stimulation frequency, the minimal pulse amplitude required to suppress seizures. The TD(0) algorithm is able to identify parameters that control seizures quickly. Additionally, our results show that the TD(0) algorithm refines the stimulation frequency to minimize stimulation energy thereby converging to optimal parameters reliably. An advantage of the TD(0) algorithm is that it is adaptive so that the parameters necessary to control the seizures can change over time. We show that the algorithm can converge on the optimal solution in simulation with slow and fast inter-seizure intervals.
Design issues of a reinforcement-based self-learning fuzzy controller for petrochemical process control

NASA Technical Reports Server (NTRS)

Yen, John; Wang, Haojin; Daugherity, Walter C.

1992-01-01

Fuzzy logic controllers have some often-cited advantages over conventional techniques such as PID control, including easier implementation, accommodation to natural language, and the ability to cover a wider range of operating conditions. One major obstacle that hinders the broader application of fuzzy logic controllers is the lack of a systematic way to develop and modify their rules; as a result the creation and modification of fuzzy rules often depends on trial and error or pure experimentation. One of the proposed approaches to address this issue is a self-learning fuzzy logic controller (SFLC) that uses reinforcement learning techniques to learn the desirability of states and to adjust the consequent part of its fuzzy control rules accordingly. Due to the different dynamics of the controlled processes, the performance of a self-learning fuzzy controller is highly contingent on its design. The design issue has not received sufficient attention. The issues related to the design of a SFLC for application to a petrochemical process are discussed, and its performance is compared with that of a PID and a self-tuning fuzzy logic controller.
Learning to Obtain Reward, but Not Avoid Punishment, Is Affected by Presence of PTSD Symptoms in Male Veterans: Empirical Data and Computational Model

PubMed Central

Myers, Catherine E.; Moustafa, Ahmed A.; Sheynin, Jony; VanMeenen, Kirsten M.; Gilbertson, Mark W.; Orr, Scott P.; Beck, Kevin D.; Pang, Kevin C. H.; Servatius, Richard J.

2013-01-01

Post-traumatic stress disorder (PTSD) symptoms include behavioral avoidance which is acquired and tends to increase with time. This avoidance may represent a general learning bias; indeed, individuals with PTSD are often faster than controls on acquiring conditioned responses based on physiologically-aversive feedback. However, it is not clear whether this learning bias extends to cognitive feedback, or to learning from both reward and punishment. Here, male veterans with self-reported current, severe PTSD symptoms (PTSS group) or with few or no PTSD symptoms (control group) completed a probabilistic classification task that included both reward-based and punishment-based trials, where feedback could take the form of reward, punishment, or an ambiguous “no-feedback” outcome that could signal either successful avoidance of punishment or failure to obtain reward. The PTSS group outperformed the control group in total points obtained; the PTSS group specifically performed better than the control group on reward-based trials, with no difference on punishment-based trials. To better understand possible mechanisms underlying observed performance, we used a reinforcement learning model of the task, and applied maximum likelihood estimation techniques to derive estimated parameters describing individual participants’ behavior. Estimations of the reinforcement value of the no-feedback outcome were significantly greater in the control group than the PTSS group, suggesting that the control group was more likely to value this outcome as positively reinforcing (i.e., signaling successful avoidance of punishment). This is consistent with the control group’s generally poorer performance on reward trials, where reward feedback was to be obtained in preference to the no-feedback outcome. Differences in the interpretation of ambiguous feedback may contribute to the facilitated reinforcement learning often observed in PTSD patients, and may in turn provide new insight into how pathological behaviors are acquired and maintained in PTSD. PMID:24015254
Concurrent Learning of Control in Multi agent Sequential Decision Tasks

DTIC Science & Technology

2018-04-17

Concurrent Learning of Control in Multi-agent Sequential Decision Tasks The overall objective of this project was to develop multi-agent reinforcement...learning (MARL) approaches for intelligent agents to autonomously learn distributed control policies in decentral- ized partially observable...shall be subject to any oenalty for failing to comply with a collection of information if it does not display a currently valid OMB control number
Preventing Learned Helplessness.

ERIC Educational Resources Information Center

Hoy, Cheri

1986-01-01

To prevent learned helplessness in learning disabled students, teachers can share responsibilities with the students, train students to reinforce themselves for effort and self control, and introduce opportunities for changing counterproductive attitudes. (CL)
Asynchronous Gossip for Averaging and Spectral Ranking

NASA Astrophysics Data System (ADS)

Borkar, Vivek S.; Makhijani, Rahul; Sundaresan, Rajesh

2014-08-01

We consider two variants of the classical gossip algorithm. The first variant is a version of asynchronous stochastic approximation. We highlight a fundamental difficulty associated with the classical asynchronous gossip scheme, viz., that it may not converge to a desired average, and suggest an alternative scheme based on reinforcement learning that has guaranteed convergence to the desired average. We then discuss a potential application to a wireless network setting with simultaneous link activation constraints. The second variant is a gossip algorithm for distributed computation of the Perron-Frobenius eigenvector of a nonnegative matrix. While the first variant draws upon a reinforcement learning algorithm for an average cost controlled Markov decision problem, the second variant draws upon a reinforcement learning algorithm for risk-sensitive control. We then discuss potential applications of the second variant to ranking schemes, reputation networks, and principal component analysis.
Reinforcement Learning with Autonomous Small Unmanned Aerial Vehicles in Cluttered Environments

NASA Technical Reports Server (NTRS)

Tran, Loc; Cross, Charles; Montague, Gilbert; Motter, Mark; Neilan, James; Qualls, Garry; Rothhaar, Paul; Trujillo, Anna; Allen, B. Danette

2015-01-01

We present ongoing work in the Autonomy Incubator at NASA Langley Research Center (LaRC) exploring the efficacy of a data set aggregation approach to reinforcement learning for small unmanned aerial vehicle (sUAV) flight in dense and cluttered environments with reactive obstacle avoidance. The goal is to learn an autonomous flight model using training experiences from a human piloting a sUAV around static obstacles. The training approach uses video data from a forward-facing camera that records the human pilot's flight. Various computer vision based features are extracted from the video relating to edge and gradient information. The recorded human-controlled inputs are used to train an autonomous control model that correlates the extracted feature vector to a yaw command. As part of the reinforcement learning approach, the autonomous control model is iteratively updated with feedback from a human agent who corrects undesired model output. This data driven approach to autonomous obstacle avoidance is explored for simulated forest environments furthering autonomous flight under the tree canopy research. This enables flight in previously inaccessible environments which are of interest to NASA researchers in Earth and Atmospheric sciences.
Tiger salamanders' (Ambystoma tigrinum) response learning and usage of visual cues.

PubMed

Kundey, Shannon M A; Millar, Roberto; McPherson, Justin; Gonzalez, Maya; Fitz, Aleyna; Allen, Chadbourne

2016-05-01

We explored tiger salamanders' (Ambystoma tigrinum) learning to execute a response within a maze as proximal visual cue conditions varied. In Experiment 1, salamanders learned to turn consistently in a T-maze for reinforcement before the maze was rotated. All learned the initial task and executed the trained turn during test, suggesting that they learned to demonstrate the reinforced response during training and continued to perform it during test. In a second experiment utilizing a similar procedure, two visual cues were placed consistently at the maze junction. Salamanders were reinforced for turning towards one cue. Cue placement was reversed during test. All learned the initial task, but executed the trained turn rather than turning towards the visual cue during test, evidencing response learning. In Experiment 3, we investigated whether a compound visual cue could control salamanders' behaviour when it was the only cue predictive of reinforcement in a cross-maze by varying start position and cue placement. All learned to turn in the direction indicated by the compound visual cue, indicating that visual cues can come to control their behaviour. Following training, testing revealed that salamanders attended to stimuli foreground over background features. Overall, these results suggest that salamanders learn to execute responses over learning to use visual cues but can use visual cues if required. Our success with this paradigm offers the potential in future studies to explore salamanders' cognition further, as well as to shed light on how features of the tiger salamanders' life history (e.g. hibernation and metamorphosis) impact cognition.
Reinforcement interval type-2 fuzzy controller design by online rule generation and q-value-aided ant colony optimization.

PubMed

Juang, Chia-Feng; Hsu, Chia-Hung

2009-12-01

This paper proposes a new reinforcement-learning method using online rule generation and Q-value-aided ant colony optimization (ORGQACO) for fuzzy controller design. The fuzzy controller is based on an interval type-2 fuzzy system (IT2FS). The antecedent part in the designed IT2FS uses interval type-2 fuzzy sets to improve controller robustness to noise. There are initially no fuzzy rules in the IT2FS. The ORGQACO concurrently designs both the structure and parameters of an IT2FS. We propose an online interval type-2 rule generation method for the evolution of system structure and flexible partitioning of the input space. Consequent part parameters in an IT2FS are designed using Q -values and the reinforcement local-global ant colony optimization algorithm. This algorithm selects the consequent part from a set of candidate actions according to ant pheromone trails and Q-values, both of which are updated using reinforcement signals. The ORGQACO design method is applied to the following three control problems: 1) truck-backing control; 2) magnetic-levitation control; and 3) chaotic-system control. The ORGQACO is compared with other reinforcement-learning methods to verify its efficiency and effectiveness. Comparisons with type-1 fuzzy systems verify the noise robustness property of using an IT2FS.
Implementation of real-time energy management strategy based on reinforcement learning for hybrid electric vehicles and simulation validation

PubMed Central

Kong, Zehui; Liu, Teng

2017-01-01

To further improve the fuel economy of series hybrid electric tracked vehicles, a reinforcement learning (RL)-based real-time energy management strategy is developed in this paper. In order to utilize the statistical characteristics of online driving schedule effectively, a recursive algorithm for the transition probability matrix (TPM) of power-request is derived. The reinforcement learning (RL) is applied to calculate and update the control policy at regular time, adapting to the varying driving conditions. A facing-forward powertrain model is built in detail, including the engine-generator model, battery model and vehicle dynamical model. The robustness and adaptability of real-time energy management strategy are validated through the comparison with the stationary control strategy based on initial transition probability matrix (TPM) generated from a long naturalistic driving cycle in the simulation. Results indicate that proposed method has better fuel economy than stationary one and is more effective in real-time control. PMID:28671967
Implementation of real-time energy management strategy based on reinforcement learning for hybrid electric vehicles and simulation validation.

PubMed

Kong, Zehui; Zou, Yuan; Liu, Teng

2017-01-01

To further improve the fuel economy of series hybrid electric tracked vehicles, a reinforcement learning (RL)-based real-time energy management strategy is developed in this paper. In order to utilize the statistical characteristics of online driving schedule effectively, a recursive algorithm for the transition probability matrix (TPM) of power-request is derived. The reinforcement learning (RL) is applied to calculate and update the control policy at regular time, adapting to the varying driving conditions. A facing-forward powertrain model is built in detail, including the engine-generator model, battery model and vehicle dynamical model. The robustness and adaptability of real-time energy management strategy are validated through the comparison with the stationary control strategy based on initial transition probability matrix (TPM) generated from a long naturalistic driving cycle in the simulation. Results indicate that proposed method has better fuel economy than stationary one and is more effective in real-time control.
Discrete Serotonin Systems Mediate Memory Enhancement and Escape Latencies after Unpredicted Aversive Experience in Drosophila Place Memory

PubMed Central

Sitaraman, Divya; Kramer, Elizabeth F.; Kahsai, Lily; Ostrowski, Daniela; Zars, Troy

2017-01-01

Feedback mechanisms in operant learning are critical for animals to increase reward or reduce punishment. However, not all conditions have a behavior that can readily resolve an event. Animals must then try out different behaviors to better their situation through outcome learning. This form of learning allows for novel solutions and with positive experience can lead to unexpected behavioral routines. Learned helplessness, as a type of outcome learning, manifests in part as increases in escape latency in the face of repeated unpredicted shocks. Little is known about the mechanisms of outcome learning. When fruit fly Drosophila melanogaster are exposed to unpredicted high temperatures in a place learning paradigm, flies both increase escape latencies and have a higher memory when given control of a place/temperature contingency. Here we describe discrete serotonin neuronal circuits that mediate aversive reinforcement, escape latencies, and memory levels after place learning in the presence and absence of unexpected aversive events. The results show that two features of learned helplessness depend on the same modulatory system as aversive reinforcement. Moreover, changes in aversive reinforcement and escape latency depend on local neural circuit modulation, while memory enhancement requires larger modulation of multiple behavioral control circuits. PMID:29321732
Reinforcement learning: Solving two case studies

NASA Astrophysics Data System (ADS)

Duarte, Ana Filipa; Silva, Pedro; dos Santos, Cristina Peixoto

2012-09-01

Reinforcement Learning algorithms offer interesting features for the control of autonomous systems, such as the ability to learn from direct interaction with the environment, and the use of a simple reward signalas opposed to the input-outputs pairsused in classic supervised learning. The reward signal indicates the success of failure of the actions executed by the agent in the environment. In this work, are described RL algorithmsapplied to two case studies: the Crawler robot and the widely known inverted pendulum. We explore RL capabilities to autonomously learn a basic locomotion pattern in the Crawler, andapproach the balancing problem of biped locomotion using the inverted pendulum.

Genetic reinforcement learning through symbiotic evolution for fuzzy controller design.

PubMed

Juang, C F; Lin, J Y; Lin, C T

2000-01-01

An efficient genetic reinforcement learning algorithm for designing fuzzy controllers is proposed in this paper. The genetic algorithm (GA) adopted in this paper is based upon symbiotic evolution which, when applied to fuzzy controller design, complements the local mapping property of a fuzzy rule. Using this Symbiotic-Evolution-based Fuzzy Controller (SEFC) design method, the number of control trials, as well as consumed CPU time, are considerably reduced when compared to traditional GA-based fuzzy controller design methods and other types of genetic reinforcement learning schemes. Moreover, unlike traditional fuzzy controllers, which partition the input space into a grid, SEFC partitions the input space in a flexible way, thus creating fewer fuzzy rules. In SEFC, different types of fuzzy rules whose consequent parts are singletons, fuzzy sets, or linear equations (TSK-type fuzzy rules) are allowed. Further, the free parameters (e.g., centers and widths of membership functions) and fuzzy rules are all tuned automatically. For the TSK-type fuzzy rule especially, which put the proposed learning algorithm in use, only the significant input variables are selected to participate in the consequent of a rule. The proposed SEFC design method has been applied to different simulated control problems, including the cart-pole balancing system, a magnetic levitation system, and a water bath temperature control system. The proposed SEFC has been verified to be efficient and superior from these control problems, and from comparisons with some traditional GA-based fuzzy systems.
Scheduled power tracking control of the wind-storage hybrid system based on the reinforcement learning theory

NASA Astrophysics Data System (ADS)

Li, Ze

2017-09-01

In allusion to the intermittency and uncertainty of the wind electricity, energy storage and wind generator are combined into a hybrid system to improve the controllability of the output power. A scheduled power tracking control method is proposed based on the reinforcement learning theory and Q-learning algorithm. In this method, the state space of the environment is formed with two key factors, i.e. the state of charge of the energy storage and the difference value between the actual wind power and scheduled power, the feasible action is the output power of the energy storage, and the corresponding immediate rewarding function is designed to reflect the rationality of the control action. By interacting with the environment and learning from the immediate reward, the optimal control strategy is gradually formed. After that, it could be applied to the scheduled power tracking control of the hybrid system. Finally, the rationality and validity of the method are verified through simulation examples.
Network Supervision of Adult Experience and Learning Dependent Sensory Cortical Plasticity.

PubMed

Blake, David T

2017-06-18

The brain is capable of remodeling throughout life. The sensory cortices provide a useful preparation for studying neuroplasticity both during development and thereafter. In adulthood, sensory cortices change in the cortical area activated by behaviorally relevant stimuli, by the strength of response within that activated area, and by the temporal profiles of those responses. Evidence supports forms of unsupervised, reinforcement, and fully supervised network learning rules. Studies on experience-dependent plasticity have mostly not controlled for learning, and they find support for unsupervised learning mechanisms. Changes occur with greatest ease in neurons containing α-CamKII, which are pyramidal neurons in layers II/III and layers V/VI. These changes use synaptic mechanisms including long term depression. Synaptic strengthening at NMDA-containing synapses does occur, but its weak association with activity suggests other factors also initiate changes. Studies that control learning find support of reinforcement learning rules and limited evidence of other forms of supervised learning. Behaviorally associating a stimulus with reinforcement leads to a strengthening of cortical response strength and enlarging of response area with poor selectivity. Associating a stimulus with omission of reinforcement leads to a selective weakening of responses. In some preparations in which these associations are not as clearly made, neurons with the most informative discharges are relatively stronger after training. Studies analyzing the temporal profile of responses associated with omission of reward, or of plasticity in studies with different discriminanda but statistically matched stimuli, support the existence of limited supervised network learning. © 2017 American Physiological Society. Compr Physiol 7:977-1008, 2017. Copyright © 2017 John Wiley & Sons, Inc.
Electrophysiological correlates of reinforcement learning in young people with Tourette syndrome with and without co-occurring ADHD symptoms.

PubMed

Shephard, Elizabeth; Jackson, Georgina M; Groom, Madeleine J

2016-06-01

Altered reinforcement learning is implicated in the causes of Tourette syndrome (TS) and attention-deficit/hyperactivity disorder (ADHD). TS and ADHD frequently co-occur but how this affects reinforcement learning has not been investigated. We examined the ability of young people with TS (n=18), TS+ADHD (N=17), ADHD (n=13) and typically developing controls (n=20) to learn and reverse stimulus-response (S-R) associations based on positive and negative reinforcement feedback. We used a 2 (TS-yes, TS-no)×2 (ADHD-yes, ADHD-no) factorial design to assess the effects of TS, ADHD, and their interaction on behavioural (accuracy, RT) and event-related potential (stimulus-locked P3, feedback-locked P2, feedback-related negativity, FRN) indices of learning and reversing the S-R associations. TS was associated with intact learning and reversal performance and largely typical ERP amplitudes. ADHD was associated with lower accuracy during S-R learning and impaired reversal learning (significantly reduced accuracy and a trend for smaller P3 amplitude). The results indicate that co-occurring ADHD symptoms impair reversal learning in TS+ADHD. The implications of these findings for behavioural tic therapies are discussed. Copyright © 2016 ISDN. Published by Elsevier Ltd. All rights reserved.
The Use of Reinforcement Procedures in Teaching Reading to Rural Culturally Deprived Children.

ERIC Educational Resources Information Center

Egeland, Byron

A group of culturally deprived children with severe reading and behavior problems was systematically given tangible reinforcers while learning to read. Twelve second-grade and 12 third-grade boys from a rural and lower socioeconomic background were taught reading with the use of tangible reinforcers (E group). Four similar control groups (C group)…
Dopamine D2 Receptor Signaling in the Nucleus Accumbens Comprises a Metabolic-Cognitive Brain Interface Regulating Metabolic Components of Glucose Reinforcement.

PubMed

Michaelides, Michael; Miller, Michael L; DiNieri, Jennifer A; Gomez, Juan L; Schwartz, Elizabeth; Egervari, Gabor; Wang, Gene Jack; Mobbs, Charles V; Volkow, Nora D; Hurd, Yasmin L

2017-11-01

Appetitive drive is influenced by coordinated interactions between brain circuits that regulate reinforcement and homeostatic signals that control metabolism. Glucose modulates striatal dopamine (DA) and regulates appetitive drive and reinforcement learning. Striatal DA D2 receptors (D2Rs) also regulate reinforcement learning and are implicated in glucose-related metabolic disorders. Nevertheless, interactions between striatal D2R and peripheral glucose have not been previously described. Here we show that manipulations involving striatal D2R signaling coincide with perseverative and impulsive-like responding for sucrose, a disaccharide consisting of fructose and glucose. Fructose conveys orosensory (ie, taste) reinforcement but does not convey metabolic (ie, nutrient-derived) reinforcement. Glucose however conveys orosensory reinforcement but unlike fructose, it is a major metabolic energy source, underlies sustained reinforcement, and activates striatal circuitry. We found that mice with deletion of dopamine- and cAMP-regulated neuronal phosphoprotein (DARPP-32) exclusively in D2R-expressing cells exhibited preferential D2R changes in the nucleus accumbens (NAc), a striatal region that critically regulates sucrose reinforcement. These changes coincided with perseverative and impulsive-like responding for sucrose pellets and sustained reinforcement learning of glucose-paired flavors. These mice were also characterized by significant glucose intolerance (ie, impaired glucose utilization). Systemic glucose administration significantly attenuated sucrose operant responding and D2R activation or blockade in the NAc bidirectionally modulated blood glucose levels and glucose tolerance. Collectively, these results implicate NAc D2R in regulating both peripheral glucose levels and glucose-dependent reinforcement learning behaviors and highlight the notion that glucose metabolic impairments arising from disrupted NAc D2R signaling are involved in compulsive and perseverative feeding behaviors.
Frontal Theta Reflects Uncertainty and Unexpectedness during Exploration and Exploitation

PubMed Central

Figueroa, Christina M.; Cohen, Michael X; Frank, Michael J.

2012-01-01

In order to understand the exploitation/exploration trade-off in reinforcement learning, previous theoretical and empirical accounts have suggested that increased uncertainty may precede the decision to explore an alternative option. To date, the neural mechanisms that support the strategic application of uncertainty-driven exploration remain underspecified. In this study, electroencephalography (EEG) was used to assess trial-to-trial dynamics relevant to exploration and exploitation. Theta-band activities over middle and lateral frontal areas have previously been implicated in EEG studies of reinforcement learning and strategic control. It was hypothesized that these areas may interact during top-down strategic behavioral control involved in exploratory choices. Here, we used a dynamic reward–learning task and an associated mathematical model that predicted individual response times. This reinforcement-learning model generated value-based prediction errors and trial-by-trial estimates of exploration as a function of uncertainty. Mid-frontal theta power correlated with unsigned prediction error, although negative prediction errors had greater power overall. Trial-to-trial variations in response-locked frontal theta were linearly related to relative uncertainty and were larger in individuals who used uncertainty to guide exploration. This finding suggests that theta-band activities reflect prefrontal-directed strategic control during exploratory choices. PMID:22120491
Tuning fuzzy PD and PI controllers using reinforcement learning.

PubMed

Boubertakh, Hamid; Tadjine, Mohamed; Glorennec, Pierre-Yves; Labiod, Salim

2010-10-01

In this paper, we propose a new auto-tuning fuzzy PD and PI controllers using reinforcement Q-learning (QL) algorithm for SISO (single-input single-output) and TITO (two-input two-output) systems. We first, investigate the design parameters and settings of a typical class of Fuzzy PD (FPD) and Fuzzy PI (FPI) controllers: zero-order Takagi-Sugeno controllers with equidistant triangular membership functions for inputs, equidistant singleton membership functions for output, Larsen's implication method, and average sum defuzzification method. Secondly, the analytical structures of these typical fuzzy PD and PI controllers are compared to their classical counterpart PD and PI controllers. Finally, the effectiveness of the proposed method is proven through simulation examples. Copyright © 2010 ISA. Published by Elsevier Ltd. All rights reserved.
A self-learning rule base for command following in dynamical systems

NASA Technical Reports Server (NTRS)

Tsai, Wei K.; Lee, Hon-Mun; Parlos, Alexander

1992-01-01

In this paper, a self-learning Rule Base for command following in dynamical systems is presented. The learning is accomplished though reinforcement learning using an associative memory called SAM. The main advantage of SAM is that it is a function approximator with explicit storage of training samples. A learning algorithm patterned after the dynamic programming is proposed. Two artificially created, unstable dynamical systems are used for testing, and the Rule Base was used to generate a feedback control to improve the command following ability of the otherwise uncontrolled systems. The numerical results are very encouraging. The controlled systems exhibit a more stable behavior and a better capability to follow reference commands. The rules resulting from the reinforcement learning are explicitly stored and they can be modified or augmented by human experts. Due to overlapping storage scheme of SAM, the stored rules are similar to fuzzy rules.
Reinforcement learning for a biped robot based on a CPG-actor-critic method.

PubMed

Nakamura, Yutaka; Mori, Takeshi; Sato, Masa-aki; Ishii, Shin

2007-08-01

Animals' rhythmic movements, such as locomotion, are considered to be controlled by neural circuits called central pattern generators (CPGs), which generate oscillatory signals. Motivated by this biological mechanism, studies have been conducted on the rhythmic movements controlled by CPG. As an autonomous learning framework for a CPG controller, we propose in this article a reinforcement learning method we call the "CPG-actor-critic" method. This method introduces a new architecture to the actor, and its training is roughly based on a stochastic policy gradient algorithm presented recently. We apply this method to an automatic acquisition problem of control for a biped robot. Computer simulations show that training of the CPG can be successfully performed by our method, thus allowing the biped robot to not only walk stably but also adapt to environmental changes.
Apprenticeship Learning: Learning to Schedule from Human Experts

DTIC Science & Technology

2016-06-09

approaches to learning such models are based on Markov models, such as reinforcement learning or inverse reinforcement learning (Busoniu, Babuska, and De...via inverse reinforcement learning. In ICML. Barto, A. G., and Mahadevan, S. 2003. Recent advances in hierarchical reinforcement learning. Discrete...of tasks with temporal constraints. In Proc. AAAI, 2110–2116. Odom, P., and Natarajan, S. 2015. Active advice seeking for inverse reinforcement
Working Memory Contributions to Reinforcement Learning Impairments in Schizophrenia

PubMed Central

Brown, Jaime K.; Gold, James M.; Waltz, James A.; Frank, Michael J.

2014-01-01

Previous research has shown that patients with schizophrenia are impaired in reinforcement learning tasks. However, behavioral learning curves in such tasks originate from the interaction of multiple neural processes, including the basal ganglia- and dopamine-dependent reinforcement learning (RL) system, but also prefrontal cortex-dependent cognitive strategies involving working memory (WM). Thus, it is unclear which specific system induces impairments in schizophrenia. We recently developed a task and computational model allowing us to separately assess the roles of RL (slow, cumulative learning) mechanisms versus WM (fast but capacity-limited) mechanisms in healthy adult human subjects. Here, we used this task to assess patients' specific sources of impairments in learning. In 15 separate blocks, subjects learned to pick one of three actions for stimuli. The number of stimuli to learn in each block varied from two to six, allowing us to separate influences of capacity-limited WM from the incremental RL system. As expected, both patients (n = 49) and healthy controls (n = 36) showed effects of set size and delay between stimulus repetitions, confirming the presence of working memory effects. Patients performed significantly worse than controls overall, but computational model fits and behavioral analyses indicate that these deficits could be entirely accounted for by changes in WM parameters (capacity and reliability), whereas RL processes were spared. These results suggest that the working memory system contributes strongly to learning impairments in schizophrenia. PMID:25297101
Model-based hierarchical reinforcement learning and human action control

PubMed Central

Botvinick, Matthew; Weinstein, Ari

2014-01-01

Recent work has reawakened interest in goal-directed or ‘model-based’ choice, where decisions are based on prospective evaluation of potential action outcomes. Concurrently, there has been growing attention to the role of hierarchy in decision-making and action control. We focus here on the intersection between these two areas of interest, considering the topic of hierarchical model-based control. To characterize this form of action control, we draw on the computational framework of hierarchical reinforcement learning, using this to interpret recent empirical findings. The resulting picture reveals how hierarchical model-based mechanisms might play a special and pivotal role in human decision-making, dramatically extending the scope and complexity of human behaviour. PMID:25267822
Is Teaching Simple Surgical Skills Using an Operant Learning Program More Effective Than Teaching by Demonstration?

PubMed

Levy, I Martin; Pryor, Karen W; McKeon, Theresa R

2016-04-01

A surgical procedure is a complex behavior that can be constructed from foundation or component behaviors. Both the component and the composite behaviors built from them are much more likely to recur if it they are reinforced (operant learning). Behaviors in humans have been successfully reinforced using the acoustic stimulus from a mechanical clicker, where the clicker serves as a conditioned reinforcer that communicates in a way that is language- and judgment-free; however, to our knowledge, the use of operant-learning principles has not been formally evaluated for acquisition of surgical skills. Two surgical tasks were taught and compared using two teaching strategies: (1) an operant learning methodology using a conditioned, acoustic reinforcer (a clicker) for positive reinforcement; and (2) a more classical approach using demonstration alone. Our goal was to determine whether a group that is taught a surgical skill using an operant learning procedure would more precisely perform that skill than a group that is taught by demonstration alone. Two specific behaviors, "tying the locking, sliding knot" and "making a low-angle drill hole," were taught to the 2014 Postgraduate Year (PGY)-1 class and first- and second-year medical students, using an operant learning procedure incorporating precise scripts along with acoustic feedback. The control groups, composed of PGY-1 and -2 nonorthopaedic surgical residents and first- and second-year medical students, were taught using demonstration alone. The precision and speed of each behavior was recorded for each individual by a single experienced surgeon, skilled in operant learning. The groups were then compared. The operant learning group achieved better precision tying the locking, sliding knot than did the control group. Twelve of the 12 test group learners tied the knot and precisely performed all six component steps, whereas only four of the 12 control group learners tied the knot and correctly performed all six component steps (the test group median was 10 [range, 10-10], the control group median was 0 [range, 0-10], p = 0.004). However, the median "time to tie the first knot" for the test group was longer than for the control group (test group median 271 seconds [range, 184-626 seconds], control group median 163 seconds [range 93-900 seconds], p = 0.017), whereas the "time to tie 10 of the locking, sliding knots" was the same for both groups (test group mean 95 seconds ± SD = 15 [range, 67-120 seconds], control group mean 95 seconds ± SD = 28 [range, 62-139 seconds], p = 0.996). For the low-angle drill hole test, the test group more consistently achieved the ideal six-step behavior for precisely drilling the low-angle hole compared with the control group (p = 0.006 for the median number of technique success comparison with an odds ratio [at the 95% confidence interval] of 82.3 [29.1-232.8]). The mean time to drill 10 low-angle holes was not different between the test group (mean 193 seconds ± SD = 26 [range, 153-222 seconds]) and the control group (mean 146 seconds ± SD = 63 [range, 114-294 seconds]) (p = 0.084). Operant learning occurs as the behavior is constructed and is highly reinforced with the result measured, not in the time saved, but in the ultimate outcome of an accurately built complex behavior. Level II, therapeutic study.
Motivation of extended behaviors by anterior cingulate cortex.

PubMed

Holroyd, Clay B; Yeung, Nick

2012-02-01

Intense research interest over the past decade has yielded diverse and often discrepant theories about the function of anterior cingulate cortex (ACC). In particular, a dichotomy has emerged between neuropsychological theories suggesting a primary role for ACC in motivating or 'energizing' behavior, and neuroimaging-inspired theories emphasizing its contribution to cognitive control and reinforcement learning. To reconcile these views, we propose that ACC supports the selection and maintenance of 'options' - extended, context-specific sequences of behavior directed toward particular goals - that are learned through a process of hierarchical reinforcement learning. This theory accounts for ACC activity in relation to learning and control while simultaneously explaining the effects of ACC damage as disrupting the motivational context supporting the production of goal-directed action sequences. Copyright © 2011 Elsevier Ltd. All rights reserved.
Reinforcement learning for stabilizing an inverted pendulum naturally leads to intermittent feedback control as in human quiet standing.

PubMed

Michimoto, Kenjiro; Suzuki, Yasuyuki; Kiyono, Ken; Kobayashi, Yasushi; Morasso, Pietro; Nomura, Taishin

2016-08-01

Intermittent feedback control for stabilizing human upright stance is a promising strategy, alternative to the standard time-continuous stiffness control. Here we show that such an intermittent controller can be established naturally through reinforcement learning. To this end, we used a single inverted pendulum model of the upright posture and a very simple reward function that gives a certain amount of punishments when the inverted pendulum falls or changes its position in the state space. We found that the acquired feedback controller exhibits hallmarks of the intermittent feedback control strategy, namely the action of the feedback controller is switched-off intermittently when the state of the pendulum is located near the stable manifold of the unstable saddle-type upright equilibrium of the inverted pendulum with no active control: this action provides an opportunity to exploit transiently converging dynamics toward the unstable upright position with no help of the active feedback control. We then speculate about a possible physiological mechanism of such reinforcement learning, and suggest that it may be related to the neural activity in the pedunculopontine tegmental nucleus (PPN) of the brainstem. This hypothesis is supported by recent evidence indicating that PPN might play critical roles for generation and regulation of postural tonus, reward prediction, as well as postural instability in patients with Parkinson's disease.
Managing Learning Disabled Students' Academic Frustration through Self-Control.

ERIC Educational Resources Information Center

Ammer, Jerome J.

1982-01-01

Teachers can help learning and behavior disordered students in middle and secondary grades develop self control through a strategy in which students are taught to stop, look, listen, and think before carrying out a task. The final step is to reinforce themselves. (CL)
Learning-based traffic signal control algorithms with neighborhood information sharing: An application for sustainable mobility

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aziz, H. M. Abdul; Zhu, Feng; Ukkusuri, Satish V.

Here, this research applies R-Markov Average Reward Technique based reinforcement learning (RL) algorithm, namely RMART, for vehicular signal control problem leveraging information sharing among signal controllers in connected vehicle environment. We implemented the algorithm in a network of 18 signalized intersections and compare the performance of RMART with fixed, adaptive, and variants of the RL schemes. Results show significant improvement in system performance for RMART algorithm with information sharing over both traditional fixed signal timing plans and real time adaptive control schemes. Additionally, the comparison with reinforcement learning algorithms including Q learning and SARSA indicate that RMART performs better atmore » higher congestion levels. Further, a multi-reward structure is proposed that dynamically adjusts the reward function with varying congestion states at the intersection. Finally, the results from test networks show significant reduction in emissions (CO, CO 2, NO x, VOC, PM 10) when RL algorithms are implemented compared to fixed signal timings and adaptive schemes.« less
Conditioning 1-6 Month Old Infants by Means of Myoelectrically Controlled Reinforcement.

ERIC Educational Resources Information Center

Stack, Dale M.; McDonnell, Paul M.

1995-01-01

In order to evaluate possibilities of fitting myoelectrically controlled prosthetic arms on infants, this study examined whether 32 infants (1-6 months) could learn to control environmental contingencies by means of contracting the forearm flexor muscle group. Results indicated that older subjects (age greater than 104 days) demonstrated learning,…
Disrupted reinforcement learning and maladaptive behavior in women with a history of childhood sexual abuse: a high-density event-related potential study.

PubMed

Pechtel, Pia; Pizzagalli, Diego A

2013-05-01

Childhood sexual abuse (CSA) has been associated with psychopathology, particularly major depressive disorder (MDD), and high-risk behaviors. Despite the epidemiological data available, the mechanisms underlying these maladaptive outcomes remain poorly understood. We examined whether a history of CSA, particularly in conjunction with a past episode of MDD, is associated with behavioral and neural dysfunction in reinforcement learning, and whether such dysfunction is linked to maladaptive behavior. Participants completed a clinical evaluation and a probabilistic reinforcement task while 128-channel event-related potentials were recorded. Academic setting; participants recruited from the community. Fifteen women with a history of CSA and remitted MDD (CSA + rMDD), 16 women with remitted MDD with no history of CSA (rMDD), and 18 healthy women (controls). Three or more episodes of coerced sexual contact (mean [SD] duration, 3.00 [2.20] years) between the ages of 7 and 12 years by at least 1 male perpetrator. Participants' preference for choosing the most rewarded stimulus and avoiding the most punished stimulus was evaluated. The feedback-related negativity and error-related negativity-hypothesized to reflect activation in the anterior cingulate cortex-were used as electrophysiological indices of reinforcement learning. No group differences emerged in the acquisition of reinforcement contingencies. In trials requiring participants to rely partially or exclusively on previously rewarded information, the CSA + rMDD group showed (1) lower accuracy (relative to both controls and the rMDD group), (2) blunted electrophysiological differentiation between correct and incorrect responses (relative to controls), and (3) increased activation in the subgenual anterior cingulate cortex (relative to the rMDD group). A history of CSA was not associated with impairments in avoiding the most punished stimulus. Self-harm and suicidal behaviors correlated with poorer performance of previously rewarded, but not previously punished, trials. Irrespective of past MDD episodes, women with a history of CSA showed neural and behavioral deficits in utilizing previous reinforcement to optimize decision making in the absence of feedback (blunted "Go learning"). Although our study provides initial evidence for reward-specific deficits associated with CSA, future research is warranted to determine if disrupted positive reinforcement learning predicts high-risk behavior following CSA.

Interactions Among Working Memory, Reinforcement Learning, and Effort in Value-Based Choice: A New Paradigm and Selective Deficits in Schizophrenia.

PubMed

Collins, Anne G E; Albrecht, Matthew A; Waltz, James A; Gold, James M; Frank, Michael J

2017-09-15

When studying learning, researchers directly observe only the participants' choices, which are often assumed to arise from a unitary learning process. However, a number of separable systems, such as working memory (WM) and reinforcement learning (RL), contribute simultaneously to human learning. Identifying each system's contributions is essential for mapping the neural substrates contributing in parallel to behavior; computational modeling can help to design tasks that allow such a separable identification of processes and infer their contributions in individuals. We present a new experimental protocol that separately identifies the contributions of RL and WM to learning, is sensitive to parametric variations in both, and allows us to investigate whether the processes interact. In experiments 1 and 2, we tested this protocol with healthy young adults (n = 29 and n = 52, respectively). In experiment 3, we used it to investigate learning deficits in medicated individuals with schizophrenia (n = 49 patients, n = 32 control subjects). Experiments 1 and 2 established WM and RL contributions to learning, as evidenced by parametric modulations of choice by load and delay and reward history, respectively. They also showed interactions between WM and RL, where RL was enhanced under high WM load. Moreover, we observed a cost of mental effort when controlling for reinforcement history: participants preferred stimuli they encountered under low WM load. Experiment 3 revealed selective deficits in WM contributions and preserved RL value learning in individuals with schizophrenia compared with control subjects. Computational approaches allow us to disentangle contributions of multiple systems to learning and, consequently, to further our understanding of psychiatric diseases. Copyright © 2017 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.
Investigation of Drive-Reinforcement Learning and Application of Learning to Flight Control

DTIC Science & Technology

1993-08-01

Attachment 1 138 Reprint of: Baird, L. (1991). Learning and Adaptive Hybrid Systems for Nonlinear Control, CSDL Report T-1099, M.S. Thesis , Department of...Aircraft, CSDL Report T-1127, S.M. Thesis , Department of Aeronautics and Astronautics, M.I.T. Attachment 3 351 . iprint of: Atkins, S. (1993...Incremental Synthesis of Optimal Control Laws Using Learning Algorithms, CSDL Report T-1181, S.M. Thesis , Department of Aeronautics and Astronautics, M.I.T
Using Reinforcement Learning to Provide Stable Brain-Machine Interface Control Despite Neural Input Reorganization

PubMed Central

Pohlmeyer, Eric A.; Mahmoudi, Babak; Geng, Shijia; Prins, Noeline W.; Sanchez, Justin C.

2014-01-01

Brain-machine interface (BMI) systems give users direct neural control of robotic, communication, or functional electrical stimulation systems. As BMI systems begin transitioning from laboratory settings into activities of daily living, an important goal is to develop neural decoding algorithms that can be calibrated with a minimal burden on the user, provide stable control for long periods of time, and can be responsive to fluctuations in the decoder’s neural input space (e.g. neurons appearing or being lost amongst electrode recordings). These are significant challenges for static neural decoding algorithms that assume stationary input/output relationships. Here we use an actor-critic reinforcement learning architecture to provide an adaptive BMI controller that can successfully adapt to dramatic neural reorganizations, can maintain its performance over long time periods, and which does not require the user to produce specific kinetic or kinematic activities to calibrate the BMI. Two marmoset monkeys used the Reinforcement Learning BMI (RLBMI) to successfully control a robotic arm during a two-target reaching task. The RLBMI was initialized using random initial conditions, and it quickly learned to control the robot from brain states using only a binary evaluative feedback regarding whether previously chosen robot actions were good or bad. The RLBMI was able to maintain control over the system throughout sessions spanning multiple weeks. Furthermore, the RLBMI was able to quickly adapt and maintain control of the robot despite dramatic perturbations to the neural inputs, including a series of tests in which the neuron input space was deliberately halved or doubled. PMID:24498055
A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning.

PubMed

Franklin, Nicholas T; Frank, Michael J

2015-12-25

Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments.
Control of a simulated arm using a novel combination of Cerebellar learning mechanisms

NASA Technical Reports Server (NTRS)

Assad, C.; Hartmann, M.; Paulin, M. G.

2001-01-01

We present a model of cerebellar cortex that combines two types of learning: feedforward predicitve association based on local Hebbian-type learning between granule cell ascending branch and parallel fiber inputs, and reinforcement learning with feedback error correction based on climbing fiber activity.
Dopaminergic control of motivation and reinforcement learning: a closed-circuit account for reward-oriented behavior.

PubMed

Morita, Kenji; Morishima, Mieko; Sakai, Katsuyuki; Kawaguchi, Yasuo

2013-05-15

Humans and animals take actions quickly when they expect that the actions lead to reward, reflecting their motivation. Injection of dopamine receptor antagonists into the striatum has been shown to slow such reward-seeking behavior, suggesting that dopamine is involved in the control of motivational processes. Meanwhile, neurophysiological studies have revealed that phasic response of dopamine neurons appears to represent reward prediction error, indicating that dopamine plays central roles in reinforcement learning. However, previous attempts to elucidate the mechanisms of these dopaminergic controls have not fully explained how the motivational and learning aspects are related and whether they can be understood by the way the activity of dopamine neurons itself is controlled by their upstream circuitries. To address this issue, we constructed a closed-circuit model of the corticobasal ganglia system based on recent findings regarding intracortical and corticostriatal circuit architectures. Simulations show that the model could reproduce the observed distinct motivational effects of D1- and D2-type dopamine receptor antagonists. Simultaneously, our model successfully explains the dopaminergic representation of reward prediction error as observed in behaving animals during learning tasks and could also explain distinct choice biases induced by optogenetic stimulation of the D1 and D2 receptor-expressing striatal neurons. These results indicate that the suggested roles of dopamine in motivational control and reinforcement learning can be understood in a unified manner through a notion that the indirect pathway of the basal ganglia represents the value of states/actions at a previous time point, an empirically driven key assumption of our model.
Reinforcement learning state estimator.

PubMed

Morimoto, Jun; Doya, Kenji

2007-03-01

In this study, we propose a novel use of reinforcement learning for estimating hidden variables and parameters of nonlinear dynamical systems. A critical issue in hidden-state estimation is that we cannot directly observe estimation errors. However, by defining errors of observable variables as a delayed penalty, we can apply a reinforcement learning frame-work to state estimation problems. Specifically, we derive a method to construct a nonlinear state estimator by finding an appropriate feedback input gain using the policy gradient method. We tested the proposed method on single pendulum dynamics and show that the joint angle variable could be successfully estimated by observing only the angular velocity, and vice versa. In addition, we show that we could acquire a state estimator for the pendulum swing-up task in which a swing-up controller is also acquired by reinforcement learning simultaneously. Furthermore, we demonstrate that it is possible to estimate the dynamics of the pendulum itself while the hidden variables are estimated in the pendulum swing-up task. Application of the proposed method to a two-linked biped model is also presented.
Disrupted Reinforcement Learning and Maladaptive Behavior in Women with a History of Childhood Sexual Abuse: A High-Density Event-Related Potential Study

PubMed Central

Pechtel, Pia; Pizzagalli, Diego A.

2013-01-01

Context Childhood sexual abuse (CSA) has been associated with psychopathology, particularly major depressive disorder (MDD), and high-risk behaviors. Despite grave epidemiological data, the mechanisms underlying these maladaptive outcomes remain poorly understood. Objective We examined whether CSA history, particularly in conjunction with past MDD, is associated with behavioral and neural dysfunction in reinforcement learning, and whether such dysfunction is linked to maladaptive behavior. Design Participants completed a clinical evaluation and a probabilistic reinforcement task while 128-channel event-related potentials were recorded. Setting Academic setting; participants recruited from the community. Participants Fifteen remitted depressed females with CSA history (CSA+rMDD), 16 remitted depressed females without CSA history (rMDD), and 18 healthy females. Main Outcome Measures Participants’ preference for choosing the most rewarded stimulus and avoiding the most punished stimulus was evaluated. The feedback-related negativity (FRN) and error-related negativity (ERN)–hypothesized to reflect activation in the anterior cingulate cortex–were used as electrophysiological indices of reinforcement learning. Results No group differences emerged in the acquisition of reinforcement contingencies. In trials requiring to rely partially or exclusively on previously rewarded information, the CSA+rMDD group showed (1) lower accuracy (relative to both controls and rMDD), (2) blunted electrophysiological differentiation between correct and incorrect responses (relative to controls), and (3) increased activation in the subgenual anterior cingulate cortex (relative to rMDD). CSA history was not associated with impairments in avoiding the most punished stimulus. Self-harm and suicidal behaviors correlated with poorer performance of previously rewarded–but not previously punished–trials. Conclusions Irrespective of past MDD, women with CSA histories showed neural and behavioral deficits in utilizing previous reinforcement to optimize decision-making in the absence of feedback (blunted “Go learning”). While the current study provides initial evidence for reward-specific deficits associated with CSA, future research is warranted to determine if disrupted positive reinforcement learning predicts high-risk behavior following CSA. PMID:23487253
Application of fuzzy logic-neural network based reinforcement learning to proximity and docking operations: Special approach/docking testcase results

NASA Technical Reports Server (NTRS)

Jani, Yashvant

1993-01-01

As part of the RICIS project, the reinforcement learning techniques developed at Ames Research Center are being applied to proximity and docking operations using the Shuttle and Solar Maximum Mission (SMM) satellite simulation. In utilizing these fuzzy learning techniques, we use the Approximate Reasoning based Intelligent Control (ARIC) architecture, and so we use these two terms interchangeably to imply the same. This activity is carried out in the Software Technology Laboratory utilizing the Orbital Operations Simulator (OOS) and programming/testing support from other contractor personnel. This report is the final deliverable D4 in our milestones and project activity. It provides the test results for the special testcase of approach/docking scenario for the shuttle and SMM satellite. Based on our experience and analysis with the attitude and translational controllers, we have modified the basic configuration of the reinforcement learning algorithm in ARIC. The shuttle translational controller and its implementation in ARIC is described in our deliverable D3. In order to simulate the final approach and docking operations, we have set-up this special testcase as described in section 2. The ARIC performance results for these operations are discussed in section 3 and conclusions are provided in section 4 along with the summary for the project.
An intelligent agent for optimal river-reservoir system management

NASA Astrophysics Data System (ADS)

Rieker, Jeffrey D.; Labadie, John W.

2012-09-01

A generalized software package is presented for developing an intelligent agent for stochastic optimization of complex river-reservoir system management and operations. Reinforcement learning is an approach to artificial intelligence for developing a decision-making agent that learns the best operational policies without the need for explicit probabilistic models of hydrologic system behavior. The agent learns these strategies experientially in a Markov decision process through observational interaction with the environment and simulation of the river-reservoir system using well-calibrated models. The graphical user interface for the reinforcement learning process controller includes numerous learning method options and dynamic displays for visualizing the adaptive behavior of the agent. As a case study, the generalized reinforcement learning software is applied to developing an intelligent agent for optimal management of water stored in the Truckee river-reservoir system of California and Nevada for the purpose of streamflow augmentation for water quality enhancement. The intelligent agent successfully learns long-term reservoir operational policies that specifically focus on mitigating water temperature extremes during persistent drought periods that jeopardize the survival of threatened and endangered fish species.
Online Pedagogical Tutorial Tactics Optimization Using Genetic-Based Reinforcement Learning

PubMed Central

Lin, Hsuan-Ta; Lee, Po-Ming; Hsiao, Tzu-Chien

2015-01-01

Tutorial tactics are policies for an Intelligent Tutoring System (ITS) to decide the next action when there are multiple actions available. Recent research has demonstrated that when the learning contents were controlled so as to be the same, different tutorial tactics would make difference in students' learning gains. However, the Reinforcement Learning (RL) techniques that were used in previous studies to induce tutorial tactics are insufficient when encountering large problems and hence were used in offline manners. Therefore, we introduced a Genetic-Based Reinforcement Learning (GBML) approach to induce tutorial tactics in an online-learning manner without basing on any preexisting dataset. The introduced method can learn a set of rules from the environment in a manner similar to RL. It includes a genetic-based optimizer for rule discovery task by generating new rules from the old ones. This increases the scalability of a RL learner for larger problems. The results support our hypothesis about the capability of the GBML method to induce tutorial tactics. This suggests that the GBML method should be favorable in developing real-world ITS applications in the domain of tutorial tactics induction. PMID:26065018
Online Pedagogical Tutorial Tactics Optimization Using Genetic-Based Reinforcement Learning.

PubMed

Lin, Hsuan-Ta; Lee, Po-Ming; Hsiao, Tzu-Chien

2015-01-01

Tutorial tactics are policies for an Intelligent Tutoring System (ITS) to decide the next action when there are multiple actions available. Recent research has demonstrated that when the learning contents were controlled so as to be the same, different tutorial tactics would make difference in students' learning gains. However, the Reinforcement Learning (RL) techniques that were used in previous studies to induce tutorial tactics are insufficient when encountering large problems and hence were used in offline manners. Therefore, we introduced a Genetic-Based Reinforcement Learning (GBML) approach to induce tutorial tactics in an online-learning manner without basing on any preexisting dataset. The introduced method can learn a set of rules from the environment in a manner similar to RL. It includes a genetic-based optimizer for rule discovery task by generating new rules from the old ones. This increases the scalability of a RL learner for larger problems. The results support our hypothesis about the capability of the GBML method to induce tutorial tactics. This suggests that the GBML method should be favorable in developing real-world ITS applications in the domain of tutorial tactics induction.
Working memory contributions to reinforcement learning impairments in schizophrenia.

PubMed

Collins, Anne G E; Brown, Jaime K; Gold, James M; Waltz, James A; Frank, Michael J

2014-10-08

Previous research has shown that patients with schizophrenia are impaired in reinforcement learning tasks. However, behavioral learning curves in such tasks originate from the interaction of multiple neural processes, including the basal ganglia- and dopamine-dependent reinforcement learning (RL) system, but also prefrontal cortex-dependent cognitive strategies involving working memory (WM). Thus, it is unclear which specific system induces impairments in schizophrenia. We recently developed a task and computational model allowing us to separately assess the roles of RL (slow, cumulative learning) mechanisms versus WM (fast but capacity-limited) mechanisms in healthy adult human subjects. Here, we used this task to assess patients' specific sources of impairments in learning. In 15 separate blocks, subjects learned to pick one of three actions for stimuli. The number of stimuli to learn in each block varied from two to six, allowing us to separate influences of capacity-limited WM from the incremental RL system. As expected, both patients (n = 49) and healthy controls (n = 36) showed effects of set size and delay between stimulus repetitions, confirming the presence of working memory effects. Patients performed significantly worse than controls overall, but computational model fits and behavioral analyses indicate that these deficits could be entirely accounted for by changes in WM parameters (capacity and reliability), whereas RL processes were spared. These results suggest that the working memory system contributes strongly to learning impairments in schizophrenia. Copyright © 2014 the authors 0270-6474/14/3413747-10$15.00/0.
Reinforcement learning of targeted movement in a spiking neuronal model of motor cortex.

PubMed

Chadderdon, George L; Neymotin, Samuel A; Kerr, Cliff C; Lytton, William W

2012-01-01

Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint "forearm" to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (-1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior.
The role of within-compound associations in learning about absent cues.

PubMed

Witnauer, James E; Miller, Ralph R

2011-05-01

When two cues are reinforced together (in compound), most associative models assume that animals learn an associative network that includes direct cue-outcome associations and a within-compound association. All models of associative learning subscribe to the importance of cue-outcome associations, but most models assume that within-compound associations are irrelevant to each cue's subsequent behavioral control. In the present article, we present an extension of Van Hamme and Wasserman's (Learning and Motivation 25:127-151, 1994) model of retrospective revaluation based on learning about absent cues that are retrieved through within-compound associations. The model was compared with a model lacking retrieval through within-compound associations. Simulations showed that within-compound associations are necessary for the model to explain higher-order retrospective revaluation and the observed greater retrospective revaluation after partial reinforcement than after continuous reinforcement alone. These simulations suggest that the associability of an absent stimulus is determined by the extent to which the stimulus is activated through the within-compound association.
The role of within-compound associations in learning about absent cues

PubMed Central

Witnauer, James E.

2011-01-01

When two cues are reinforced together (in compound), most associative models assume that animals learn an associative network that includes direct cue–outcome associations and a within-compound association. All models of associative learning subscribe to the importance of cue–outcome associations, but most models assume that within-compound associations are irrelevant to each cue's subsequent behavioral control. In the present article, we present an extension of Van Hamme and Wasserman's (Learning and Motivation 25:127–151, 1994) model of retrospective revaluation based on learning about absent cues that are retrieved through within-compound associations. The model was compared with a model lacking retrieval through within-compound associations. Simulations showed that within-compound associations are necessary for the model to explain higher-order retrospective revaluation and the observed greater retrospective revaluation after partial reinforcement than after continuous reinforcement alone. These simulations suggest that the associability of an absent stimulus is determined by the extent to which the stimulus is activated through the within-compound association. PMID:21264569
Structure identification in fuzzy inference using reinforcement learning

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap

1993-01-01

In our previous work on the GARIC architecture, we have shown that the system can start with surface structure of the knowledge base (i.e., the linguistic expression of the rules) and learn the deep structure (i.e., the fuzzy membership functions of the labels used in the rules) by using reinforcement learning. Assuming the surface structure, GARIC refines the fuzzy membership functions used in the consequents of the rules using a gradient descent procedure. This hybrid fuzzy logic and reinforcement learning approach can learn to balance a cart-pole system and to backup a truck to its docking location after a few trials. In this paper, we discuss how to do structure identification using reinforcement learning in fuzzy inference systems. This involves identifying both surface as well as deep structure of the knowledge base. The term set of fuzzy linguistic labels used in describing the values of each control variable must be derived. In this process, splitting a label refers to creating new labels which are more granular than the original label and merging two labels creates a more general label. Splitting and merging of labels directly transform the structure of the action selection network used in GARIC by increasing or decreasing the number of hidden layer nodes.
Spared internal but impaired external reward prediction error signals in major depressive disorder during reinforcement learning.

PubMed

Bakic, Jasmina; Pourtois, Gilles; Jepma, Marieke; Duprat, Romain; De Raedt, Rudi; Baeken, Chris

2017-01-01

Major depressive disorder (MDD) creates debilitating effects on a wide range of cognitive functions, including reinforcement learning (RL). In this study, we sought to assess whether reward processing as such, or alternatively the complex interplay between motivation and reward might potentially account for the abnormal reward-based learning in MDD. A total of 35 treatment resistant MDD patients and 44 age matched healthy controls (HCs) performed a standard probabilistic learning task. RL was titrated using behavioral, computational modeling and event-related brain potentials (ERPs) data. MDD patients showed comparable learning rate compared to HCs. However, they showed decreased lose-shift responses as well as blunted subjective evaluations of the reinforcers used during the task, relative to HCs. Moreover, MDD patients showed normal internal (at the level of error-related negativity, ERN) but abnormal external (at the level of feedback-related negativity, FRN) reward prediction error (RPE) signals during RL, selectively when additional efforts had to be made to establish learning. Collectively, these results lend support to the assumption that MDD does not impair reward processing per se during RL. Instead, it seems to alter the processing of the emotional value of (external) reinforcers during RL, when additional intrinsic motivational processes have to be engaged. © 2016 Wiley Periodicals, Inc.
Attentional Selection Can Be Predicted by Reinforcement Learning of Task-relevant Stimulus Features Weighted by Value-independent Stickiness.

PubMed

Balcarras, Matthew; Ardid, Salva; Kaping, Daniel; Everling, Stefan; Womelsdorf, Thilo

2016-02-01

Attention includes processes that evaluate stimuli relevance, select the most relevant stimulus against less relevant stimuli, and bias choice behavior toward the selected information. It is not clear how these processes interact. Here, we captured these processes in a reinforcement learning framework applied to a feature-based attention task that required macaques to learn and update the value of stimulus features while ignoring nonrelevant sensory features, locations, and action plans. We found that value-based reinforcement learning mechanisms could account for feature-based attentional selection and choice behavior but required a value-independent stickiness selection process to explain selection errors while at asymptotic behavior. By comparing different reinforcement learning schemes, we found that trial-by-trial selections were best predicted by a model that only represents expected values for the task-relevant feature dimension, with nonrelevant stimulus features and action plans having only a marginal influence on covert selections. These findings show that attentional control subprocesses can be described by (1) the reinforcement learning of feature values within a restricted feature space that excludes irrelevant feature dimensions, (2) a stochastic selection process on feature-specific value representations, and (3) value-independent stickiness toward previous feature selections akin to perseveration in the motor domain. We speculate that these three mechanisms are implemented by distinct but interacting brain circuits and that the proposed formal account of feature-based stimulus selection will be important to understand how attentional subprocesses are implemented in primate brain networks.
Self-regulation of the anterior insula: Reinforcement learning using real-time fMRI neurofeedback.

PubMed

Lawrence, Emma J; Su, Li; Barker, Gareth J; Medford, Nick; Dalton, Jeffrey; Williams, Steve C R; Birbaumer, Niels; Veit, Ralf; Ranganatha, Sitaram; Bodurka, Jerzy; Brammer, Michael; Giampietro, Vincent; David, Anthony S

2014-03-01

The anterior insula (AI) plays a key role in affective processing, and insular dysfunction has been noted in several clinical conditions. Real-time functional MRI neurofeedback (rtfMRI-NF) provides a means of helping people learn to self-regulate activation in this brain region. Using the Blood Oxygenated Level Dependant (BOLD) signal from the right AI (RAI) as neurofeedback, we trained participants to increase RAI activation. In contrast, another group of participants was shown 'control' feedback from another brain area. Pre- and post-training affective probes were shown, with subjective ratings and skin conductance response (SCR) measured. We also investigated a reward-related reinforcement learning model of rtfMRI-NF. In contrast to the controls, we hypothesised a positive linear increase in RAI activation in participants shown feedback from this region, alongside increases in valence ratings and SCR to affective probes. Hypothesis-driven analyses showed a significant interaction between the RAI/control neurofeedback groups and the effect of self-regulation. Whole-brain analyses revealed a significant linear increase in RAI activation across four training runs in the group who received feedback from RAI. Increased activation was also observed in the caudate body and thalamus, likely representing feedback-related learning. No positive linear trend was observed in the RAI in the group receiving control feedback, suggesting that these data are not a general effect of cognitive strategy or control feedback. The control group did, however, show diffuse activation across the putamen, caudate and posterior insula which may indicate the representation of false feedback. No significant training-related behavioural differences were observed for valence ratings, or SCR. In addition, correlational analyses based on a reinforcement learning model showed that the dorsal anterior cingulate cortex underpinned learning in both groups. In summary, these data demonstrate that it is possible to regulate the RAI using rtfMRI-NF within one scanning session, and that such reward-related learning is mediated by the dorsal anterior cingulate. Copyright © 2013 Elsevier Inc. All rights reserved.

Artificial neural networks and approximate reasoning for intelligent control in space

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1991-01-01

A method is introduced for learning to refine the control rules of approximate reasoning-based controllers. A reinforcement-learning technique is used in conjunction with a multi-layer neural network model of an approximate reasoning-based controller. The model learns by updating its prediction of the physical system's behavior. The model can use the control knowledge of an experienced operator and fine-tune it through the process of learning. Some of the space domains suitable for applications of the model such as rendezvous and docking, camera tracking, and tethered systems control are discussed.
Fuzzy and neural control

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1992-01-01

Fuzzy logic and neural networks provide new methods for designing control systems. Fuzzy logic controllers do not require a complete analytical model of a dynamic system and can provide knowledge-based heuristic controllers for ill-defined and complex systems. Neural networks can be used for learning control. In this chapter, we discuss hybrid methods using fuzzy logic and neural networks which can start with an approximate control knowledge base and refine it through reinforcement learning.
Rational and Mechanistic Perspectives on Reinforcement Learning

ERIC Educational Resources Information Center

Chater, Nick

2009-01-01

This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: "mechanistic" and "rational." Reinforcement learning is often viewed in mechanistic terms--as…
Learning Disability as Related to Infrequent Punishment and Limited Participation in Delay of Reinforcement Tasks

ERIC Educational Resources Information Center

Wilson, L. R.

1975-01-01

This study, by comparing for learning disabled and control children parental frequency rating of physical punishment and of completion of tasks, hypothesized that parental indulgence is associated with learning disorders. Referral children with learning problems (N=18), were rated significantly lower on both measures that students referred for…
Synchronization of Chaotic Systems without Direct Connections Using Reinforcement Learning

NASA Astrophysics Data System (ADS)

Sato, Norihisa; Adachi, Masaharu

In this paper, we propose a control method for the synchronization of chaotic systems that does not require the systems to be connected, unlike existing methods such as that proposed by Pecora and Carroll in 1990. The method is based on the reinforcement learning algorithm. We apply our method to two discrete-time chaotic systems with mismatched parameters and achieve M step delay synchronization. Moreover, we extend the proposed method to the synchronization of continuous-time chaotic systems.
A stimulus-control account of regulated drug intake in rats.

PubMed

Panlilio, Leigh V; Thorndike, Eric B; Schindler, Charles W

2008-02-01

Patterns of drug self-administration are often highly regular, with a consistent pause after each self-injection. This pausing might occur because the animal has learned that additional injections are not reinforcing once the drug effect has reached a certain level, possibly due to the reinforcement system reaching full capacity. Thus, interoceptive effects of the drug might function as a discriminative stimulus, signaling when additional drug will be reinforcing and when it will not. This hypothetical stimulus control aspect of drug self-administration was emulated using a schedule of food reinforcement. Rats' nose-poke responses produced food only when a cue light was present. No drug was administered at any time. However, the state of the light stimulus was determined by calculating what the whole-body drug level would have been if each response in the session had produced a drug injection. The light was only presented while this virtual drug level was below a specific threshold. A range of doses of cocaine and remifentanil were emulated using parameters based on previous self-administration experiments. Response patterns were highly regular, dose-dependent, and remarkably similar to actual drug self-administration. This similarity suggests that the emulation schedule may provide a reasonable model of the contingencies inherent in drug reinforcement. Thus, these results support a stimulus control account of regulated drug intake in which rats learn to discriminate when the level of drug effect has fallen to a point where another self-injection will be reinforcing.
Reinforcement learning solution for HJB equation arising in constrained optimal control problem.

PubMed

Luo, Biao; Wu, Huai-Ning; Huang, Tingwen; Liu, Derong

2015-11-01

The constrained optimal control problem depends on the solution of the complicated Hamilton-Jacobi-Bellman equation (HJBE). In this paper, a data-based off-policy reinforcement learning (RL) method is proposed, which learns the solution of the HJBE and the optimal control policy from real system data. One important feature of the off-policy RL is that its policy evaluation can be realized with data generated by other behavior policies, not necessarily the target policy, which solves the insufficient exploration problem. The convergence of the off-policy RL is proved by demonstrating its equivalence to the successive approximation approach. Its implementation procedure is based on the actor-critic neural networks structure, where the function approximation is conducted with linearly independent basis functions. Subsequently, the convergence of the implementation procedure with function approximation is also proved. Finally, its effectiveness is verified through computer simulations. Copyright © 2015 Elsevier Ltd. All rights reserved.
Specific effect of a dopamine partial agonist on counterfactual learning: evidence from Gilles de la Tourette syndrome.

PubMed

Salvador, Alexandre; Worbe, Yulia; Delorme, Cécile; Coricelli, Giorgio; Gaillard, Raphaël; Robbins, Trevor W; Hartmann, Andreas; Palminteri, Stefano

2017-07-24

The dopamine partial agonist aripiprazole is increasingly used to treat pathologies for which other antipsychotics are indicated because it displays fewer side effects, such as sedation and depression-like symptoms, than other dopamine receptor antagonists. Previously, we showed that aripiprazole may protect motivational function by preserving reinforcement-related signals used to sustain reward-maximization. However, the effect of aripiprazole on more cognitive facets of human reinforcement learning, such as learning from the forgone outcomes of alternative courses of action (i.e., counterfactual learning), is unknown. To test the influence of aripiprazole on counterfactual learning, we administered a reinforcement learning task that involves both direct learning from obtained outcomes and indirect learning from forgone outcomes to two groups of Gilles de la Tourette (GTS) patients, one consisting of patients who were completely unmedicated and the other consisting of patients who were receiving aripiprazole monotherapy, and to healthy subjects. We found that whereas learning performance improved in the presence of counterfactual feedback in both healthy controls and unmedicated GTS patients, this was not the case in aripiprazole-medicated GTS patients. Our results suggest that whereas aripiprazole preserves direct learning of action-outcome associations, it may impair more complex inferential processes, such as counterfactual learning from forgone outcomes, in GTS patients treated with this medication.
Creating a Reinforcement Learning Controller for Functional Electrical Stimulation of a Human Arm*

PubMed Central

Thomas, Philip S.; Branicky, Michael; van den Bogert, Antonie; Jagodnik, Kathleen

2010-01-01

Clinical tests have shown that the dynamics of a human arm, controlled using Functional Electrical Stimulation (FES), can vary significantly between and during trials. In this paper, we study the application of Reinforcement Learning to create a controller that can adapt to these changing dynamics of a human arm. Development and tests were done in simulation using a two-dimensional arm model and Hill-based muscle dynamics. An actor-critic architecture is used with artificial neural networks for both the actor and the critic. We begin by training it using a Proportional Derivative (PD) controller as a supervisor. We then make clinically relevant changes to the dynamics of the arm and test the actor-critic’s ability to adapt without supervision in a reasonable number of episodes. PMID:22081795
Use of Frontal Lobe Hemodynamics as Reinforcement Signals to an Adaptive Controller

PubMed Central

DiStasio, Marcello M.; Francis, Joseph T.

2013-01-01

Decision-making ability in the frontal lobe (among other brain structures) relies on the assignment of value to states of the animal and its environment. Then higher valued states can be pursued and lower (or negative) valued states avoided. The same principle forms the basis for computational reinforcement learning controllers, which have been fruitfully applied both as models of value estimation in the brain, and as artificial controllers in their own right. This work shows how state desirability signals decoded from frontal lobe hemodynamics, as measured with near-infrared spectroscopy (NIRS), can be applied as reinforcers to an adaptable artificial learning agent in order to guide its acquisition of skills. A set of experiments carried out on an alert macaque demonstrate that both oxy- and deoxyhemoglobin concentrations in the frontal lobe show differences in response to both primarily and secondarily desirable (versus undesirable) stimuli. This difference allows a NIRS signal classifier to serve successfully as a reinforcer for an adaptive controller performing a virtual tool-retrieval task. The agent's adaptability allows its performance to exceed the limits of the NIRS classifier decoding accuracy. We also show that decoding state desirabilities is more accurate when using relative concentrations of both oxyhemoglobin and deoxyhemoglobin, rather than either species alone. PMID:23894500
New Knowledge Derived from Learned Knowledge: Functional-Anatomic Correlates of Stimulus Equivalence

ERIC Educational Resources Information Center

Schlund, Michael W.; Hoehn-Saric, Rudolf; Cataldo, Michael F.

2007-01-01

Forming new knowledge based on knowledge established through prior learning is a central feature of higher cognition that is captured in research on stimulus equivalence (SE). Numerous SE investigations show that reinforcing behavior under control of distinct sets of arbitrary conditional relations gives rise to stimulus control by new, "derived"…
Identifying Cognitive Remediation Change Through Computational Modelling—Effects on Reinforcement Learning in Schizophrenia

PubMed Central

Cella, Matteo; Bishara, Anthony J.; Medin, Evelina; Swan, Sarah; Reeder, Clare; Wykes, Til

2014-01-01

Objective: Converging research suggests that individuals with schizophrenia show a marked impairment in reinforcement learning, particularly in tasks requiring flexibility and adaptation. The problem has been associated with dopamine reward systems. This study explores, for the first time, the characteristics of this impairment and how it is affected by a behavioral intervention—cognitive remediation. Method: Using computational modelling, 3 reinforcement learning parameters based on the Wisconsin Card Sorting Test (WCST) trial-by-trial performance were estimated: R (reward sensitivity), P (punishment sensitivity), and D (choice consistency). In Study 1 the parameters were compared between a group of individuals with schizophrenia (n = 100) and a healthy control group (n = 50). In Study 2 the effect of cognitive remediation therapy (CRT) on these parameters was assessed in 2 groups of individuals with schizophrenia, one receiving CRT (n = 37) and the other receiving treatment as usual (TAU, n = 34). Results: In Study 1 individuals with schizophrenia showed impairment in the R and P parameters compared with healthy controls. Study 2 demonstrated that sensitivity to negative feedback (P) and reward (R) improved in the CRT group after therapy compared with the TAU group. R and P parameter change correlated with WCST outputs. Improvements in R and P after CRT were associated with working memory gains and reduction of negative symptoms, respectively. Conclusion: Schizophrenia reinforcement learning difficulties negatively influence performance in shift learning tasks. CRT can improve sensitivity to reward and punishment. Identifying parameters that show change may be useful in experimental medicine studies to identify cognitive domains susceptible to improvement. PMID:24214932
Curiosity driven reinforcement learning for motion planning on humanoids

PubMed Central

Frank, Mikhail; Leitner, Jürgen; Stollenga, Marijn; Förster, Alexander; Schmidhuber, Jürgen

2014-01-01

Most previous work on artificial curiosity (AC) and intrinsic motivation focuses on basic concepts and theory. Experimental results are generally limited to toy scenarios, such as navigation in a simulated maze, or control of a simple mechanical system with one or two degrees of freedom. To study AC in a more realistic setting, we embody a curious agent in the complex iCub humanoid robot. Our novel reinforcement learning (RL) framework consists of a state-of-the-art, low-level, reactive control layer, which controls the iCub while respecting constraints, and a high-level curious agent, which explores the iCub's state-action space through information gain maximization, learning a world model from experience, controlling the actual iCub hardware in real-time. To the best of our knowledge, this is the first ever embodied, curious agent for real-time motion planning on a humanoid. We demonstrate that it can learn compact Markov models to represent large regions of the iCub's configuration space, and that the iCub explores intelligently, showing interest in its physical constraints as well as in objects it finds in its environment. PMID:24432001
The probability of reinforcement per trial affects posttrial responding and subsequent extinction but not within-trial responding.

PubMed

Harris, Justin A; Kwok, Dorothy W S

2018-01-01

During magazine approach conditioning, rats do not discriminate between a conditional stimulus (CS) that is consistently reinforced with food and a CS that is occasionally (partially) reinforced, as long as the CSs have the same overall reinforcement rate per second. This implies that rats are indifferent to the probability of reinforcement per trial. However, in the same rats, the per-trial reinforcement rate will affect subsequent extinction-responding extinguishes more rapidly for a CS that was consistently reinforced than for a partially reinforced CS. Here, we trained rats with consistently and partially reinforced CSs that were matched for overall reinforcement rate per second. We measured conditioned responding both during and immediately after the CSs. Differences in the per-trial probability of reinforcement did not affect the acquisition of responding during the CS but did affect subsequent extinction of that responding, and also affected the post-CS response rates during conditioning. Indeed, CSs with the same probability of reinforcement per trial evoked the same amount of post-CS responding even when they differed in overall reinforcement rate and thus evoked different amounts of responding during the CS. We conclude that reinforcement rate per second controls rats' acquisition of responding during the CS, but at the same time, rats also learn specifically about the probability of reinforcement per trial. The latter learning affects the rats' expectation of reinforcement as an outcome of the trial, which influences their ability to detect retrospectively that an opportunity for reinforcement was missed, and, in turn, drives extinction. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Adjudicated Students' Perceptions of Ideal Teacher Characteristics.

ERIC Educational Resources Information Center

Bodine, Jaynell Cope; Olivarez, Arturo, Jr.; Ponticell, Judith A.

Certain teacher behaviors have been found to be conducive to learning or reinforcing for learning, while other behaviors have been found detrimental to learning. Humanism in teacher/pupil control orientation has been found to promote more positive attitudes toward teachers, and custodialism has been found to produce more negative attitudes toward…
Engagement in Classroom Learning: Creating Temporal Participation Incentives for Extrinsically Motivated Students through Bonus Credits

ERIC Educational Resources Information Center

Rassuli, Ali

2012-01-01

Extrinsic inducements to adjust students' learning motivations have evolved within 2 opposing paradigms. Cognitive evaluation theories claim that controlling factors embedded in extrinsic rewards dissipate intrinsic aspirations. Behavioral theorists contend that if engagement is voluntary, extrinsic reinforcements enhance learning without ill…
An Upside to Reward Sensitivity: The Hippocampus Supports Enhanced Reinforcement Learning in Adolescence.

PubMed

Davidow, Juliet Y; Foerde, Karin; Galván, Adriana; Shohamy, Daphna

2016-10-05

Adolescents are notorious for engaging in reward-seeking behaviors, a tendency attributed to heightened activity in the brain's reward systems during adolescence. It has been suggested that reward sensitivity in adolescence might be adaptive, but evidence of an adaptive role has been scarce. Using a probabilistic reinforcement learning task combined with reinforcement learning models and fMRI, we found that adolescents showed better reinforcement learning and a stronger link between reinforcement learning and episodic memory for rewarding outcomes. This behavioral benefit was related to heightened prediction error-related BOLD activity in the hippocampus and to stronger functional connectivity between the hippocampus and the striatum at the time of reinforcement. These findings reveal an important role for the hippocampus in reinforcement learning in adolescence and suggest that reward sensitivity in adolescence is related to adaptive differences in how adolescents learn from experience. Copyright © 2016 Elsevier Inc. All rights reserved.
Reinforcement Learning Deficits in People with Schizophrenia Persist after Extended Trials

PubMed Central

Cicero, David C.; Martin, Elizabeth A.; Becker, Theresa M.; Kerns, John G.

2014-01-01

Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning. PMID:25172610
Reinforcement learning deficits in people with schizophrenia persist after extended trials.

PubMed

Cicero, David C; Martin, Elizabeth A; Becker, Theresa M; Kerns, John G

2014-12-30

Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive feedback and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Beyond adaptive-critic creative learning for intelligent mobile robots

NASA Astrophysics Data System (ADS)

Liao, Xiaoqun; Cao, Ming; Hall, Ernest L.

2001-10-01

Intelligent industrial and mobile robots may be considered proven technology in structured environments. Teach programming and supervised learning methods permit solutions to a variety of applications. However, we believe that to extend the operation of these machines to more unstructured environments requires a new learning method. Both unsupervised learning and reinforcement learning are potential candidates for these new tasks. The adaptive critic method has been shown to provide useful approximations or even optimal control policies to non-linear systems. The purpose of this paper is to explore the use of new learning methods that goes beyond the adaptive critic method for unstructured environments. The adaptive critic is a form of reinforcement learning. A critic element provides only high level grading corrections to a cognition module that controls the action module. In the proposed system the critic's grades are modeled and forecasted, so that an anticipated set of sub-grades are available to the cognition model. The forecasting grades are interpolated and are available on the time scale needed by the action model. The success of the system is highly dependent on the accuracy of the forecasted grades and adaptability of the action module. Examples from the guidance of a mobile robot are provided to illustrate the method for simple line following and for the more complex navigation and control in an unstructured environment. The theory presented that is beyond the adaptive critic may be called creative theory. Creative theory is a form of learning that models the highest level of human learning - imagination. The application of the creative theory appears to not only be to mobile robots but also to many other forms of human endeavor such as educational learning and business forecasting. Reinforcement learning such as the adaptive critic may be applied to known problems to aid in the discovery of their solutions. The significance of creative theory is that it permits the discovery of the unknown problems, ones that are not yet recognized but may be critical to survival or success.

An architecture for designing fuzzy logic controllers using neural networks

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1991-01-01

Described here is an architecture for designing fuzzy controllers through a hierarchical process of control rule acquisition and by using special classes of neural network learning techniques. A new method for learning to refine a fuzzy logic controller is introduced. A reinforcement learning technique is used in conjunction with a multi-layer neural network model of a fuzzy controller. The model learns by updating its prediction of the plant's behavior and is related to the Sutton's Temporal Difference (TD) method. The method proposed here has the advantage of using the control knowledge of an experienced operator and fine-tuning it through the process of learning. The approach is applied to a cart-pole balancing system.
A fuzzy reinforcement learning approach to power control in wireless transmitters.

PubMed

Vengerov, David; Bambos, Nicholas; Berenji, Hamid R

2005-08-01

We address the issue of power-controlled shared channel access in wireless networks supporting packetized data traffic. We formulate this problem using the dynamic programming framework and present a new distributed fuzzy reinforcement learning algorithm (ACFRL-2) capable of adequately solving a class of problems to which the power control problem belongs. Our experimental results show that the algorithm converges almost deterministically to a neighborhood of optimal parameter values, as opposed to a very noisy stochastic convergence of earlier algorithms. The main tradeoff facing a transmitter is to balance its current power level with future backlog in the presence of stochastically changing interference. Simulation experiments demonstrate that the ACFRL-2 algorithm achieves significant performance gains over the standard power control approach used in CDMA2000. Such a large improvement is explained by the fact that ACFRL-2 allows transmitters to learn implicit coordination policies, which back off under stressful channel conditions as opposed to engaging in escalating "power wars."
Joint Extraction of Entities and Relations Using Reinforcement Learning and Deep Learning.

PubMed

Feng, Yuntian; Zhang, Hongjun; Hao, Wenning; Chen, Gang

2017-01-01

We use both reinforcement learning and deep learning to simultaneously extract entities and relations from unstructured texts. For reinforcement learning, we model the task as a two-step decision process. Deep learning is used to automatically capture the most important information from unstructured texts, which represent the state in the decision process. By designing the reward function per step, our proposed method can pass the information of entity extraction to relation extraction and obtain feedback in order to extract entities and relations simultaneously. Firstly, we use bidirectional LSTM to model the context information, which realizes preliminary entity extraction. On the basis of the extraction results, attention based method can represent the sentences that include target entity pair to generate the initial state in the decision process. Then we use Tree-LSTM to represent relation mentions to generate the transition state in the decision process. Finally, we employ Q -Learning algorithm to get control policy π in the two-step decision process. Experiments on ACE2005 demonstrate that our method attains better performance than the state-of-the-art method and gets a 2.4% increase in recall-score.
Joint Extraction of Entities and Relations Using Reinforcement Learning and Deep Learning

PubMed Central

Zhang, Hongjun; Chen, Gang

2017-01-01

We use both reinforcement learning and deep learning to simultaneously extract entities and relations from unstructured texts. For reinforcement learning, we model the task as a two-step decision process. Deep learning is used to automatically capture the most important information from unstructured texts, which represent the state in the decision process. By designing the reward function per step, our proposed method can pass the information of entity extraction to relation extraction and obtain feedback in order to extract entities and relations simultaneously. Firstly, we use bidirectional LSTM to model the context information, which realizes preliminary entity extraction. On the basis of the extraction results, attention based method can represent the sentences that include target entity pair to generate the initial state in the decision process. Then we use Tree-LSTM to represent relation mentions to generate the transition state in the decision process. Finally, we employ Q-Learning algorithm to get control policy π in the two-step decision process. Experiments on ACE2005 demonstrate that our method attains better performance than the state-of-the-art method and gets a 2.4% increase in recall-score. PMID:28894463
Effects of Ventral Striatum Lesions on Stimulus-Based versus Action-Based Reinforcement Learning.

PubMed

Rothenhoefer, Kathryn M; Costa, Vincent D; Bartolo, Ramón; Vicario-Feliciano, Raquel; Murray, Elisabeth A; Averbeck, Bruno B

2017-07-19

Learning the values of actions versus stimuli may depend on separable neural circuits. In the current study, we evaluated the performance of rhesus macaques with ventral striatum (VS) lesions on a two-arm bandit task that had randomly interleaved blocks of stimulus-based and action-based reinforcement learning (RL). Compared with controls, monkeys with VS lesions had deficits in learning to select rewarding images but not rewarding actions. We used a RL model to quantify learning and choice consistency and found that, in stimulus-based RL, the VS lesion monkeys were more influenced by negative feedback and had lower choice consistency than controls. Using a Bayesian model to parse the groups' learning strategies, we also found that VS lesion monkeys defaulted to an action-based choice strategy. Therefore, the VS is involved specifically in learning the value of stimuli, not actions. SIGNIFICANCE STATEMENT Reinforcement learning models of the ventral striatum (VS) often assume that it maintains an estimate of state value. This suggests that it plays a general role in learning whether rewards are assigned based on a chosen action or stimulus. In the present experiment, we examined the effects of VS lesions on monkeys' ability to learn that choosing a particular action or stimulus was more likely to lead to reward. We found that VS lesions caused a specific deficit in the monkeys' ability to discriminate between images with different values, whereas their ability to discriminate between actions with different values remained intact. Our results therefore suggest that the VS plays a specific role in learning to select rewarded stimuli. Copyright © 2017 the authors 0270-6474/17/376902-13$15.00/0.
Reinforcement learning or active inference?

PubMed

Friston, Karl J; Daunizeau, Jean; Kiebel, Stefan J

2009-07-29

This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.
A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning

PubMed Central

Franklin, Nicholas T; Frank, Michael J

2015-01-01

Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments. DOI: http://dx.doi.org/10.7554/eLife.12029.001 PMID:26705698
Distributed Economic Dispatch in Microgrids Based on Cooperative Reinforcement Learning.

PubMed

Liu, Weirong; Zhuang, Peng; Liang, Hao; Peng, Jun; Huang, Zhiwu; Weirong Liu; Peng Zhuang; Hao Liang; Jun Peng; Zhiwu Huang; Liu, Weirong; Liang, Hao; Peng, Jun; Zhuang, Peng; Huang, Zhiwu

2018-06-01

Microgrids incorporated with distributed generation (DG) units and energy storage (ES) devices are expected to play more and more important roles in the future power systems. Yet, achieving efficient distributed economic dispatch in microgrids is a challenging issue due to the randomness and nonlinear characteristics of DG units and loads. This paper proposes a cooperative reinforcement learning algorithm for distributed economic dispatch in microgrids. Utilizing the learning algorithm can avoid the difficulty of stochastic modeling and high computational complexity. In the cooperative reinforcement learning algorithm, the function approximation is leveraged to deal with the large and continuous state spaces. And a diffusion strategy is incorporated to coordinate the actions of DG units and ES devices. Based on the proposed algorithm, each node in microgrids only needs to communicate with its local neighbors, without relying on any centralized controllers. Algorithm convergence is analyzed, and simulations based on real-world meteorological and load data are conducted to validate the performance of the proposed algorithm.
Mapping anhedonia onto reinforcement learning: a behavioural meta-analysis

PubMed Central

2013-01-01

Background Depression is characterised partly by blunted reactions to reward. However, tasks probing this deficiency have not distinguished insensitivity to reward from insensitivity to the prediction errors for reward that determine learning and are putatively reported by the phasic activity of dopamine neurons. We attempted to disentangle these factors with respect to anhedonia in the context of stress, Major Depressive Disorder (MDD), Bipolar Disorder (BPD) and a dopaminergic challenge. Methods Six behavioural datasets involving 392 experimental sessions were subjected to a model-based, Bayesian meta-analysis. Participants across all six studies performed a probabilistic reward task that used an asymmetric reinforcement schedule to assess reward learning. Healthy controls were tested under baseline conditions, stress or after receiving the dopamine D2 agonist pramipexole. In addition, participants with current or past MDD or BPD were evaluated. Reinforcement learning models isolated the contributions of variation in reward sensitivity and learning rate. Results MDD and anhedonia reduced reward sensitivity more than they affected the learning rate, while a low dose of the dopamine D2 agonist pramipexole showed the opposite pattern. Stress led to a pattern consistent with a mixed effect on reward sensitivity and learning rate. Conclusion Reward-related learning reflected at least two partially separable contributions. The first related to phasic prediction error signalling, and was preferentially modulated by a low dose of the dopamine agonist pramipexole. The second related directly to reward sensitivity, and was preferentially reduced in MDD and anhedonia. Stress altered both components. Collectively, these findings highlight the contribution of model-based reinforcement learning meta-analysis for dissecting anhedonic behavior. PMID:23782813
Reinforcement Learning for Weakly-Coupled MDPs and an Application to Planetary Rover Control

NASA Technical Reports Server (NTRS)

Bernstein, Daniel S.; Zilberstein, Shlomo

2003-01-01

Weakly-coupled Markov decision processes can be decomposed into subprocesses that interact only through a small set of bottleneck states. We study a hierarchical reinforcement learning algorithm designed to take advantage of this particular type of decomposability. To test our algorithm, we use a decision-making problem faced by autonomous planetary rovers. In this problem, a Mars rover must decide which activities to perform and when to traverse between science sites in order to make the best use of its limited resources. In our experiments, the hierarchical algorithm performs better than Q-learning in the early stages of learning, but unlike Q-learning it converges to a suboptimal policy. This suggests that it may be advantageous to use the hierarchical algorithm when training time is limited.
Effects of Social Reinforcement, Locus of Control, and Cognitive Style on Concept Learning among Retarded Children.

ERIC Educational Resources Information Center

Panda, Kailas C.

To examine the effects of locus of control (the extent to which an individual feels he has control over his own behavior) and cognitive style variables on learning deficits among mentally handicapped children, 80 mentally retarded boys (IQ 50 to 83, age 160 to 196 months) were administered a battery of tests. Analyses of student performance…
Biases in probabilistic category learning in relation to social anxiety

PubMed Central

Abraham, Anna; Hermann, Christiane

2015-01-01

Instrumental learning paradigms are rarely employed to investigate the mechanisms underlying acquired fear responses in social anxiety. Here, we adapted a probabilistic category learning paradigm to assess information processing biases as a function of the degree of social anxiety traits in a sample of healthy individuals without a diagnosis of social phobia. Participants were presented with three pairs of neutral faces with differing probabilistic accuracy contingencies (A/B: 80/20, C/D: 70/30, E/F: 60/40). Upon making their choice, negative and positive feedback was conveyed using angry and happy faces, respectively. The highly socially anxious group showed a strong tendency to be more accurate at learning the probability contingency associated with the most ambiguous stimulus pair (E/F: 60/40). Moreover, when pairing the most positively reinforced stimulus or the most negatively reinforced stimulus with all the other stimuli in a test phase, the highly socially anxious group avoided the most negatively reinforced stimulus significantly more than the control group. The results are discussed with reference to avoidance learning and hypersensitivity to negative socially evaluative information associated with social anxiety. PMID:26347685
Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data.

PubMed

Lewis, F L; Vamvoudakis, Kyriakos G

2011-02-01

Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q -learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.
Roles of OA1 octopamine receptor and Dop1 dopamine receptor in mediating appetitive and aversive reinforcement revealed by RNAi studies

PubMed Central

Awata, Hiroko; Wakuda, Ryo; Ishimaru, Yoshiyasu; Matsuoka, Yuji; Terao, Kanta; Katata, Satomi; Matsumoto, Yukihisa; Hamanaka, Yoshitaka; Noji, Sumihare; Mito, Taro; Mizunami, Makoto

2016-01-01

Revealing reinforcing mechanisms in associative learning is important for elucidation of brain mechanisms of behavior. In mammals, dopamine neurons are thought to mediate both appetitive and aversive reinforcement signals. Studies using transgenic fruit-flies suggested that dopamine neurons mediate both appetitive and aversive reinforcements, through the Dop1 dopamine receptor, but our studies using octopamine and dopamine receptor antagonists and using Dop1 knockout crickets suggested that octopamine neurons mediate appetitive reinforcement and dopamine neurons mediate aversive reinforcement in associative learning in crickets. To fully resolve this issue, we examined the effects of silencing of expression of genes that code the OA1 octopamine receptor and Dop1 and Dop2 dopamine receptors by RNAi in crickets. OA1-silenced crickets exhibited impairment in appetitive learning with water but not in aversive learning with sodium chloride solution, while Dop1-silenced crickets exhibited impairment in aversive learning but not in appetitive learning. Dop2-silenced crickets showed normal scores in both appetitive learning and aversive learning. The results indicate that octopamine neurons mediate appetitive reinforcement via OA1 and that dopamine neurons mediate aversive reinforcement via Dop1 in crickets, providing decisive evidence that neurotransmitters and receptors that mediate appetitive reinforcement indeed differ among different species of insects. PMID:27412401
Roles of OA1 octopamine receptor and Dop1 dopamine receptor in mediating appetitive and aversive reinforcement revealed by RNAi studies.

PubMed

Awata, Hiroko; Wakuda, Ryo; Ishimaru, Yoshiyasu; Matsuoka, Yuji; Terao, Kanta; Katata, Satomi; Matsumoto, Yukihisa; Hamanaka, Yoshitaka; Noji, Sumihare; Mito, Taro; Mizunami, Makoto

2016-07-14

Revealing reinforcing mechanisms in associative learning is important for elucidation of brain mechanisms of behavior. In mammals, dopamine neurons are thought to mediate both appetitive and aversive reinforcement signals. Studies using transgenic fruit-flies suggested that dopamine neurons mediate both appetitive and aversive reinforcements, through the Dop1 dopamine receptor, but our studies using octopamine and dopamine receptor antagonists and using Dop1 knockout crickets suggested that octopamine neurons mediate appetitive reinforcement and dopamine neurons mediate aversive reinforcement in associative learning in crickets. To fully resolve this issue, we examined the effects of silencing of expression of genes that code the OA1 octopamine receptor and Dop1 and Dop2 dopamine receptors by RNAi in crickets. OA1-silenced crickets exhibited impairment in appetitive learning with water but not in aversive learning with sodium chloride solution, while Dop1-silenced crickets exhibited impairment in aversive learning but not in appetitive learning. Dop2-silenced crickets showed normal scores in both appetitive learning and aversive learning. The results indicate that octopamine neurons mediate appetitive reinforcement via OA1 and that dopamine neurons mediate aversive reinforcement via Dop1 in crickets, providing decisive evidence that neurotransmitters and receptors that mediate appetitive reinforcement indeed differ among different species of insects.
Evolving fuzzy rules in a learning classifier system

NASA Technical Reports Server (NTRS)

Valenzuela-Rendon, Manuel

1993-01-01

The fuzzy classifier system (FCS) combines the ideas of fuzzy logic controllers (FLC's) and learning classifier systems (LCS's). It brings together the expressive powers of fuzzy logic as it has been applied in fuzzy controllers to express relations between continuous variables, and the ability of LCS's to evolve co-adapted sets of rules. The goal of the FCS is to develop a rule-based system capable of learning in a reinforcement regime, and that can potentially be used for process control.
Believer-Skeptic Meets Actor-Critic: Rethinking the Role of Basal Ganglia Pathways during Decision-Making and Reinforcement Learning

PubMed Central

Dunovan, Kyle; Verstynen, Timothy

2016-01-01

The flexibility of behavioral control is a testament to the brain's capacity for dynamically resolving uncertainty during goal-directed actions. This ability to select actions and learn from immediate feedback is driven by the dynamics of basal ganglia (BG) pathways. A growing body of empirical evidence conflicts with the traditional view that these pathways act as independent levers for facilitating (i.e., direct pathway) or suppressing (i.e., indirect pathway) motor output, suggesting instead that they engage in a dynamic competition during action decisions that computationally captures action uncertainty. Here we discuss the utility of encoding action uncertainty as a dynamic competition between opposing control pathways and provide evidence that this simple mechanism may have powerful implications for bridging neurocomputational theories of decision making and reinforcement learning. PMID:27047328
Believer-Skeptic Meets Actor-Critic: Rethinking the Role of Basal Ganglia Pathways during Decision-Making and Reinforcement Learning.

PubMed

Dunovan, Kyle; Verstynen, Timothy

2016-01-01

The flexibility of behavioral control is a testament to the brain's capacity for dynamically resolving uncertainty during goal-directed actions. This ability to select actions and learn from immediate feedback is driven by the dynamics of basal ganglia (BG) pathways. A growing body of empirical evidence conflicts with the traditional view that these pathways act as independent levers for facilitating (i.e., direct pathway) or suppressing (i.e., indirect pathway) motor output, suggesting instead that they engage in a dynamic competition during action decisions that computationally captures action uncertainty. Here we discuss the utility of encoding action uncertainty as a dynamic competition between opposing control pathways and provide evidence that this simple mechanism may have powerful implications for bridging neurocomputational theories of decision making and reinforcement learning.
BEHAVIORAL MECHANISMS UNDERLYING NICOTINE REINFORCEMENT

PubMed Central

Rupprecht, Laura E.; Smith, Tracy T.; Schassburger, Rachel L.; Buffalari, Deanne M.; Sved, Alan F.; Donny, Eric C.

2015-01-01

Cigarette smoking is the leading cause of preventable deaths worldwide and nicotine, the primary psychoactive constituent in tobacco, drives sustained use. The behavioral actions of nicotine are complex and extend well beyond the actions of the drug as a primary reinforcer. Stimuli that are consistently paired with nicotine can, through associative learning, take on reinforcing properties as conditioned stimuli. These conditioned stimuli can then impact the rate and probability of behavior and even function as conditioning reinforcers that maintain behavior in the absence of nicotine. Nicotine can also act as a conditioned stimulus, predicting the delivery of other reinforcers, which may allow nicotine to acquire value as a conditioned reinforcer. These associative effects, establishing non-nicotine stimuli as conditioned stimuli with discriminative stimulus and conditioned reinforcing properties as well as establishing nicotine as a conditioned stimulus, are predicted by basic conditioning principles. However, nicotine can also act non-associatively. Nicotine directly enhances the reinforcing efficacy of other reinforcing stimuli in the environment, an effect that does not require a temporal or predictive relationship between nicotine and either the stimulus or the behavior. Hence, the reinforcing actions of nicotine stem both from the primary reinforcing actions of the drug (and the subsequent associative learning effects) as well as the reinforcement enhancement action of nicotine which is non-associative in nature. Gaining a better understanding of how nicotine impacts behavior will allow for maximally effective tobacco control efforts aimed at reducing the harm associated with tobacco use by reducing and/or treating its addictiveness. PMID:25638333
Amygdala and Ventral Striatum Make Distinct Contributions to Reinforcement Learning.

PubMed

Costa, Vincent D; Dal Monte, Olga; Lucas, Daniel R; Murray, Elisabeth A; Averbeck, Bruno B

2016-10-19

Reinforcement learning (RL) theories posit that dopaminergic signals are integrated within the striatum to associate choices with outcomes. Often overlooked is that the amygdala also receives dopaminergic input and is involved in Pavlovian processes that influence choice behavior. To determine the relative contributions of the ventral striatum (VS) and amygdala to appetitive RL, we tested rhesus macaques with VS or amygdala lesions on deterministic and stochastic versions of a two-arm bandit reversal learning task. When learning was characterized with an RL model relative to controls, amygdala lesions caused general decreases in learning from positive feedback and choice consistency. By comparison, VS lesions only affected learning in the stochastic task. Moreover, the VS lesions hastened the monkeys' choice reaction times, which emphasized a speed-accuracy trade-off that accounted for errors in deterministic learning. These results update standard accounts of RL by emphasizing distinct contributions of the amygdala and VS to RL. Published by Elsevier Inc.

Amygdala and ventral striatum make distinct contributions to reinforcement learning

PubMed Central

Costa, Vincent D.; Monte, Olga Dal; Lucas, Daniel R.; Murray, Elisabeth A.; Averbeck, Bruno B.

2016-01-01

Summary Reinforcement learning (RL) theories posit that dopaminergic signals are integrated within the striatum to associate choices with outcomes. Often overlooked is that the amygdala also receives dopaminergic input and is involved in Pavlovian processes that influence choice behavior. To determine the relative contributions of the ventral striatum (VS) and amygdala to appetitive RL we tested rhesus macaques with VS or amygdala lesions on deterministic and stochastic versions of a two-arm bandit reversal learning task. When learning was characterized with a RL model relative to controls, amygdala lesions caused general decreases in learning from positive feedback and choice consistency. By comparison, VS lesions only affected learning in the stochastic task. Moreover, the VS lesions hastened the monkeys’ choice reaction times, which emphasized a speed-accuracy tradeoff that accounted for errors in deterministic learning. These results update standard accounts of RL by emphasizing distinct contributions of the amygdala and VS to RL. PMID:27720488
Development and Evaluation of Mechatronics Learning System in a Web-Based Environment

ERIC Educational Resources Information Center

Shyr, Wen-Jye

2011-01-01

The development of remote laboratory suitable for the reinforcement of undergraduate level teaching of mechatronics is important. For the reason, a Web-based mechatronics learning system, called the RECOLAB (REmote COntrol LABoratory), for remote learning in engineering education has been developed in this study. The web-based environment is an…
Rational and mechanistic perspectives on reinforcement learning.

PubMed

Chater, Nick

2009-12-01

This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: mechanistic and rational. Reinforcement learning is often viewed in mechanistic terms--as describing the operation of aspects of an agent's cognitive and neural machinery. Yet it can also be viewed as a rational level of description, specifically, as describing a class of methods for learning from experience, using minimal background knowledge. This paper considers how rational and mechanistic perspectives differ, and what types of evidence distinguish between them. Reinforcement learning research in the cognitive and brain sciences is often implicitly committed to the mechanistic interpretation. Here the opposite view is put forward: that accounts of reinforcement learning should apply at the rational level, unless there is strong evidence for a mechanistic interpretation. Implications of this viewpoint for reinforcement-based theories in the cognitive and brain sciences are discussed.
Stress affects instrumental learning based on positive or negative reinforcement in interaction with personality in domestic horses

PubMed Central

Valenchon, Mathilde; Lévy, Frédéric; Moussu, Chantal; Lansade, Léa

2017-01-01

The present study investigated how stress affects instrumental learning performance in horses (Equus caballus) depending on the type of reinforcement. Horses were assigned to four groups (N = 15 per group); each group received training with negative or positive reinforcement in the presence or absence of stressors unrelated to the learning task. The instrumental learning task consisted of the horse entering one of two compartments at the appearance of a visual signal given by the experimenter. In the absence of stressors unrelated to the task, learning performance did not differ between negative and positive reinforcements. The presence of stressors unrelated to the task (exposure to novel and sudden stimuli) impaired learning performance. Interestingly, this learning deficit was smaller when the negative reinforcement was used. The negative reinforcement, considered as a stressor related to the task, could have counterbalanced the impact of the extrinsic stressor by focusing attention toward the learning task. In addition, learning performance appears to differ between certain dimensions of personality depending on the presence of stressors and the type of reinforcement. These results suggest that when negative reinforcement is used (i.e. stressor related to the task), the most fearful horses may be the best performers in the absence of stressors but the worst performers when stressors are present. On the contrary, when positive reinforcement is used, the most fearful horses appear to be consistently the worst performers, with and without exposure to stressors unrelated to the learning task. This study is the first to demonstrate in ungulates that stress affects learning performance differentially according to the type of reinforcement and in interaction with personality. It provides fundamental and applied perspectives in the understanding of the relationships between personality and training abilities. PMID:28475581
Application of fuzzy logic-neural network based reinforcement learning to proximity and docking operations: Attitude control results

NASA Technical Reports Server (NTRS)

Jani, Yashvant

1992-01-01

As part of the RICIS activity, the reinforcement learning techniques developed at Ames Research Center are being applied to proximity and docking operations using the Shuttle and Solar Max satellite simulation. This activity is carried out in the software technology laboratory utilizing the Orbital Operations Simulator (OOS). This report is deliverable D2 Altitude Control Results and provides the status of the project after four months of activities and outlines the future plans. In section 2 we describe the Fuzzy-Learner system for the attitude control functions. In section 3, we provide the description of test cases and results in a chronological order. In section 4, we have summarized our results and conclusions. Our future plans and recommendations are provided in section 5.
Neural Control of a Tracking Task via Attention-Gated Reinforcement Learning for Brain-Machine Interfaces.

PubMed

Wang, Yiwen; Wang, Fang; Xu, Kai; Zhang, Qiaosheng; Zhang, Shaomin; Zheng, Xiaoxiang

2015-05-01

Reinforcement learning (RL)-based brain machine interfaces (BMIs) enable the user to learn from the environment through interactions to complete the task without desired signals, which is promising for clinical applications. Previous studies exploited Q-learning techniques to discriminate neural states into simple directional actions providing the trial initial timing. However, the movements in BMI applications can be quite complicated, and the action timing explicitly shows the intention when to move. The rich actions and the corresponding neural states form a large state-action space, imposing generalization difficulty on Q-learning. In this paper, we propose to adopt attention-gated reinforcement learning (AGREL) as a new learning scheme for BMIs to adaptively decode high-dimensional neural activities into seven distinct movements (directional moves, holdings and resting) due to the efficient weight-updating. We apply AGREL on neural data recorded from M1 of a monkey to directly predict a seven-action set in a time sequence to reconstruct the trajectory of a center-out task. Compared to Q-learning techniques, AGREL could improve the target acquisition rate to 90.16% in average with faster convergence and more stability to follow neural activity over multiple days, indicating the potential to achieve better online decoding performance for more complicated BMI tasks.
Social reinforcement can regulate localized brain activity.

PubMed

Mathiak, Krystyna A; Koush, Yury; Dyck, Miriam; Gaber, Tilman J; Alawi, Eliza; Zepf, Florian D; Zvyagintsev, Mikhail; Mathiak, Klaus

2010-11-01

Social learning is essential for adaptive behavior in humans. Neurofeedback based on functional magnetic resonance imaging (fMRI) trains control over localized brain activity. It can disentangle learning processes at the neural level and thus investigate the mechanisms of operant conditioning with explicit social reinforcers. In a pilot study, a computer-generated face provided a positive feedback (smiling) when activity in the anterior cingulate cortex (ACC) increased and gradually returned to a neutral expression when the activity dropped. One female volunteer without previous experience in fMRI underwent training based on a social reinforcer. Directly before and after the neurofeedback runs, neural responses to a cognitive interference task (Simon task) were recorded. We observed a significant increase in activity within ACC during the neurofeedback blocks, correspondent with the a-priori defined anatomical region of interest. In the course of the neurofeedback training, the subject learned to regulate ACC activity and could maintain the control even without direct feedback. Moreover, ACC was activated significantly stronger during Simon task after the neurofeedback training when compared to before. Localized brain activity can be controlled by social reward. The increased ACC activity transferred to a cognitive task with the potential to reduce cognitive interference. Systematic studies are required to explore long-term effects on social behavior and clinical applications.
Identifying cognitive remediation change through computational modelling--effects on reinforcement learning in schizophrenia.

PubMed

Cella, Matteo; Bishara, Anthony J; Medin, Evelina; Swan, Sarah; Reeder, Clare; Wykes, Til

2014-11-01

Converging research suggests that individuals with schizophrenia show a marked impairment in reinforcement learning, particularly in tasks requiring flexibility and adaptation. The problem has been associated with dopamine reward systems. This study explores, for the first time, the characteristics of this impairment and how it is affected by a behavioral intervention-cognitive remediation. Using computational modelling, 3 reinforcement learning parameters based on the Wisconsin Card Sorting Test (WCST) trial-by-trial performance were estimated: R (reward sensitivity), P (punishment sensitivity), and D (choice consistency). In Study 1 the parameters were compared between a group of individuals with schizophrenia (n = 100) and a healthy control group (n = 50). In Study 2 the effect of cognitive remediation therapy (CRT) on these parameters was assessed in 2 groups of individuals with schizophrenia, one receiving CRT (n = 37) and the other receiving treatment as usual (TAU, n = 34). In Study 1 individuals with schizophrenia showed impairment in the R and P parameters compared with healthy controls. Study 2 demonstrated that sensitivity to negative feedback (P) and reward (R) improved in the CRT group after therapy compared with the TAU group. R and P parameter change correlated with WCST outputs. Improvements in R and P after CRT were associated with working memory gains and reduction of negative symptoms, respectively. Schizophrenia reinforcement learning difficulties negatively influence performance in shift learning tasks. CRT can improve sensitivity to reward and punishment. Identifying parameters that show change may be useful in experimental medicine studies to identify cognitive domains susceptible to improvement. © The Author 2013. Published by Oxford University Press on behalf of the Maryland Psychiatric Research Center. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Feedback from the heart: Emotional learning and memory is controlled by cardiac cycle, interoceptive accuracy and personality.

PubMed

Pfeifer, Gaby; Garfinkel, Sarah N; Gould van Praag, Cassandra D; Sahota, Kuljit; Betka, Sophie; Critchley, Hugo D

2017-05-01

Feedback processing is critical to trial-and-error learning. Here, we examined whether interoceptive signals concerning the state of cardiovascular arousal influence the processing of reinforcing feedback during the learning of 'emotional' face-name pairs, with subsequent effects on retrieval. Participants (N=29) engaged in a learning task of face-name pairs (fearful, neutral, happy faces). Correct and incorrect learning decisions were reinforced by auditory feedback, which was delivered either at cardiac systole (on the heartbeat, when baroreceptors signal the contraction of the heart to the brain), or at diastole (between heartbeats during baroreceptor quiescence). We discovered a cardiac influence on feedback processing that enhanced the learning of fearful faces in people with heightened interoceptive ability. Individuals with enhanced accuracy on a heartbeat counting task learned fearful face-name pairs better when feedback was given at systole than at diastole. This effect was not present for neutral and happy faces. At retrieval, we also observed related effects of personality: First, individuals scoring higher for extraversion showed poorer retrieval accuracy. These individuals additionally manifested lower resting heart rate and lower state anxiety, suggesting that attenuated levels of cardiovascular arousal in extraverts underlies poorer performance. Second, higher extraversion scores predicted higher emotional intensity ratings of fearful faces reinforced at systole. Third, individuals scoring higher for neuroticism showed higher retrieval confidence for fearful faces reinforced at diastole. Our results show that cardiac signals shape feedback processing to influence learning of fearful faces, an effect underpinned by personality differences linked to psychophysiological arousal. Copyright © 2017 Elsevier B.V. All rights reserved.
Learning about Chemiosmosis and ATP Synthesis with Animations Outside of the Classroom †

PubMed Central

Goff, Eric E.; Reindl, Katie M.; Johnson, Christina; McClean, Phillip; Offerdahl, Erika G.; Schroeder, Noah L.; White, Alan R.

2017-01-01

Many undergraduate biology courses have begun to implement instructional strategies aimed at increasing student interaction with course material outside of the classroom. Two examples of such practices are introducing students to concepts as preparation prior to instruction, and as conceptual reinforcement after the instructional period. Using a three-group design, we investigate the impact of an animation developed as part of the Virtual Cell Animation Collection on the topic of concentration gradients and their role in the actions of ATP synthase as a means of pre-class preparation or post-class reinforcement compared with a no-intervention control group. Results from seven sections of introductory biology (n = 732) randomized to treatments over two semesters show that students who viewed animation as preparation (d = 0.44, p < 0.001) or as reinforcement (d = 0.53, p < 0.001) both outperformed students in the control group on a follow-up assessment. Direct comparison of the preparation and reinforcement treatments shows no significant difference in student outcomes between the two treatment groups (p = 0.87). Results suggest that while student interaction with animations on the topic of concentration gradients outside of the classroom may lead to greater learning outcomes than the control group, in the traditional lecture-based course the timing of such interactions may not be as important. PMID:28512512
Racial bias shapes social reinforcement learning.

PubMed

Lindström, Björn; Selbing, Ida; Molapour, Tanaz; Olsson, Andreas

2014-03-01

Both emotional facial expressions and markers of racial-group belonging are ubiquitous signals in social interaction, but little is known about how these signals together affect future behavior through learning. To address this issue, we investigated how emotional (threatening or friendly) in-group and out-group faces reinforced behavior in a reinforcement-learning task. We asked whether reinforcement learning would be modulated by intergroup attitudes (i.e., racial bias). The results showed that individual differences in racial bias critically modulated reinforcement learning. As predicted, racial bias was associated with more efficiently learned avoidance of threatening out-group individuals. We used computational modeling analysis to quantitatively delimit the underlying processes affected by social reinforcement. These analyses showed that racial bias modulates the rate at which exposure to threatening out-group individuals is transformed into future avoidance behavior. In concert, these results shed new light on the learning processes underlying social interaction with racial-in-group and out-group individuals.
Hierarchically organized behavior and its neural foundations: A reinforcement-learning perspective

PubMed Central

Botvinick, Matthew M.; Niv, Yael; Barto, Andrew C.

2009-01-01

Research on human and animal behavior has long emphasized its hierarchical structure — the divisibility of ongoing behavior into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behavior has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this paper, we reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behavior. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behavior. PMID:18926527
Neural mechanisms of reinforcement learning in unmedicated patients with major depressive disorder.

PubMed

Rothkirch, Marcus; Tonn, Jonas; Köhler, Stephan; Sterzer, Philipp

2017-04-01

According to current concepts, major depressive disorder is strongly related to dysfunctional neural processing of motivational information, entailing impairments in reinforcement learning. While computational modelling can reveal the precise nature of neural learning signals, it has not been used to study learning-related neural dysfunctions in unmedicated patients with major depressive disorder so far. We thus aimed at comparing the neural coding of reward and punishment prediction errors, representing indicators of neural learning-related processes, between unmedicated patients with major depressive disorder and healthy participants. To this end, a group of unmedicated patients with major depressive disorder (n = 28) and a group of age- and sex-matched healthy control participants (n = 30) completed an instrumental learning task involving monetary gains and losses during functional magnetic resonance imaging. The two groups did not differ in their learning performance. Patients and control participants showed the same level of prediction error-related activity in the ventral striatum and the anterior insula. In contrast, neural coding of reward prediction errors in the medial orbitofrontal cortex was reduced in patients. Moreover, neural reward prediction error signals in the medial orbitofrontal cortex and ventral striatum showed negative correlations with anhedonia severity. Using a standard instrumental learning paradigm we found no evidence for an overall impairment of reinforcement learning in medication-free patients with major depressive disorder. Importantly, however, the attenuated neural coding of reward in the medial orbitofrontal cortex and the relation between anhedonia and reduced reward prediction error-signalling in the medial orbitofrontal cortex and ventral striatum likely reflect an impairment in experiencing pleasure from rewarding events as a key mechanism of anhedonia in major depressive disorder. © The Author (2017). Published by Oxford University Press on behalf of the Guarantors of Brain. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Decentralized reinforcement-learning control and emergence of motion patterns

NASA Astrophysics Data System (ADS)

Svinin, Mikhail; Yamada, Kazuyaki; Okhura, Kazuhiro; Ueda, Kanji

1998-10-01

In this paper we propose a system for studying emergence of motion patterns in autonomous mobile robotic systems. The system implements an instance-based reinforcement learning control. Three spaces are of importance in formulation of the control scheme. They are the work space, the sensor space, and the action space. Important feature of our system is that all these spaces are assumed to be continuous. The core part of the system is a classifier system. Based on the sensory state space analysis, the control is decentralized and is specified at the lowest level of the control system. However, the local controllers are implicitly connected through the perceived environment information. Therefore, they constitute a dynamic environment with respect to each other. The proposed control scheme is tested under simulation for a mobile robot in a navigation task. It is shown that some patterns of global behavior--such as collision avoidance, wall-following, light-seeking--can emerge from the local controllers.
Model-Based Reinforcement Learning under Concurrent Schedules of Reinforcement in Rodents

ERIC Educational Resources Information Center

Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

2009-01-01

Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's…
The role of multisensor data fusion in neuromuscular control of a sagittal arm with a pair of muscles using actor-critic reinforcement learning method.

PubMed

Golkhou, V; Parnianpour, M; Lucas, C

2004-01-01

In this study, we consider the role of multisensor data fusion in neuromuscular control using an actor-critic reinforcement learning method. The model we use is a single link system actuated by a pair of muscles that are excited with alpha and gamma signals. Various physiological sensor information such as proprioception, spindle sensors, and Golgi tendon organs have been integrated to achieve an oscillatory movement with variable amplitude and frequency, while achieving a stable movement with minimum metabolic cost and coactivation. The system is highly nonlinear in all its physical and physiological attributes. Transmission delays are included in the afferent and efferent neural paths to account for a more accurate representation of the reflex loops. This paper proposes a reinforcement learning method with an Actor-Critic architecture instead of middle and low level of central nervous system (CNS). The Actor in this structure is a two layer feedforward neural network and the Critic is a model of the cerebellum. The Critic is trained by the State-Action-Reward-State-Action (SARSA) method. The Critic will train the Actor by supervisory learning based on previous experiences. The reinforcement signal in SARSA is evaluated based on available alternatives concerning the concept of multisensor data fusion. The effectiveness and the biological plausibility of the present model are demonstrated by several simulations. The system showed excellent tracking capability when we integrated the available sensor information. Addition of a penalty for activation of muscles resulted in much lower muscle coactivation while keeping the movement stable.
Learning Based Bidding Strategy for HVAC Systems in Double Auction Retail Energy Markets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sun, Yannan; Somani, Abhishek; Carroll, Thomas E.

In this paper, a bidding strategy is proposed using reinforcement learning for HVAC systems in a double auction market. The bidding strategy does not require a specific model-based representation of behavior, i.e., a functional form to translate indoor house temperatures into bid prices. The results from reinforcement learning based approach are compared with the HVAC bidding approach used in the AEP gridSMART® smart grid demonstration project and it is shown that the model-free (learning based) approach tracks well the results from the model-based behavior. Successful use of model-free approaches to represent device-level economic behavior may help develop similar approaches tomore » represent behavior of more complex devices or groups of diverse devices, such as in a building. Distributed control requires an understanding of decision making processes of intelligent agents so that appropriate mechanisms may be developed to control and coordinate their responses, and model-free approaches to represent behavior will be extremely useful in that quest.« less
Variable Behavior and Repeated Learning in Two Mouse Strains: Developmental and Genetic Contributions.

PubMed

Arnold, Megan A; Newland, M Christopher

2018-06-16

Behavioral inflexibility is often assessed using reversal learning tasks, which require a relatively low degree of response variability. No studies have assessed sensitivity to reinforcement contingencies that specifically select highly variable response patterns in mice, let alone in models of neurodevelopmental disorders involving limited response variation. Operant variability and incremental repeated acquisition (IRA) were used to assess unique aspects of behavioral variability of two mouse strains: BALB/c, a model of some deficits in ASD, and C57Bl/6. On the operant variability task, BALB/c mice responded more repetitively during adolescence than C57Bl/6 mice when reinforcement did not require variability but responded more variably when reinforcement required variability. During IRA testing in adulthood, both strains acquired an unchanging, performance sequence equally well. Strain differences emerged, however, after novel learning sequences began alternating with the performance sequence: BALB/c mice substantially outperformed C57Bl/6 mice. Using litter-mate controls, it was found that adolescent experience with variability did not affect either learning or performance on the IRA task in adulthood. These findings constrain the use of BALB/c mice as a model of ASD, but once again reveal this strain is highly sensitive to reinforcement contingencies and they are fast and robust learners. Copyright © 2018. Published by Elsevier B.V.
Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-Time Multiplayer Nonzero-Sum Games.

PubMed

Song, Ruizhuo; Lewis, Frank L; Wei, Qinglai

2017-03-01

This paper establishes an off-policy integral reinforcement learning (IRL) method to solve nonlinear continuous-time (CT) nonzero-sum (NZS) games with unknown system dynamics. The IRL algorithm is presented to obtain the iterative control and off-policy learning is used to allow the dynamics to be completely unknown. Off-policy IRL is designed to do policy evaluation and policy improvement in the policy iteration algorithm. Critic and action networks are used to obtain the performance index and control for each player. The gradient descent algorithm makes the update of critic and action weights simultaneously. The convergence analysis of the weights is given. The asymptotic stability of the closed-loop system and the existence of Nash equilibrium are proved. The simulation study demonstrates the effectiveness of the developed method for nonlinear CT NZS games with unknown system dynamics.
Cost-Benefit Arbitration Between Multiple Reinforcement-Learning Systems.

PubMed

Kool, Wouter; Gershman, Samuel J; Cushman, Fiery A

2017-09-01

Human behavior is sometimes determined by habit and other times by goal-directed planning. Modern reinforcement-learning theories formalize this distinction as a competition between a computationally cheap but inaccurate model-free system that gives rise to habits and a computationally expensive but accurate model-based system that implements planning. It is unclear, however, how people choose to allocate control between these systems. Here, we propose that arbitration occurs by comparing each system's task-specific costs and benefits. To investigate this proposal, we conducted two experiments showing that people increase model-based control when it achieves greater accuracy than model-free control, and especially when the rewards of accurate performance are amplified. In contrast, they are insensitive to reward amplification when model-based and model-free control yield equivalent accuracy. This suggests that humans adaptively balance habitual and planned action through on-line cost-benefit analysis.

Pavlovian to instrumental transfer of control in a human learning task.

PubMed

Nadler, Natasha; Delgado, Mauricio R; Delamater, Andrew R

2011-10-01

Pavlovian learning tasks have been widely used as tools to understand basic cognitive and emotional processes in humans. The present studies investigated one particular task, Pavlovian-to-instrumental transfer (PIT), with human participants in an effort to examine potential cognitive and emotional effects of Pavlovian cues upon instrumentally trained performance. In two experiments, subjects first learned two separate instrumental response-outcome relationships (i.e., R1-O1 and R2-O2) and then were exposed to various stimulus-outcome relationships (i.e., S1-O1, S2-O2, S3-O3, and S4-) before the effects of the Pavlovian stimuli on instrumental responding were assessed during a non-reinforced test. In Experiment 1, instrumental responding was established using a positive-reinforcement procedure, whereas in Experiment 2, a quasi-avoidance learning task was used. In both cases, the Pavlovian stimuli exerted selective control over instrumental responding, whereby S1 and S2 selectively elevated the instrumental response with which it shared an outcome. In addition, in Experiment 2, S3 exerted a nonselective transfer of control effect, whereby both responses were elevated over baseline levels. These data identify two ways, one specific and one general, in which Pavlovian processes can exert control over instrumental responding in human learning paradigms, suggesting that this method may serve as a useful tool in the study of basic cognitive and emotional processes in human learning.
Predicting psychosis across diagnostic boundaries: Behavioral and computational modeling evidence for impaired reinforcement learning in schizophrenia and bipolar disorder with a history of psychosis.

PubMed

Strauss, Gregory P; Thaler, Nicholas S; Matveeva, Tatyana M; Vogel, Sally J; Sutton, Griffin P; Lee, Bern G; Allen, Daniel N

2015-08-01

There is increasing evidence that schizophrenia (SZ) and bipolar disorder (BD) share a number of cognitive, neurobiological, and genetic markers. Shared features may be most prevalent among SZ and BD with a history of psychosis. This study extended this literature by examining reinforcement learning (RL) performance in individuals with SZ (n = 29), BD with a history of psychosis (BD+; n = 24), BD without a history of psychosis (BD-; n = 23), and healthy controls (HC; n = 24). RL was assessed through a probabilistic stimulus selection task with acquisition and test phases. Computational modeling evaluated competing accounts of the data. Each participant's trial-by-trial decision-making behavior was fit to 3 computational models of RL: (a) a standard actor-critic model simulating pure basal ganglia-dependent learning, (b) a pure Q-learning model simulating action selection as a function of learned expected reward value, and (c) a hybrid model where an actor-critic is "augmented" by a Q-learning component, meant to capture the top-down influence of orbitofrontal cortex value representations on the striatum. The SZ group demonstrated greater reinforcement learning impairments at acquisition and test phases than the BD+, BD-, and HC groups. The BD+ and BD- groups displayed comparable performance at acquisition and test phases. Collapsing across diagnostic categories, greater severity of current psychosis was associated with poorer acquisition of the most rewarding stimuli as well as poor go/no-go learning at test. Model fits revealed that reinforcement learning in SZ was best characterized by a pure actor-critic model where learning is driven by prediction error signaling alone. In contrast, BD-, BD+, and HC were best fit by a hybrid model where prediction errors are influenced by top-down expected value representations that guide decision making. These findings suggest that abnormalities in the reward system are more prominent in SZ than BD; however, current psychotic symptoms may be associated with reinforcement learning deficits regardless of a Diagnostic and Statistical Manual of Mental Disorders (5th Edition; American Psychiatric Association, 2013) diagnosis. (c) 2015 APA, all rights reserved).
Motor Learning Enhances Use-Dependent Plasticity

PubMed Central

2017-01-01

Motor behaviors are shaped not only by current sensory signals but also by the history of recent experiences. For instance, repeated movements toward a particular target bias the subsequent movements toward that target direction. This process, called use-dependent plasticity (UDP), is considered a basic and goal-independent way of forming motor memories. Most studies consider movement history as the critical component that leads to UDP (Classen et al., 1998; Verstynen and Sabes, 2011). However, the effects of learning (i.e., improved performance) on UDP during movement repetition have not been investigated. Here, we used transcranial magnetic stimulation in two experiments to assess plasticity changes occurring in the primary motor cortex after individuals repeated reinforced and nonreinforced actions. The first experiment assessed whether learning a skill task modulates UDP. We found that a group that successfully learned the skill task showed greater UDP than a group that did not accumulate learning, but made comparable repeated actions. The second experiment aimed to understand the role of reinforcement learning in UDP while controlling for reward magnitude and action kinematics. We found that providing subjects with a binary reward without visual feedback of the cursor led to increased UDP effects. Subjects in the group that received comparable reward not associated with their actions maintained the previously induced UDP. Our findings illustrate how reinforcing consistent actions strengthens use-dependent memories and provide insight into operant mechanisms that modulate plastic changes in the motor cortex. SIGNIFICANCE STATEMENT Performing consistent motor actions induces use-dependent plastic changes in the motor cortex. This plasticity reflects one of the basic forms of human motor learning. Past studies assumed that this form of learning is exclusively affected by repetition of actions. However, here we showed that success-based reinforcement signals could affect the human use-dependent plasticity (UDP) process. Our results indicate that learning augments and interacts with UDP. This effect is important to the understanding of the interplay between the different forms of motor learning and suggests that reinforcement is not only important to learning new behaviors, but can shape our subsequent behavior via its interaction with UDP. PMID:28143961
Control of Appetitive and Aversive Taste-Reactivity Responses by an Auditory Conditioned Stimulus in a Devaluation Task: A FOS and Behavioral Analysis

ERIC Educational Resources Information Center

Kerfoot, Erin C.; Agarwal, Isha; Lee, Hongjoo J.; Holland, Peter C.

2007-01-01

Through associative learning, cues for biologically significant reinforcers such as food may gain access to mental representations of those reinforcers. Here, we used devaluation procedures, behavioral assessment of hedonic taste-reactivity responses, and measurement of immediate-early gene (IEG) expression to show that a cue for food engages…
11.2 YIP Human In the Loop Statistical RelationalLearners

DTIC Science & Technology

2017-10-23

learning formalisms including inverse reinforcement learning [4] and statistical relational learning [7, 5, 8]. We have also applied our algorithms in...one introduced for label preferences. 4 Figure 2: Active Advice Seeking for Inverse Reinforcement Learning. active advice seeking is in selecting the...learning tasks. 1.2.1 Sequential Decision-Making Our previous work on advice for inverse reinforcement learning (IRL) defined advice as action
Roles of Approval Motivation and Generalized Expectancy for Reinforcement in Children's Conceptual Discrimination Learning

ERIC Educational Resources Information Center

Nyce, Peggy A.; And Others

1977-01-01

Forty-four third graders were given a two-choice conceptual discrimination learning task. The two major factors were (1) four treatment groups varying at the extremes on two personality measures, approval motivation and locus of control and (2) sex. (MS)
Enhanced Experience Replay for Deep Reinforcement Learning

DTIC Science & Technology

2015-11-01

ARL-TR-7538 ● NOV 2015 US Army Research Laboratory Enhanced Experience Replay for Deep Reinforcement Learning by David Doria...Experience Replay for Deep Reinforcement Learning by David Doria, Bryan Dawson, and Manuel Vindiola Computational and Information Sciences Directorate...
Reconciling Reinforcement Learning Models with Behavioral Extinction and Renewal: Implications for Addiction, Relapse, and Problem Gambling

ERIC Educational Resources Information Center

Redish, A. David; Jensen, Steve; Johnson, Adam; Kurth-Nelson, Zeb

2007-01-01

Because learned associations are quickly renewed following extinction, the extinction process must include processes other than unlearning. However, reinforcement learning models, such as the temporal difference reinforcement learning (TDRL) model, treat extinction as an unlearning of associated value and are thus unable to capture renewal. TDRL…
Change in the relative contributions of habit and working memory facilitates serial reversal learning expertise in rhesus monkeys.

PubMed

Hassett, Thomas C; Hampton, Robert R

2017-05-01

Functionally distinct memory systems likely evolved in response to incompatible demands placed on learning by distinct environmental conditions. Working memory appears adapted, in part, for conditions that change frequently, making rapid acquisition and brief retention of information appropriate. In contrast, habits form gradually over many experiences, adapting organisms to contingencies of reinforcement that are stable over relatively long intervals. Serial reversal learning provides an opportunity to simultaneously examine the processes involved in adapting to rapidly changing and relatively stable contingencies. In serial reversal learning, selecting one of the two simultaneously presented stimuli is positively reinforced, while selection of the other is not. After a preference for the positive stimulus develops, the contingencies of reinforcement reverse. Naïve subjects adapt to such reversals gradually, perseverating in selection of the previously rewarded stimulus. Experts reverse rapidly according to a win-stay, lose-shift response pattern. We assessed whether a change in the relative control of choice by habit and working memory accounts for the development of serial reversal learning expertise. Across three experiments, we applied manipulations intended to attenuate the contribution of working memory but leave the contribution of habit intact. We contrasted performance following long and short intervals in Experiments 1 and 2, and we interposed a competing cognitive load between trials in Experiment 3. These manipulations slowed the acquisition of reversals in expert subjects, but not naïve subjects, indicating that serial reversal learning expertise is facilitated by a shift in the control of choice from passively acquired habit to actively maintained working memory.
Behavioral and neural properties of social reinforcement learning

PubMed Central

Jones, Rebecca M.; Somerville, Leah H.; Li, Jian; Ruberry, Erika J.; Libby, Victoria; Glover, Gary; Voss, Henning U.; Ballon, Douglas J.; Casey, BJ

2011-01-01

Social learning is critical for engaging in complex interactions with other individuals. Learning from positive social exchanges, such as acceptance from peers, may be similar to basic reinforcement learning. We formally test this hypothesis by developing a novel paradigm that is based upon work in non-human primates and human imaging studies of reinforcement learning. The probability of receiving positive social reinforcement from three distinct peers was parametrically manipulated while brain activity was recorded in healthy adults using event-related functional magnetic resonance imaging (fMRI). Over the course of the experiment, participants responded more quickly to faces of peers who provided more frequent positive social reinforcement, and rated them as more likeable. Modeling trial-by-trial learning showed ventral striatum and orbital frontal cortex activity correlated positively with forming expectations about receiving social reinforcement. Rostral anterior cingulate cortex activity tracked positively with modulations of expected value of the cues (peers). Together, the findings across three levels of analysis - social preferences, response latencies and modeling neural responses – are consistent with reinforcement learning theory and non-human primate electrophysiological studies of reward. This work highlights the fundamental influence of acceptance by one’s peers in altering subsequent behavior. PMID:21917787
The combination of appetitive and aversive reinforcers and the nature of their interaction during auditory learning.

PubMed

Ilango, A; Wetzel, W; Scheich, H; Ohl, F W

2010-03-31

Learned changes in behavior can be elicited by either appetitive or aversive reinforcers. It is, however, not clear whether the two types of motivation, (approaching appetitive stimuli and avoiding aversive stimuli) drive learning in the same or different ways, nor is their interaction understood in situations where the two types are combined in a single experiment. To investigate this question we have developed a novel learning paradigm for Mongolian gerbils, which not only allows rewards and punishments to be presented in isolation or in combination with each other, but also can use these opposite reinforcers to drive the same learned behavior. Specifically, we studied learning of tone-conditioned hurdle crossing in a shuttle box driven by either an appetitive reinforcer (brain stimulation reward) or an aversive reinforcer (electrical footshock), or by a combination of both. Combination of the two reinforcers potentiated speed of acquisition, led to maximum possible performance, and delayed extinction as compared to either reinforcer alone. Additional experiments, using partial reinforcement protocols and experiments in which one of the reinforcers was omitted after the animals had been previously trained with the combination of both reinforcers, indicated that appetitive and aversive reinforcers operated together but acted in different ways: in this particular experimental context, punishment appeared to be more effective for initial acquisition and reward more effective to maintain a high level of conditioned responses (CRs). The results imply that learning mechanisms in problem solving were maximally effective when the initial punishment of mistakes was combined with the subsequent rewarding of correct performance. Copyright 2010 IBRO. Published by Elsevier Ltd. All rights reserved.
Punishment Insensitivity and Impaired Reinforcement Learning in Preschoolers

ERIC Educational Resources Information Center

Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

2014-01-01

Background: Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a…
Stimulus discriminability may bias value-based probabilistic learning.

PubMed

Schutte, Iris; Slagter, Heleen A; Collins, Anne G E; Frank, Michael J; Kenemans, J Leon

2017-01-01

Reinforcement learning tasks are often used to assess participants' tendency to learn more from the positive or more from the negative consequences of one's action. However, this assessment often requires comparison in learning performance across different task conditions, which may differ in the relative salience or discriminability of the stimuli associated with more and less rewarding outcomes, respectively. To address this issue, in a first set of studies, participants were subjected to two versions of a common probabilistic learning task. The two versions differed with respect to the stimulus (Hiragana) characters associated with reward probability. The assignment of character to reward probability was fixed within version but reversed between versions. We found that performance was highly influenced by task version, which could be explained by the relative perceptual discriminability of characters assigned to high or low reward probabilities, as assessed by a separate discrimination experiment. Participants were more reliable in selecting rewarding characters that were more discriminable, leading to differences in learning curves and their sensitivity to reward probability. This difference in experienced reinforcement history was accompanied by performance biases in a test phase assessing ability to learn from positive vs. negative outcomes. In a subsequent large-scale web-based experiment, this impact of task version on learning and test measures was replicated and extended. Collectively, these findings imply a key role for perceptual factors in guiding reward learning and underscore the need to control stimulus discriminability when making inferences about individual differences in reinforcement learning.
The cerebellum: a neural system for the study of reinforcement learning.

PubMed

Swain, Rodney A; Kerr, Abigail L; Thompson, Richard F

2011-01-01

In its strictest application, the term "reinforcement learning" refers to a computational approach to learning in which an agent (often a machine) interacts with a mutable environment to maximize reward through trial and error. The approach borrows essentials from several fields, most notably Computer Science, Behavioral Neuroscience, and Psychology. At the most basic level, a neural system capable of mediating reinforcement learning must be able to acquire sensory information about the external environment and internal milieu (either directly or through connectivities with other brain regions), must be able to select a behavior to be executed, and must be capable of providing evaluative feedback about the success of that behavior. Given that Psychology informs us that reinforcers, both positive and negative, are stimuli or consequences that increase the probability that the immediately antecedent behavior will be repeated and that reinforcer strength or viability is modulated by the organism's past experience with the reinforcer, its affect, and even the state of its muscles (e.g., eyes open or closed); it is the case that any neural system that supports reinforcement learning must also be sensitive to these same considerations. Once learning is established, such a neural system must finally be able to maintain continued response expression and prevent response drift. In this report, we examine both historical and recent evidence that the cerebellum satisfies all of these requirements. While we report evidence from a variety of learning paradigms, the majority of our discussion will focus on classical conditioning of the rabbit eye blink response as an ideal model system for the study of reinforcement and reinforcement learning.
A new computational account of cognitive control over reinforcement-based decision-making: Modeling of a probabilistic learning task.

PubMed

Zendehrouh, Sareh

2015-11-01

Recent work on decision-making field offers an account of dual-system theory for decision-making process. This theory holds that this process is conducted by two main controllers: a goal-directed system and a habitual system. In the reinforcement learning (RL) domain, the habitual behaviors are connected with model-free methods, in which appropriate actions are learned through trial-and-error experiences. However, goal-directed behaviors are associated with model-based methods of RL, in which actions are selected using a model of the environment. Studies on cognitive control also suggest that during processes like decision-making, some cortical and subcortical structures work in concert to monitor the consequences of decisions and to adjust control according to current task demands. Here a computational model is presented based on dual system theory and cognitive control perspective of decision-making. The proposed model is used to simulate human performance on a variant of probabilistic learning task. The basic proposal is that the brain implements a dual controller, while an accompanying monitoring system detects some kinds of conflict including a hypothetical cost-conflict one. The simulation results address existing theories about two event-related potentials, namely error related negativity (ERN) and feedback related negativity (FRN), and explore the best account of them. Based on the results, some testable predictions are also presented. Copyright © 2015 Elsevier Ltd. All rights reserved.
Safe Exploration Algorithms for Reinforcement Learning Controllers.

PubMed

Mannucci, Tommaso; van Kampen, Erik-Jan; de Visser, Cornelis; Chu, Qiping

2018-04-01

Self-learning approaches, such as reinforcement learning, offer new possibilities for autonomous control of uncertain or time-varying systems. However, exploring an unknown environment under limited prediction capabilities is a challenge for a learning agent. If the environment is dangerous, free exploration can result in physical damage or in an otherwise unacceptable behavior. With respect to existing methods, the main contribution of this paper is the definition of a new approach that does not require global safety functions, nor specific formulations of the dynamics or of the environment, but relies on interval estimation of the dynamics of the agent during the exploration phase, assuming a limited capability of the agent to perceive the presence of incoming fatal states. Two algorithms are presented with this approach. The first is the Safety Handling Exploration with Risk Perception Algorithm (SHERPA), which provides safety by individuating temporary safety functions, called backups. SHERPA is shown in a simulated, simplified quadrotor task, for which dangerous states are avoided. The second algorithm, denominated OptiSHERPA, can safely handle more dynamically complex systems for which SHERPA is not sufficient through the use of safety metrics. An application of OptiSHERPA is simulated on an aircraft altitude control task.
Prenatal exposure to ethanol during late gestation facilitates operant self-administration of the drug in 5-day-old rats

PubMed Central

Miranda-Morales, Roberto Sebastián; Nizhnikov, Michael E.; Spear, Norman E.

2014-01-01

Prenatal ethanol exposure modifies postnatal affinity to the drug, increasing the probability of ethanol use and abuse. The present study tested developing rats (5-day-old) in a novel operant technique to assess the degree of ethanol self-administration as a result of prenatal exposure to low ethanol doses during late gestation. On a single occasion during each of gestational days 17–20, pregnant rats were intragastrically administered ethanol 1 g/kg, or water (vehicle). On postnatal day 5, pups were tested on a novel operant conditioning procedure in which they learned to touch a sensor to obtain 0.1% saccharin, 3% ethanol, or 5% ethanol. Immediately after a 15-min training session, a 6-min extinction session was given in which operant behavior had no consequence. Pups were positioned on a smooth surface and had access to a touch-sensitive sensor. Physical contact with the sensor activated an infusion pump, which served to deliver an intraoral solution as reinforcement (Paired group). A Yoked control animal evaluated at the same time received the reinforcer when its corresponding Paired pup touched the sensor. Operant behavior to gain access to 3% ethanol was facilitated by prenatal exposure to ethanol during late gestation. In contrast, operant learning reflecting ethanol reinforcement did not occur in control animals prenatally exposed to water only. Similarly, saccharin reinforcement was not affected by prenatal ethanol exposure. These results suggest that in 5-day-old rats, prenatal exposure to a low ethanol dose facilitates operant learning reinforced by intraoral administration of a low-concentration ethanol solution. This emphasizes the importance of intrauterine experiences with ethanol in later susceptibility to drug reinforcement. The present operant conditioning technique represents an alternative tool to assess self-administration and seeking behavior during early stages of development. PMID:24355072
Prenatal exposure to ethanol during late gestation facilitates operant self-administration of the drug in 5-day-old rats.

PubMed

Miranda-Morales, Roberto Sebastián; Nizhnikov, Michael E; Spear, Norman E

2014-02-01

Prenatal ethanol exposure modifies postnatal affinity to the drug, increasing the probability of ethanol use and abuse. The present study tested developing rats (5-day-old) in a novel operant technique to assess the degree of ethanol self-administration as a result of prenatal exposure to low ethanol doses during late gestation. On a single occasion during each of gestational days 17-20, pregnant rats were intragastrically administered ethanol 1 g/kg, or water (vehicle). On postnatal day 5, pups were tested on a novel operant conditioning procedure in which they learned to touch a sensor to obtain 0.1% saccharin, 3% ethanol, or 5% ethanol. Immediately after a 15-min training session, a 6-min extinction session was given in which operant behavior had no consequence. Pups were positioned on a smooth surface and had access to a touch-sensitive sensor. Physical contact with the sensor activated an infusion pump, which served to deliver an intraoral solution as reinforcement (Paired group). A Yoked control animal evaluated at the same time received the reinforcer when its corresponding Paired pup touched the sensor. Operant behavior to gain access to 3% ethanol was facilitated by prenatal exposure to ethanol during late gestation. In contrast, operant learning reflecting ethanol reinforcement did not occur in control animals prenatally exposed to water only. Similarly, saccharin reinforcement was not affected by prenatal ethanol exposure. These results suggest that in 5-day-old rats, prenatal exposure to a low ethanol dose facilitates operant learning reinforced by intraoral administration of a low-concentration ethanol solution. This emphasizes the importance of intrauterine experiences with ethanol in later susceptibility to drug reinforcement. The present operant conditioning technique represents an alternative tool to assess self-administration and seeking behavior during early stages of development. Published by Elsevier Inc.
Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints

NASA Astrophysics Data System (ADS)

Yang, Xiong; Liu, Derong; Wang, Ding

2014-03-01

In this paper, an adaptive reinforcement learning-based solution is developed for the infinite-horizon optimal control problem of constrained-input continuous-time nonlinear systems in the presence of nonlinearities with unknown structures. Two different types of neural networks (NNs) are employed to approximate the Hamilton-Jacobi-Bellman equation. That is, an recurrent NN is constructed to identify the unknown dynamical system, and two feedforward NNs are used as the actor and the critic to approximate the optimal control and the optimal cost, respectively. Based on this framework, the action NN and the critic NN are tuned simultaneously, without the requirement for the knowledge of system drift dynamics. Moreover, by using Lyapunov's direct method, the weights of the action NN and the critic NN are guaranteed to be uniformly ultimately bounded, while keeping the closed-loop system stable. To demonstrate the effectiveness of the present approach, simulation results are illustrated.
Cognitive control predicts use of model-based reinforcement learning.

PubMed

Otto, A Ross; Skatova, Anya; Madlon-Kay, Seth; Daw, Nathaniel D

2015-02-01

Accounts of decision-making and its neural substrates have long posited the operation of separate, competing valuation systems in the control of choice behavior. Recent theoretical and experimental work suggest that this classic distinction between behaviorally and neurally dissociable systems for habitual and goal-directed (or more generally, automatic and controlled) choice may arise from two computational strategies for reinforcement learning (RL), called model-free and model-based RL, but the cognitive or computational processes by which one system may dominate over the other in the control of behavior is a matter of ongoing investigation. To elucidate this question, we leverage the theoretical framework of cognitive control, demonstrating that individual differences in utilization of goal-related contextual information--in the service of overcoming habitual, stimulus-driven responses--in established cognitive control paradigms predict model-based behavior in a separate, sequential choice task. The behavioral correspondence between cognitive control and model-based RL compellingly suggests that a common set of processes may underpin the two behaviors. In particular, computational mechanisms originally proposed to underlie controlled behavior may be applicable to understanding the interactions between model-based and model-free choice behavior.

Fear of losing money? Aversive conditioning with secondary reinforcers.

PubMed

Delgado, M R; Labouliere, C D; Phelps, E A

2006-12-01

Money is a secondary reinforcer that acquires its value through social communication and interaction. In everyday human behavior and laboratory studies, money has been shown to influence appetitive or reward learning. It is unclear, however, if money has a similar impact on aversive learning. The goal of this study was to investigate the efficacy of money in aversive learning, comparing it with primary reinforcers that are traditionally used in fear conditioning paradigms. A series of experiments were conducted in which participants initially played a gambling game that led to a monetary gain. They were then presented with an aversive conditioning paradigm, with either shock (primary reinforcer) or loss of money (secondary reinforcer) as the unconditioned stimulus. Skin conductance responses and subjective ratings indicated that potential monetary loss modulated the conditioned response. Depending on the presentation context, the secondary reinforcer was as effective as the primary reinforcer during aversive conditioning. These results suggest that stimuli that acquire reinforcing properties through social communication and interaction, such as money, can effectively influence aversive learning.
Reinforcement learning and Tourette syndrome.

PubMed

Palminteri, Stefano; Pessiglione, Mathias

2013-01-01

In this chapter, we report the first experimental explorations of reinforcement learning in Tourette syndrome, realized by our team in the last few years. This report will be preceded by an introduction aimed to provide the reader with the state of the art of the knowledge concerning the neural bases of reinforcement learning at the moment of these studies and the scientific rationale beyond them. In short, reinforcement learning is learning by trial and error to maximize rewards and minimize punishments. This decision-making and learning process implicates the dopaminergic system projecting to the frontal cortex-basal ganglia circuits. A large body of evidence suggests that the dysfunction of the same neural systems is implicated in the pathophysiology of Tourette syndrome. Our results show that Tourette condition, as well as the most common pharmacological treatments (dopamine antagonists), affects reinforcement learning performance in these patients. Specifically, the results suggest a deficit in negative reinforcement learning, possibly underpinned by a functional hyperdopaminergia, which could explain the persistence of tics, despite their evident inadaptive (negative) value. This idea, together with the implications of these results in Tourette therapy and the future perspectives, is discussed in Section 4 of this chapter. © 2013 Elsevier Inc. All rights reserved.
Effects of Dopamine Medication on Sequence Learning with Stochastic Feedback in Parkinson's Disease

PubMed Central

Seo, Moonsang; Beigi, Mazda; Jahanshahi, Marjan; Averbeck, Bruno B.

2010-01-01

A growing body of evidence suggests that the midbrain dopamine system plays a key role in reinforcement learning and disruption of the midbrain dopamine system in Parkinson's disease (PD) may lead to deficits on tasks that require learning from feedback. We examined how changes in dopamine levels (“ON” and “OFF” their dopamine medication) affect sequence learning from stochastic positive and negative feedback using Bayesian reinforcement learning models. We found deficits in sequence learning in patients with PD when they were “ON” and “OFF” medication relative to healthy controls, but smaller differences between patients “OFF” and “ON”. The deficits were mainly due to decreased learning from positive feedback, although across all participant groups learning was more strongly associated with positive than negative feedback in our task. The learning in our task is likely mediated by the relatively depleted dorsal striatum and not the relatively intact ventral striatum. Therefore, the changes we see in our task may be due to a strong loss of phasic dopamine signals in the dorsal striatum in PD. PMID:20740077
Tonic or Phasic Stimulation of Dopaminergic Projections to Prefrontal Cortex Causes Mice to Maintain or Deviate from Previously Learned Behavioral Strategies

PubMed Central

Ellwood, Ian T.; Patel, Tosha; Wadia, Varun; Lee, Anthony T.; Liptak, Alayna T.

2017-01-01

Dopamine neurons in the ventral tegmental area (VTA) encode reward prediction errors and can drive reinforcement learning through their projections to striatum, but much less is known about their projections to prefrontal cortex (PFC). Here, we studied these projections and observed phasic VTA–PFC fiber photometry signals after the delivery of rewards. Next, we studied how optogenetic stimulation of these projections affects behavior using conditioned place preference and a task in which mice learn associations between cues and food rewards and then use those associations to make choices. Neither phasic nor tonic stimulation of dopaminergic VTA–PFC projections elicited place preference. Furthermore, substituting phasic VTA–PFC stimulation for food rewards was not sufficient to reinforce new cue–reward associations nor maintain previously learned ones. However, the same patterns of stimulation that failed to reinforce place preference or cue–reward associations were able to modify behavior in other ways. First, continuous tonic stimulation maintained previously learned cue–reward associations even after they ceased being valid. Second, delivering phasic stimulation either continuously or after choices not previously associated with reward induced mice to make choices that deviated from previously learned associations. In summary, despite the fact that dopaminergic VTA–PFC projections exhibit phasic increases in activity that are time locked to the delivery of rewards, phasic activation of these projections does not necessarily reinforce specific actions. Rather, dopaminergic VTA–PFC activity can control whether mice maintain or deviate from previously learned cue–reward associations. SIGNIFICANCE STATEMENT Dopaminergic inputs from ventral tegmental area (VTA) to striatum encode reward prediction errors and reinforce specific actions; however, it is currently unknown whether dopaminergic inputs to prefrontal cortex (PFC) play similar or distinct roles. Here, we used bulk Ca2+ imaging to show that unexpected rewards or reward-predicting cues elicit phasic increases in the activity of dopaminergic VTA–PFC fibers. However, in multiple behavioral paradigms, we failed to observe reinforcing effects after stimulation of these fibers. In these same experiments, we did find that tonic or phasic patterns of stimulation caused mice to maintain or deviate from previously learned cue–reward associations, respectively. Therefore, although they may exhibit similar patterns of activity, dopaminergic inputs to striatum and PFC can elicit divergent behavioral effects. PMID:28739583
Intelligence moderates reinforcement learning: a mini-review of the neural evidence

PubMed Central

2014-01-01

Our understanding of the neural basis of reinforcement learning and intelligence, two key factors contributing to human strivings, has progressed significantly recently. However, the overlap of these two lines of research, namely, how intelligence affects neural responses during reinforcement learning, remains uninvestigated. A mini-review of three existing studies suggests that higher IQ (especially fluid IQ) may enhance the neural signal of positive prediction error in dorsolateral prefrontal cortex, dorsal anterior cingulate cortex, and striatum, several brain substrates of reinforcement learning or intelligence. PMID:25185818
Intelligence moderates reinforcement learning: a mini-review of the neural evidence.

PubMed

Chen, Chong

2015-06-01

Our understanding of the neural basis of reinforcement learning and intelligence, two key factors contributing to human strivings, has progressed significantly recently. However, the overlap of these two lines of research, namely, how intelligence affects neural responses during reinforcement learning, remains uninvestigated. A mini-review of three existing studies suggests that higher IQ (especially fluid IQ) may enhance the neural signal of positive prediction error in dorsolateral prefrontal cortex, dorsal anterior cingulate cortex, and striatum, several brain substrates of reinforcement learning or intelligence. Copyright © 2015 the American Physiological Society.
Discrete-time online learning control for a class of unknown nonaffine nonlinear systems using reinforcement learning.

PubMed

Yang, Xiong; Liu, Derong; Wang, Ding; Wei, Qinglai

2014-07-01

In this paper, a reinforcement-learning-based direct adaptive control is developed to deliver a desired tracking performance for a class of discrete-time (DT) nonlinear systems with unknown bounded disturbances. We investigate multi-input-multi-output unknown nonaffine nonlinear DT systems and employ two neural networks (NNs). By using Implicit Function Theorem, an action NN is used to generate the control signal and it is also designed to cancel the nonlinearity of unknown DT systems, for purpose of utilizing feedback linearization methods. On the other hand, a critic NN is applied to estimate the cost function, which satisfies the recursive equations derived from heuristic dynamic programming. The weights of both the action NN and the critic NN are directly updated online instead of offline training. By utilizing Lyapunov's direct method, the closed-loop tracking errors and the NN estimated weights are demonstrated to be uniformly ultimately bounded. Two numerical examples are provided to show the effectiveness of the present approach. Copyright © 2014 Elsevier Ltd. All rights reserved.
Using Google Drive to Facilitate a Blended Approach to Authentic Learning

ERIC Educational Resources Information Center

Rowe, Michael; Bozalek, Vivienne; Frantz, Jose

2013-01-01

While technology has the potential to create opportunities for transformative learning in higher education, it is often used to merely reinforce didactic teaching that aims to control access to expert knowledge. Instead, educators should consider using technology to enhance communication and provide richer, more meaningful platforms for the social…
Adaptive critic autopilot design of bank-to-turn missiles using fuzzy basis function networks.

PubMed

Lin, Chuan-Kai

2005-04-01

A new adaptive critic autopilot design for bank-to-turn missiles is presented. In this paper, the architecture of adaptive critic learning scheme contains a fuzzy-basis-function-network based associative search element (ASE), which is employed to approximate nonlinear and complex functions of bank-to-turn missiles, and an adaptive critic element (ACE) generating the reinforcement signal to tune the associative search element. In the design of the adaptive critic autopilot, the control law receives signals from a fixed gain controller, an ASE and an adaptive robust element, which can eliminate approximation errors and disturbances. Traditional adaptive critic reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment, however, the proposed tuning algorithm can significantly shorten the learning time by online tuning all parameters of fuzzy basis functions and weights of ASE and ACE. Moreover, the weight updating law derived from the Lyapunov stability theory is capable of guaranteeing both tracking performance and stability. Computer simulation results confirm the effectiveness of the proposed adaptive critic autopilot.
Generating Adaptive Behaviour within a Memory-Prediction Framework

PubMed Central

Rawlinson, David; Kowadlo, Gideon

2012-01-01

The Memory-Prediction Framework (MPF) and its Hierarchical-Temporal Memory implementation (HTM) have been widely applied to unsupervised learning problems, for both classification and prediction. To date, there has been no attempt to incorporate MPF/HTM in reinforcement learning or other adaptive systems; that is, to use knowledge embodied within the hierarchy to control a system, or to generate behaviour for an agent. This problem is interesting because the human neocortex is believed to play a vital role in the generation of behaviour, and the MPF is a model of the human neocortex. We propose some simple and biologically-plausible enhancements to the Memory-Prediction Framework. These cause it to explore and interact with an external world, while trying to maximize a continuous, time-varying reward function. All behaviour is generated and controlled within the MPF hierarchy. The hierarchy develops from a random initial configuration by interaction with the world and reinforcement learning only. Among other demonstrations, we show that a 2-node hierarchy can learn to successfully play “rocks, paper, scissors” against a predictable opponent. PMID:22272231
Space Objects Maneuvering Detection and Prediction via Inverse Reinforcement Learning

NASA Astrophysics Data System (ADS)

Linares, R.; Furfaro, R.

This paper determines the behavior of Space Objects (SOs) using inverse Reinforcement Learning (RL) to estimate the reward function that each SO is using for control. The approach discussed in this work can be used to analyze maneuvering of SOs from observational data. The inverse RL problem is solved using the Feature Matching approach. This approach determines the optimal reward function that a SO is using while maneuvering by assuming that the observed trajectories are optimal with respect to the SO's own reward function. This paper uses estimated orbital elements data to determine the behavior of SOs in a data-driven fashion.
Reinforcement learning in complementarity game and population dynamics

NASA Astrophysics Data System (ADS)

Jost, Jürgen; Li, Wei

2014-02-01

We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005), 10.1016/j.physa.2004.07.005] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.
The prefrontal cortex and hybrid learning during iterative competitive games.

PubMed

Abe, Hiroshi; Seo, Hyojung; Lee, Daeyeol

2011-12-01

Behavioral changes driven by reinforcement and punishment are referred to as simple or model-free reinforcement learning. Animals can also change their behaviors by observing events that are neither appetitive nor aversive when these events provide new information about payoffs available from alternative actions. This is an example of model-based reinforcement learning and can be accomplished by incorporating hypothetical reward signals into the value functions for specific actions. Recent neuroimaging and single-neuron recording studies showed that the prefrontal cortex and the striatum are involved not only in reinforcement and punishment, but also in model-based reinforcement learning. We found evidence for both types of learning, and hence hybrid learning, in monkeys during simulated competitive games. In addition, in both the dorsolateral prefrontal cortex and orbitofrontal cortex, individual neurons heterogeneously encoded signals related to actual and hypothetical outcomes from specific actions, suggesting that both areas might contribute to hybrid learning. © 2011 New York Academy of Sciences.
Multi-Objective Reinforcement Learning-based Deep Neural Networks for Cognitive Space Communications

NASA Technical Reports Server (NTRS)

Ferreria, Paulo; Paffenroth, Randy; Wyglinski, Alexander M.; Hackett, Timothy; Bilen, Sven; Reinhart, Richard; Mortensen, Dale

2017-01-01

Future communication subsystems of space exploration missions can potentially benefit from software-defined radios (SDRs) controlled by machine learning algorithms. In this paper, we propose a novel hybrid radio resource allocation management control algorithm that integrates multi-objective reinforcement learning and deep artificial neural networks. The objective is to efficiently manage communications system resources by monitoring performance functions with common dependent variables that result in conflicting goals. The uncertainty in the performance of thousands of different possible combinations of radio parameters makes the trade-off between exploration and exploitation in reinforcement learning (RL) much more challenging for future critical space-based missions. Thus, the system should spend as little time as possible on exploring actions, and whenever it explores an action, it should perform at acceptable levels most of the time. The proposed approach enables on-line learning by interactions with the environment and restricts poor resource allocation performance through virtual environment exploration. Improvements in the multiobjective performance can be achieved via transmitter parameter adaptation on a packet-basis, with poorly predicted performance promptly resulting in rejected decisions. Simulations presented in this work considered the DVB-S2 standard adaptive transmitter parameters and additional ones expected to be present in future adaptive radio systems. Performance results are provided by analysis of the proposed hybrid algorithm when operating across a satellite communication channel from Earth to GEO orbit during clear sky conditions. The proposed approach constitutes part of the core cognitive engine proof-of-concept to be delivered to the NASA Glenn Research Center SCaN Testbed located onboard the International Space Station.
Explicit and implicit reinforcement learning across the psychosis spectrum.

PubMed

Barch, Deanna M; Carter, Cameron S; Gold, James M; Johnson, Sheri L; Kring, Ann M; MacDonald, Angus W; Pizzagalli, Diego A; Ragland, J Daniel; Silverstein, Steven M; Strauss, Milton E

2017-07-01

Motivational and hedonic impairments are core features of a variety of types of psychopathology. An important aspect of motivational function is reinforcement learning (RL), including implicit (i.e., outside of conscious awareness) and explicit (i.e., including explicit representations about potential reward associations) learning, as well as both positive reinforcement (learning about actions that lead to reward) and punishment (learning to avoid actions that lead to loss). Here we present data from paradigms designed to assess both positive and negative components of both implicit and explicit RL, examine performance on each of these tasks among individuals with schizophrenia, schizoaffective disorder, and bipolar disorder with psychosis, and examine their relative relationships to specific symptom domains transdiagnostically. None of the diagnostic groups differed significantly from controls on the implicit RL tasks in either bias toward a rewarded response or bias away from a punished response. However, on the explicit RL task, both the individuals with schizophrenia and schizoaffective disorder performed significantly worse than controls, but the individuals with bipolar did not. Worse performance on the explicit RL task, but not the implicit RL task, was related to worse motivation and pleasure symptoms across all diagnostic categories. Performance on explicit RL, but not implicit RL, was related to working memory, which accounted for some of the diagnostic group differences. However, working memory did not account for the relationship of explicit RL to motivation and pleasure symptoms. These findings suggest transdiagnostic relationships across the spectrum of psychotic disorders between motivation and pleasure impairments and explicit RL. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Deficits in Positive Reinforcement Learning and Uncertainty-Driven Exploration are Associated with Distinct Aspects of Negative Symptoms in Schizophrenia

PubMed Central

Strauss, Gregory P.; Frank, Michael J.; Waltz, James A.; Kasanova, Zuzana; Herbener, Ellen S.; Gold, James M.

2011-01-01

Background Negative symptoms are core features of schizophrenia; however, the cognitive and neural basis for individual negative symptom domains remains unclear. Converging evidence suggests a role for striatal and prefrontal dopamine in reward learning and the exploration of actions that might produce outcomes that are better than the status quo. The current study examines whether deficits in reinforcement learning and uncertainty-driven exploration predict specific negative symptoms domains. Methods We administered a temporal decision making task, which required trial-by-trial adjustment of reaction time (RT) to maximize reward receipt, to 51 patients with schizophrenia and 39 age-matched healthy controls. Task conditions were designed such that expected value (probability * magnitude) increased (IEV), decreased (DEV), or remained constant (CEV) with increasing response times. Computational analyses were applied to estimate the degree to which trial-by-trial responses are influenced by reinforcement history. Results Individuals with schizophrenia showed impaired Go learning, but intact NoGo learning relative to controls. These effects were pronounced as a function of global measures of negative symptom. Uncertainty-based exploration was substantially reduced in individuals with schizophrenia, and selectively correlated with clinical ratings of anhedonia. Conclusions Schizophrenia patients, particularly those with high negative symptoms, failed to speed RT's to increase positive outcomes and showed reduced tendency to explore when alternative actions could lead to better outcomes than the status quo. Results are interpreted in the context of current computational, genetic, and pharmacological data supporting the roles of striatal and prefrontal dopamine in these processes. PMID:21168124
Multi-Objective Reinforcement Learning-Based Deep Neural Networks for Cognitive Space Communications

NASA Technical Reports Server (NTRS)

Ferreria, Paulo Victor R.; Paffenroth, Randy; Wyglinski, Alexander M.; Hackett, Timothy M.; Bilen, Sven G.; Reinhart, Richard C.; Mortensen, Dale J.

2017-01-01

Future communication subsystems of space exploration missions can potentially benefit from software-defined radios (SDRs) controlled by machine learning algorithms. In this paper, we propose a novel hybrid radio resource allocation management control algorithm that integrates multi-objective reinforcement learning and deep artificial neural networks. The objective is to efficiently manage communications system resources by monitoring performance functions with common dependent variables that result in conflicting goals. The uncertainty in the performance of thousands of different possible combinations of radio parameters makes the trade-off between exploration and exploitation in reinforcement learning (RL) much more challenging for future critical space-based missions. Thus, the system should spend as little time as possible on exploring actions, and whenever it explores an action, it should perform at acceptable levels most of the time. The proposed approach enables on-line learning by interactions with the environment and restricts poor resource allocation performance through virtual environment exploration. Improvements in the multiobjective performance can be achieved via transmitter parameter adaptation on a packet-basis, with poorly predicted performance promptly resulting in rejected decisions. Simulations presented in this work considered the DVB-S2 standard adaptive transmitter parameters and additional ones expected to be present in future adaptive radio systems. Performance results are provided by analysis of the proposed hybrid algorithm when operating across a satellite communication channel from Earth to GEO orbit during clear sky conditions. The proposed approach constitutes part of the core cognitive engine proof-of-concept to be delivered to the NASA Glenn Research Center SCaN Testbed located onboard the International Space Station.
Gr-GDHP: A New Architecture for Globalized Dual Heuristic Dynamic Programming.

PubMed

Zhong, Xiangnan; Ni, Zhen; He, Haibo

2017-10-01

Goal representation globalized dual heuristic dynamic programming (Gr-GDHP) method is proposed in this paper. A goal neural network is integrated into the traditional GDHP method providing an internal reinforcement signal and its derivatives to help the control and learning process. From the proposed architecture, it is shown that the obtained internal reinforcement signal and its derivatives can be able to adjust themselves online over time rather than a fixed or predefined function in literature. Furthermore, the obtained derivatives can directly contribute to the objective function of the critic network, whose learning process is thus simplified. Numerical simulation studies are applied to show the performance of the proposed Gr-GDHP method and compare the results with other existing adaptive dynamic programming designs. We also investigate this method on a ball-and-beam balancing system. The statistical simulation results are presented for both the Gr-GDHP and the GDHP methods to demonstrate the improved learning and controlling performance.
Generalization of value in reinforcement learning by humans.

PubMed

Wimmer, G Elliott; Daw, Nathaniel D; Shohamy, Daphna

2012-04-01

Research in decision-making has focused on the role of dopamine and its striatal targets in guiding choices via learned stimulus-reward or stimulus-response associations, behavior that is well described by reinforcement learning theories. However, basic reinforcement learning is relatively limited in scope and does not explain how learning about stimulus regularities or relations may guide decision-making. A candidate mechanism for this type of learning comes from the domain of memory, which has highlighted a role for the hippocampus in learning of stimulus-stimulus relations, typically dissociated from the role of the striatum in stimulus-response learning. Here, we used functional magnetic resonance imaging and computational model-based analyses to examine the joint contributions of these mechanisms to reinforcement learning. Humans performed a reinforcement learning task with added relational structure, modeled after tasks used to isolate hippocampal contributions to memory. On each trial participants chose one of four options, but the reward probabilities for pairs of options were correlated across trials. This (uninstructed) relationship between pairs of options potentially enabled an observer to learn about option values based on experience with the other options and to generalize across them. We observed blood oxygen level-dependent (BOLD) activity related to learning in the striatum and also in the hippocampus. By comparing a basic reinforcement learning model to one augmented to allow feedback to generalize between correlated options, we tested whether choice behavior and BOLD activity were influenced by the opportunity to generalize across correlated options. Although such generalization goes beyond standard computational accounts of reinforcement learning and striatal BOLD, both choices and striatal BOLD activity were better explained by the augmented model. Consistent with the hypothesized role for the hippocampus in this generalization, functional connectivity between the ventral striatum and hippocampus was modulated, across participants, by the ability of the augmented model to capture participants' choice. Our results thus point toward an interactive model in which striatal reinforcement learning systems may employ relational representations typically associated with the hippocampus. © 2012 The Authors. European Journal of Neuroscience © 2012 Federation of European Neuroscience Societies and Blackwell Publishing Ltd.
Instrumental learning and relearning in individuals with psychopathy and in patients with lesions involving the amygdala or orbitofrontal cortex.

PubMed

Mitchell, D G V; Fine, C; Richell, R A; Newman, C; Lumsden, J; Blair, K S; Blair, R J R

2006-05-01

Previous work has shown that individuals with psychopathy are impaired on some forms of associative learning, particularly stimulus-reinforcement learning (Blair et al., 2004; Newman & Kosson, 1986). Animal work suggests that the acquisition of stimulus-reinforcement associations requires the amygdala (Baxter & Murray, 2002). Individuals with psychopathy also show impoverished reversal learning (Mitchell, Colledge, Leonard, & Blair, 2002). Reversal learning is supported by the ventrolateral and orbitofrontal cortex (Rolls, 2004). In this paper we present experiments investigating stimulus-reinforcement learning and relearning in patients with lesions of the orbitofrontal cortex or amygdala, and individuals with developmental psychopathy without known trauma. The results are interpreted with reference to current neurocognitive models of stimulus-reinforcement learning, relearning, and developmental psychopathy. Copyright (c) 2006 APA, all rights reserved.

Learning Sequences of Actions in Collectives of Autonomous Agents

NASA Technical Reports Server (NTRS)

Turner, Kagan; Agogino, Adrian K.; Wolpert, David H.; Clancy, Daniel (Technical Monitor)

2001-01-01

In this paper we focus on the problem of designing a collective of autonomous agents that individually learn sequences of actions such that the resultant sequence of joint actions achieves a predetermined global objective. We are particularly interested in instances of this problem where centralized control is either impossible or impractical. For single agent systems in similar domains, machine learning methods (e.g., reinforcement learners) have been successfully used. However, applying such solutions directly to multi-agent systems often proves problematic, as agents may work at cross-purposes, or have difficulty in evaluating their contribution to achievement of the global objective, or both. Accordingly, the crucial design step in multiagent systems centers on determining the private objectives of each agent so that as the agents strive for those objectives, the system reaches a good global solution. In this work we consider a version of this problem involving multiple autonomous agents in a grid world. We use concepts from collective intelligence to design goals for the agents that are 'aligned' with the global goal, and are 'learnable' in that agents can readily see how their behavior affects their utility. We show that reinforcement learning agents using those goals outperform both 'natural' extensions of single agent algorithms and global reinforcement, learning solutions based on 'team games'.
Continuous theta-burst stimulation (cTBS) over the lateral prefrontal cortex alters reinforcement learning bias.

PubMed

Ott, Derek V M; Ullsperger, Markus; Jocham, Gerhard; Neumann, Jane; Klein, Tilmann A

2011-07-15

The prefrontal cortex is known to play a key role in higher-order cognitive functions. Recently, we showed that this brain region is active in reinforcement learning, during which subjects constantly have to integrate trial outcomes in order to optimize performance. To further elucidate the role of the dorsolateral prefrontal cortex (DLPFC) in reinforcement learning, we applied continuous theta-burst stimulation (cTBS) either to the left or right DLPFC, or to the vertex as a control region, respectively, prior to the performance of a probabilistic learning task in an fMRI environment. While there was no influence of cTBS on learning performance per se, we observed a stimulation-dependent modulation of reward vs. punishment sensitivity: Left-hemispherical DLPFC stimulation led to a more reward-guided performance, while right-hemispherical cTBS induced a more avoidance-guided behavior. FMRI results showed enhanced prediction error coding in the ventral striatum in subjects stimulated over the left as compared to the right DLPFC. Both behavioral and imaging results are in line with recent findings that left, but not right-hemispherical stimulation can trigger a release of dopamine in the ventral striatum, which has been suggested to increase the relative impact of rewards rather than punishment on behavior. Copyright © 2011 Elsevier Inc. All rights reserved.
Tunnel Ventilation Control Using Reinforcement Learning Methodology

NASA Astrophysics Data System (ADS)

Chu, Baeksuk; Kim, Dongnam; Hong, Daehie; Park, Jooyoung; Chung, Jin Taek; Kim, Tae-Hyung

The main purpose of tunnel ventilation system is to maintain CO pollutant concentration and VI (visibility index) under an adequate level to provide drivers with comfortable and safe driving environment. Moreover, it is necessary to minimize power consumption used to operate ventilation system. To achieve the objectives, the control algorithm used in this research is reinforcement learning (RL) method. RL is a goal-directed learning of a mapping from situations to actions without relying on exemplary supervision or complete models of the environment. The goal of RL is to maximize a reward which is an evaluative feedback from the environment. In the process of constructing the reward of the tunnel ventilation system, two objectives listed above are included, that is, maintaining an adequate level of pollutants and minimizing power consumption. RL algorithm based on actor-critic architecture and gradient-following algorithm is adopted to the tunnel ventilation system. The simulations results performed with real data collected from existing tunnel ventilation system and real experimental verification are provided in this paper. It is confirmed that with the suggested controller, the pollutant level inside the tunnel was well maintained under allowable limit and the performance of energy consumption was improved compared to conventional control scheme.
Reinforcement of Science Learning through Local Culture: A Delphi Study

ERIC Educational Resources Information Center

Nuangchalerm, Prasart

2008-01-01

This study aims to explore the ways to reinforce science learning through local culture by using Delphi technique. Twenty four participants in various fields of study were selected. The result of study provides a framework for reinforcement of science learning through local culture on the theme life and environment. (Contains 1 table.)
A Novel Clustering Method Curbing the Number of States in Reinforcement Learning

NASA Astrophysics Data System (ADS)

Kotani, Naoki; Nunobiki, Masayuki; Taniguchi, Kenji

We propose an efficient state-space construction method for a reinforcement learning. Our method controls the number of categories with improving the clustering method of Fuzzy ART which is an autonomous state-space construction method. The proposed method represents weight vector as the mean value of input vectors in order to curb the number of new categories and eliminates categories whose state values are low to curb the total number of categories. As the state value is updated, the size of category becomes small to learn policy strictly. We verified the effectiveness of the proposed method with simulations of a reaching problem for a two-link robot arm. We confirmed that the number of categories was reduced and the agent achieved the complex task quickly.
Reinforcement-Learning-Based Robust Controller Design for Continuous-Time Uncertain Nonlinear Systems Subject to Input Constraints.

PubMed

Liu, Derong; Yang, Xiong; Wang, Ding; Wei, Qinglai

2015-07-01

The design of stabilizing controller for uncertain nonlinear systems with control constraints is a challenging problem. The constrained-input coupled with the inability to identify accurately the uncertainties motivates the design of stabilizing controller based on reinforcement-learning (RL) methods. In this paper, a novel RL-based robust adaptive control algorithm is developed for a class of continuous-time uncertain nonlinear systems subject to input constraints. The robust control problem is converted to the constrained optimal control problem with appropriately selecting value functions for the nominal system. Distinct from typical action-critic dual networks employed in RL, only one critic neural network (NN) is constructed to derive the approximate optimal control. Meanwhile, unlike initial stabilizing control often indispensable in RL, there is no special requirement imposed on the initial control. By utilizing Lyapunov's direct method, the closed-loop optimal control system and the estimated weights of the critic NN are proved to be uniformly ultimately bounded. In addition, the derived approximate optimal control is verified to guarantee the uncertain nonlinear system to be stable in the sense of uniform ultimate boundedness. Two simulation examples are provided to illustrate the effectiveness and applicability of the present approach.
Application of a model of instrumental conditioning to mobile robot control

NASA Astrophysics Data System (ADS)

Saksida, Lisa M.; Touretzky, D. S.

1997-09-01

Instrumental conditioning is a psychological process whereby an animal learns to associate its actions with their consequences. This type of learning is exploited in animal training techniques such as 'shaping by successive approximations,' which enables trainers to gradually adjust the animal's behavior by giving strategically timed reinforcements. While this is similar in principle to reinforcement learning, the real phenomenon includes many subtle effects not considered in the machine learning literature. In addition, a good deal of domain information is utilized by an animal learning a new task; it does not start from scratch every time it learns a new behavior. For these reasons, it is not surprising that mobile robot learning algorithms have yet to approach the sophistication and robustness of animal learning. A serious attempt to model instrumental learning could prove fruitful for improving machine learning techniques. In the present paper, we develop a computational theory of shaping at a level appropriate for controlling mobile robots. The theory is based on a series of mechanisms for 'behavior editing,' in which pre-existing behaviors, either innate or previously learned, can be dramatically changed in magnitude, shifted in direction, or otherwise manipulated so as to produce new behavioral routines. We have implemented our theory on Amelia, an RWI B21 mobile robot equipped with a gripper and color video camera. We provide results from training Amelia on several tasks, all of which were constructed as variations of one innate behavior, object-pursuit.
Effect of Flumethrin on Survival and Olfactory Learning in Honeybees

PubMed Central

Tan, Ken; Yang, Shuang; Wang, Zhengwei; Menzel, Randolf

2013-01-01

Flumethrin has been widely used as an acaricide for the control of Varroa mites in commercial honeybee keeping throughout the world for many years. Here we test the mortality of the Asian honeybee Apis cerana cerana after treatment with flumethrin. We also ask (1) how bees react to the odor of flumethrin, (2) whether its odor induces an innate avoidance response, (3) whether its taste transmits an aversive reinforcing component in olfactory learning, and (4) whether its odor or taste can be associated with reward in classical conditioning. Our results show that flumethrin has a negative effect on Apis ceranàs lifespan, induces an innate avoidance response, acts as a punishing reinforcer in olfactory learning, and interferes with the association of an appetitive conditioned stimulus. Furthermore flumethrin uptake within the colony reduces olfactory learning over an extended period of time. PMID:23785490
CENTRAL REINFORCING EFFECTS OF ETHANOL ARE BLOCKED BY CATALASE INHIBITION

PubMed Central

Nizhnikov, Michael Edward; Molina, Juan Carlos; Spear, Norman

2007-01-01

Recent studies have systematically indicated that newborn rats are highly sensitive to ethanol’s positive reinforcing effects. Central administrations of ethanol (25–200 mg %) associated with an olfactory conditioned stimulus (CS) promote subsequent conditioned approach to the CS as evaluated through the newborn’s response to a surrogate nipple scented with the CS. It has been shown that ethanol’s first metabolite, acetaldehyde, exerts significant reinforcing effects in the central nervous system. A significant amount of acetaldehyde is derived from ethanol metabolism via the catalase system. In newborn rats catalase levels are particularly high in several brain structures. The present study tested the effect of catalase inhibition on central ethanol reinforcement. In the first experiment, pups experienced lemon odor either paired or unpaired with intracisternal (i.c.) administrations of 100 mg% ethanol. Half of the animals corresponding to each learning condition were pretreated with i.c. administrations of either physiological saline or a catalase inhibitor (sodium-azide). Catalase inhibition completely suppressed ethanol reinforcement in paired groups without affecting responsiveness to the CS during conditioning or responding by unpaired control groups. A second experiment tested whether these effects were specific to ethanol reinforcement or due instead to general impairment in learning and expression capabilities. Central administration of an endogenous kappa opioid receptor agonist (dynorphin A-13) was employed as an alternative source of reinforcement. Inhibition of the catalase system had no effect on the reinforcing properties of dynorphin. The present results support the hypothesis that ethanol metabolism regulated by the catalase system plays a critical role in determination of ethanol reinforcement in newborn rats. PMID:17980789
Gaussian Processes for Data-Efficient Learning in Robotics and Control.

PubMed

Deisenroth, Marc Peter; Fox, Dieter; Rasmussen, Carl Edward

2015-02-01

Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-specific knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or specific knowledge about the underlying dynamics. In this paper, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.
Negative mood reverses devaluation of goal-directed drug-seeking favouring an incentive learning account of drug dependence.

PubMed

Hogarth, Lee; He, Zhimin; Chase, Henry W; Wills, Andy J; Troisi, Joseph; Leventhal, Adam M; Mathew, Amanda R; Hitsman, Brian

2015-09-01

Two theories explain how negative mood primes smoking behaviour. The stimulus-response (S-R) account argues that in the negative mood state, smoking is experienced as more reinforcing, establishing a direct (automatic) association between the negative mood state and smoking behaviour. By contrast, the incentive learning account argues that in the negative mood state smoking is expected to be more reinforcing, which integrates with instrumental knowledge of the response required to produce that outcome. One differential prediction is that whereas the incentive learning account anticipates that negative mood induction could augment a novel tobacco-seeking response in an extinction test, the S-R account could not explain this effect because the extinction test prevents S-R learning by omitting experience of the reinforcer. To test this, overnight-deprived daily smokers (n = 44) acquired two instrumental responses for tobacco and chocolate points, respectively, before smoking to satiety. Half then received negative mood induction to raise the expected value of tobacco, opposing satiety, whilst the remainder received positive mood induction. Finally, a choice between tobacco and chocolate was measured in extinction to test whether negative mood could augment tobacco choice, opposing satiety, in the absence of direct experience of tobacco reinforcement. Negative mood induction not only abolished the devaluation of tobacco choice, but participants with a significant increase in negative mood increased their tobacco choice in extinction, despite satiety. These findings suggest that negative mood augments drug-seeking by raising the expected value of the drug through incentive learning, rather than through automatic S-R control.
'Proactive' use of cue-context congruence for building reinforcement learning's reward function.

PubMed

Zsuga, Judit; Biro, Klara; Tajti, Gabor; Szilasi, Magdolna Emma; Papp, Csaba; Juhasz, Bela; Gesztelyi, Rudolf

2016-10-28

Reinforcement learning is a fundamental form of learning that may be formalized using the Bellman equation. Accordingly an agent determines the state value as the sum of immediate reward and of the discounted value of future states. Thus the value of state is determined by agent related attributes (action set, policy, discount factor) and the agent's knowledge of the environment embodied by the reward function and hidden environmental factors given by the transition probability. The central objective of reinforcement learning is to solve these two functions outside the agent's control either using, or not using a model. In the present paper, using the proactive model of reinforcement learning we offer insight on how the brain creates simplified representations of the environment, and how these representations are organized to support the identification of relevant stimuli and action. Furthermore, we identify neurobiological correlates of our model by suggesting that the reward and policy functions, attributes of the Bellman equitation, are built by the orbitofrontal cortex (OFC) and the anterior cingulate cortex (ACC), respectively. Based on this we propose that the OFC assesses cue-context congruence to activate the most context frame. Furthermore given the bidirectional neuroanatomical link between the OFC and model-free structures, we suggest that model-based input is incorporated into the reward prediction error (RPE) signal, and conversely RPE signal may be used to update the reward-related information of context frames and the policy underlying action selection in the OFC and ACC, respectively. Furthermore clinical implications for cognitive behavioral interventions are discussed.
Partial Planning Reinforcement Learning

DTIC Science & Technology

2012-08-31

Research Office P.O. Box 12211 Research Triangle Park, NC 27709-2211 15. SUBJECT TERMS Reinforcement Learning, Bayesian Optimization, Active ... Learning , Action Model Learning, Decision Theoretic Assistance Prasad Tadepalli, Alan Fern Oregon State University Office of Sponsored Programs Oregon State
Involvement of the Rat Anterior Cingulate Cortex in Control of Instrumental Responses Guided by Reward Expectancy

ERIC Educational Resources Information Center

Schweimer, Judith; Hauber, Wolfgang

2005-01-01

The anterior cingulate cortex (ACC) plays a critical role in stimulus-reinforcement learning and reward-guided selection of actions. Here we conducted a series of experiments to further elucidate the role of the ACC in instrumental behavior involving effort-based decision-making and instrumental learning guided by reward-predictive stimuli. In…
The Memory Trace Supporting Lose-Shift Responding Decays Rapidly after Reward Omission and Is Distinct from Other Learning Mechanisms in Rats.

PubMed

Gruber, Aaron J; Thapa, Rajat

2016-01-01

The propensity of animals to shift choices immediately after unexpectedly poor reinforcement outcomes is a pervasive strategy across species and tasks. We report here that the memory supporting such lose-shift responding in rats rapidly decays during the intertrial interval and persists throughout training and testing on a binary choice task, despite being a suboptimal strategy. Lose-shift responding is not positively correlated with the prevalence and temporal dependence of win-stay responding, and it is inconsistent with predictions of reinforcement learning on the task. These data provide further evidence that win-stay and lose-shift are mediated by dissociated neural mechanisms and indicate that lose-shift responding presents a potential confound for the study of choice in the many operant choice tasks with short intertrial intervals. We propose that this immediate lose-shift responding is an intrinsic feature of the brain's choice mechanisms that is engaged as a choice reflex and works in parallel with reinforcement learning and other control mechanisms to guide action selection.
Optimizing microstimulation using a reinforcement learning framework.

PubMed

Brockmeier, Austin J; Choi, John S; Distasio, Marcello M; Francis, Joseph T; Príncipe, José C

2011-01-01

The ability to provide sensory feedback is desired to enhance the functionality of neuroprosthetics. Somatosensory feedback provides closed-loop control to the motor system, which is lacking in feedforward neuroprosthetics. In the case of existing somatosensory function, a template of the natural response can be used as a template of desired response elicited by electrical microstimulation. In the case of no initial training data, microstimulation parameters that produce responses close to the template must be selected in an online manner. We propose using reinforcement learning as a framework to balance the exploration of the parameter space and the continued selection of promising parameters for further stimulation. This approach avoids an explicit model of the neural response from stimulation. We explore a preliminary architecture--treating the task as a k-armed bandit--using offline data recorded for natural touch and thalamic microstimulation, and we examine the methods efficiency in exploring the parameter space while concentrating on promising parameter forms. The best matching stimulation parameters, from k = 68 different forms, are selected by the reinforcement learning algorithm consistently after 334 realizations.
Learning tactile skills through curious exploration

PubMed Central

Pape, Leo; Oddo, Calogero M.; Controzzi, Marco; Cipriani, Christian; Förster, Alexander; Carrozza, Maria C.; Schmidhuber, Jürgen

2012-01-01

We present curiosity-driven, autonomous acquisition of tactile exploratory skills on a biomimetic robot finger equipped with an array of microelectromechanical touch sensors. Instead of building tailored algorithms for solving a specific tactile task, we employ a more general curiosity-driven reinforcement learning approach that autonomously learns a set of motor skills in absence of an explicit teacher signal. In this approach, the acquisition of skills is driven by the information content of the sensory input signals relative to a learner that aims at representing sensory inputs using fewer and fewer computational resources. We show that, from initially random exploration of its environment, the robotic system autonomously develops a small set of basic motor skills that lead to different kinds of tactile input. Next, the system learns how to exploit the learned motor skills to solve supervised texture classification tasks. Our approach demonstrates the feasibility of autonomous acquisition of tactile skills on physical robotic platforms through curiosity-driven reinforcement learning, overcomes typical difficulties of engineered solutions for active tactile exploration and underactuated control, and provides a basis for studying developmental learning through intrinsic motivation in robots. PMID:22837748
A statistical learning strategy for closed-loop control of fluid flows

NASA Astrophysics Data System (ADS)

Guéniat, Florimond; Mathelin, Lionel; Hussaini, M. Yousuff

2016-12-01

This work discusses a closed-loop control strategy for complex systems utilizing scarce and streaming data. A discrete embedding space is first built using hash functions applied to the sensor measurements from which a Markov process model is derived, approximating the complex system's dynamics. A control strategy is then learned using reinforcement learning once rewards relevant with respect to the control objective are identified. This method is designed for experimental configurations, requiring no computations nor prior knowledge of the system, and enjoys intrinsic robustness. It is illustrated on two systems: the control of the transitions of a Lorenz'63 dynamical system, and the control of the drag of a cylinder flow. The method is shown to perform well.
Reinforcement learning in supply chains.

PubMed

Valluri, Annapurna; North, Michael J; Macal, Charles M

2009-10-01

Effective management of supply chains creates value and can strategically position companies. In practice, human beings have been found to be both surprisingly successful and disappointingly inept at managing supply chains. The related fields of cognitive psychology and artificial intelligence have postulated a variety of potential mechanisms to explain this behavior. One of the leading candidates is reinforcement learning. This paper applies agent-based modeling to investigate the comparative behavioral consequences of three simple reinforcement learning algorithms in a multi-stage supply chain. For the first time, our findings show that the specific algorithm that is employed can have dramatic effects on the results obtained. Reinforcement learning is found to be valuable in multi-stage supply chains with several learning agents, as independent agents can learn to coordinate their behavior. However, learning in multi-stage supply chains using these postulated approaches from cognitive psychology and artificial intelligence take extremely long time periods to achieve stability which raises questions about their ability to explain behavior in real supply chains. The fact that it takes thousands of periods for agents to learn in this simple multi-agent setting provides new evidence that real world decision makers are unlikely to be using strict reinforcement learning in practice.
Reinforcement learning in scheduling

NASA Technical Reports Server (NTRS)

Dietterich, Tom G.; Ok, Dokyeong; Zhang, Wei; Tadepalli, Prasad

1994-01-01

The goal of this research is to apply reinforcement learning methods to real-world problems like scheduling. In this preliminary paper, we show that learning to solve scheduling problems such as the Space Shuttle Payload Processing and the Automatic Guided Vehicle (AGV) scheduling can be usefully studied in the reinforcement learning framework. We discuss some of the special challenges posed by the scheduling domain to these methods and propose some possible solutions we plan to implement.

Action-Driven Visual Object Tracking With Deep Reinforcement Learning.

PubMed

Yun, Sangdoo; Choi, Jongwon; Yoo, Youngjoon; Yun, Kimin; Choi, Jin Young

2018-06-01

In this paper, we propose an efficient visual tracker, which directly captures a bounding box containing the target object in a video by means of sequential actions learned using deep neural networks. The proposed deep neural network to control tracking actions is pretrained using various training video sequences and fine-tuned during actual tracking for online adaptation to a change of target and background. The pretraining is done by utilizing deep reinforcement learning (RL) as well as supervised learning. The use of RL enables even partially labeled data to be successfully utilized for semisupervised learning. Through the evaluation of the object tracking benchmark data set, the proposed tracker is validated to achieve a competitive performance at three times the speed of existing deep network-based trackers. The fast version of the proposed method, which operates in real time on graphics processing unit, outperforms the state-of-the-art real-time trackers with an accuracy improvement of more than 8%.
Neural Basis of Reinforcement Learning and Decision Making

PubMed Central

Lee, Daeyeol; Seo, Hyojung; Jung, Min Whan

2012-01-01

Reinforcement learning is an adaptive process in which an animal utilizes its previous experience to improve the outcomes of future choices. Computational theories of reinforcement learning play a central role in the newly emerging areas of neuroeconomics and decision neuroscience. In this framework, actions are chosen according to their value functions, which describe how much future reward is expected from each action. Value functions can be adjusted not only through reward and penalty, but also by the animal’s knowledge of its current environment. Studies have revealed that a large proportion of the brain is involved in representing and updating value functions and using them to choose an action. However, how the nature of a behavioral task affects the neural mechanisms of reinforcement learning remains incompletely understood. Future studies should uncover the principles by which different computational elements of reinforcement learning are dynamically coordinated across the entire brain. PMID:22462543
Effect of reinforcement learning on coordination of multiangent systems

NASA Astrophysics Data System (ADS)

Bukkapatnam, Satish T. S.; Gao, Greg

2000-12-01

For effective coordination of distributed environments involving multiagent systems, learning ability of each agent in the environment plays a crucial role. In this paper, we develop a simple group learning method based on reinforcement, and study its effect on coordination through application to a supply chain procurement scenario involving a computer manufacturer. Here, all parties are represented by self-interested, autonomous agents, each capable of performing specific simple tasks. They negotiate with each other to perform complex tasks and thus coordinate supply chain procurement. Reinforcement learning is intended to enable each agent to reach a best negotiable price within a shortest possible time. Our simulations of the application scenario under different learning strategies reveals the positive effects of reinforcement learning on an agent's as well as the system's performance.
Reinforcement Learning of Two-Joint Virtual Arm Reaching in a Computer Model of Sensorimotor Cortex

PubMed Central

Neymotin, Samuel A.; Chadderdon, George L.; Kerr, Cliff C.; Francis, Joseph T.; Lytton, William W.

2014-01-01

Neocortical mechanisms of learning sensorimotor control involve a complex series of interactions at multiple levels, from synaptic mechanisms to cellular dynamics to network connectomics. We developed a model of sensory and motor neocortex consisting of 704 spiking model neurons. Sensory and motor populations included excitatory cells and two types of interneurons. Neurons were interconnected with AMPA/NMDA and GABAA synapses. We trained our model using spike-timing-dependent reinforcement learning to control a two-joint virtual arm to reach to a fixed target. For each of 125 trained networks, we used 200 training sessions, each involving 15 s reaches to the target from 16 starting positions. Learning altered network dynamics, with enhancements to neuronal synchrony and behaviorally relevant information flow between neurons. After learning, networks demonstrated retention of behaviorally relevant memories by using proprioceptive information to perform reach-to-target from multiple starting positions. Networks dynamically controlled which joint rotations to use to reach a target, depending on current arm position. Learning-dependent network reorganization was evident in both sensory and motor populations: learned synaptic weights showed target-specific patterning optimized for particular reach movements. Our model embodies an integrative hypothesis of sensorimotor cortical learning that could be used to interpret future electrophysiological data recorded in vivo from sensorimotor learning experiments. We used our model to make the following predictions: learning enhances synchrony in neuronal populations and behaviorally relevant information flow across neuronal populations, enhanced sensory processing aids task-relevant motor performance and the relative ease of a particular movement in vivo depends on the amount of sensory information required to complete the movement. PMID:24047323
Cueing, demand fading, and positive reinforcement to establish self-feeding and oral consumption in a child with chronic food refusal.

PubMed

Luiselli, J K

2000-07-01

A 3-year-old child with multiple medical disorders and chronic food refusal was treated successfully using a program that incorporated antecedent control procedures combined with positive reinforcement. The antecedent manipulations included visual cueing of a criterion number of self-feeding responses that were required during meals to receive reinforcement and a gradual increase in the imposed criterion (demand fading) that was based on improved frequency of oral consumption. As evaluated in a changing criterion design, the child learned to feed himself as an outcome of treatment. One year following intervention, he was consuming a variety of foods and had gained weight. Advantages of antecedent control methods for the treatment of chronic food refusal are discussed.
Reinforcement Learning Models and Their Neural Correlates: An Activation Likelihood Estimation Meta-Analysis

PubMed Central

Kumar, Poornima; Eickhoff, Simon B.; Dombrovski, Alexandre Y.

2015-01-01

Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments – prediction error – is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies suggest that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that employed algorithmic reinforcement learning models, across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, while instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually-estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies. PMID:25665667
Developing skills versus reinforcing concepts in physics labs: Insight from a survey of students' beliefs about experimental physics

NASA Astrophysics Data System (ADS)

Wilcox, Bethany R.; Lewandowski, H. J.

2017-06-01

Physics laboratory courses have been generally acknowledged as an important component of the undergraduate curriculum, particularly with respect to developing students' interest in, and understanding of, experimental physics. There are a number of possible learning goals for these courses including reinforcing physics concepts, developing laboratory skills, and promoting expertlike beliefs about the nature of experimental physics. However, there is little consensus among instructors and researchers interested in the laboratory learning environment as to the relative importance of these various learning goals. Here, we contribute data to this debate through the analysis of students' responses to the laboratory-focused assessment known as the Colorado Learning Attitudes about Science Survey for Experimental Physics (E-CLASS). Using a large, national data set of students' responses, we compare students' E-CLASS performance in classes in which the instructor self-reported focusing on developing skills, reinforcing concepts, or both. As the classification of courses was based on instructor self-report, we also provide additional description of these courses with respect to how often students engage in particular activities in the lab. We find that courses that focus specifically on developing lab skills have more expertlike postinstruction E-CLASS responses than courses that focus either on reinforcing physics concepts or on both goals. Within first-year courses, this effect is larger for women. Moreover, these findings hold when controlling for the variance in postinstruction scores that is associated with preinstruction E-CLASS scores, student major, and student gender.
Preliminary Work for Examining the Scalability of Reinforcement Learning

NASA Technical Reports Server (NTRS)

Clouse, Jeff

1998-01-01

Researchers began studying automated agents that learn to perform multiple-step tasks early in the history of artificial intelligence (Samuel, 1963; Samuel, 1967; Waterman, 1970; Fikes, Hart & Nilsonn, 1972). Multiple-step tasks are tasks that can only be solved via a sequence of decisions, such as control problems, robotics problems, classic problem-solving, and game-playing. The objective of agents attempting to learn such tasks is to use the resources they have available in order to become more proficient at the tasks. In particular, each agent attempts to develop a good policy, a mapping from states to actions, that allows it to select actions that optimize a measure of its performance on the task; for example, reducing the number of steps necessary to complete the task successfully. Our study focuses on reinforcement learning, a set of learning techniques where the learner performs trial-and-error experiments in the task and adapts its policy based on the outcome of those experiments. Much of the work in reinforcement learning has focused on a particular, simple representation, where every problem state is represented explicitly in a table, and associated with each state are the actions that can be chosen in that state. A major advantage of this table lookup representation is that one can prove that certain reinforcement learning techniques will develop an optimal policy for the current task. The drawback is that the representation limits the application of reinforcement learning to multiple-step tasks with relatively small state-spaces. There has been a little theoretical work that proves that convergence to optimal solutions can be obtained when using generalization structures, but the structures are quite simple. The theory says little about complex structures, such as multi-layer, feedforward artificial neural networks (Rumelhart & McClelland, 1986), but empirical results indicate that the use of reinforcement learning with such structures is promising. These empirical results make no theoretical claims, nor compare the policies produced to optimal policies. A goal of our work is to be able to make the comparison between an optimal policy and one stored in an artificial neural network. A difficulty of performing such a study is finding a multiple-step task that is small enough that one can find an optimal policy using table lookup, yet large enough that, for practical purposes, an artificial neural network is really required. We have identified a limited form of the game OTHELLO as satisfying these requirements. The work we report here is in the very preliminary stages of research, but this paper provides background for the problem being studied and a description of our initial approach to examining the problem. In the remainder of this paper, we first describe reinforcement learning in more detail. Next, we present the game OTHELLO. Finally we argue that a restricted form of the game meets the requirements of our study, and describe our preliminary approach to finding an optimal solution to the problem.
Self-Paced Prioritized Curriculum Learning With Coverage Penalty in Deep Reinforcement Learning.

PubMed

Ren, Zhipeng; Dong, Daoyi; Li, Huaxiong; Chen, Chunlin; Zhipeng Ren; Daoyi Dong; Huaxiong Li; Chunlin Chen; Dong, Daoyi; Li, Huaxiong; Chen, Chunlin; Ren, Zhipeng

2018-06-01

In this paper, a new training paradigm is proposed for deep reinforcement learning using self-paced prioritized curriculum learning with coverage penalty. The proposed deep curriculum reinforcement learning (DCRL) takes the most advantage of experience replay by adaptively selecting appropriate transitions from replay memory based on the complexity of each transition. The criteria of complexity in DCRL consist of self-paced priority as well as coverage penalty. The self-paced priority reflects the relationship between the temporal-difference error and the difficulty of the current curriculum for sample efficiency. The coverage penalty is taken into account for sample diversity. With comparison to deep Q network (DQN) and prioritized experience replay (PER) methods, the DCRL algorithm is evaluated on Atari 2600 games, and the experimental results show that DCRL outperforms DQN and PER on most of these games. More results further show that the proposed curriculum training paradigm of DCRL is also applicable and effective for other memory-based deep reinforcement learning approaches, such as double DQN and dueling network. All the experimental results demonstrate that DCRL can achieve improved training efficiency and robustness for deep reinforcement learning.
Efficient model learning methods for actor-critic control.

PubMed

Grondman, Ivo; Vaandrager, Maarten; Buşoniu, Lucian; Babuska, Robert; Schuitema, Erik

2012-06-01

We propose two new actor-critic algorithms for reinforcement learning. Both algorithms use local linear regression (LLR) to learn approximations of the functions involved. A crucial feature of the algorithms is that they also learn a process model, and this, in combination with LLR, provides an efficient policy update for faster learning. The first algorithm uses a novel model-based update rule for the actor parameters. The second algorithm does not use an explicit actor but learns a reference model which represents a desired behavior, from which desired control actions can be calculated using the inverse of the learned process model. The two novel methods and a standard actor-critic algorithm are applied to the pendulum swing-up problem, in which the novel methods achieve faster learning than the standard algorithm.
B-tree search reinforcement learning for model based intelligent agent

NASA Astrophysics Data System (ADS)

Bhuvaneswari, S.; Vignashwaran, R.

2013-03-01

Agents trained by learning techniques provide a powerful approximation of active solutions for naive approaches. In this study using B - Trees implying reinforced learning the data search for information retrieval is moderated to achieve accuracy with minimum search time. The impact of variables and tactics applied in training are determined using reinforcement learning. Agents based on these techniques perform satisfactory baseline and act as finite agents based on the predetermined model against competitors from the course.
Changes in corticostriatal connectivity during reinforcement learning in humans.

PubMed

Horga, Guillermo; Maia, Tiago V; Marsh, Rachel; Hao, Xuejun; Xu, Dongrong; Duan, Yunsuo; Tau, Gregory Z; Graniello, Barbara; Wang, Zhishun; Kangarlu, Alayar; Martinez, Diana; Packard, Mark G; Peterson, Bradley S

2015-02-01

Many computational models assume that reinforcement learning relies on changes in synaptic efficacy between cortical regions representing stimuli and striatal regions involved in response selection, but this assumption has thus far lacked empirical support in humans. We recorded hemodynamic signals with fMRI while participants navigated a virtual maze to find hidden rewards. We fitted a reinforcement-learning algorithm to participants' choice behavior and evaluated the neural activity and the changes in functional connectivity related to trial-by-trial learning variables. Activity in the posterior putamen during choice periods increased progressively during learning. Furthermore, the functional connections between the sensorimotor cortex and the posterior putamen strengthened progressively as participants learned the task. These changes in corticostriatal connectivity differentiated participants who learned the task from those who did not. These findings provide a direct link between changes in corticostriatal connectivity and learning, thereby supporting a central assumption common to several computational models of reinforcement learning. © 2014 Wiley Periodicals, Inc.
A common neural circuit mechanism for internally guided and externally reinforced forms of motor learning.

PubMed

Hisey, Erin; Kearney, Matthew Gene; Mooney, Richard

2018-04-01

The complex skills underlying verbal and musical expression can be learned without external punishment or reward, indicating their learning is internally guided. The neural mechanisms that mediate internally guided learning are poorly understood, but a circuit comprising dopamine-releasing neurons in the midbrain ventral tegmental area (VTA) and their targets in the basal ganglia are important to externally reinforced learning. Juvenile zebra finches copy a tutor song in a process that is internally guided and, in adulthood, can learn to modify the fundamental frequency (pitch) of a target syllable in response to external reinforcement with white noise. Here we combined intersectional genetic ablation of VTA neurons, reversible blockade of dopamine receptors in the basal ganglia, and singing-triggered optogenetic stimulation of VTA terminals to establish that a common VTA-basal ganglia circuit enables internally guided song copying and externally reinforced syllable pitch learning.
Cognitive Control Predicts Use of Model-Based Reinforcement-Learning

PubMed Central

Otto, A. Ross; Skatova, Anya; Madlon-Kay, Seth; Daw, Nathaniel D.

2015-01-01

Accounts of decision-making and its neural substrates have long posited the operation of separate, competing valuation systems in the control of choice behavior. Recent theoretical and experimental work suggest that this classic distinction between behaviorally and neurally dissociable systems for habitual and goal-directed (or more generally, automatic and controlled) choice may arise from two computational strategies for reinforcement learning (RL), called model-free and model-based RL, but the cognitive or computational processes by which one system may dominate over the other in the control of behavior is a matter of ongoing investigation. To elucidate this question, we leverage the theoretical framework of cognitive control, demonstrating that individual differences in utilization of goal-related contextual information—in the service of overcoming habitual, stimulus-driven responses—in established cognitive control paradigms predict model-based behavior in a separate, sequential choice task. The behavioral correspondence between cognitive control and model-based RL compellingly suggests that a common set of processes may underpin the two behaviors. In particular, computational mechanisms originally proposed to underlie controlled behavior may be applicable to understanding the interactions between model-based and model-free choice behavior. PMID:25170791
Knockout crickets for the study of learning and memory: Dopamine receptor Dop1 mediates aversive but not appetitive reinforcement in crickets.

PubMed

Awata, Hiroko; Watanabe, Takahito; Hamanaka, Yoshitaka; Mito, Taro; Noji, Sumihare; Mizunami, Makoto

2015-11-02

Elucidation of reinforcement mechanisms in associative learning is an important subject in neuroscience. In mammals, dopamine neurons are thought to play critical roles in mediating both appetitive and aversive reinforcement. Our pharmacological studies suggested that octopamine and dopamine neurons mediate reward and punishment, respectively, in crickets, but recent studies in fruit-flies concluded that dopamine neurons mediates both reward and punishment, via the type 1 dopamine receptor Dop1. To resolve the discrepancy between studies in different insect species, we produced Dop1 knockout crickets using the CRISPR/Cas9 system and found that they are defective in aversive learning with sodium chloride punishment but not appetitive learning with water or sucrose reward. The results suggest that dopamine and octopamine neurons mediate aversive and appetitive reinforcement, respectively, in crickets. We suggest unexpected diversity in neurotransmitters mediating appetitive reinforcement between crickets and fruit-flies, although the neurotransmitter mediating aversive reinforcement is conserved. This study demonstrates usefulness of the CRISPR/Cas9 system for producing knockout animals for the study of learning and memory.
Enhancing second-order conditioning with lesions of the basolateral amygdala.

PubMed

Holland, Peter C

2016-04-01

Because the occurrence of primary reinforcers in natural environments is relatively rare, conditioned reinforcement plays an important role in many accounts of behavior, including pathological behaviors such as the abuse of alcohol or drugs. As a result of pairing with natural or drug reinforcers, initially neutral cues acquire the ability to serve as reinforcers for subsequent learning. Accepting a major role for conditioned reinforcement in everyday learning is complicated by the often-evanescent nature of this phenomenon in the laboratory, especially when primary reinforcers are entirely absent from the test situation. Here, I found that under certain conditions, the impact of conditioned reinforcement could be extended by lesions of the basolateral amygdala (BLA). Rats received first-order Pavlovian conditioning pairings of 1 visual conditioned stimulus (CS) with food prior to receiving excitotoxic or sham lesions of the BLA, and first-order pairings of another visual CS with food after that surgery. Finally, each rat received second-order pairings of a different auditory cue with each visual first-order CS. As in prior studies, relative to sham-lesioned control rats, lesioned rats were impaired in their acquisition of second-order conditioning to the auditory cue paired with the first-order CS that was trained after surgery. However, lesioned rats showed enhanced and prolonged second-order conditioning to the auditory cue paired with the first-order CS that was trained before amygdala damage was made. Implications for an enhanced role for conditioned reinforcement by drug-related cues after drug-induced alterations in neural plasticity are discussed. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Optimizing Chemical Reactions with Deep Reinforcement Learning.

PubMed

Zhou, Zhenpeng; Li, Xiaocheng; Zare, Richard N

2017-12-27

Deep reinforcement learning was employed to optimize chemical reactions. Our model iteratively records the results of a chemical reaction and chooses new experimental conditions to improve the reaction outcome. This model outperformed a state-of-the-art blackbox optimization algorithm by using 71% fewer steps on both simulations and real reactions. Furthermore, we introduced an efficient exploration strategy by drawing the reaction conditions from certain probability distributions, which resulted in an improvement on regret from 0.062 to 0.039 compared with a deterministic policy. Combining the efficient exploration policy with accelerated microdroplet reactions, optimal reaction conditions were determined in 30 min for the four reactions considered, and a better understanding of the factors that control microdroplet reactions was reached. Moreover, our model showed a better performance after training on reactions with similar or even dissimilar underlying mechanisms, which demonstrates its learning ability.
Deletion of the δ opioid receptor gene impairs place conditioning but preserves morphine reinforcement.

PubMed

Le Merrer, Julie; Plaza-Zabala, Ainhoa; Del Boca, Carolina; Matifas, Audrey; Maldonado, Rafael; Kieffer, Brigitte L

2011-04-01

Converging experimental data indicate that δ opioid receptors contribute to mediate drug reinforcement processes. Whether their contribution reflects a role in the modulation of drug reward or an implication in conditioned learning, however, has not been explored. In the present study, we investigated the impact of δ receptor gene knockout on reinforced conditioned learning under several experimental paradigms. We assessed the ability of δ receptor knockout mice to form drug-context associations with either morphine (appetitive)- or lithium (aversive)-induced Pavlovian place conditioning. We also examined the efficiency of morphine to serve as a positive reinforcer in these mice and their motivation to gain drug injections, with operant intravenous self-administration under fixed and progressive ratio schedules and at two different doses. Mutant mice showed impaired place conditioning in both appetitive and aversive conditions, indicating disrupted context-drug association. In contrast, mutant animals displayed intact acquisition of morphine self-administration and reached breaking-points comparable to control subjects. Thus, reinforcing effects of morphine and motivation to obtain the drug were maintained. Collectively, the data suggest that δ receptor activity is not involved in morphine reinforcement but facilitates place conditioning. This study reveals a novel aspect of δ opioid receptor function in addiction-related behaviors. Copyright © 2011 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.
Social Cognition as Reinforcement Learning: Feedback Modulates Emotion Inference.

PubMed

Zaki, Jamil; Kallman, Seth; Wimmer, G Elliott; Ochsner, Kevin; Shohamy, Daphna

2016-09-01

Neuroscientific studies of social cognition typically employ paradigms in which perceivers draw single-shot inferences about the internal states of strangers. Real-world social inference features much different parameters: People often encounter and learn about particular social targets (e.g., friends) over time and receive feedback about whether their inferences are correct or incorrect. Here, we examined this process and, more broadly, the intersection between social cognition and reinforcement learning. Perceivers were scanned using fMRI while repeatedly encountering three social targets who produced conflicting visual and verbal emotional cues. Perceivers guessed how targets felt and received feedback about whether they had guessed correctly. Visual cues reliably predicted one target's emotion, verbal cues predicted a second target's emotion, and neither reliably predicted the third target's emotion. Perceivers successfully used this information to update their judgments over time. Furthermore, trial-by-trial learning signals-estimated using two reinforcement learning models-tracked activity in ventral striatum and ventromedial pFC, structures associated with reinforcement learning, and regions associated with updating social impressions, including TPJ. These data suggest that learning about others' emotions, like other forms of feedback learning, relies on domain-general reinforcement mechanisms as well as domain-specific social information processing.
Probabilistic Reinforcement Learning in Adults with Autism Spectrum Disorders

PubMed Central

Solomon, Marjorie; Smith, Anne C.; Frank, Michael J.; Ly, Stanford; Carter, Cameron S.

2017-01-01

Background Autism spectrum disorders (ASDs) can be conceptualized as disorders of learning, however there have been few experimental studies taking this perspective. Methods We examined the probabilistic reinforcement learning performance of 28 adults with ASDs and 30 typically developing adults on a task requiring learning relationships between three stimulus pairs consisting of Japanese characters with feedback that was valid with different probabilities (80%, 70%, and 60%). Both univariate and Bayesian state–space data analytic methods were employed. Hypotheses were based on the extant literature as well as on neurobiological and computational models of reinforcement learning. Results Both groups learned the task after training. However, there were group differences in early learning in the first task block where individuals with ASDs acquired the most frequently accurately reinforced stimulus pair (80%) comparably to typically developing individuals; exhibited poorer acquisition of the less frequently reinforced 70% pair as assessed by state–space learning curves; and outperformed typically developing individuals on the near chance (60%) pair. Individuals with ASDs also demonstrated deficits in using positive feedback to exploit rewarded choices. Conclusions Results support the contention that individuals with ASDs are slower learners. Based on neurobiology and on the results of computational modeling, one interpretation of this pattern of findings is that impairments are related to deficits in flexible updating of reinforcement history as mediated by the orbito-frontal cortex, with spared functioning of the basal ganglia. This hypothesis about the pathophysiology of learning in ASDs can be tested using functional magnetic resonance imaging. PMID:21425243

Effectiveness of an educational video as an instrument to refresh and reinforce the learning of a nursing technique: a randomized controlled trial.

PubMed

Salina, Loris; Ruffinengo, Carlo; Garrino, Lorenza; Massariello, Patrizia; Charrier, Lorena; Martin, Barbara; Favale, Maria Santina; Dimonte, Valerio

2012-05-01

The Undergraduate Nursing Course has been using videos for the past year or so. Videos are used for many different purposes such as during lessons, nurse refresher courses, reinforcement, and sharing and comparison of knowledge with the professional and scientific community. The purpose of this study was to estimate the efficacy of the video (moving an uncooperative patient from the supine to the lateral position) as an instrument to refresh and reinforce nursing techniques. A two-arm randomized controlled trial (RCT) design was chosen: both groups attended lessons in the classroom as well as in the laboratory; a month later while one group received written information as a refresher, the other group watched the video. Both groups were evaluated in a blinded fashion. A total of 223 students agreed to take part in the study. The difference observed between those who had seen the video and those who had read up on the technique turned out to be an average of 6.19 points in favour of the first (P < 0.05). The results of the RCT demonstrated that students who had seen the video were better able to apply the technique, resulting in a better performance. The video, therefore, represents an important tool to refresh and reinforce previous learning.
Does Feedback-Related Brain Response during Reinforcement Learning Predict Socio-motivational (In-)dependence in Adolescence?

PubMed Central

Raufelder, Diana; Boehme, Rebecca; Romund, Lydia; Golde, Sabrina; Lorenz, Robert C.; Gleich, Tobias; Beck, Anne

2016-01-01

This multi-methodological study applied functional magnetic resonance imaging to investigate neural activation in a group of adolescent students (N = 88) during a probabilistic reinforcement learning task. We related patterns of emerging brain activity and individual learning rates to socio-motivational (in-)dependence manifested in four different motivation types (MTs): (1) peer-dependent MT, (2) teacher-dependent MT, (3) peer-and-teacher-dependent MT, (4) peer-and-teacher-independent MT. A multinomial regression analysis revealed that the individual learning rate predicts students’ membership to the independent MT, or the peer-and-teacher-dependent MT. Additionally, the striatum, a brain region associated with behavioral adaptation and flexibility, showed increased learning-related activation in students with motivational independence. Moreover, the prefrontal cortex, which is involved in behavioral control, was more active in students of the peer-and-teacher-dependent MT. Overall, this study offers new insights into the interplay of motivation and learning with (1) a focus on inter-individual differences in the role of peers and teachers as source of students’ individual motivation and (2) its potential neurobiological basis. PMID:27199873
Role of dopamine D2 receptors in human reinforcement learning.

PubMed

Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

2014-09-01

Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well.
Role of Dopamine D2 Receptors in Human Reinforcement Learning

PubMed Central

Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

2014-01-01

Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well. PMID:24713613
Intelligent control and cooperation for mobile robots

NASA Astrophysics Data System (ADS)

Stingu, Petru Emanuel

The topic discussed in this work addresses the current research being conducted at the Automation & Robotics Research Institute in the areas of UAV quadrotor control and heterogenous multi-vehicle cooperation. Autonomy can be successfully achieved by a robot under the following conditions: the robot has to be able to acquire knowledge about the environment and itself, and it also has to be able to reason under uncertainty. The control system must react quickly to immediate challenges, but also has to slowly adapt and improve based on accumulated knowledge. The major contribution of this work is the transfer of the ADP algorithms from the purely theoretical environment to the complex real-world robotic platforms that work in real-time and in uncontrolled environments. Many solutions are adopted from those present in nature because they have been proven to be close to optimal in very different settings. For the control of a single platform, reinforcement learning algorithms are used to design suboptimal controllers for a class of complex systems that can be conceptually split in local loops with simpler dynamics and relatively weak coupling to the rest of the system. Optimality is enforced by having a global critic but the curse of dimensionality is avoided by using local actors and intelligent pre-processing of the information used for learning the optimal controllers. The system model is used for constructing the structure of the control system, but on top of that the adaptive neural networks that form the actors use the knowledge acquired during normal operation to get closer to optimal control. In real-world experiments, efficient learning is a strong requirement for success. This is accomplished by using an approximation of the system model to focus the learning for equivalent configurations of the state space. Due to the availability of only local data for training, neural networks with local activation functions are implemented. For the control of a formation of robots subjected to dynamic communication constraints, game theory is used in addition to reinforcement learning. The nodes maintain an extra set of state variables about all the other nodes that they can communicate to. The more important are trust and predictability. They are a way to incorporate knowledge acquired in the past into the control decisions taken by each node. The trust variable provides a simple mechanism for the implementation of reinforcement learning. For robot formations, potential field based control algorithms are used to generate the control commands. The formation structure changes due to the environment and due to the decisions of the nodes. It is a problem of building a graph and coalitions by having distributed decisions but still reaching an optimal behavior globally.
[Functional neuroanatomy of implicit learning: associative, motor and habit].

PubMed

Correa, M

The present review focuses on the neuroanatomy of aspects of implicit learning that involve stimulus-response associations, such as classical and instrumental conditioning, motor learning and habit formation. These types of learning all require a progression in the acquisition of procedural information about 'how to do things' instead of 'what things are'. These forms of implicit learning share the neural substrate formed mainly by brain circuits involving basal ganglia, prefrontal cortex and amygdala. The relationship between pavlovian and instrumental learning is shown in the transference and autoshaping studies. There has been a resurgence of interest in habit learning because of the suggestion that addiction is a process that progresses from a reinforced response to a habit in which the stimulus-response association is supraselected and becomes independent of voluntary cognitive control. Dopamine has demonstrated to be involved in the acquisition of these procedures. The different forms of procedural learning studied here all are characterized by stimulus-response-reinforcement associations, but there are differences between them in terms of the degree to which some of these associations or components are strengthened. These different patterns of association are partially regulated by the degree of involvement of the frontal-striatal-amygdala circuits.
Policy improvement by a model-free Dyna architecture.

PubMed

Hwang, Kao-Shing; Lo, Chia-Yue

2013-05-01

The objective of this paper is to accelerate the process of policy improvement in reinforcement learning. The proposed Dyna-style system combines two learning schemes, one of which utilizes a temporal difference method for direct learning; the other uses relative values for indirect learning in planning between two successive direct learning cycles. Instead of establishing a complicated world model, the approach introduces a simple predictor of average rewards to actor-critic architecture in the simulation (planning) mode. The relative value of a state, defined as the accumulated differences between immediate reward and average reward, is used to steer the improvement process in the right direction. The proposed learning scheme is applied to control a pendulum system for tracking a desired trajectory to demonstrate its adaptability and robustness. Through reinforcement signals from the environment, the system takes the appropriate action to drive an unknown dynamic to track desired outputs in few learning cycles. Comparisons are made between the proposed model-free method, a connectionist adaptive heuristic critic, and an advanced method of Dyna-Q learning in the experiments of labyrinth exploration. The proposed method outperforms its counterparts in terms of elapsed time and convergence rate.
From Recurrent Choice to Skill Learning: A Reinforcement-Learning Model

ERIC Educational Resources Information Center

Fu, Wai-Tat; Anderson, John R.

2006-01-01

The authors propose a reinforcement-learning mechanism as a model for recurrent choice and extend it to account for skill learning. The model was inspired by recent research in neurophysiological studies of the basal ganglia and provides an integrated explanation of recurrent choice behavior and skill learning. The behavior includes effects of…
A fundamental role for context in instrumental learning and extinction.

PubMed

Bouton, Mark E; Todd, Travis P

2014-05-01

The purpose of this article is to review recent research that has investigated the effects of context change on instrumental (operant) learning. The first part of the article discusses instrumental extinction, in which the strength of a reinforced instrumental behavior declines when reinforcers are withdrawn. The results suggest that extinction of either simple or discriminated operant behavior is relatively specific to the context in which it is learned: As in prior studies of Pavlovian extinction, ABA, ABC, and AAB renewal effects can all be observed. Further analysis supports the idea that the organism learns to refrain from making a specific response in a specific context, or in more formal terms, an inhibitory context-response association. The second part of the article then discusses research suggesting that the context also controls instrumental behavior before it is extinguished. Several experiments demonstrate that a context switch after either simple or discriminated operant training causes a decrement in the strength of the response. Over a range of conditions, the animal appears to learn a direct association between the context and the response. Under some conditions, it can also learn a hierarchical representation of context and the response-reinforcer relation. Extinction is still more context-specific than conditioning, as indicated by ABC and AAB renewal. Overall, the results establish that the context can play a significant role in both the acquisition and extinction of operant behavior. Copyright © 2014 Elsevier B.V. All rights reserved.
Kernel-based least squares policy iteration for reinforcement learning.

PubMed

Xu, Xin; Hu, Dewen; Lu, Xicheng

2007-07-01

In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating an initial controller to ensure online performance.
Adolescent-specific patterns of behavior and neural activity during social reinforcement learning

PubMed Central

Jones, Rebecca M.; Somerville, Leah H.; Li, Jian; Ruberry, Erika J.; Powers, Alisa; Mehta, Natasha; Dyke, Jonathan; Casey, BJ

2014-01-01

Humans are sophisticated social beings. Social cues from others are exceptionally salient, particularly during adolescence. Understanding how adolescents interpret and learn from variable social signals can provide insight into the observed shift in social sensitivity during this period. The current study tested 120 participants between the ages of 8 and 25 years on a social reinforcement learning task where the probability of receiving positive social feedback was parametrically manipulated. Seventy-eight of these participants completed the task during fMRI scanning. Modeling trial-by-trial learning, children and adults showed higher positive learning rates than adolescents, suggesting that adolescents demonstrated less differentiation in their reaction times for peers who provided more positive feedback. Forming expectations about receiving positive social reinforcement correlated with neural activity within the medial prefrontal cortex and ventral striatum across age. Adolescents, unlike children and adults, showed greater insular activity during positive prediction error learning and increased activity in the supplementary motor cortex and the putamen when receiving positive social feedback regardless of the expected outcome, suggesting that peer approval may motivate adolescents towards action. While different amounts of positive social reinforcement enhanced learning in children and adults, all positive social reinforcement equally motivated adolescents. Together, these findings indicate that sensitivity to peer approval during adolescence goes beyond simple reinforcement theory accounts and suggests possible explanations for how peers may motivate adolescent behavior. PMID:24550063
Adolescent-specific patterns of behavior and neural activity during social reinforcement learning.

PubMed

Jones, Rebecca M; Somerville, Leah H; Li, Jian; Ruberry, Erika J; Powers, Alisa; Mehta, Natasha; Dyke, Jonathan; Casey, B J

2014-06-01

Humans are sophisticated social beings. Social cues from others are exceptionally salient, particularly during adolescence. Understanding how adolescents interpret and learn from variable social signals can provide insight into the observed shift in social sensitivity during this period. The present study tested 120 participants between the ages of 8 and 25 years on a social reinforcement learning task where the probability of receiving positive social feedback was parametrically manipulated. Seventy-eight of these participants completed the task during fMRI scanning. Modeling trial-by-trial learning, children and adults showed higher positive learning rates than did adolescents, suggesting that adolescents demonstrated less differentiation in their reaction times for peers who provided more positive feedback. Forming expectations about receiving positive social reinforcement correlated with neural activity within the medial prefrontal cortex and ventral striatum across age. Adolescents, unlike children and adults, showed greater insular activity during positive prediction error learning and increased activity in the supplementary motor cortex and the putamen when receiving positive social feedback regardless of the expected outcome, suggesting that peer approval may motivate adolescents toward action. While different amounts of positive social reinforcement enhanced learning in children and adults, all positive social reinforcement equally motivated adolescents. Together, these findings indicate that sensitivity to peer approval during adolescence goes beyond simple reinforcement theory accounts and suggest possible explanations for how peers may motivate adolescent behavior.
Parallel Online Temporal Difference Learning for Motor Control.

PubMed

Caarls, Wouter; Schuitema, Erik

2016-07-01

Temporal difference (TD) learning, a key concept in reinforcement learning, is a popular method for solving simulated control problems. However, in real systems, this method is often avoided in favor of policy search methods because of its long learning time. But policy search suffers from its own drawbacks, such as the necessity of informed policy parameterization and initialization. In this paper, we show that TD learning can work effectively in real robotic systems as well, using parallel model learning and planning. Using locally weighted linear regression and trajectory sampled planning with 14 concurrent threads, we can achieve a speedup of almost two orders of magnitude over regular TD control on simulated control benchmarks. For a real-world pendulum swing-up task and a two-link manipulator movement task, we report a speedup of 20× to 60× , with a real-time learning speed of less than half a minute. The results are competitive with state-of-the-art policy search.
Stochastic Reinforcement Benefits Skill Acquisition

ERIC Educational Resources Information Center

Dayan, Eran; Averbeck, Bruno B.; Richmond, Barry J.; Cohen, Leonardo G.

2014-01-01

Learning complex skills is driven by reinforcement, which facilitates both online within-session gains and retention of the acquired skills. Yet, in ecologically relevant situations, skills are often acquired when mapping between actions and rewarding outcomes is unknown to the learning agent, resulting in reinforcement schedules of a stochastic…
A reward optimization method based on action subrewards in hierarchical reinforcement learning.

PubMed

Fu, Yuchen; Liu, Quan; Ling, Xionghong; Cui, Zhiming

2014-01-01

Reinforcement learning (RL) is one kind of interactive learning methods. Its main characteristics are "trial and error" and "related reward." A hierarchical reinforcement learning method based on action subrewards is proposed to solve the problem of "curse of dimensionality," which means that the states space will grow exponentially in the number of features and low convergence speed. The method can reduce state spaces greatly and choose actions with favorable purpose and efficiency so as to optimize reward function and enhance convergence speed. Apply it to the online learning in Tetris game, and the experiment result shows that the convergence speed of this algorithm can be enhanced evidently based on the new method which combines hierarchical reinforcement learning algorithm and action subrewards. The "curse of dimensionality" problem is also solved to a certain extent with hierarchical method. All the performance with different parameters is compared and analyzed as well.
Multi-agent Reinforcement Learning Model for Effective Action Selection

NASA Astrophysics Data System (ADS)

Youk, Sang Jo; Lee, Bong Keun

Reinforcement learning is a sub area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. In the case of multi-agent, especially, which state space and action space gets very enormous in compared to single agent, so it needs to take most effective measure available select the action strategy for effective reinforcement learning. This paper proposes a multi-agent reinforcement learning model based on fuzzy inference system in order to improve learning collect speed and select an effective action in multi-agent. This paper verifies an effective action select strategy through evaluation tests based on Robocop Keep away which is one of useful test-beds for multi-agent. Our proposed model can apply to evaluate efficiency of the various intelligent multi-agents and also can apply to strategy and tactics of robot soccer system.
Pragmatically Framed Cross-Situational Noun Learning Using Computational Reinforcement Models

PubMed Central

Najnin, Shamima; Banerjee, Bonny

2018-01-01

Cross-situational learning and social pragmatic theories are prominent mechanisms for learning word meanings (i.e., word-object pairs). In this paper, the role of reinforcement is investigated for early word-learning by an artificial agent. When exposed to a group of speakers, the agent comes to understand an initial set of vocabulary items belonging to the language used by the group. Both cross-situational learning and social pragmatic theory are taken into account. As social cues, joint attention and prosodic cues in caregiver's speech are considered. During agent-caregiver interaction, the agent selects a word from the caregiver's utterance and learns the relations between that word and the objects in its visual environment. The “novel words to novel objects” language-specific constraint is assumed for computing rewards. The models are learned by maximizing the expected reward using reinforcement learning algorithms [i.e., table-based algorithms: Q-learning, SARSA, SARSA-λ, and neural network-based algorithms: Q-learning for neural network (Q-NN), neural-fitted Q-network (NFQ), and deep Q-network (DQN)]. Neural network-based reinforcement learning models are chosen over table-based models for better generalization and quicker convergence. Simulations are carried out using mother-infant interaction CHILDES dataset for learning word-object pairings. Reinforcement is modeled in two cross-situational learning cases: (1) with joint attention (Attentional models), and (2) with joint attention and prosodic cues (Attentional-prosodic models). Attentional-prosodic models manifest superior performance to Attentional ones for the task of word-learning. The Attentional-prosodic DQN outperforms existing word-learning models for the same task. PMID:29441027
The Relationship between Software Design and Children's Engagement

ERIC Educational Resources Information Center

Buckleitner, Warren

2006-01-01

This study was an attempt to measure the effects of praise and reinforcement on children in a computer learning setting. A sorting game was designed to simulate 2 interaction styles. One style, called high computer control, provided frequent praise and coaching. The other, called high child control, had narration and praise toggled off. A…
Reinforcement learning improves behaviour from evaluative feedback

NASA Astrophysics Data System (ADS)

Littman, Michael L.

2015-05-01

Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.
Reinforcement learning improves behaviour from evaluative feedback.

PubMed

Littman, Michael L

2015-05-28

Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

The dorsal tegmental noradrenergic projection: an analysis of its role in maze learning.

PubMed

Roberts, D C; Price, M T; Fibiger, H C

1976-04-01

The hypothesis that the noradrenergic projection from the locus coeruleus (LC) to the cerebral cortex and hippocampus is an important neural substrate for learning was evaluated. Maze performance was studied in rats receiving either electrolytic lesions of LC or 6-hydroxydopamine (6-OHDA) lesions of the dorsal tegmental noradrenergic projection. The LC lesions did not disrupt the acquisition of a running response for food reinforcement in an L-shaped runway, even though hippocampal-cortical norepinephrine (NE) was reduced to 29%. Greater telencephalic NE depletions (to 6% of control levels) produced by 6-OHDA also failed to disrupt the acquisition of this behavior or to impair the acquisition of a food-reinforced position habit in a T-maze. Neither locomotor activity nor habituation to a novel environment was affected by the 6-OHDA lesions. Rats with such lesions were, however, found to be significantly more distractible than were controls during the performance of a previously trained response. The hypothesis that telencephalic NE is of fundamental importance in learning was not supported. The data suggest that this system may participate in attentional mechanisms.
Approximate reasoning-based learning and control for proximity operations and docking in space

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Jani, Yashvant; Lea, Robert N.

1991-01-01

A recently proposed hybrid-neutral-network and fuzzy-logic-control architecture is applied to a fuzzy logic controller developed for attitude control of the Space Shuttle. A model using reinforcement learning and learning from past experience for fine-tuning its knowledge base is proposed. Two main components of this approximate reasoning-based intelligent control (ARIC) model - an action-state evaluation network and action selection network are described as well as the Space Shuttle attitude controller. An ARIC model for the controller is presented, and it is noted that the input layer in each network includes three nodes representing the angle error, angle error rate, and bias node. Preliminary results indicate that the controller can hold the pitch rate within its desired deadband and starts to use the jets at about 500 sec in the run.
Resurgence of Integrated Behavioral Units

PubMed Central

Bachá-Méndez, Gustavo; Reid, Alliston K; Mendoza-Soylovna, Adela

2007-01-01

Two experiments with rats examined the dynamics of well-learned response sequences when reinforcement contingencies were changed. Both experiments contained four phases, each of which reinforced a 2-response sequence of lever presses until responding was stable. The contingencies then were shifted to a new reinforced sequence until responding was again stable. Extinction-induced resurgence of previously reinforced, and then extinguished, heterogeneous response sequences was observed in all subjects in both experiments. These sequences were demonstrated to be integrated behavioral units, controlled by processes acting at the level of the entire sequence. Response-level processes were also simultaneously operative. Errors in sequence production were strongly influenced by the terminal, not the initial, response in the currently reinforced sequence, but not by the previously reinforced sequence. These studies demonstrate that sequence-level and response-level processes can operate simultaneously in integrated behavioral units. Resurgence and the development of integrated behavioral units may be dissociated; thus the observation of one does not necessarily imply the other. PMID:17345948
The power of associative learning and the ontogeny of optimal behaviour.

PubMed

Enquist, Magnus; Lind, Johan; Ghirlanda, Stefano

2016-11-01

Behaving efficiently (optimally or near-optimally) is central to animals' adaptation to their environment. Much evolutionary biology assumes, implicitly or explicitly, that optimal behavioural strategies are genetically inherited, yet the behaviour of many animals depends crucially on learning. The question of how learning contributes to optimal behaviour is largely open. Here we propose an associative learning model that can learn optimal behaviour in a wide variety of ecologically relevant circumstances. The model learns through chaining, a term introduced by Skinner to indicate learning of behaviour sequences by linking together shorter sequences or single behaviours. Our model formalizes the concept of conditioned reinforcement (the learning process that underlies chaining) and is closely related to optimization algorithms from machine learning. Our analysis dispels the common belief that associative learning is too limited to produce 'intelligent' behaviour such as tool use, social learning, self-control or expectations of the future. Furthermore, the model readily accounts for both instinctual and learned aspects of behaviour, clarifying how genetic evolution and individual learning complement each other, and bridging a long-standing divide between ethology and psychology. We conclude that associative learning, supported by genetic predispositions and including the oft-neglected phenomenon of conditioned reinforcement, may suffice to explain the ontogeny of optimal behaviour in most, if not all, non-human animals. Our results establish associative learning as a more powerful optimizing mechanism than acknowledged by current opinion.
The power of associative learning and the ontogeny of optimal behaviour

PubMed Central

Enquist, Magnus; Lind, Johan

2016-01-01

Behaving efficiently (optimally or near-optimally) is central to animals' adaptation to their environment. Much evolutionary biology assumes, implicitly or explicitly, that optimal behavioural strategies are genetically inherited, yet the behaviour of many animals depends crucially on learning. The question of how learning contributes to optimal behaviour is largely open. Here we propose an associative learning model that can learn optimal behaviour in a wide variety of ecologically relevant circumstances. The model learns through chaining, a term introduced by Skinner to indicate learning of behaviour sequences by linking together shorter sequences or single behaviours. Our model formalizes the concept of conditioned reinforcement (the learning process that underlies chaining) and is closely related to optimization algorithms from machine learning. Our analysis dispels the common belief that associative learning is too limited to produce ‘intelligent’ behaviour such as tool use, social learning, self-control or expectations of the future. Furthermore, the model readily accounts for both instinctual and learned aspects of behaviour, clarifying how genetic evolution and individual learning complement each other, and bridging a long-standing divide between ethology and psychology. We conclude that associative learning, supported by genetic predispositions and including the oft-neglected phenomenon of conditioned reinforcement, may suffice to explain the ontogeny of optimal behaviour in most, if not all, non-human animals. Our results establish associative learning as a more powerful optimizing mechanism than acknowledged by current opinion. PMID:28018662
Quantum-Enhanced Machine Learning

NASA Astrophysics Data System (ADS)

Dunjko, Vedran; Taylor, Jacob M.; Briegel, Hans J.

2016-09-01

The emerging field of quantum machine learning has the potential to substantially aid in the problems and scope of artificial intelligence. This is only enhanced by recent successes in the field of classical machine learning. In this work we propose an approach for the systematic treatment of machine learning, from the perspective of quantum information. Our approach is general and covers all three main branches of machine learning: supervised, unsupervised, and reinforcement learning. While quantum improvements in supervised and unsupervised learning have been reported, reinforcement learning has received much less attention. Within our approach, we tackle the problem of quantum enhancements in reinforcement learning as well, and propose a systematic scheme for providing improvements. As an example, we show that quadratic improvements in learning efficiency, and exponential improvements in performance over limited time periods, can be obtained for a broad class of learning problems.
Factors Impacting Emergence of Behavioral Control by Underselected Stimuli in Humans after Reduction of Control by Overselected Stimuli

ERIC Educational Resources Information Center

Broomfield, Laura; McHugh, Louise; Reed, Phil

2010-01-01

Stimulus overselectivity occurs when only one of potentially many aspects of the environment controls behavior. Adult participants were trained and tested on a trial-and-error discrimination learning task while engaging in a concurrent load task, and overselectivity emerged. When responding to the overselected stimulus was reduced by reinforcing a…
The control of tonic pain by active relief learning

PubMed Central

Mano, Hiroaki; Lee, Michael; Yoshida, Wako; Kawato, Mitsuo; Robbins, Trevor W

2018-01-01

Tonic pain after injury characterises a behavioural state that prioritises recovery. Although generally suppressing cognition and attention, tonic pain needs to allow effective relief learning to reduce the cause of the pain. Here, we describe a central learning circuit that supports learning of relief and concurrently suppresses the level of ongoing pain. We used computational modelling of behavioural, physiological and neuroimaging data in two experiments in which subjects learned to terminate tonic pain in static and dynamic escape-learning paradigms. In both studies, we show that active relief-seeking involves a reinforcement learning process manifest by error signals observed in the dorsal putamen. Critically, this system uses an uncertainty (‘associability’) signal detected in pregenual anterior cingulate cortex that both controls the relief learning rate, and endogenously and parametrically modulates the level of tonic pain. The results define a self-organising learning circuit that reduces ongoing pain when learning about potential relief. PMID:29482716
Impedance learning for robotic contact tasks using natural actor-critic algorithm.

PubMed

Kim, Byungchan; Park, Jooyoung; Park, Shinsuk; Kang, Sungchul

2010-04-01

Compared with their robotic counterparts, humans excel at various tasks by using their ability to adaptively modulate arm impedance parameters. This ability allows us to successfully perform contact tasks even in uncertain environments. This paper considers a learning strategy of motor skill for robotic contact tasks based on a human motor control theory and machine learning schemes. Our robot learning method employs impedance control based on the equilibrium point control theory and reinforcement learning to determine the impedance parameters for contact tasks. A recursive least-square filter-based episodic natural actor-critic algorithm is used to find the optimal impedance parameters. The effectiveness of the proposed method was tested through dynamic simulations of various contact tasks. The simulation results demonstrated that the proposed method optimizes the performance of the contact tasks in uncertain conditions of the environment.
Using paper presentation breaks during didactic lectures improves learning of physiology in undergraduate students.

PubMed

Ghorbani, Ahmad; Ghazvini, Kiarash

2016-03-01

Many studies have emphasized the incorporation of active learning into classrooms to reinforce didactic lectures for physiology courses. This work aimed to determine if presenting classic papers during didactic lectures improves the learning of physiology among undergraduate students. Twenty-two students of health information technology were randomly divided into the following two groups: 1) didactic lecture only (control group) and 2) didactic lecture plus paper presentation breaks (DLPP group). In the control group, main topics of gastrointestinal and endocrine physiology were taught using only the didactic lecture technique. In the DLPP group, some topics were presented by the didactic lecture method (similar to the control group) and some topics were taught by the DLPP technique (first, concepts were covered briefly in a didactic format and then reinforced with presentation of a related classic paper). The combination of didactic lecture and paper breaks significantly improved learning so that students in the DLPP group showed higher scores on related topics compared with those in the control group (P < 0.001). Comparison of the scores of topics taught by only the didactic lecture and those using both the didactic lecture and paper breaks showed significant improvement only in the DLPP group (P < 0.001). Data obtained from the final exam showed that in the DLPP group, the mean score of the topics taught by the combination of didactic lecture and paper breaks was significantly higher than those taught by only didactic lecture (P < 0.05). In conclusion, the combination of paper presentation breaks and didactic lectures improves the learning of physiology. Copyright © 2016 The American Physiological Society.
The drift diffusion model as the choice rule in reinforcement learning.

PubMed

Pedersen, Mads Lund; Frank, Michael J; Biele, Guido

2017-08-01

Current reinforcement-learning models often assume simplified decision processes that do not fully reflect the dynamic complexities of choice processes. Conversely, sequential-sampling models of decision making account for both choice accuracy and response time, but assume that decisions are based on static decision values. To combine these two computational models of decision making and learning, we implemented reinforcement-learning models in which the drift diffusion model describes the choice process, thereby capturing both within- and across-trial dynamics. To exemplify the utility of this approach, we quantitatively fit data from a common reinforcement-learning paradigm using hierarchical Bayesian parameter estimation, and compared model variants to determine whether they could capture the effects of stimulant medication in adult patients with attention-deficit hyperactivity disorder (ADHD). The model with the best relative fit provided a good description of the learning process, choices, and response times. A parameter recovery experiment showed that the hierarchical Bayesian modeling approach enabled accurate estimation of the model parameters. The model approach described here, using simultaneous estimation of reinforcement-learning and drift diffusion model parameters, shows promise for revealing new insights into the cognitive and neural mechanisms of learning and decision making, as well as the alteration of such processes in clinical groups.
The drift diffusion model as the choice rule in reinforcement learning

PubMed Central

Frank, Michael J.

2017-01-01

Current reinforcement-learning models often assume simplified decision processes that do not fully reflect the dynamic complexities of choice processes. Conversely, sequential-sampling models of decision making account for both choice accuracy and response time, but assume that decisions are based on static decision values. To combine these two computational models of decision making and learning, we implemented reinforcement-learning models in which the drift diffusion model describes the choice process, thereby capturing both within- and across-trial dynamics. To exemplify the utility of this approach, we quantitatively fit data from a common reinforcement-learning paradigm using hierarchical Bayesian parameter estimation, and compared model variants to determine whether they could capture the effects of stimulant medication in adult patients with attention-deficit hyper-activity disorder (ADHD). The model with the best relative fit provided a good description of the learning process, choices, and response times. A parameter recovery experiment showed that the hierarchical Bayesian modeling approach enabled accurate estimation of the model parameters. The model approach described here, using simultaneous estimation of reinforcement-learning and drift diffusion model parameters, shows promise for revealing new insights into the cognitive and neural mechanisms of learning and decision making, as well as the alteration of such processes in clinical groups. PMID:27966103
Optimizing Chemical Reactions with Deep Reinforcement Learning

PubMed Central

2017-01-01

Deep reinforcement learning was employed to optimize chemical reactions. Our model iteratively records the results of a chemical reaction and chooses new experimental conditions to improve the reaction outcome. This model outperformed a state-of-the-art blackbox optimization algorithm by using 71% fewer steps on both simulations and real reactions. Furthermore, we introduced an efficient exploration strategy by drawing the reaction conditions from certain probability distributions, which resulted in an improvement on regret from 0.062 to 0.039 compared with a deterministic policy. Combining the efficient exploration policy with accelerated microdroplet reactions, optimal reaction conditions were determined in 30 min for the four reactions considered, and a better understanding of the factors that control microdroplet reactions was reached. Moreover, our model showed a better performance after training on reactions with similar or even dissimilar underlying mechanisms, which demonstrates its learning ability. PMID:29296675
Operant conditioning of enhanced pain sensitivity by heat-pain titration.

PubMed

Becker, Susanne; Kleinböhl, Dieter; Klossika, Iris; Hölzl, Rupert

2008-11-15

Operant conditioning mechanisms have been demonstrated to be important in the development of chronic pain. Most experimental studies have investigated the operant modulation of verbal pain reports with extrinsic reinforcement, such as verbal reinforcement. Whether this reflects actual changes in the subjective experience of the nociceptive stimulus remained unclear. This study replicates and extends our previous demonstration that enhanced pain sensitivity to prolonged heat-pain stimulation could be learned in healthy participants through intrinsic reinforcement (contingent changes in nociceptive input) independent of verbal pain reports. In addition, we examine whether different magnitudes of reinforcement differentially enhance pain sensitivity using an operant heat-pain titration paradigm. It is based on the previously developed non-verbal behavioral discrimination task for the assessment of sensitization, which uses discriminative down- or up-regulation of stimulus temperatures in response to changes in subjective intensity. In operant heat-pain titration, this discriminative behavior and not verbal pain report was contingently reinforced or punished by acute decreases or increases in heat-pain intensity. The magnitude of reinforcement was varied between three groups: low (N1=13), medium (N2=11) and high reinforcement (N3=12). Continuous reinforcement was applied to acquire and train the operant behavior, followed by partial reinforcement to analyze the underlying learning mechanisms. Results demonstrated that sensitization to prolonged heat-pain stimulation was enhanced by operant learning within 1h. The extent of sensitization was directly dependent on the received magnitude of reinforcement. Thus, operant learning mechanisms based on intrinsic reinforcement may provide an explanation for the gradual development of sustained hypersensitivity during pain that is becoming chronic.
A model for discriminating reinforcers in time and space.

PubMed

Cowie, Sarah; Davison, Michael; Elliffe, Douglas

2016-06-01

Both the response-reinforcer and stimulus-reinforcer relation are important in discrimination learning; differential responding requires a minimum of two discriminably-different stimuli and two discriminably-different associated contingencies of reinforcement. When elapsed time is a discriminative stimulus for the likely availability of a reinforcer, choice over time may be modeled by an extension of the Davison and Nevin (1999) model that assumes that local choice strictly matches the effective local reinforcer ratio. The effective local reinforcer ratio may differ from the obtained local reinforcer ratio for two reasons: Because the animal inaccurately estimates times associated with obtained reinforcers, and thus incorrectly discriminates the stimulus-reinforcer relation across time; and because of error in discriminating the response-reinforcer relation. In choice-based timing tasks, the two responses are usually highly discriminable, and so the larger contributor to differences between the effective and obtained reinforcer ratio is error in discriminating the stimulus-reinforcer relation. Such error may be modeled either by redistributing the numbers of reinforcers obtained at each time across surrounding times, or by redistributing the ratio of reinforcers obtained at each time in the same way. We assessed the extent to which these two approaches to modeling discrimination of the stimulus-reinforcer relation could account for choice in a range of temporal-discrimination procedures. The version of the model that redistributed numbers of reinforcers accounted for more variance in the data. Further, this version provides an explanation for shifts in the point of subjective equality that occur as a result of changes in the local reinforcer rate. The inclusion of a parameter reflecting error in discriminating the response-reinforcer relation enhanced the ability of each version of the model to describe data. The ability of this class of model to account for a range of data suggests that timing, like other conditional discriminations, is choice under the joint discriminative control of elapsed time and differential reinforcement. Understanding the role of differential reinforcement is therefore critical to understanding control by elapsed time. Copyright © 2016 Elsevier B.V. All rights reserved.
Improving Robot Motor Learning with Negatively Valenced Reinforcement Signals

PubMed Central

Navarro-Guerrero, Nicolás; Lowe, Robert J.; Wermter, Stefan

2017-01-01

Both nociception and punishment signals have been used in robotics. However, the potential for using these negatively valenced types of reinforcement learning signals for robot learning has not been exploited in detail yet. Nociceptive signals are primarily used as triggers of preprogrammed action sequences. Punishment signals are typically disembodied, i.e., with no or little relation to the agent-intrinsic limitations, and they are often used to impose behavioral constraints. Here, we provide an alternative approach for nociceptive signals as drivers of learning rather than simple triggers of preprogrammed behavior. Explicitly, we use nociception to expand the state space while we use punishment as a negative reinforcement learning signal. We compare the performance—in terms of task error, the amount of perceived nociception, and length of learned action sequences—of different neural networks imbued with punishment-based reinforcement signals for inverse kinematic learning. We contrast the performance of a version of the neural network that receives nociceptive inputs to that without such a process. Furthermore, we provide evidence that nociception can improve learning—making the algorithm more robust against network initializations—as well as behavioral performance by reducing the task error, perceived nociception, and length of learned action sequences. Moreover, we provide evidence that punishment, at least as typically used within reinforcement learning applications, may be detrimental in all relevant metrics. PMID:28420976
Reinforcement learning techniques for controlling resources in power networks

NASA Astrophysics Data System (ADS)

Kowli, Anupama Sunil

As power grids transition towards increased reliance on renewable generation, energy storage and demand response resources, an effective control architecture is required to harness the full functionalities of these resources. There is a critical need for control techniques that recognize the unique characteristics of the different resources and exploit the flexibility afforded by them to provide ancillary services to the grid. The work presented in this dissertation addresses these needs. Specifically, new algorithms are proposed, which allow control synthesis in settings wherein the precise distribution of the uncertainty and its temporal statistics are not known. These algorithms are based on recent developments in Markov decision theory, approximate dynamic programming and reinforcement learning. They impose minimal assumptions on the system model and allow the control to be "learned" based on the actual dynamics of the system. Furthermore, they can accommodate complex constraints such as capacity and ramping limits on generation resources, state-of-charge constraints on storage resources, comfort-related limitations on demand response resources and power flow limits on transmission lines. Numerical studies demonstrating applications of these algorithms to practical control problems in power systems are discussed. Results demonstrate how the proposed control algorithms can be used to improve the performance and reduce the computational complexity of the economic dispatch mechanism in a power network. We argue that the proposed algorithms are eminently suitable to develop operational decision-making tools for large power grids with many resources and many sources of uncertainty.
Agent-based real-time signal coordination in congested networks.

DOT National Transportation Integrated Search

2014-01-01

This study is the continuation of a previous NEXTRANS study on agent-based reinforcement : learning methods for signal coordination in congested networks. In the previous study, the : formulation of a real-time agent-based traffic signal control in o...
Learning to soar in turbulent environments

NASA Astrophysics Data System (ADS)

Reddy, Gautam; Celani, Antonio; Sejnowski, Terrence; Vergassola, Massimo

Birds and gliders exploit warm, rising atmospheric currents (thermals) to reach heights comparable to low-lying clouds with a reduced expenditure of energy. Soaring provides a remarkable instance of complex decision-making in biology and requires a long-term strategy to effectively use the ascending thermals. Furthermore, the problem is technologically relevant to extend the flying range of autonomous gliders. The formation of thermals unavoidably generates strong turbulent fluctuations, which make deriving an efficient policy harder and thus constitute an essential element of soaring. Here, we approach soaring flight as a problem of learning to navigate highly fluctuating turbulent environments. We simulate the atmospheric boundary layer by numerical models of turbulent convective flow and combine them with model-free, experience-based, reinforcement learning algorithms to train the virtual gliders. For the learned policies in the regimes of moderate and strong turbulence levels, the virtual glider adopts an increasingly conservative policy as turbulence levels increase, quantifying the degree of risk affordable in turbulent environments. Reinforcement learning uncovers those sensorimotor cues that permit effective control over soaring in turbulent environments.
Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

PubMed Central

Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

2013-01-01

Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity. PMID:23592970

Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

PubMed

Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

2013-04-01

Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.
Place preference and vocal learning rely on distinct reinforcers in songbirds.

PubMed

Murdoch, Don; Chen, Ruidong; Goldberg, Jesse H

2018-04-30

In reinforcement learning (RL) agents are typically tasked with maximizing a single objective function such as reward. But it remains poorly understood how agents might pursue distinct objectives at once. In machines, multiobjective RL can be achieved by dividing a single agent into multiple sub-agents, each of which is shaped by agent-specific reinforcement, but it remains unknown if animals adopt this strategy. Here we use songbirds to test if navigation and singing, two behaviors with distinct objectives, can be differentially reinforced. We demonstrate that strobe flashes aversively condition place preference but not song syllables. Brief noise bursts aversively condition song syllables but positively reinforce place preference. Thus distinct behavior-generating systems, or agencies, within a single animal can be shaped by correspondingly distinct reinforcement signals. Our findings suggest that spatially segregated vocal circuits can solve a credit assignment problem associated with multiobjective learning.
Dopamine-Dependent Reinforcement of Motor Skill Learning: Evidence from Gilles de la Tourette Syndrome

ERIC Educational Resources Information Center

Palminteri, Stefano; Lebreton, Mael; Worbe, Yulia; Hartmann, Andreas; Lehericy, Stephane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

2011-01-01

Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only…
A knowledge-base generating hierarchical fuzzy-neural controller.

PubMed

Kandadai, R M; Tien, J M

1997-01-01

We present an innovative fuzzy-neural architecture that is able to automatically generate a knowledge base, in an extractable form, for use in hierarchical knowledge-based controllers. The knowledge base is in the form of a linguistic rule base appropriate for a fuzzy inference system. First, we modify Berenji and Khedkar's (1992) GARIC architecture to enable it to automatically generate a knowledge base; a pseudosupervised learning scheme using reinforcement learning and error backpropagation is employed. Next, we further extend this architecture to a hierarchical controller that is able to generate its own knowledge base. Example applications are provided to underscore its viability.
Reinforcement learning controller design for affine nonlinear discrete-time systems using online approximators.

PubMed

Yang, Qinmin; Jagannathan, Sarangapani

2012-04-01

In this paper, reinforcement learning state- and output-feedback-based adaptive critic controller designs are proposed by using the online approximators (OLAs) for a general multi-input and multioutput affine unknown nonlinear discretetime systems in the presence of bounded disturbances. The proposed controller design has two entities, an action network that is designed to produce optimal signal and a critic network that evaluates the performance of the action network. The critic estimates the cost-to-go function which is tuned online using recursive equations derived from heuristic dynamic programming. Here, neural networks (NNs) are used both for the action and critic whereas any OLAs, such as radial basis functions, splines, fuzzy logic, etc., can be utilized. For the output-feedback counterpart, an additional NN is designated as the observer to estimate the unavailable system states, and thus, separation principle is not required. The NN weight tuning laws for the controller schemes are also derived while ensuring uniform ultimate boundedness of the closed-loop system using Lyapunov theory. Finally, the effectiveness of the two controllers is tested in simulation on a pendulum balancing system and a two-link robotic arm system.
Evolution with Reinforcement Learning in Negotiation

PubMed Central

Zou, Yi; Zhan, Wenjie; Shao, Yuan

2014-01-01

Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm. PMID:25048108
Evolution with reinforcement learning in negotiation.

PubMed

Zou, Yi; Zhan, Wenjie; Shao, Yuan

2014-01-01

Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm.
Stress attenuates the flexible updating of aversive value

PubMed Central

Raio, Candace M.; Hartley, Catherine A.; Orederu, Temidayo A.; Li, Jian; Phelps, Elizabeth A.

2017-01-01

In a dynamic environment, sources of threat or safety can unexpectedly change, requiring the flexible updating of stimulus−outcome associations that promote adaptive behavior. However, aversive contexts in which we are required to update predictions of threat are often marked by stress. Acute stress is thought to reduce behavioral flexibility, yet its influence on the modulation of aversive value has not been well characterized. Given that stress exposure is a prominent risk factor for anxiety and trauma-related disorders marked by persistent, inflexible responses to threat, here we examined how acute stress affects the flexible updating of threat responses. Participants completed an aversive learning task, in which one stimulus was probabilistically associated with an electric shock, while the other stimulus signaled safety. A day later, participants underwent an acute stress or control manipulation before completing a reversal learning task during which the original stimulus−outcome contingencies switched. Skin conductance and neuroendocrine responses provided indices of sympathetic arousal and stress responses, respectively. Despite equivalent initial learning, stressed participants showed marked impairments in reversal learning relative to controls. Additionally, reversal learning deficits across participants were related to heightened levels of alpha-amylase, a marker of noradrenergic activity. Finally, fitting arousal data to a computational reinforcement learning model revealed that stress-induced reversal learning deficits emerged from stress-specific changes in the weight assigned to prediction error signals, disrupting the adaptive adjustment of learning rates. Our findings provide insight into how stress renders individuals less sensitive to changes in aversive reinforcement and have implications for understanding clinical conditions marked by stress-related psychopathology. PMID:28973957
Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations.

PubMed

Lee, Jae Young; Park, Jin Bae; Choi, Yoon Ho

2015-05-01

This paper focuses on a class of reinforcement learning (RL) algorithms, named integral RL (I-RL), that solve continuous-time (CT) nonlinear optimal control problems with input-affine system dynamics. First, we extend the concepts of exploration, integral temporal difference, and invariant admissibility to the target CT nonlinear system that is governed by a control policy plus a probing signal called an exploration. Then, we show input-to-state stability (ISS) and invariant admissibility of the closed-loop systems with the policies generated by integral policy iteration (I-PI) or invariantly admissible PI (IA-PI) method. Based on these, three online I-RL algorithms named explorized I-PI and integral Q -learning I, II are proposed, all of which generate the same convergent sequences as I-PI and IA-PI under the required excitation condition on the exploration. All the proposed methods are partially or completely model free, and can simultaneously explore the state space in a stable manner during the online learning processes. ISS, invariant admissibility, and convergence properties of the proposed methods are also investigated, and related with these, we show the design principles of the exploration for safe learning. Neural-network-based implementation methods for the proposed schemes are also presented in this paper. Finally, several numerical simulations are carried out to verify the effectiveness of the proposed methods.
A learning theory account of depression.

PubMed

Ramnerö, Jonas; Folke, Fredrik; Kanter, Jonathan W

2015-06-11

Learning theory provides a foundation for understanding and deriving treatment principles for impacting a spectrum of functional processes relevant to the construct of depression. While behavioral interventions have been commonplace in the cognitive behavioral tradition, most often conceptualized within a cognitive theoretical framework, recent years have seen renewed interest in more purely behavioral models. These modern learning theory accounts of depression focus on the interchange between behavior and the environment, mainly in terms of lack of reinforcement, extinction of instrumental behavior, and excesses of aversive control, and include a conceptualization of relevant cognitive and emotional variables. These positions, drawn from extensive basic and applied research, cohere with biological theories on reduced reward learning and reward responsiveness and views of depression as a heterogeneous, complex set of disorders. Treatment techniques based on learning theory, often labeled Behavioral Activation (BA) focus on activating the individual in directions that increase contact with potential reinforcers, as defined ideographically with the client. BA is considered an empirically well-established treatment that generalizes well across diverse contexts and populations. The learning theory account is discussed in terms of being a parsimonious model and ground for treatments highly suitable for large scale dissemination. © 2015 Scandinavian Psychological Associations and John Wiley & Sons Ltd.
The Effects of Partial Reinforcement in the Acquisition and Extinction of Recurrent Serial Patterns.

ERIC Educational Resources Information Center

Dockstader, Steven L.

The purpose of these 2 experiments was to determine whether sequential response pattern behavior is affected by partial reinforcement in the same way as other behavior systems. The first experiment investigated the partial reinforcement extinction effects (PREE) in a sequential concept learning task where subjects were required to learn a…
An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning

PubMed Central

Balasubramani, Pragathi P.; Chakravarthy, V. Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A.

2014-01-01

Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG. PMID:24795614
Microstimulation of the Human Substantia Nigra Alters Reinforcement Learning

PubMed Central

Ramayya, Ashwin G.; Misra, Amrit

2014-01-01

Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action–reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action–reward associations rather than stimulus–reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action–reward associations during reinforcement learning. PMID:24828643
Learning Multirobot Hose Transportation and Deployment by Distributed Round-Robin Q-Learning.

PubMed

Fernandez-Gauna, Borja; Etxeberria-Agiriano, Ismael; Graña, Manuel

2015-01-01

Multi-Agent Reinforcement Learning (MARL) algorithms face two main difficulties: the curse of dimensionality, and environment non-stationarity due to the independent learning processes carried out by the agents concurrently. In this paper we formalize and prove the convergence of a Distributed Round Robin Q-learning (D-RR-QL) algorithm for cooperative systems. The computational complexity of this algorithm increases linearly with the number of agents. Moreover, it eliminates environment non sta tionarity by carrying a round-robin scheduling of the action selection and execution. That this learning scheme allows the implementation of Modular State-Action Vetoes (MSAV) in cooperative multi-agent systems, which speeds up learning convergence in over-constrained systems by vetoing state-action pairs which lead to undesired termination states (UTS) in the relevant state-action subspace. Each agent's local state-action value function learning is an independent process, including the MSAV policies. Coordination of locally optimal policies to obtain the global optimal joint policy is achieved by a greedy selection procedure using message passing. We show that D-RR-QL improves over state-of-the-art approaches, such as Distributed Q-Learning, Team Q-Learning and Coordinated Reinforcement Learning in a paradigmatic Linked Multi-Component Robotic System (L-MCRS) control problem: the hose transportation task. L-MCRS are over-constrained systems with many UTS induced by the interaction of the passive linking element and the active mobile robots.
An Investigation of Ways to Reduce the Failure Rate of Student Pilots during Flying Training in the Royal Australian Air Force.

DTIC Science & Technology

1987-09-01

Luthans (28) expanded the concept of learning as follows: 1. Learning involves a change, though not necessarily an improvement, in behaviour. Learning...that results in an unpleasant outcome is not likely to be repeated (36:244). Luthans and Kreitner (27) described the various forms of reinforcement as...four 33 alternatives (defined previously on page 24 and taken from Luthans ) of positive reinforcement, negative reinforcement, extinction and punishment
Mastery Learning through Individualized Instruction: A Reinforcement Strategy

ERIC Educational Resources Information Center

Sagy, John; Ravi, R.; Ananthasayanam, R.

2009-01-01

The present study attempts to gauge the effect of individualized instructional methods as a reinforcement strategy for mastery learning. Among various individualized instructional methods, the study focuses on PIM (Programmed Instructional Method) and CAIM (Computer Assisted Instruction Method). Mastery learning is a process where students achieve…
Working Memory and Reinforcement Schedule Jointly Determine Reinforcement Learning in Children: Potential Implications for Behavioral Parent Training

PubMed Central

Segers, Elien; Beckers, Tom; Geurts, Hilde; Claes, Laurence; Danckaerts, Marina; van der Oord, Saskia

2018-01-01

Introduction: Behavioral Parent Training (BPT) is often provided for childhood psychiatric disorders. These disorders have been shown to be associated with working memory impairments. BPT is based on operant learning principles, yet how operant principles shape behavior (through the partial reinforcement (PRF) extinction effect, i.e., greater resistance to extinction that is created when behavior is reinforced partially rather than continuously) and the potential role of working memory therein is scarcely studied in children. This study explored the PRF extinction effect and the role of working memory therein using experimental tasks in typically developing children. Methods: Ninety-seven children (age 6–10) completed a working memory task and an operant learning task, in which children acquired a response-sequence rule under either continuous or PRF (120 trials), followed by an extinction phase (80 trials). Data of 88 children were used for analysis. Results: The PRF extinction effect was confirmed: We observed slower acquisition and extinction in the PRF condition as compared to the continuous reinforcement (CRF) condition. Working memory was negatively related to acquisition but not extinction performance. Conclusion: Both reinforcement contingencies and working memory relate to acquisition performance. Potential implications for BPT are that decreasing working memory load may enhance the chance of optimally learning through reinforcement. PMID:29643822
Credit assignment in movement-dependent reinforcement learning

PubMed Central

Boggess, Matthew J.; Crossley, Matthew J.; Parvin, Darius; Ivry, Richard B.; Taylor, Jordan A.

2016-01-01

When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants’ explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem. PMID:27247404
Credit assignment in movement-dependent reinforcement learning.

PubMed

McDougle, Samuel D; Boggess, Matthew J; Crossley, Matthew J; Parvin, Darius; Ivry, Richard B; Taylor, Jordan A

2016-06-14

When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants' explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem.
Reinforcement learning design-based adaptive tracking control with less learning parameters for nonlinear discrete-time MIMO systems.

PubMed

Liu, Yan-Jun; Tang, Li; Tong, Shaocheng; Chen, C L Philip; Li, Dong-Juan

2015-01-01

Based on the neural network (NN) approximator, an online reinforcement learning algorithm is proposed for a class of affine multiple input and multiple output (MIMO) nonlinear discrete-time systems with unknown functions and disturbances. In the design procedure, two networks are provided where one is an action network to generate an optimal control signal and the other is a critic network to approximate the cost function. An optimal control signal and adaptation laws can be generated based on two NNs. In the previous approaches, the weights of critic and action networks are updated based on the gradient descent rule and the estimations of optimal weight vectors are directly adjusted in the design. Consequently, compared with the existing results, the main contributions of this paper are: 1) only two parameters are needed to be adjusted, and thus the number of the adaptation laws is smaller than the previous results and 2) the updating parameters do not depend on the number of the subsystems for MIMO systems and the tuning rules are replaced by adjusting the norms on optimal weight vectors in both action and critic networks. It is proven that the tracking errors, the adaptation laws, and the control inputs are uniformly bounded using Lyapunov analysis method. The simulation examples are employed to illustrate the effectiveness of the proposed algorithm.

Oxytocin attenuates trust as a subset of more general reinforcement learning, with altered reward circuit functional connectivity in males.

PubMed

Ide, Jaime S; Nedic, Sanja; Wong, Kin F; Strey, Shmuel L; Lawson, Elizabeth A; Dickerson, Bradford C; Wald, Lawrence L; La Camera, Giancarlo; Mujica-Parodi, Lilianne R

2018-07-01

Oxytocin (OT) is an endogenous neuropeptide that, while originally thought to promote trust, has more recently been found to be context-dependent. Here we extend experimental paradigms previously restricted to de novo decision-to-trust, to a more realistic environment in which social relationships evolve in response to iterative feedback over twenty interactions. In a randomized, double blind, placebo-controlled within-subject/crossover experiment of human adult males, we investigated the effects of a single dose of intranasal OT (40 IU) on Bayesian expectation updating and reinforcement learning within a social context, with associated brain circuit dynamics. Subjects participated in a neuroeconomic task (Iterative Trust Game) designed to probe iterative social learning while their brains were scanned using ultra-high field (7T) fMRI. We modeled each subject's behavior using Bayesian updating of belief-states ("willingness to trust") as well as canonical measures of reinforcement learning (learning rate, inverse temperature). Behavioral trajectories were then used as regressors within fMRI activation and connectivity analyses to identify corresponding brain network functionality affected by OT. Behaviorally, OT reduced feedback learning, without bias with respect to positive versus negative reward. Neurobiologically, reduced learning under OT was associated with muted communication between three key nodes within the reward circuit: the orbitofrontal cortex, amygdala, and lateral (limbic) habenula. Our data suggest that OT, rather than inspiring feelings of generosity, instead attenuates the brain's encoding of prediction error and therefore its ability to modulate pre-existing beliefs. This effect may underlie OT's putative role in promoting what has typically been reported as 'unjustified trust' in the face of information that suggests likely betrayal, while also resolving apparent contradictions with regard to OT's context-dependent behavioral effects. Copyright © 2018 Elsevier Inc. All rights reserved.
How partial reinforcement of food cues affects the extinction and reacquisition of appetitive responses. A new model for dieting success?

PubMed

van den Akker, Karolien; Havermans, Remco C; Bouton, Mark E; Jansen, Anita

2014-10-01

Animals and humans can easily learn to associate an initially neutral cue with food intake through classical conditioning, but extinction of learned appetitive responses can be more difficult. Intermittent or partial reinforcement of food cues causes especially persistent behaviour in animals: after exposure to such learning schedules, the decline in responding that occurs during extinction is slow. After extinction, increases in responding with renewed reinforcement of food cues (reacquisition) might be less rapid after acquisition with partial reinforcement. In humans, it may be that the eating behaviour of some individuals resembles partial reinforcement schedules to a greater extent, possibly affecting dieting success by interacting with extinction and reacquisition. Furthermore, impulsivity has been associated with less successful dieting, and this association might be explained by impulsivity affecting the learning and extinction of appetitive responses. In the present two studies, the effects of different reinforcement schedules and impulsivity on the acquisition, extinction, and reacquisition of appetitive responses were investigated in a conditioning paradigm involving food rewards in healthy humans. Overall, the results indicate both partial reinforcement schedules and, possibly, impulsivity to be associated with worse extinction performance. A new model of dieting success is proposed: learning histories and, perhaps, certain personality traits (impulsivity) can interfere with the extinction and reacquisition of appetitive responses to food cues and they may be causally related to unsuccessful dieting. Copyright © 2014 Elsevier Ltd. All rights reserved.
Does arousal interfere with operant conditioning of spike-wave discharges in genetic epileptic rats?

PubMed

Osterhagen, Lasse; Breteler, Marinus; van Luijtelaar, Gilles

2010-06-01

One of the ways in which brain computer interfaces can be used is neurofeedback (NF). Subjects use their brain activation to control an external device, and with this technique it is also possible to learn to control aspects of the brain activity by operant conditioning. Beneficial effects of NF training on seizure occurrence have been described in epileptic patients. Little research has been done about differentiating NF effectiveness by type of epilepsy, particularly, whether idiopathic generalized seizures are susceptible to NF. In this experiment, seizures that manifest themselves as spike-wave discharges (SWDs) in the EEG were reinforced during 10 sessions in 6 rats of the WAG/Rij strain, an animal model for absence epilepsy. EEG's were recorded before and after the training sessions. Reinforcing SWDs let to decreased SWD occurrences during training; however, the changes during training were not persistent in the post-training sessions. Because behavioural states are known to have an influence on the occurrence of SWDs, it is proposed that the reinforcement situation increased arousal which resulted in fewer SWDs. Additional tests supported this hypothesis. The outcomes have implications for the possibility to train SWDs with operant learning techniques. Copyright (c) 2010 Elsevier B.V. All rights reserved.
Autonomous learning based on cost assumptions: theoretical studies and experiments in robot control.

PubMed

Ribeiro, C H; Hemerly, E M

2000-02-01

Autonomous learning techniques are based on experience acquisition. In most realistic applications, experience is time-consuming: it implies sensor reading, actuator control and algorithmic update, constrained by the learning system dynamics. The information crudeness upon which classical learning algorithms operate make such problems too difficult and unrealistic. Nonetheless, additional information for facilitating the learning process ideally should be embedded in such a way that the structural, well-studied characteristics of these fundamental algorithms are maintained. We investigate in this article a more general formulation of the Q-learning method that allows for a spreading of information derived from single updates towards a neighbourhood of the instantly visited state and converges to optimality. We show how this new formulation can be used as a mechanism to safely embed prior knowledge about the structure of the state space, and demonstrate it in a modified implementation of a reinforcement learning algorithm in a real robot navigation task.
Developing PFC Representations Using Reinforcement Learning

ERIC Educational Resources Information Center

Reynolds, Jeremy R.; O'Reilly, Randall C.

2009-01-01

From both functional and biological considerations, it is widely believed that action production, planning, and goal-oriented behaviors supported by the frontal cortex are organized hierarchically [Fuster (1991); Koechlin, E., Ody, C., & Kouneiher, F. (2003). "Neuroscience: The architecture of cognitive control in the human prefrontal cortex."…
Regulating recognition decisions through incremental reinforcement learning.

PubMed

Han, Sanghoon; Dobbins, Ian G

2009-06-01

Does incremental reinforcement learning influence recognition memory judgments? We examined this question by subtly altering the relative validity or availability of feedback in order to differentially reinforce old or new recognition judgments. Experiment 1 probabilistically and incorrectly indicated that either misses or false alarms were correct in the context of feedback that was otherwise accurate. Experiment 2 selectively withheld feedback for either misses or false alarms in the context of feedback that was otherwise present. Both manipulations caused prominent shifts of recognition memory decision criteria that remained for considerable periods even after feedback had been altogether removed. Overall, these data demonstrate that incremental reinforcement-learning mechanisms influence the degree of caution subjects exercise when evaluating explicit memories.
Infant Contingency Learning in Different Cultural Contexts

ERIC Educational Resources Information Center

Graf, Frauke; Lamm, Bettina; Goertz, Claudia; Kolling, Thorsten; Freitag, Claudia; Spangler, Sibylle; Fassbender, Ina; Teubert, Manuel; Vierhaus, Marc; Keller, Heidi; Lohaus, Arnold; Schwarzer, Gudrun; Knopf, Monika

2012-01-01

Three-month-old Cameroonian Nso farmer and German middle-class infants were compared regarding learning and retention in a computerized mobile task. Infants achieving a preset learning criterion during reinforcement were tested for immediate and long-term retention measured in terms of an increased response rate after reinforcement and after a…
Adaptive Educational Software by Applying Reinforcement Learning

ERIC Educational Resources Information Center

Bennane, Abdellah

2013-01-01

The introduction of the intelligence in teaching software is the object of this paper. In software elaboration process, one uses some learning techniques in order to adapt the teaching software to characteristics of student. Generally, one uses the artificial intelligence techniques like reinforcement learning, Bayesian network in order to adapt…
Processing speed enhances model-based over model-free reinforcement learning in the presence of high working memory functioning

PubMed Central

Schad, Daniel J.; Jünger, Elisabeth; Sebold, Miriam; Garbusow, Maria; Bernhardt, Nadine; Javadi, Amir-Homayoun; Zimmermann, Ulrich S.; Smolka, Michael N.; Heinz, Andreas; Rapp, Michael A.; Huys, Quentin J. M.

2014-01-01

Theories of decision-making and its neural substrates have long assumed the existence of two distinct and competing valuation systems, variously described as goal-directed vs. habitual, or, more recently and based on statistical arguments, as model-free vs. model-based reinforcement-learning. Though both have been shown to control choices, the cognitive abilities associated with these systems are under ongoing investigation. Here we examine the link to cognitive abilities, and find that individual differences in processing speed covary with a shift from model-free to model-based choice control in the presence of above-average working memory function. This suggests shared cognitive and neural processes; provides a bridge between literatures on intelligence and valuation; and may guide the development of process models of different valuation components. Furthermore, it provides a rationale for individual differences in the tendency to deploy valuation systems, which may be important for understanding the manifold neuropsychiatric diseases associated with malfunctions of valuation. PMID:25566131
Basolateral amygdala lesions and sensitivity to reinforcer magnitude in concurrent chains schedules.

PubMed

Helms, Christa M; Mitchell, Suzanne H

2008-08-22

Previous studies show that the basolateral amygdala (BLA) is required for behavior to adjust when the value of a reinforcer decreases after satiation or pairing with gastric distress. This study evaluated the effect of pre- or post-training excitotoxic lesions of the BLA on changes in preference with another type of contingency change, reinforcer magnitude reversal. Rats were trained to press left and right levers during a variable-interval choice phase for 50 microl or 150 microl sucrose delivered to consistent locations after a 16-s delay. Tones were presented during the first and last 2s of the delay to reinforcement. The tone frequency predicted the magnitude of sucrose reinforcement in baseline conditions. All groups acquired stable preference for the lever on the large (150 microl) reinforcer side. However, nose poking during the delay to large reinforcement was highly accurate (i.e., to the reinforced side) for all groups except the rats with BLA lesions induced before training, suggesting impaired control of behavior by the tone. After the acquisition of stable preference, the locations of the reinforcer magnitudes were unpredictably reversed for a single session. Pre-training lesions blunted changes in preference when the reinforcer magnitudes were reversed. Lesions induced after stable preference was acquired, but prior to reversal, did not disrupt changes in preference. The data suggest that the BLA contributes to the adaptation of choice behavior following changes in reinforcer magnitude. Impaired learning about the tone-reinforcer magnitude relationships may have disrupted discrimination of the reinforcer magnitude reversal.
A spiking neural network model of model-free reinforcement learning with high-dimensional sensory input and perceptual ambiguity.

PubMed

Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji

2015-01-01

A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach.
Punishment insensitivity and impaired reinforcement learning in preschoolers.

PubMed

Briggs-Gowan, Margaret J; Nichols, Sara R; Voss, Joel; Zobel, Elvira; Carter, Alice S; McCarthy, Kimberly J; Pine, Daniel S; Blair, James; Wakschlag, Lauren S

2014-01-01

Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a developmental vulnerability to psychopathic traits. One hundred and fifty-seven preschoolers (mean age 4.7 ± 0.8 years) participated in a substudy that was embedded within a larger project. Children completed the 'Stars-in-Jars' task, which involved learning to select rewarded jars and avoid punished jars. Maternal report of responsiveness to socialization was assessed with the Punishment Insensitivity and Low Concern for Others scales of the Multidimensional Assessment of Preschool Disruptive Behavior (MAP-DB). Punishment Insensitivity, but not Low Concern for Others, was significantly associated with reinforcement learning in multivariate models that accounted for age and sex. Specifically, higher Punishment Insensitivity was associated with significantly lower overall performance and more errors on punished trials ('passive avoidance'). Impairments in reinforcement learning manifest in preschoolers who are high in maternal ratings of Punishment Insensitivity. If replicated, these findings may help to pinpoint the neurodevelopmental antecedents of psychopathic tendencies and suggest novel intervention targets beginning in early childhood. © 2013 The Authors. Journal of Child Psychology and Psychiatry © 2013 Association for Child and Adolescent Mental Health.
A Spiking Neural Network Model of Model-Free Reinforcement Learning with High-Dimensional Sensory Input and Perceptual Ambiguity

PubMed Central

Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji

2015-01-01

A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach. PMID:25734662
Rats bred for helplessness exhibit positive reinforcement learning deficits which are not alleviated by an antidepressant dose of the MAO-B inhibitor deprenyl.

PubMed

Schulz, Daniela; Henn, Fritz A; Petri, David; Huston, Joseph P

2016-08-04

Principles of negative reinforcement learning may play a critical role in the etiology and treatment of depression. We examined the integrity of positive reinforcement learning in congenitally helpless (cH) rats, an animal model of depression, using a random ratio schedule and a devaluation-extinction procedure. Furthermore, we tested whether an antidepressant dose of the monoamine oxidase (MAO)-B inhibitor deprenyl would reverse any deficits in positive reinforcement learning. We found that cH rats (n=9) were impaired in the acquisition of even simple operant contingencies, such as a fixed interval (FI) 20 schedule. cH rats exhibited no apparent deficits in appetite or reward sensitivity. They reacted to the devaluation of food in a manner consistent with a dose-response relationship. Reinforcer motivation as assessed by lever pressing across sessions with progressively decreasing reward probabilities was highest in congenitally non-helpless (cNH, n=10) rats as long as the reward probabilities remained relatively high. cNH compared to wild-type (n=10) rats were also more resistant to extinction across sessions. Compared to saline (n=5), deprenyl (n=5) reduced the duration of immobility of cH rats in the forced swimming test, indicative of antidepressant effects, but did not restore any deficits in the acquisition of a FI 20 schedule. We conclude that positive reinforcement learning was impaired in rats bred for helplessness, possibly due to motivational impairments but not deficits in reward sensitivity, and that deprenyl exerted antidepressant effects but did not reverse the deficits in positive reinforcement learning. Copyright © 2016 IBRO. Published by Elsevier Ltd. All rights reserved.
Antipsychotic dose modulates behavioral and neural responses to feedback during reinforcement learning in schizophrenia.

PubMed

Insel, Catherine; Reinen, Jenna; Weber, Jochen; Wager, Tor D; Jarskog, L Fredrik; Shohamy, Daphna; Smith, Edward E

2014-03-01

Schizophrenia is characterized by an abnormal dopamine system, and dopamine blockade is the primary mechanism of antipsychotic treatment. Consistent with the known role of dopamine in reward processing, prior research has demonstrated that patients with schizophrenia exhibit impairments in reward-based learning. However, it remains unknown how treatment with antipsychotic medication impacts the behavioral and neural signatures of reinforcement learning in schizophrenia. The goal of this study was to examine whether antipsychotic medication modulates behavioral and neural responses to prediction error coding during reinforcement learning. Patients with schizophrenia completed a reinforcement learning task while undergoing functional magnetic resonance imaging. The task consisted of two separate conditions in which participants accumulated monetary gain or avoided monetary loss. Behavioral results indicated that antipsychotic medication dose was associated with altered behavioral approaches to learning, such that patients taking higher doses of medication showed increased sensitivity to negative reinforcement. Higher doses of antipsychotic medication were also associated with higher learning rates (LRs), suggesting that medication enhanced sensitivity to trial-by-trial feedback. Neuroimaging data demonstrated that antipsychotic dose was related to differences in neural signatures of feedback prediction error during the loss condition. Specifically, patients taking higher doses of medication showed attenuated prediction error responses in the striatum and the medial prefrontal cortex. These findings indicate that antipsychotic medication treatment may influence motivational processes in patients with schizophrenia.
Beta-adrenergic receptors support attention to extinction learning that occurs in the absence, but not the presence, of a context change

PubMed Central

André, Marion Agnès Emma; Wolf, Oliver T.; Manahan-Vaughan, Denise

2015-01-01

The noradrenergic (NA)-system is an important regulator of cognitive function. It contributes to extinction learning (EL), and in disorders where EL is impaired NA-dysfunction has been postulated. We explored whether NA acting on beta-adrenergic-receptors (β-AR), regulates EL that depends on context, but is not fear-associated. We assessed behavior in an “AAA” or “ABA” paradigm: rats were trained for 3 days in a T-maze (context-A) to learn that a reward is consistently found in the goal arm, despite low reward probability. This was followed on day 4 by EL (unrewarded), whereby in the ABA-paradigm, EL was reinforced by a context change (B), and in the AAA-paradigm, no context change occurred. On day 5, re-exposure to the A-context (unrewarded) occurred. Typically, in control “AAA” animals EL occurred on day 4 that progressed further on day 5. In control “ABA” animals, EL also occurred on day 4, followed by renewal of the previously learned (A) behavior on day 5, that was succeeded (on day 5) by extinction of this behavior, as the animals realised that no food reward would be given. Treatment with the β-AR-antagonist, propranolol, prior to EL on day 4, impaired EL in the AAA-paradigm. In the “ABA” paradigm, antagonist treatment on day 4, had no effect on extinction that was reinforced by a context change (B). Furthermore, β-AR-antagonism prior to renewal testing (on day 5) in the ABA-paradigm, resulted in normal renewal behavior, although subsequent extinction of responses during day 5 was prevented by the antagonist. Thus, under both treatment conditions, β-AR-antagonism prevented extinction of the behavior learned in the “A” context. β-AR-blockade during an overt context change did not prevent EL, whereas β-AR were required for EL in an unchanging context. These data suggest that β-AR may support EL by reinforcing attention towards relevant changes in the previously learned experience, and that this process supports extinction learning in constant-context conditions. PMID:26074793
How we learn to make decisions: rapid propagation of reinforcement learning prediction errors in humans.

PubMed

Krigolson, Olav E; Hassall, Cameron D; Handy, Todd C

2014-03-01

Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors-discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833-1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129-141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769-776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679-709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward prediction errors and the changes in amplitude of these prediction errors at the time of choice presentation and reward delivery. Our results provide further support that the computations that underlie human learning and decision-making follow reinforcement learning principles.
Developing PFC representations using reinforcement learning.

PubMed

Reynolds, Jeremy R; O'Reilly, Randall C

2009-12-01

From both functional and biological considerations, it is widely believed that action production, planning, and goal-oriented behaviors supported by the frontal cortex are organized hierarchically [Fuster (1991); Koechlin, E., Ody, C., & Kouneiher, F. (2003). Neuroscience: The architecture of cognitive control in the human prefrontal cortex. Science, 424, 1181-1184; Miller, G. A., Galanter, E., & Pribram, K. H. (1960). Plans and the structure of behavior. New York: Holt]. However, the nature of the different levels of the hierarchy remains unclear, and little attention has been paid to the origins of such a hierarchy. We address these issues through biologically-inspired computational models that develop representations through reinforcement learning. We explore several different factors in these models that might plausibly give rise to a hierarchical organization of representations within the PFC, including an initial connectivity hierarchy within PFC, a hierarchical set of connections between PFC and subcortical structures controlling it, and differential synaptic plasticity schedules. Simulation results indicate that architectural constraints contribute to the segregation of different types of representations, and that this segregation facilitates learning. These findings are consistent with the idea that there is a functional hierarchy in PFC, as captured in our earlier computational models of PFC function and a growing body of empirical data.
Microstimulation of the human substantia nigra alters reinforcement learning.

PubMed

Ramayya, Ashwin G; Misra, Amrit; Baltuch, Gordon H; Kahana, Michael J

2014-05-14

Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action-reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action-reward associations rather than stimulus-reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action-reward associations during reinforcement learning. Copyright © 2014 the authors 0270-6474/14/346887-09$15.00/0.
Human reinforcement learning subdivides structured action spaces by learning effector-specific values

PubMed Central

Gershman, Samuel J.; Pesaran, Bijan; Daw, Nathaniel D.

2009-01-01

Humans and animals are endowed with a large number of effectors. Although this enables great behavioral flexibility, it presents an equally formidable reinforcement learning problem of discovering which actions are most valuable, due to the high dimensionality of the action space. An unresolved question is how neural systems for reinforcement learning – such as prediction error signals for action valuation associated with dopamine and the striatum – can cope with this “curse of dimensionality.” We propose a reinforcement learning framework that allows for learned action valuations to be decomposed into effector-specific components when appropriate to a task, and test it by studying to what extent human behavior and BOLD activity can exploit such a decomposition in a multieffector choice task. Subjects made simultaneous decisions with their left and right hands and received separate reward feedback for each hand movement. We found that choice behavior was better described by a learning model that decomposed the values of bimanual movements into separate values for each effector, rather than a traditional model that treated the bimanual actions as unitary with a single value. A decomposition of value into effector-specific components was also observed in value-related BOLD signaling, in the form of lateralized biases in striatal correlates of prediction error and anticipatory value correlates in the intraparietal sulcus. These results suggest that the human brain can use decomposed value representations to “divide and conquer” reinforcement learning over high-dimensional action spaces. PMID:19864565

Human reinforcement learning subdivides structured action spaces by learning effector-specific values.

PubMed

Gershman, Samuel J; Pesaran, Bijan; Daw, Nathaniel D

2009-10-28

Humans and animals are endowed with a large number of effectors. Although this enables great behavioral flexibility, it presents an equally formidable reinforcement learning problem of discovering which actions are most valuable because of the high dimensionality of the action space. An unresolved question is how neural systems for reinforcement learning-such as prediction error signals for action valuation associated with dopamine and the striatum-can cope with this "curse of dimensionality." We propose a reinforcement learning framework that allows for learned action valuations to be decomposed into effector-specific components when appropriate to a task, and test it by studying to what extent human behavior and blood oxygen level-dependent (BOLD) activity can exploit such a decomposition in a multieffector choice task. Subjects made simultaneous decisions with their left and right hands and received separate reward feedback for each hand movement. We found that choice behavior was better described by a learning model that decomposed the values of bimanual movements into separate values for each effector, rather than a traditional model that treated the bimanual actions as unitary with a single value. A decomposition of value into effector-specific components was also observed in value-related BOLD signaling, in the form of lateralized biases in striatal correlates of prediction error and anticipatory value correlates in the intraparietal sulcus. These results suggest that the human brain can use decomposed value representations to "divide and conquer" reinforcement learning over high-dimensional action spaces.
Neurocomputational mechanisms of prosocial learning and links to empathy.

PubMed

Lockwood, Patricia L; Apps, Matthew A J; Valton, Vincent; Viding, Essi; Roiser, Jonathan P

2016-08-30

Reinforcement learning theory powerfully characterizes how we learn to benefit ourselves. In this theory, prediction errors-the difference between a predicted and actual outcome of a choice-drive learning. However, we do not operate in a social vacuum. To behave prosocially we must learn the consequences of our actions for other people. Empathy, the ability to vicariously experience and understand the affect of others, is hypothesized to be a critical facilitator of prosocial behaviors, but the link between empathy and prosocial behavior is still unclear. During functional magnetic resonance imaging (fMRI) participants chose between different stimuli that were probabilistically associated with rewards for themselves (self), another person (prosocial), or no one (control). Using computational modeling, we show that people can learn to obtain rewards for others but do so more slowly than when learning to obtain rewards for themselves. fMRI revealed that activity in a posterior portion of the subgenual anterior cingulate cortex/basal forebrain (sgACC) drives learning only when we are acting in a prosocial context and signals a prosocial prediction error conforming to classical principles of reinforcement learning theory. However, there is also substantial variability in the neural and behavioral efficiency of prosocial learning, which is predicted by trait empathy. More empathic people learn more quickly when benefitting others, and their sgACC response is the most selective for prosocial learning. We thus reveal a computational mechanism driving prosocial learning in humans. This framework could provide insights into atypical prosocial behavior in those with disorders of social cognition.
Impaired implicit learning and feedback processing after stroke.

PubMed

Lam, J M; Globas, C; Hosp, J A; Karnath, H-O; Wächter, T; Luft, A R

2016-02-09

The ability to learn is assumed to support successful recovery and rehabilitation therapy after stroke. Hence, learning impairments may reduce the recovery potential. Here, the hypothesis is tested that stroke survivors have deficits in feedback-driven implicit learning. Stroke survivors (n=30) and healthy age-matched control subjects (n=21) learned a probabilistic classification task with brain activation measured using functional magnetic resonance imaging in a subset of these individuals (17 stroke and 10 controls). Stroke subjects learned slower than controls to classify cues. After being rewarded with a smiley face, they were less likely to give the same response when the cue was repeated. Stroke subjects showed reduced brain activation in putamen, pallidum, thalamus, frontal and prefrontal cortices and cerebellum when compared with controls. Lesion analysis identified those stroke survivors as learning-impaired who had lesions in frontal areas, putamen, thalamus, caudate and insula. Lesion laterality had no effect on learning efficacy or brain activation. These findings suggest that stroke survivors have deficits in reinforcement learning that may be related to dysfunctional processing of feedback-based decision-making, reward signals and working memory. Copyright © 2015 IBRO. Published by Elsevier Ltd. All rights reserved.
Pointwise probability reinforcements for robust statistical inference.

PubMed

Frénay, Benoît; Verleysen, Michel

2014-02-01

Statistical inference using machine learning techniques may be difficult with small datasets because of abnormally frequent data (AFDs). AFDs are observations that are much more frequent in the training sample that they should be, with respect to their theoretical probability, and include e.g. outliers. Estimates of parameters tend to be biased towards models which support such data. This paper proposes to introduce pointwise probability reinforcements (PPRs): the probability of each observation is reinforced by a PPR and a regularisation allows controlling the amount of reinforcement which compensates for AFDs. The proposed solution is very generic, since it can be used to robustify any statistical inference method which can be formulated as a likelihood maximisation. Experiments show that PPRs can be easily used to tackle regression, classification and projection: models are freed from the influence of outliers. Moreover, outliers can be filtered manually since an abnormality degree is obtained for each observation. Copyright © 2013 Elsevier Ltd. All rights reserved.
Attention control learning in the decision space using state estimation

NASA Astrophysics Data System (ADS)

Gharaee, Zahra; Fatehi, Alireza; Mirian, Maryam S.; Nili Ahmadabadi, Majid

2016-05-01

The main goal of this paper is modelling attention while using it in efficient path planning of mobile robots. The key challenge in concurrently aiming these two goals is how to make an optimal, or near-optimal, decision in spite of time and processing power limitations, which inherently exist in a typical multi-sensor real-world robotic application. To efficiently recognise the environment under these two limitations, attention of an intelligent agent is controlled by employing the reinforcement learning framework. We propose an estimation method using estimated mixture-of-experts task and attention learning in perceptual space. An agent learns how to employ its sensory resources, and when to stop observing, by estimating its perceptual space. In this paper, static estimation of the state space in a learning task problem, which is examined in the WebotsTM simulator, is performed. Simulation results show that a robot learns how to achieve an optimal policy with a controlled cost by estimating the state space instead of continually updating sensory information.
Separation of time-based and trial-based accounts of the partial reinforcement extinction effect.

PubMed

Bouton, Mark E; Woods, Amanda M; Todd, Travis P

2014-01-01

Two appetitive conditioning experiments with rats examined time-based and trial-based accounts of the partial reinforcement extinction effect (PREE). In the PREE, the loss of responding that occurs in extinction is slower when the conditioned stimulus (CS) has been paired with a reinforcer on some of its presentations (partially reinforced) instead of every presentation (continuously reinforced). According to a time-based or "time-accumulation" view (e.g., Gallistel and Gibbon, 2000), the PREE occurs because the organism has learned in partial reinforcement to expect the reinforcer after a larger amount of time has accumulated in the CS over trials. In contrast, according to a trial-based view (e.g., Capaldi, 1967), the PREE occurs because the organism has learned in partial reinforcement to expect the reinforcer after a larger number of CS presentations. Experiment 1 used a procedure that equated partially and continuously reinforced groups on their expected times to reinforcement during conditioning. A PREE was still observed. Experiment 2 then used an extinction procedure that allowed time in the CS and the number of trials to accumulate differentially through extinction. The PREE was still evident when responding was examined as a function of expected time units to the reinforcer, but was eliminated when responding was examined as a function of expected trial units to the reinforcer. There was no evidence that the animal responded according to the ratio of time accumulated during the CS in extinction over the time in the CS expected before the reinforcer. The results thus favor a trial-based account over a time-based account of extinction and the PREE. This article is part of a Special Issue entitled: Associative and Temporal Learning. Copyright © 2013 Elsevier B.V. All rights reserved.
Chronic Exposure to Methamphetamine Disrupts Reinforcement-Based Decision Making in Rats.

PubMed

Groman, Stephanie M; Rich, Katherine M; Smith, Nathaniel J; Lee, Daeyeol; Taylor, Jane R

2018-03-01

The persistent use of psychostimulant drugs, despite the detrimental outcomes associated with continued drug use, may be because of disruptions in reinforcement-learning processes that enable behavior to remain flexible and goal directed in dynamic environments. To identify the reinforcement-learning processes that are affected by chronic exposure to the psychostimulant methamphetamine (MA), the current study sought to use computational and biochemical analyses to characterize decision-making processes, assessed by probabilistic reversal learning, in rats before and after they were exposed to an escalating dose regimen of MA (or saline control). The ability of rats to use flexible and adaptive decision-making strategies following changes in stimulus-reward contingencies was significantly impaired following exposure to MA. Computational analyses of parameters that track choice and outcome behavior indicated that exposure to MA significantly impaired the ability of rats to use negative outcomes effectively. These MA-induced changes in decision making were similar to those observed in rats following administration of a dopamine D2/3 receptor antagonist. These data use computational models to provide insight into drug-induced maladaptive decision making that may ultimately identify novel targets for the treatment of psychostimulant addiction. We suggest that the disruption in utilization of negative outcomes to adaptively guide dynamic decision making is a new behavioral mechanism by which MA rigidly biases choice behavior.
Reversal learning and resurgence of operant behavior in zebrafish (Danio rerio).

PubMed

Kuroda, Toshikazu; Mizutani, Yuto; Cançado, Carlos R X; Podlesnik, Christopher A

2017-09-01

Zebrafish are used extensively as vertebrate animal models in biomedical research for having such features as a fully sequenced genome and transparent embryo. Yet, operant-conditioning studies with this species are scarce. The present study investigated reversal learning and resurgence of operant behavior in zebrafish. A target response (approaching a sensor) was reinforced in Phase 1. In Phase 2, the target response was extinguished while reinforcing an alternative response (approaching a different sensor). In Phase 3, extinction was in effect for the target and alternative responses. Reversal learning was demonstrated when responding tracked contingency changes between Phases 1 and 2. Moreover, resurgence occurred in 10 of 13 fish in Phase 3: Target response rates increased transiently and exceeded rates of an unreinforced control response. The present study provides the first evidence with zebrafish supporting reversal learning between discrete operant responses and a laboratory model of relapse. These findings open the possibility to assessing genetic influences of operant behavior generally and in models of relapse (e.g., resurgence, renewal, reinstatement). Copyright © 2017 Elsevier B.V. All rights reserved.
Medical Asepsis, Research, and Continuing Education

ERIC Educational Resources Information Center

Trussell, Patricia M.; Crow, Sue

1977-01-01

Emphasizes the need that continuing education programs for nurses in hospitals orient newly employed graduate nurses specifically to infection control measures as carried out in that institution and then to reinforce these learnings by regular planned programs. Points out ways that those responsible for inservice nursing education can facilitate…
Utilising reinforcement learning to develop strategies for driving auditory neural implants.

PubMed

Lee, Geoffrey W; Zambetta, Fabio; Li, Xiaodong; Paolini, Antonio G

2016-08-01

In this paper we propose a novel application of reinforcement learning to the area of auditory neural stimulation. We aim to develop a simulation environment which is based off real neurological responses to auditory and electrical stimulation in the cochlear nucleus (CN) and inferior colliculus (IC) of an animal model. Using this simulator we implement closed loop reinforcement learning algorithms to determine which methods are most effective at learning effective acoustic neural stimulation strategies. By recording a comprehensive set of acoustic frequency presentations and neural responses from a set of animals we created a large database of neural responses to acoustic stimulation. Extensive electrical stimulation in the CN and the recording of neural responses in the IC provides a mapping of how the auditory system responds to electrical stimuli. The combined dataset is used as the foundation for the simulator, which is used to implement and test learning algorithms. Reinforcement learning, utilising a modified n-Armed Bandit solution, is implemented to demonstrate the model's function. We show the ability to effectively learn stimulation patterns which mimic the cochlea's ability to covert acoustic frequencies to neural activity. Time taken to learn effective replication using neural stimulation takes less than 20 min under continuous testing. These results show the utility of reinforcement learning in the field of neural stimulation. These results can be coupled with existing sound processing technologies to develop new auditory prosthetics that are adaptable to the recipients current auditory pathway. The same process can theoretically be abstracted to other sensory and motor systems to develop similar electrical replication of neural signals.
Embedded Incremental Feature Selection for Reinforcement Learning

DTIC Science & Technology

2012-05-01

Prior to this work, feature selection for reinforce- ment learning has focused on linear value function ap- proximation ( Kolter and Ng, 2009; Parr et al...InProceed- ings of the the 23rd International Conference on Ma- chine Learning, pages 449–456. Kolter , J. Z. and Ng, A. Y. (2009). Regularization and feature
Social Learning, Reinforcement and Crime: Evidence from Three European Cities

ERIC Educational Resources Information Center

Tittle, Charles R.; Antonaccio, Olena; Botchkovar, Ekaterina

2012-01-01

This study reports a cross-cultural test of Social Learning Theory using direct measures of social learning constructs and focusing on the causal structure implied by the theory. Overall, the results strongly confirm the main thrust of the theory. Prior criminal reinforcement and current crime-favorable definitions are highly related in all three…
Novelty and Inductive Generalization in Human Reinforcement Learning

PubMed Central

Gershman, Samuel J.; Niv, Yael

2015-01-01

In reinforcement learning, a decision maker searching for the most rewarding option is often faced with the question: what is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: how can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of reinforcement learning in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional reinforcement learning algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty. PMID:25808176
Learning with incomplete information and the mathematical structure behind it.

PubMed

Kühn, Reimer; Stamatescu, Ion-Olimpiu

2007-07-01

We investigate the problem of learning with incomplete information as exemplified by learning with delayed reinforcement. We study a two phase learning scenario in which a phase of Hebbian associative learning based on momentary internal representations is supplemented by an 'unlearning' phase depending on a graded reinforcement signal. The reinforcement signal quantifies the success-rate globally for a number of learning steps in phase one, and 'unlearning' is indiscriminate with respect to associations learnt in that phase. Learning according to this model is studied via simulations and analytically within a student-teacher scenario for both single layer networks and, for a committee machine. Success and speed of learning depend on the ratio lambda of the learning rates used for the associative Hebbian learning phase and for the unlearning-correction in response to the reinforcement signal, respectively. Asymptotically perfect generalization is possible only, if this ratio exceeds a critical value lambda( c ), in which case the generalization error exhibits a power law decay with the number of examples seen by the student, with an exponent that depends in a non-universal manner on the parameter lambda. We find these features to be robust against a wide spectrum of modifications of microscopic modelling details. Two illustrative applications-one of a robot learning to navigate a field containing obstacles, and the problem of identifying a specific component in a collection of stimuli-are also provided.
Decentralized learning in Markov games.

PubMed

Vrancx, Peter; Verbeeck, Katja; Nowé, Ann

2008-08-01

Learning automata (LA) were recently shown to be valuable tools for designing multiagent reinforcement learning algorithms. One of the principal contributions of the LA theory is that a set of decentralized independent LA is able to control a finite Markov chain with unknown transition probabilities and rewards. In this paper, we propose to extend this algorithm to Markov games--a straightforward extension of single-agent Markov decision problems to distributed multiagent decision problems. We show that under the same ergodic assumptions of the original theorem, the extended algorithm will converge to a pure equilibrium point between agent policies.
Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance

DTIC Science & Technology

2014-09-29

Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity , and Performance W. Bradley Knox...positive a trainer’s reward values are; temporal discounting, the extent to which future reward is discounted in value; episodicity , whether task...learning occurs in discrete learning episodes instead of one continuing session; and task performance, the agent’s performance on the task the trainer
Closed-loop and robust control of quantum systems.

PubMed

Chen, Chunlin; Wang, Lin-Cheng; Wang, Yuanlong

2013-01-01

For most practical quantum control systems, it is important and difficult to attain robustness and reliability due to unavoidable uncertainties in the system dynamics or models. Three kinds of typical approaches (e.g., closed-loop learning control, feedback control, and robust control) have been proved to be effective to solve these problems. This work presents a self-contained survey on the closed-loop and robust control of quantum systems, as well as a brief introduction to a selection of basic theories and methods in this research area, to provide interested readers with a general idea for further studies. In the area of closed-loop learning control of quantum systems, we survey and introduce such learning control methods as gradient-based methods, genetic algorithms (GA), and reinforcement learning (RL) methods from a unified point of view of exploring the quantum control landscapes. For the feedback control approach, the paper surveys three control strategies including Lyapunov control, measurement-based control, and coherent-feedback control. Then such topics in the field of quantum robust control as H(∞) control, sliding mode control, quantum risk-sensitive control, and quantum ensemble control are reviewed. The paper concludes with a perspective of future research directions that are likely to attract more attention.
The basal ganglia is necessary for learning spectral, but not temporal features of birdsong

PubMed Central

Ali, Farhan; Fantana, Antoniu L.; Burak, Yoram; Ölveczky, Bence P.

2013-01-01

Executing a motor skill requires the brain to control which muscles to activate at what times. How these aspects of control - motor implementation and timing - are acquired, and whether the learning processes underlying them differ, is not well understood. To address this we used a reinforcement learning paradigm to independently manipulate both spectral and temporal features of birdsong, a complex learned motor sequence, while recording and perturbing activity in underlying circuits. Our results uncovered a striking dissociation in how neural circuits underlie learning in the two domains. The basal ganglia was required for modifying spectral, but not temporal structure. This functional dissociation extended to the descending motor pathway, where recordings from a premotor cortex analogue nucleus reflected changes to temporal, but not spectral structure. Our results reveal a strategy in which the nervous system employs different and largely independent circuits to learn distinct aspects of a motor skill. PMID:24075977
Learning to soar in turbulent environments

PubMed Central

Reddy, Gautam; Celani, Antonio; Sejnowski, Terrence J.; Vergassola, Massimo

2016-01-01

Birds and gliders exploit warm, rising atmospheric currents (thermals) to reach heights comparable to low-lying clouds with a reduced expenditure of energy. This strategy of flight (thermal soaring) is frequently used by migratory birds. Soaring provides a remarkable instance of complex decision making in biology and requires a long-term strategy to effectively use the ascending thermals. Furthermore, the problem is technologically relevant to extend the flying range of autonomous gliders. Thermal soaring is commonly observed in the atmospheric convective boundary layer on warm, sunny days. The formation of thermals unavoidably generates strong turbulent fluctuations, which constitute an essential element of soaring. Here, we approach soaring flight as a problem of learning to navigate complex, highly fluctuating turbulent environments. We simulate the atmospheric boundary layer by numerical models of turbulent convective flow and combine them with model-free, experience-based, reinforcement learning algorithms to train the gliders. For the learned policies in the regimes of moderate and strong turbulence levels, the glider adopts an increasingly conservative policy as turbulence levels increase, quantifying the degree of risk affordable in turbulent environments. Reinforcement learning uncovers those sensorimotor cues that permit effective control over soaring in turbulent environments. PMID:27482099
Learning to soar in turbulent environments.

PubMed

Reddy, Gautam; Celani, Antonio; Sejnowski, Terrence J; Vergassola, Massimo

2016-08-16

Birds and gliders exploit warm, rising atmospheric currents (thermals) to reach heights comparable to low-lying clouds with a reduced expenditure of energy. This strategy of flight (thermal soaring) is frequently used by migratory birds. Soaring provides a remarkable instance of complex decision making in biology and requires a long-term strategy to effectively use the ascending thermals. Furthermore, the problem is technologically relevant to extend the flying range of autonomous gliders. Thermal soaring is commonly observed in the atmospheric convective boundary layer on warm, sunny days. The formation of thermals unavoidably generates strong turbulent fluctuations, which constitute an essential element of soaring. Here, we approach soaring flight as a problem of learning to navigate complex, highly fluctuating turbulent environments. We simulate the atmospheric boundary layer by numerical models of turbulent convective flow and combine them with model-free, experience-based, reinforcement learning algorithms to train the gliders. For the learned policies in the regimes of moderate and strong turbulence levels, the glider adopts an increasingly conservative policy as turbulence levels increase, quantifying the degree of risk affordable in turbulent environments. Reinforcement learning uncovers those sensorimotor cues that permit effective control over soaring in turbulent environments.

Learning from demonstration: Teaching a myoelectric prosthesis with an intact limb via reinforcement learning.

PubMed

Vasan, Gautham; Pilarski, Patrick M

2017-07-01

Prosthetic arms should restore and extend the capabilities of someone with an amputation. They should move naturally and be able to perform elegant, coordinated movements that approximate those of a biological arm. Despite these objectives, the control of modern-day prostheses is often nonintuitive and taxing. Existing devices and control approaches do not yet give users the ability to effect highly synergistic movements during their daily-life control of a prosthetic device. As a step towards improving the control of prosthetic arms and hands, we introduce an intuitive approach to training a prosthetic control system that helps a user achieve hard-to-engineer control behaviours. Specifically, we present an actor-critic reinforcement learning method that for the first time promises to allow someone with an amputation to use their non-amputated arm to teach their prosthetic arm how to move through a wide range of coordinated motions and grasp patterns. We evaluate our method during the myoelectric control of a multi-joint robot arm by non-amputee users, and demonstrate that by using our approach a user can train their arm to perform simultaneous gestures and movements in all three degrees of freedom in the robot's hand and wrist based only on information sampled from the robot and the user's above-elbow myoelectric signals. Our results indicate that this learning-from-demonstration paradigm may be well suited to use by both patients and clinicians with minimal technical knowledge, as it allows a user to personalize the control of his or her prosthesis without having to know the underlying mechanics of the prosthetic limb. These preliminary results also suggest that our approach may extend in a straightforward way to next-generation prostheses with precise finger and wrist control, such that these devices may someday allow users to perform fluid and intuitive movements like playing the piano, catching a ball, and comfortably shaking hands.
Role of the medial prefrontal cortex in impaired decision making in juvenile attention-deficit/hyperactivity disorder.

PubMed

Hauser, Tobias U; Iannaccone, Reto; Ball, Juliane; Mathys, Christoph; Brandeis, Daniel; Walitza, Susanne; Brem, Silvia

2014-10-01

Attention-deficit/hyperactivity disorder (ADHD) has been associated with deficient decision making and learning. Models of ADHD have suggested that these deficits could be caused by impaired reward prediction errors (RPEs). Reward prediction errors are signals that indicate violations of expectations and are known to be encoded by the dopaminergic system. However, the precise learning and decision-making deficits and their neurobiological correlates in ADHD are not well known. To determine the impaired decision-making and learning mechanisms in juvenile ADHD using advanced computational models, as well as the related neural RPE processes using multimodal neuroimaging. Twenty adolescents with ADHD and 20 healthy adolescents serving as controls (aged 12-16 years) were examined using a probabilistic reversal learning task while simultaneous functional magnetic resonance imaging and electroencephalogram were recorded. Learning and decision making were investigated by contrasting a hierarchical Bayesian model with an advanced reinforcement learning model and by comparing the model parameters. The neural correlates of RPEs were studied in functional magnetic resonance imaging and electroencephalogram. Adolescents with ADHD showed more simplistic learning as reflected by the reinforcement learning model (exceedance probability, Px = .92) and had increased exploratory behavior compared with healthy controls (mean [SD] decision steepness parameter β: ADHD, 4.83 [2.97]; controls, 6.04 [2.53]; P = .02). The functional magnetic resonance imaging analysis revealed impaired RPE processing in the medial prefrontal cortex during cue as well as during outcome presentation (P < .05, family-wise error correction). The outcome-related impairment in the medial prefrontal cortex could be attributed to deficient processing at 200 to 400 milliseconds after feedback presentation as reflected by reduced feedback-related negativity (ADHD, 0.61 [3.90] μV; controls, -1.68 [2.52] μV; P = .04). The combination of computational modeling of behavior and multimodal neuroimaging revealed that impaired decision making and learning mechanisms in adolescents with ADHD are driven by impaired RPE processing in the medial prefrontal cortex. This novel, combined approach furthers the understanding of the pathomechanisms in ADHD and may advance treatment strategies.
A clustering-based graph Laplacian framework for value function approximation in reinforcement learning.

PubMed

Xu, Xin; Huang, Zhenhua; Graves, Daniel; Pedrycz, Witold

2014-12-01

In order to deal with the sequential decision problems with large or continuous state spaces, feature representation and function approximation have been a major research topic in reinforcement learning (RL). In this paper, a clustering-based graph Laplacian framework is presented for feature representation and value function approximation (VFA) in RL. By making use of clustering-based techniques, that is, K-means clustering or fuzzy C-means clustering, a graph Laplacian is constructed by subsampling in Markov decision processes (MDPs) with continuous state spaces. The basis functions for VFA can be automatically generated from spectral analysis of the graph Laplacian. The clustering-based graph Laplacian is integrated with a class of approximation policy iteration algorithms called representation policy iteration (RPI) for RL in MDPs with continuous state spaces. Simulation and experimental results show that, compared with previous RPI methods, the proposed approach needs fewer sample points to compute an efficient set of basis functions and the learning control performance can be improved for a variety of parameter settings.
Developing PFC representations using reinforcement learning

PubMed Central

Reynolds, Jeremy R.; O'Reilly, Randall C.

2009-01-01

From both functional and biological considerations, it is widely believed that action production, planning, and goal-oriented behaviors supported by the frontal cortex are organized hierarchically (Fuster, 1990, Koechlin, Ody, & Kouneiher, 2003, & Miller, Galanter, & Pribram, 1960) However, the nature of the different levels of the hierarchy remains unclear, and little attention has been paid to the origins of such a hierarchy. We address these issues through biologically-inspired computational models that develop representations through reinforcement learning. We explore several different factors in these models that might plausibly give rise to a hierarchical organization of representations within the PFC, including an initial connectivity hierarchy within PFC, a hierarchical set of connections between PFC and subcortical structures controlling it, and differential synaptic plasticity schedules. Simulation results indicate that architectural constraints contribute to the segregation of different types of representations, and that this segregation facilitates learning. These findings are consistent with the idea that there is a functional hierarchy in PFC, as captured in our earlier computational models of PFC function and a growing body of empirical data. PMID:19591977
Switching Reinforcement Learning for Continuous Action Space

NASA Astrophysics Data System (ADS)

Nagayoshi, Masato; Murao, Hajime; Tamaki, Hisashi

Reinforcement Learning (RL) attracts much attention as a technique of realizing computational intelligence such as adaptive and autonomous decentralized systems. In general, however, it is not easy to put RL into practical use. This difficulty includes a problem of designing a suitable action space of an agent, i.e., satisfying two requirements in trade-off: (i) to keep the characteristics (or structure) of an original search space as much as possible in order to seek strategies that lie close to the optimal, and (ii) to reduce the search space as much as possible in order to expedite the learning process. In order to design a suitable action space adaptively, we propose switching RL model to mimic a process of an infant's motor development in which gross motor skills develop before fine motor skills. Then, a method for switching controllers is constructed by introducing and referring to the “entropy”. Further, through computational experiments by using robot navigation problems with one and two-dimensional continuous action space, the validity of the proposed method has been confirmed.
An insula-frontostriatal network mediates flexible cognitive control by adaptively predicting changing control demands

PubMed Central

Jiang, Jiefeng; Beck, Jeffrey; Heller, Katherine; Egner, Tobias

2015-01-01

The anterior cingulate and lateral prefrontal cortices have been implicated in implementing context-appropriate attentional control, but the learning mechanisms underlying our ability to flexibly adapt the control settings to changing environments remain poorly understood. Here we show that human adjustments to varying control demands are captured by a reinforcement learner with a flexible, volatility-driven learning rate. Using model-based functional magnetic resonance imaging, we demonstrate that volatility of control demand is estimated by the anterior insula, which in turn optimizes the prediction of forthcoming demand in the caudate nucleus. The caudate's prediction of control demand subsequently guides the implementation of proactive and reactive attentional control in dorsal anterior cingulate and dorsolateral prefrontal cortices. These data enhance our understanding of the neuro-computational mechanisms of adaptive behaviour by connecting the classic cingulate-prefrontal cognitive control network to a subcortical control-learning mechanism that infers future demands by flexibly integrating remote and recent past experiences. PMID:26391305
Brain Circuits of Methamphetamine Place Reinforcement Learning: The Role of the Hippocampus-VTA Loop.

PubMed

Keleta, Yonas B; Martinez, Joe L

2012-03-01

The reinforcing effects of addictive drugs including methamphetamine (METH) involve the midbrain ventral tegmental area (VTA). VTA is primary source of dopamine (DA) to the nucleus accumbens (NAc) and the ventral hippocampus (VHC). These three brain regions are functionally connected through the hippocampal-VTA loop that includes two main neural pathways: the bottom-up pathway and the top-down pathway. In this paper, we take the view that addiction is a learning process. Therefore, we tested the involvement of the hippocampus in reinforcement learning by studying conditioned place preference (CPP) learning by sequentially conditioning each of the three nuclei in either the bottom-up order of conditioning; VTA, then VHC, finally NAc, or the top-down order; VHC, then VTA, finally NAc. Following habituation, the rats underwent experimental modules consisting of two conditioning trials each followed by immediate testing (test 1 and test 2) and two additional tests 24 h (test 3) and/or 1 week following conditioning (test 4). The module was repeated three times for each nucleus. The results showed that METH, but not Ringer's, produced positive CPP following conditioning each brain area in the bottom-up order. In the top-down order, METH, but not Ringer's, produced either an aversive CPP or no learning effect following conditioning each nucleus of interest. In addition, METH place aversion was antagonized by coadministration of the N-methyl-d-aspartate (NMDA) receptor antagonist MK801, suggesting that the aversion learning was an NMDA receptor activation-dependent process. We conclude that the hippocampus is a critical structure in the reward circuit and hence suggest that the development of target-specific therapeutics for the control of addiction emphasizes on the hippocampus-VTA top-down connection.
Rational metareasoning and the plasticity of cognitive control.

PubMed

Lieder, Falk; Shenhav, Amitai; Musslick, Sebastian; Griffiths, Thomas L

2018-04-01

The human brain has the impressive capacity to adapt how it processes information to high-level goals. While it is known that these cognitive control skills are malleable and can be improved through training, the underlying plasticity mechanisms are not well understood. Here, we develop and evaluate a model of how people learn when to exert cognitive control, which controlled process to use, and how much effort to exert. We derive this model from a general theory according to which the function of cognitive control is to select and configure neural pathways so as to make optimal use of finite time and limited computational resources. The central idea of our Learned Value of Control model is that people use reinforcement learning to predict the value of candidate control signals of different types and intensities based on stimulus features. This model correctly predicts the learning and transfer effects underlying the adaptive control-demanding behavior observed in an experiment on visual attention and four experiments on interference control in Stroop and Flanker paradigms. Moreover, our model explained these findings significantly better than an associative learning model and a Win-Stay Lose-Shift model. Our findings elucidate how learning and experience might shape people's ability and propensity to adaptively control their minds and behavior. We conclude by predicting under which circumstances these learning mechanisms might lead to self-control failure.
Rational metareasoning and the plasticity of cognitive control

PubMed Central

Shenhav, Amitai; Musslick, Sebastian; Griffiths, Thomas L.

2018-01-01

The human brain has the impressive capacity to adapt how it processes information to high-level goals. While it is known that these cognitive control skills are malleable and can be improved through training, the underlying plasticity mechanisms are not well understood. Here, we develop and evaluate a model of how people learn when to exert cognitive control, which controlled process to use, and how much effort to exert. We derive this model from a general theory according to which the function of cognitive control is to select and configure neural pathways so as to make optimal use of finite time and limited computational resources. The central idea of our Learned Value of Control model is that people use reinforcement learning to predict the value of candidate control signals of different types and intensities based on stimulus features. This model correctly predicts the learning and transfer effects underlying the adaptive control-demanding behavior observed in an experiment on visual attention and four experiments on interference control in Stroop and Flanker paradigms. Moreover, our model explained these findings significantly better than an associative learning model and a Win-Stay Lose-Shift model. Our findings elucidate how learning and experience might shape people’s ability and propensity to adaptively control their minds and behavior. We conclude by predicting under which circumstances these learning mechanisms might lead to self-control failure. PMID:29694347
Annelids. A Multimedia CD-ROM. [CD-ROM].

ERIC Educational Resources Information Center

2001

This CD-ROM is designed for classroom and individual use to teach and learn about annelids. Integrated animations, custom graphics, three-dimensional representations, photographs, and sound are featured for use in user-controlled activities. Interactive lessons are available to reinforce the subject material. Pre- and post-testing sections are…
Social Behaviorism, Human Motivation, and the Conditioning Therapies.

ERIC Educational Resources Information Center

Staats, Arthur W.

The author conceives of the human emotional system as being composed of three functions of motivational stimuli: (a) the attitudinal or emotional, (b) the reinforcing, and (c) the discriminative controlling function which the stimuli acquire. He defines and describes each of these functions and their effect on integrated learning principles. He…
Cost and Effectiveness of an Educational Program for Autistic Children Using a Systems Approach.

ERIC Educational Resources Information Center

Hung, David W.; And Others

1983-01-01

A systems approach, which features behavioral assessments, a functional curriculum, behavior management, precision teaching, systematic use of reinforcement, and a structured teaching schedule, resulted in greater learning of functional skills and increased structured teaching time per day compared to two control treaments for 12 autistic…
Unpacking a Liturgical Framing of Desire for the Purposes of Educational Research

ERIC Educational Resources Information Center

Renga, Ian Parker

2017-01-01

Much of the public discourse on education arguably reinforces the assumption that most stakeholders share the same desires for teaching and learning--desires reflecting a liberal paradigm that stresses individualism, control, and efficiency. But there are other desires, and additional empirical research informed by a Vygotskian sociocultural…
Change Detection, Multiple Controllers, and Dynamic Environments: Insights from the Brain

ERIC Educational Resources Information Center

Pearson, John M.; Platt, Michael L.

2013-01-01

Foundational studies in decision making focused on behavior as the most accessible and reliable data on which to build theories of choice. More recent work, however, has incorporated neural data to provide insights unavailable from behavior alone. Among other contributions, these studies have validated reinforcement learning models by…
Proactivity and Reinforcement: The Contingency of Social Behavior

ERIC Educational Resources Information Center

Williams, J. Sherwood; And Others

1976-01-01

This paper analyzes development of group structure in terms of the stimulus-sampling perspective. Learning is the continual sampling of possibilities, with those reinforced possibilities increasing in probability of occurance. This contingency learning approach is tested experimentally. (NG)
Automated Inattention and Fatigue Detection System in Distance Education for Elementary School Students

ERIC Educational Resources Information Center

Hwang, Kuo-An; Yang, Chia-Hao

2009-01-01

Most courses based on distance learning focus on the cognitive domain of learning. Because students are sometimes inattentive or tired, they may neglect the attention goal of learning. This study proposes an auto-detection and reinforcement mechanism for the distance-education system based on the reinforcement teaching strategy. If a student is…
When, What, and How Much to Reward in Reinforcement Learning-Based Models of Cognition

ERIC Educational Resources Information Center

Janssen, Christian P.; Gray, Wayne D.

2012-01-01

Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other…
Construction of multi-agent mobile robots control system in the problem of persecution with using a modified reinforcement learning method based on neural networks

NASA Astrophysics Data System (ADS)

Patkin, M. L.; Rogachev, G. N.

2018-02-01

A method for constructing a multi-agent control system for mobile robots based on training with reinforcement using deep neural networks is considered. Synthesis of the management system is proposed to be carried out with reinforcement training and the modified Actor-Critic method, in which the Actor module is divided into Action Actor and Communication Actor in order to simultaneously manage mobile robots and communicate with partners. Communication is carried out by sending partners at each step a vector of real numbers that are added to the observation vector and affect the behaviour. Functions of Actors and Critic are approximated by deep neural networks. The Critics value function is trained by using the TD-error method and the Actor’s function by using DDPG. The Communication Actor’s neural network is trained through gradients received from partner agents. An environment in which a cooperative multi-agent interaction is present was developed, computer simulation of the application of this method in the control problem of two robots pursuing two goals was carried out.
BASOLATERAL AMYGDALA LESIONS AND SENSITIVITY TO REINFORCER MAGNITUDE IN CONCURRENT CHAINS SCHEDULES

PubMed Central

Helms, Christa M.; Mitchell, Suzanne H.

2008-01-01

Previous studies show that the basolateral amygdala (BLA) is required for behavior to adjust when the value of a reinforcer decreases after satiation or pairing with gastric distress. This study evaluated the effect of pre- or post-training excitotoxic lesions of the BLA on changes in preference with another type of contingency change, reinforcer magnitude reversal. Rats were trained to press left and right levers during a variable-interval choice phase for 50 µl or 150 µl sucrose delivered to consistent locations after a 16-s delay. Tones were presented during the first and last 2 s of the delay to reinforcement. The tone frequency predicted the magnitude of sucrose reinforcement in baseline conditions. All groups acquired stable preference for the lever on the large (150-µl) reinforcer side. However, nose poking during the delay to large reinforcement was highly accurate (i.e., to the reinforced side) for all groups except the rats with BLA lesions induced before training, suggesting impaired control of behavior by the tone. After the acquisition of stable preference, the locations of the reinforcer magnitudes were unpredictably reversed for a single session. Pre-training lesions blunted changes in preference when the reinforcer magnitudes were reversed. Lesions induced after stable preference was acquired, but prior to reversal, did not disrupt changes in preference. The data suggest that the BLA contributes to the adaptation of choice behavior following changes in reinforcer magnitude. Impaired learning about the tone-reinforcer magnitude relationships may have disrupted discrimination of the reinforcer magnitude reversal. PMID:18455812
Dissociating error-based and reinforcement-based loss functions during sensorimotor learning

PubMed Central

McGregor, Heather R.; Mohatarem, Ayman

2017-01-01

It has been proposed that the sensorimotor system uses a loss (cost) function to evaluate potential movements in the presence of random noise. Here we test this idea in the context of both error-based and reinforcement-based learning. In a reaching task, we laterally shifted a cursor relative to true hand position using a skewed probability distribution. This skewed probability distribution had its mean and mode separated, allowing us to dissociate the optimal predictions of an error-based loss function (corresponding to the mean of the lateral shifts) and a reinforcement-based loss function (corresponding to the mode). We then examined how the sensorimotor system uses error feedback and reinforcement feedback, in isolation and combination, when deciding where to aim the hand during a reach. We found that participants compensated differently to the same skewed lateral shift distribution depending on the form of feedback they received. When provided with error feedback, participants compensated based on the mean of the skewed noise. When provided with reinforcement feedback, participants compensated based on the mode. Participants receiving both error and reinforcement feedback continued to compensate based on the mean while repeatedly missing the target, despite receiving auditory, visual and monetary reinforcement feedback that rewarded hitting the target. Our work shows that reinforcement-based and error-based learning are separable and can occur independently. Further, when error and reinforcement feedback are in conflict, the sensorimotor system heavily weights error feedback over reinforcement feedback. PMID:28753634

Dissociating error-based and reinforcement-based loss functions during sensorimotor learning.

PubMed

Cashaback, Joshua G A; McGregor, Heather R; Mohatarem, Ayman; Gribble, Paul L

2017-07-01

It has been proposed that the sensorimotor system uses a loss (cost) function to evaluate potential movements in the presence of random noise. Here we test this idea in the context of both error-based and reinforcement-based learning. In a reaching task, we laterally shifted a cursor relative to true hand position using a skewed probability distribution. This skewed probability distribution had its mean and mode separated, allowing us to dissociate the optimal predictions of an error-based loss function (corresponding to the mean of the lateral shifts) and a reinforcement-based loss function (corresponding to the mode). We then examined how the sensorimotor system uses error feedback and reinforcement feedback, in isolation and combination, when deciding where to aim the hand during a reach. We found that participants compensated differently to the same skewed lateral shift distribution depending on the form of feedback they received. When provided with error feedback, participants compensated based on the mean of the skewed noise. When provided with reinforcement feedback, participants compensated based on the mode. Participants receiving both error and reinforcement feedback continued to compensate based on the mean while repeatedly missing the target, despite receiving auditory, visual and monetary reinforcement feedback that rewarded hitting the target. Our work shows that reinforcement-based and error-based learning are separable and can occur independently. Further, when error and reinforcement feedback are in conflict, the sensorimotor system heavily weights error feedback over reinforcement feedback.
Role of amygdala central nucleus in feature negative discriminations

PubMed Central

Holland, Peter C.

2012-01-01

Consistent with a popular theory of associative learning, the Pearce-Hall (1980) model, the surprising omission of expected events enhances cue associability (the ease with which a cue may enter into new associations), across a wide variety of behavioral training procedures. Furthermore, previous experiments from this laboratory showed that these enhancements are absent in rats with impaired function of the amygdala central nucleus (CeA). A notable exception to these assertions is found in feature negative (FN) discrimination learning, in which a “target” stimulus is reinforced when it is presented alone but nonreinforced when it is presented in compound with another, “feature” stimulus. According to the Pearce-Hall model, reinforcer omission on compound trials should enhance the associability of the feature relative to control training conditions. However, prior experiments have shown no evidence that CeA lesions affect FN discrimination learning. Here we explored this apparent contradiction by evaluating the hypothesis that the surprising omission of an event confers enhanced associability on a cue only if that cue itself generates the disconfirmed prediction. Thus, in a FN discrimination, the surprising omission of the reinforcer on compound trials would enhance the associability of the target stimulus but not that of the feature. Our data confirmed this hypothesis, and showed this enhancement to depend on intact CeA function, as in other procedures. The results are consistent with modern reformulations of both cue and reward processing theories that assign roles for both individual and aggregate error terms in associative learning. PMID:22889308
Altered Risk-Based Decision Making following Adolescent Alcohol Use Results from an Imbalance in Reinforcement Learning in Rats

PubMed Central

Hart, Andrew S.; Collins, Anne L.; Bernstein, Ilene L.; Phillips, Paul E. M.

2012-01-01

Alcohol use during adolescence has profound and enduring consequences on decision-making under risk. However, the fundamental psychological processes underlying these changes are unknown. Here, we show that alcohol use produces over-fast learning for better-than-expected, but not worse-than-expected, outcomes without altering subjective reward valuation. We constructed a simple reinforcement learning model to simulate altered decision making using behavioral parameters extracted from rats with a history of adolescent alcohol use. Remarkably, the learning imbalance alone was sufficient to simulate the divergence in choice behavior observed between these groups of animals. These findings identify a selective alteration in reinforcement learning following adolescent alcohol use that can account for a robust change in risk-based decision making persisting into later life. PMID:22615989
Intrinsic interactive reinforcement learning - Using error-related potentials for real world human-robot interaction.

PubMed

Kim, Su Kyoung; Kirchner, Elsa Andrea; Stefes, Arne; Kirchner, Frank

2017-12-14

Reinforcement learning (RL) enables robots to learn its optimal behavioral strategy in dynamic environments based on feedback. Explicit human feedback during robot RL is advantageous, since an explicit reward function can be easily adapted. However, it is very demanding and tiresome for a human to continuously and explicitly generate feedback. Therefore, the development of implicit approaches is of high relevance. In this paper, we used an error-related potential (ErrP), an event-related activity in the human electroencephalogram (EEG), as an intrinsically generated implicit feedback (rewards) for RL. Initially we validated our approach with seven subjects in a simulated robot learning scenario. ErrPs were detected online in single trial with a balanced accuracy (bACC) of 91%, which was sufficient to learn to recognize gestures and the correct mapping between human gestures and robot actions in parallel. Finally, we validated our approach in a real robot scenario, in which seven subjects freely chose gestures and the real robot correctly learned the mapping between gestures and actions (ErrP detection (90% bACC)). In this paper, we demonstrated that intrinsically generated EEG-based human feedback in RL can successfully be used to implicitly improve gesture-based robot control during human-robot interaction. We call our approach intrinsic interactive RL.
Comparative learning theory and its application in the training of horses.

PubMed

Cooper, J J

1998-11-01

Training can best be explained as a process that occurs through stimulus-response-reinforcement chains, whereby animals are conditioned to associate cues in their environment, with specific behavioural responses and their rewarding consequences. Research into learning in horses has concentrated on their powers of discrimination and on primary positive reinforcement schedules, where the correct response is paired with a desirable consequence such as food. In contrast, a number of other learning processes that are used in training have been widely studied in other species, but have received little scientific investigation in the horse. These include: negative reinforcement, where performance of the correct response is followed by removal of, or decrease in, intensity of a unpleasant stimulus; punishment, where an incorrect response is paired with an undesirable consequence, but without consistent prior warning; secondary conditioning, where a natural primary reinforcer such as food is closely associated with an arbitrary secondary reinforcer such as vocal praise; and variable or partial conditioning, where once the correct response has been learnt, reinforcement is presented according to an intermittent schedule to increase resistance to extinction outside of training.
The nature of sexual reinforcement.

PubMed Central

Crawford, L L; Holloway, K S; Domjan, M

1993-01-01

Sexual reinforcers are not part of a regulatory system involved in the maintenance of critical metabolic processes, they differ for males and females, they differ as a function of species and mating system, and they show ontogenetic and seasonal changes related to endocrine conditions. Exposure to a member of the opposite sex without copulation can be sufficient for sexual reinforcement. However, copulatory access is a stronger reinforcer, and copulatory opportunity can serve to enhance the reinforcing efficacy of stimulus features of a sexual partner. Conversely, under certain conditions, noncopulatory exposure serves to decrease reinforcer efficacy. Many common learning phenomena such as acquisition, extinction, discrimination learning, second-order conditioning, and latent inhibition have been demonstrated in sexual conditioning. These observations extend the generality of findings obtained with more conventional reinforcers, but the mechanisms of these effects and their gender and species specificity remain to be explored. PMID:8354970
Negative symptoms in schizophrenia result from a failure to represent the expected value of rewards: Behavioral and computational modeling evidence

PubMed Central

Gold, James M.; Waltz, James A.; Matveeva, Tatyana M.; Kasanova, Zuzana; Strauss, Gregory P.; Herbener, Ellen S.; Collins, Anne G.E.; Frank, Michael J.

2015-01-01

Context Negative symptoms are a core feature of schizophrenia, but their pathophysiology remains unclear. Objective Negative symptoms are defined by the absence of normal function. However, there must be a productive mechanism that leads to this absence. Here, we test a reinforcement learning account suggesting that negative symptoms result from a failure to represent the expected value of rewards coupled with preserved loss avoidance learning. Design Subjects performed a probabilistic reinforcement learning paradigm involving stimulus pairs in which choices resulted in either reward or avoidance of loss. Following training, subjects indicated their valuation of the stimuli in a transfer task. Computational modeling was used to distinguish between alternative accounts of the data. Setting A tertiary care research outpatient clinic. Patients A total of 47 clinically stable patients with a diagnosis of schizophrenia or schizoaffective disorder and 28 healthy volunteers participated. Patients were divided into high and low negative symptom groups. Main Outcome measures 1) The number of choices leading to reward or loss avoidance and 2) performance in the transfer phase. Quantitative fits from three different models were examined. Results High negative symptom patients demonstrated impaired learning from rewards but intact loss avoidance learning, and failed to distinguish rewarding stimuli from loss-avoiding stimuli in the transfer phase. Model fits revealed that high negative symptom patients were better characterized by an “actor-critic” model, learning stimulus-response associations, whereas controls and low negative symptom patients incorporated expected value of their actions (“Q-learning”) into the selection process. Conclusions Negative symptoms are associated with a specific reinforcement learning abnormality: High negative symptoms patients do not represent the expected value of rewards when making decisions but learn to avoid punishments through the use of prediction errors. This computational framework offers the potential to understand negative symptoms at a mechanistic level. PMID:22310503
Mesolimbic confidence signals guide perceptual learning in the absence of external feedback

PubMed Central

Guggenmos, Matthias; Wilbertz, Gregor; Hebart, Martin N; Sterzer, Philipp

2016-01-01

It is well established that learning can occur without external feedback, yet normative reinforcement learning theories have difficulties explaining such instances of learning. Here, we propose that human observers are capable of generating their own feedback signals by monitoring internal decision variables. We investigated this hypothesis in a visual perceptual learning task using fMRI and confidence reports as a measure for this monitoring process. Employing a novel computational model in which learning is guided by confidence-based reinforcement signals, we found that mesolimbic brain areas encoded both anticipation and prediction error of confidence—in remarkable similarity to previous findings for external reward-based feedback. We demonstrate that the model accounts for choice and confidence reports and show that the mesolimbic confidence prediction error modulation derived through the model predicts individual learning success. These results provide a mechanistic neurobiological explanation for learning without external feedback by augmenting reinforcement models with confidence-based feedback. DOI: http://dx.doi.org/10.7554/eLife.13388.001 PMID:27021283
Autonomous Motion Learning for Intra-Vehicular Activity Space Robot

NASA Astrophysics Data System (ADS)

Watanabe, Yutaka; Yairi, Takehisa; Machida, Kazuo

Space robots will be needed in the future space missions. So far, many types of space robots have been developed, but in particular, Intra-Vehicular Activity (IVA) space robots that support human activities should be developed to reduce human-risks in space. In this paper, we study the motion learning method of an IVA space robot with the multi-link mechanism. The advantage point is that this space robot moves using reaction force of the multi-link mechanism and contact forces from the wall as space walking of an astronaut, not to use a propulsion. The control approach is determined based on a reinforcement learning with the actor-critic algorithm. We demonstrate to clear effectiveness of this approach using a 5-link space robot model by simulation. First, we simulate that a space robot learn the motion control including contact phase in two dimensional case. Next, we simulate that a space robot learn the motion control changing base attitude in three dimensional case.
Increasing instructional efficiency by presenting additional stimuli in learning trials for children with autism spectrum disorders.

PubMed

Vladescu, Jason C; Kodak, Tiffany M

2013-12-01

The current study examined the effectiveness and efficiency of presenting secondary targets within learning trials for 4 children with an autism spectrum disorder. Specifically, we compared 4 instructional conditions using a progressive prompt delay. In 3 conditions, we presented secondary targets in the antecedent or consequence portion of learning trials, or in the absence of prompts and reinforcement. In the fourth condition (control), we did not include secondary targets in learning trials. Results replicate and extend previous research by demonstrating that the majority of participants acquired secondary targets presented in the antecedent and consequent events of learning trials. © Society for the Experimental Analysis of Behavior.
Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: a simulated robotic study.

PubMed

Mirolli, Marco; Santucci, Vieri G; Baldassarre, Gianluca

2013-03-01

An important issue of recent neuroscientific research is to understand the functional role of the phasic release of dopamine in the striatum, and in particular its relation to reinforcement learning. The literature is split between two alternative hypotheses: one considers phasic dopamine as a reward prediction error similar to the computational TD-error, whose function is to guide an animal to maximize future rewards; the other holds that phasic dopamine is a sensory prediction error signal that lets the animal discover and acquire novel actions. In this paper we propose an original hypothesis that integrates these two contrasting positions: according to our view phasic dopamine represents a TD-like reinforcement prediction error learning signal determined by both unexpected changes in the environment (temporary, intrinsic reinforcements) and biological rewards (permanent, extrinsic reinforcements). Accordingly, dopamine plays the functional role of driving both the discovery and acquisition of novel actions and the maximization of future rewards. To validate our hypothesis we perform a series of experiments with a simulated robotic system that has to learn different skills in order to get rewards. We compare different versions of the system in which we vary the composition of the learning signal. The results show that only the system reinforced by both extrinsic and intrinsic reinforcements is able to reach high performance in sufficiently complex conditions. Copyright © 2013 Elsevier Ltd. All rights reserved.
Within- and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory.

PubMed

Collins, Anne G E; Frank, Michael J

2018-03-06

Learning from rewards and punishments is essential to survival and facilitates flexible human behavior. It is widely appreciated that multiple cognitive and reinforcement learning systems contribute to decision-making, but the nature of their interactions is elusive. Here, we leverage methods for extracting trial-by-trial indices of reinforcement learning (RL) and working memory (WM) in human electro-encephalography to reveal single-trial computations beyond that afforded by behavior alone. Neural dynamics confirmed that increases in neural expectation were predictive of reduced neural surprise in the following feedback period, supporting central tenets of RL models. Within- and cross-trial dynamics revealed a cooperative interplay between systems for learning, in which WM contributes expectations to guide RL, despite competition between systems during choice. Together, these results provide a deeper understanding of how multiple neural systems interact for learning and decision-making and facilitate analysis of their disruption in clinical populations.
Impairments in action-outcome learning in schizophrenia.

PubMed

Morris, Richard W; Cyrzon, Chad; Green, Melissa J; Le Pelley, Mike E; Balleine, Bernard W

2018-03-03

Learning the causal relation between actions and their outcomes (AO learning) is critical for goal-directed behavior when actions are guided by desire for the outcome. This can be contrasted with habits that are acquired by reinforcement and primed by prevailing stimuli, in which causal learning plays no part. Recently, we demonstrated that goal-directed actions are impaired in schizophrenia; however, whether this deficit exists alongside impairments in habit or reinforcement learning is unknown. The present study distinguished deficits in causal learning from reinforcement learning in schizophrenia. We tested people with schizophrenia (SZ, n = 25) and healthy adults (HA, n = 25) in a vending machine task. Participants learned two action-outcome contingencies (e.g., push left to get a chocolate M&M, push right to get a cracker), and they also learned one contingency was degraded by delivery of noncontingent outcomes (e.g., free M&Ms), as well as changes in value by outcome devaluation. Both groups learned the best action to obtain rewards; however, SZ did not distinguish the more causal action when one AO contingency was degraded. Moreover, action selection in SZ was insensitive to changes in outcome value unless feedback was provided, and this was related to the deficit in AO learning. The failure to encode the causal relation between action and outcome in schizophrenia occurred without any apparent deficit in reinforcement learning. This implies that poor goal-directed behavior in schizophrenia cannot be explained by a more primary deficit in reward learning such as insensitivity to reward value or reward prediction errors.
Altered cingulo-striatal function underlies reward drive deficits in schizophrenia.

PubMed

Park, Il Ho; Chun, Ji Won; Park, Hae-Jeong; Koo, Min-Seong; Park, Sunyoung; Kim, Seok-Hyeong; Kim, Jae-Jin

2015-02-01

Amotivation in schizophrenia is assumed to involve dysfunctional dopaminergic signaling of reward prediction or anticipation. It is unclear, however, whether the translation of neural representation of reward value to behavioral drive is affected in schizophrenia. In order to examine how abnormal neural processing of response valuation and initiation affects incentive motivation in schizophrenia, we conducted functional MRI using a deterministic reinforcement learning task with variable intervals of contingency reversals in 20 clinically stable patients with schizophrenia and 20 healthy controls. Behaviorally, the advantage of positive over negative reinforcer in reinforcement-related responsiveness was not observed in patients. Patients showed altered response valuation and initiation-related striatal activity and deficient rostro-ventral anterior cingulate cortex activation during reward approach initiation. Among these neural abnormalities, rostro-ventral anterior cingulate cortex activation was correlated with positive reinforcement-related responsiveness in controls and social anhedonia and social amotivation subdomain scores in patients. Our findings indicate that the central role of the anterior cingulate cortex is in translating action value into driving force of action, and underscore the role of the cingulo-striatal network in amotivation in schizophrenia. Copyright © 2014 Elsevier B.V. All rights reserved.
Neurocomputational mechanisms of prosocial learning and links to empathy

PubMed Central

Apps, Matthew A. J.; Valton, Vincent; Viding, Essi; Roiser, Jonathan P.

2016-01-01

Reinforcement learning theory powerfully characterizes how we learn to benefit ourselves. In this theory, prediction errors—the difference between a predicted and actual outcome of a choice—drive learning. However, we do not operate in a social vacuum. To behave prosocially we must learn the consequences of our actions for other people. Empathy, the ability to vicariously experience and understand the affect of others, is hypothesized to be a critical facilitator of prosocial behaviors, but the link between empathy and prosocial behavior is still unclear. During functional magnetic resonance imaging (fMRI) participants chose between different stimuli that were probabilistically associated with rewards for themselves (self), another person (prosocial), or no one (control). Using computational modeling, we show that people can learn to obtain rewards for others but do so more slowly than when learning to obtain rewards for themselves. fMRI revealed that activity in a posterior portion of the subgenual anterior cingulate cortex/basal forebrain (sgACC) drives learning only when we are acting in a prosocial context and signals a prosocial prediction error conforming to classical principles of reinforcement learning theory. However, there is also substantial variability in the neural and behavioral efficiency of prosocial learning, which is predicted by trait empathy. More empathic people learn more quickly when benefitting others, and their sgACC response is the most selective for prosocial learning. We thus reveal a computational mechanism driving prosocial learning in humans. This framework could provide insights into atypical prosocial behavior in those with disorders of social cognition. PMID:27528669
The Effects of a Token Reinforcement System on the Reading and Arithmetic Skills Learnings of Migrant Primary School Pupils.

ERIC Educational Resources Information Center

Heitzman, Andrew J.

The New York State Center for Migrant Studies conducted this 1968 study which investigated effects of token reinforcers on reading and arithmetic skills learnings of migrant primary school students during a 6-week summer school session. Students (Negro and Caucasian) received plastic tokens to reward skills learning responses. Tokens were traded…
The Effects of Observation of Learn Units during Reinforcement and Correction Conditions on the Rate of Learning Math Algorithms by Fifth Grade Students

ERIC Educational Resources Information Center

Neu, Jessica Adele

2013-01-01

I conducted two studies on the comparative effects of the observation of learn units during (a) reinforcement or (b) correction conditions on the acquisition of math objectives. The dependent variables were the within-session cumulative numbers of correct responses emitted during observational sessions. The independent variables were the…
The Identification and Establishment of Reinforcement for Collaboration in Elementary Students

ERIC Educational Resources Information Center

Darcy, Laura

2017-01-01

In Experiment 1, I conducted a functional analysis of student rate of learning with and without a peer-yoked contingency for 12 students in Kindergarten through 2nd grade in order to determine if they had conditioned reinforcement for collaboration. Using an ABAB reversal design, I compared rate of learning as measured by learn units to criterion…
Implicit chaining in cotton-top tamarins (Saguinus oedipus) with elements equated for probability of reinforcement

PubMed Central

Dillon, Laura; Collins, Meaghan; Conway, Maura; Cunningham, Kate

2013-01-01

Three experiments examined the implicit learning of sequences under conditions in which the elements comprising a sequence were equated in terms of reinforcement probability. In Experiment 1 cotton-top tamarins (Saguinus oedipus) experienced a five-element sequence displayed serially on a touch screen in which reinforcement probability was equated across elements at .16 per element. Tamarins demonstrated learning of this sequence with higher latencies during a random test as compared to baseline sequence training. In Experiments 2 and 3, manipulations of the procedure used in the first experiment were undertaken to rule out a confound owing to the fact that the elements in Experiment 1 bore different temporal relations to the intertrial interval (ITI), an inhibitory period. The results of Experiments 2 and 3 indicated that the implicit learning observed in Experiment 1 was not due to temporal proximity between some elements and the inhibitory ITI. The results taken together support two conclusion: First that tamarins engaged in sequence learning whether or not there was contingent reinforcement for learning the sequence, and second that this learning was not due to subtle differences in associative strength between the elements of the sequence. PMID:23344718
Improving the Science Excursion: An Educational Technologist's View

ERIC Educational Resources Information Center

Balson, M.

1973-01-01

Analyzes the nature of the learning process and attempts to show how the three components of a reinforcement contingency, the stimulus, the response and the reinforcement can be utilized to increase the efficiency of a typical science learning experience, the excursion. (JR)

A Novel Dynamic Spectrum Access Framework Based on Reinforcement Learning for Cognitive Radio Sensor Networks.

PubMed

Lin, Yun; Wang, Chao; Wang, Jiaxing; Dou, Zheng

2016-10-12

Cognitive radio sensor networks are one of the kinds of application where cognitive techniques can be adopted and have many potential applications, challenges and future research trends. According to the research surveys, dynamic spectrum access is an important and necessary technology for future cognitive sensor networks. Traditional methods of dynamic spectrum access are based on spectrum holes and they have some drawbacks, such as low accessibility and high interruptibility, which negatively affect the transmission performance of the sensor networks. To address this problem, in this paper a new initialization mechanism is proposed to establish a communication link and set up a sensor network without adopting spectrum holes to convey control information. Specifically, firstly a transmission channel model for analyzing the maximum accessible capacity for three different polices in a fading environment is discussed. Secondly, a hybrid spectrum access algorithm based on a reinforcement learning model is proposed for the power allocation problem of both the transmission channel and the control channel. Finally, extensive simulations have been conducted and simulation results show that this new algorithm provides a significant improvement in terms of the tradeoff between the control channel reliability and the efficiency of the transmission channel.
A Novel Dynamic Spectrum Access Framework Based on Reinforcement Learning for Cognitive Radio Sensor Networks

PubMed Central

Lin, Yun; Wang, Chao; Wang, Jiaxing; Dou, Zheng

2016-01-01

Cognitive radio sensor networks are one of the kinds of application where cognitive techniques can be adopted and have many potential applications, challenges and future research trends. According to the research surveys, dynamic spectrum access is an important and necessary technology for future cognitive sensor networks. Traditional methods of dynamic spectrum access are based on spectrum holes and they have some drawbacks, such as low accessibility and high interruptibility, which negatively affect the transmission performance of the sensor networks. To address this problem, in this paper a new initialization mechanism is proposed to establish a communication link and set up a sensor network without adopting spectrum holes to convey control information. Specifically, firstly a transmission channel model for analyzing the maximum accessible capacity for three different polices in a fading environment is discussed. Secondly, a hybrid spectrum access algorithm based on a reinforcement learning model is proposed for the power allocation problem of both the transmission channel and the control channel. Finally, extensive simulations have been conducted and simulation results show that this new algorithm provides a significant improvement in terms of the tradeoff between the control channel reliability and the efficiency of the transmission channel. PMID:27754316
Off-policy reinforcement learning for H∞ control design.

PubMed

Luo, Biao; Wu, Huai-Ning; Huang, Tingwen

2015-01-01

The H∞ control design problem is considered for nonlinear systems with unknown internal system model. It is known that the nonlinear H∞ control problem can be transformed into solving the so-called Hamilton-Jacobi-Isaacs (HJI) equation, which is a nonlinear partial differential equation that is generally impossible to be solved analytically. Even worse, model-based approaches cannot be used for approximately solving HJI equation, when the accurate system model is unavailable or costly to obtain in practice. To overcome these difficulties, an off-policy reinforcement leaning (RL) method is introduced to learn the solution of HJI equation from real system data instead of mathematical system model, and its convergence is proved. In the off-policy RL method, the system data can be generated with arbitrary policies rather than the evaluating policy, which is extremely important and promising for practical systems. For implementation purpose, a neural network (NN)-based actor-critic structure is employed and a least-square NN weight update algorithm is derived based on the method of weighted residuals. Finally, the developed NN-based off-policy RL method is tested on a linear F16 aircraft plant, and further applied to a rotational/translational actuator system.
Learning alternative movement coordination patterns using reinforcement feedback.

PubMed

Lin, Tzu-Hsiang; Denomme, Amber; Ranganathan, Rajiv

2018-05-01

One of the characteristic features of the human motor system is redundancy-i.e., the ability to achieve a given task outcome using multiple coordination patterns. However, once participants settle on using a specific coordination pattern, the process of learning to use a new alternative coordination pattern to perform the same task is still poorly understood. Here, using two experiments, we examined this process of how participants shift from one coordination pattern to another using different reinforcement schedules. Participants performed a virtual reaching task, where they moved a cursor to different targets positioned on the screen. Our goal was to make participants use a coordination pattern with greater trunk motion, and to this end, we provided reinforcement by making the cursor disappear if the trunk motion during the reach did not cross a specified threshold value. In Experiment 1, we compared two reinforcement schedules in two groups of participants-an abrupt group, where the threshold was introduced immediately at the beginning of practice; and a gradual group, where the threshold was introduced gradually with practice. Results showed that both abrupt and gradual groups were effective in shifting their coordination patterns to involve greater trunk motion, but the abrupt group showed greater retention when the reinforcement was removed. In Experiment 2, we examined the basis of this advantage in the abrupt group using two additional control groups. Results showed that the advantage of the abrupt group was because of a greater number of practice trials with the desired coordination pattern. Overall, these results show that reinforcement can be successfully used to shift coordination patterns, which has potential in the rehabilitation of movement disorders.
Modification of saccadic gain by reinforcement

PubMed Central

Paeye, Céline; Wallman, Josh

2011-01-01

Control of saccadic gain is often viewed as a simple compensatory process in which gain is adjusted over many trials by the postsaccadic retinal error, thereby maintaining saccadic accuracy. Here, we propose that gain might also be changed by a reinforcement process not requiring a visual error. To test this hypothesis, we used experimental paradigms in which retinal error was removed by extinguishing the target at the start of each saccade and either an auditory tone or the vision of the target on the fovea was provided as reinforcement after those saccades that met an amplitude criterion. These reinforcement procedures caused a progressive change in saccade amplitude in nearly all subjects, although the rate of adaptation differed greatly among subjects. When we reversed the contingencies and reinforced those saccades landing closer to the original target location, saccade gain changed back toward normal gain in most subjects. When subjects had saccades adapted first by reinforcement and a week later by conventional intrasaccadic step adaptation, both paradigms yielded similar degrees of gain changes and similar transfer to new amplitudes and to new starting positions of the target step as well as comparable rates of recovery. We interpret these changes in saccadic gain in the absence of postsaccadic retinal error as showing that saccade adaptation is not controlled by a single error signal. More generally, our findings suggest that normal saccade adaptation might involve general learning mechanisms rather than only specialized mechanisms for motor calibration. PMID:21525366
Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults

PubMed Central

Smith, Tim J.; Senju, Atsushi

2017-01-01

While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue–reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue–reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting. PMID:28250186
Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults.

PubMed

Vernetti, Angélina; Smith, Tim J; Senju, Atsushi

2017-03-15

While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue-reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue-reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting. © 2017 The Authors.
Learning and altering behaviours by reinforcement: neurocognitive differences between children and adults.

PubMed

Shephard, E; Jackson, G M; Groom, M J

2014-01-01

This study examined neurocognitive differences between children and adults in the ability to learn and adapt simple stimulus-response associations through feedback. Fourteen typically developing children (mean age=10.2) and 15 healthy adults (mean age=25.5) completed a simple task in which they learned to associate visually presented stimuli with manual responses based on performance feedback (acquisition phase), and then reversed and re-learned those associations following an unexpected change in reinforcement contingencies (reversal phase). Electrophysiological activity was recorded throughout task performance. We found no group differences in learning-related changes in performance (reaction time, accuracy) or in the amplitude of event-related potentials (ERPs) associated with stimulus processing (P3 ERP) or feedback processing (feedback-related negativity; FRN) during the acquisition phase. However, children's performance was significantly more disrupted by the reversal than adults and FRN amplitudes were significantly modulated by the reversal phase in children but not adults. These findings indicate that children have specific difficulties with reinforcement learning when acquired behaviours must be altered. This may be caused by the added demands on immature executive functioning, specifically response monitoring, created by the requirement to reverse the associations, or a developmental difference in the way in which children and adults approach reinforcement learning. Copyright © 2013 The Authors. Published by Elsevier Ltd.. All rights reserved.
Medial prefrontal cortex involvement in the expression of extinction and ABA renewal of instrumental behavior for a food reinforcer.

PubMed

Eddy, Meghan C; Todd, Travis P; Bouton, Mark E; Green, John T

2016-02-01

Instrumental renewal, the return of extinguished instrumental responding after removal from the extinction context, is an important model of behavioral relapse that is poorly understood at the neural level. In two experiments, we examined the role of the dorsomedial prefrontal cortex (dmPFC) and the ventromedial prefrontal cortex (vmPFC) in extinction and ABA renewal of instrumental responding for a sucrose reinforcer. Previous work, exclusively using drug reinforcers, has suggested that the roles of the dmPFC and vmPFC in expression of extinction and ABA renewal may depend at least in part on the type of drug reinforcer used. The current experiments used a food reinforcer because the behavioral mechanisms underlying the extinction and renewal of instrumental responding are especially well worked out in this paradigm. After instrumental conditioning in context A and extinction in context B, we inactivated dmPFC, vmPFC, or a more ventral medial prefrontal cortex region by infusing baclofen/muscimol (B/M) just prior to testing in both contexts. In rats with inactivated dmPFC, ABA renewal was still present (i.e., responding increased when returned to context A); however responding was lower (less renewal) than controls. Inactivation of vmPFC increased responding in context B (the extinction context) and decreased responding in context A, indicating no renewal in these animals. There was no effect of B/M infusion on rats with cannula placements ventral to the vmPFC. Fluorophore-conjugated muscimol was infused in a subset of rats following test to visualize infusion spread. Imaging suggested that the infusion spread was minimal and mainly constrained to the targeted area. Together, these experiments suggest that there is a region of medial prefrontal cortex encompassing both dmPFC and vmPFC that is important for ABA renewal of extinguished instrumental responding for a food reinforcer. In addition, vmPFC, but not dmPFC, is important for expression of extinction of responding for a food reinforcer. The role of the medial prefrontal cortex in renewal in the original conditioning context may depend in part on control over excitatory context-response or context-(response-outcome) relations that might be learned in acquisition. The role of the vmPFC in expression of extinction may depend on its control over inhibitory context-response or context-(response-outcome) relations that are learned in extinction. Copyright © 2015 Elsevier Inc. All rights reserved.
Nature vs Nurture: Effects of Learning on Evolution

NASA Astrophysics Data System (ADS)

Nagrani, Nagina

In the field of Evolutionary Robotics, the design, development and application of artificial neural networks as controllers have derived their inspiration from biology. Biologists and artificial intelligence researchers are trying to understand the effects of neural network learning during the lifetime of the individuals on evolution of these individuals by qualitative and quantitative analyses. The conclusion of these analyses can help develop optimized artificial neural networks to perform any given task. The purpose of this thesis is to study the effects of learning on evolution. This has been done by applying Temporal Difference Reinforcement Learning methods to the evolution of Artificial Neural Tissue controller. The controller has been assigned the task to collect resources in a designated area in a simulated environment. The performance of the individuals is measured by the amount of resources collected. A comparison has been made between the results obtained by incorporating learning in evolution and evolution alone. The effects of learning parameters: learning rate, training period, discount rate, and policy on evolution have also been studied. It was observed that learning delays the performance of the evolving individuals over the generations. However, the non zero learning rate throughout the evolution process signifies natural selection preferring individuals possessing plasticity.
Touchscreen assays of learning, response inhibition, and motivation in the marmoset (Callithrix jacchus).

PubMed

Kangas, Brian D; Bergman, Jack; Coyle, Joseph T

2016-05-01

Recent developments in precision gene editing have led to the emergence of the marmoset as an experimental subject of considerable interest and translational value. A better understanding of behavioral phenotypes of the common marmoset will inform the extent to which forthcoming transgenic mutants are cognitively intact. Therefore, additional information regarding their learning, inhibitory control, and motivational abilities is needed. The present studies used touchscreen-based repeated acquisition and discrimination reversal tasks to examine basic dimensions of learning and response inhibition. Marmosets were trained daily to respond to one of the two simultaneously presented novel stimuli. Subjects learned to discriminate the two stimuli (acquisition) and, subsequently, with the contingencies switched (reversal). In addition, progressive ratio performance was used to measure the effort expended to obtain a highly palatable reinforcer varying in magnitude and, thereby, provide an index of relative motivational value. Results indicate that rates of both acquisition and reversal of novel discriminations increased across successive sessions, but that rate of reversal learning remained slower than acquisition learning, i.e., more trials were needed for mastery. A positive correlation was observed between progressive ratio break point and reinforcement magnitude. These results closely replicate previous findings with squirrel monkeys, thus providing evidence of similarity in learning processes across nonhuman primate species. Moreover, these data provide key information about the normative phenotype of wild-type marmosets using three relevant behavioral endpoints.
Intelligent Control Systems Research

NASA Technical Reports Server (NTRS)

Loparo, Kenneth A.

1994-01-01

Results of a three phase research program into intelligent control systems are presented. The first phase looked at implementing the lowest or direct level of a hierarchical control scheme using a reinforcement learning approach assuming no a priori information about the system under control. The second phase involved the design of an adaptive/optimizing level of the hierarchy and its interaction with the direct control level. The third and final phase of the research was aimed at combining the results of the previous phases with some a priori information about the controlled system.
A neural model of hierarchical reinforcement learning.

PubMed

Rasmussen, Daniel; Voelker, Aaron; Eliasmith, Chris

2017-01-01

We develop a novel, biologically detailed neural model of reinforcement learning (RL) processes in the brain. This model incorporates a broad range of biological features that pose challenges to neural RL, such as temporally extended action sequences, continuous environments involving unknown time delays, and noisy/imprecise computations. Most significantly, we expand the model into the realm of hierarchical reinforcement learning (HRL), which divides the RL process into a hierarchy of actions at different levels of abstraction. Here we implement all the major components of HRL in a neural model that captures a variety of known anatomical and physiological properties of the brain. We demonstrate the performance of the model in a range of different environments, in order to emphasize the aim of understanding the brain's general reinforcement learning ability. These results show that the model compares well to previous modelling work and demonstrates improved performance as a result of its hierarchical ability. We also show that the model's behaviour is consistent with available data on human hierarchical RL, and generate several novel predictions.
Neural correlates of reinforcement learning and social preferences in competitive bidding.

PubMed

van den Bos, Wouter; Talwar, Arjun; McClure, Samuel M

2013-01-30

In competitive social environments, people often deviate from what rational choice theory prescribes, resulting in losses or suboptimal monetary gains. We investigate how competition affects learning and decision-making in a common value auction task. During the experiment, groups of five human participants were simultaneously scanned using MRI while playing the auction task. We first demonstrate that bidding is well characterized by reinforcement learning with biased reward representations dependent on social preferences. Indicative of reinforcement learning, we found that estimated trial-by-trial prediction errors correlated with activity in the striatum and ventromedial prefrontal cortex. Additionally, we found that individual differences in social preferences were related to activity in the temporal-parietal junction and anterior insula. Connectivity analyses suggest that monetary and social value signals are integrated in the ventromedial prefrontal cortex and striatum. Based on these results, we argue for a novel mechanistic account for the integration of reinforcement history and social preferences in competitive decision-making.
Leader-Follower Output Synchronization of Linear Heterogeneous Systems With Active Leader Using Reinforcement Learning.

PubMed

Yang, Yongliang; Modares, Hamidreza; Wunsch, Donald C; Yin, Yixin

2018-06-01

This paper develops optimal control protocols for the distributed output synchronization problem of leader-follower multiagent systems with an active leader. Agents are assumed to be heterogeneous with different dynamics and dimensions. The desired trajectory is assumed to be preplanned and is generated by the leader. Other follower agents autonomously synchronize to the leader by interacting with each other using a communication network. The leader is assumed to be active in the sense that it has a nonzero control input so that it can act independently and update its control to keep the followers away from possible danger. A distributed observer is first designed to estimate the leader's state and generate the reference signal for each follower. Then, the output synchronization of leader-follower systems with an active leader is formulated as a distributed optimal tracking problem, and inhomogeneous algebraic Riccati equations (AREs) are derived to solve it. The resulting distributed optimal control protocols not only minimize the steady-state error but also optimize the transient response of the agents. An off-policy reinforcement learning algorithm is developed to solve the inhomogeneous AREs online in real time and without requiring any knowledge of the agents' dynamics. Finally, two simulation examples are conducted to illustrate the effectiveness of the proposed algorithm.
If Kids Could Vote: Children, Democracy, and the Media

ERIC Educational Resources Information Center

Sugarman, Sally

2006-01-01

Preparing children to become citizens of a democracy requires recognition of the different ways in which children learn about politics. Kids in the United States currently spend most of their lives in controlled situations such as schools where the dependency they experience in their homes is reinforced. Besides teachers--books, films, television,…
Food Chains & Webs. A Multimedia CD-ROM. [CD-ROM].

ERIC Educational Resources Information Center

2001

This CD-ROM is designed for classroom and individual use to teach and learn about food chains and food webs. Integrated animations, custom graphics, three-dimensional representations, photographs, and sound are featured for use in user-controlled activities. Interactive lessons are available to reinforce the subject material. Pre- and post-testing…
DNA: The Molecule of Life. A Multimedia CD-ROM. [CD-ROM].

ERIC Educational Resources Information Center

2001

This CD-ROM is designed for classroom and individual use to teach and learn about DNA. Integrated animations, custom graphics, three-dimensional representations, photographs, and sound are featured for use in user-controlled activities. Interactive lessons are available to reinforce the subject material. Pre- and post-testing sections are also…
Design and Control of Large Collections of Learning Agents

NASA Technical Reports Server (NTRS)

Agogino, Adrian

2001-01-01

The intelligent control of multiple autonomous agents is an important yet difficult task. Previous methods used to address this problem have proved to be either too brittle, too hard to use, or not scalable to large systems. The 'Collective Intelligence' project at NASA/Ames provides an elegant, machine-learning approach to address these problems. This approach mathematically defines some essential properties that a reward system should have to promote coordinated behavior among reinforcement learners. This work has focused on creating additional key properties and algorithms within the mathematics of the Collective Intelligence framework. One of the additions will allow agents to learn more quickly, in a more coordinated manner. The other will let agents learn with less knowledge of their environment. These additions will allow the framework to be applied more easily, to a much larger domain of multi-agent problems.
Extinction of Pavlovian conditioning: The influence of trial number and reinforcement history.

PubMed

Chan, C K J; Harris, Justin A

2017-08-01

Pavlovian conditioning is sensitive to the temporal relationship between the conditioned stimulus (CS) and the unconditioned stimulus (US). This has motivated models that describe learning as a process that continuously updates associative strength during the trial or specifically encodes the CS-US interval. These models predict that extinction of responding is also continuous, such that response loss is proportional to the cumulative duration of exposure to the CS without the US. We review evidence showing that this prediction is incorrect, and that extinction is trial-based rather than time-based. We also present two experiments that test the importance of trials versus time on the Partial Reinforcement Extinction Effect (PREE), in which responding extinguishes more slowly for a CS that was inconsistently reinforced with the US than for a consistently reinforced one. We show that increasing the number of extinction trials of the partially reinforced CS, relative to the consistently reinforced CS, overcomes the PREE. However, increasing the duration of extinction trials by the same amount does not overcome the PREE. We conclude that animals learn about the likelihood of the US per trial during conditioning, and learn trial-by-trial about the absence of the US during extinction. Moreover, what they learn about the likelihood of the US during conditioning affects how sensitive they are to the absence of the US during extinction. Copyright © 2017 Elsevier B.V. All rights reserved.

Fuzzy self-learning control for magnetic servo system

NASA Technical Reports Server (NTRS)

Tarn, J. H.; Kuo, L. T.; Juang, K. Y.; Lin, C. E.

1994-01-01

It is known that an effective control system is the key condition for successful implementation of high-performance magnetic servo systems. Major issues to design such control systems are nonlinearity; unmodeled dynamics, such as secondary effects for copper resistance, stray fields, and saturation; and that disturbance rejection for the load effect reacts directly on the servo system without transmission elements. One typical approach to design control systems under these conditions is a special type of nonlinear feedback called gain scheduling. It accommodates linear regulators whose parameters are changed as a function of operating conditions in a preprogrammed way. In this paper, an on-line learning fuzzy control strategy is proposed. To inherit the wealth of linear control design, the relations between linear feedback and fuzzy logic controllers have been established. The exercise of engineering axioms of linear control design is thus transformed into tuning of appropriate fuzzy parameters. Furthermore, fuzzy logic control brings the domain of candidate control laws from linear into nonlinear, and brings new prospects into design of the local controllers. On the other hand, a self-learning scheme is utilized to automatically tune the fuzzy rule base. It is based on network learning infrastructure; statistical approximation to assign credit; animal learning method to update the reinforcement map with a fast learning rate; and temporal difference predictive scheme to optimize the control laws. Different from supervised and statistical unsupervised learning schemes, the proposed method learns on-line from past experience and information from the process and forms a rule base of an FLC system from randomly assigned initial control rules.
Off-Policy Actor-Critic Structure for Optimal Control of Unknown Systems With Disturbances.

PubMed

Song, Ruizhuo; Lewis, Frank L; Wei, Qinglai; Zhang, Huaguang

2016-05-01

An optimal control method is developed for unknown continuous-time systems with unknown disturbances in this paper. The integral reinforcement learning (IRL) algorithm is presented to obtain the iterative control. Off-policy learning is used to allow the dynamics to be completely unknown. Neural networks are used to construct critic and action networks. It is shown that if there are unknown disturbances, off-policy IRL may not converge or may be biased. For reducing the influence of unknown disturbances, a disturbances compensation controller is added. It is proven that the weight errors are uniformly ultimately bounded based on Lyapunov techniques. Convergence of the Hamiltonian function is also proven. The simulation study demonstrates the effectiveness of the proposed optimal control method for unknown systems with disturbances.
Deep reinforcement learning for automated radiation adaptation in lung cancer.

PubMed

Tseng, Huan-Hsin; Luo, Yi; Cui, Sunan; Chien, Jen-Tzung; Ten Haken, Randall K; Naqa, Issam El

2017-12-01

To investigate deep reinforcement learning (DRL) based on historical treatment plans for developing automated radiation adaptation protocols for nonsmall cell lung cancer (NSCLC) patients that aim to maximize tumor local control at reduced rates of radiation pneumonitis grade 2 (RP2). In a retrospective population of 114 NSCLC patients who received radiotherapy, a three-component neural networks framework was developed for deep reinforcement learning (DRL) of dose fractionation adaptation. Large-scale patient characteristics included clinical, genetic, and imaging radiomics features in addition to tumor and lung dosimetric variables. First, a generative adversarial network (GAN) was employed to learn patient population characteristics necessary for DRL training from a relatively limited sample size. Second, a radiotherapy artificial environment (RAE) was reconstructed by a deep neural network (DNN) utilizing both original and synthetic data (by GAN) to estimate the transition probabilities for adaptation of personalized radiotherapy patients' treatment courses. Third, a deep Q-network (DQN) was applied to the RAE for choosing the optimal dose in a response-adapted treatment setting. This multicomponent reinforcement learning approach was benchmarked against real clinical decisions that were applied in an adaptive dose escalation clinical protocol. In which, 34 patients were treated based on avid PET signal in the tumor and constrained by a 17.2% normal tissue complication probability (NTCP) limit for RP2. The uncomplicated cure probability (P+) was used as a baseline reward function in the DRL. Taking our adaptive dose escalation protocol as a blueprint for the proposed DRL (GAN + RAE + DQN) architecture, we obtained an automated dose adaptation estimate for use at ∼2/3 of the way into the radiotherapy treatment course. By letting the DQN component freely control the estimated adaptive dose per fraction (ranging from 1-5 Gy), the DRL automatically favored dose escalation/de-escalation between 1.5 and 3.8 Gy, a range similar to that used in the clinical protocol. The same DQN yielded two patterns of dose escalation for the 34 test patients, but with different reward variants. First, using the baseline P+ reward function, individual adaptive fraction doses of the DQN had similar tendencies to the clinical data with an RMSE = 0.76 Gy; but adaptations suggested by the DQN were generally lower in magnitude (less aggressive). Second, by adjusting the P+ reward function with higher emphasis on mitigating local failure, better matching of doses between the DQN and the clinical protocol was achieved with an RMSE = 0.5 Gy. Moreover, the decisions selected by the DQN seemed to have better concordance with patients eventual outcomes. In comparison, the traditional temporal difference (TD) algorithm for reinforcement learning yielded an RMSE = 3.3 Gy due to numerical instabilities and lack of sufficient learning. We demonstrated that automated dose adaptation by DRL is a feasible and a promising approach for achieving similar results to those chosen by clinicians. The process may require customization of the reward function if individual cases were to be considered. However, development of this framework into a fully credible autonomous system for clinical decision support would require further validation on larger multi-institutional datasets. © 2017 American Association of Physicists in Medicine.
Closed-Loop and Robust Control of Quantum Systems

PubMed Central

Wang, Lin-Cheng

2013-01-01

For most practical quantum control systems, it is important and difficult to attain robustness and reliability due to unavoidable uncertainties in the system dynamics or models. Three kinds of typical approaches (e.g., closed-loop learning control, feedback control, and robust control) have been proved to be effective to solve these problems. This work presents a self-contained survey on the closed-loop and robust control of quantum systems, as well as a brief introduction to a selection of basic theories and methods in this research area, to provide interested readers with a general idea for further studies. In the area of closed-loop learning control of quantum systems, we survey and introduce such learning control methods as gradient-based methods, genetic algorithms (GA), and reinforcement learning (RL) methods from a unified point of view of exploring the quantum control landscapes. For the feedback control approach, the paper surveys three control strategies including Lyapunov control, measurement-based control, and coherent-feedback control. Then such topics in the field of quantum robust control as H ∞ control, sliding mode control, quantum risk-sensitive control, and quantum ensemble control are reviewed. The paper concludes with a perspective of future research directions that are likely to attract more attention. PMID:23997680
Can Service Learning Reinforce Social and Cultural Bias? Exploring a Popular Model of Family Involvement for Early Childhood Teacher Candidates

ERIC Educational Resources Information Center

Dunn-Kenney, Maylan

2010-01-01

Service learning is often used in teacher education as a way to challenge social bias and provide teacher candidates with skills needed to work in partnership with diverse families. Although some literature suggests that service learning could reinforce cultural bias, there is little documentation. In a study of 21 early childhood teacher…
Deep Gate Recurrent Neural Network

DTIC Science & Technology

2016-11-22

Schmidhuber. A system for robotic heart surgery that learns to tie knots using recurrent neural networks. In IEEE International Conference on...tasks, such as Machine Translation (Bahdanau et al. (2015)) or Robot Reinforcement Learning (Bakker (2001)). The main idea behind these networks is to...and J. Peters. Reinforcement learning in robotics : A survey. The International Journal of Robotics Research, 32:1238–1274, 2013. ISSN 0278-3649. doi
Research rocket test RR-1 (Black Brant VC) and RR-2 (Aerobee 170A): Investigations of the stability of bubbles in plain and fiber-reinforced metal and solidified in a near-zero-g environment

NASA Technical Reports Server (NTRS)

Yates, I. C.; Yost, V. H.

1973-01-01

The results of the first two of a series of research rocket flights are presented. The objectives of these flights were (1) to learn about the capabilities of these rockets, (2) to learn how to interface the payloads and rockets, and (3) to process some of the composite casting demonstration capsules intended originally for Apollo 15. The capsules contained experiments for investigating the stability of gas bubbles in plain and fiber-reinforced metal melted and solidified in a near-zero-g (0.0119g) environment. The characteristics of the two research rockets, an Aerobee 170A and a Black Brant VC, used to obtain the periods of near-zero-g and the temperature control unit used for processing the contents of the two experiment capsules are discussed.
Optimal Reward Functions in Distributed Reinforcement Learning

NASA Technical Reports Server (NTRS)

Wolpert, David H.; Tumer, Kagan

2000-01-01

We consider the design of multi-agent systems so as to optimize an overall world utility function when (1) those systems lack centralized communication and control, and (2) each agents runs a distinct Reinforcement Learning (RL) algorithm. A crucial issue in such design problems is to initialize/update each agent's private utility function, so as to induce best possible world utility. Traditional 'team game' solutions to this problem sidestep this issue and simply assign to each agent the world utility as its private utility function. In previous work we used the 'Collective Intelligence' framework to derive a better choice of private utility functions, one that results in world utility performance up to orders of magnitude superior to that ensuing from use of the team game utility. In this paper we extend these results. We derive the general class of private utility functions that both are easy for the individual agents to learn and that, if learned well, result in high world utility. We demonstrate experimentally that using these new utility functions can result in significantly improved performance over that of our previously proposed utility, over and above that previous utility's superiority to the conventional team game utility.
Conceptualizing withdrawal-induced escalation of alcohol self-administration as a learned, plasticity-dependent process

PubMed Central

Walker, Brendan M.

2013-01-01

This article represents one of five contributions focusing on the topic “Plasticity and neuroadaptive responses within the extended amygdala in response to chronic or excessive alcohol exposure” that were developed by awardees participating in the Young Investigator Award Symposium at the “Alcoholism and Stress: A Framework for Future Treatment Strategies” conference in Volterra, Italy on May 3–6, 2011 that was organized/chaired by Drs. Antonio Noronha and Fulton Crews and sponsored by the National Institute on Alcohol Abuse and Alcoholism. This review discusses the dependence-induced neuroadaptations in affective systems that provide a basis for negative reinforcement learning and presents evidence demonstrating that escalated alcohol consumption during withdrawal is a learned, plasticity-dependent process. The review concludes by identifying changes within extended amygdala dynorphin/kappa-opioid receptor systems that could serve as the foundation for the occurrence of negative reinforcement processes. While some evidence contained herein may be specific to alcohol dependence-related learning and plasticity, much of the information will be of relevance to any addictive disorder involving negative reinforcement mechanisms. Collectively, the information presented within this review provides a framework to assess the negative reinforcing effects of alcohol in a manner that distinguishes neuroadaptations produced by chronic alcohol exposure from the actual plasticity that is associated with negative reinforcement learning in dependent organisms. PMID:22459874
Reinforcement Learning Strategies for Clinical Trials in Non-small Cell Lung Cancer

PubMed Central

Zhao, Yufan; Zeng, Donglin; Socinski, Mark A.; Kosorok, Michael R.

2010-01-01

Summary Typical regimens for advanced metastatic stage IIIB/IV non-small cell lung cancer (NSCLC) consist of multiple lines of treatment. We present an adaptive reinforcement learning approach to discover optimal individualized treatment regimens from a specially designed clinical trial (a “clinical reinforcement trial”) of an experimental treatment for patients with advanced NSCLC who have not been treated previously with systemic therapy. In addition to the complexity of the problem of selecting optimal compounds for first and second-line treatments based on prognostic factors, another primary goal is to determine the optimal time to initiate second-line therapy, either immediately or delayed after induction therapy, yielding the longest overall survival time. A reinforcement learning method called Q-learning is utilized which involves learning an optimal regimen from patient data generated from the clinical reinforcement trial. Approximating the Q-function with time-indexed parameters can be achieved by using a modification of support vector regression which can utilize censored data. Within this framework, a simulation study shows that the procedure can extract optimal regimens for two lines of treatment directly from clinical data without prior knowledge of the treatment effect mechanism. In addition, we demonstrate that the design reliably selects the best initial time for second-line therapy while taking into account the heterogeneity of NSCLC across patients. PMID:21385164
Hierarchical extreme learning machine based reinforcement learning for goal localization

NASA Astrophysics Data System (ADS)

AlDahoul, Nouar; Zaw Htike, Zaw; Akmeliawati, Rini

2017-03-01

The objective of goal localization is to find the location of goals in noisy environments. Simple actions are performed to move the agent towards the goal. The goal detector should be capable of minimizing the error between the predicted locations and the true ones. Few regions need to be processed by the agent to reduce the computational effort and increase the speed of convergence. In this paper, reinforcement learning (RL) method was utilized to find optimal series of actions to localize the goal region. The visual data, a set of images, is high dimensional unstructured data and needs to be represented efficiently to get a robust detector. Different deep Reinforcement models have already been used to localize a goal but most of them take long time to learn the model. This long learning time results from the weights fine tuning stage that is applied iteratively to find an accurate model. Hierarchical Extreme Learning Machine (H-ELM) was used as a fast deep model that doesn’t fine tune the weights. In other words, hidden weights are generated randomly and output weights are calculated analytically. H-ELM algorithm was used in this work to find good features for effective representation. This paper proposes a combination of Hierarchical Extreme learning machine and Reinforcement learning to find an optimal policy directly from visual input. This combination outperforms other methods in terms of accuracy and learning speed. The simulations and results were analysed by using MATLAB.
Dissociation between active and observational learning from positive and negative feedback in Parkinsonism.

PubMed

Kobza, Stefan; Ferrea, Stefano; Schnitzler, Alfons; Pollok, Bettina; Südmeyer, Martin; Bellebaum, Christian

2012-01-01

Feedback to both actively performed and observed behaviour allows adaptation of future actions. Positive feedback leads to increased activity of dopamine neurons in the substantia nigra, whereas dopamine neuron activity is decreased following negative feedback. Dopamine level reduction in unmedicated Parkinson's Disease patients has been shown to lead to a negative learning bias, i.e. enhanced learning from negative feedback. Recent findings suggest that the neural mechanisms of active and observational learning from feedback might differ, with the striatum playing a less prominent role in observational learning. Therefore, it was hypothesized that unmedicated Parkinson's Disease patients would show a negative learning bias only in active but not in observational learning. In a between-group design, 19 Parkinson's Disease patients and 40 healthy controls engaged in either an active or an observational probabilistic feedback-learning task. For both tasks, transfer phases aimed to assess the bias to learn better from positive or negative feedback. As expected, actively learning patients showed a negative learning bias, whereas controls learned better from positive feedback. In contrast, no difference between patients and controls emerged for observational learning, with both groups showing better learning from positive feedback. These findings add to neural models of reinforcement-learning by suggesting that dopamine-modulated input to the striatum plays a minor role in observational learning from feedback. Future research will have to elucidate the specific neural underpinnings of observational learning.
A sequential learning analysis of decisions in organizations to escalate investments despite continuing costs or losses

PubMed Central

Goltz, Sonia M.

1992-01-01

Reinforcement process may underlie decisions frequently found in organizations to escalate investments of time, money and other resources in strategies (e.g., product development, capital investment, plant expansion) that do not result in immediate reinforces. Whereas cognitive biases have been proffered in previous explanations, the present analysis suggested that this persistence is a form of resistance to extinction arising from experiences with past investments that were variably reinforced. This explanation was examined in two experiments by varying the pattern of returns and losses subjects experienced for investment decisions prior to experiencing a series losses. Consistent with the proposed explanation, two conditions resulted in higher levels of recommitment during continuous losses: (a) training using a variable schedule of partial reinforcement, and (b) no training on the task. Results indicate that behavior analysis can be used to understand and control situations in organizations that are prone to escalation, such as investments in the research and development of new product lines and extensions of further loans to customers. PMID:16795785
Reusable Reinforcement Learning via Shallow Trails.

PubMed

Yu, Yang; Chen, Shi-Yong; Da, Qing; Zhou, Zhi-Hua

2018-06-01

Reinforcement learning has shown great success in helping learning agents accomplish tasks autonomously from environment interactions. Meanwhile in many real-world applications, an agent needs to accomplish not only a fixed task but also a range of tasks. For this goal, an agent can learn a metapolicy over a set of training tasks that are drawn from an underlying distribution. By maximizing the total reward summed over all the training tasks, the metapolicy can then be reused in accomplishing test tasks from the same distribution. However, in practice, we face two major obstacles to train and reuse metapolicies well. First, how to identify tasks that are unrelated or even opposite with each other, in order to avoid their mutual interference in the training. Second, how to characterize task features, according to which a metapolicy can be reused. In this paper, we propose the MetA-Policy LEarning (MAPLE) approach that overcomes the two difficulties by introducing the shallow trail. It probes a task by running a roughly trained policy. Using the rewards of the shallow trail, MAPLE automatically groups similar tasks. Moreover, when the task parameters are unknown, the rewards of the shallow trail also serve as task features. Empirical studies on several controlling tasks verify that MAPLE can train metapolicies well and receives high reward on test tasks.
Ventral striatum lesions do not affect reinforcement learning with deterministic outcomes on slow time scales.

PubMed

Vicario-Feliciano, Raquel; Murray, Elisabeth A; Averbeck, Bruno B

2017-10-01

A large body of work has implicated the ventral striatum (VS) in aspects of reinforcement learning (RL). However, less work has directly examined the effects of lesions in the VS, or other forms of inactivation, on 2-armed bandit RL tasks. We have recently found that lesions in the VS in macaque monkeys affect learning with stochastic schedules but have minimal effects with deterministic schedules. The reasons for this are not currently clear. Because our previous work used short intertrial intervals, one possibility is that the animals were using working memory to bridge stimulus-reward associations from 1 trial to the next. In the present study, we examined learning of 60 pairs of objects, in which the animals received only 1 trial per day with each pair. The large number of object pairs and the long interval (approximately 24 hr) between trials with a given pair minimized the chances that the animals could use working memory to bridge trials. We found that monkeys with VS lesions were unimpaired relative to controls, which suggests that animals with VS lesions can still learn to select rewarded objects even when they cannot make use of working memory. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Adaptive Fuzzy Systems in Computational Intelligence

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1996-01-01

In recent years, the interest in computational intelligence techniques, which currently includes neural networks, fuzzy systems, and evolutionary programming, has grown significantly and a number of their applications have been developed in the government and industry. In future, an essential element in these systems will be fuzzy systems that can learn from experience by using neural network in refining their performances. The GARIC architecture, introduced earlier, is an example of a fuzzy reinforcement learning system which has been applied in several control domains such as cart-pole balancing, simulation of to Space Shuttle orbital operations, and tether control. A number of examples from GARIC's applications in these domains will be demonstrated.
Effects of strategic versus tactical instructions on adaptation to changing contingencies in children with adhd.

PubMed

Bicard, David E; Neef, Nancy A

2002-01-01

This study examined the effects of two types of instructions on the academic responding of 4 children with attention deficit hyperactivity disorder. Tactical instructions specified how to distribute responding between two concurrently available sets of math problems associated with different variable-interval schedules of reinforcement. Strategic instructions provided a strategy to determine the best way to distribute responding. Instruction conditions were counterbalanced in an ABAB/BABA reversal design nested within a multiple baseline across participants design. Experimental sessions consisted of a learning session in which participants were provided with one type of instruction, followed by a test session in which no instruction was provided. The schedules of reinforcement were subsequently reversed during test sessions. When learning and test schedules were identical, the responding of all 4 participants closely matched the reinforcement schedules. When tactical instructions were provided and schedules were subsequently changed, responding often remained under the control of the instructions. When strategic instructions were provided, responding more quickly adapted to the changed contingencies. Analysis of postsession verbal reports indicated correspondence between the participants' verbal descriptions (whether accurate or inaccurate) and their nonverbal patterns of responding.
Depression, Activity, and Evaluation of Reinforcement

ERIC Educational Resources Information Center

Hammen, Constance L.; Glass, David R., Jr.

1975-01-01

This research attempted to find the causal relation between mood and level of reinforcement. An effort was made to learn what mood change might occur if depressed subjects increased their levels of participation in reinforcing activities. (Author/RK)
What Can Reinforcement Learning Teach Us About Non-Equilibrium Quantum Dynamics

NASA Astrophysics Data System (ADS)

Bukov, Marin; Day, Alexandre; Sels, Dries; Weinberg, Phillip; Polkovnikov, Anatoli; Mehta, Pankaj

Equilibrium thermodynamics and statistical physics are the building blocks of modern science and technology. Yet, our understanding of thermodynamic processes away from equilibrium is largely missing. In this talk, I will reveal the potential of what artificial intelligence can teach us about the complex behaviour of non-equilibrium systems. Specifically, I will discuss the problem of finding optimal drive protocols to prepare a desired target state in quantum mechanical systems by applying ideas from Reinforcement Learning [one can think of Reinforcement Learning as the study of how an agent (e.g. a robot) can learn and perfect a given policy through interactions with an environment.]. The driving protocols learnt by our agent suggest that the non-equilibrium world features possibilities easily defying intuition based on equilibrium physics.
Kinesthetic Reinforcement-Is It a Boon to Learning?

ERIC Educational Resources Information Center

Bohrer, Roxilu K.

1970-01-01

Language instruction, particularly in the elementary school, should be reinforced through the use of visual aids and through associated physical activity. Kinesthetic experiences provide an opportunity to make use of non-verbal cues to meaning, enliven classroom activities, and maximize learning for pupils. The author discusses the educational…

Reinforcing Basic Skills Through Social Studies. Grades 4-7.

ERIC Educational Resources Information Center

Lewis, Teresa Marie

Arranged into seven parts, this document provides a variety of games and activities, bulletin board ideas, overhead transparencies, student handouts, and learning station ideas to help reinforce basic social studies skills in the intermediate grades. In part 1, students learn about timelines, first constructing their own life timeline, then a…
Effects of Reinforcement on Peer Imitation in a Small Group Play Context

ERIC Educational Resources Information Center

Barton, Erin E.; Ledford, Jennifer R.

2018-01-01

Children with disabilities often have deficits in imitation skills, particularly in imitating peers. Imitation is considered a behavioral cusp--which, once learned, allows a child to access additional and previously unavailable learning opportunities. In the current study, researchers examined the efficacy of contingent reinforcement delivered…
The role of punishment in the in-patient treatment of psychiatrically disturbed children.

PubMed

Alderton, H R

1967-02-01

The role of punishment in the psychiatric in-patient treatment of nonpsychotic latency-age children with behaviourdisorders is discussed. Punishment is defined as the removal of previously existing positive reinforcers or the administration of aversive stimuli. Ways in which appropriate social behaviour may be acquired are briefly considered. These include reinforcement of desirable responses, non-reinforcement of undesirable responses, reinforcement of incompatible responses and imitative learning. The reported effects of punishment on behaviour are reviewed and the psychological functions necessary before punishment can have the intended effects considered. For seriously disturbed children punishment is ineffective as a treatment technique. It reinforces pathological perceptions of self and adults even if it successfully suppresses behaviour. The frame of reference of the seriously disturbed child contraindicates the removal of positive reinforcers and verbal as well as physical aversive stimuli. Controls and punishments must be clearly distinguished. Controls continue only as long as the behaviour towards which they are directed. They do not include the deliberate establishment of an unpleasant state by the adult as a result of particular behaviour. Control techniques such as removal from a group may be necessary but when possible should be avoided in favour of techniques less likely to be misinterpreted. Avoidance of punishment in treatment makes even more important explicit expectations and provision of realistic controls. Natural laws may result in unpleasant experiences as an unavoidable result of certain behaviour. By definition such results can never be imposed by the adult. Treatment considerations may necessitate that the child be protected from the results of his actions. Avoidance of punishment requires a higher staff/child ratio, more mature and better trained staff. Sometimes children have previously been deterred from serious community acting out only by punishment. Should the therapeutic endeavours outlined not prevent such behaviour, treatment in a closed setting without punishment is indicated, not the use of punishment in an open centre.
Neurofeedback in Learning Disabled Children: Visual versus Auditory Reinforcement.

PubMed

Fernández, Thalía; Bosch-Bayard, Jorge; Harmony, Thalía; Caballero, María I; Díaz-Comas, Lourdes; Galán, Lídice; Ricardo-Garcell, Josefina; Aubert, Eduardo; Otero-Ojeda, Gloria

2016-03-01

Children with learning disabilities (LD) frequently have an EEG characterized by an excess of theta and a deficit of alpha activities. NFB using an auditory stimulus as reinforcer has proven to be a useful tool to treat LD children by positively reinforcing decreases of the theta/alpha ratio. The aim of the present study was to optimize the NFB procedure by comparing the efficacy of visual (with eyes open) versus auditory (with eyes closed) reinforcers. Twenty LD children with an abnormally high theta/alpha ratio were randomly assigned to the Auditory or the Visual group, where a 500 Hz tone or a visual stimulus (a white square), respectively, was used as a positive reinforcer when the value of the theta/alpha ratio was reduced. Both groups had signs consistent with EEG maturation, but only the Auditory Group showed behavioral/cognitive improvements. In conclusion, the auditory reinforcer was more efficacious in reducing the theta/alpha ratio, and it improved the cognitive abilities more than the visual reinforcer.
Reinforcement learning agents providing advice in complex video games

NASA Astrophysics Data System (ADS)

Taylor, Matthew E.; Carboni, Nicholas; Fachantidis, Anestis; Vlahavas, Ioannis; Torrey, Lisa

2014-01-01

This article introduces a teacher-student framework for reinforcement learning, synthesising and extending material that appeared in conference proceedings [Torrey, L., & Taylor, M. E. (2013)]. Teaching on a budget: Agents advising agents in reinforcement learning. {Proceedings of the international conference on autonomous agents and multiagent systems}] and in a non-archival workshop paper [Carboni, N., &Taylor, M. E. (2013, May)]. Preliminary results for 1 vs. 1 tactics in StarCraft. {Proceedings of the adaptive and learning agents workshop (at AAMAS-13)}]. In this framework, a teacher agent instructs a student agent by suggesting actions the student should take as it learns. However, the teacher may only give such advice a limited number of times. We present several novel algorithms that teachers can use to budget their advice effectively, and we evaluate them in two complex video games: StarCraft and Pac-Man. Our results show that the same amount of advice, given at different moments, can have different effects on student learning, and that teachers can significantly affect student learning even when students use different learning methods and state representations.
Establishment and Maintenance of Socially Learned Conditioned Reinforcement in Young Children: Elimination of the Role of Adults and View of Peers' Faces

ERIC Educational Resources Information Center

Zrinzo, Michelle; Greer, R. Douglas

2013-01-01

Prior research has demonstrated the establishment of reinforcers for learning and maintenance with young children as a function of social learning where a peer and an adult experimenter were present. The presence of an adult experimenter was eliminated in the present study to test if the effect produced in the prior studies would occur with only…
Learning the specific quality of taste reinforcement in larval Drosophila.

PubMed

Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

2015-01-27

The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing-in any brain.
The evolution of continuous learning of the structure of the environment

PubMed Central

Kolodny, Oren; Edelman, Shimon; Lotem, Arnon

2014-01-01

Continuous, ‘always on’, learning of structure from a stream of data is studied mainly in the fields of machine learning or language acquisition, but its evolutionary roots may go back to the first organisms that were internally motivated to learn and represent their environment. Here, we study under what conditions such continuous learning (CL) may be more adaptive than simple reinforcement learning and examine how it could have evolved from the same basic associative elements. We use agent-based computer simulations to compare three learning strategies: simple reinforcement learning; reinforcement learning with chaining (RL-chain) and CL that applies the same associative mechanisms used by the other strategies, but also seeks statistical regularities in the relations among all items in the environment, regardless of the initial association with food. We show that a sufficiently structured environment favours the evolution of both RL-chain and CL and that CL outperforms the other strategies when food is relatively rare and the time for learning is limited. This advantage of internally motivated CL stems from its ability to capture statistical patterns in the environment even before they are associated with food, at which point they immediately become useful for planning. PMID:24402920
The partial-reinforcement extinction effect and the contingent-sampling hypothesis.

PubMed

Hochman, Guy; Erev, Ido

2013-12-01

The partial-reinforcement extinction effect (PREE) implies that learning under partial reinforcements is more robust than learning under full reinforcements. While the advantages of partial reinforcements have been well-documented in laboratory studies, field research has failed to support this prediction. In the present study, we aimed to clarify this pattern. Experiment 1 showed that partial reinforcements increase the tendency to select the promoted option during extinction; however, this effect is much smaller than the negative effect of partial reinforcements on the tendency to select the promoted option during the training phase. Experiment 2 demonstrated that the overall effect of partial reinforcements varies inversely with the attractiveness of the alternative to the promoted behavior: The overall effect is negative when the alternative is relatively attractive, and positive when the alternative is relatively unattractive. These results can be captured with a contingent-sampling model assuming that people select options that provided the best payoff in similar past experiences. The best fit was obtained under the assumption that similarity is defined by the sequence of the last four outcomes.
The effects of aging on the interaction between reinforcement learning and attention.

PubMed

Radulescu, Angela; Daniel, Reka; Niv, Yael

2016-11-01

Reinforcement learning (RL) in complex environments relies on selective attention to uncover those aspects of the environment that are most predictive of reward. Whereas previous work has focused on age-related changes in RL, it is not known whether older adults learn differently from younger adults when selective attention is required. In 2 experiments, we examined how aging affects the interaction between RL and selective attention. Younger and older adults performed a learning task in which only 1 stimulus dimension was relevant to predicting reward, and within it, 1 "target" feature was the most rewarding. Participants had to discover this target feature through trial and error. In Experiment 1, stimuli varied on 1 or 3 dimensions and participants received hints that revealed the target feature, the relevant dimension, or gave no information. Group-related differences in accuracy and RTs differed systematically as a function of the number of dimensions and the type of hint available. In Experiment 2 we used trial-by-trial computational modeling of the learning process to test for age-related differences in learning strategies. Behavior of both young and older adults was explained well by a reinforcement-learning model that uses selective attention to constrain learning. However, the model suggested that older adults restricted their learning to fewer features, employing more focused attention than younger adults. Furthermore, this difference in strategy predicted age-related deficits in accuracy. We discuss these results suggesting that a narrower filter of attention may reflect an adaptation to the reduced capabilities of the reinforcement learning system. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Optimal and Scalable Caching for 5G Using Reinforcement Learning of Space-Time Popularities

NASA Astrophysics Data System (ADS)

Sadeghi, Alireza; Sheikholeslami, Fatemeh; Giannakis, Georgios B.

2018-02-01

Small basestations (SBs) equipped with caching units have potential to handle the unprecedented demand growth in heterogeneous networks. Through low-rate, backhaul connections with the backbone, SBs can prefetch popular files during off-peak traffic hours, and service them to the edge at peak periods. To intelligently prefetch, each SB must learn what and when to cache, while taking into account SB memory limitations, the massive number of available contents, the unknown popularity profiles, as well as the space-time popularity dynamics of user file requests. In this work, local and global Markov processes model user requests, and a reinforcement learning (RL) framework is put forth for finding the optimal caching policy when the transition probabilities involved are unknown. Joint consideration of global and local popularity demands along with cache-refreshing costs allow for a simple, yet practical asynchronous caching approach. The novel RL-based caching relies on a Q-learning algorithm to implement the optimal policy in an online fashion, thus enabling the cache control unit at the SB to learn, track, and possibly adapt to the underlying dynamics. To endow the algorithm with scalability, a linear function approximation of the proposed Q-learning scheme is introduced, offering faster convergence as well as reduced complexity and memory requirements. Numerical tests corroborate the merits of the proposed approach in various realistic settings.
Intelligent multiagent coordination based on reinforcement hierarchical neuro-fuzzy models.

PubMed

Mendoza, Leonardo Forero; Vellasco, Marley; Figueiredo, Karla

2014-12-01

This paper presents the research and development of two hybrid neuro-fuzzy models for the hierarchical coordination of multiple intelligent agents. The main objective of the models is to have multiple agents interact intelligently with each other in complex systems. We developed two new models of coordination for intelligent multiagent systems, which integrates the Reinforcement Learning Hierarchical Neuro-Fuzzy model with two proposed coordination mechanisms: the MultiAgent Reinforcement Learning Hierarchical Neuro-Fuzzy with a market-driven coordination mechanism (MA-RL-HNFP-MD) and the MultiAgent Reinforcement Learning Hierarchical Neuro-Fuzzy with graph coordination (MA-RL-HNFP-CG). In order to evaluate the proposed models and verify the contribution of the proposed coordination mechanisms, two multiagent benchmark applications were developed: the pursuit game and the robot soccer simulation. The results obtained demonstrated that the proposed coordination mechanisms greatly improve the performance of the multiagent system when compared with other strategies.
Refining Linear Fuzzy Rules by Reinforcement Learning

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap S.; Malkani, Anil

1996-01-01

Linear fuzzy rules are increasingly being used in the development of fuzzy logic systems. Radial basis functions have also been used in the antecedents of the rules for clustering in product space which can automatically generate a set of linear fuzzy rules from an input/output data set. Manual methods are usually used in refining these rules. This paper presents a method for refining the parameters of these rules using reinforcement learning which can be applied in domains where supervised input-output data is not available and reinforcements are received only after a long sequence of actions. This is shown for a generalization of radial basis functions. The formation of fuzzy rules from data and their automatic refinement is an important step in closing the gap between the application of reinforcement learning methods in the domains where only some limited input-output data is available.
Instructed knowledge shapes feedback-driven aversive learning in striatum and orbitofrontal cortex, but not the amygdala

PubMed Central

Atlas, Lauren Y; Doll, Bradley B; Li, Jian; Daw, Nathaniel D; Phelps, Elizabeth A

2016-01-01

Socially-conveyed rules and instructions strongly shape expectations and emotions. Yet most neuroscientific studies of learning consider reinforcement history alone, irrespective of knowledge acquired through other means. We examined fear conditioning and reversal in humans to test whether instructed knowledge modulates the neural mechanisms of feedback-driven learning. One group was informed about contingencies and reversals. A second group learned only from reinforcement. We combined quantitative models with functional magnetic resonance imaging and found that instructions induced dissociations in the neural systems of aversive learning. Responses in striatum and orbitofrontal cortex updated with instructions and correlated with prefrontal responses to instructions. Amygdala responses were influenced by reinforcement similarly in both groups and did not update with instructions. Results extend work on instructed reward learning and reveal novel dissociations that have not been observed with punishments or rewards. Findings support theories of specialized threat-detection and may have implications for fear maintenance in anxiety. DOI: http://dx.doi.org/10.7554/eLife.15192.001 PMID:27171199
Optimisation of cognitive performance in rodent operant (touchscreen) testing: Evaluation and effects of reinforcer strength.

PubMed

Phillips, Benjamin U; Heath, Christopher J; Ossowska, Zofia; Bussey, Timothy J; Saksida, Lisa M

2017-09-01

Operant testing is a widely used and highly effective method of studying cognition in rodents. Performance on such tasks is sensitive to reinforcer strength. It is therefore advantageous to select effective reinforcers to minimize training times and maximize experimental throughput. To quantitatively investigate the control of behavior by different reinforcers, performance of mice was tested with either strawberry milkshake or a known powerful reinforcer, super saccharin (1.5% or 2% (w/v) saccharin/1.5% (w/v) glucose/water mixture). Mice were tested on fixed (FR)- and progressive-ratio (PR) schedules in the touchscreen-operant testing system. Under an FR schedule, both the rate of responding and number of trials completed were higher in animals responding for strawberry milkshake versus super saccharin. Under a PR schedule, mice were willing to emit similar numbers of responses for strawberry milkshake and super saccharin; however, analysis of the rate of responding revealed a significantly higher rate of responding by animals reinforced with milkshake versus super saccharin. To determine the impact of reinforcer strength on cognitive performance, strawberry milkshake and super saccharin-reinforced animals were compared on a touchscreen visual discrimination task. Animals reinforced by strawberry milkshake were significantly faster to acquire the discrimination than animals reinforced by super saccharin. Taken together, these results suggest that strawberry milkshake is superior to super saccharin for operant behavioral testing and further confirms that the application of response rate analysis to multiple ratio tasks is a highly sensitive method for the detection of behavioral differences relevant to learning and motivation.
Autoshaping in micrencephalic rats.

PubMed

Goldstein, L H; Oakley, D A

1989-06-01

An autoshaping procedure in which the illumination of a lever was predictive of food reinforcement was used to compare learning in rats with micrencephaly induced by irradiation on the 16th day of gestation and in sham-irradiated controls. Both groups showed equivalent levels of lever-directed activity, and the micrencephalic animals differentiated as well as the control animals between the predictive lever and a nonpredictive lever. The micrencephalic animals were able to redistribute their lever-directed activity when the significance of the levers was reversed and did so more readily than the control animals. Results support the claim that association learning survives either traumatic or developmental neocortical damage and have implications for remedial procedures following both head injury and developmental cerebral pathology in humans.
Online Learning of Genetic Network Programming and its Application to Prisoner’s Dilemma Game

NASA Astrophysics Data System (ADS)

Mabu, Shingo; Hirasawa, Kotaro; Hu, Jinglu; Murata, Junichi

A new evolutionary model with the network structure named Genetic Network Programming (GNP) has been proposed recently. GNP, that is, an expansion of GA and GP, represents solutions as a network structure and evolves it by using “offline learning (selection, mutation, crossover)”. GNP can memorize the past action sequences in the network flow, so it can deal with Partially Observable Markov Decision Process (POMDP) well. In this paper, in order to improve the ability of GNP, Q learning (an off-policy TD control algorithm) that is one of the famous online methods is introduced for online learning of GNP. Q learning is suitable for GNP because (1) in reinforcement learning, the rewards an agent will get in the future can be estimated, (2) TD control doesn’t need much memory and can learn quickly, and (3) off-policy is suitable in order to search for an optimal solution independently of the policy. Finally, in the simulations, online learning of GNP is applied to a player for “Prisoner’s dilemma game” and its ability for online adaptation is confirmed.
Fuzzy Sarsa with Focussed Replacing Eligibility Traces for Robust and Accurate Control

NASA Astrophysics Data System (ADS)

Kamdem, Sylvain; Ohki, Hidehiro; Sueda, Naomichi

Several methods of reinforcement learning in continuous state and action spaces that utilize fuzzy logic have been proposed in recent years. This paper introduces Fuzzy Sarsa(λ), an on-policy algorithm for fuzzy learning that relies on a novel way of computing replacing eligibility traces to accelerate the policy evaluation. It is tested against several temporal difference learning algorithms: Sarsa(λ), Fuzzy Q(λ), an earlier fuzzy version of Sarsa and an actor-critic algorithm. We perform detailed evaluations on two benchmark problems : a maze domain and the cart pole. Results of various tests highlight the strengths and weaknesses of these algorithms and show that Fuzzy Sarsa(λ) outperforms all other algorithms tested for a larger granularity of design and under noisy conditions. It is a highly competitive method of learning in realistic noisy domains where a denser fuzzy design over the state space is needed for a more precise control.
Life Span Differences in Electrophysiological Correlates of Monitoring Gains and Losses during Probabilistic Reinforcement Learning

ERIC Educational Resources Information Center

Hammerer, Dorothea; Li, Shu-Chen; Muller, Viktor; Lindenberger, Ulman

2011-01-01

By recording the feedback-related negativity (FRN) in response to gains and losses, we investigated the contribution of outcome monitoring mechanisms to age-associated differences in probabilistic reinforcement learning. Specifically, we assessed the difference of the monitoring reactions to gains and losses to investigate the monitoring of…
Reinforcement Learning in Young Adults with Developmental Language Impairment

ERIC Educational Resources Information Center

Lee, Joanna C.; Tomblin, J. Bruce

2012-01-01

The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank, Seeberger, & O'Reilly, 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic…

Effective Reinforcement Techniques in Elementary Physical Education: The Key to Behavior Management

ERIC Educational Resources Information Center

Downing, John; Keating, Tedd; Bennett, Carl

2005-01-01

The ability to shape appropriate behavior while extinguishing misbehavior is critical to teaching and learning in physical education. The scientific principles that affect student learning in the gymnasium also apply to the methods teachers use to influence social behaviors. Research indicates that reinforcement strategies are more effective than…
A neural model of hierarchical reinforcement learning

PubMed Central

Rasmussen, Daniel; Eliasmith, Chris

2017-01-01

We develop a novel, biologically detailed neural model of reinforcement learning (RL) processes in the brain. This model incorporates a broad range of biological features that pose challenges to neural RL, such as temporally extended action sequences, continuous environments involving unknown time delays, and noisy/imprecise computations. Most significantly, we expand the model into the realm of hierarchical reinforcement learning (HRL), which divides the RL process into a hierarchy of actions at different levels of abstraction. Here we implement all the major components of HRL in a neural model that captures a variety of known anatomical and physiological properties of the brain. We demonstrate the performance of the model in a range of different environments, in order to emphasize the aim of understanding the brain’s general reinforcement learning ability. These results show that the model compares well to previous modelling work and demonstrates improved performance as a result of its hierarchical ability. We also show that the model’s behaviour is consistent with available data on human hierarchical RL, and generate several novel predictions. PMID:28683111
Reinforcement Learning Trees

PubMed Central

Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R.

2015-01-01

In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings. PMID:26903687
Reciprocity Family Counseling: A Multi-Ethnic Model.

ERIC Educational Resources Information Center

Penrose, David M.

The Reciprocity Family Counseling Method involves learning principles of behavior modification including selective reinforcement, behavioral contracting, self-correction, and over-correction. Selective reinforcement refers to the recognition and modification of parent/child responses and reinforcers. Parents and children are asked to identify…
Reinforcement active learning in the vibrissae system: optimal object localization.

PubMed

Gordon, Goren; Dorfman, Nimrod; Ahissar, Ehud

2013-01-01

Rats move their whiskers to acquire information about their environment. It has been observed that they palpate novel objects and objects they are required to localize in space. We analyze whisker-based object localization using two complementary paradigms, namely, active learning and intrinsic-reward reinforcement learning. Active learning algorithms select the next training samples according to the hypothesized solution in order to better discriminate between correct and incorrect labels. Intrinsic-reward reinforcement learning uses prediction errors as the reward to an actor-critic design, such that behavior converges to the one that optimizes the learning process. We show that in the context of object localization, the two paradigms result in palpation whisking as their respective optimal solution. These results suggest that rats may employ principles of active learning and/or intrinsic reward in tactile exploration and can guide future research to seek the underlying neuronal mechanisms that implement them. Furthermore, these paradigms are easily transferable to biomimetic whisker-based artificial sensors and can improve the active exploration of their environment. Copyright © 2012 Elsevier Ltd. All rights reserved.
A comparison of differential reinforcement procedures with children with autism.

PubMed

Boudreau, Brittany A; Vladescu, Jason C; Kodak, Tiffany M; Argott, Paul J; Kisamore, April N

2015-12-01

The current evaluation compared the effects of 2 differential reinforcement arrangements and a nondifferential reinforcement arrangement on the acquisition of tacts for 3 children with autism. Participants learned in all reinforcement-based conditions, and we discuss areas for future research in light of these findings and potential limitations. © Society for the Experimental Analysis of Behavior.
Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making.

PubMed

Schönberg, Tom; Daw, Nathaniel D; Joel, Daphna; O'Doherty, John P

2007-11-21

The computational framework of reinforcement learning has been used to forward our understanding of the neural mechanisms underlying reward learning and decision-making behavior. It is known that humans vary widely in their performance in decision-making tasks. Here, we used a simple four-armed bandit task in which subjects are almost evenly split into two groups on the basis of their performance: those who do learn to favor choice of the optimal action and those who do not. Using models of reinforcement learning we sought to determine the neural basis of these intrinsic differences in performance by scanning both groups with functional magnetic resonance imaging. We scanned 29 subjects while they performed the reward-based decision-making task. Our results suggest that these two groups differ markedly in the degree to which reinforcement learning signals in the striatum are engaged during task performance. While the learners showed robust prediction error signals in both the ventral and dorsal striatum during learning, the nonlearner group showed a marked absence of such signals. Moreover, the magnitude of prediction error signals in a region of dorsal striatum correlated significantly with a measure of behavioral performance across all subjects. These findings support a crucial role of prediction error signals, likely originating from dopaminergic midbrain neurons, in enabling learning of action selection preferences on the basis of obtained rewards. Thus, spontaneously observed individual differences in decision making performance demonstrate the suggested dependence of this type of learning on the functional integrity of the dopaminergic striatal system in humans.
Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin.

PubMed

Ezaki, Takahiro; Horita, Yutaka; Takezawa, Masanori; Masuda, Naoki

2016-07-01

Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner's dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations.
Social stress reactivity alters reward and punishment learning

PubMed Central

Frank, Michael J.; Allen, John J. B.

2011-01-01

To examine how stress affects cognitive functioning, individual differences in trait vulnerability (punishment sensitivity) and state reactivity (negative affect) to social evaluative threat were examined during concurrent reinforcement learning. Lower trait-level punishment sensitivity predicted better reward learning and poorer punishment learning; the opposite pattern was found in more punishment sensitive individuals. Increasing state-level negative affect was directly related to punishment learning accuracy in highly punishment sensitive individuals, but these measures were inversely related in less sensitive individuals. Combined electrophysiological measurement, performance accuracy and computational estimations of learning parameters suggest that trait and state vulnerability to stress alter cortico-striatal functioning during reinforcement learning, possibly mediated via medio-frontal cortical systems. PMID:20453038
Social stress reactivity alters reward and punishment learning.

PubMed

Cavanagh, James F; Frank, Michael J; Allen, John J B

2011-06-01

To examine how stress affects cognitive functioning, individual differences in trait vulnerability (punishment sensitivity) and state reactivity (negative affect) to social evaluative threat were examined during concurrent reinforcement learning. Lower trait-level punishment sensitivity predicted better reward learning and poorer punishment learning; the opposite pattern was found in more punishment sensitive individuals. Increasing state-level negative affect was directly related to punishment learning accuracy in highly punishment sensitive individuals, but these measures were inversely related in less sensitive individuals. Combined electrophysiological measurement, performance accuracy and computational estimations of learning parameters suggest that trait and state vulnerability to stress alter cortico-striatal functioning during reinforcement learning, possibly mediated via medio-frontal cortical systems.
Context change explains resurgence after the extinction of operant behavior

PubMed Central

Trask, Sydney; Schepers, Scott T.; Bouton, Mark E.

2016-01-01

Extinguished operant behavior can return or “resurge” when a response that has replaced it is also extinguished. Typically studied in nonhuman animals, the resurgence effect may provide insight into relapse that is seen when reinforcement is discontinued following human contingency management (CM) and functional communication training (FCT) treatments, which both involve reinforcing alternative behaviors to reduce behavioral excess. Although the variables that affect resurgence have been studied for some time, the mechanisms through which they promote relapse are still debated. We discuss three explanations of resurgence (response prevention, an extension of behavioral momentum theory, and an account emphasizing context change) as well as studies that evaluate them. Several new findings from our laboratory concerning the effects of different temporal distributions of the reinforcer during response elimination and the effects of manipulating qualitative features of the reinforcer pose a particular challenge to the momentum-based model. Overall, the results are consistent with a contextual account of resurgence, which emphasizes that reinforcers presented during response elimination have a discriminative role controlling behavioral inhibition. Changing the “reinforcer context” at the start of testing produces relapse if the organism has not learned to suppress its responding under conditions similar to the ones that prevail during testing. PMID:27429503
Learned helplessness and learned prevalence: exploring the causal relations among perceived controllability, reward prevalence, and exploration.

PubMed

Teodorescu, Kinneret; Erev, Ido

2014-10-01

Exposure to uncontrollable outcomes has been found to trigger learned helplessness, a state in which the agent, because of lack of exploration, fails to take advantage of regained control. Although the implications of this phenomenon have been widely studied, its underlying cause remains undetermined. One can learn not to explore because the environment is uncontrollable, because the average reinforcement for exploring is low, or because rewards for exploring are rare. In the current research, we tested a simple experimental paradigm that contrasts the predictions of these three contributors and offers a unified psychological mechanism that underlies the observed phenomena. Our results demonstrate that learned helplessness is not correlated with either the perceived controllability of one's environment or the average reward, which suggests that reward prevalence is a better predictor of exploratory behavior than the other two factors. A simple computational model in which exploration decisions were based on small samples of past experiences captured the empirical phenomena while also providing a cognitive basis for feelings of uncontrollability. © The Author(s) 2014.
Somatosensory Contribution to the Initial Stages of Human Motor Learning

PubMed Central

Bernardi, Nicolò F.; Darainy, Mohammad

2015-01-01

The early stages of motor skill acquisition are often marked by uncertainty about the sensory and motor goals of the task, as is the case in learning to speak or learning the feel of a good tennis serve. Here we present an experimental model of this early learning process, in which targets are acquired by exploration and reinforcement rather than sensory error. We use this model to investigate the relative contribution of motor and sensory factors to human motor learning. Participants make active reaching movements or matched passive movements to an unseen target using a robot arm. We find that learning through passive movements paired with reinforcement is comparable with learning associated with active movement, both in terms of magnitude and durability, with improvements due to training still observable at a 1 week retest. Motor learning is also accompanied by changes in somatosensory perceptual acuity. No stable changes in motor performance are observed for participants that train, actively or passively, in the absence of reinforcement, or for participants who are given explicit information about target position in the absence of somatosensory experience. These findings indicate that the somatosensory system dominates learning in the early stages of motor skill acquisition. SIGNIFICANCE STATEMENT The research focuses on the initial stages of human motor learning, introducing a new experimental model that closely approximates the key features of motor learning outside of the laboratory. The finding indicates that it is the somatosensory system rather than the motor system that dominates learning in the early stages of motor skill acquisition. This is important given that most of our computational models of motor learning are based on the idea that learning is motoric in origin. This is also a valuable finding for rehabilitation of patients with limited mobility as it shows that reinforcement in conjunction with passive movement results in benefits to motor learning that are as great as those observed for active movement training. PMID:26490869
Substantia nigra activity level predicts trial-to-trial adjustments in cognitive control

PubMed Central

Boehler, C.N.; Bunzeck, N.; Krebs, R.M.; Noesselt, T.; Schoenfeld, M.A.; Heinze, H.-J.; Münte, T.F.; Woldorff, M.G.; Hopf, J.-M.

2011-01-01

Effective adaptation to the demands of a changing environment requires flexible cognitive control. The medial and lateral frontal cortices are involved in such control processes, putatively in close interplay with the basal ganglia. In particular, dopaminergic projections from the midbrain (i.e., from the substantia nigra (SN) and the ventral tegmental area (VTA)) have been proposed to play a pivotal role in modulating the activity in these areas for cognitive control purposes. In that dopaminergic involvement has been strongly implicated in reinforcement learning, these ideas suggest functional links between reinforcement learning, where the outcome of actions shapes behavior over time, and cognitive control in a more general context, where no direct reward is involved. Here, we provide evidence from functional MRI in humans that activity in the SN predicts systematic subsequent trial-to-trial response time (RT) prolongations that are thought to reflect cognitive control in a Stop-signal paradigm. In particular, variations in the activity level of the SN in one trial predicted the degree of RT prolongation on the subsequent trial, consistent with a modulating output signal from the SN being involved in enhancing cognitive control. This link between SN activity and subsequent behavioral adjustments lends support to theoretical accounts that propose dopaminergic control signals that shape behavior both in the presence and absence of direct reward. This SN-based modulatory mechanism is presumably mediated via a wider network that determines response speed in this task, including frontal and parietal control regions, along with the basal ganglia and the associated subthalamic nucleus. PMID:20465358
Reinforcement-learning-based output-feedback control of nonstrict nonlinear discrete-time systems with application to engine emission control.

PubMed

Shih, Peter; Kaul, Brian C; Jagannathan, Sarangapani; Drallmeier, James A

2009-10-01

A novel reinforcement-learning-based output adaptive neural network (NN) controller, which is also referred to as the adaptive-critic NN controller, is developed to deliver the desired tracking performance for a class of nonlinear discrete-time systems expressed in nonstrict feedback form in the presence of bounded and unknown disturbances. The adaptive-critic NN controller consists of an observer, a critic, and two action NNs. The observer estimates the states and output, and the two action NNs provide virtual and actual control inputs to the nonlinear discrete-time system. The critic approximates a certain strategic utility function, and the action NNs minimize the strategic utility function and control inputs. All NN weights adapt online toward minimization of a performance index, utilizing the gradient-descent-based rule, in contrast with iteration-based adaptive-critic schemes. Lyapunov functions are used to show the stability of the closed-loop tracking error, weights, and observer estimates. Separation and certainty equivalence principles, persistency of excitation condition, and linearity in the unknown parameter assumption are not needed. Experimental results on a spark ignition (SI) engine operating lean at an equivalence ratio of 0.75 show a significant (25%) reduction in cyclic dispersion in heat release with control, while the average fuel input changes by less than 1% compared with the uncontrolled case. Consequently, oxides of nitrogen (NO(x)) drop by 30%, and unburned hydrocarbons drop by 16% with control. Overall, NO(x)'s are reduced by over 80% compared with stoichiometric levels.
Somatic and Reinforcement-Based Plasticity in the Initial Stages of Human Motor Learning.

PubMed

Sidarta, Ananda; Vahdat, Shahabeddin; Bernardi, Nicolò F; Ostry, David J

2016-11-16

As one learns to dance or play tennis, the desired somatosensory state is typically unknown. Trial and error is important as motor behavior is shaped by successful and unsuccessful movements. As an experimental model, we designed a task in which human participants make reaching movements to a hidden target and receive positive reinforcement when successful. We identified somatic and reinforcement-based sources of plasticity on the basis of changes in functional connectivity using resting-state fMRI before and after learning. The neuroimaging data revealed reinforcement-related changes in both motor and somatosensory brain areas in which a strengthening of connectivity was related to the amount of positive reinforcement during learning. Areas of prefrontal cortex were similarly altered in relation to reinforcement, with connectivity between sensorimotor areas of putamen and the reward-related ventromedial prefrontal cortex strengthened in relation to the amount of successful feedback received. In other analyses, we assessed connectivity related to changes in movement direction between trials, a type of variability that presumably reflects exploratory strategies during learning. We found that connectivity in a network linking motor and somatosensory cortices increased with trial-to-trial changes in direction. Connectivity varied as well with the change in movement direction following incorrect movements. Here the changes were observed in a somatic memory and decision making network involving ventrolateral prefrontal cortex and second somatosensory cortex. Our results point to the idea that the initial stages of motor learning are not wholly motor but rather involve plasticity in somatic and prefrontal networks related both to reward and exploration. In the initial stages of motor learning, the placement of the limbs is learned primarily through trial and error. In an experimental analog, participants make reaching movements to a hidden target and receive positive feedback when successful. We identified sources of plasticity based on changes in functional connectivity using resting-state fMRI. The main finding is that there is a strengthening of connectivity between reward-related prefrontal areas and sensorimotor areas in the basal ganglia and frontal cortex. There is also a strengthening of connectivity related to movement exploration in sensorimotor circuits involved in somatic memory and decision making. The results indicate that initial stages of motor learning depend on plasticity in somatic and prefrontal networks related to reward and exploration. Copyright © 2016 the authors 0270-6474/16/3611682-11$15.00/0.
Somatic and Reinforcement-Based Plasticity in the Initial Stages of Human Motor Learning

PubMed Central

Sidarta, Ananda; Vahdat, Shahabeddin; Bernardi, Nicolò F.

2016-01-01

As one learns to dance or play tennis, the desired somatosensory state is typically unknown. Trial and error is important as motor behavior is shaped by successful and unsuccessful movements. As an experimental model, we designed a task in which human participants make reaching movements to a hidden target and receive positive reinforcement when successful. We identified somatic and reinforcement-based sources of plasticity on the basis of changes in functional connectivity using resting-state fMRI before and after learning. The neuroimaging data revealed reinforcement-related changes in both motor and somatosensory brain areas in which a strengthening of connectivity was related to the amount of positive reinforcement during learning. Areas of prefrontal cortex were similarly altered in relation to reinforcement, with connectivity between sensorimotor areas of putamen and the reward-related ventromedial prefrontal cortex strengthened in relation to the amount of successful feedback received. In other analyses, we assessed connectivity related to changes in movement direction between trials, a type of variability that presumably reflects exploratory strategies during learning. We found that connectivity in a network linking motor and somatosensory cortices increased with trial-to-trial changes in direction. Connectivity varied as well with the change in movement direction following incorrect movements. Here the changes were observed in a somatic memory and decision making network involving ventrolateral prefrontal cortex and second somatosensory cortex. Our results point to the idea that the initial stages of motor learning are not wholly motor but rather involve plasticity in somatic and prefrontal networks related both to reward and exploration. SIGNIFICANCE STATEMENT In the initial stages of motor learning, the placement of the limbs is learned primarily through trial and error. In an experimental analog, participants make reaching movements to a hidden target and receive positive feedback when successful. We identified sources of plasticity based on changes in functional connectivity using resting-state fMRI. The main finding is that there is a strengthening of connectivity between reward-related prefrontal areas and sensorimotor areas in the basal ganglia and frontal cortex. There is also a strengthening of connectivity related to movement exploration in sensorimotor circuits involved in somatic memory and decision making. The results indicate that initial stages of motor learning depend on plasticity in somatic and prefrontal networks related to reward and exploration. PMID:27852776
A learning-based semi-autonomous controller for robotic exploration of unknown disaster scenes while searching for victims.

PubMed

Doroodgar, Barzin; Liu, Yugang; Nejat, Goldie

2014-12-01

Semi-autonomous control schemes can address the limitations of both teleoperation and fully autonomous robotic control of rescue robots in disaster environments by allowing a human operator to cooperate and share such tasks with a rescue robot as navigation, exploration, and victim identification. In this paper, we present a unique hierarchical reinforcement learning-based semi-autonomous control architecture for rescue robots operating in cluttered and unknown urban search and rescue (USAR) environments. The aim of the controller is to enable a rescue robot to continuously learn from its own experiences in an environment in order to improve its overall performance in exploration of unknown disaster scenes. A direction-based exploration technique is integrated in the controller to expand the search area of the robot via the classification of regions and the rubble piles within these regions. Both simulations and physical experiments in USAR-like environments verify the robustness of the proposed HRL-based semi-autonomous controller to unknown cluttered scenes with different sizes and varying types of configurations.
Army Training Study: Training Effectiveness Analysis (TEA) Summary. Volume 4. Ordnance, Signal, CAMMS (Computer Assisted Map Maneuver System).

DTIC Science & Technology

1978-08-08

learning can be reinforced on the job because individuals at all aptitude and experience levels investigated have the ability to be successful ...both game board and brigade controllers provided realistic feedback and guidance to the command group players . A second major function of the brigade...and experimental measures. The brigade level data collectors and game board players were controllers provided by the participating units’ parent
Dissociation between Active and Observational Learning from Positive and Negative Feedback in Parkinsonism

PubMed Central

Kobza, Stefan; Ferrea, Stefano; Schnitzler, Alfons; Pollok, Bettina

2012-01-01

Feedback to both actively performed and observed behaviour allows adaptation of future actions. Positive feedback leads to increased activity of dopamine neurons in the substantia nigra, whereas dopamine neuron activity is decreased following negative feedback. Dopamine level reduction in unmedicated Parkinson’s Disease patients has been shown to lead to a negative learning bias, i.e. enhanced learning from negative feedback. Recent findings suggest that the neural mechanisms of active and observational learning from feedback might differ, with the striatum playing a less prominent role in observational learning. Therefore, it was hypothesized that unmedicated Parkinson’s Disease patients would show a negative learning bias only in active but not in observational learning. In a between-group design, 19 Parkinson’s Disease patients and 40 healthy controls engaged in either an active or an observational probabilistic feedback-learning task. For both tasks, transfer phases aimed to assess the bias to learn better from positive or negative feedback. As expected, actively learning patients showed a negative learning bias, whereas controls learned better from positive feedback. In contrast, no difference between patients and controls emerged for observational learning, with both groups showing better learning from positive feedback. These findings add to neural models of reinforcement-learning by suggesting that dopamine-modulated input to the striatum plays a minor role in observational learning from feedback. Future research will have to elucidate the specific neural underpinnings of observational learning. PMID:23185586

Discriminative stimuli that control instrumental tobacco-seeking by human smokers also command selective attention.

PubMed

Hogarth, Lee; Dickinson, Anthony; Duka, Theodora

2003-08-01

Incentive salience theory states that acquired bias in selective attention for stimuli associated with tobacco-smoke reinforcement controls the selective performance of tobacco-seeking and tobacco-taking behaviour. To support this theory, we assessed whether a stimulus that had acquired control of a tobacco-seeking response in a discrimination procedure would command the focus of visual attention in a subsequent test phase. Smokers received discrimination training in which an instrumental key-press response was followed by tobacco-smoke reinforcement when one visual discriminative stimulus (S+) was present, but not when another stimulus (S-) was present. The skin conductance response to the S+ and S- assessed whether Pavlovian conditioning to the S+ had taken place. In a subsequent test phase, the S+ and S- were presented in the dot-probe task and the allocation of the focus of visual attention to these stimuli was measured. Participants learned to perform the instrumental tobacco-seeking response selectively in the presence of the S+ relative to the S-, and showed a greater skin conductance response to the S+ than the S-. In the subsequent test phase, participants allocated the focus of visual attention to the S+ in preference to the S-. Correlation analysis revealed that the visual attentional bias for the S+ was positively associated with the number of times the S+ had been paired with tobacco-smoke in training, the skin conductance response to the S+ and with subjective craving to smoke. Furthermore, increased exposure to tobacco-smoke in the natural environment was associated with reduced discrimination learning. These data demonstrate that discriminative stimuli that signal that tobacco-smoke reinforcement is available acquire the capacity to command selective attentional and elicit instrumental tobacco-seeking behaviour.
Neuromuscular control of the point to point and oscillatory movements of a sagittal arm with the actor-critic reinforcement learning method.

PubMed

Golkhou, Vahid; Parnianpour, Mohamad; Lucas, Caro

2005-04-01

In this study, we have used a single link system with a pair of muscles that are excited with alpha and gamma signals to achieve both point to point and oscillatory movements with variable amplitude and frequency.The system is highly nonlinear in all its physical and physiological attributes. The major physiological characteristics of this system are simultaneous activation of a pair of nonlinear muscle-like-actuators for control purposes, existence of nonlinear spindle-like sensors and Golgi tendon organ-like sensor, actions of gravity and external loading. Transmission delays are included in the afferent and efferent neural paths to account for a more accurate representation of the reflex loops.A reinforcement learning method with an actor-critic (AC) architecture instead of middle and low level of central nervous system (CNS), is used to track a desired trajectory. The actor in this structure is a two layer feedforward neural network and the critic is a model of the cerebellum. The critic is trained by state-action-reward-state-action (SARSA) method. The critic will train the actor by supervisory learning based on the prior experiences. Simulation studies of oscillatory movements based on the proposed algorithm demonstrate excellent tracking capability and after 280 epochs the RMS error for position and velocity profiles were 0.02, 0.04 rad and rad/s, respectively.
What is the optimal task difficulty for reinforcement learning of brain self-regulation?

PubMed

Bauer, Robert; Vukelić, Mathias; Gharabaghi, Alireza

2016-09-01

The balance between action and reward during neurofeedback may influence reinforcement learning of brain self-regulation. Eleven healthy volunteers participated in three runs of motor imagery-based brain-machine interface feedback where a robot passively opened the hand contingent to β-band modulation. For each run, the β-desynchronization threshold to initiate the hand robot movement increased in difficulty (low, moderate, and demanding). In this context, the incentive to learn was estimated by the change of reward per action, operationalized as the change in reward duration per movement onset. Variance analysis revealed a significant interaction between threshold difficulty and the relationship between reward duration and number of movement onsets (p<0.001), indicating a negative learning incentive for low difficulty, but a positive learning incentive for moderate and demanding runs. Exploration of different thresholds in the same data set indicated that the learning incentive peaked at higher thresholds than the threshold which resulted in maximum classification accuracy. Specificity is more important than sensitivity of neurofeedback for reinforcement learning of brain self-regulation. Learning efficiency requires adequate challenge by neurofeedback interventions. Copyright © 2016 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.
The Computational Development of Reinforcement Learning during Adolescence

PubMed Central

Palminteri, Stefano; Coricelli, Giorgio; Blakemore, Sarah-Jayne

2016-01-01

Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents’ behaviour was better explained by a basic reinforcement learning algorithm, adults’ behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence. PMID:27322574
Tracked robot controllers for climbing obstacles autonomously

NASA Astrophysics Data System (ADS)

Vincent, Isabelle

2009-05-01

Research in mobile robot navigation has demonstrated some success in navigating flat indoor environments while avoiding obstacles. However, the challenge of analyzing complex environments to climb obstacles autonomously has had very little success due to the complexity of the task. Unmanned ground vehicles currently exhibit simple autonomous behaviours compared to the human ability to move in the world. This paper presents the control algorithms designed for a tracked mobile robot to autonomously climb obstacles by varying its tracks configuration. Two control algorithms are proposed to solve the autonomous locomotion problem for climbing obstacles. First, a reactive controller evaluates the appropriate geometric configuration based on terrain and vehicle geometric considerations. Then, a reinforcement learning algorithm finds alternative solutions when the reactive controller gets stuck while climbing an obstacle. The methodology combines reactivity to learning. The controllers have been demonstrated in box and stair climbing simulations. The experiments illustrate the effectiveness of the proposed approach for crossing obstacles.
Coexistence of Reward and Unsupervised Learning During the Operant Conditioning of Neural Firing Rates

PubMed Central

Kerr, Robert R.; Grayden, David B.; Thomas, Doreen A.; Gilson, Matthieu; Burkitt, Anthony N.

2014-01-01

A fundamental goal of neuroscience is to understand how cognitive processes, such as operant conditioning, are performed by the brain. Typical and well studied examples of operant conditioning, in which the firing rates of individual cortical neurons in monkeys are increased using rewards, provide an opportunity for insight into this. Studies of reward-modulated spike-timing-dependent plasticity (RSTDP), and of other models such as R-max, have reproduced this learning behavior, but they have assumed that no unsupervised learning is present (i.e., no learning occurs without, or independent of, rewards). We show that these models cannot elicit firing rate reinforcement while exhibiting both reward learning and ongoing, stable unsupervised learning. To fix this issue, we propose a new RSTDP model of synaptic plasticity based upon the observed effects that dopamine has on long-term potentiation and depression (LTP and LTD). We show, both analytically and through simulations, that our new model can exhibit unsupervised learning and lead to firing rate reinforcement. This requires that the strengthening of LTP by the reward signal is greater than the strengthening of LTD and that the reinforced neuron exhibits irregular firing. We show the robustness of our findings to spike-timing correlations, to the synaptic weight dependence that is assumed, and to changes in the mean reward. We also consider our model in the differential reinforcement of two nearby neurons. Our model aligns more strongly with experimental studies than previous models and makes testable predictions for future experiments. PMID:24475240
Oculomotor learning revisited: a model of reinforcement learning in the basal ganglia incorporating an efference copy of motor actions

PubMed Central

Fee, Michale S.

2012-01-01

In its simplest formulation, reinforcement learning is based on the idea that if an action taken in a particular context is followed by a favorable outcome, then, in the same context, the tendency to produce that action should be strengthened, or reinforced. While reinforcement learning forms the basis of many current theories of basal ganglia (BG) function, these models do not incorporate distinct computational roles for signals that convey context, and those that convey what action an animal takes. Recent experiments in the songbird suggest that vocal-related BG circuitry receives two functionally distinct excitatory inputs. One input is from a cortical region that carries context information about the current “time” in the motor sequence. The other is an efference copy of motor commands from a separate cortical brain region that generates vocal variability during learning. Based on these findings, I propose here a general model of vertebrate BG function that combines context information with a distinct motor efference copy signal. The signals are integrated by a learning rule in which efference copy inputs gate the potentiation of context inputs (but not efference copy inputs) onto medium spiny neurons in response to a rewarded action. The hypothesis is described in terms of a circuit that implements the learning of visually guided saccades. The model makes testable predictions about the anatomical and functional properties of hypothesized context and efference copy inputs to the striatum from both thalamic and cortical sources. PMID:22754501
Oculomotor learning revisited: a model of reinforcement learning in the basal ganglia incorporating an efference copy of motor actions.

PubMed

Fee, Michale S

2012-01-01

In its simplest formulation, reinforcement learning is based on the idea that if an action taken in a particular context is followed by a favorable outcome, then, in the same context, the tendency to produce that action should be strengthened, or reinforced. While reinforcement learning forms the basis of many current theories of basal ganglia (BG) function, these models do not incorporate distinct computational roles for signals that convey context, and those that convey what action an animal takes. Recent experiments in the songbird suggest that vocal-related BG circuitry receives two functionally distinct excitatory inputs. One input is from a cortical region that carries context information about the current "time" in the motor sequence. The other is an efference copy of motor commands from a separate cortical brain region that generates vocal variability during learning. Based on these findings, I propose here a general model of vertebrate BG function that combines context information with a distinct motor efference copy signal. The signals are integrated by a learning rule in which efference copy inputs gate the potentiation of context inputs (but not efference copy inputs) onto medium spiny neurons in response to a rewarded action. The hypothesis is described in terms of a circuit that implements the learning of visually guided saccades. The model makes testable predictions about the anatomical and functional properties of hypothesized context and efference copy inputs to the striatum from both thalamic and cortical sources.
Histidine-decarboxylase knockout mice show deficient nonreinforced episodic object memory, improved negatively reinforced water-maze performance, and increased neo- and ventro-striatal dopamine turnover.

PubMed

Dere, Ekrem; De Souza-Silva, Maria A; Topic, Bianca; Spieler, Richard E; Haas, Helmut L; Huston, Joseph P

2003-01-01

The brain's histaminergic system has been implicated in hippocampal synaptic plasticity, learning, and memory, as well as brain reward and reinforcement. Our past pharmacological and lesion studies indicated that the brain's histamine system exerts inhibitory effects on the brain's reinforcement respective reward system reciprocal to mesolimbic dopamine systems, thereby modulating learning and memory performance. Given the close functional relationship between brain reinforcement and memory processes, the total disruption of brain histamine synthesis via genetic disruption of its synthesizing enzyme, histidine decarboxylase (HDC), in the mouse might have differential effects on learning dependent on the task-inherent reinforcement contingencies. Here, we investigated the effects of an HDC gene disruption in the mouse in a nonreinforced object exploration task and a negatively reinforced water-maze task as well as on neo- and ventro-striatal dopamine systems known to be involved in brain reward and reinforcement. Histidine decarboxylase knockout (HDC-KO) mice had higher dihydrophenylacetic acid concentrations and a higher dihydrophenylacetic acid/dopamine ratio in the neostriatum. In the ventral striatum, dihydrophenylacetic acid/dopamine and 3-methoxytyramine/dopamine ratios were higher in HDC-KO mice. Furthermore, the HDC-KO mice showed improved water-maze performance during both hidden and cued platform tasks, but deficient object discrimination based on temporal relationships. Our data imply that disruption of brain histamine synthesis can have both memory promoting and suppressive effects via distinct and independent mechanisms and further indicate that these opposed effects are related to the task-inherent reinforcement contingencies.
Dorsal Striatal-Midbrain Connectivity in Humans Predicts How Reinforcements Are Used to Guide Decisions

ERIC Educational Resources Information Center

Kahnt, Thorsten; Park, Soyoung Q.; Cohen, Michael X.; Beck, Anne; Heinz, Andreas; Wrase, Jana

2009-01-01

It has been suggested that the target areas of dopaminergic midbrain neurons, the dorsal (DS) and ventral striatum (VS), are differently involved in reinforcement learning especially as actor and critic. Whereas the critic learns to predict rewards, the actor maintains action values to guide future decisions. The different midbrain connections to…
Autonomous Inter-Task Transfer in Reinforcement Learning Domains

DTIC Science & Technology

2008-08-01

Twentieth International Joint Conference on Artificial Intelli - gence, 2007. 304 Fumihide Tanaka and Masayuki Yamamura. Multitask reinforcement learning...Functions . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Instance-based...tures [Laird et al., 1986, Choi et al., 2007]. However, TL for RL tasks has only recently been gaining attention in the artificial intelligence
Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia.

PubMed

Markou, Athina; Salamone, John D; Bussey, Timothy J; Mar, Adam C; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

2013-11-01

The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu) meeting. A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. Copyright © 2013 Elsevier Ltd. All rights reserved.
Goal-Directed and Habit-Like Modulations of Stimulus Processing during Reinforcement Learning.

PubMed

Luque, David; Beesley, Tom; Morris, Richard W; Jack, Bradley N; Griffiths, Oren; Whitford, Thomas J; Le Pelley, Mike E

2017-03-15

Recent research has shown that perceptual processing of stimuli previously associated with high-value rewards is automatically prioritized even when rewards are no longer available. It has been hypothesized that such reward-related modulation of stimulus salience is conceptually similar to an "attentional habit." Recording event-related potentials in humans during a reinforcement learning task, we show strong evidence in favor of this hypothesis. Resistance to outcome devaluation (the defining feature of a habit) was shown by the stimulus-locked P1 component, reflecting activity in the extrastriate visual cortex. Analysis at longer latencies revealed a positive component (corresponding to the P3b, from 550-700 ms) sensitive to outcome devaluation. Therefore, distinct spatiotemporal patterns of brain activity were observed corresponding to habitual and goal-directed processes. These results demonstrate that reinforcement learning engages both attentional habits and goal-directed processes in parallel. Consequences for brain and computational models of reinforcement learning are discussed. SIGNIFICANCE STATEMENT The human attentional network adapts to detect stimuli that predict important rewards. A recent hypothesis suggests that the visual cortex automatically prioritizes reward-related stimuli, driven by cached representations of reward value; that is, stimulus-response habits. Alternatively, the neural system may track the current value of the predicted outcome. Our results demonstrate for the first time that visual cortex activity is increased for reward-related stimuli even when the rewarding event is temporarily devalued. In contrast, longer-latency brain activity was specifically sensitive to transient changes in reward value. Therefore, we show that both habit-like attention and goal-directed processes occur in the same learning episode at different latencies. This result has important consequences for computational models of reinforcement learning. Copyright © 2017 the authors 0270-6474/17/373009-09$15.00/0.
Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia

PubMed Central

Markou, Athina; Salamone, John D.; Bussey, Timothy; Mar, Adam; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

2013-01-01

The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu). A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. PMID:23994273
Feature Reinforcement Learning: Part I. Unstructured MDPs

NASA Astrophysics Data System (ADS)

Hutter, Marcus

2009-12-01

General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II (Hutter, 2009c). The role of POMDPs is also considered there.
Impairments in learning by monetary rewards and alcohol-associated rewards in detoxified alcoholic patients.

PubMed

Jokisch, Daniel; Roser, Patrik; Juckel, Georg; Daum, Irene; Bellebaum, Christian

2014-07-01

Excessive alcohol consumption has been linked to structural and functional brain changes associated with cognitive, emotional, and behavioral impairments. It has been suggested that neural processing in the reward system is also affected by alcoholism. The present study aimed at further investigating reward-based associative learning and reversal learning in detoxified alcohol-dependent patients. Twenty-one detoxified alcohol-dependent patients and 26 healthy control subjects participated in a probabilistic learning task using monetary and alcohol-associated rewards as feedback stimuli indicating correct responses. Performance during acquisition and reversal learning in the different feedback conditions was analyzed. Alcohol-dependent patients and healthy control subjects showed an increase in learning performance over learning blocks during acquisition, with learning performance being significantly lower in alcohol-dependent patients. After changing the contingencies, alcohol-dependent patients exhibited impaired reversal learning and showed, in contrast to healthy controls, different learning curves for different types of rewards with no increase in performance for high monetary and alcohol-associated feedback. The present findings provide evidence that dysfunctional processing in the reward system in alcohol-dependent patients leads to alterations in reward-based learning resulting in a generally reduced performance. In addition, the results suggest that alcohol-dependent patients are, in particular, more impaired in changing an established behavior originally reinforced by high rewards. Copyright © 2014 by the Research Society on Alcoholism.
Pleasurable music affects reinforcement learning according to the listener

PubMed Central

Gold, Benjamin P.; Frank, Michael J.; Bogert, Brigitte; Brattico, Elvira

2013-01-01

Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875
Using Collective Intelligence to Route Internet Traffic

NASA Technical Reports Server (NTRS)

Wolpert, David H.; Tumer, Kagan; Frank, Jeremy

1998-01-01

A Collective Intelligence (COIN) is a community of interacting reinforcement learning (RL) algorithms designed so that their collective behavior maximizes a global utility function. We introduce the theory of COINs, then present experiments using that theory to design COINs to control internet traffic routing. These experiments indicate that COINs outperform previous RL-based systems for such routing that have previously been investigated.
Representation of aversive prediction errors in the human periaqueductal gray

PubMed Central

Roy, Mathieu; Shohamy, Daphna; Daw, Nathaniel; Jepma, Marieke; Wimmer, Elliott; Wager, Tor D.

2014-01-01

Pain is a primary driver of learning and motivated action. It is also a target of learning, as nociceptive brain responses are shaped by learning processes. We combined an instrumental pain avoidance task with an axiomatic approach to assessing fMRI signals related to prediction errors (PEs), which drive reinforcement-based learning. We found that pain PEs were encoded in the periaqueductal gray (PAG), an important structure for pain control and learning in animal models. Axiomatic tests combined with dynamic causal modeling suggested that ventromedial prefrontal cortex, supported by putamen, provides an expected value-related input to the PAG, which then conveys PE signals to prefrontal regions important for behavioral regulation, including orbitofrontal, anterior mid-cingulate, and dorsomedial prefrontal cortices. Thus, pain-related learning involves distinct neural circuitry, with implications for behavior and pain dynamics. PMID:25282614
Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation.

PubMed

Shih, Peter; Kaul, Brian C; Jagannathan, S; Drallmeier, James A

2008-08-01

A novel reinforcement-learning-based dual-control methodology adaptive neural network (NN) controller is developed to deliver a desired tracking performance for a class of complex feedback nonlinear discrete-time systems, which consists of a second-order nonlinear discrete-time system in nonstrict feedback form and an affine nonlinear discrete-time system, in the presence of bounded and unknown disturbances. For example, the exhaust gas recirculation (EGR) operation of a spark ignition (SI) engine is modeled by using such a complex nonlinear discrete-time system. A dual-controller approach is undertaken where primary adaptive critic NN controller is designed for the nonstrict feedback nonlinear discrete-time system whereas the secondary one for the affine nonlinear discrete-time system but the controllers together offer the desired performance. The primary adaptive critic NN controller includes an NN observer for estimating the states and output, an NN critic, and two action NNs for generating virtual control and actual control inputs for the nonstrict feedback nonlinear discrete-time system, whereas an additional critic NN and an action NN are included for the affine nonlinear discrete-time system by assuming the state availability. All NN weights adapt online towards minimization of a certain performance index, utilizing gradient-descent-based rule. Using Lyapunov theory, the uniformly ultimate boundedness (UUB) of the closed-loop tracking error, weight estimates, and observer estimates are shown. The adaptive critic NN controller performance is evaluated on an SI engine operating with high EGR levels where the controller objective is to reduce cyclic dispersion in heat release while minimizing fuel intake. Simulation and experimental results indicate that engine out emissions drop significantly at 20% EGR due to reduction in dispersion in heat release thus verifying the dual-control approach.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.