Science.gov

Sample records for reward reinforcement learning

  1. Auto-exploratory average reward reinforcement learning

    SciTech Connect

    Ok, DoKyeong; Tadepalli, P.

    1996-12-31

    We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexplored parts of the state space, while always choosing greedy actions with respect to the current value function. We show that this {open_quotes}Auto-exploratory H-learning{close_quotes} performs better than the original H-learning under previously studied exploration methods such as random, recency-based, or counter-based exploration.

  2. Reward and reinforcement activity in the nucleus accumbens during learning

    PubMed Central

    Gale, John T.; Shields, Donald C.; Ishizawa, Yumiko; Eskandar, Emad N.

    2014-01-01

    The nucleus accumbens core (NAcc) has been implicated in learning associations between sensory cues and profitable motor responses. However, the precise mechanisms that underlie these functions remain unclear. We recorded single-neuron activity from the NAcc of primates trained to perform a visual-motor associative learning task. During learning, we found two distinct classes of NAcc neurons. The first class demonstrated progressive increases in firing rates at the go-cue, feedback/tone and reward epochs of the task, as novel associations were learned. This suggests that these neurons may play a role in the exploitation of rewarding behaviors. In contrast, the second class exhibited attenuated firing rates, but only at the reward epoch of the task. These findings suggest that some NAcc neurons play a role in reward-based reinforcement during learning. PMID:24765069

  3. Finding intrinsic rewards by embodied evolution and constrained reinforcement learning.

    PubMed

    Uchibe, Eiji; Doya, Kenji

    2008-12-01

    Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and 'mating' by software reproduction are three major 'extrinsic' rewards. We show in hardware experiments that the robots can find appropriate 'intrinsic' rewards for the vision of battery packs and other robots to promote approach behaviors. PMID:19013054

  4. Optimal Reward Functions in Distributed Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Wolpert, David H.; Tumer, Kagan

    2000-01-01

    We consider the design of multi-agent systems so as to optimize an overall world utility function when (1) those systems lack centralized communication and control, and (2) each agents runs a distinct Reinforcement Learning (RL) algorithm. A crucial issue in such design problems is to initialize/update each agent's private utility function, so as to induce best possible world utility. Traditional 'team game' solutions to this problem sidestep this issue and simply assign to each agent the world utility as its private utility function. In previous work we used the 'Collective Intelligence' framework to derive a better choice of private utility functions, one that results in world utility performance up to orders of magnitude superior to that ensuing from use of the team game utility. In this paper we extend these results. We derive the general class of private utility functions that both are easy for the individual agents to learn and that, if learned well, result in high world utility. We demonstrate experimentally that using these new utility functions can result in significantly improved performance over that of our previously proposed utility, over and above that previous utility's superiority to the conventional team game utility.

  5. When, What, and How Much to Reward in Reinforcement Learning-Based Models of Cognition

    ERIC Educational Resources Information Center

    Janssen, Christian P.; Gray, Wayne D.

    2012-01-01

    Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other…

  6. Homeostatic reinforcement learning for integrating reward collection and physiological stability

    PubMed Central

    Keramati, Mehdi; Gutkin, Boris

    2014-01-01

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system. DOI: http://dx.doi.org/10.7554/eLife.04811.001 PMID:25457346

  7. Computational models of reinforcement learning: the role of dopamine as a reward signal

    PubMed Central

    Samson, R. D.; Frank, M. J.

    2010-01-01

    Reinforcement learning is ubiquitous. Unlike other forms of learning, it involves the processing of fast yet content-poor feedback information to correct assumptions about the nature of a task or of a set of stimuli. This feedback information is often delivered as generic rewards or punishments, and has little to do with the stimulus features to be learned. How can such low-content feedback lead to such an efficient learning paradigm? Through a review of existing neuro-computational models of reinforcement learning, we suggest that the efficiency of this type of learning resides in the dynamic and synergistic cooperation of brain systems that use different levels of computations. The implementation of reward signals at the synaptic, cellular, network and system levels give the organism the necessary robustness, adaptability and processing speed required for evolutionary and behavioral success. PMID:21629583

  8. Toward an autonomous brain machine interface: integrating sensorimotor reward modulation and reinforcement learning.

    PubMed

    Marsh, Brandi T; Tarigoppula, Venkata S Aditya; Chen, Chen; Francis, Joseph T

    2015-05-13

    For decades, neurophysiologists have worked on elucidating the function of the cortical sensorimotor control system from the standpoint of kinematics or dynamics. Recently, computational neuroscientists have developed models that can emulate changes seen in the primary motor cortex during learning. However, these simulations rely on the existence of a reward-like signal in the primary sensorimotor cortex. Reward modulation of the primary sensorimotor cortex has yet to be characterized at the level of neural units. Here we demonstrate that single units/multiunits and local field potentials in the primary motor (M1) cortex of nonhuman primates (Macaca radiata) are modulated by reward expectation during reaching movements and that this modulation is present even while subjects passively view cursor motions that are predictive of either reward or nonreward. After establishing this reward modulation, we set out to determine whether we could correctly classify rewarding versus nonrewarding trials, on a moment-to-moment basis. This reward information could then be used in collaboration with reinforcement learning principles toward an autonomous brain-machine interface. The autonomous brain-machine interface would use M1 for both decoding movement intention and extraction of reward expectation information as evaluative feedback, which would then update the decoding algorithm as necessary. In the work presented here, we show that this, in theory, is possible. PMID:25972167

  9. The left hemisphere learns what is right: Hemispatial reward learning depends on reinforcement learning processes in the contralateral hemisphere.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly Crystal; Schwartz, Sophie

    2016-08-01

    Orienting biases refer to consistent, trait-like direction of attention or locomotion toward one side of space. Recent studies suggest that such hemispatial biases may determine how well people memorize information presented in the left or right hemifield. Moreover, lesion studies indicate that learning rewarded stimuli in one hemispace depends on the integrity of the contralateral striatum. However, the exact neural and computational mechanisms underlying the influence of individual orienting biases on reward learning remain unclear. Because reward-based behavioural adaptation depends on the dopaminergic system and prediction error (PE) encoding in the ventral striatum, we hypothesized that hemispheric asymmetries in dopamine (DA) function may determine individual spatial biases in reward learning. To test this prediction, we acquired fMRI in 33 healthy human participants while they performed a lateralized reward task. Learning differences between hemispaces were assessed by presenting stimuli, assigned to different reward probabilities, to the left or right of central fixation, i.e. presented in the left or right visual hemifield. Hemispheric differences in DA function were estimated through differential fMRI responses to positive vs. negative feedback in the left vs. right ventral striatum, and a computational approach was used to identify the neural correlates of PEs. Our results show that spatial biases favoring reward learning in the right (vs. left) hemifield were associated with increased reward responses in the left hemisphere and relatively better neural encoding of PEs for stimuli presented in the right (vs. left) hemifield. These findings demonstrate that trait-like spatial biases implicate hemisphere-specific learning mechanisms, with individual differences between hemispheres contributing to reinforcing spatial biases. PMID:27221149

  10. Principal components analysis of reward prediction errors in a reinforcement learning task.

    PubMed

    Sambrook, Thomas D; Goslin, Jeremy

    2016-01-01

    Models of reinforcement learning represent reward and punishment in terms of reward prediction errors (RPEs), quantitative signed terms describing the degree to which outcomes are better than expected (positive RPEs) or worse (negative RPEs). An electrophysiological component known as feedback related negativity (FRN) occurs at frontocentral sites 240-340ms after feedback on whether a reward or punishment is obtained, and has been claimed to neurally encode an RPE. An outstanding question however, is whether the FRN is sensitive to the size of both positive RPEs and negative RPEs. Previous attempts to answer this question have examined the simple effects of RPE size for positive RPEs and negative RPEs separately. However, this methodology can be compromised by overlap from components coding for unsigned prediction error size, or "salience", which are sensitive to the absolute size of a prediction error but not its valence. In our study, positive and negative RPEs were parametrically modulated using both reward likelihood and magnitude, with principal components analysis used to separate out overlying components. This revealed a single RPE encoding component responsive to the size of positive RPEs, peaking at ~330ms, and occupying the delta frequency band. Other components responsive to unsigned prediction error size were shown, but no component sensitive to negative RPE size was found. PMID:26196667

  11. An average-reward reinforcement learning algorithm for computing bias-optimal policies

    SciTech Connect

    Mahadevan, S.

    1996-12-31

    Average-reward reinforcement learning (ARL) is an undiscounted optimality framework that is generally applicable to a broad range of control tasks. ARL computes gain-optimal control policies that maximize the expected payoff per step. However, gain-optimality has some intrinsic limitations as an optimality criterion, since for example, it cannot distinguish between different policies that all reach an absorbing goal state, but incur varying costs. A more selective criterion is bias optimality, which can filter gain-optimal policies to select those that reach absorbing goals with the minimum cost. While several ARL algorithms for computing gain-optimal policies have been proposed, none of these algorithms can guarantee bias optimality, since this requires solving at least two nested optimality equations. In this paper, we describe a novel model-based ARL algorithm for computing bias-optimal policies. We test the proposed algorithm using an admission control queuing system, and show that it is able to utilize the queue much more efficiently than a gain-optimal method by learning bias-optimal policies.

  12. The Rewards of Learning.

    ERIC Educational Resources Information Center

    Chance, Paul

    1992-01-01

    Although intrinsic rewards are important, they (along with punishment and encouragement) are insufficient for efficient learning. Teachers must supplement intrinsic rewards with extrinsic rewards, such as praising, complimenting, applauding, and providing other forms of recognition for good work. Teachers should use the weakest reward required to…

  13. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

    PubMed

    Glimcher, Paul W

    2011-09-13

    A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn. PMID:21389268

  14. An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning

    PubMed Central

    Balasubramani, Pragathi P.; Chakravarthy, V. Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A.

    2014-01-01

    Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG. PMID:24795614

  15. Reward feedback accelerates motor learning.

    PubMed

    Nikooyan, Ali A; Ahmed, Alaa A

    2015-01-15

    Recent findings have demonstrated that reward feedback alone can drive motor learning. However, it is not yet clear whether reward feedback alone can lead to learning when a perturbation is introduced abruptly, or how a reward gradient can modulate learning. In this study, we provide reward feedback that decays continuously with increasing error. We asked whether it is possible to learn an abrupt visuomotor rotation by reward alone, and if the learning process could be modulated by combining reward and sensory feedback and/or by using different reward landscapes. We designed a novel visuomotor learning protocol during which subjects experienced an abruptly introduced rotational perturbation. Subjects received either visual feedback or reward feedback, or a combination of the two. Two different reward landscapes, where the reward decayed either linearly or cubically with distance from the target, were tested. Results demonstrate that it is possible to learn from reward feedback alone and that the combination of reward and sensory feedback accelerates learning. An analysis of the underlying mechanisms reveals that although reward feedback alone does not allow for sensorimotor remapping, it can nonetheless lead to broad generalization, highlighting a dissociation between remapping and generalization. Also, the combination of reward and sensory feedback accelerates learning without compromising sensorimotor remapping. These findings suggest that the use of reward feedback is a promising approach to either supplement or substitute sensory feedback in the development of improved neurorehabilitation techniques. More generally, they point to an important role played by reward in the motor learning process. PMID:25355957

  16. Olfactory preference conditioning changes the reward value of reinforced and non-reinforced odors

    PubMed Central

    Torquet, Nicolas; Aimé, Pascaline; Messaoudi, Belkacem; Garcia, Samuel; Ey, Elodie; Gervais, Rémi; Julliard, A. Karyn; Ravel, Nadine

    2014-01-01

    Olfaction is determinant for the organization of rodent behavior. In a feeding context, rodents must quickly discriminate whether a nutrient can be ingested or whether it represents a potential danger to them. To understand the learning processes that support food choice, aversive olfactory learning and flavor appetitive learning have been extensively studied. In contrast, little is currently known about olfactory appetitive learning and its mechanisms. We designed a new paradigm to study conditioned olfactory preference in rats. After 8 days of exposure to a pair of odors (one paired with sucrose and the other with water), rats developed a strong and stable preference for the odor associated with the sucrose solution. A series of experiments were conducted to further analyze changes in reward value induced by this paradigm for both stimuli. As expected, the reward value of the reinforced odor changed positively. Interestingly, the reward value of the alternative odor decreased. This devaluation had an impact on further odor comparisons that the animal had to make. This result suggests that appetitive conditioning involving a comparison between two odors not only leads to a change in the reward value of the reinforced odor, but also induces a stable devaluation of the non-reinforced stimulus. PMID:25071486

  17. Reinforcement, Reward, and Intrinsic Motivation: A Meta-Analysis.

    ERIC Educational Resources Information Center

    Cameron, Judy; Pierce, W. David

    1994-01-01

    A meta-analysis including 96 experimental studies considers the effects of reinforcement/reward on intrinsic motivation. Results indicate that reward does not decrease intrinsic motivation, although interaction effects must be examined. An analysis with five studies also indicates that reinforcement does not harm intrinsic motivation. (SLD)

  18. Tonic Dopamine Modulates Exploitation of Reward Learning

    PubMed Central

    Beeler, Jeff A.; Daw, Nathaniel; Frazier, Cristianne R. M.; Zhuang, Xiaoxi

    2010-01-01

    The impact of dopamine on adaptive behavior in a naturalistic environment is largely unexamined. Experimental work suggests that phasic dopamine is central to reinforcement learning whereas tonic dopamine may modulate performance without altering learning per se; however, this idea has not been developed formally or integrated with computational models of dopamine function. We quantitatively evaluate the role of tonic dopamine in these functions by studying the behavior of hyperdopaminergic DAT knockdown mice in an instrumental task in a semi-naturalistic homecage environment. In this “closed economy” paradigm, subjects earn all of their food by pressing either of two levers, but the relative cost for food on each lever shifts frequently. Compared to wild-type mice, hyperdopaminergic mice allocate more lever presses on high-cost levers, thus working harder to earn a given amount of food and maintain their body weight. However, both groups show a similarly quick reaction to shifts in lever cost, suggesting that the hyperdominergic mice are not slower at detecting changes, as with a learning deficit. We fit the lever choice data using reinforcement learning models to assess the distinction between acquisition and expression the models formalize. In these analyses, hyperdopaminergic mice displayed normal learning from recent reward history but diminished capacity to exploit this learning: a reduced coupling between choice and reward history. These data suggest that dopamine modulates the degree to which prior learning biases action selection and consequently alters the expression of learned, motivated behavior. PMID:21120145

  19. Mind matters: placebo enhances reward learning in Parkinson's disease.

    PubMed

    Schmidt, Liane; Braun, Erin Kendall; Wager, Tor D; Shohamy, Daphna

    2014-12-01

    Expectations have a powerful influence on how we experience the world. Neurobiological and computational models of learning suggest that dopamine is crucial for shaping expectations of reward and that expectations alone may influence dopamine levels. However, because expectations and reinforcers are typically manipulated together, the role of expectations per se has remained unclear. We separated these two factors using a placebo dopaminergic manipulation in individuals with Parkinson's disease. We combined a reward learning task with functional magnetic resonance imaging to test how expectations of dopamine release modulate learning-related activity in the brain. We found that the mere expectation of dopamine release enhanced reward learning and modulated learning-related signals in the striatum and the ventromedial prefrontal cortex. These effects were selective to learning from reward: neither medication nor placebo had an effect on learning to avoid monetary loss. These findings suggest a neurobiological mechanism by which expectations shape learning and affect. PMID:25326691

  20. Microstimulation of the human substantia nigra alters reinforcement learning.

    PubMed

    Ramayya, Ashwin G; Misra, Amrit; Baltuch, Gordon H; Kahana, Michael J

    2014-05-14

    Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action-reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action-reward associations rather than stimulus-reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action-reward associations during reinforcement learning. PMID:24828643

  1. Hierarchical Bayesian inverse reinforcement learning.

    PubMed

    Choi, Jaedeug; Kim, Kee-Eung

    2015-04-01

    Inverse reinforcement learning (IRL) is the problem of inferring the underlying reward function from the expert's behavior data. The difficulty in IRL mainly arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behavior data as optimal. Another difficulty comes from the noisy behavior data due to sub-optimal experts. We propose a hierarchical Bayesian framework, which subsumes most of the previous IRL algorithms as well as models the sub-optimality of the expert's behavior. Using a number of experiments on a synthetic problem, we demonstrate the effectiveness of our approach including the robustness of our hierarchical Bayesian framework to the sub-optimal expert behavior data. Using a real dataset from taxi GPS traces, we additionally show that our approach predicts the driving behavior with a high accuracy. PMID:25291805

  2. What is the role of dopamine in reward: hedonic impact, reward learning, or incentive salience?

    PubMed

    Berridge, K C; Robinson, T E

    1998-12-01

    What roles do mesolimbic and neostriatal dopamine systems play in reward? Do they mediate the hedonic impact of rewarding stimuli? Do they mediate hedonic reward learning and associative prediction? Our review of the literature, together with results of a new study of residual reward capacity after dopamine depletion, indicates the answer to both questions is 'no'. Rather, dopamine systems may mediate the incentive salience of rewards, modulating their motivational value in a manner separable from hedonia and reward learning. In a study of the consequences of dopamine loss, rats were depleted of dopamine in the nucleus accumbens and neostriatum by up to 99% using 6-hydroxydopamine. In a series of experiments, we applied the 'taste reactivity' measure of affective reactions (gapes, etc.) to assess the capacity of dopamine-depleted rats for: 1) normal affect (hedonic and aversive reactions), 2) modulation of hedonic affect by associative learning (taste aversion conditioning), and 3) hedonic enhancement of affect by non-dopaminergic pharmacological manipulation of palatability (benzodiazepine administration). We found normal hedonic reaction patterns to sucrose vs. quinine, normal learning of new hedonic stimulus values (a change in palatability based on predictive relations), and normal pharmacological hedonic enhancement of palatability. We discuss these results in the context of hypotheses and data concerning the role of dopamine in reward. We review neurochemical, electrophysiological, and other behavioral evidence. We conclude that dopamine systems are not needed either to mediate the hedonic pleasure of reinforcers or to mediate predictive associations involved in hedonic reward learning. We conclude instead that dopamine may be more important to incentive salience attributions to the neural representations of reward-related stimuli. Incentive salience, we suggest, is a distinct component of motivation and reward. In other words, dopamine systems are necessary

  3. Reinforcement learning: Solving two case studies

    NASA Astrophysics Data System (ADS)

    Duarte, Ana Filipa; Silva, Pedro; dos Santos, Cristina Peixoto

    2012-09-01

    Reinforcement Learning algorithms offer interesting features for the control of autonomous systems, such as the ability to learn from direct interaction with the environment, and the use of a simple reward signalas opposed to the input-outputs pairsused in classic supervised learning. The reward signal indicates the success of failure of the actions executed by the agent in the environment. In this work, are described RL algorithmsapplied to two case studies: the Crawler robot and the widely known inverted pendulum. We explore RL capabilities to autonomously learn a basic locomotion pattern in the Crawler, andapproach the balancing problem of biped locomotion using the inverted pendulum.

  4. Modular Inverse Reinforcement Learning for Visuomotor Behavior

    PubMed Central

    Rothkopf, Constantin A.; Ballard, Dana H.

    2013-01-01

    In a large variety of situations one would like to have an expressive and accurate model of observed animal or human behavior. While general purpose mathematical models may capture successfully properties of observed behavior, it is desirable to root models in biological facts. Because of ample empirical evidence for reward-based learning in visuomotor tasks we use a computational model based on the assumption that the observed agent is balancing the costs and benefits of its behavior to meet its goals. This leads to using the framework of Reinforcement Learning, which additionally provides well-established algorithms for learning of visuomotor task solutions. To quantify the agent’s goals as rewards implicit in the observed behavior we propose to use inverse reinforcement learning, which quantifies the agent’s goals as rewards implicit in the observed behavior. Based on the assumption of a modular cognitive architecture, we introduce a modular inverse reinforcement learning algorithm that estimates the relative reward contributions of the component tasks in navigation, consisting of following a path while avoiding obstacles and approaching targets. It is shown how to recover the component reward weights for individual tasks and that variability in observed trajectories can be explained succinctly through behavioral goals. It is demonstrated through simulations that good estimates can be obtained already with modest amounts of observation data, which in turn allows the prediction of behavior in novel configurations. PMID:23832417

  5. Reinforcement learning in scheduling

    NASA Technical Reports Server (NTRS)

    Dietterich, Tom G.; Ok, Dokyeong; Zhang, Wei; Tadepalli, Prasad

    1994-01-01

    The goal of this research is to apply reinforcement learning methods to real-world problems like scheduling. In this preliminary paper, we show that learning to solve scheduling problems such as the Space Shuttle Payload Processing and the Automatic Guided Vehicle (AGV) scheduling can be usefully studied in the reinforcement learning framework. We discuss some of the special challenges posed by the scheduling domain to these methods and propose some possible solutions we plan to implement.

  6. Model-Based Reinforcement Learning under Concurrent Schedules of Reinforcement in Rodents

    ERIC Educational Resources Information Center

    Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

    2009-01-01

    Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's…

  7. Dose Dependent Dopaminergic Modulation of Reward-Based Learning in Parkinson's Disease

    ERIC Educational Resources Information Center

    van Wouwe, N. C.; Ridderinkhof, K. R.; Band, G. P. H.; van den Wildenberg, W. P. M.; Wylie, S. A.

    2012-01-01

    Learning to select optimal behavior in new and uncertain situations is a crucial aspect of living and requires the ability to quickly associate stimuli with actions that lead to rewarding outcomes. Mathematical models of reinforcement-based learning to select rewarding actions distinguish between (1) the formation of stimulus-action-reward…

  8. Mate call as reward: Acoustic communication signals can acquire positive reinforcing values during adulthood in female zebra finches (Taeniopygia guttata).

    PubMed

    Hernandez, Alexandra M; Perez, Emilie C; Mulard, Hervé; Mathevon, Nicolas; Vignal, Clémentine

    2016-02-01

    Social stimuli can have rewarding properties and promote learning. In birds, conspecific vocalizations like song can act as a reinforcer, and specific song variants can acquire particular rewarding values during early life exposure. Here we ask if, during adulthood, an acoustic signal simpler and shorter than song can become a reward for a female songbird because of its particular social value. Using an operant choice apparatus, we showed that female zebra finches display a preferential response toward their mate's calls. This reinforcing value of mate's calls could be involved in the maintenance of the monogamous pair-bond of the zebra finch. PMID:26881942

  9. Learning Reward Uncertainty in the Basal Ganglia.

    PubMed

    Mikhael, John G; Bogacz, Rafal

    2016-09-01

    Learning the reliability of different sources of rewards is critical for making optimal choices. However, despite the existence of detailed theory describing how the expected reward is learned in the basal ganglia, it is not known how reward uncertainty is estimated in these circuits. This paper presents a class of models that encode both the mean reward and the spread of the rewards, the former in the difference between the synaptic weights of D1 and D2 neurons, and the latter in their sum. In the models, the tendency to seek (or avoid) options with variable reward can be controlled by increasing (or decreasing) the tonic level of dopamine. The models are consistent with the physiology of and synaptic plasticity in the basal ganglia, they explain the effects of dopaminergic manipulations on choices involving risks, and they make multiple experimental predictions. PMID:27589489

  10. Reinforcement learning or active inference?

    PubMed

    Friston, Karl J; Daunizeau, Jean; Kiebel, Stefan J

    2009-01-01

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain. PMID:19641614

  11. Reinforcement Learning or Active Inference?

    PubMed Central

    Friston, Karl J.; Daunizeau, Jean; Kiebel, Stefan J.

    2009-01-01

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain. PMID:19641614

  12. Framework for robot skill learning using reinforcement learning

    NASA Astrophysics Data System (ADS)

    Wei, Yingzi; Zhao, Mingyang

    2003-09-01

    Robot acquiring skill is a process similar to human skill learning. Reinforcement learning (RL) is an on-line actor critic method for a robot to develop its skill. The reinforcement function has become the critical component for its effect of evaluating the action and guiding the learning process. We present an augmented reward function that provides a new way for RL controller to incorporate prior knowledge and experience into the RL controller. Also, the difference form of augmented reward function is considered carefully. The additional reward beyond conventional reward will provide more heuristic information for RL. In this paper, we present a strategy for the task of complex skill learning. Automatic robot shaping policy is to dissolve the complex skill into a hierarchical learning process. The new form of value function is introduced to attain smooth motion switching swiftly. We present a formal, but practical, framework for robot skill learning and also illustrate with an example the utility of method for learning skilled robot control on line.

  13. The Computational Development of Reinforcement Learning during Adolescence.

    PubMed

    Palminteri, Stefano; Kilford, Emma J; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-06-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents' behaviour was better explained by a basic reinforcement learning algorithm, adults' behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence. PMID:27322574

  14. The Computational Development of Reinforcement Learning during Adolescence

    PubMed Central

    Palminteri, Stefano; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-01-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents’ behaviour was better explained by a basic reinforcement learning algorithm, adults’ behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence. PMID:27322574

  15. Mapping anhedonia onto reinforcement learning: a behavioural meta-analysis

    PubMed Central

    2013-01-01

    Background Depression is characterised partly by blunted reactions to reward. However, tasks probing this deficiency have not distinguished insensitivity to reward from insensitivity to the prediction errors for reward that determine learning and are putatively reported by the phasic activity of dopamine neurons. We attempted to disentangle these factors with respect to anhedonia in the context of stress, Major Depressive Disorder (MDD), Bipolar Disorder (BPD) and a dopaminergic challenge. Methods Six behavioural datasets involving 392 experimental sessions were subjected to a model-based, Bayesian meta-analysis. Participants across all six studies performed a probabilistic reward task that used an asymmetric reinforcement schedule to assess reward learning. Healthy controls were tested under baseline conditions, stress or after receiving the dopamine D2 agonist pramipexole. In addition, participants with current or past MDD or BPD were evaluated. Reinforcement learning models isolated the contributions of variation in reward sensitivity and learning rate. Results MDD and anhedonia reduced reward sensitivity more than they affected the learning rate, while a low dose of the dopamine D2 agonist pramipexole showed the opposite pattern. Stress led to a pattern consistent with a mixed effect on reward sensitivity and learning rate. Conclusion Reward-related learning reflected at least two partially separable contributions. The first related to phasic prediction error signalling, and was preferentially modulated by a low dose of the dopamine agonist pramipexole. The second related directly to reward sensitivity, and was preferentially reduced in MDD and anhedonia. Stress altered both components. Collectively, these findings highlight the contribution of model-based reinforcement learning meta-analysis for dissecting anhedonic behavior. PMID:23782813

  16. Dopamine, reward learning, and active inference

    PubMed Central

    FitzGerald, Thomas H. B.; Dolan, Raymond J.; Friston, Karl

    2015-01-01

    Temporal difference learning models propose phasic dopamine signaling encodes reward prediction errors that drive learning. This is supported by studies where optogenetic stimulation of dopamine neurons can stand in lieu of actual reward. Nevertheless, a large body of data also shows that dopamine is not necessary for learning, and that dopamine depletion primarily affects task performance. We offer a resolution to this paradox based on an hypothesis that dopamine encodes the precision of beliefs about alternative actions, and thus controls the outcome-sensitivity of behavior. We extend an active inference scheme for solving Markov decision processes to include learning, and show that simulated dopamine dynamics strongly resemble those actually observed during instrumental conditioning. Furthermore, simulated dopamine depletion impairs performance but spares learning, while simulated excitation of dopamine neurons drives reward learning, through aberrant inference about outcome states. Our formal approach provides a novel and parsimonious reconciliation of apparently divergent experimental findings. PMID:26581305

  17. Reward and non-reward learning of flower colours in the butterfly Byasa alcinous (Lepidoptera: Papilionidae)

    NASA Astrophysics Data System (ADS)

    Kandori, Ikuo; Yamaki, Takafumi

    2012-09-01

    Learning plays an important role in food acquisition for a wide range of insects. To increase their foraging efficiency, flower-visiting insects may learn to associate floral cues with the presence (so-called reward learning) or the absence (so-called non-reward learning) of a reward. Reward learning whilst foraging for flowers has been demonstrated in many insect taxa, whilst non-reward learning in flower-visiting insects has been demonstrated only in honeybees, bumblebees and hawkmoths. This study examined both reward and non-reward learning abilities in the butterfly Byasa alcinous whilst foraging among artificial flowers of different colours. This butterfly showed both types of learning, although butterflies of both sexes learned faster via reward learning. In addition, females learned via reward learning faster than males. To the best of our knowledge, these are the first empirical data on the learning speed of both reward and non-reward learning in insects. We discuss the adaptive significance of a lower learning speed for non-reward learning when foraging on flowers.

  18. Nucleus Accumbens Core and Shell Differentially Encode Reward-Associated Cues after Reinforcer Devaluation

    PubMed Central

    West, Elizabeth A.

    2016-01-01

    Nucleus accumbens (NAc) neurons encode features of stimulus learning and action selection associated with rewards. The NAc is necessary for using information about expected outcome values to guide behavior after reinforcer devaluation. Evidence suggests that core and shell subregions may play dissociable roles in guiding motivated behavior. Here, we recorded neural activity in the NAc core and shell during training and performance of a reinforcer devaluation task. Long–Evans male rats were trained that presses on a lever under an illuminated cue light delivered a flavored sucrose reward. On subsequent test days, each rat was given free access to one of two distinctly flavored foods to consume to satiation and were then immediately tested on the lever pressing task under extinction conditions. Rats decreased pressing on the test day when the reinforcer earned during training was the sated flavor (devalued) compared with the test day when the reinforcer was not the sated flavor (nondevalued), demonstrating evidence of outcome-selective devaluation. Cue-selective encoding during training by NAc core (but not shell) neurons reliably predicted subsequent behavioral performance; that is, the greater the percentage of neurons that responded to the cue, the better the rats suppressed responding after devaluation. In contrast, NAc shell (but not core) neurons significantly decreased cue-selective encoding in the devalued condition compared with the nondevalued condition. These data reveal that NAc core and shell neurons encode information differentially about outcome-specific cues after reinforcer devaluation that are related to behavioral performance and outcome value, respectively. SIGNIFICANCE STATEMENT Many neuropsychiatric disorders are marked by impairments in behavioral flexibility. Although the nucleus accumbens (NAc) is required for behavioral flexibility, it is not known how NAc neurons encode this information. Here, we recorded NAc neurons during a training

  19. Reinforcement Learning Trees

    PubMed Central

    Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R.

    2015-01-01

    In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings. PMID:26903687

  20. Mind matters: Placebo enhances reward learning in Parkinson’s disease

    PubMed Central

    Schmidt, Liane; Braun, Erin Kendall; Wager, Tor D.; Shohamy, Daphna

    2015-01-01

    Expectations have a powerful influence on how we experience the world. Neurobiological and computational models of learning suggest that dopamine is crucial for shaping expectations of reward and that expectations alone may influence dopamine levels. However, because expectations and reinforcers are typically manipulated together, the role of expectations per se has remained unclear. Here, we separated these two factors using a placebo dopaminergic manipulation in Parkinson’s patients. We combined a reward learning task with fMRI to test how expectations of dopamine release modulate learning-related activity in the brain. We found that the mere expectation of dopamine release enhances reward learning and modulates learning-related signals in the striatum and the ventromedial prefrontal cortex. These effects were selective to learning from reward: neither medication nor placebo had an effect on learning to avoid monetary loss. These findings suggest a neurobiological mechanism by which expectations shape learning and affect. PMID:25326691

  1. Time-Extended Policies in Mult-Agent Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian K.

    2004-01-01

    Reinforcement learning methods perform well in many domains where a single agent needs to take a sequence of actions to perform a task. These methods use sequences of single-time-step rewards to create a policy that tries to maximize a time-extended utility, which is a (possibly discounted) sum of these rewards. In this paper we build on our previous work showing how these methods can be extended to a multi-agent environment where each agent creates its own policy that works towards maximizing a time-extended global utility over all agents actions. We show improved methods for creating time-extended utilities for the agents that are both "aligned" with the global utility and "learnable." We then show how to crate single-time-step rewards while avoiding the pi fall of having rewards aligned with the global reward leading to utilities not aligned with the global utility. Finally, we apply these reward functions to the multi-agent Gridworld problem. We explicitly quantify a utility's learnability and alignment, and show that reinforcement learning agents using the prescribed reward functions successfully tradeoff learnability and alignment. As a result they outperform both global (e.g., team games ) and local (e.g., "perfectly learnable" ) reinforcement learning solutions by as much as an order of magnitude.

  2. Contextual modulation of value signals in reward and punishment learning

    PubMed Central

    Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

    2015-01-01

    Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative—context-dependent—scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system. PMID:26302782

  3. Contextual modulation of value signals in reward and punishment learning.

    PubMed

    Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

    2015-01-01

    Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative--context-dependent--scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system. PMID:26302782

  4. Statistical Mechanics of the Delayed Reward-Based Learning with Node Perturbation

    NASA Astrophysics Data System (ADS)

    Hiroshi Saito,; Kentaro Katahira,; Kazuo Okanoya,; Masato Okada,

    2010-06-01

    In reward-based learning, reward is typically given with some delay after a behavior that causes the reward. In machine learning literature, the framework of the eligibility trace has been used as one of the solutions to handle the delayed reward in reinforcement learning. In recent studies, the eligibility trace is implied to be important for difficult neuroscience problem known as the “distal reward problem”. Node perturbation is one of the stochastic gradient methods from among many kinds of reinforcement learning implementations, and it searches the approximate gradient by introducing perturbation to a network. Since the stochastic gradient method does not require a objective function differential, it is expected to be able to account for the learning mechanism of a complex system, like a brain. We study the node perturbation with the eligibility trace as a specific example of delayed reward-based learning, and analyzed it using a statistical mechanics approach. As a result, we show the optimal time constant of the eligibility trace respect to the reward delay and the existence of unlearnable parameter configurations.

  5. Benchmarking for Bayesian Reinforcement Learning

    PubMed Central

    Ernst, Damien; Couëtoux, Adrien

    2016-01-01

    In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but the benchmarks used to compare them are only relevant for specific cases. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed. PMID:27304891

  6. Learning Analytics: Readiness and Rewards

    ERIC Educational Resources Information Center

    Friesen, Norm

    2013-01-01

    This position paper introduces the relatively new field of learning analytics, first by considering the relevant meanings of both "learning" and "analytics," and then by looking at two main levels at which learning analytics can be or has been implemented in educational organizations. Although integrated turnkey systems or…

  7. Synthetic cathinones and their rewarding and reinforcing effects in rodents.

    PubMed

    Watterson, Lucas R; Olive, M Foster

    2014-06-01

    Synthetic cathinones, colloquially referred to as "bath salts", are derivatives of the psychoactive alkaloid cathinone found in Catha edulis (Khat). Since the mid-to-late 2000's, these amphetamine-like psychostimulants have gained popularity amongst drug users due to their potency, low cost, ease of procurement, and constantly evolving chemical structures. Concomitant with their increased use is the emergence of a growing collection of case reports of bizarre and dangerous behaviors, toxicity to numerous organ systems, and death. However, scientific information regarding the abuse liability of these drugs has been relatively slower to materialize. Recently we have published several studies demonstrating that laboratory rodents will readily self-administer the "first generation" synthetic cathinones methylenedioxypyrovalerone (MDPV) and methylone via the intravenous route, in patterns similar to those of methamphetamine. Under progressive ratio schedules of reinforcement, the rank order of reinforcing efficacy of these compounds are MDPV ≥ methamphetamine > methylone. MDPV and methylone, as well as the "second generation" synthetic cathinones α-pyrrolidinovalerophenone (α-PVP) and 4-methylethcathinone (4-MEC), also dose-dependently increase brain reward function. Collectively, these findings indicate that synthetic cathinones have a high abuse and addiction potential and underscore the need for future assessment of the extent and duration of neurotoxicity induced by these emerging drugs of abuse. PMID:25328910

  8. Synthetic cathinones and their rewarding and reinforcing effects in rodents

    PubMed Central

    Watterson, Lucas R.; Olive, M. Foster

    2014-01-01

    Synthetic cathinones, colloquially referred to as “bath salts”, are derivatives of the psychoactive alkaloid cathinone found in Catha edulis (Khat). Since the mid-to-late 2000’s, these amphetamine-like psychostimulants have gained popularity amongst drug users due to their potency, low cost, ease of procurement, and constantly evolving chemical structures. Concomitant with their increased use is the emergence of a growing collection of case reports of bizarre and dangerous behaviors, toxicity to numerous organ systems, and death. However, scientific information regarding the abuse liability of these drugs has been relatively slower to materialize. Recently we have published several studies demonstrating that laboratory rodents will readily self-administer the “first generation” synthetic cathinones methylenedioxypyrovalerone (MDPV) and methylone via the intravenous route, in patterns similar to those of methamphetamine. Under progressive ratio schedules of reinforcement, the rank order of reinforcing efficacy of these compounds are MDPV ≥ methamphetamine > methylone. MDPV and methylone, as well as the “second generation” synthetic cathinones α-pyrrolidinovalerophenone (α-PVP) and 4-methylethcathinone (4-MEC), also dose-dependently increase brain reward function. Collectively, these findings indicate that synthetic cathinones have a high abuse and addiction potential and underscore the need for future assessment of the extent and duration of neurotoxicity induced by these emerging drugs of abuse. PMID:25328910

  9. Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: a simulated robotic study.

    PubMed

    Mirolli, Marco; Santucci, Vieri G; Baldassarre, Gianluca

    2013-03-01

    An important issue of recent neuroscientific research is to understand the functional role of the phasic release of dopamine in the striatum, and in particular its relation to reinforcement learning. The literature is split between two alternative hypotheses: one considers phasic dopamine as a reward prediction error similar to the computational TD-error, whose function is to guide an animal to maximize future rewards; the other holds that phasic dopamine is a sensory prediction error signal that lets the animal discover and acquire novel actions. In this paper we propose an original hypothesis that integrates these two contrasting positions: according to our view phasic dopamine represents a TD-like reinforcement prediction error learning signal determined by both unexpected changes in the environment (temporary, intrinsic reinforcements) and biological rewards (permanent, extrinsic reinforcements). Accordingly, dopamine plays the functional role of driving both the discovery and acquisition of novel actions and the maximization of future rewards. To validate our hypothesis we perform a series of experiments with a simulated robotic system that has to learn different skills in order to get rewards. We compare different versions of the system in which we vary the composition of the learning signal. The results show that only the system reinforced by both extrinsic and intrinsic reinforcements is able to reach high performance in sufficiently complex conditions. PMID:23353115

  10. Fuzzy reinforcement learning control for compliance tasks of robotic manipulators.

    PubMed

    Tzafestas, S G; Rigatos, G G

    2002-01-01

    A fuzzy reinforcement learning (FRL) scheme which is based on the principles of sliding-mode control and fuzzy logic is proposed. The FRL uses only immediate reward. Sufficient conditions for the convergence of the FRL to the optimal task performance are studied. The validity of the method is tested through simulation examples of a robot which deburrs a metal surface. PMID:18238109

  11. Changes in corticostriatal connectivity during reinforcement learning in humans

    PubMed Central

    Horga, Guillermo; Maia, Tiago V.; Marsh, Rachel; Hao, Xuejun; Xu, Dongrong; Duan, Yunsuo; Tau, Gregory Z.; Graniello, Barbara; Wang, Zhishun; Kangarlu, Alayar; Martinez, Diana; Packard, Mark G.; Peterson, Bradley S.

    2015-01-01

    Many computational models assume that reinforcement learning relies on changes in synaptic efficacy between cortical regions representing stimuli and striatal regions involved in response selection, but this assumption has thus far lacked empirical support in humans. We recorded hemodynamic signals with fMRI while participants navigated a virtual maze to find hidden rewards. We fitted a reinforcement-learning algorithm to participants’ choice behavior and evaluated the neural activity and the changes in functional connectivity related to trial-by-trial learning variables. Activity in the posterior putamen during choice periods increased progressively during learning. Furthermore, the functional connections between the sensorimotor cortex and the posterior putamen strengthened progressively as participants learned the task. These changes in corticostriatal connectivity differentiated participants who learned the task from those who did not. These findings provide a direct link between changes in corticostriatal connectivity and learning, thereby supporting a central assumption common to several computational models of reinforcement learning. PMID:25393839

  12. Learning when reward is delayed: a marking hypothesis.

    PubMed

    Lieberman, D A; McIntosh, D C; Thomas, G V

    1979-07-01

    Rats were trained on spatial discriminations in which reward was delayed for 1 min. Experiment 1 tested Lett's hypothesis that responses made in the home cage during the delay interval are less likely to interfere with learning than responses made in the maze. Experimental subjects were transferred to their home cages during the delay interval, and control subjects were picked up but then immediately replaced in the maze. Contrary to Lett's hypothesis, both groups learned. Further experiments suggested that handling following a choice response was the crucial variable in producing learning: No learning occurred when handling was delayed (Experiment 2) or omitted (Experiment 3). One possible explanation for the fact that handling facilitated learning is that it served to mark the preceding choice response in memory so that subjects were then more likely to recall it when subsequently reinforced. In accordance with this interpretation, learning was found to be just as strong when the choice response was followed by an intense light or noise as by handling (Experiment 4). The implication of marking for other phenomena such as avoidance, quasi-reinforcement, and the paradoxical effects of punishment is also discussed. PMID:528888

  13. Differential Reward Learning for Self and Others Predicts Self-Reported Altruism

    PubMed Central

    Kwak, Youngbin; Pearson, John; Huettel, Scott A.

    2014-01-01

    In social environments, decisions not only determine rewards for oneself but also for others. However, individual differences in pro-social behaviors have been typically studied through self-report. We developed a decision-making paradigm in which participants chose from card decks with differing rewards for themselves and charity; some decks gave similar rewards to both, while others gave higher rewards for one or the other. We used a reinforcement-learning model that estimated each participant's relative weighting of self versus charity reward. As shown both in choices and model parameters, individuals who showed relatively better learning of rewards for charity – compared to themselves – were more likely to engage in pro-social behavior outside of a laboratory setting indicated by self-report. Overall rates of reward learning, however, did not predict individual differences in pro-social tendencies. These results support the idea that biases toward learning about social rewards are associated with one's altruistic tendencies. PMID:25215883

  14. Hybrid Approach to Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Boulebtateche, Brahim; Fezari, Mourad; Boughazi, Mohamed

    2008-06-01

    Reinforcement Learning (RL) is a general framework in which an autonomous agent tries to learn an optimal policy of actions from direct interaction with the surrounding environment (RL). However, one difficulty for the application of RL control is its slow convergence, especially in environments with continuous state space. In this paper, a modified structure of RL is proposed to speed up reinforcement learning control. In this approach, supervision technique is combined with the standard Q-learning, a model-free algorithm of reinforcement learning. The a priori information is provided to the RL by an optimal LQ-controller, used to indicate preferred actions at intermittent times. It is shown that the convergence speed of the supervised RL agent is greatly improved compared to the conventional Q-Learning algorithm. Simulation work and results on the cart-pole balancing problem are given to illustrate the efficiency of the proposed method.

  15. Stochastic optimization of multireservoir systems via reinforcement learning

    NASA Astrophysics Data System (ADS)

    Lee, Jin-Hee; Labadie, John W.

    2007-11-01

    Although several variants of stochastic dynamic programming have been applied to optimal operation of multireservoir systems, they have been plagued by a high-dimensional state space and the inability to accurately incorporate the stochastic environment as characterized by temporally and spatially correlated hydrologic inflows. Reinforcement learning has emerged as an effective approach to solving sequential decision problems by combining concepts from artificial intelligence, cognitive science, and operations research. A reinforcement learning system has a mathematical foundation similar to dynamic programming and Markov decision processes, with the goal of maximizing the long-term reward or returns as conditioned on the state of the system environment and the immediate reward obtained from operational decisions. Reinforcement learning can include Monte Carlo simulation where transition probabilities and rewards are not explicitly known a priori. The Q-Learning method in reinforcement learning is demonstrated on the two-reservoir Geum River system, South Korea, and is shown to outperform implicit stochastic dynamic programming and sampling stochastic dynamic programming methods.

  16. Differential Influence of Levodopa on Reward-Based Learning in Parkinson's Disease

    PubMed Central

    Graef, Susanne; Biele, Guido; Krugel, Lea K.; Marzinzik, Frank; Wahl, Michael; Wotka, Johann; Klostermann, Fabian; Heekeren, Hauke R.

    2010-01-01

    The mesocorticolimbic dopamine (DA) system linking the dopaminergic midbrain to the prefrontal cortex and subcortical striatum has been shown to be sensitive to reinforcement in animals and humans. Within this system, coexistent segregated striato-frontal circuits have been linked to different functions. In the present study, we tested patients with Parkinson's disease (PD), a neurodegenerative disorder characterized by dopaminergic cell loss, on two reward-based learning tasks assumed to differentially involve dorsal and ventral striato-frontal circuits. 15 non-depressed and non-demented PD patients on levodopa monotherapy were tested both on and off medication. Levodopa had beneficial effects on the performance on an instrumental learning task with constant stimulus-reward associations, hypothesized to rely on dorsal striato-frontal circuits. In contrast, performance on a reversal learning task with changing reward contingencies, relying on ventral striato-frontal structures, was better in the unmedicated state. These results are in line with the “overdose hypothesis” which assumes detrimental effects of dopaminergic medication on functions relying upon less affected regions in PD. This study demonstrates, in a within-subject design, a double dissociation of dopaminergic medication and performance on two reward-based learning tasks differing in regard to whether reward contingencies are constant or dynamic. There was no evidence for a dose effect of levodopa on reward-based behavior with the patients’ actual levodopa dose being uncorrelated to their performance on the reward-based learning tasks. PMID:21048900

  17. Reinforcement learning in supply chains.

    PubMed

    Valluri, Annapurna; North, Michael J; Macal, Charles M

    2009-10-01

    Effective management of supply chains creates value and can strategically position companies. In practice, human beings have been found to be both surprisingly successful and disappointingly inept at managing supply chains. The related fields of cognitive psychology and artificial intelligence have postulated a variety of potential mechanisms to explain this behavior. One of the leading candidates is reinforcement learning. This paper applies agent-based modeling to investigate the comparative behavioral consequences of three simple reinforcement learning algorithms in a multi-stage supply chain. For the first time, our findings show that the specific algorithm that is employed can have dramatic effects on the results obtained. Reinforcement learning is found to be valuable in multi-stage supply chains with several learning agents, as independent agents can learn to coordinate their behavior. However, learning in multi-stage supply chains using these postulated approaches from cognitive psychology and artificial intelligence take extremely long time periods to achieve stability which raises questions about their ability to explain behavior in real supply chains. The fact that it takes thousands of periods for agents to learn in this simple multi-agent setting provides new evidence that real world decision makers are unlikely to be using strict reinforcement learning in practice. PMID:19885962

  18. Role of Dopamine D2 Receptors in Human Reinforcement Learning

    PubMed Central

    Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

    2014-01-01

    Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well. PMID:24713613

  19. Reinforcement active learning in the vibrissae system: optimal object localization.

    PubMed

    Gordon, Goren; Dorfman, Nimrod; Ahissar, Ehud

    2013-01-01

    Rats move their whiskers to acquire information about their environment. It has been observed that they palpate novel objects and objects they are required to localize in space. We analyze whisker-based object localization using two complementary paradigms, namely, active learning and intrinsic-reward reinforcement learning. Active learning algorithms select the next training samples according to the hypothesized solution in order to better discriminate between correct and incorrect labels. Intrinsic-reward reinforcement learning uses prediction errors as the reward to an actor-critic design, such that behavior converges to the one that optimizes the learning process. We show that in the context of object localization, the two paradigms result in palpation whisking as their respective optimal solution. These results suggest that rats may employ principles of active learning and/or intrinsic reward in tactile exploration and can guide future research to seek the underlying neuronal mechanisms that implement them. Furthermore, these paradigms are easily transferable to biomimetic whisker-based artificial sensors and can improve the active exploration of their environment. PMID:22789551

  20. Monetary reward modulates task-irrelevant perceptual learning for invisible stimuli.

    PubMed

    Pascucci, David; Mastropasqua, Tommaso; Turatto, Massimo

    2015-01-01

    Task Irrelevant Perceptual Learning (TIPL) shows that the brain's discriminative capacity can improve also for invisible and unattended visual stimuli. It has been hypothesized that this form of "unconscious" neural plasticity is mediated by an endogenous reward mechanism triggered by the correct task performance. Although this result has challenged the mandatory role of attention in perceptual learning, no direct evidence exists of the hypothesized link between target recognition, reward and TIPL. Here, we manipulated the reward value associated with a target to demonstrate the involvement of reinforcement mechanisms in sensory plasticity for invisible inputs. Participants were trained in a central task associated with either high or low monetary incentives, provided only at the end of the experiment, while subliminal stimuli were presented peripherally. Our results showed that high incentive-value targets induced a greater degree of perceptual improvement for the subliminal stimuli, supporting the role of reinforcement mechanisms in TIPL. PMID:25942318

  1. Prior fear conditioning and reward learning interact in fear and reward networks

    PubMed Central

    Bulganin, Lisa; Bach, Dominik R.; Wittmann, Bianca C.

    2014-01-01

    The ability to flexibly adapt responses to changes in the environment is important for survival. Previous research in humans separately examined the mechanisms underlying acquisition and extinction of aversive and appetitive conditioned responses. It is yet unclear how aversive and appetitive learning interact on a neural level during counterconditioning in humans. This functional magnetic resonance imaging (fMRI) study investigated the interaction of fear conditioning and subsequent reward learning. In the first phase (fear acquisition), images predicted aversive electric shocks or no aversive outcome. In the second phase (counterconditioning), half of the CS+ and CS− were associated with monetary reward in the absence of electric stimulation. The third phase initiated reinstatement of fear through presentation of electric shocks, followed by CS presentation in the absence of shock or reward. Results indicate that participants were impaired at learning the reward contingencies for stimuli previously associated with shock. In the counterconditioning phase, prior fear association interacted with reward representation in the amygdala, where activation was decreased for rewarded compared to unrewarded CS− trials, while there was no reward-related difference in CS+ trials. In the reinstatement phase, an interaction of previous fear association and previous reward status was observed in a reward network consisting of substantia nigra/ventral tegmental area (SN/VTA), striatum and orbitofrontal cortex (OFC), where activation was increased by previous reward association only for CS− but not for CS+ trials. These findings suggest that during counterconditioning, prior fear conditioning interferes with reward learning, subsequently leading to lower activation of the reward network. PMID:24624068

  2. Scaling prediction errors to reward variability benefits error-driven learning in humans

    PubMed Central

    Schultz, Wolfram

    2015-01-01

    Effective error-driven learning requires individuals to adapt learning to environmental reward variability. The adaptive mechanism may involve decays in learning rate across subsequent trials, as shown previously, and rescaling of reward prediction errors. The present study investigated the influence of prediction error scaling and, in particular, the consequences for learning performance. Participants explicitly predicted reward magnitudes that were drawn from different probability distributions with specific standard deviations. By fitting the data with reinforcement learning models, we found scaling of prediction errors, in addition to the learning rate decay shown previously. Importantly, the prediction error scaling was closely related to learning performance, defined as accuracy in predicting the mean of reward distributions, across individual participants. In addition, participants who scaled prediction errors relative to standard deviation also presented with more similar performance for different standard deviations, indicating that increases in standard deviation did not substantially decrease “adapters'” accuracy in predicting the means of reward distributions. However, exaggerated scaling beyond the standard deviation resulted in impaired performance. Thus efficient adaptation makes learning more robust to changing variability. PMID:26180123

  3. The emergence of saliency and novelty responses from Reinforcement Learning principles.

    PubMed

    Laurent, Patryk A

    2008-12-01

    Recent attempts to map reward-based learning models, like Reinforcement Learning [Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An introduction. Cambridge, MA: MIT Press], to the brain are based on the observation that phasic increases and decreases in the spiking of dopamine-releasing neurons signal differences between predicted and received reward [Gillies, A., & Arbuthnott, G. (2000). Computational models of the basal ganglia. Movement Disorders, 15(5), 762-770; Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1-27]. However, this reward-prediction error is only one of several signals communicated by that phasic activity; another involves an increase in dopaminergic spiking, reflecting the appearance of salient but unpredicted non-reward stimuli [Doya, K. (2002). Metalearning and neuromodulation. Neural Networks, 15(4-6), 495-506; Horvitz, J. C. (2000). Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward events. Neuroscience, 96(4), 651-656; Redgrave, P., & Gurney, K. (2006). The short-latency dopamine signal: A role in discovering novel actions? Nature Reviews Neuroscience, 7(12), 967-975], especially when an organism subsequently orients towards the stimulus [Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1-27]. To explain these findings, Kakade and Dayan [Kakade, S., & Dayan, P. (2002). Dopamine: Generalization and bonuses. Neural Networks, 15(4-6), 549-559.] and others have posited that novel, unexpected stimuli are intrinsically rewarding. The simulation reported in this article demonstrates that this assumption is not necessary because the effect it is intended to capture emerges from the reward-prediction learning mechanisms of Reinforcement Learning. Thus, Reinforcement Learning principles can be used to understand not just reward-related activity of the dopaminergic neurons of the basal ganglia, but also some

  4. Tunnel Ventilation Control Using Reinforcement Learning Methodology

    NASA Astrophysics Data System (ADS)

    Chu, Baeksuk; Kim, Dongnam; Hong, Daehie; Park, Jooyoung; Chung, Jin Taek; Kim, Tae-Hyung

    The main purpose of tunnel ventilation system is to maintain CO pollutant concentration and VI (visibility index) under an adequate level to provide drivers with comfortable and safe driving environment. Moreover, it is necessary to minimize power consumption used to operate ventilation system. To achieve the objectives, the control algorithm used in this research is reinforcement learning (RL) method. RL is a goal-directed learning of a mapping from situations to actions without relying on exemplary supervision or complete models of the environment. The goal of RL is to maximize a reward which is an evaluative feedback from the environment. In the process of constructing the reward of the tunnel ventilation system, two objectives listed above are included, that is, maintaining an adequate level of pollutants and minimizing power consumption. RL algorithm based on actor-critic architecture and gradient-following algorithm is adopted to the tunnel ventilation system. The simulations results performed with real data collected from existing tunnel ventilation system and real experimental verification are provided in this paper. It is confirmed that with the suggested controller, the pollutant level inside the tunnel was well maintained under allowable limit and the performance of energy consumption was improved compared to conventional control scheme.

  5. Robot-assisted motor training: assistance decreases exploration during reinforcement learning.

    PubMed

    Sans-Muntadas, Albert; Duarte, Jaime E; Reinkensmeyer, David J

    2014-01-01

    Reinforcement learning (RL) is a form of motor learning that robotic therapy devices could potentially manipulate to promote neurorehabilitation. We developed a system that requires trainees to use RL to learn a predefined target movement. The system provides higher rewards for movements that are more similar to the target movement. We also developed a novel algorithm that rewards trainees of different abilities with comparable reward sizes. This algorithm measures a trainee's performance relative to their best performance, rather than relative to an absolute target performance, to determine reward. We hypothesized this algorithm would permit subjects who cannot normally achieve high reward levels to do so while still learning. In an experiment with 21 unimpaired human subjects, we found that all subjects quickly learned to make a first target movement with and without the reward equalization. However, artificially increasing reward decreased the subjects' tendency to engage in exploration and therefore slowed learning, particularly when we changed the target movement. An anti-slacking watchdog algorithm further slowed learning. These results suggest that robotic algorithms that assist trainees in achieving rewards or in preventing slacking might, over time, discourage the exploration needed for reinforcement learning. PMID:25570749

  6. Reward-Guided Learning with and without Causal Attribution.

    PubMed

    Jocham, Gerhard; Brodersen, Kay H; Constantinescu, Alexandra O; Kahn, Martin C; Ianni, Angela M; Walton, Mark E; Rushworth, Matthew F S; Behrens, Timothy E J

    2016-04-01

    When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task. PMID:26971947

  7. Reward-Guided Learning with and without Causal Attribution

    PubMed Central

    Jocham, Gerhard; Brodersen, Kay H.; Constantinescu, Alexandra O.; Kahn, Martin C.; Ianni, Angela M.; Walton, Mark E.; Rushworth, Matthew F.S.; Behrens, Timothy E.J.

    2016-01-01

    Summary When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task. PMID:26971947

  8. Learning strategies in table tennis using inverse reinforcement learning.

    PubMed

    Muelling, Katharina; Boularias, Abdeslam; Mohler, Betty; Schölkopf, Bernhard; Peters, Jan

    2014-10-01

    Learning a complex task such as table tennis is a challenging problem for both robots and humans. Even after acquiring the necessary motor skills, a strategy is needed to choose where and how to return the ball to the opponent's court in order to win the game. The data-driven identification of basic strategies in interactive tasks, such as table tennis, is a largely unexplored problem. In this paper, we suggest a computational model for representing and inferring strategies, based on a Markov decision problem, where the reward function models the goal of the task as well as the strategic information. We show how this reward function can be discovered from demonstrations of table tennis matches using model-free inverse reinforcement learning. The resulting framework allows to identify basic elements on which the selection of striking movements is based. We tested our approach on data collected from players with different playing styles and under different playing conditions. The estimated reward function was able to capture expert-specific strategic information that sufficed to distinguish the expert among players with different skill levels as well as different playing styles. PMID:24756167

  9. Stress Modulates Reinforcement Learning in Younger and Older Adults

    PubMed Central

    Lighthall, Nichole R.; Gorlick, Marissa A.; Schoeke, Andrej; Frank, Michael J.; Mather, Mara

    2012-01-01

    Animal research and human neuroimaging studies indicate that stress increases dopamine levels in brain regions involved in reward processing and stress also appears to increase the attractiveness of addictive drugs. The current study tested the hypothesis that stress increases reward salience, leading to more effective learning about positive than negative outcomes in a probabilistic selection task. Changes to dopamine pathways with age raise the question of whether stress effects on incentive-based learning differ by age. Thus, the present study also examined whether effects of stress on reinforcement learning differed for younger (age 18–34) and older participants (age 65–85). Cold pressor stress was administered to half of the participants in each age group and salivary cortisol levels were used to confirm biophysiological response to cold stress. Following the manipulation, participants completed a probabilistic learning task involving positive and negative feedback. In both younger and older adults, stress enhanced learning about cues that predicted positive outcomes. In addition, during the initial learning phase, stress diminished sensitivity to recent feedback across age groups. These results indicate that stress affects reinforcement learning in both younger and older adults and suggests that stress exerts different effects on specific components of reinforcement learning depending on their neural underpinnings. PMID:22946523

  10. Rational and Mechanistic Perspectives on Reinforcement Learning

    ERIC Educational Resources Information Center

    Chater, Nick

    2009-01-01

    This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: "mechanistic" and "rational." Reinforcement learning is often viewed in mechanistic terms--as…

  11. Rule Learning in Autism: The Role of Reward Type and Social Context

    PubMed Central

    Jones, E. J. H.; Webb, S. J.; Estes, A.; Dawson, G.

    2013-01-01

    Learning abstract rules is central to social and cognitive development. Across two experiments, we used Delayed Non-Matching to Sample tasks to characterize the longitudinal development and nature of rule-learning impairments in children with Autism Spectrum Disorder (ASD). Results showed that children with ASD consistently experienced more difficulty learning an abstract rule from a discrete physical reward than children with DD. Rule learning was facilitated by the provision of more concrete reinforcement, suggesting an underlying difficulty in forming conceptual connections. Learning abstract rules about social stimuli remained challenging through late childhood, indicating the importance of testing executive functions in both social and non-social contexts. PMID:23311315

  12. Rule learning in autism: the role of reward type and social context.

    PubMed

    Jones, E J H; Webb, S J; Estes, A; Dawson, G

    2013-01-01

    Learning abstract rules is central to social and cognitive development. Across two experiments, we used Delayed Non-Matching to Sample tasks to characterize the longitudinal development and nature of rule-learning impairments in children with Autism Spectrum Disorder (ASD). Results showed that children with ASD consistently experienced more difficulty learning an abstract rule from a discrete physical reward than children with DD. Rule learning was facilitated by the provision of more concrete reinforcement, suggesting an underlying difficulty in forming conceptual connections. Learning abstract rules about social stimuli remained challenging through late childhood, indicating the importance of testing executive functions in both social and non-social contexts. PMID:23311315

  13. Dopamine-Dependent Reinforcement of Motor Skill Learning: Evidence from Gilles de la Tourette Syndrome

    ERIC Educational Resources Information Center

    Palminteri, Stefano; Lebreton, Mael; Worbe, Yulia; Hartmann, Andreas; Lehericy, Stephane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-01-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only…

  14. Working memory and reward association learning impairments in obesity

    PubMed Central

    Coppin, Géraldine; Nolan-Poupart, Sarah; Jones-Gotman, Marilyn; Small, Dana M.

    2014-01-01

    Obesity has been associated with impaired executive functions including working memory. Less explored is the influence of obesity on learning and memory. In the current study we assessed stimulus reward association learning, explicit learning and memory and working memory in healthy weight, overweight and obese individuals. Explicit learning and memory did not differ as a function of group. In contrast, working memory was significantly and similarly impaired in both overweight and obese individuals compared to the healthy weight group. In the first reward association learning task the obese, but not healthy weight or overweight participants consistently formed paradoxical preferences for a pattern associated with a negative outcome (fewer food rewards). To determine if the deficit was specific to food reward a second experiment was conducted using money. Consistent with experiment 1, obese individuals selected the pattern associated with a negative outcome (fewer monetary rewards) more frequently than healthy weight individuals and thus failed to develop a significant preference for the most rewarded patterns as was observed in the healthy weight group. Finally, on a probabilistic learning task, obese compared to healthy weight individuals showed deficits in negative, but not positive outcome learning. Taken together, our results demonstrate deficits in working memory and stimulus reward learning in obesity and suggest that obese individuals are impaired in learning to avoid negative outcomes. PMID:25447070

  15. Learning arm's posture control using reinforcement learning and feedback-error-learning.

    PubMed

    Kambara, H; Kim, J; Sato, M; Koike, Y

    2004-01-01

    In this paper, we propose a learning model using the Actor-Critic method and the feedback-error-learning scheme. The Actor-Critic method, which is one of the major frameworks in reinforcement learning, has attracted attention as a computational learning model in the basal ganglia. Meanwhile, the feedback-error-learning is learning architecture proposed as a computationally coherent model of cerebellar motor learning. This learning architecture's purpose is to acquire a feed-forward controller by using a feedback controller's output as an error signal. In past researches, a predetermined constant gain feedback controller was used for the feedback-error-learning. We use the Actor-Critic method for obtaining a feedback controller in the feedback-error-earning. By applying the proposed learning model to an arm's posture control, we show that high-performance feedback and feed-forward controller can be acquired from only by using a scalar value of reward. PMID:17271719

  16. Pedunculopontine tegmental nucleus lesions impair probabilistic reversal learning by reducing sensitivity to positive reward feedback.

    PubMed

    Syed, Anam; Baker, Phillip M; Ragozzino, Michael E

    2016-05-01

    Recent findings indicate that pedunculopontine tegmental nucleus (PPTg) neurons encode reward-related information that is context-dependent. This information is critical for behavioral flexibility when reward outcomes change signaling a shift in response patterns should occur. The present experiment investigated whether NMDA lesions of the PPTg affects the acquisition and/or reversal learning of a spatial discrimination using probabilistic reinforcement. Male Long-Evans rats received a bilateral infusion of NMDA (30nmoles/side) or saline into the PPTg. Subsequently, rats were tested in a spatial discrimination test using a probabilistic learning procedure. One spatial location was rewarded with an 80% probability and the other spatial location rewarded with a 20% probability. After reaching acquisition criterion of 10 consecutive correct trials, the spatial location - reward contingencies were reversed in the following test session. Bilateral and unilateral PPTg-lesioned rats acquired the spatial discrimination test comparable to that as sham controls. In contrast, bilateral PPTg lesions, but not unilateral PPTg lesions, impaired reversal learning. The reversal learning deficit occurred because of increased regressions to the previously 'correct' spatial location after initially selecting the new, 'correct' choice. PPTg lesions also reduced the frequency of win-stay behavior early in the reversal learning session, but did not modify the frequency of lose-shift behavior during reversal learning. The present results suggest that the PPTg contributes to behavioral flexibility under conditions in which outcomes are uncertain, e.g. probabilistic reinforcement, by facilitating sensitivity to positive reward outcomes that allows the reliable execution of a new choice pattern. PMID:26976089

  17. COMT Val(158) Met genotype is associated with reward learning: a replication study and meta-analysis.

    PubMed

    Corral-Frías, N S; Pizzagalli, D A; Carré, J M; Michalski, L J; Nikolova, Y S; Perlis, R H; Fagerness, J; Lee, M R; Conley, E Drabant; Lancaster, T M; Haddad, S; Wolf, A; Smoller, J W; Hariri, A R; Bogdan, R

    2016-06-01

    Identifying mechanisms through which individual differences in reward learning emerge offers an opportunity to understand both a fundamental form of adaptive responding as well as etiological pathways through which aberrant reward learning may contribute to maladaptive behaviors and psychopathology. One candidate mechanism through which individual differences in reward learning may emerge is variability in dopaminergic reinforcement signaling. A common functional polymorphism within the catechol-O-methyl transferase gene (COMT; rs4680, Val(158) Met) has been linked to reward learning, where homozygosity for the Met allele (linked to heightened prefrontal dopamine function and decreased dopamine synthesis in the midbrain) has been associated with relatively increased reward learning. Here, we used a probabilistic reward learning task to asses response bias, a behavioral form of reward learning, across three separate samples that were combined for analyses (age: 21.80 ± 3.95; n = 392; 268 female; European-American: n = 208). We replicate prior reports that COMT rs4680 Met allele homozygosity is associated with increased reward learning in European-American participants (β = 0.20, t = 2.75, P < 0.01; ΔR(2) = 0.04). Moreover, a meta-analysis of 4 studies, including the current one, confirmed the association between COMT rs4680 genotype and reward learning (95% CI -0.11 to -0.03; z = 3.2; P < 0.01). These results suggest that variability in dopamine signaling associated with COMT rs4680 influences individual differences in reward which may potentially contribute to psychopathology characterized by reward dysfunction. PMID:27138112

  18. Use of Inverse Reinforcement Learning for Identity Prediction

    NASA Technical Reports Server (NTRS)

    Hayes, Roy; Bao, Jonathan; Beling, Peter; Horowitz, Barry

    2011-01-01

    We adopt Markov Decision Processes (MDP) to model sequential decision problems, which have the characteristic that the current decision made by a human decision maker has an uncertain impact on future opportunity. We hypothesize that the individuality of decision makers can be modeled as differences in the reward function under a common MDP model. A machine learning technique, Inverse Reinforcement Learning (IRL), was used to learn an individual's reward function based on limited observation of his or her decision choices. This work serves as an initial investigation for using IRL to analyze decision making, conducted through a human experiment in a cyber shopping environment. Specifically, the ability to determine the demographic identity of users is conducted through prediction analysis and supervised learning. The results show that IRL can be used to correctly identify participants, at a rate of 68% for gender and 66% for one of three college major categories.

  19. The role of reward in word learning and its implications for language acquisition.

    PubMed

    Ripollés, Pablo; Marco-Pallarés, Josep; Hielscher, Ulrike; Mestres-Missé, Anna; Tempelmann, Claus; Heinze, Hans-Jochen; Rodríguez-Fornells, Antoni; Noesselt, Toemme

    2014-11-01

    The exact neural processes behind humans' drive to acquire a new language--first as infants and later as second-language learners--are yet to be established. Recent theoretical models have proposed that during human evolution, emerging language-learning mechanisms might have been glued to phylogenetically older subcortical reward systems, reinforcing human motivation to learn a new language. Supporting this hypothesis, our results showed that adult participants exhibited robust fMRI activation in the ventral striatum (VS)--a core region of reward processing--when successfully learning the meaning of new words. This activation was similar to the VS recruitment elicited using an independent reward task. Moreover, the VS showed enhanced functional and structural connectivity with neocortical language areas during successful word learning. Together, our results provide evidence for the neural substrate of reward and motivation during word learning. We suggest that this strong functional and anatomical coupling between neocortical language regions and the subcortical reward system provided a crucial advantage in humans that eventually enabled our lineage to successfully acquire linguistic skills. PMID:25447993

  20. Multi Agent Reward Analysis for Learning in Noisy Domains

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian K.

    2005-01-01

    In many multi agent learning problems, it is difficult to determine, a priori, the agent reward structure that will lead to good performance. This problem is particularly pronounced in continuous, noisy domains ill-suited to simple table backup schemes commonly used in TD(lambda)/Q-learning. In this paper, we present a new reward evaluation method that allows the tradeoff between coordination among the agents and the difficulty of the learning problem each agent faces to be visualized. This method is independent of the learning algorithm and is only a function of the problem domain and the agents reward structure. We then use this reward efficiency visualization method to determine an effective reward without performing extensive simulations. We test this method in both a static and a dynamic multi-rover learning domain where the agents have continuous state spaces and where their actions are noisy (e.g., the agents movement decisions are not always carried out properly). Our results show that in the more difficult dynamic domain, the reward efficiency visualization method provides a two order of magnitude speedup in selecting a good reward. Most importantly it allows one to quickly create and verify rewards tailored to the observational limitations of the domain.

  1. Incidental learning of rewarded associations bolsters learning on an associative task.

    PubMed

    Freedberg, Michael; Schacherer, Jonathan; Hazeltine, Eliot

    2016-05-01

    Reward has been shown to change behavior as a result of incentive learning (by motivating the individual to increase their effort) and instrumental learning (by increasing the frequency of a particular behavior). However, Palminteri et al. (2011) demonstrated that reward can also improve the incidental learning of a motor skill even when participants are unaware of the relationship between the reward and the motor act. Nonetheless, it remains unknown whether these effects of reward are the indirect results of manipulations of top-down factors. To identify the locus of the benefit associated with rewarded incidental learning, we used a chord-learning task (Seibel, 1963) in which the correct performance of some chords was consistently rewarded with points necessary to complete the block whereas the correct performance of other chords was not rewarded. Following training, participants performed a transfer phase without reward and then answered a questionnaire to assess explicit awareness about the rewards. Experiment 1 revealed that rewarded chords were performed more quickly than unrewarded chords, and there was little awareness about the relationship between chords and reward. Experiment 2 obtained similar findings with simplified responses to show that the advantage for rewarded stimulus combinations reflected more efficient binding of stimulus-response (S-R) associations, rather than a response bias for rewarded associations or improved motor learning. These results indicate that rewards can be used to significantly improve the learning of S-R associations without directly manipulating top-down factors. (PsycINFO Database Record PMID:26569435

  2. Learned helplessness and learned prevalence: exploring the causal relations among perceived controllability, reward prevalence, and exploration.

    PubMed

    Teodorescu, Kinneret; Erev, Ido

    2014-10-01

    Exposure to uncontrollable outcomes has been found to trigger learned helplessness, a state in which the agent, because of lack of exploration, fails to take advantage of regained control. Although the implications of this phenomenon have been widely studied, its underlying cause remains undetermined. One can learn not to explore because the environment is uncontrollable, because the average reinforcement for exploring is low, or because rewards for exploring are rare. In the current research, we tested a simple experimental paradigm that contrasts the predictions of these three contributors and offers a unified psychological mechanism that underlies the observed phenomena. Our results demonstrate that learned helplessness is not correlated with either the perceived controllability of one's environment or the average reward, which suggests that reward prevalence is a better predictor of exploratory behavior than the other two factors. A simple computational model in which exploration decisions were based on small samples of past experiences captured the empirical phenomena while also providing a cognitive basis for feelings of uncontrollability. PMID:25193942

  3. Neuropsychology of reward learning and negative symptoms in schizophrenia.

    PubMed

    Nestor, Paul G; Choate, Victoria; Niznikiewicz, Margaret; Levitt, James J; Shenton, Martha E; McCarley, Robert W

    2014-11-01

    We used the Iowa Gambling Test (IGT) to examine the relationship of reward learning to both neuropsychological functioning and symptom formation in 65 individuals with schizophrenia. Results indicated that compared to controls, participants with schizophrenia showed significantly reduced reward learning, which in turn correlated with reduced intelligence, memory and executive function, and negative symptoms. The current findings suggested that a disease-related disturbance in reward learning may underlie both cognitive and motivation deficits, as expressed by neuropsychological impairment and negative symptoms in schizophrenia. PMID:25261881

  4. A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task.

    PubMed

    Suri, R E; Schultz, W

    1999-01-01

    This study investigated how the simulated response of dopamine neurons to reward-related stimuli could be used as reinforcement signal for learning a spatial delayed response task. Spatial delayed response tasks assess the functions of frontal cortex and basal ganglia in short-term memory, movement preparation and expectation of environmental events. In these tasks, a stimulus appears for a short period at a particular location, and after a delay the subject moves to the location indicated. Dopamine neurons are activated by unpredicted rewards and reward-predicting stimuli, are not influenced by fully predicted rewards, and are depressed by omitted rewards. Thus, they appear to report an error in the prediction of reward, which is the crucial reinforcement term in formal learning theories. Theoretical studies on reinforcement learning have shown that signals similar to dopamine responses can be used as effective teaching signals for learning. A neural network model implementing the temporal difference algorithm was trained to perform a simulated spatial delayed response task. The reinforcement signal was modeled according to the basic characteristics of dopamine responses to novel stimuli, primary rewards and reward-predicting stimuli. A Critic component analogous to dopamine neurons computed a temporal error in the prediction of reinforcement and emitted this signal to an Actor component which mediated the behavioral output. The spatial delayed response task was learned via two subtasks introducing spatial choices and temporal delays, in the same manner as monkeys in the laboratory. In all three tasks, the reinforcement signal of the Critic developed in a similar manner to the responses of natural dopamine neurons in comparable learning situations, and the learning curves of the Actor replicated the progress of learning observed in the animals. Several manipulations demonstrated further the efficacy of the particular characteristics of the dopamine

  5. Learning the specific quality of taste reinforcement in larval Drosophila

    PubMed Central

    Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

    2015-01-01

    The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing—in any brain. DOI: http://dx.doi.org/10.7554/eLife.04711.001 PMID:25622533

  6. Online learning control by association and reinforcement.

    PubMed

    Si, J; Wang, Y T

    2001-01-01

    This paper focuses on a systematic treatment for developing a generic online learning control system based on the fundamental principle of reinforcement learning or more specifically neural dynamic programming. This online learning system improves its performance over time in two aspects: 1) it learns from its own mistakes through the reinforcement signal from the external environment and tries to reinforce its action to improve future performance; and 2) system states associated with the positive reinforcement is memorized through a network learning process where in the future, similar states will be more positively associated with a control action leading to a positive reinforcement. A successful candidate of online learning control design is introduced. Real-time learning algorithms is derived for individual components in the learning system. Some analytical insight is provided to give guidelines on the learning process took place in each module of the online learning control system. PMID:18244383

  7. Credit assignment in movement-dependent reinforcement learning.

    PubMed

    McDougle, Samuel D; Boggess, Matthew J; Crossley, Matthew J; Parvin, Darius; Ivry, Richard B; Taylor, Jordan A

    2016-06-14

    When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants' explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem. PMID:27247404

  8. Pleasurable music affects reinforcement learning according to the listener.

    PubMed

    Gold, Benjamin P; Frank, Michael J; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875

  9. Pleasurable music affects reinforcement learning according to the listener

    PubMed Central

    Gold, Benjamin P.; Frank, Michael J.; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875

  10. Can model-free reinforcement learning explain deontological moral judgments?

    PubMed

    Ayars, Alisabeth

    2016-05-01

    Dual-systems frameworks propose that moral judgments are derived from both an immediate emotional response, and controlled/rational cognition. Recently Cushman (2013) proposed a new dual-system theory based on model-free and model-based reinforcement learning. Model-free learning attaches values to actions based on their history of reward and punishment, and explains some deontological, non-utilitarian judgments. Model-based learning involves the construction of a causal model of the world and allows for far-sighted planning; this form of learning fits well with utilitarian considerations that seek to maximize certain kinds of outcomes. I present three concerns regarding the use of model-free reinforcement learning to explain deontological moral judgment. First, many actions that humans find aversive from model-free learning are not judged to be morally wrong. Moral judgment must require something in addition to model-free learning. Second, there is a dearth of evidence for central predictions of the reinforcement account-e.g., that people with different reinforcement histories will, all else equal, make different moral judgments. Finally, to account for the effect of intention within the framework requires certain assumptions which lack support. These challenges are reasonable foci for future empirical/theoretical work on the model-free/model-based framework. PMID:26918742

  11. Collaborating Fuzzy Reinforcement Learning Agents

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1997-01-01

    Earlier, we introduced GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Relearning and at the local level, each agent learns and operates based on ANTARCTIC, a technique for fuzzy reinforcement learning. In this paper, we show that it is possible for these agents to compete in order to affect the selected control policy but at the same time, they can collaborate while investigating the state space. In this model, the evaluator or the critic learns by observing all the agents behaviors but the control policy changes only based on the behavior of the winning agent also known as the super agent.

  12. Anhedonia and the relative reward value of drug and nondrug reinforcers in cigarette smokers.

    PubMed

    Leventhal, Adam M; Trujillo, Michael; Ameringer, Katherine J; Tidey, Jennifer W; Sussman, Steve; Kahler, Christopher W

    2014-05-01

    Anhedonia-a psychopathologic trait indicative of diminished interest, pleasure, and enjoyment-has been linked to use of and addiction to several substances, including tobacco. We hypothesized that anhedonic drug users develop an imbalance in the relative reward value of drug versus nondrug reinforcers, which could maintain drug use behavior. To test this hypothesis, we examined whether anhedonia predicted the tendency to choose an immediate drug reward (i.e., smoking) over a less immediate nondrug reward (i.e., money) in a laboratory study of non-treatment-seeking adult cigarette smokers. Participants (N = 275, ≥10 cigarettes/day) attended a baseline visit that involved anhedonia assessment followed by 2 counterbalanced experimental visits: (a) after 16-hr smoking abstinence and (b) nonabstinent. At both experimental visits, participants completed self-report measures of mood state followed by a behavioral smoking task, which measured 2 aspects of the relative reward value of smoking versus money: (1) latency to initiate smoking when delaying smoking was monetarily rewarded and (2) willingness to purchase individual cigarettes. Results indicated that higher anhedonia predicted quicker smoking initiation and more cigarettes purchased. These relations were partially mediated by low positive and high negative mood states assessed immediately prior to the smoking task. Abstinence amplified the extent to which anhedonia predicted cigarette consumption among those who responded to the abstinence manipulation, but not the entire sample. Anhedonia may bias motivation toward smoking over alternative reinforcers, perhaps by giving rise to poor acute mood states. An imbalance in the reward value assigned to drug versus nondrug reinforcers may link anhedonia-related psychopathology to drug use. PMID:24886011

  13. Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

    PubMed

    Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

    2013-04-01

    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity. PMID:23592970

  14. Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

    PubMed Central

    Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

    2013-01-01

    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity. PMID:23592970

  15. Distributed reinforcement learning for adaptive and robust network intrusion response

    NASA Astrophysics Data System (ADS)

    Malialis, Kleanthis; Devlin, Sam; Kudenko, Daniel

    2015-07-01

    Distributed denial of service (DDoS) attacks constitute a rapidly evolving threat in the current Internet. Multiagent Router Throttling is a novel approach to defend against DDoS attacks where multiple reinforcement learning agents are installed on a set of routers and learn to rate-limit or throttle traffic towards a victim server. The focus of this paper is on online learning and scalability. We propose an approach that incorporates task decomposition, team rewards and a form of reward shaping called difference rewards. One of the novel characteristics of the proposed system is that it provides a decentralised coordinated response to the DDoS problem, thus being resilient to DDoS attacks themselves. The proposed system learns remarkably fast, thus being suitable for online learning. Furthermore, its scalability is successfully demonstrated in experiments involving 1000 learning agents. We compare our approach against a baseline and a popular state-of-the-art throttling technique from the network security literature and show that the proposed approach is more effective, adaptive to sophisticated attack rate dynamics and robust to agent failures.

  16. A Selective Role for Lmo4 in Cue–Reward Learning

    PubMed Central

    Mangieri, Regina A.; Morrisett, Richard A.; Heberlein, Ulrike; Messing, Robert O.

    2015-01-01

    The ability to use environmental cues to predict rewarding events is essential to survival. The basolateral amygdala (BLA) plays a central role in such forms of associative learning. Aberrant cue–reward learning is thought to underlie many psychopathologies, including addiction, so understanding the underlying molecular mechanisms can inform strategies for intervention. The transcriptional regulator LIM-only 4 (LMO4) is highly expressed in pyramidal neurons of the BLA, where it plays an important role in fear learning. Because the BLA also contributes to cue–reward learning, we investigated the role of BLA LMO4 in this process using Lmo4-deficient mice and RNA interference. Lmo4-deficient mice showed a selective deficit in conditioned reinforcement. Knockdown of LMO4 in the BLA, but not in the nucleus accumbens, recapitulated this deficit in wild-type mice. Molecular and electrophysiological studies identified a deficit in dopamine D2 receptor signaling in the BLA of Lmo4-deficient mice. These results reveal a novel, LMO4-dependent transcriptional program within the BLA that is essential to cue–reward learning. PMID:26134647

  17. A Selective Role for Lmo4 in Cue-Reward Learning.

    PubMed

    Maiya, Rajani; Mangieri, Regina A; Morrisett, Richard A; Heberlein, Ulrike; Messing, Robert O

    2015-07-01

    The ability to use environmental cues to predict rewarding events is essential to survival. The basolateral amygdala (BLA) plays a central role in such forms of associative learning. Aberrant cue-reward learning is thought to underlie many psychopathologies, including addiction, so understanding the underlying molecular mechanisms can inform strategies for intervention. The transcriptional regulator LIM-only 4 (LMO4) is highly expressed in pyramidal neurons of the BLA, where it plays an important role in fear learning. Because the BLA also contributes to cue-reward learning, we investigated the role of BLA LMO4 in this process using Lmo4-deficient mice and RNA interference. Lmo4-deficient mice showed a selective deficit in conditioned reinforcement. Knockdown of LMO4 in the BLA, but not in the nucleus accumbens, recapitulated this deficit in wild-type mice. Molecular and electrophysiological studies identified a deficit in dopamine D2 receptor signaling in the BLA of Lmo4-deficient mice. These results reveal a novel, LMO4-dependent transcriptional program within the BLA that is essential to cue-reward learning. PMID:26134647

  18. Ensemble algorithms in reinforcement learning.

    PubMed

    Wiering, Marco A; van Hasselt, Hado

    2008-08-01

    This paper describes several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent. The aim is to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms. We designed and implemented four different ensemble methods combining the following five different RL algorithms: Q-learning, Sarsa, actor-critic (AC), QV-learning, and AC learning automaton. The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms, in contrast to previous work where ensemble methods have been used in RL for representing and learning a single value function. We show experiments on five maze problems of varying complexity; the first problem is simple, but the other four maze tasks are of a dynamic or partially observable nature. The results indicate that the BM and MV ensembles significantly outperform the single RL algorithms. PMID:18632380

  19. Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems.

    PubMed

    Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

    2013-01-01

    A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider. PMID:24137146

  20. Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems

    PubMed Central

    Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

    2013-01-01

    A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider. PMID:24137146

  1. Learning That a Cocaine Reward is Smaller Than Expected: A Test of Redish's Computational Model of Addiction

    PubMed Central

    Marks, Katherine R.; Kearns, David N.; Christensen, Chesley J.; Silberberg, Alan; Weiss, Stanley J.

    2010-01-01

    The present experiment tested the prediction of Redish's [7] computational model of addiction that drug reward expectation continues to grow even when the received drug reward is smaller than expected. Initially, rats were trained to press two levers, each associated with a large dose of cocaine. Then, the dose associated with one of the levers was substantially reduced. Thus, when rats first pressed the reduced-dose lever, they expected a large cocaine reward, but received a small one. On subsequent choice tests, preference for the reduced-dose lever was reduced, demonstrating that rats learned to devalue the reduced-dose lever. The finding that rats learned to lower reward expectation when they received a smaller-than-expected cocaine reward is in opposition to the hypothesis that drug reinforcers produce a perpetual and non-correctable positive prediction error that causes the learned value of drug rewards to continually grow. Instead, the present results suggest that standard error-correction learning rules apply even to drug reinforcers. PMID:20381539

  2. Punishment insensitivity and impaired reinforcement learning in preschoolers

    PubMed Central

    Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

    2013-01-01

    Background Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a developmental vulnerability to psychopathic traits. Methods 157 preschoolers (mean age 4.7 ±0.8 years) participated in a substudy that was embedded within a larger project. Children completed the “Stars-in-Jars” task, which involved learning to select rewarded jars and avoid punished jars. Maternal report of responsiveness to socialization was assessed with the Punishment Insensitivity and Low Concern for Others scales of the Multidimensional Assessment of Preschool Disruptive Behavior (MAP-DB). Results Punishment Insensitivity, but not Low Concern for Others, was significantly associated with reinforcement learning in multivariate models that accounted for age and sex. Specifically, higher Punishment Insensitivity was associated with significantly lower overall performance and more errors on punished trials (“passive avoidance”). Conclusions Impairments in reinforcement learning manifest in preschoolers who are high in maternal ratings of Punishment Insensitivity. If replicated, these findings may help to pinpoint the neurodevelopmental antecedents of psychopathic tendencies and suggest novel intervention targets beginning in early childhood. PMID:24033313

  3. A reinforcement learning approach to gait training improves retention

    PubMed Central

    Hasson, Christopher J.; Manczurowsky, Julia; Yen, Sheng-Che

    2015-01-01

    Many gait training programs are based on supervised learning principles: an individual is guided towards a desired gait pattern with directional error feedback. While this results in rapid adaptation, improvements quickly disappear. This study tested the hypothesis that a reinforcement learning approach improves retention and transfer of a new gait pattern. The results of a pilot study and larger experiment are presented. Healthy subjects were randomly assigned to either a supervised group, who received explicit instructions and directional error feedback while they learned a new gait pattern on a treadmill, or a reinforcement group, who was only shown whether they were close to or far from the desired gait. Subjects practiced for 10 min, followed by immediate and overnight retention and over-ground transfer tests. The pilot study showed that subjects could learn a new gait pattern under a reinforcement learning paradigm. The larger experiment, which had twice as many subjects (16 in each group) showed that the reinforcement group had better overnight retention than the supervised group (a 32% vs. 120% error increase, respectively), but there were no differences for over-ground transfer. These results suggest that encouraging participants to find rewarding actions through self-guided exploration is beneficial for retention. PMID:26379524

  4. Two spatiotemporally distinct value systems shape reward-based learning in the human brain.

    PubMed

    Fouragnan, Elsa; Retzler, Chris; Mullinger, Karen; Philiastides, Marios G

    2015-01-01

    Avoiding repeated mistakes and learning to reinforce rewarding decisions is critical for human survival and adaptive actions. Yet, the neural underpinnings of the value systems that encode different decision-outcomes remain elusive. Here coupling single-trial electroencephalography with simultaneously acquired functional magnetic resonance imaging, we uncover the spatiotemporal dynamics of two separate but interacting value systems encoding decision-outcomes. Consistent with a role in regulating alertness and switching behaviours, an early system is activated only by negative outcomes and engages arousal-related and motor-preparatory brain structures. Consistent with a role in reward-based learning, a later system differentially suppresses or activates regions of the human reward network in response to negative and positive outcomes, respectively. Following negative outcomes, the early system interacts and downregulates the late system, through a thalamic interaction with the ventral striatum. Critically, the strength of this coupling predicts participants' switching behaviour and avoidance learning, directly implicating the thalamostriatal pathway in reward-based learning. PMID:26348160

  5. Two spatiotemporally distinct value systems shape reward-based learning in the human brain

    PubMed Central

    Fouragnan, Elsa; Retzler, Chris; Mullinger, Karen; Philiastides, Marios G.

    2015-01-01

    Avoiding repeated mistakes and learning to reinforce rewarding decisions is critical for human survival and adaptive actions. Yet, the neural underpinnings of the value systems that encode different decision-outcomes remain elusive. Here coupling single-trial electroencephalography with simultaneously acquired functional magnetic resonance imaging, we uncover the spatiotemporal dynamics of two separate but interacting value systems encoding decision-outcomes. Consistent with a role in regulating alertness and switching behaviours, an early system is activated only by negative outcomes and engages arousal-related and motor-preparatory brain structures. Consistent with a role in reward-based learning, a later system differentially suppresses or activates regions of the human reward network in response to negative and positive outcomes, respectively. Following negative outcomes, the early system interacts and downregulates the late system, through a thalamic interaction with the ventral striatum. Critically, the strength of this coupling predicts participants' switching behaviour and avoidance learning, directly implicating the thalamostriatal pathway in reward-based learning. PMID:26348160

  6. Reward-based learning for virtual neurorobotics through emotional speech processing

    PubMed Central

    Jayet Bray, Laurence C.; Ferneyhough, Gareth B.; Barker, Emily R.; Thibeault, Corey M.; Harris, Frederick C.

    2013-01-01

    Reward-based learning can easily be applied to real life with a prevalence in children teaching methods. It also allows machines and software agents to automatically determine the ideal behavior from a simple reward feedback (e.g., encouragement) to maximize their performance. Advancements in affective computing, especially emotional speech processing (ESP) have allowed for more natural interaction between humans and robots. Our research focuses on integrating a novel ESP system in a relevant virtual neurorobotic (VNR) application. We created an emotional speech classifier that successfully distinguished happy and utterances. The accuracy of the system was 95.3 and 98.7% during the offline mode (using an emotional speech database) and the live mode (using live recordings), respectively. It was then integrated in a neurorobotic scenario, where a virtual neurorobot had to learn a simple exercise through reward-based learning. If the correct decision was made the robot received a spoken reward, which in turn stimulated synapses (in our simulated model) undergoing spike-timing dependent plasticity (STDP) and reinforced the corresponding neural pathways. Both our ESP and neurorobotic systems allowed our neurorobot to successfully and consistently learn the exercise. The integration of ESP in real-time computational neuroscience architecture is a first step toward the combination of human emotions and virtual neurorobotics. PMID:23641213

  7. Reinforcement Learning in Multidimensional Environments Relies on Attention Mechanisms

    PubMed Central

    Daniel, Reka; Geana, Andra; Gershman, Samuel J.; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C.

    2015-01-01

    In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this “representation learning” process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the “curse of dimensionality” in reinforcement learning. PMID:26019331

  8. Reinforcement learning in continuous time and space.

    PubMed

    Doya, K

    2000-01-01

    This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD(lambda) algorithms are shown. For policy improvement, two methods-a continuous actor-critic method and a value-gradient-based greedy policy-are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model. PMID:10636940

  9. Nucleus accumbens core lesions retard instrumental learning and performance with delayed reinforcement in the rat

    PubMed Central

    Cardinal, Rudolf N; Cheung, Timothy HC

    2005-01-01

    Background Delays between actions and their outcomes severely hinder reinforcement learning systems, but little is known of the neural mechanism by which animals overcome this problem and bridge such delays. The nucleus accumbens core (AcbC), part of the ventral striatum, is required for normal preference for a large, delayed reward over a small, immediate reward (self-controlled choice) in rats, but the reason for this is unclear. We investigated the role of the AcbC in learning a free-operant instrumental response using delayed reinforcement, performance of a previously-learned response for delayed reinforcement, and assessment of the relative magnitudes of two different rewards. Results Groups of rats with excitotoxic or sham lesions of the AcbC acquired an instrumental response with different delays (0, 10, or 20 s) between the lever-press response and reinforcer delivery. A second (inactive) lever was also present, but responding on it was never reinforced. As expected, the delays retarded learning in normal rats. AcbC lesions did not hinder learning in the absence of delays, but AcbC-lesioned rats were impaired in learning when there was a delay, relative to sham-operated controls. All groups eventually acquired the response and discriminated the active lever from the inactive lever to some degree. Rats were subsequently trained to discriminate reinforcers of different magnitudes. AcbC-lesioned rats were more sensitive to differences in reinforcer magnitude than sham-operated controls, suggesting that the deficit in self-controlled choice previously observed in such rats was a consequence of reduced preference for delayed rewards relative to immediate rewards, not of reduced preference for large rewards relative to small rewards. AcbC lesions also impaired the performance of a previously-learned instrumental response in a delay-dependent fashion. Conclusions These results demonstrate that the AcbC contributes to instrumental learning and performance by

  10. Incidental Learning of Rewarded Associations Bolsters Learning on an Associative Task

    ERIC Educational Resources Information Center

    Freedberg, Michael; Schacherer, Jonathan; Hazeltine, Eliot

    2016-01-01

    Reward has been shown to change behavior as a result of incentive learning (by motivating the individual to increase their effort) and instrumental learning (by increasing the frequency of a particular behavior). However, Palminteri et al. (2011) demonstrated that reward can also improve the incidental learning of a motor skill even when…

  11. Assessing Evidence for a Common Function of Delay in Causal Learning and Reward Discounting

    PubMed Central

    Greville, W. James; Buehner, Marc J.

    2012-01-01

    Time occupies a central role in both the induction of causal relationships and determining the subjective value of rewards. Delays devalue rewards and also impair learning of relationships between events. The mathematical relation between the time until a delayed reward and its present value has been characterized as a hyperbola-like function, and increasing delays of reinforcement tend to elicit judgments or response rates that similarly show a negatively accelerated decay pattern. Furthermore, neurological research implicates both the hippocampus and prefrontal cortex in both these processes. Since both processes are broadly concerned with the concepts of reward, value, and time, involve a similar functional form, and have been identified as involving the same specific brain regions, it seems tempting to assume that the two processes are underpinned by the same cognitive or neural mechanisms. We set out to determine experimentally whether a common cognitive mechanism underlies these processes, by contrasting individual performances on causal judgment and delay discounting tasks. Results from each task corresponded with previous findings in the literature, but no relation was found between the two tasks. The task was replicated and extended by including two further measures, the Barrett Impulsiveness Scale (BIS), and a causal attribution task. Performance on this latter task was correlated with results on the causal judgment task, and also with the non-planning component of the BIS, but the results from the delay discounting task was not correlated with either causal learning task nor the BIS. Implications for current theories of learning are considered. PMID:23162508

  12. Anticipated Reward Enhances Offline Learning during Sleep

    ERIC Educational Resources Information Center

    Fischer, Stefan; Born, Jan

    2009-01-01

    Sleep is known to promote the consolidation of motor memories. In everyday life, typically more than 1 isolated motor skill is acquired at a time, and this possibly gives rise to interference during consolidation. Here, it is shown that reward expectancy determines the amount of sleep-dependent memory consolidation. Subjects were trained on 2…

  13. Emotion and reward are dissociable from error during motor learning.

    PubMed

    Festini, Sara B; Preston, Stephanie D; Reuter-Lorenz, Patricia A; Seidler, Rachael D

    2016-06-01

    Although emotion is known to reciprocally interact with cognitive and motor performance, contemporary theories of motor learning do not specifically consider how dynamic variations in a learner's affective state may influence motor performance during motor learning. Using a prism adaptation paradigm, we assessed emotion during motor learning on a trial-by-trial basis. We designed two dart-throwing experiments to dissociate motor performance and reward outcomes by giving participants maximum points for accurate throws and reduced points for throws that hit zones away from the target (i.e., "accidental points"). Experiment 1 dissociated motor performance from emotional responses and found that affective ratings tracked points earned more closely than error magnitude. Further, both reward and error uniquely contributed to motor learning, as indexed by the change in error from one trial to the next. Experiment 2 manipulated accidental point locations vertically, whereas prism displacement remained horizontal. Results demonstrated that reward could bias motor performance even when concurrent sensorimotor adaptation was taking place in a perpendicular direction. Thus, these experiments demonstrate that affective states were dissociable from error magnitude during motor learning and that affect more closely tracked points earned. Our findings further implicate reward as another factor, other than error, that contributes to motor learning, suggesting the importance of incorporating affective states into models of motor learning. PMID:26746312

  14. The Distracting Effect of Material Reward: An Alternative Explanation for the Superior Performance of Reward Groups in Probability Learning

    ERIC Educational Resources Information Center

    McGraw, Kenneth O.; McCullers, John C.

    1974-01-01

    To determine whether the distraction effect associated with material rewards in discrimination learning can account for the superior performance of reward groups in probability learning, the performance of 144 school children (preschool, second, and fifth grades) on a two-choice successive discrimination task was compared under three reinforcement…

  15. Rewards.

    PubMed

    Gunderman, Richard B; Kamer, Aaron P

    2011-05-01

    For much of the 20th century, psychologists and economists operated on the assumption that work is devoid of intrinsic rewards, and the only way to get people to work harder is through the use of rewards and punishments. This so-called carrot-and-stick model of workplace motivation, when applied to medical practice, emphasizes the use of financial incentives and disincentives to manipulate behavior. More recently, however, it has become apparent that, particularly when applied to certain kinds of work, such approaches can be ineffective or even frankly counterproductive. Instead of focusing on extrinsic rewards such as compensation, organizations and their leaders need to devote more attention to the intrinsic rewards of work itself. This article reviews this new understanding of rewards and traces out its practical implications for radiology today. PMID:21531311

  16. Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning.

    PubMed

    Frank, Michael J; Moustafa, Ahmed A; Haughey, Heather M; Curran, Tim; Hutchison, Kent E

    2007-10-01

    What are the genetic and neural components that support adaptive learning from positive and negative outcomes? Here, we show with genetic analyses that three independent dopaminergic mechanisms contribute to reward and avoidance learning in humans. A polymorphism in the DARPP-32 gene, associated with striatal dopamine function, predicted relatively better probabilistic reward learning. Conversely, the C957T polymorphism of the DRD2 gene, associated with striatal D2 receptor function, predicted the degree to which participants learned to avoid choices that had been probabilistically associated with negative outcomes. The Val/Met polymorphism of the COMT gene, associated with prefrontal cortical dopamine function, predicted participants' ability to rapidly adapt behavior on a trial-to-trial basis. These findings support a neurocomputational dissociation between striatal and prefrontal dopaminergic mechanisms in reinforcement learning. Computational maximum likelihood analyses reveal independent gene effects on three reinforcement learning parameters that can explain the observed dissociations. PMID:17913879

  17. Cortical mechanisms for reinforcement learning in competitive games.

    PubMed

    Seo, Hyojung; Lee, Daeyeol

    2008-12-12

    Game theory analyses optimal strategies for multiple decision makers interacting in a social group. However, the behaviours of individual humans and animals often deviate systematically from the optimal strategies described by game theory. The behaviours of rhesus monkeys (Macaca mulatta) in simple zero-sum games showed similar patterns, but their departures from the optimal strategies were well accounted for by a simple reinforcement-learning algorithm. During a computer-simulated zero-sum game, neurons in the dorsolateral prefrontal cortex often encoded the previous choices of the animal and its opponent as well as the animal's reward history. By contrast, the neurons in the anterior cingulate cortex predominantly encoded the animal's reward history. Using simple competitive games, therefore, we have demonstrated functional specialization between different areas of the primate frontal cortex involved in outcome monitoring and action selection. Temporally extended signals related to the animal's previous choices might facilitate the association between choices and their delayed outcomes, whereas information about the choices of the opponent might be used to estimate the reward expected from a particular action. Finally, signals related to the reward history might be used to monitor the overall success of the animal's current decision-making strategy. PMID:18829430

  18. Measurement of food reinforcement in preschool children: Associations with food intake, BMI, and reward sensitivity

    PubMed Central

    Rollins, Brandi Y.; Loken, Eric; Savage, Jennifer S.; Birch, Leann L.

    2014-01-01

    Progressive ratio (PR) schedules of reinforcement have been used to measure the relative reinforcing value (RRV) of food in humans as young as 8 years old; however, developmentally appropriate measures are needed to measure RRV of food earlier in life. Study objectives were to demonstrate the validity of the RRV of food task adapted for use among for preschool children (3 to 5y), and examine individual differences in performance. Thirty-three children completed the RRV of food task in which they worked to access graham crackers. They also completed a snack task where they had free access these foods, liking and hunger assessments, and their heights and weights were measured. Parents reported on their child’s reward sensitivity. Overall, children were willing work for palatable snack foods. Boys and older children made more responses in the task, while children with higher BMI z-scores and reward sensitivity responded at a faster rate. Children who worked harder in terms of total responses and response rates consumed more calories in the snack session. This study demonstrates that with slight modifications, the RRV of food task is a valid and developmentally appropriate measure for assessing individual differences in food reinforcement among very young children. PMID:24090537

  19. Early Years Education: Are Young Students Intrinsically or Extrinsically Motivated Towards School Activities? A Discussion about the Effects of Rewards on Young Children's Learning

    ERIC Educational Resources Information Center

    Theodotou, Evgenia

    2014-01-01

    Rewards can reinforce and at the same time forestall young children's willingness to learn. However, they are broadly used in the field of education, especially in early years settings, to stimulate children towards learning activities. This paper reviews the theoretical and research literature related to intrinsic and extrinsic motivational…

  20. DAT isn’t all that: cocaine reward and reinforcement requires Toll Like Receptor 4 signaling

    PubMed Central

    Northcutt, A.L.; Hutchinson, M.R.; Wang, X.; Baratta, M.V.; Hiranita, T.; Cochran, T.A.; Pomrenze, M.B.; Galer, E.L.; Kopajtic, T.A.; Li, C.M.; Amat, J.; Larson, G.; Cooper, D.C.; Huang, Y.; O’Neill, C.E.; Yin, H.; Zahniser, N.R.; Katz, J.L.; Rice, K.C.; Maier, S.F.; Bachtell, R.K.; Watkins, L.R.

    2014-01-01

    The initial reinforcing properties of drugs of abuse, such as cocaine, are largely attributed to their ability to activate the mesolimbic dopamine system. Resulting increases in extracellular dopamine in the nucleus accumbens (NAc) are traditionally thought to result from cocaine’s ability to block dopamine transporters (DATs). Here we demonstrate that cocaine also interacts with the immunosurveillance receptor complex, Toll-Like Receptor 4 (TLR4), on microglial cells to initiate central innate immune signaling. Disruption of cocaine signaling at TLR4 suppresses cocaine-induced extracellular dopamine in the NAc, as well as cocaine conditioned place preference and cocaine self-administration. These results provide a novel understanding of the neurobiological mechanisms underlying cocaine reward/reinforcement that includes a critical role for central immune signaling, and offer a new target for medication development for cocaine abuse treatment. PMID:25644383

  1. Reinforcement learning improves behaviour from evaluative feedback

    NASA Astrophysics Data System (ADS)

    Littman, Michael L.

    2015-05-01

    Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

  2. Spiking neural networks with different reinforcement learning (RL) schemes in a multiagent setting.

    PubMed

    Christodoulou, Chris; Cleanthous, Aristodemos

    2010-12-31

    This paper investigates the effectiveness of spiking agents when trained with reinforcement learning (RL) in a challenging multiagent task. In particular, it explores learning through reward-modulated spike-timing dependent plasticity (STDP) and compares it to reinforcement of stochastic synaptic transmission in the general-sum game of the Iterated Prisoner's Dilemma (IPD). More specifically, a computational model is developed where we implement two spiking neural networks as two "selfish" agents learning simultaneously but independently, competing in the IPD game. The purpose of our system (or collective) is to maximise its accumulated reward in the presence of reward-driven competing agents within the collective. This can only be achieved when the agents engage in a behaviour of mutual cooperation during the IPD. Previously, we successfully applied reinforcement of stochastic synaptic transmission to the IPD game. The current study utilises reward-modulated STDP with eligibility trace and results show that the system managed to exhibit the desired behaviour by establishing mutual cooperation between the agents. It is noted that the cooperative outcome was attained after a relatively short learning period which enhanced the accumulation of reward by the system. As in our previous implementation, the successful application of the learning algorithm to the IPD becomes possible only after we extended it with additional global reinforcement signals in order to enhance competition at the neuronal level. Moreover it is also shown that learning is enhanced (as indicated by an increased IPD cooperative outcome) through: (i) strong memory for each agent (regulated by a high eligibility trace time constant) and (ii) firing irregularity produced by equipping the agents' LIF neurons with a partial somatic reset mechanism. PMID:21793357

  3. Dopamine-dependent reinforcement of motor skill learning: evidence from Gilles de la Tourette syndrome.

    PubMed

    Palminteri, Stefano; Lebreton, Maël; Worbe, Yulia; Hartmann, Andreas; Lehéricy, Stéphane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-08-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only affect choices but also motor skills such as typing. Here, we employed a novel paradigm to demonstrate that monetary rewards can improve motor skill learning in humans. Indeed, healthy participants progressively got faster in executing sequences of key presses that were repeatedly rewarded with 10 euro compared with 1 cent. Control tests revealed that the effect of reinforcement on motor skill learning was independent of subjects being aware of sequence-reward associations. To account for this implicit effect, we developed an actor-critic model, in which reward prediction errors are used by the critic to update state values and by the actor to facilitate action execution. To assess the role of dopamine in such computations, we applied the same paradigm in patients with Gilles de la Tourette syndrome, who were either unmedicated or treated with neuroleptics. We also included patients with focal dystonia, as an example of hyperkinetic motor disorder unrelated to dopamine. Model fit showed the following dissociation: while motor skills were affected in all patient groups, reinforcement learning was selectively enhanced in unmedicated patients with Gilles de la Tourette syndrome and impaired by neuroleptics. These results support the hypothesis that overactive dopamine transmission leads to excessive reinforcement of motor sequences, which might explain the formation of tics in Gilles de la Tourette syndrome. PMID:21727098

  4. Reinforcement learning for routing in cognitive radio ad hoc networks.

    PubMed

    Al-Rawi, Hasan A A; Yau, Kok-Lim Alvin; Mohamad, Hafizal; Ramli, Nordin; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs. PMID:25140350

  5. Common Neural Mechanisms Underlying Reversal Learning by Reward and Punishment

    PubMed Central

    Xue, Gui; Xue, Feng; Droutman, Vita; Lu, Zhong-Lin; Bechara, Antoine; Read, Stephen

    2013-01-01

    Impairments in flexible goal-directed decisions, often examined by reversal learning, are associated with behavioral abnormalities characterized by impulsiveness and disinhibition. Although the lateral orbital frontal cortex (OFC) has been consistently implicated in reversal learning, it is still unclear whether this region is involved in negative feedback processing, behavioral control, or both, and whether reward and punishment might have different effects on lateral OFC involvement. Using a relatively large sample (N = 47), and a categorical learning task with either monetary reward or moderate electric shock as feedback, we found overlapping activations in the right lateral OFC (and adjacent insula) for reward and punishment reversal learning when comparing correct reversal trials with correct acquisition trials, whereas we found overlapping activations in the right dorsolateral prefrontal cortex (DLPFC) when negative feedback signaled contingency change. The right lateral OFC and DLPFC also showed greater sensitivity to punishment than did their left homologues, indicating an asymmetry in how punishment is processed. We propose that the right lateral OFC and anterior insula are important for transforming affective feedback to behavioral adjustment, whereas the right DLPFC is involved in higher level attention control. These results provide insight into the neural mechanisms of reversal learning and behavioral flexibility, which can be leveraged to understand risky behaviors among vulnerable populations. PMID:24349211

  6. Learning to trade via direct reinforcement.

    PubMed

    Moody, J; Saffell, M

    2001-01-01

    We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision-making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent reinforcement learning (RRL) for discovering investment policies. The need to build forecasting models is eliminated, and better trading performance is obtained. The direct reinforcement approach differs from dynamic programming and reinforcement algorithms such as TD-learning and Q-learning, which attempt to estimate a value function for the control problem. We find that the RRL direct reinforcement framework enables a simpler problem representation, avoids Bellman's curse of dimensionality and offers compelling advantages in efficiency. We demonstrate how direct reinforcement can be used to optimize risk-adjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work using real financial data, we find that our approach based on RRL produces better trading strategies than systems utilizing Q-learning (a value function method). Real-world applications include an intra-daily currency trader and a monthly asset allocation system for the S&P 500 Stock Index and T-Bills. PMID:18249919

  7. Using a board game to reinforce learning.

    PubMed

    Yoon, Bona; Rodriguez, Leslie; Faselis, Charles J; Liappis, Angelike P

    2014-03-01

    Experiential gaming strategies offer a variation on traditional learning. A board game was used to present synthesized content of fundamental catheter care concepts and reinforce evidence-based practices relevant to nursing. Board games are innovative educational tools that can enhance active learning. PMID:24588236

  8. Adaptive Educational Software by Applying Reinforcement Learning

    ERIC Educational Resources Information Center

    Bennane, Abdellah

    2013-01-01

    The introduction of the intelligence in teaching software is the object of this paper. In software elaboration process, one uses some learning techniques in order to adapt the teaching software to characteristics of student. Generally, one uses the artificial intelligence techniques like reinforcement learning, Bayesian network in order to adapt…

  9. Context transfer in reinforcement learning using action-value functions.

    PubMed

    Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

    2014-01-01

    This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task. PMID:25610457

  10. Effort-Reward Imbalance for Learning Is Associated with Fatigue in School Children

    ERIC Educational Resources Information Center

    Fukuda, Sanae; Yamano, Emi; Joudoi, Takako; Mizuno, Kei; Tanaka, Masaaki; Kawatani, Junko; Takano, Miyuki; Tomoda, Akemi; Imai-Matsumura, Kyoko; Miike, Teruhisa; Watanabe, Yasuyoshi

    2010-01-01

    We examined relationships among fatigue, sleep quality, and effort-reward imbalance for learning in school children. We developed an effort-reward for learning scale in school students and examined its reliability and validity. Self-administered surveys, including the effort reward for leaning scale and fatigue scale, were completed by 1,023…

  11. Learning to Obtain Reward, but Not Avoid Punishment, Is Affected by Presence of PTSD Symptoms in Male Veterans: Empirical Data and Computational Model

    PubMed Central

    Myers, Catherine E.; Moustafa, Ahmed A.; Sheynin, Jony; VanMeenen, Kirsten M.; Gilbertson, Mark W.; Orr, Scott P.; Beck, Kevin D.; Pang, Kevin C. H.; Servatius, Richard J.

    2013-01-01

    Post-traumatic stress disorder (PTSD) symptoms include behavioral avoidance which is acquired and tends to increase with time. This avoidance may represent a general learning bias; indeed, individuals with PTSD are often faster than controls on acquiring conditioned responses based on physiologically-aversive feedback. However, it is not clear whether this learning bias extends to cognitive feedback, or to learning from both reward and punishment. Here, male veterans with self-reported current, severe PTSD symptoms (PTSS group) or with few or no PTSD symptoms (control group) completed a probabilistic classification task that included both reward-based and punishment-based trials, where feedback could take the form of reward, punishment, or an ambiguous “no-feedback” outcome that could signal either successful avoidance of punishment or failure to obtain reward. The PTSS group outperformed the control group in total points obtained; the PTSS group specifically performed better than the control group on reward-based trials, with no difference on punishment-based trials. To better understand possible mechanisms underlying observed performance, we used a reinforcement learning model of the task, and applied maximum likelihood estimation techniques to derive estimated parameters describing individual participants’ behavior. Estimations of the reinforcement value of the no-feedback outcome were significantly greater in the control group than the PTSS group, suggesting that the control group was more likely to value this outcome as positively reinforcing (i.e., signaling successful avoidance of punishment). This is consistent with the control group’s generally poorer performance on reward trials, where reward feedback was to be obtained in preference to the no-feedback outcome. Differences in the interpretation of ambiguous feedback may contribute to the facilitated reinforcement learning often observed in PTSD patients, and may in turn provide new insight into

  12. Evolution with Reinforcement Learning in Negotiation

    PubMed Central

    Zou, Yi; Zhan, Wenjie; Shao, Yuan

    2014-01-01

    Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm. PMID:25048108

  13. The basolateral amygdala in reward learning and addiction.

    PubMed

    Wassum, Kate M; Izquierdo, Alicia

    2015-10-01

    Sophisticated behavioral paradigms partnered with the emergence of increasingly selective techniques to target the basolateral amygdala (BLA) have resulted in an enhanced understanding of the role of this nucleus in learning and using reward information. Due to the wide variety of behavioral approaches many questions remain on the circumscribed role of BLA in appetitive behavior. In this review, we integrate conclusions of BLA function in reward-related behavior using traditional interference techniques (lesion, pharmacological inactivation) with those using newer methodological approaches in experimental animals that allow in vivo manipulation of cell type-specific populations and neural recordings. Secondly, from a review of appetitive behavioral tasks in rodents and monkeys and recent computational models of reward procurement, we derive evidence for BLA as a neural integrator of reward value, history, and cost parameters. Taken together, BLA codes specific and temporally dynamic outcome representations in a distributed network to orchestrate adaptive responses. We provide evidence that experiences with opiates and psychostimulants alter these outcome representations in BLA, resulting in long-term modified action. PMID:26341938

  14. Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology.

    PubMed

    Schultz, Wolfram

    2004-04-01

    Neurons in a small number of brain structures detect rewards and reward-predicting stimuli and are active during the expectation of predictable food and liquid rewards. These neurons code the reward information according to basic terms of various behavioural theories that seek to explain reward-directed learning, approach behaviour and decision-making. The involved brain structures include groups of dopamine neurons, the striatum including the nucleus accumbens, the orbitofrontal cortex and the amygdala. The reward information is fed to brain structures involved in decision-making and organisation of behaviour, such as the dorsolateral prefrontal cortex and possibly the parietal cortex. The neural coding of basic reward terms derived from formal theories puts the neurophysiological investigation of reward mechanisms on firm conceptual grounds and provides neural correlates for the function of rewards in learning, approach behaviour and decision-making. PMID:15082317

  15. Human dorsal striatal activity during choice discriminates reinforcement learning behavior from the gambler's fallacy.

    PubMed

    Jessup, Ryan K; O'Doherty, John P

    2011-04-27

    Reinforcement learning theory has generated substantial interest in neurobiology, particularly because of the resemblance between phasic dopamine and reward prediction errors. Actor-critic theories have been adapted to account for the functions of the striatum, with parts of the dorsal striatum equated to the actor. Here, we specifically test whether the human dorsal striatum--as predicted by an actor-critic instantiation--is used on a trial-to-trial basis at the time of choice to choose in accordance with reinforcement learning theory, as opposed to a competing strategy: the gambler's fallacy. Using a partial-brain functional magnetic resonance imaging scanning protocol focused on the striatum and other ventral brain areas, we found that the dorsal striatum is more active when choosing consistent with reinforcement learning compared with the competing strategy. Moreover, an overlapping area of dorsal striatum along with the ventral striatum was found to be correlated with reward prediction errors at the time of outcome, as predicted by the actor-critic framework. These findings suggest that the same region of dorsal striatum involved in learning stimulus-response associations may contribute to the control of behavior during choice, thereby using those learned associations. Intriguingly, neither reinforcement learning nor the gambler's fallacy conformed to the optimal choice strategy on the specific decision-making task we used. Thus, the dorsal striatum may contribute to the control of behavior according to reinforcement learning even when the prescriptions of such an algorithm are suboptimal in terms of maximizing future rewards. PMID:21525269

  16. Refining Linear Fuzzy Rules by Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.; Malkani, Anil

    1996-01-01

    Linear fuzzy rules are increasingly being used in the development of fuzzy logic systems. Radial basis functions have also been used in the antecedents of the rules for clustering in product space which can automatically generate a set of linear fuzzy rules from an input/output data set. Manual methods are usually used in refining these rules. This paper presents a method for refining the parameters of these rules using reinforcement learning which can be applied in domains where supervised input-output data is not available and reinforcements are received only after a long sequence of actions. This is shown for a generalization of radial basis functions. The formation of fuzzy rules from data and their automatic refinement is an important step in closing the gap between the application of reinforcement learning methods in the domains where only some limited input-output data is available.

  17. Autonomous reinforcement learning with experience replay.

    PubMed

    Wawrzyński, Paweł; Tanwani, Ajay Kumar

    2013-05-01

    This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time. PMID:23237972

  18. Time representation in reinforcement learning models of the basal ganglia

    PubMed Central

    Gershman, Samuel J.; Moustafa, Ahmed A.; Ludvig, Elliot A.

    2014-01-01

    Reinforcement learning (RL) models have been influential in understanding many aspects of basal ganglia function, from reward prediction to action selection. Time plays an important role in these models, but there is still no theoretical consensus about what kind of time representation is used by the basal ganglia. We review several theoretical accounts and their supporting evidence. We then discuss the relationship between RL models and the timing mechanisms that have been attributed to the basal ganglia. We hypothesize that a single computational system may underlie both RL and interval timing—the perception of duration in the range of seconds to hours. This hypothesis, which extends earlier models by incorporating a time-sensitive action selection mechanism, may have important implications for understanding disorders like Parkinson's disease in which both decision making and timing are impaired. PMID:24409138

  19. Reinforcement Learning in Information Searching

    ERIC Educational Resources Information Center

    Cen, Yonghua; Gan, Liren; Bai, Chen

    2013-01-01

    Introduction: The study seeks to answer two questions: How do university students learn to use correct strategies to conduct scholarly information searches without instructions? and, What are the differences in learning mechanisms between users at different cognitive levels? Method: Two groups of users, thirteen first year undergraduate students…

  20. Can Traditions Emerge from the Interaction of Stimulus Enhancement and Reinforcement Learning? An Experimental Model

    PubMed Central

    MATTHEWS, LUKE J; PAUKNER, ANNIKA; SUOMI, STEPHEN J

    2010-01-01

    The study of social learning in captivity and behavioral traditions in the wild are two burgeoning areas of research, but few empirical studies have tested how learning mechanisms produce emergent patterns of tradition. Studies have examined how social learning mechanisms that are cognitively complex and possessed by few species, such as imitation, result in traditional patterns, yet traditional patterns are also exhibited by species that may not possess such mechanisms. We propose an explicit model of how stimulus enhancement and reinforcement learning could interact to produce traditions. We tested the model experimentally with tufted capuchin monkeys (Cebus apella), which exhibit traditions in the wild but have rarely demonstrated imitative abilities in captive experiments. Monkeys showed both stimulus enhancement learning and a habitual bias to perform whichever behavior first obtained them a reward. These results support our model that simple social learning mechanisms combined with reinforcement can result in traditional patterns of behavior. PMID:21135912

  1. Prediction error in reinforcement learning: a meta-analysis of neuroimaging studies.

    PubMed

    Garrison, Jane; Erdeniz, Burak; Done, John

    2013-08-01

    Activation likelihood estimation (ALE) meta-analyses were used to examine the neural correlates of prediction error in reinforcement learning. The findings are interpreted in the light of current computational models of learning and action selection. In this context, particular consideration is given to the comparison of activation patterns from studies using instrumental and Pavlovian conditioning, and where reinforcement involved rewarding or punishing feedback. The striatum was the key brain area encoding for prediction error, with activity encompassing dorsal and ventral regions for instrumental and Pavlovian reinforcement alike, a finding which challenges the functional separation of the striatum into a dorsal 'actor' and a ventral 'critic'. Prediction error activity was further observed in diverse areas of predominantly anterior cerebral cortex including medial prefrontal cortex and anterior cingulate cortex. Distinct patterns of prediction error activity were found for studies using rewarding and aversive reinforcers; reward prediction errors were observed primarily in the striatum while aversive prediction errors were found more widely including insula and habenula. PMID:23567522

  2. The Reinforcing and Rewarding Effects of Methylone, a Synthetic Cathinone Commonly Found in "Bath Salts"

    PubMed

    Watterson, Lucas R; Hood, Lauren; Sewalia, Kaveish; Tomek, Seven E; Yahn, Stephanie; Johnson, Craig Trevor; Wegner, Scott; Blough, Bruce E; Marusich, Julie A; Olive, M Foster

    2012-12-01

    Methylone is a member of the designer drug class known as synthetic cathinones which have become increasingly popular drugs of abuse in recent years. Commonly referred to as "bath salts", these amphetamine-like compounds are sold as "legal" alternatives to illicit drugs such as cocaine, methamphetamine, and 3,4-methylenedioxymethamphetamine (MDMA, ecstasy). Following their dramatic rise in popularity along with numerous reports of toxicity and death, several of these drugs were classified as Schedule I drugs in the United States in 2012. Despite these bans, these drugs and other new structurally similar analogues continue to be abused. Currently, however, it is unknown whether these compounds possess the potential for compulsive use and addiction. The present study sought to determine the relative abuse liability of methylone by employing intravenous self-administration (IVSA) and intracranial self-stimulation (ICSS) paradigms in rats. We demonstrate that methylone (0.05, 0.1, 0.2, and 0.5 mg/kg/infusion) dose-dependently functions as a reinforcer, and that there is a significant positive relationship between methylone dose and reinforcer efficacy. Furthermore, responding during short access sessions (ShA, 2 hr/day) appeared more robust than previous IVSA studies with MDMA. However, unlike previous findings with abused stimulants such as cocaine or methamphetamine, long access sessions (LgA, 6 hr/day) did not lead to escalated drug intake or increased reinforcer efficacy. Finally, methylone produced a dose-dependent, but statistically non-significant, trend towards reductions in ICSS thresholds. Together these results reveal that methylone may possess an addiction potential similar to or greater than MDMA, yet patterns of self-administration and effects on brain reward function suggest that this drug may have a lower potential for abuse and compulsive use than prototypical psychostimulants. PMID:24244886

  3. Providing a food reward reduces inhibitory avoidance learning in zebrafish.

    PubMed

    Manuel, Remy; Zethof, Jan; Flik, Gert; van den Bos, Ruud

    2015-11-01

    As shown in male rats, prior history of subjects changes behavioural and stress-responses to challenges: a two-week history of exposure to rewards at fixed intervals led to slightly, but consistently, lower physiological stress-responses and anxiety-like behaviour. Here, we tested whether similar effects are present in zebrafish (Danio rerio). After two weeks of providing Artemia (brine shrimp; Artemia salina) as food reward or flake food (Tetramin) as control at fixed intervals, zebrafish were exposed to a fear-avoidance learning task using an inhibitory avoidance protocol. Half the number of fish received a 3V shock on day 1 and were tested and sacrificed on day 2; the other half received a second 3V shock on day 2 and were tested and sacrificed on day 3. The latter was done to assess whether effects are robust, as effects in rats have been shown to be modest. Zebrafish that were given Artemia showed less inhibitory avoidance after one shock, but not after two shocks, than zebrafish that were given flake-food. Reduced avoidance behaviour was associated with lower telencepahalic gene expression levels of cannabinoid receptor 1 (cnr1) and higher gene expression levels of corticotropin releasing factor (crf). These results suggest that providing rewards at fixed intervals alters fear avoidance behaviour, albeit modestly, in zebrafish. We discuss the data in the context of similar underlying brain structures in mammals and fish. PMID:26342856

  4. Fuzzy Q-Learning for Generalization of Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1996-01-01

    Fuzzy Q-Learning, introduced earlier by the author, is an extension of Q-Learning into fuzzy environments. GARIC is a methodology for fuzzy reinforcement learning. In this paper, we introduce GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Q-Learning and at the local level, each agent learns and operates based on GARIC. GARIC-Q improves the speed and applicability of Fuzzy Q-Learning through generalization of input space by using fuzzy rules and bridges the gap between Q-Learning and rule based intelligent systems.

  5. Reinforcement Learning Based Artificial Immune Classifier

    PubMed Central

    Karakose, Mehmet

    2013-01-01

    One of the widely used methods for classification that is a decision-making process is artificial immune systems. Artificial immune systems based on natural immunity system can be successfully applied for classification, optimization, recognition, and learning in real-world problems. In this study, a reinforcement learning based artificial immune classifier is proposed as a new approach. This approach uses reinforcement learning to find better antibody with immune operators. The proposed new approach has many contributions according to other methods in the literature such as effectiveness, less memory cell, high accuracy, speed, and data adaptability. The performance of the proposed approach is demonstrated by simulation and experimental results using real data in Matlab and FPGA. Some benchmark data and remote image data are used for experimental results. The comparative results with supervised/unsupervised based artificial immune system, negative selection classifier, and resource limited artificial immune classifier are given to demonstrate the effectiveness of the proposed new method. PMID:23935424

  6. Online reinforcement learning for dynamic multimedia systems.

    PubMed

    Mastronarde, Nicholas; van der Schaar, Mihaela

    2010-02-01

    In our previous work, we proposed a systematic cross-layer framework for dynamic multimedia systems, which allows each layer to make autonomous and foresighted decisions that maximize the system's long-term performance, while meeting the application's real-time delay constraints. The proposed solution solved the cross-layer optimization offline, under the assumption that the multimedia system's probabilistic dynamics were known a priori, by modeling the system as a layered Markov decision process. In practice, however, these dynamics are unknown a priori and, therefore, must be learned online. In this paper, we address this problem by allowing the multimedia system layers to learn, through repeated interactions with each other, to autonomously optimize the system's long-term performance at run-time. The two key challenges in this layered learning setting are: (i) each layer's learning performance is directly impacted by not only its own dynamics, but also by the learning processes of the other layers with which it interacts; and (ii) selecting a learning model that appropriately balances time-complexity (i.e., learning speed) with the multimedia system's limited memory and the multimedia application's real-time delay constraints. We propose two reinforcement learning algorithms for optimizing the system under different design constraints: the first algorithm solves the cross-layer optimization in a centralized manner and the second solves it in a decentralized manner. We analyze both algorithms in terms of their required computation, memory, and interlayer communication overheads. After noting that the proposed reinforcement learning algorithms learn too slowly, we introduce a complementary accelerated learning algorithm that exploits partial knowledge about the system's dynamics in order to dramatically improve the system's performance. In our experiments, we demonstrate that decentralized learning can perform equally as well as centralized learning, while

  7. Drive-reinforcement learning system applications

    NASA Astrophysics Data System (ADS)

    Johnson, Daniel W.

    1992-07-01

    The application of Drive-Reinforcement (D-R) to the unsupervised learning of manipulator control functions was investigated. In particular, the ability of a D-R neuronal system to learn servo-level and trajectory-level controls for a robotic mechanism was assessed. Results indicate that D-R based systems can be successful at learning these functions in real-time with actual hardware. Moreover, since the control architectures are generic, the evidence suggests that D-R would be effective in control system applications outside the robotics arena.

  8. Reward functions of the basal ganglia.

    PubMed

    Schultz, Wolfram

    2016-07-01

    Besides their fundamental movement function evidenced by Parkinsonian deficits, the basal ganglia are involved in processing closely linked non-motor, cognitive and reward information. This review describes the reward functions of three brain structures that are major components of the basal ganglia or are closely associated with the basal ganglia, namely midbrain dopamine neurons, pedunculopontine nucleus, and striatum (caudate nucleus, putamen, nucleus accumbens). Rewards are involved in learning (positive reinforcement), approach behavior, economic choices and positive emotions. The response of dopamine neurons to rewards consists of an early detection component and a subsequent reward component that reflects a prediction error in economic utility, but is unrelated to movement. Dopamine activations to non-rewarded or aversive stimuli reflect physical impact, but not punishment. Neurons in pedunculopontine nucleus project their axons to dopamine neurons and process sensory stimuli, movements and rewards and reward-predicting stimuli without coding outright reward prediction errors. Neurons in striatum, besides their pronounced movement relationships, process rewards irrespective of sensory and motor aspects, integrate reward information into movement activity, code the reward value of individual actions, change their reward-related activity during learning, and code own reward in social situations depending on whose action produces the reward. These data demonstrate a variety of well-characterized reward processes in specific basal ganglia nuclei consistent with an important function in non-motor aspects of motivated behavior. PMID:26838982

  9. Reinforcement Learning Models and Their Neural Correlates: An Activation Likelihood Estimation Meta-Analysis

    PubMed Central

    Kumar, Poornima; Eickhoff, Simon B.; Dombrovski, Alexandre Y.

    2015-01-01

    Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments – prediction error – is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies suggest that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that employed algorithmic reinforcement learning models, across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, while instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually-estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies. PMID:25665667

  10. Reinforcement learning models and their neural correlates: An activation likelihood estimation meta-analysis.

    PubMed

    Chase, Henry W; Kumar, Poornima; Eickhoff, Simon B; Dombrovski, Alexandre Y

    2015-06-01

    Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments-prediction error-is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies have suggested that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that had employed algorithmic reinforcement learning models across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, whereas instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies. PMID:25665667

  11. Autistic Traits Moderate the Impact of Reward Learning on Social Behaviour

    PubMed Central

    Panasiti, Maria Serena; Puzzo, Ignazio

    2015-01-01

    A deficit in empathy has been suggested to underlie social behavioural atypicalities in autism. A parallel theoretical account proposes that reduced social motivation (i.e., low responsivity to social rewards) can account for the said atypicalities. Recent evidence suggests that autistic traits modulate the link between reward and proxy metrics related to empathy. Using an evaluative conditioning paradigm to associate high and low rewards with faces, a previous study has shown that individuals high in autistic traits show reduced spontaneous facial mimicry of faces associated with high vs. low reward. This observation raises the possibility that autistic traits modulate the magnitude of evaluative conditioning. To test this, we investigated (a) if autistic traits could modulate the ability to implicitly associate a reward value to a social stimulus (reward learning/conditioning, using the Implicit Association Task, IAT); (b) if the learned association could modulate participants’ prosocial behaviour (i.e., social reciprocity, measured using the cyberball task); (c) if the strength of this modulation was influenced by autistic traits. In 43 neurotypical participants, we found that autistic traits moderated the relationship of social reward learning on prosocial behaviour but not reward learning itself. This evidence suggests that while autistic traits do not directly influence social reward learning, they modulate the relationship of social rewards with prosocial behaviour. Autism Res 2016, 9: 471–479. © 2015 The Authors Autism Research published by Wiley Periodicals, Inc. on behalf of International Society for Autism Research PMID:26280134

  12. Autistic Traits Moderate the Impact of Reward Learning on Social Behaviour.

    PubMed

    Panasiti, Maria Serena; Puzzo, Ignazio; Chakrabarti, Bhismadev

    2016-04-01

    A deficit in empathy has been suggested to underlie social behavioural atypicalities in autism. A parallel theoretical account proposes that reduced social motivation (i.e., low responsivity to social rewards) can account for the said atypicalities. Recent evidence suggests that autistic traits modulate the link between reward and proxy metrics related to empathy. Using an evaluative conditioning paradigm to associate high and low rewards with faces, a previous study has shown that individuals high in autistic traits show reduced spontaneous facial mimicry of faces associated with high vs. low reward. This observation raises the possibility that autistic traits modulate the magnitude of evaluative conditioning. To test this, we investigated (a) if autistic traits could modulate the ability to implicitly associate a reward value to a social stimulus (reward learning/conditioning, using the Implicit Association Task, IAT); (b) if the learned association could modulate participants' prosocial behaviour (i.e., social reciprocity, measured using the cyberball task); (c) if the strength of this modulation was influenced by autistic traits. In 43 neurotypical participants, we found that autistic traits moderated the relationship of social reward learning on prosocial behaviour but not reward learning itself. This evidence suggests that while autistic traits do not directly influence social reward learning, they modulate the relationship of social rewards with prosocial behaviour. Autism Res 2016, 9: 471-479. © 2015 The Authors Autism Research published by Wiley Periodicals, Inc. on behalf of International Society for Autism Research. PMID:26280134

  13. Neurocomputational mechanisms of reinforcement-guided learning in humans: a review.

    PubMed

    Cohen, Michael X

    2008-06-01

    Adapting decision making according to dynamic and probabilistic changes in action-reward contingencies is critical for survival in a competitive and resource-limited world. Much research has focused on elucidating the neural systems and computations that underlie how the brain identifies whether the consequences of actions are relatively good or bad. In contrast, less empirical research has focused on the mechanisms by which reinforcements might be used to guide decision making. Here, I review recent studies in which an attempt to bridge this gap has been made by characterizing how humans use reward information to guide and optimize decision making. Regions that have been implicated in reinforcement processing, including the striatum, orbitofrontal cortex, and anterior cingulate, also seem to mediate how reinforcements are used to adjust subsequent decision making. This research provides insights into why the brain devotes resources to evaluating reinforcements and suggests a direction for future research, from studying the mechanisms of reinforcement processing to studying the mechanisms of reinforcement learning. PMID:18589502

  14. Reinforcement learning in professional basketball players.

    PubMed

    Neiman, Tal; Loewenstein, Yonatan

    2011-01-01

    Reinforcement learning in complex natural environments is a challenging task because the agent should generalize from the outcomes of actions taken in one state of the world to future actions in different states of the world. The extent to which human experts find the proper level of generalization is unclear. Here we show, using the sequences of field goal attempts made by professional basketball players, that the outcome of even a single field goal attempt has a considerable effect on the rate of subsequent 3 point shot attempts, in line with standard models of reinforcement learning. However, this change in behaviour is associated with negative correlations between the outcomes of successive field goal attempts. These results indicate that despite years of experience and high motivation, professional players overgeneralize from the outcomes of their most recent actions, which leads to decreased performance. PMID:22146388

  15. Implication of dopaminergic modulation in operant reward learning and the induction of compulsive-like feeding behavior in Aplysia.

    PubMed

    Bédécarrats, Alexis; Cornet, Charles; Simmers, John; Nargeot, Romuald

    2013-06-01

    Feeding in Aplysia provides an amenable model system for analyzing the neuronal substrates of motivated behavior and its adaptability by associative reward learning and neuromodulation. Among such learning processes, appetitive operant conditioning that leads to a compulsive-like expression of feeding actions is known to be associated with changes in the membrane properties and electrical coupling of essential action-initiating B63 neurons in the buccal central pattern generator (CPG). Moreover, the food-reward signal for this learning is conveyed in the esophageal nerve (En), an input nerve rich in dopamine-containing fibers. Here, to investigate whether dopamine (DA) is involved in this learning-induced plasticity, we used an in vitro analog of operant conditioning in which electrical stimulation of En substituted the contingent reinforcement of biting movements in vivo. Our data indicate that contingent En stimulation does, indeed, replicate the operant learning-induced changes in CPG output and the underlying membrane and synaptic properties of B63. Significantly, moreover, this network and cellular plasticity was blocked when the input nerve was stimulated in the presence of the DA receptor antagonist cis-flupenthixol. These results therefore suggest that En-derived dopaminergic modulation of CPG circuitry contributes to the operant reward-dependent emergence of a compulsive-like expression of Aplysia's feeding behavior. PMID:23685764

  16. The Function of Direct and Vicarious Reinforcement in Human Learning.

    ERIC Educational Resources Information Center

    Owens, Carl R.; And Others

    The role of reinforcement has long been an issue in learning theory. The effects of reinforcement in learning were investigated under circumstances which made the information necessary for correct performance equally available to reinforced and nonreinforced subjects. Fourth graders (N=36) were given a pre-test of 20 items from the Peabody Picture…

  17. Effects of Cooperative versus Individual Study on Learning and Motivation after Reward-Removal

    ERIC Educational Resources Information Center

    Sears, David A.; Pai, Hui-Hua

    2012-01-01

    Rewards are frequently used in classrooms and recommended as a key component of well-researched methods of cooperative learning (e.g., Slavin, 1995). While many studies of cooperative learning find beneficial effects of rewards, many studies of individuals find negative effects (e.g., Deci, Koestner, & Ryan, 1999; Lepper, 1988). This may be…

  18. The Influence of Personality on Neural Mechanisms of Observational Fear and Reward Learning

    ERIC Educational Resources Information Center

    Hooker, Christine I.; Verosky, Sara C.; Miyakawa, Asako; Knight, Robert T.; D'Esposito, Mark

    2008-01-01

    Fear and reward learning can occur through direct experience or observation. Both channels can enhance survival or create maladaptive behavior. We used fMRI to isolate neural mechanisms of observational fear and reward learning and investigate whether neural response varied according to individual differences in neuroticism and extraversion.…

  19. Reinforcement Learning and Dopamine in Schizophrenia: Dimensions of Symptoms or Specific Features of a Disease Group?

    PubMed Central

    Deserno, Lorenz; Boehme, Rebecca; Heinz, Andreas; Schlagenhauf, Florian

    2013-01-01

    Abnormalities in reinforcement learning are a key finding in schizophrenia and have been proposed to be linked to elevated levels of dopamine neurotransmission. Behavioral deficits in reinforcement learning and their neural correlates may contribute to the formation of clinical characteristics of schizophrenia. The ability to form predictions about future outcomes is fundamental for environmental interactions and depends on neuronal teaching signals, like reward prediction errors. While aberrant prediction errors, that encode non-salient events as surprising, have been proposed to contribute to the formation of positive symptoms, a failure to build neural representations of decision values may result in negative symptoms. Here, we review behavioral and neuroimaging research in schizophrenia and focus on studies that implemented reinforcement learning models. In addition, we discuss studies that combined reinforcement learning with measures of dopamine. Thereby, we suggest how reinforcement learning abnormalities in schizophrenia may contribute to the formation of psychotic symptoms and may interact with cognitive deficits. These ideas point toward an interplay of more rigid versus flexible control over reinforcement learning. Pronounced deficits in the flexible or model-based domain may allow for a detailed characterization of well-established cognitive deficits in schizophrenia patients based on computational models of learning. Finally, we propose a framework based on the potentially crucial contribution of dopamine to dysfunctional reinforcement learning on the level of neural networks. Future research may strongly benefit from computational modeling but also requires further methodological improvement for clinical group studies. These research tools may help to improve our understanding of disease-specific mechanisms and may help to identify clinically relevant subgroups of the heterogeneous entity schizophrenia. PMID:24391603

  20. The Cerebellum: A Neural System for the Study of Reinforcement Learning

    PubMed Central

    Swain, Rodney A.; Kerr, Abigail L.; Thompson, Richard F.

    2011-01-01

    In its strictest application, the term “reinforcement learning” refers to a computational approach to learning in which an agent (often a machine) interacts with a mutable environment to maximize reward through trial and error. The approach borrows essentials from several fields, most notably Computer Science, Behavioral Neuroscience, and Psychology. At the most basic level, a neural system capable of mediating reinforcement learning must be able to acquire sensory information about the external environment and internal milieu (either directly or through connectivities with other brain regions), must be able to select a behavior to be executed, and must be capable of providing evaluative feedback about the success of that behavior. Given that Psychology informs us that reinforcers, both positive and negative, are stimuli or consequences that increase the probability that the immediately antecedent behavior will be repeated and that reinforcer strength or viability is modulated by the organism's past experience with the reinforcer, its affect, and even the state of its muscles (e.g., eyes open or closed); it is the case that any neural system that supports reinforcement learning must also be sensitive to these same considerations. Once learning is established, such a neural system must finally be able to maintain continued response expression and prevent response drift. In this report, we examine both historical and recent evidence that the cerebellum satisfies all of these requirements. While we report evidence from a variety of learning paradigms, the majority of our discussion will focus on classical conditioning of the rabbit eye blink response as an ideal model system for the study of reinforcement and reinforcement learning. PMID:21427778

  1. Identifying Cognitive Remediation Change Through Computational Modelling—Effects on Reinforcement Learning in Schizophrenia

    PubMed Central

    Cella, Matteo; Bishara, Anthony J.; Medin, Evelina; Swan, Sarah; Reeder, Clare; Wykes, Til

    2014-01-01

    Objective: Converging research suggests that individuals with schizophrenia show a marked impairment in reinforcement learning, particularly in tasks requiring flexibility and adaptation. The problem has been associated with dopamine reward systems. This study explores, for the first time, the characteristics of this impairment and how it is affected by a behavioral intervention—cognitive remediation. Method: Using computational modelling, 3 reinforcement learning parameters based on the Wisconsin Card Sorting Test (WCST) trial-by-trial performance were estimated: R (reward sensitivity), P (punishment sensitivity), and D (choice consistency). In Study 1 the parameters were compared between a group of individuals with schizophrenia (n = 100) and a healthy control group (n = 50). In Study 2 the effect of cognitive remediation therapy (CRT) on these parameters was assessed in 2 groups of individuals with schizophrenia, one receiving CRT (n = 37) and the other receiving treatment as usual (TAU, n = 34). Results: In Study 1 individuals with schizophrenia showed impairment in the R and P parameters compared with healthy controls. Study 2 demonstrated that sensitivity to negative feedback (P) and reward (R) improved in the CRT group after therapy compared with the TAU group. R and P parameter change correlated with WCST outputs. Improvements in R and P after CRT were associated with working memory gains and reduction of negative symptoms, respectively. Conclusion: Schizophrenia reinforcement learning difficulties negatively influence performance in shift learning tasks. CRT can improve sensitivity to reward and punishment. Identifying parameters that show change may be useful in experimental medicine studies to identify cognitive domains susceptible to improvement. PMID:24214932

  2. Excitotoxic lesions of the medial striatum delay extinction of a reinforcement color discrimination operant task in domestic chicks; a functional role of reward anticipation.

    PubMed

    Ichikawa, Yoko; Izawa, Ei-Ichi; Matsushima, Toshiya

    2004-12-01

    To reveal the functional roles of the striatum, we examined the effects of excitotoxic lesions to the bilateral medial striatum (mSt) and nucleus accumbens (Ac) in a food reinforcement color discrimination operant task. With a food reward as reinforcement, 1-week-old domestic chicks were trained to peck selectively at red and yellow beads (S+) and not to peck at a blue bead (S-). Those chicks then received either lesions or sham operations and were tested in extinction training sessions, during which yellow turned out to be nonrewarding (S-), whereas red and blue remained unchanged. To further examine the effects on postoperant noninstrumental aspects of behavior, we also measured the "waiting time", during which chicks stayed at the empty feeder after pecking at yellow. Although the lesioned chicks showed significantly higher error rates in the nonrewarding yellow trials, their postoperant waiting time gradually decreased similarly to the sham controls. Furthermore, the lesioned chicks waited significantly longer than the controls, even from the first extinction block. In the blue trials, both lesioned and sham chicks consistently refrained from pecking, indicating that the delayed extinction was not due to a general disinhibition of pecking. Similarly, no effects were found in the novel training sessions, suggesting that the lesions had selective effects on the extinction of a learned operant. These results suggest that a neural representation of memory-based reward anticipation in the mSt/Ac could contribute to the anticipation error required for extinction. PMID:15561503

  3. Multiagent reinforcement learning: spiking and nonspiking agents in the iterated Prisoner's Dilemma.

    PubMed

    Vassiliades, Vassilis; Cleanthous, Aristodemos; Christodoulou, Chris

    2011-04-01

    This paper investigates multiagent reinforcement learning (MARL) in a general-sum game where the payoffs' structure is such that the agents are required to exploit each other in a way that benefits all agents. The contradictory nature of these games makes their study in multiagent systems quite challenging. In particular, we investigate MARL with spiking and nonspiking agents in the Iterated Prisoner's Dilemma by exploring the conditions required to enhance its cooperative outcome. The spiking agents are neural networks with leaky integrate-and-fire neurons trained with two different learning algorithms: 1) reinforcement of stochastic synaptic transmission, or 2) reward-modulated spike-timing-dependent plasticity with eligibility trace. The nonspiking agents use a tabular representation and are trained with Q- and SARSA learning algorithms, with a novel reward transformation process also being applied to the Q-learning agents. According to the results, the cooperative outcome is enhanced by: 1) transformed internal reinforcement signals and a combination of a high learning rate and a low discount factor with an appropriate exploration schedule in the case of non-spiking agents, and 2) having longer eligibility trace time constant in the case of spiking agents. Moreover, it is shown that spiking and nonspiking agents have similar behavior and therefore they can equally well be used in a multiagent interaction setting. For training the spiking agents in the case where more than one output neuron competes for reinforcement, a novel and necessary modification that enhances competition is applied to the two learning algorithms utilized, in order to avoid a possible synaptic saturation. This is done by administering to the networks additional global reinforcement signals for every spike of the output neurons that were not "responsible" for the preceding decision. PMID:21421435

  4. Learning to Produce Syllabic Speech Sounds via Reward-Modulated Neural Plasticity

    PubMed Central

    Warlaumont, Anne S.; Finnegan, Megan K.

    2016-01-01

    At around 7 months of age, human infants begin to reliably produce well-formed syllables containing both consonants and vowels, a behavior called canonical babbling. Over subsequent months, the frequency of canonical babbling continues to increase. How the infant’s nervous system supports the acquisition of this ability is unknown. Here we present a computational model that combines a spiking neural network, reinforcement-modulated spike-timing-dependent plasticity, and a human-like vocal tract to simulate the acquisition of canonical babbling. Like human infants, the model’s frequency of canonical babbling gradually increases. The model is rewarded when it produces a sound that is more auditorily salient than sounds it has previously produced. This is consistent with data from human infants indicating that contingent adult responses shape infant behavior and with data from deaf and tracheostomized infants indicating that hearing, including hearing one’s own vocalizations, is critical for canonical babbling development. Reward receipt increases the level of dopamine in the neural network. The neural network contains a reservoir with recurrent connections and two motor neuron groups, one agonist and one antagonist, which control the masseter and orbicularis oris muscles, promoting or inhibiting mouth closure. The model learns to increase the number of salient, syllabic sounds it produces by adjusting the base level of muscle activation and increasing their range of activity. Our results support the possibility that through dopamine-modulated spike-timing-dependent plasticity, the motor cortex learns to harness its natural oscillations in activity in order to produce syllabic sounds. It thus suggests that learning to produce rhythmic mouth movements for speech production may be supported by general cortical learning mechanisms. The model makes several testable predictions and has implications for our understanding not only of how syllabic vocalizations develop

  5. Learning to Produce Syllabic Speech Sounds via Reward-Modulated Neural Plasticity.

    PubMed

    Warlaumont, Anne S; Finnegan, Megan K

    2016-01-01

    At around 7 months of age, human infants begin to reliably produce well-formed syllables containing both consonants and vowels, a behavior called canonical babbling. Over subsequent months, the frequency of canonical babbling continues to increase. How the infant's nervous system supports the acquisition of this ability is unknown. Here we present a computational model that combines a spiking neural network, reinforcement-modulated spike-timing-dependent plasticity, and a human-like vocal tract to simulate the acquisition of canonical babbling. Like human infants, the model's frequency of canonical babbling gradually increases. The model is rewarded when it produces a sound that is more auditorily salient than sounds it has previously produced. This is consistent with data from human infants indicating that contingent adult responses shape infant behavior and with data from deaf and tracheostomized infants indicating that hearing, including hearing one's own vocalizations, is critical for canonical babbling development. Reward receipt increases the level of dopamine in the neural network. The neural network contains a reservoir with recurrent connections and two motor neuron groups, one agonist and one antagonist, which control the masseter and orbicularis oris muscles, promoting or inhibiting mouth closure. The model learns to increase the number of salient, syllabic sounds it produces by adjusting the base level of muscle activation and increasing their range of activity. Our results support the possibility that through dopamine-modulated spike-timing-dependent plasticity, the motor cortex learns to harness its natural oscillations in activity in order to produce syllabic sounds. It thus suggests that learning to produce rhythmic mouth movements for speech production may be supported by general cortical learning mechanisms. The model makes several testable predictions and has implications for our understanding not only of how syllabic vocalizations develop in

  6. Post-learning Hippocampal Dynamics Promote Preferential Retention of Rewarding Events.

    PubMed

    Gruber, Matthias J; Ritchey, Maureen; Wang, Shao-Fang; Doss, Manoj K; Ranganath, Charan

    2016-03-01

    Reward motivation is known to modulate memory encoding, and this effect depends on interactions between the substantia nigra/ventral tegmental area complex (SN/VTA) and the hippocampus. It is unknown, however, whether these interactions influence offline neural activity in the human brain that is thought to promote memory consolidation. Here we used fMRI to test the effect of reward motivation on post-learning neural dynamics and subsequent memory for objects that were learned in high- and low-reward motivation contexts. We found that post-learning increases in resting-state functional connectivity between the SN/VTA and hippocampus predicted preferential retention of objects that were learned in high-reward contexts. In addition, multivariate pattern classification revealed that hippocampal representations of high-reward contexts were preferentially reactivated during post-learning rest, and the number of hippocampal reactivations was predictive of preferential retention of items learned in high-reward contexts. These findings indicate that reward motivation alters offline post-learning dynamics between the SN/VTA and hippocampus, providing novel evidence for a potential mechanism by which reward could influence memory consolidation. PMID:26875624

  7. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning

    PubMed Central

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  8. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning.

    PubMed

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  9. Experiential reward learning outweighs instruction prior to adulthood

    PubMed Central

    Decker, Johannes H.; Lourenco, Frederico S.; Doll, Bradley B.; Hartley, Catherine A.

    2015-01-01

    Throughout our lives, we face the important task of distinguishing rewarding actions from those that are best avoided. Importantly, there are multiple means by which we acquire this information. Through trial and error, we use experiential feedback to evaluate our actions. We also learn which actions are advantageous through explicit instruction from others. Here, we examined whether the influence of these two forms of learning on choice changes across development by placing instruction and experience in competition in a probabilistic-learning task. Whereas inaccurate instruction markedly biased adults’ estimations of a stimulus’s value, children and adolescents were better able to objectively estimate stimulus values through experience. Instructional control of learning is thought to recruit prefrontal–striatal brain circuitry, which continues to mature into adulthood. Our behavioral data suggest that this protracted neurocognitive maturation may cause the motivated actions of children and adolescents to be less influenced by explicit instruction than are those of adults. This absence of a confirmation bias in children and adolescents represents a paradoxical developmental advantage of youth over adults in the unbiased evaluation of actions through positive and negative experience. PMID:25582607

  10. Instrumental learning of traits versus rewards: dissociable neural correlates and effects on choice.

    PubMed

    Hackel, Leor M; Doll, Bradley B; Amodio, David M

    2015-09-01

    Humans learn about people and objects through positive and negative experiences, yet they can also look beyond the immediate reward of an interaction to encode trait-level attributes. We found that perceivers encoded both reward and trait-level information through feedback in an instrumental learning task, but relied more heavily on trait representations in cross-context decisions. Both learning types implicated ventral striatum, but trait learning also recruited a network associated with social impression formation. PMID:26237363

  11. Reward learning in pediatric depression and anxiety: Preliminary findings in a high-risk sample

    PubMed Central

    Morris, Bethany H.; Bylsma, Lauren M.; Yaroslavsky, Ilya; Kovacs, Maria; Rottenberg, Jonathan

    2015-01-01

    Background Reward learning has been postulated as a critical component of hedonic functioning that predicts depression risk. Reward learning deficits have been established in adults with current depressive disorders, but no prior studies have examined the relationship of reward learning and depression in children. The present study investigated reward learning as a function of familial depression risk and current diagnostic status in a pediatric sample. Method The sample included 204 children of parents with a history of depression (n=86 high-risk offspring) or parents with no history of major mental disorder (n=118 low-risk offspring). Semi-structured clinical interviews were used to establish current mental diagnoses in the children. A modified signal detection task was used to assess reward learning. We tested whether reward learning was impaired in high-risk offspring relative to low-risk offspring. We also tested whether reward learning was impaired in children with current disorders known to blunt hedonic function (depression, social phobia, PTSD, GAD, n=13) compared to children with no disorders and to a psychiatric comparison group with ADHD. Results High- and low-risk youth did not differ in reward learning. However, youth with current anhedonic disorders (depression, social phobia, PTSD, GAD) exhibited blunted reward learning relative to nondisordered youth and those with ADHD. Conclusions Our results are a first demonstration that reward learning deficits are present among youth with disorders known to blunt anhedonic function and that these deficits have some degree of diagnostic specificity. We advocate for future studies to replicate and extend these preliminary findings. PMID:25826304

  12. Knowledge-Based Reinforcement Learning for Data Mining

    NASA Astrophysics Data System (ADS)

    Kudenko, Daniel; Grzes, Marek

    Data Mining is the process of extracting patterns from data. Two general avenues of research in the intersecting areas of agents and data mining can be distinguished. The first approach is concerned with mining an agent’s observation data in order to extract patterns, categorize environment states, and/or make predictions of future states. In this setting, data is normally available as a batch, and the agent’s actions and goals are often independent of the data mining task. The data collection is mainly considered as a side effect of the agent’s activities. Machine learning techniques applied in such situations fall into the class of supervised learning. In contrast, the second scenario occurs where an agent is actively performing the data mining, and is responsible for the data collection itself. For example, a mobile network agent is acquiring and processing data (where the acquisition may incur a certain cost), or a mobile sensor agent is moving in a (perhaps hostile) environment, collecting and processing sensor readings. In these settings, the tasks of the agent and the data mining are highly intertwined and interdependent (or even identical). Supervised learning is not a suitable technique for these cases. Reinforcement Learning (RL) enables an agent to learn from experience (in form of reward and punishment for explorative actions) and adapt to new situations, without a teacher. RL is an ideal learning technique for these data mining scenarios, because it fits the agent paradigm of continuous sensing and acting, and the RL agent is able to learn to make decisions on the sampling of the environment which provides the data. Nevertheless, RL still suffers from scalability problems, which have prevented its successful use in many complex real-world domains. The more complex the tasks, the longer it takes a reinforcement learning algorithm to converge to a good solution. For many real-world tasks, human expert knowledge is available. For example, human

  13. Attenuating GABAA Receptor Signaling in Dopamine Neurons Selectively Enhances Reward Learning and Alters Risk Preference in Mice

    PubMed Central

    Parker, Jones G.; Wanat, Matthew J.; Soden, Marta E.; Ahmad, Kinza; Zweifel, Larry S.; Bamford, Nigel S.; Palmiter, Richard D.

    2011-01-01

    Phasic dopamine transmission encodes the value of reward-predictive stimuli and influences both learning and decision-making. Altered dopamine signaling is associated with psychiatric conditions characterized by risky choices such as pathological gambling. These observations highlight the importance of understanding how dopamine neuron activity is modulated. While excitatory drive onto dopamine neurons is critical for generating phasic dopamine responses, emerging evidence suggests that inhibitory signaling also modulates these responses. To address the functional importance of inhibitory signaling in dopamine neurons, we generated mice lacking the β3 subunit of the GABAA receptor specifically in dopamine neurons (β3-KO mice) and examined their behavior in tasks that assessed appetitive learning, aversive learning, and risk preference. Dopamine neurons in midbrain slices from β3-KO mice exhibited attenuated GABA-evoked inhibitory post-synaptic currents. Furthermore, electrical stimulation of excitatory afferents to dopamine neurons elicited more dopamine release in the nucleus accumbens of β3-KO mice as measured by fast-scan cyclic voltammetry. β3-KO mice were more active than controls when given morphine, which correlated with potential compensatory upregulation of GABAergic tone onto dopamine neurons. β3-KO mice learned faster in two food-reinforced learning paradigms, but extinguished their learned behavior normally. Enhanced learning was specific for appetitive tasks, as aversive learning was unaffected in β3-KO mice. Finally, we found that β3-KO mice had enhanced risk preference in a probabilistic selection task that required mice to choose between a small certain reward and a larger uncertain reward. Collectively, these findings identify a selective role for GABAA signaling in dopamine neurons in appetitive learning and decision-making. PMID:22114279

  14. Single amino acids in sucrose rewards modulate feeding and associative learning in the honeybee

    PubMed Central

    Simcock, Nicola K.; Gray, Helen E.; Wright, Geraldine A.

    2014-01-01

    Obtaining the correct balance of nutrients requires that the brain integrates information about the body’s nutritional state with sensory information from food to guide feeding behaviour. Learning is a mechanism that allows animals to identify cues associated with nutrients so that they can be located quickly when required. Feedback about nutritional state is essential for nutrient balancing and could influence learning. How specific this feedback is to individual nutrients has not often been examined. Here, we tested how the honeybee’s nutritional state influenced the likelihood it would feed on and learn sucrose solutions containing single amino acids. Nutritional state was manipulated by pre-feeding bees with either 1 M sucrose or 1 M sucrose containing 100 mM of isoleucine, proline, phenylalanine, or methionine 24 h prior to olfactory conditioning of the proboscis extension response. We found that bees pre-fed sucrose solution consumed less of solutions containing amino acids and were also less likely to learn to associate amino acid solutions with odours. Unexpectedly, bees pre-fed solutions containing an amino acid were also less likely to learn to associate odours with sucrose the next day. Furthermore, they consumed more of and were more likely to learn when rewarded with an amino acid solution if they were pre-fed isoleucine and proline. Our data indicate that single amino acids at relatively high concentrations inhibit feeding on sucrose solutions containing them, and they can act as appetitive reinforcers during learning. Our data also suggest that select amino acids interact with mechanisms that signal nutritional sufficiency to reduce hunger. Based on these experiments, we predict that nutrient balancing for essential amino acids during learning requires integration of information about several amino acids experienced simultaneously. PMID:24819203

  15. Convergence of reinforcement learning algorithms and acceleration of learning

    NASA Astrophysics Data System (ADS)

    Potapov, A.; Ali, M. K.

    2003-02-01

    The techniques of reinforcement learning have been gaining increasing popularity recently. However, the question of their convergence rate is still open. We consider the problem of choosing the learning steps αn, and their relation with discount γ and exploration degree ɛ. Appropriate choices of these parameters may drastically influence the convergence rate of the techniques. From analytical examples, we conjecture optimal values of αn and then use numerical examples to verify our conjectures.

  16. Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control.

    PubMed

    Uragami, Daisuke; Takahashi, Tatsuji; Matsuo, Yoshiki

    2014-02-01

    Many algorithms and methods in artificial intelligence or machine learning were inspired by human cognition. As a mechanism to handle the exploration-exploitation dilemma in reinforcement learning, the loosely symmetric (LS) value function that models causal intuition of humans was proposed (Shinohara et al., 2007). While LS shows the highest correlation with causal induction by humans, it has been reported that it effectively works in multi-armed bandit problems that form the simplest class of tasks representing the dilemma. However, the scope of application of LS was limited to the reinforcement learning problems that have K actions with only one state (K-armed bandit problems). This study proposes LS-Q learning architecture that can deal with general reinforcement learning tasks with multiple states and delayed reward. We tested the learning performance of the new architecture in giant-swing robot motion learning, where uncertainty and unknown-ness of the environment is huge. In the test, the help of ready-made internal models or functional approximation of the state space were not given. The simulations showed that while the ordinary Q-learning agent does not reach giant-swing motion because of stagnant loops (local optima with low rewards), LS-Q escapes such loops and acquires giant-swing. It is confirmed that the smaller number of states is, in other words, the more coarse-grained the division of states and the more incomplete the state observation is, the better LS-Q performs in comparison with Q-learning. We also showed that the high performance of LS-Q depends comparatively little on parameter tuning and learning time. This suggests that the proposed method inspired by human cognition works adaptively in real environments. PMID:24296286

  17. Integration of reinforcement learning and optimal decision-making theories of the basal ganglia.

    PubMed

    Bogacz, Rafal; Larsen, Tobias

    2011-04-01

    This article seeks to integrate two sets of theories describing action selection in the basal ganglia: reinforcement learning theories describing learning which actions to select to maximize reward and decision-making theories proposing that the basal ganglia selects actions on the basis of sensory evidence accumulated in the cortex. In particular, we present a model that integrates the actor-critic model of reinforcement learning and a model assuming that the cortico-basal-ganglia circuit implements a statistically optimal decision-making procedure. The values of cortico-striatal weights required for optimal decision making in our model differ from those provided by standard reinforcement learning models. Nevertheless, we show that an actor-critic model converges to the weights required for optimal decision making when biologically realistic limits on synaptic weights are introduced. We also describe the model's predictions concerning reaction times and neural responses during learning, and we discuss directions required for further integration of reinforcement learning and optimal decision-making theories. PMID:21222528

  18. Beyond Rewards

    ERIC Educational Resources Information Center

    Hall, Philip S.

    2009-01-01

    Using rewards to impact students' behavior has long been common practice. However, using reward systems to enhance student learning conveniently masks the larger and admittedly more difficult task of finding and implementing the structure and techniques that children with special needs require to learn. More important, rewarding the child for good…

  19. Forward shift of feeding buzz components of dolphins and belugas during associative learning reveals a likely connection to reward expectation, pleasure and brain dopamine activation.

    PubMed

    Ridgway, S H; Moore, P W; Carder, D A; Romano, T A

    2014-08-15

    For many years, we heard sounds associated with reward from dolphins and belugas. We named these pulsed sounds victory squeals (VS), as they remind us of a child's squeal of delight. Here we put these sounds in context with natural and learned behavior. Like bats, echolocating cetaceans produce feeding buzzes as they approach and catch prey. Unlike bats, cetaceans continue their feeding buzzes after prey capture and the after portion is what we call the VS. Prior to training (or conditioning), the VS comes after the fish reward; with repeated trials it moves to before the reward. During training, we use a whistle or other sound to signal a correct response by the animal. This sound signal, named a secondary reinforcer (SR), leads to the primary reinforcer, fish. Trainers usually name their whistle or other SR a bridge, as it bridges the time gap between the correct response and reward delivery. During learning, the SR becomes associated with reward and the VS comes after the SR rather than after the fish. By following the SR, the VS confirms that the animal expects a reward. Results of early brain stimulation work suggest to us that SR stimulates brain dopamine release, which leads to the VS. Although there are no direct studies of dopamine release in cetaceans, we found that the timing of our VS is consistent with a response after dopamine release. We compared trained vocal responses to auditory stimuli with VS responses to SR sounds. Auditory stimuli that did not signal reward resulted in faster responses by a mean of 151 ms for dolphins and 250 ms for belugas. In laboratory animals, there is a 100 to 200 ms delay for dopamine release. VS delay in our animals is similar and consistent with vocalization after dopamine release. Our novel observation suggests that the dopamine reward system is active in cetacean brains. PMID:25122919

  20. A reinforcement learning mechanism responsible for the valuation of free choice

    PubMed Central

    Cockburn, Jeffrey; Collins, Anne G.E.; Frank, Michael J.

    2014-01-01

    Summary Humans exhibit a preference for options they have freely chosen over equally valued options they have not; however, the neural mechanism that drives this bias and its functional significance have yet to be identified. Here, we propose a model in which choice biases arise due to amplified positive reward prediction errors associated with free choice. Using a novel variant of a probabilistic learning task, we show that choice biases are selective to options that are predominantly associated with positive outcomes. A polymorphism in DARPP-32, a gene linked to dopaminergic striatal plasticity and individual differences in reinforcement learning, was found to predict the effect of choice as a function of value. We propose that these choice biases are the behavioral byproduct of a credit assignment mechanism responsible for ensuring the effective delivery of dopaminergic reinforcement learning signals broadcast to the striatum. PMID:25066083

  1. Neuroelectric signatures of reward learning and decision-making in the human nucleus accumbens.

    PubMed

    Cohen, Michael X; Axmacher, Nikolai; Lenartz, Doris; Elger, Christian E; Sturm, Volker; Schlaepfer, Thomas E

    2009-06-01

    Learning that certain actions lead to risky rewards is critical for biological, social, and economic survival, but the precise neural mechanisms of such reward-guided learning remain unclear. Here, we show that the human nucleus accumbens plays a key role in learning about risks by representing reward value. We recorded electrophysiological activity directly from the nucleus accumbens of five patients undergoing deep brain stimulation for treatment of refractory major depression. Patients engaged in a simple reward-learning task in which they first learned stimulus-outcome associations (learning task), and then were able to choose from among the learned stimuli (choosing task). During the learning task, nucleus accumbens activity reflected potential and received reward values both during the cue stimulus and during the feedback. During the choosing task, there was no nucleus accumbens activity during the cue stimulus, but feedback-related activity was pronounced and similar to that during the learning task. This pattern of results is inconsistent with a prediction error response. Finally, analyses of cross-correlations between the accumbens and simultaneous recordings of medial frontal cortex suggest a dynamic interaction between these structures. The high spatial and temporal resolution of these recordings provides novel insights into the timing of activity in the human nucleus accumbens, its functions during reward-guided learning and decision-making, and its interactions with medial frontal cortex. PMID:19092783

  2. Reinforcement Learning of Targeted Movement in a Spiking Neuronal Model of Motor Cortex

    PubMed Central

    Chadderdon, George L.; Neymotin, Samuel A.; Kerr, Cliff C.; Lytton, William W.

    2012-01-01

    Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint “forearm” to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (−1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior. PMID:23094042

  3. Evidence for the negative impact of reward on self-regulated learning.

    PubMed

    Wehe, Hillary S; Rhodes, Matthew G; Seger, Carol A

    2015-01-01

    The undermining effect refers to the detrimental impact rewards can have on intrinsic motivation to engage in a behaviour. The current study tested the hypothesis that participants' self-regulated learning behaviours are susceptible to the undermining effect. Participants were assigned to learn a set of Swahili-English word pairs. Half of the participants were offered a reward for performance, and half were not offered a reward. After the initial study phase, participants were permitted to continue studying the words during a free period. The results were consistent with an undermining effect: Participants who were not offered a reward spent more time studying the words during the free period. The results suggest that rewards may negatively impact self-regulated learning behaviours and provide support for the encouragement of intrinsic motivation. PMID:26106977

  4. Vicarious reinforcement learning signals when instructing others.

    PubMed

    Apps, Matthew A J; Lesage, Elise; Ramnani, Narender

    2015-02-18

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action-outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors. PMID:25698730

  5. Vicarious Reinforcement Learning Signals When Instructing Others

    PubMed Central

    Lesage, Elise; Ramnani, Narender

    2015-01-01

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action–outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors. PMID:25698730

  6. Stimulus-Reward Association and Reversal Learning in Individuals with Asperger Syndrome

    ERIC Educational Resources Information Center

    Zalla, Tiziana; Sav, Anca-Maria; Leboyer, Marion

    2009-01-01

    In the present study, performance of a group of adults with Asperger Syndrome (AS) on two series of object reversal and extinction was compared with that of a group of adults with typical development. Participants were requested to learn a stimulus-reward association rule and monitor changes in reward value of stimuli in order to gain as many…

  7. Aging Affects Acquisition and Reversal of Reward-Based Associative Learning

    ERIC Educational Resources Information Center

    Weiler, Julia A.; Bellebaum, Christian; Daum, Irene

    2008-01-01

    Reward-based associative learning is mediated by a distributed network of brain regions that are dependent on the dopaminergic system. Age-related changes in key regions of this system, the striatum and the prefrontal cortex, may adversely affect the ability to use reward information for the guidance of behavior. The present study investigated the…

  8. Episodic Memory Encoding Interferes with Reward Learning and Decreases Striatal Prediction Errors

    PubMed Central

    Braun, Erin Kendall; Daw, Nathaniel D.

    2014-01-01

    Learning is essential for adaptive decision making. The striatum and its dopaminergic inputs are known to support incremental reward-based learning, while the hippocampus is known to support encoding of single events (episodic memory). Although traditionally studied separately, in even simple experiences, these two types of learning are likely to co-occur and may interact. Here we sought to understand the nature of this interaction by examining how incremental reward learning is related to concurrent episodic memory encoding. During the experiment, human participants made choices between two options (colored squares), each associated with a drifting probability of reward, with the goal of earning as much money as possible. Incidental, trial-unique object pictures, unrelated to the choice, were overlaid on each option. The next day, participants were given a surprise memory test for these pictures. We found that better episodic memory was related to a decreased influence of recent reward experience on choice, both within and across participants. fMRI analyses further revealed that during learning the canonical striatal reward prediction error signal was significantly weaker when episodic memory was stronger. This decrease in reward prediction error signals in the striatum was associated with enhanced functional connectivity between the hippocampus and striatum at the time of choice. Our results suggest a mechanism by which memory encoding may compete for striatal processing and provide insight into how interactions between different forms of learning guide reward-based decision making. PMID:25378157

  9. CLEANing the Reward: Counterfactual Actions to Remove Exploratory Action Noise in Multiagent Learning

    NASA Technical Reports Server (NTRS)

    HolmesParker, Chris; Taylor, Mathew E.; Tumer, Kagan; Agogino, Adrian

    2014-01-01

    Learning in multiagent systems can be slow because agents must learn both how to behave in a complex environment and how to account for the actions of other agents. The inability of an agent to distinguish between the true environmental dynamics and those caused by the stochastic exploratory actions of other agents creates noise in each agent's reward signal. This learning noise can have unforeseen and often undesirable effects on the resultant system performance. We define such noise as exploratory action noise, demonstrate the critical impact it can have on the learning process in multiagent settings, and introduce a reward structure to effectively remove such noise from each agent's reward signal. In particular, we introduce Coordinated Learning without Exploratory Action Noise (CLEAN) rewards and empirically demonstrate their benefits

  10. Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats.

    PubMed

    Lloyd, Kevin; Becker, Nadine; Jones, Matthew W; Bogacz, Rafal

    2012-01-01

    Learning to form appropriate, task-relevant working memory representations is a complex process central to cognition. Gating models frame working memory as a collection of past observations and use reinforcement learning (RL) to solve the problem of when to update these observations. Investigation of how gating models relate to brain and behavior remains, however, at an early stage. The current study sought to explore the ability of simple RL gating models to replicate rule learning behavior in rats. Rats were trained in a maze-based spatial learning task that required animals to make trial-by-trial choices contingent upon their previous experience. Using an abstract version of this task, we tested the ability of two gating algorithms, one based on the Actor-Critic and the other on the State-Action-Reward-State-Action (SARSA) algorithm, to generate behavior consistent with the rats'. Both models produced rule-acquisition behavior consistent with the experimental data, though only the SARSA gating model mirrored faster learning following rule reversal. We also found that both gating models learned multiple strategies in solving the initial task, a property which highlights the multi-agent nature of such models and which is of importance in considering the neural basis of individual differences in behavior. PMID:23115551

  11. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia

    PubMed Central

    Markou, Athina; Salamone, John D.; Bussey, Timothy; Mar, Adam; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-01-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu). A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. PMID:23994273

  12. Connectionist reinforcement learning of robot control skills

    NASA Astrophysics Data System (ADS)

    Araújo, Rui; Nunes, Urbano; de Almeida, A. T.

    1998-07-01

    Many robot manipulator tasks are difficult to model explicitly and it is difficult to design and program automatic control algorithms for them. The development, improvement, and application of learning techniques taking advantage of sensory information would enable the acquisition of new robot skills and avoid some of the difficulties of explicit programming. In this paper we use a reinforcement learning approach for on-line generation of skills for control of robot manipulator systems. Instead of generating skills by explicit programming of a perception to action mapping they are generated by trial and error learning, guided by a performance evaluation feedback function. The resulting system may be seen as an anticipatory system that constructs an internal representation model of itself and of its environment. This enables it to identify its current situation and to generate corresponding appropriate commands to the system in order to perform the required skill. The method was applied to the problem of learning a force control skill in which the tool-tip of a robot manipulator must be moved from a free space situation, to a contact state with a compliant surface and having a constant interaction force.

  13. Value learning and arousal in the extinction of probabilistic rewards: the role of dopamine in a modified temporal difference model.

    PubMed

    Song, Minryung R; Fellous, Jean-Marc

    2014-01-01

    Because most rewarding events are probabilistic and changing, the extinction of probabilistic rewards is important for survival. It has been proposed that the extinction of probabilistic rewards depends on arousal and the amount of learning of reward values. Midbrain dopamine neurons were suggested to play a role in both arousal and learning reward values. Despite extensive research on modeling dopaminergic activity in reward learning (e.g. temporal difference models), few studies have been done on modeling its role in arousal. Although temporal difference models capture key characteristics of dopaminergic activity during the extinction of deterministic rewards, they have been less successful at simulating the extinction of probabilistic rewards. By adding an arousal signal to a temporal difference model, we were able to simulate the extinction of probabilistic rewards and its dependence on the amount of learning. Our simulations propose that arousal allows the probability of reward to have lasting effects on the updating of reward value, which slows the extinction of low probability rewards. Using this model, we predicted that, by signaling the prediction error, dopamine determines the learned reward value that has to be extinguished during extinction and participates in regulating the size of the arousal signal that controls the learning rate. These predictions were supported by pharmacological experiments in rats. PMID:24586823

  14. Stochastic Reinforcement Benefits Skill Acquisition

    ERIC Educational Resources Information Center

    Dayan, Eran; Averbeck, Bruno B.; Richmond, Barry J.; Cohen, Leonardo G.

    2014-01-01

    Learning complex skills is driven by reinforcement, which facilitates both online within-session gains and retention of the acquired skills. Yet, in ecologically relevant situations, skills are often acquired when mapping between actions and rewarding outcomes is unknown to the learning agent, resulting in reinforcement schedules of a stochastic…

  15. Punishment Insensitivity and Impaired Reinforcement Learning in Preschoolers

    ERIC Educational Resources Information Center

    Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

    2014-01-01

    Background: Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a…

  16. Reinforcement of Science Learning through Local Culture: A Delphi Study

    ERIC Educational Resources Information Center

    Nuangchalerm, Prasart

    2008-01-01

    This study aims to explore the ways to reinforce science learning through local culture by using Delphi technique. Twenty four participants in various fields of study were selected. The result of study provides a framework for reinforcement of science learning through local culture on the theme life and environment. (Contains 1 table.)

  17. A Discussion of Possibility of Reinforcement Learning Using Event-Related Potential in BCI

    NASA Astrophysics Data System (ADS)

    Yamagishi, Yuya; Tsubone, Tadashi; Wada, Yasuhiro

    Recently, Brain computer interface (BCI) which is a direct connecting pathway an external device such as a computer or a robot and a human brain have gotten a lot of attention. Since BCI can control the machines as robots by using the brain activity without using the voluntary muscle, the BCI may become a useful communication tool for handicapped persons, for instance, amyotrophic lateral sclerosis patients. However, in order to realize the BCI system which can perform precise tasks on various environments, it is necessary to design the control rules to adapt to the dynamic environments. Reinforcement learning is one approach of the design of the control rule. If this reinforcement leaning can be performed by the brain activity, it leads to the attainment of BCI that has general versatility. In this research, we paid attention to P300 of event-related potential as an alternative signal of the reward of reinforcement learning. We discriminated between the success and the failure trials from P300 of the EEG of the single trial by using the proposed discrimination algorithm based on Support vector machine. The possibility of reinforcement learning was examined from the viewpoint of the number of discriminated trials. It was shown that there was a possibility to be able to learn in most subjects.

  18. How Food as a Reward Is Detrimental to Children's Health, Learning, and Behavior

    ERIC Educational Resources Information Center

    Fedewa, Alicia L.; Davis, Matthew Cody

    2015-01-01

    Background: Despite small- and wide-scale prevention efforts to curb obesity, the percentage of children classified as overweight and obese has remained relatively consistent in the last decade. As school personnel are increasingly pressured to enhance student performance, many educators use food as a reward to motivate and reinforce positive…

  19. Differential Modulation of Reinforcement Learning by D2 Dopamine and NMDA Glutamate Receptor Antagonism

    PubMed Central

    Klein, Tilmann A.; Ullsperger, Markus

    2014-01-01

    The firing pattern of midbrain dopamine (DA) neurons is well known to reflect reward prediction errors (PEs), the difference between obtained and expected rewards. The PE is thought to be a crucial signal for instrumental learning, and interference with DA transmission impairs learning. Phasic increases of DA neuron firing during positive PEs are driven by activation of NMDA receptors, whereas phasic suppression of firing during negative PEs is likely mediated by inputs from the lateral habenula. We aimed to determine the contribution of DA D2-class and NMDA receptors to appetitively and aversively motivated reinforcement learning. Healthy human volunteers were scanned with functional magnetic resonance imaging while they performed an instrumental learning task under the influence of either the DA D2 receptor antagonist amisulpride (400 mg), the NMDA receptor antagonist memantine (20 mg), or placebo. Participants quickly learned to select (“approach”) rewarding and to reject (“avoid”) punishing options. Amisulpride impaired both approach and avoidance learning, while memantine mildly attenuated approach learning but had no effect on avoidance learning. These behavioral effects of the antagonists were paralleled by their modulation of striatal PEs. Amisulpride reduced both appetitive and aversive PEs, while memantine diminished appetitive, but not aversive PEs. These data suggest that striatal D2-class receptors contribute to both approach and avoidance learning by detecting both the phasic DA increases and decreases during appetitive and aversive PEs. NMDA receptors on the contrary appear to be required only for approach learning because phasic DA increases during positive PEs are NMDA dependent, whereas phasic decreases during negative PEs are not. PMID:25253860

  20. Differential modulation of reinforcement learning by D2 dopamine and NMDA glutamate receptor antagonism.

    PubMed

    Jocham, Gerhard; Klein, Tilmann A; Ullsperger, Markus

    2014-09-24

    The firing pattern of midbrain dopamine (DA) neurons is well known to reflect reward prediction errors (PEs), the difference between obtained and expected rewards. The PE is thought to be a crucial signal for instrumental learning, and interference with DA transmission impairs learning. Phasic increases of DA neuron firing during positive PEs are driven by activation of NMDA receptors, whereas phasic suppression of firing during negative PEs is likely mediated by inputs from the lateral habenula. We aimed to determine the contribution of DA D2-class and NMDA receptors to appetitively and aversively motivated reinforcement learning. Healthy human volunteers were scanned with functional magnetic resonance imaging while they performed an instrumental learning task under the influence of either the DA D2 receptor antagonist amisulpride (400 mg), the NMDA receptor antagonist memantine (20 mg), or placebo. Participants quickly learned to select ("approach") rewarding and to reject ("avoid") punishing options. Amisulpride impaired both approach and avoidance learning, while memantine mildly attenuated approach learning but had no effect on avoidance learning. These behavioral effects of the antagonists were paralleled by their modulation of striatal PEs. Amisulpride reduced both appetitive and aversive PEs, while memantine diminished appetitive, but not aversive PEs. These data suggest that striatal D2-class receptors contribute to both approach and avoidance learning by detecting both the phasic DA increases and decreases during appetitive and aversive PEs. NMDA receptors on the contrary appear to be required only for approach learning because phasic DA increases during positive PEs are NMDA dependent, whereas phasic decreases during negative PEs are not. PMID:25253860

  1. Reinforcement learning for port-hamiltonian systems.

    PubMed

    Sprangers, Olivier; Babuška, Robert; Nageshrao, Subramanya P; Lopes, Gabriel A D

    2015-05-01

    Passivity-based control (PBC) for port-Hamiltonian systems provides an intuitive way of achieving stabilization by rendering a system passive with respect to a desired storage function. However, in most instances the control law is obtained without any performance considerations and it has to be calculated by solving a complex partial differential equation (PDE). In order to address these issues we introduce a reinforcement learning (RL) approach into the energy-balancing passivity-based control (EB-PBC) method, which is a form of PBC in which the closed-loop energy is equal to the difference between the stored and supplied energies. We propose a technique to parameterize EB-PBC that preserves the systems's PDE matching conditions, does not require the specification of a global desired Hamiltonian, includes performance criteria, and is robust. The parameters of the control law are found by using actor-critic (AC) RL, enabling the search for near-optimal control policies satisfying a desired closed-loop energy landscape. The advantage is that the solutions learned can be interpreted in terms of energy shaping and damping injection, which makes it possible to numerically assess stability using passivity theory. From the RL perspective, our proposal allows for the class of port-Hamiltonian systems to be incorporated in the AC framework, speeding up the learning thanks to the resulting parameterization of the policy. The method has been successfully applied to the pendulum swing-up problem in simulations and real-life experiments. PMID:25167564

  2. Repeated electrical stimulation of reward-related brain regions affects cocaine but not "natural" reinforcement.

    PubMed

    Levy, Dino; Shabat-Simon, Maytal; Shalev, Uri; Barnea-Ygael, Noam; Cooper, Ayelet; Zangen, Abraham

    2007-12-19

    Drug addiction is associated with long-lasting neuronal adaptations including alterations in dopamine and glutamate receptors in the brain reward system. Treatment strategies for cocaine addiction and especially the prevention of craving and relapse are limited, and their effectiveness is still questionable. We hypothesized that repeated stimulation of the brain reward system can induce localized neuronal adaptations that may either potentiate or reduce addictive behaviors. The present study was designed to test how repeated interference with the brain reward system using localized electrical stimulation of the medial forebrain bundle at the lateral hypothalamus (LH) or the prefrontal cortex (PFC) affects cocaine addiction-associated behaviors and some of the neuronal adaptations induced by repeated exposure to cocaine. Repeated high-frequency stimulation in either site influenced cocaine, but not sucrose reward-related behaviors. Stimulation of the LH reduced cue-induced seeking behavior, whereas stimulation of the PFC reduced both cocaine-seeking behavior and the motivation for its consumption. The behavioral findings were accompanied by glutamate receptor subtype alterations in the nucleus accumbens and the ventral tegmental area, both key structures of the reward system. It is therefore suggested that repeated electrical stimulation of the PFC can become a novel strategy for treating addiction. PMID:18094257

  3. Using Fuzzy Logic for Performance Evaluation in Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.

    1992-01-01

    Current reinforcement learning algorithms require long training periods which generally limit their applicability to small size problems. A new architecture is described which uses fuzzy rules to initialize its two neural networks: a neural network for performance evaluation and another for action selection. This architecture is applied to control of dynamic systems and it is demonstrated that it is possible to start with an approximate prior knowledge and learn to refine it through experiments using reinforcement learning.

  4. Oculomotor learning revisited: a model of reinforcement learning in the basal ganglia incorporating an efference copy of motor actions

    PubMed Central

    Fee, Michale S.

    2012-01-01

    In its simplest formulation, reinforcement learning is based on the idea that if an action taken in a particular context is followed by a favorable outcome, then, in the same context, the tendency to produce that action should be strengthened, or reinforced. While reinforcement learning forms the basis of many current theories of basal ganglia (BG) function, these models do not incorporate distinct computational roles for signals that convey context, and those that convey what action an animal takes. Recent experiments in the songbird suggest that vocal-related BG circuitry receives two functionally distinct excitatory inputs. One input is from a cortical region that carries context information about the current “time” in the motor sequence. The other is an efference copy of motor commands from a separate cortical brain region that generates vocal variability during learning. Based on these findings, I propose here a general model of vertebrate BG function that combines context information with a distinct motor efference copy signal. The signals are integrated by a learning rule in which efference copy inputs gate the potentiation of context inputs (but not efference copy inputs) onto medium spiny neurons in response to a rewarded action. The hypothesis is described in terms of a circuit that implements the learning of visually guided saccades. The model makes testable predictions about the anatomical and functional properties of hypothesized context and efference copy inputs to the striatum from both thalamic and cortical sources. PMID:22754501

  5. Gambling against neglect: unconscious spatial biases induced by reward reinforcement in healthy people and brain-damaged patients.

    PubMed

    Lucas, Nadia; Schwartz, Sophie; Leroy, Rosario; Pavin, Sandra; Diserens, Karin; Vuilleumier, Patrik

    2013-01-01

    Orienting attention in space recruits fronto-parietal networks whose damage results in unilateral spatial neglect. However, attention orienting may also be governed by emotional and motivational factors; but it remains unknown whether these factors act through a modulation of the fronto-parietal attentional systems or distinct neural pathways. Here we asked whether attentional orienting is affected by learning about the reward value of targets in a visual search task, in a spatially specific manner, and whether these effects are preserved in right-brain damaged patients with left spatial neglect. We found that associating rewards with left-sided (but not right-sided) targets during search led to progressive exploration biases towards left space, in both healthy people and neglect patients. Such spatially specific biases occurred even without any conscious awareness of the asymmetric reward contingencies. These results show that reward-induced modulations of space representation are preserved despite a dysfunction of fronto-parietal networks associated with neglect, and therefore suggest that they may arise through spared subcortical networks directly acting on sensory processing and/or oculomotor circuits. These effects could be usefully exploited for potentiating rehabilitation strategies in neglect patients. PMID:23969194

  6. Switching Reinforcement Learning for Continuous Action Space

    NASA Astrophysics Data System (ADS)

    Nagayoshi, Masato; Murao, Hajime; Tamaki, Hisashi

    Reinforcement Learning (RL) attracts much attention as a technique of realizing computational intelligence such as adaptive and autonomous decentralized systems. In general, however, it is not easy to put RL into practical use. This difficulty includes a problem of designing a suitable action space of an agent, i.e., satisfying two requirements in trade-off: (i) to keep the characteristics (or structure) of an original search space as much as possible in order to seek strategies that lie close to the optimal, and (ii) to reduce the search space as much as possible in order to expedite the learning process. In order to design a suitable action space adaptively, we propose switching RL model to mimic a process of an infant's motor development in which gross motor skills develop before fine motor skills. Then, a method for switching controllers is constructed by introducing and referring to the “entropy”. Further, through computational experiments by using robot navigation problems with one and two-dimensional continuous action space, the validity of the proposed method has been confirmed.

  7. Coevolutionary networks of reinforcement-learning agents

    NASA Astrophysics Data System (ADS)

    Kianercy, Ardeshir; Galstyan, Aram

    2013-07-01

    This paper presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement-learning scheme. It is demonstrated that the coevolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with nontrivial structural dynamics. In particular, we characterize the Nash equilibria (NE) in such games and examine the local stability of the rest points corresponding to those equilibria. We also study general n-player networks via both simulations and analytical methods and find that, in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable.

  8. Neural Correlates of Reward-Based Spatial Learning in Persons with Cocaine Dependence

    PubMed Central

    Tau, Gregory Z; Marsh, Rachel; Wang, Zhishun; Torres-Sanchez, Tania; Graniello, Barbara; Hao, Xuejun; Xu, Dongrong; Packard, Mark G; Duan, Yunsuo; Kangarlu, Alayar; Martinez, Diana; Peterson, Bradley S

    2014-01-01

    Dysfunctional learning systems are thought to be central to the pathogenesis of and impair recovery from addictions. The functioning of the brain circuits for episodic memory or learning that support goal-directed behavior has not been studied previously in persons with cocaine dependence (CD). Thirteen abstinent CD and 13 healthy participants underwent MRI scanning while performing a task that requires the use of spatial cues to navigate a virtual-reality environment and find monetary rewards, allowing the functional assessment of the brain systems for spatial learning, a form of episodic memory. Whereas both groups performed similarly on the reward-based spatial learning task, we identified disturbances in brain regions involved in learning and reward in CD participants. In particular, CD was associated with impaired functioning of medial temporal lobe (MTL), a brain region that is crucial for spatial learning (and episodic memory) with concomitant recruitment of striatum (which normally participates in stimulus-response, or habit, learning), and prefrontal cortex. CD was also associated with enhanced sensitivity of the ventral striatum to unexpected rewards but not to expected rewards earned during spatial learning. We provide evidence that spatial learning in CD is characterized by disturbances in functioning of an MTL-based system for episodic memory and a striatum-based system for stimulus-response learning and reward. We have found additional abnormalities in distributed cortical regions. Consistent with findings from animal studies, we provide the first evidence in humans describing the disruptive effects of cocaine on the coordinated functioning of multiple neural systems for learning and memory. PMID:23917430

  9. Single Dose of a Dopamine Agonist Impairs Reinforcement Learning in Humans: Evidence from Event-related Potentials and Computational Modeling of Striatal-Cortical Function

    PubMed Central

    Santesso, Diane L.; Evins, A. Eden; Frank, Michael J.; Cowman Schetter, Erika M.; Bogdan, Ryan; Pizzagalli, Diego A.

    2011-01-01

    Animal findings have highlighted the modulatory role of phasic dopamine (DA) signaling in incentive learning, particularly in the acquisition of reward-related behavior. In humans, these processes remain largely unknown. In a recent study we demonstrated that a single low dose of a D2/D3 agonist (pramipexole) – assumed to activate DA autoreceptors and thus reduce phasic DA bursts – impaired reward learning in healthy subjects performing a probabilistic reward task. The purpose of the present study was to extend these behavioral findings using event-related potentials and computational modeling. Compared to the placebo group, participants receiving pramipexole showed increased feedback-related negativity to probabilistic rewards and decreased activation in dorsal anterior cingulate regions previously implicated in integrating reinforcement history over time. Additionally, findings of blunted reward learning in participants receiving pramipexole were simulated by reduced presynaptic DA signaling in response to reward in a neural network model of striatal-cortical function. These preliminary findings offer important insights on the role of phasic DA signals on reinforcement learning in humans, and provide initial evidence regarding the spatio-temporal dynamics of brain mechanisms underlying these processes. PMID:18726908

  10. Rewarded by Punishment: Reflections on the Disuse of Positive Reinforcement in Education.

    ERIC Educational Resources Information Center

    Maag, John W.

    2001-01-01

    This article delineates the reasons why educators find punishment a more acceptable approach for managing students' challenging behaviors than positive reinforcement. The article argues that educators should plan the occurrence of positive reinforcement to increase appropriate behaviors rather than running the risk of it haphazardly promoting…

  11. A hypothesis for basal ganglia-dependent reinforcement learning in the songbird

    PubMed Central

    Fee, Michale S.; Goldberg, Jesse H.

    2011-01-01

    Most of our motor skills are not innately programmed, but are learned by a combination of motor exploration and performance evaluation, suggesting that they proceed through a reinforcement learning (RL) mechanism. Songbirds have emerged as a model system to study how a complex behavioral sequence can be learned through an RL-like strategy. Interestingly, like motor sequence learning in mammals, song learning in birds requires a basal ganglia (BG)-thalamocortical loop, suggesting common neural mechanisms. Here we outline a specific working hypothesis for how BG-forebrain circuits could utilize an internally computed reinforcement signal to direct song learning. Our model includes a number of general concepts borrowed from the mammalian BG literature, including a dopaminergic reward prediction error and dopamine mediated plasticity at corticostriatal synapses. We also invoke a number of conceptual advances arising from recent observations in the songbird. Specifically, there is evidence for a specialized cortical circuit that adds trial-to-trial variability to stereotyped cortical motor programs, and a role for the BG in ‘biasing’ this variability to improve behavioral performance. This BG-dependent ‘premotor bias’ may in turn guide plasticity in downstream cortical synapses to consolidate recently-learned song changes. Given the similarity between mammalian and songbird BG-thalamocortical circuits, our model for the role of the BG in this process may have broader relevance to mammalian BG function. PMID:22015923

  12. The Use of Rewards in Instructional Digital Games: An Application of Positive Reinforcement

    ERIC Educational Resources Information Center

    Malala, John; Major, Anthony; Maunez-Cuadra, Jose; McCauley-Bell, Pamela

    2007-01-01

    The main argument being presented in this paper is that instructional designers and educational researchers need to shift their attention from performance to interest. Educational digital games have to aim at building lasting interest in real world applications. The main hypothesis advocated in this paper is that the use of rewards in educational…

  13. Neuropeptide F neurons modulate sugar reward during associative olfactory learning of Drosophila larvae.

    PubMed

    Rohwedder, Astrid; Selcho, Mareike; Chassot, Bérénice; Thum, Andreas S

    2015-12-15

    All organisms continuously have to adapt their behavior according to changes in the environment in order to survive. Experience-driven changes in behavior are usually mediated and maintained by modifications in signaling within defined brain circuits. Given the simplicity of the larval brain of Drosophila and its experimental accessibility on the genetic and behavioral level, we analyzed if Drosophila neuropeptide F (dNPF) neurons are involved in classical olfactory conditioning. dNPF is an ortholog of the mammalian neuropeptide Y, a highly conserved neuromodulator that stimulates food-seeking behavior. We provide a comprehensive anatomical analysis of the dNPF neurons on the single-cell level. We demonstrate that artificial activation of dNPF neurons inhibits appetitive olfactory learning by modulating the sugar reward signal during acquisition. No effect is detectable for the retrieval of an established appetitive olfactory memory. The modulatory effect is based on the joint action of three distinct cell types that, if tested on the single-cell level, inhibit and invert the conditioned behavior. Taken together, our work describes anatomically and functionally a new part of the sugar reinforcement signaling pathway for classical olfactory conditioning in Drosophila larvae. PMID:26234537

  14. Modulation of spatial attention by goals, statistical learning, and monetary reward.

    PubMed

    Jiang, Yuhong V; Sha, Li Z; Remington, Roger W

    2015-10-01

    This study documented the relative strength of task goals, visual statistical learning, and monetary reward in guiding spatial attention. Using a difficult T-among-L search task, we cued spatial attention to one visual quadrant by (i) instructing people to prioritize it (goal-driven attention), (ii) placing the target frequently there (location probability learning), or (iii) associating that quadrant with greater monetary gain (reward-based attention). Results showed that successful goal-driven attention exerted the strongest influence on search RT. Incidental location probability learning yielded a smaller though still robust effect. Incidental reward learning produced negligible guidance for spatial attention. The 95 % confidence intervals of the three effects were largely nonoverlapping. To understand these results, we simulated the role of location repetition priming in probability cuing and reward learning. Repetition priming underestimated the strength of location probability cuing, suggesting that probability cuing involved long-term statistical learning of how to shift attention. Repetition priming provided a reasonable account for the negligible effect of reward on spatial attention. We propose a multiple-systems view of spatial attention that includes task goals, search habit, and priming as primary drivers of top-down attention. PMID:26105657

  15. Short-term plasticity as cause-effect hypothesis testing in distal reward learning.

    PubMed

    Soltoggio, Andrea

    2015-02-01

    Asynchrony, overlaps, and delays in sensory-motor signals introduce ambiguity as to which stimuli, actions, and rewards are causally related. Only the repetition of reward episodes helps distinguish true cause-effect relationships from coincidental occurrences. In the model proposed here, a novel plasticity rule employs short- and long-term changes to evaluate hypotheses on cause-effect relationships. Transient weights represent hypotheses that are consolidated in long-term memory only when they consistently predict or cause future rewards. The main objective of the model is to preserve existing network topologies when learning with ambiguous information flows. Learning is also improved by biasing the exploration of the stimulus-response space toward actions that in the past occurred before rewards. The model indicates under which conditions beliefs can be consolidated in long-term memory, it suggests a solution to the plasticity-stability dilemma, and proposes an interpretation of the role of short-term plasticity. PMID:25189158

  16. Knockout crickets for the study of learning and memory: Dopamine receptor Dop1 mediates aversive but not appetitive reinforcement in crickets.

    PubMed

    Awata, Hiroko; Watanabe, Takahito; Hamanaka, Yoshitaka; Mito, Taro; Noji, Sumihare; Mizunami, Makoto

    2015-01-01

    Elucidation of reinforcement mechanisms in associative learning is an important subject in neuroscience. In mammals, dopamine neurons are thought to play critical roles in mediating both appetitive and aversive reinforcement. Our pharmacological studies suggested that octopamine and dopamine neurons mediate reward and punishment, respectively, in crickets, but recent studies in fruit-flies concluded that dopamine neurons mediates both reward and punishment, via the type 1 dopamine receptor Dop1. To resolve the discrepancy between studies in different insect species, we produced Dop1 knockout crickets using the CRISPR/Cas9 system and found that they are defective in aversive learning with sodium chloride punishment but not appetitive learning with water or sucrose reward. The results suggest that dopamine and octopamine neurons mediate aversive and appetitive reinforcement, respectively, in crickets. We suggest unexpected diversity in neurotransmitters mediating appetitive reinforcement between crickets and fruit-flies, although the neurotransmitter mediating aversive reinforcement is conserved. This study demonstrates usefulness of the CRISPR/Cas9 system for producing knockout animals for the study of learning and memory. PMID:26521965

  17. Knockout crickets for the study of learning and memory: Dopamine receptor Dop1 mediates aversive but not appetitive reinforcement in crickets

    PubMed Central

    Awata, Hiroko; Watanabe, Takahito; Hamanaka, Yoshitaka; Mito, Taro; Noji, Sumihare; Mizunami, Makoto

    2015-01-01

    Elucidation of reinforcement mechanisms in associative learning is an important subject in neuroscience. In mammals, dopamine neurons are thought to play critical roles in mediating both appetitive and aversive reinforcement. Our pharmacological studies suggested that octopamine and dopamine neurons mediate reward and punishment, respectively, in crickets, but recent studies in fruit-flies concluded that dopamine neurons mediates both reward and punishment, via the type 1 dopamine receptor Dop1. To resolve the discrepancy between studies in different insect species, we produced Dop1 knockout crickets using the CRISPR/Cas9 system and found that they are defective in aversive learning with sodium chloride punishment but not appetitive learning with water or sucrose reward. The results suggest that dopamine and octopamine neurons mediate aversive and appetitive reinforcement, respectively, in crickets. We suggest unexpected diversity in neurotransmitters mediating appetitive reinforcement between crickets and fruit-flies, although the neurotransmitter mediating aversive reinforcement is conserved. This study demonstrates usefulness of the CRISPR/Cas9 system for producing knockout animals for the study of learning and memory. PMID:26521965

  18. Curiosity and reward: Valence predicts choice and information prediction errors enhance learning.

    PubMed

    Marvin, Caroline B; Shohamy, Daphna

    2016-03-01

    Curiosity drives many of our daily pursuits and interactions; yet, we know surprisingly little about how it works. Here, we harness an idea implied in many conceptualizations of curiosity: that information has value in and of itself. Reframing curiosity as the motivation to obtain reward-where the reward is information-allows one to leverage major advances in theoretical and computational mechanisms of reward-motivated learning. We provide new evidence supporting 2 predictions that emerge from this framework. First, we find an asymmetric effect of positive versus negative information, with positive information enhancing both curiosity and long-term memory for information. Second, we find that it is not the absolute value of information that drives learning but, rather, the gap between the reward expected and reward received, an "information prediction error." These results support the idea that information functions as a reward, much like money or food, guiding choices and driving learning in systematic ways. (PsycINFO Database Record PMID:26783880

  19. A Computer-Assisted Learning Model Based on the Digital Game Exponential Reward System

    ERIC Educational Resources Information Center

    Moon, Man-Ki; Jahng, Surng-Gahb; Kim, Tae-Yong

    2011-01-01

    The aim of this research was to construct a motivational model which would stimulate voluntary and proactive learning using digital game methods offering players more freedom and control. The theoretical framework of this research lays the foundation for a pedagogical learning model based on digital games. We analyzed the game reward system, which…

  20. The influence of personality on neural mechanisms of observational fear and reward learning

    PubMed Central

    Hooker, Christine I.; Verosky, Sara C.; Miyakawa, Asako; Knight, Robert T.; D’Esposito, Mark

    2012-01-01

    Fear and reward learning can occur through direct experience or observation. Both channels can enhance survival or create maladaptive behavior. We used fMRI to isolate neural mechanisms of observational fear and reward learning and investigate whether neural response varied according to individual differences in neuroticism and extraversion. Participants learned object-emotion associations by observing a woman respond with fearful (or neutral) and happy (or neutral) facial expressions to novel objects. The amygdala-hippocampal complex was active when learning the object-fear association, and the hippocampus was active when learning the object-happy association. After learning, objects were presented alone; amygdala activity was greater for the fear (vs. neutral) and happy (vs. neutral) associated object. Importantly, greater amygdala-hippocampal activity during fear (vs. neutral) learning predicted better recognition of learned objects on a subsequent memory test. Furthermore, personality modulated neural mechanisms of learning. Neuroticism positively correlated with neural activity in the amygdala and hippocampus during fear (vs. neutral) learning. Low extraversion/high introversion was related to faster behavioral predictions of the fearful and neutral expressions during fear learning. In addition, low extraversion/high introversion was related to greater amygdala activity during happy (vs. neutral) learning, happy (vs. neutral) object recognition, and faster reaction times for predicting happy and neutral expressions during reward learning. These findings suggest that neuroticism is associated with an increased sensitivity in the neural mechanism for fear learning which leads to enhanced encoding of fear associations, and that low extraversion/high introversion is related to enhanced conditionability for both fear and reward learning. PMID:18573512

  1. Assessment of rewarding and reinforcing properties of biperiden in conditioned place preference in rats.

    PubMed

    Allahverdiyev, Oruc; Nurten, Asiye; Enginar, Nurhan

    2011-12-01

    Biperiden is one of the most commonly abused anticholinergic drugs. This study assessed its motivational effects in the acquisition of conditioned place preference in rats. Biperiden neither produced place conditioning itself nor enhanced the rewarding effect of morphine. Furthermore, biperiden in combination with haloperidol also did not affect place preference. These findings suggest that biperiden seems devoid of abuse potential properties at least at the doses used. PMID:21855580

  2. Reinforcement learning of motor skills with policy gradients.

    PubMed

    Peters, Jan; Schaal, Stefan

    2008-05-01

    Autonomous learning is one of the hallmarks of human and animal behavior, and understanding the principles of learning will be crucial in order to achieve true autonomy in advanced machines like humanoid robots. In this paper, we examine learning of complex motor skills with human-like limbs. While supervised learning can offer useful tools for bootstrapping behavior, e.g., by learning from demonstration, it is only reinforcement learning that offers a general approach to the final trial-and-error improvement that is needed by each individual acquiring a skill. Neither neurobiological nor machine learning studies have, so far, offered compelling results on how reinforcement learning can be scaled to the high-dimensional continuous state and action spaces of humans or humanoids. Here, we combine two recent research developments on learning motor control in order to achieve this scaling. First, we interpret the idea of modular motor control by means of motor primitives as a suitable way to generate parameterized control policies for reinforcement learning. Second, we combine motor primitives with the theory of stochastic policy gradient learning, which currently seems to be the only feasible framework for reinforcement learning for humanoids. We evaluate different policy gradient methods with a focus on their applicability to parameterized motor primitives. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm. PMID:18482830

  3. Reinforcement learning, conditioning, and the brain: Successes and challenges.

    PubMed

    Maia, Tiago V

    2009-12-01

    The field of reinforcement learning has greatly influenced the neuroscientific study of conditioning. This article provides an introduction to reinforcement learning followed by an examination of the successes and challenges using reinforcement learning to understand the neural bases of conditioning. Successes reviewed include (1) the mapping of positive and negative prediction errors to the firing of dopamine neurons and neurons in the lateral habenula, respectively; (2) the mapping of model-based and model-free reinforcement learning to associative and sensorimotor cortico-basal ganglia-thalamo-cortical circuits, respectively; and (3) the mapping of actor and critic to the dorsal and ventral striatum, respectively. Challenges reviewed consist of several behavioral and neural findings that are at odds with standard reinforcement-learning models, including, among others, evidence for hyperbolic discounting and adaptive coding. The article suggests ways of reconciling reinforcement-learning models with many of the challenging findings, and highlights the need for further theoretical developments where necessary. Additional information related to this study may be downloaded from http://cabn.psychonomic-journals.org/content/supplemental. PMID:19897789

  4. Oxytocin selectively facilitates learning with social feedback and increases activity and functional connectivity in emotional memory and reward processing regions.

    PubMed

    Hu, Jiehui; Qi, Song; Becker, Benjamin; Luo, Lizhu; Gao, Shan; Gong, Qiyong; Hurlemann, René; Kendrick, Keith M

    2015-06-01

    In male Caucasian subjects, learning is facilitated by receipt of social compared with non-social feedback, and the neuropeptide oxytocin (OXT) facilitates this effect. In this study, we have first shown a cultural difference in that male Chinese subjects actually perform significantly worse in the same reinforcement associated learning task with social (emotional faces) compared with non-social feedback. Nevertheless, in two independent double-blind placebo (PLC) controlled between-subject design experiments we found OXT still selectively facilitated learning with social feedback. Similar to Caucasian subjects this OXT effect was strongest with feedback using female rather than male faces. One experiment performed in conjunction with functional magnetic resonance imaging showed that during the response, but not feedback phase of the task, OXT selectively increased activity in the amygdala, hippocampus, parahippocampal gyrus and putamen during the social feedback condition, and functional connectivity between the amygdala and insula and caudate. Therefore, OXT may be increasing the salience and reward value of anticipated social feedback. In the PLC group, response times and state anxiety scores during social feedback were associated with signal changes in these same regions but not in the OXT group. OXT may therefore have also facilitated learning by reducing anxiety in the social feedback condition. Overall our results provide the first evidence for cultural differences in social facilitation of learning per se, but a similar selective enhancement of learning with social feedback under OXT. This effect of OXT may be associated with enhanced responses and functional connectivity in emotional memory and reward processing regions. PMID:25664702

  5. SOVEREIGN: An autonomous neural system for incrementally learning planned action sequences to navigate towards a rewarded goal.

    PubMed

    Gnadt, William; Grossberg, Stephen

    2008-06-01

    How do reactive and planned behaviors interact in real time? How are sequences of such behaviors released at appropriate times during autonomous navigation to realize valued goals? Controllers for both animals and mobile robots, or animats, need reactive mechanisms for exploration, and learned plans to reach goal objects once an environment becomes familiar. The SOVEREIGN (Self-Organizing, Vision, Expectation, Recognition, Emotion, Intelligent, Goal-oriented Navigation) animat model embodies these capabilities, and is tested in a 3D virtual reality environment. SOVEREIGN includes several interacting subsystems which model complementary properties of cortical What and Where processing streams and which clarify similarities between mechanisms for navigation and arm movement control. As the animat explores an environment, visual inputs are processed by networks that are sensitive to visual form and motion in the What and Where streams, respectively. Position-invariant and size-invariant recognition categories are learned by real-time incremental learning in the What stream. Estimates of target position relative to the animat are computed in the Where stream, and can activate approach movements toward the target. Motion cues from animat locomotion can elicit head-orienting movements to bring a new target into view. Approach and orienting movements are alternately performed during animat navigation. Cumulative estimates of each movement are derived from interacting proprioceptive and visual cues. Movement sequences are stored within a motor working memory. Sequences of visual categories are stored in a sensory working memory. These working memories trigger learning of sensory and motor sequence categories, or plans, which together control planned movements. Predictively effective chunk combinations are selectively enhanced via reinforcement learning when the animat is rewarded. Selected planning chunks effect a gradual transition from variable reactive exploratory

  6. Rats bred for helplessness exhibit positive reinforcement learning deficits which are not alleviated by an antidepressant dose of the MAO-B inhibitor deprenyl.

    PubMed

    Schulz, Daniela; Henn, Fritz A; Petri, David; Huston, Joseph P

    2016-08-01

    Principles of negative reinforcement learning may play a critical role in the etiology and treatment of depression. We examined the integrity of positive reinforcement learning in congenitally helpless (cH) rats, an animal model of depression, using a random ratio schedule and a devaluation-extinction procedure. Furthermore, we tested whether an antidepressant dose of the monoamine oxidase (MAO)-B inhibitor deprenyl would reverse any deficits in positive reinforcement learning. We found that cH rats (n=9) were impaired in the acquisition of even simple operant contingencies, such as a fixed interval (FI) 20 schedule. cH rats exhibited no apparent deficits in appetite or reward sensitivity. They reacted to the devaluation of food in a manner consistent with a dose-response relationship. Reinforcer motivation as assessed by lever pressing across sessions with progressively decreasing reward probabilities was highest in congenitally non-helpless (cNH, n=10) rats as long as the reward probabilities remained relatively high. cNH compared to wild-type (n=10) rats were also more resistant to extinction across sessions. Compared to saline (n=5), deprenyl (n=5) reduced the duration of immobility of cH rats in the forced swimming test, indicative of antidepressant effects, but did not restore any deficits in the acquisition of a FI 20 schedule. We conclude that positive reinforcement learning was impaired in rats bred for helplessness, possibly due to motivational impairments but not deficits in reward sensitivity, and that deprenyl exerted antidepressant effects but did not reverse the deficits in positive reinforcement learning. PMID:27163379

  7. Hedging Your Bets by Learning Reward Correlations in the Human Brain

    PubMed Central

    Wunderlich, Klaus; Symmonds, Mkael; Bossaerts, Peter; Dolan, Raymond J.

    2011-01-01

    Summary Human subjects are proficient at tracking the mean and variance of rewards and updating these via prediction errors. Here, we addressed whether humans can also learn about higher-order relationships between distinct environmental outcomes, a defining ecological feature of contexts where multiple sources of rewards are available. By manipulating the degree to which distinct outcomes are correlated, we show that subjects implemented an explicit model-based strategy to learn the associated outcome correlations and were adept in using that information to dynamically adjust their choices in a task that required a minimization of outcome variance. Importantly, the experimentally generated outcome correlations were explicitly represented neuronally in right midinsula with a learning prediction error signal expressed in rostral anterior cingulate cortex. Thus, our data show that the human brain represents higher-order correlation structures between rewards, a core adaptive ability whose immediate benefit is optimized sampling. PMID:21943609

  8. Individual differences and the neural representations of reward expectation and reward prediction error

    PubMed Central

    2007-01-01

    Reward expectation and reward prediction errors are thought to be critical for dynamic adjustments in decision-making and reward-seeking behavior, but little is known about their representation in the brain during uncertainty and risk-taking. Furthermore, little is known about what role individual differences might play in such reinforcement processes. In this study, it is shown behavioral and neural responses during a decision-making task can be characterized by a computational reinforcement learning model and that individual differences in learning parameters in the model are critical for elucidating these processes. In the fMRI experiment, subjects chose between high- and low-risk rewards. A computational reinforcement learning model computed expected values and prediction errors that each subject might experience on each trial. These outputs predicted subjects’ trial-to-trial choice strategies and neural activity in several limbic and prefrontal regions during the task. Individual differences in estimated reinforcement learning parameters proved critical for characterizing these processes, because models that incorporated individual learning parameters explained significantly more variance in the fMRI data than did a model using fixed learning parameters. These findings suggest that the brain engages a reinforcement learning process during risk-taking and that individual differences play a crucial role in modeling this process. PMID:17710118

  9. A Brain-like Learning System with Supervised, Unsupervised and Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Sasakawa, Takafumi; Hu, Jinglu; Hirasawa, Kotaro

    Our brain has three different learning paradigms: supervised, unsupervised and reinforcement learning. And it is suggested that those learning paradigms relate deeply to the cerebellum, cerebral cortex and basal ganglia in the brain, respectively. Inspired by these knowledge of brain, we present a brain-like learning system with those three different learning algorithms. The proposed system consists of three parts: the supervised learning (SL) part, the unsupervised learning (UL) part and the reinforcement learning (RL) part. The SL part, corresponding to the cerebellum of brain, learns an input-output mapping by supervised learning. The UL part, corresponding to the cerebral cortex of brain, is a competitive learning network, and divides an input space to subspaces by unsupervised learning. The RL part, corresponding to the basal ganglia of brain, optimizes the model performance by reinforcement learning. Numerical simulations show that the proposed brain-like learning system optimizes its performance automatically and has superior performance to an ordinary neural network.

  10. Qualitative adaptive reward learning with success failure maps: applied to humanoid robot walking.

    PubMed

    Nassour, John; Hugel, Vincent; Ben Ouezdou, Fethi; Cheng, Gordon

    2013-01-01

    In the human brain, rewards are encoded in a flexible and adaptive way after each novel stimulus. Neurons of the orbitofrontal cortex are the key reward structure of the brain. Neurobiological studies show that the anterior cingulate cortex of the brain is primarily responsible for avoiding repeated mistakes. According to vigilance threshold, which denotes the tolerance to risks, we can differentiate between a learning mechanism that takes risks and one that averts risks. The tolerance to risk plays an important role in such a learning mechanism. Results have shown the differences in learning capacity between risk-taking and risk-avert behaviors. These neurological properties provide promising inspirations for robot learning based on rewards. In this paper, we propose a learning mechanism that is able to learn from negative and positive feedback with reward coding adaptively. It is composed of two phases: evaluation and decision making. In the evaluation phase, we use a Kohonen self-organizing map technique to represent success and failure. Decision making is based on an early warning mechanism that enables avoiding repeating past mistakes. The behavior to risk is modulated in order to gain experiences for success and for failure. Success map is learned with adaptive reward that qualifies the learned task in order to optimize the efficiency. Our approach is presented with an implementation on the NAO humanoid robot, controlled by a bioinspired neural controller based on a central pattern generator. The learning system adapts the oscillation frequency and the motor neuron gain in pitch and roll in order to walk on flat and sloped terrain, and to switch between them. PMID:24808209

  11. The role of GABAB receptors in human reinforcement learning.

    PubMed

    Ort, Andres; Kometer, Michael; Rohde, Judith; Seifritz, Erich; Vollenweider, Franz X

    2014-10-01

    Behavioral evidence from human studies suggests that the γ-aminobutyric acid type B receptor (GABAB receptor) agonist baclofen modulates reinforcement learning and reduces craving in patients with addiction spectrum disorders. However, in contrast to the well established role of dopamine in reinforcement learning, the mechanisms by which the GABAB receptor influences reinforcement learning in humans remain completely unknown. To further elucidate this issue, a cross-over, double-blind, placebo-controlled study was performed in healthy human subjects (N=15) to test the effects of baclofen (20 and 50mg p.o.) on probabilistic reinforcement learning. Outcomes were the feedback-induced P2 component of the event-related potential, the feedback-related negativity, and the P300 component of the event-related potential. Baclofen produced a reduction of P2 amplitude over the course of the experiment, but did not modulate the feedback-related negativity. Furthermore, there was a trend towards increased learning after baclofen administration relative to placebo over the course of the experiment. The present results extend previous theories of reinforcement learning, which focus on the importance of mesolimbic dopamine signaling, and indicate that stimulation of cortical GABAB receptors in a fronto-parietal network leads to better attentional allocation in reinforcement learning. This observation is a first step in our understanding of how baclofen may improve reinforcement learning in healthy subjects. Further studies with bigger sample sizes are needed to corroborate this conclusion and furthermore, test this effect in patients with addiction spectrum disorder. PMID:25194227

  12. Neural Regions that Underlie Reinforcement Learning Also Engage in Social Expectancy Violations

    PubMed Central

    Harris, Lasana T.; Fiske, Susan T.

    2013-01-01

    Prediction error, the difference between an expected and actual outcome, serves as a learning signal that interacts with reward and punishment value to direct future behavior during reinforcement learning. We hypothesized that similar learning and valuation signals may underlie social expectancy violations. Here, we explore the neural correlates of social expectancy violation signals along the universal person-perception dimensions of trait warmth and competence. In this context, social learning may result from expectancy violations that occur when a target is inconsistent with an a priori schema. Expectancy violation may activate neural regions normally implicated in prediction error and valuation during appetitive and aversive conditioning. Using fMRI, we first gave perceivers warmth or competence behavioral information. Participants then saw pictures of people responsible for the behavior; they represented social groups either inconsistent (rated low on either warmth or competence) or consistent (rated high on either warmth or competence) with the behavior information. Warmth and competence expectancy violations activate striatal regions and frontal cortex respectively, areas that represent evaluative and prediction-error signals. These findings suggest that regions underlying reinforcement learning may be engaged in warmth and competence social expectancy violation, and illustrate the neural overlap between neuroeconomics and social neuroscience. PMID:20119878

  13. Sound Sequence Discrimination Learning Motivated by Reward Requires Dopaminergic D2 Receptor Activation in the Rat Auditory Cortex

    ERIC Educational Resources Information Center

    Kudoh, Masaharu; Shibuki, Katsuei

    2006-01-01

    We have previously reported that sound sequence discrimination learning requires cholinergic inputs to the auditory cortex (AC) in rats. In that study, reward was used for motivating discrimination behavior in rats. Therefore, dopaminergic inputs mediating reward signals may have an important role in the learning. We tested the possibility in the…

  14. WWC Quick Review of the Article "Culture and the Interaction of Student Ethnicity with Reward Structure in Group Learning" Revised

    ERIC Educational Resources Information Center

    What Works Clearinghouse, 2010

    2010-01-01

    This paper presents an updated WWC (What Works Clearinghouse) Review of the Article "Culture and the Interaction of Student Ethnicity with Reward Structure in Group Learning". The study examined the effects of different reward systems used in group learning situations on the math skills of African-American and White students. The research…

  15. WWC Review of the Article "Culture and the Interaction of Student Ethnicity with Reward Structure in Group Learning"

    ERIC Educational Resources Information Center

    What Works Clearinghouse, 2010

    2010-01-01

    "Culture and the Interaction of Student Ethnicity with Reward Structure in Group Learning" examined the effects of different reward systems used in group learning situations on the math skills of African-American and white students. The study analyzed data on 75 African-American and 57 white fourth- and fifth-grade students from urban schools in…

  16. Correlates of reward-predictive value in learning-related hippocampal neural activity

    PubMed Central

    Okatan, Murat

    2009-01-01

    Temporal difference learning (TD) is a popular algorithm in machine learning. Two learning signals that are derived from this algorithm, the predictive value and the prediction error, have been shown to explain changes in neural activity and behavior during learning across species. Here, the predictive value signal is used to explain the time course of learning-related changes in the activity of hippocampal neurons in monkeys performing an associative learning task. The TD algorithm serves as the centerpiece of a joint probability model for the learning-related neural activity and the behavioral responses recorded during the task. The neural component of the model consists of spiking neurons that compete and learn the reward-predictive value of task-relevant input signals. The predictive-value signaled by these neurons influences the behavioral response generated by a stochastic decision stage, which constitutes the behavioral component of the model. It is shown that the time course of the changes in neural activity and behavioral performance generated by the model exhibits key features of the experimental data. The results suggest that information about correct associations may be expressed in the hippocampus before it is detected in the behavior of a subject. In this way, the hippocampus may be among the earliest brain areas to express learning and drive the behavioral changes associated with learning. Correlates of reward-predictive value may be expressed in the hippocampus through rate remapping within spatial memory representations, they may represent reward-related aspects of a declarative or explicit relational memory representation of task contingencies, or they may correspond to reward-related components of episodic memory representations. These potential functions are discussed in connection with hippocampal cell assembly sequences and their reverse reactivation during the awake state. The results provide further support for the proposal that neural

  17. Social Learning, Reinforcement and Crime: Evidence from Three European Cities

    ERIC Educational Resources Information Center

    Tittle, Charles R.; Antonaccio, Olena; Botchkovar, Ekaterina

    2012-01-01

    This study reports a cross-cultural test of Social Learning Theory using direct measures of social learning constructs and focusing on the causal structure implied by the theory. Overall, the results strongly confirm the main thrust of the theory. Prior criminal reinforcement and current crime-favorable definitions are highly related in all three…

  18. Reinforcement Learning of Optimal Supervisor based on the Worst-Case Behavior

    NASA Astrophysics Data System (ADS)

    Kajiwara, Kouji; Yamasaki, Tatsushi

    The supervisory control initiated by Ramadge and Wonham is a framework for logical control of discrete event systems. In the original supervisory control, the costs for occurrence and disabling of events have not been considered. Then, the optimal supervisory control based on quatitative measures has also been studied. This paper proposes a synthesis method of the optimal supervisor based on the worst-case behavior of discrete event systems. We introduce the new value functions for the assigned control patterns. The new value functions are not based on the expected total rewards, but based on the most undesirable event occurrence in the assigned control pattern. In the proposed method, the supervisor learns how to assign the control pattern based on reinforcement learning so as to maximize the value functions. We show the efficiency of the proposed method by computer simulation.

  19. The impact of effort-reward imbalance and learning motivation on teachers' sickness absence.

    PubMed

    Derycke, Hanne; Vlerick, Peter; Van de Ven, Bart; Rots, Isabel; Clays, Els

    2013-02-01

    The aim of this study was to analyse the impact of the effort-reward imbalance and learning motivation on sickness absence duration and sickness absence frequency among beginning teachers in Flanders (Belgium). A total of 603 teachers, who recently graduated, participated in this study. Effort-reward imbalance and learning motivation were assessed by means of self-administered questionnaires. Prospective data of registered sickness absence during 12 months follow-up were collected. Multivariate logistic regression analyses were performed. An imbalance between high efforts and low rewards (extrinsic hypothesis) was associated with longer sickness absence duration and more frequent absences. A low level of learning motivation (intrinsic hypothesis) was not associated with longer sickness absence duration but was significantly positively associated with sickness absence frequency. No significant results were obtained for the interaction hypothesis between imbalance and learning motivation. Further research is needed to deepen our understanding of the impact of psychosocial work conditions and personal resources on both sickness absence duration and frequency. Specifically, attention could be given to optimizing or reducing efforts spent at work, increasing rewards and stimulating learning motivation to influence sickness absence. PMID:22337584

  20. RM-SORN: a reward-modulated self-organizing recurrent neural network.

    PubMed

    Aswolinskiy, Witali; Pipa, Gordon

    2015-01-01

    Neural plasticity plays an important role in learning and memory. Reward-modulation of plasticity offers an explanation for the ability of the brain to adapt its neural activity to achieve a rewarded goal. Here, we define a neural network model that learns through the interaction of Intrinsic Plasticity (IP) and reward-modulated Spike-Timing-Dependent Plasticity (STDP). IP enables the network to explore possible output sequences and STDP, modulated by reward, reinforces the creation of the rewarded output sequences. The model is tested on tasks for prediction, recall, non-linear computation, pattern recognition, and sequence generation. It achieves performance comparable to networks trained with supervised learning, while using simple, biologically motivated plasticity rules, and rewarding strategies. The results confirm the importance of investigating the interaction of several plasticity rules in the context of reward-modulated learning and whether reward-modulated self-organization can explain the amazing capabilities of the brain. PMID:25852533

  1. RM-SORN: a reward-modulated self-organizing recurrent neural network

    PubMed Central

    Aswolinskiy, Witali; Pipa, Gordon

    2015-01-01

    Neural plasticity plays an important role in learning and memory. Reward-modulation of plasticity offers an explanation for the ability of the brain to adapt its neural activity to achieve a rewarded goal. Here, we define a neural network model that learns through the interaction of Intrinsic Plasticity (IP) and reward-modulated Spike-Timing-Dependent Plasticity (STDP). IP enables the network to explore possible output sequences and STDP, modulated by reward, reinforces the creation of the rewarded output sequences. The model is tested on tasks for prediction, recall, non-linear computation, pattern recognition, and sequence generation. It achieves performance comparable to networks trained with supervised learning, while using simple, biologically motivated plasticity rules, and rewarding strategies. The results confirm the importance of investigating the interaction of several plasticity rules in the context of reward-modulated learning and whether reward-modulated self-organization can explain the amazing capabilities of the brain. PMID:25852533

  2. Components and characteristics of the dopamine reward utility signal.

    PubMed

    Stauffer, William R; Lak, Armin; Kobayashi, Shunsuke; Schultz, Wolfram

    2016-06-01

    Rewards are defined by their behavioral functions in learning (positive reinforcement), approach behavior, economic choices, and emotions. Dopamine neurons respond to rewards with two components, similar to higher order sensory and cognitive neurons. The initial, rapid, unselective dopamine detection component reports all salient environmental events irrespective of their reward association. It is highly sensitive to factors related to reward and thus detects a maximal number of potential rewards. It also senses aversive stimuli but reports their physical impact rather than their aversiveness. The second response component processes reward value accurately and starts early enough to prevent confusion with unrewarded stimuli and objects. It codes reward value as a numeric, quantitative utility prediction error, consistent with formal concepts of economic decision theory. Thus, the dopamine reward signal is fast, highly sensitive and appropriate for driving and updating economic decisions. J. Comp. Neurol. 524:1699-1711, 2016. © 2015 Wiley Periodicals, Inc. PMID:26272220

  3. Human-level control through deep reinforcement learning

    NASA Astrophysics Data System (ADS)

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K.; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-01

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  4. Human-level control through deep reinforcement learning.

    PubMed

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A; Veness, Joel; Bellemare, Marc G; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-26

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks

  5. Involvement of the Rat Anterior Cingulate Cortex in Control of Instrumental Responses Guided by Reward Expectancy

    ERIC Educational Resources Information Center

    Schweimer, Judith; Hauber, Wolfgang

    2005-01-01

    The anterior cingulate cortex (ACC) plays a critical role in stimulus-reinforcement learning and reward-guided selection of actions. Here we conducted a series of experiments to further elucidate the role of the ACC in instrumental behavior involving effort-based decision-making and instrumental learning guided by reward-predictive stimuli. In…

  6. Protein interaction network constructing based on text mining and reinforcement learning with application to prostate cancer.

    PubMed

    Zhu, Fei; Liu, Quan; Zhang, Xiaofang; Shen, Bairong

    2015-08-01

    Constructing interaction network from biomedical texts is a very important and interesting work. The authors take advantage of text mining and reinforcement learning approaches to establish protein interaction network. Considering the high computational efficiency of co-occurrence-based interaction extraction approaches and high precision of linguistic patterns approaches, the authors propose an interaction extracting algorithm where they utilise frequently used linguistic patterns to extract the interactions from texts and then find out interactions from extended unprocessed texts under the basic idea of co-occurrence approach, meanwhile they discount the interaction extracted from extended texts. They put forward a reinforcement learning-based algorithm to establish a protein interaction network, where nodes represent proteins and edges denote interactions. During the evolutionary process, a node selects another node and the attained reward determines which predicted interaction should be reinforced. The topology of the network is updated by the agent until an optimal network is formed. They used texts downloaded from PubMed to construct a prostate cancer protein interaction network by the proposed methods. The results show that their method brought out pretty good matching rate. Network topology analysis results also demonstrate that the curves of node degree distribution, node degree probability and probability distribution of constructed network accord with those of the scale-free network well. PMID:26243825

  7. A Spiking Network Model of Decision Making Employing Rewarded STDP

    PubMed Central

    Bazhenov, Maxim

    2014-01-01

    Reward-modulated spike timing dependent plasticity (STDP) combines unsupervised STDP with a reinforcement signal that modulates synaptic changes. It was proposed as a learning rule capable of solving the distal reward problem in reinforcement learning. Nonetheless, performance and limitations of this learning mechanism have yet to be tested for its ability to solve biological problems. In our work, rewarded STDP was implemented to model foraging behavior in a simulated environment. Over the course of training the network of spiking neurons developed the capability of producing highly successful decision-making. The network performance remained stable even after significant perturbations of synaptic structure. Rewarded STDP alone was insufficient to learn effective decision making due to the difficulty maintaining homeostatic equilibrium of synaptic weights and the development of local performance maxima. Our study predicts that successful learning requires stabilizing mechanisms that allow neurons to balance their input and output synapses as well as synaptic noise. PMID:24632858

  8. The Role of BDNF, Leptin, and Catecholamines in Reward Learning in Bulimia Nervosa

    PubMed Central

    Grob, Simona; Milos, Gabriella; Schnyder, Ulrich; Eckert, Anne; Lang, Undine; Hasler, Gregor

    2015-01-01

    Background: A relationship between bulimia nervosa and reward-related behavior is supported by several lines of evidence. The dopaminergic dysfunctions in the processing of reward-related stimuli have been shown to be modulated by the neurotrophin brain derived neurotrophic factor (BDNF) and the hormone leptin. Methods: Using a randomized, double-blind, placebo-controlled, crossover design, a reward learning task was applied to study the behavior of 20 female subjects with remitted bulimia nervosa and 27 female healthy controls under placebo and catecholamine depletion with alpha-methyl-para-tyrosine (AMPT). The plasma levels of BDNF and leptin were measured twice during the placebo and the AMPT condition, immediately before and 1 hour after a standardized breakfast. Results: AMPT–induced differences in plasma BDNF levels were positively correlated with the AMPT–induced differences in reward learning in the whole sample (P=.05). Across conditions, plasma brain derived neurotrophic factor levels were higher in remitted bulimia nervosa subjects compared with controls (diagnosis effect; P=.001). Plasma BDNF and leptin levels were higher in the morning before compared with after a standardized breakfast across groups and conditions (time effect; P<.0001). The plasma leptin levels were higher under catecholamine depletion compared with placebo in the whole sample (treatment effect; P=.0004). Conclusions: This study reports on preliminary findings that suggest a catecholamine-dependent association of plasma BDNF and reward learning in subjects with remitted bulimia nervosa and controls. A role of leptin in reward learning is not supported by this study. However, leptin levels were sensitive to a depletion of catecholamine stores in both remitted bulimia nervosa and controls. PMID:25522424

  9. Negative reinforcement impairs overnight memory consolidation.

    PubMed

    Stamm, Andrew W; Nguyen, Nam D; Seicol, Benjamin J; Fagan, Abigail; Oh, Angela; Drumm, Michael; Lundt, Maureen; Stickgold, Robert; Wamsley, Erin J

    2014-11-01

    Post-learning sleep is beneficial for human memory. However, it may be that not all memories benefit equally from sleep. Here, we manipulated a spatial learning task using monetary reward and performance feedback, asking whether enhancing the salience of the task would augment overnight memory consolidation and alter its incorporation into dreaming. Contrary to our hypothesis, we found that the addition of reward impaired overnight consolidation of spatial memory. Our findings seemingly contradict prior reports that enhancing the reward value of learned information augments sleep-dependent memory processing. Given that the reward followed a negative reinforcement paradigm, consolidation may have been impaired via a stress-related mechanism. PMID:25320351

  10. Social Cognition as Reinforcement Learning: Feedback Modulates Emotion Inference.

    PubMed

    Zaki, Jamil; Kallman, Seth; Wimmer, G Elliott; Ochsner, Kevin; Shohamy, Daphna

    2016-09-01

    Neuroscientific studies of social cognition typically employ paradigms in which perceivers draw single-shot inferences about the internal states of strangers. Real-world social inference features much different parameters: People often encounter and learn about particular social targets (e.g., friends) over time and receive feedback about whether their inferences are correct or incorrect. Here, we examined this process and, more broadly, the intersection between social cognition and reinforcement learning. Perceivers were scanned using fMRI while repeatedly encountering three social targets who produced conflicting visual and verbal emotional cues. Perceivers guessed how targets felt and received feedback about whether they had guessed correctly. Visual cues reliably predicted one target's emotion, verbal cues predicted a second target's emotion, and neither reliably predicted the third target's emotion. Perceivers successfully used this information to update their judgments over time. Furthermore, trial-by-trial learning signals-estimated using two reinforcement learning models-tracked activity in ventral striatum and ventromedial pFC, structures associated with reinforcement learning, and regions associated with updating social impressions, including TPJ. These data suggest that learning about others' emotions, like other forms of feedback learning, relies on domain-general reinforcement mechanisms as well as domain-specific social information processing. PMID:27167401

  11. Dopamine Replacement Therapy, Learning and Reward Prediction in Parkinson's Disease: Implications for Rehabilitation.

    PubMed

    Ferrazzoli, Davide; Carter, Adrian; Ustun, Fatma S; Palamara, Grazia; Ortelli, Paola; Maestri, Roberto; Yücel, Murat; Frazzitta, Giuseppe

    2016-01-01

    The principal feature of Parkinson's disease (PD) is the impaired ability to acquire and express habitual-automatic actions due to the loss of dopamine in the dorsolateral striatum, the region of the basal ganglia associated with the control of habitual behavior. Dopamine replacement therapy (DRT) compensates for the lack of dopamine, representing the standard treatment for different motor symptoms of PD (such as rigidity, bradykinesia and resting tremor). On the other hand, rehabilitation treatments, exploiting the use of cognitive strategies, feedbacks and external cues, permit to "learn to bypass" the defective basal ganglia (using the dorsolateral area of the prefrontal cortex) allowing the patients to perform correct movements under executive-volitional control. Therefore, DRT and rehabilitation seem to be two complementary and synergistic approaches. Learning and reward are central in rehabilitation: both of these mechanisms are the basis for the success of any rehabilitative treatment. Anyway, it is known that "learning resources" and reward could be negatively influenced from dopaminergic drugs. Furthermore, DRT causes different well-known complications: among these, dyskinesias, motor fluctuations, and dopamine dysregulation syndrome (DDS) are intimately linked with the alteration in the learning and reward mechanisms and could impact seriously on the rehabilitative outcomes. These considerations highlight the need for careful titration of DRT to produce the desired improvement in motor symptoms while minimizing the associated detrimental effects. This is important in order to maximize the motor re-learning based on repetition, reward and practice during rehabilitation. In this scenario, we review the knowledge concerning the interactions between DRT, learning and reward, examine the most impactful DRT side effects and provide suggestions for optimizing rehabilitation in PD. PMID:27378872

  12. Dopamine Replacement Therapy, Learning and Reward Prediction in Parkinson’s Disease: Implications for Rehabilitation

    PubMed Central

    Ferrazzoli, Davide; Carter, Adrian; Ustun, Fatma S.; Palamara, Grazia; Ortelli, Paola; Maestri, Roberto; Yücel, Murat; Frazzitta, Giuseppe

    2016-01-01

    The principal feature of Parkinson’s disease (PD) is the impaired ability to acquire and express habitual-automatic actions due to the loss of dopamine in the dorsolateral striatum, the region of the basal ganglia associated with the control of habitual behavior. Dopamine replacement therapy (DRT) compensates for the lack of dopamine, representing the standard treatment for different motor symptoms of PD (such as rigidity, bradykinesia and resting tremor). On the other hand, rehabilitation treatments, exploiting the use of cognitive strategies, feedbacks and external cues, permit to “learn to bypass” the defective basal ganglia (using the dorsolateral area of the prefrontal cortex) allowing the patients to perform correct movements under executive-volitional control. Therefore, DRT and rehabilitation seem to be two complementary and synergistic approaches. Learning and reward are central in rehabilitation: both of these mechanisms are the basis for the success of any rehabilitative treatment. Anyway, it is known that “learning resources” and reward could be negatively influenced from dopaminergic drugs. Furthermore, DRT causes different well-known complications: among these, dyskinesias, motor fluctuations, and dopamine dysregulation syndrome (DDS) are intimately linked with the alteration in the learning and reward mechanisms and could impact seriously on the rehabilitative outcomes. These considerations highlight the need for careful titration of DRT to produce the desired improvement in motor symptoms while minimizing the associated detrimental effects. This is important in order to maximize the motor re-learning based on repetition, reward and practice during rehabilitation. In this scenario, we review the knowledge concerning the interactions between DRT, learning and reward, examine the most impactful DRT side effects and provide suggestions for optimizing rehabilitation in PD. PMID:27378872

  13. Fuel not fun: Reinterpreting attenuated brain responses to reward in obesity.

    PubMed

    Kroemer, Nils B; Small, Dana M

    2016-08-01

    There is a well-established literature linking obesity to altered dopamine signaling and brain response to food-related stimuli. Neuroimaging studies frequently report enhanced responses in dopaminergic regions during food anticipation and decreased responses during reward receipt. This has been interpreted as reflecting anticipatory "reward surfeit", and consummatory "reward deficiency". In particular, attenuated response in the dorsal striatum to primary food rewards is proposed to reflect anhedonia, which leads to overeating in an attempt to compensate for the reward deficit. In this paper, we propose an alternative view. We consider brain response to food-related stimuli in a reinforcement-learning framework, which can be employed to separate the contributions of reward sensitivity and reward-related learning that are typically entangled in the brain response to reward. Consequently, we posit that decreased striatal responses to milkshake receipt reflect reduced reward-related learning rather than reward deficiency or anhedonia because reduced reward sensitivity would translate uniformly into reduced anticipatory and consummatory responses to reward. By re-conceptualizing reward deficiency as a shift in learning about subjective value of rewards, we attempt to reconcile neuroimaging findings with the putative role of dopamine in effort, energy expenditure and exploration and suggest that attenuated brain responses to energy dense foods reflect the "fuel", not the fun entailed by the reward. PMID:27085908

  14. Sensory Responsiveness and the Effects of Equal Subjective Rewards on Tactile Learning and Memory of Honeybees

    ERIC Educational Resources Information Center

    Scheiner, Ricarda; Kuritz-Kaiser, Anthea; Menzel, Randolf; Erber, Joachim

    2005-01-01

    In tactile learning, sucrose is the unconditioned stimulus and reward, which is usually applied to the antenna to elicit proboscis extension and which the bee can drink when it is subsequently applied to the extended proboscis. The conditioned stimulus is a tactile object that the bee can scan with its antennae. In this paper we describe the…

  15. Drift diffusion model of reward and punishment learning in schizophrenia: Modeling and experimental data.

    PubMed

    Moustafa, Ahmed A; Kéri, Szabolcs; Somlai, Zsuzsanna; Balsdon, Tarryn; Frydecka, Dorota; Misiak, Blazej; White, Corey

    2015-09-15

    In this study, we tested reward- and punishment learning performance using a probabilistic classification learning task in patients with schizophrenia (n=37) and healthy controls (n=48). We also fit subjects' data using a Drift Diffusion Model (DDM) of simple decisions to investigate which components of the decision process differ between patients and controls. Modeling results show between-group differences in multiple components of the decision process. Specifically, patients had slower motor/encoding time, higher response caution (favoring accuracy over speed), and a deficit in classification learning for punishment, but not reward, trials. The results suggest that patients with schizophrenia adopt a compensatory strategy of favoring accuracy over speed to improve performance, yet still show signs of a deficit in learning based on negative feedback. Our data highlights the importance of applying fitting models (particularly drift diffusion models) to behavioral data. The implications of these findings are discussed relative to theories of schizophrenia and cognitive processing. PMID:26005124

  16. Learning in neural networks by reinforcement of irregular spiking

    NASA Astrophysics Data System (ADS)

    Xie, Xiaohui; Seung, H. Sebastian

    2004-04-01

    Artificial neural networks are often trained by using the back propagation algorithm to compute the gradient of an objective function with respect to the synaptic strengths. For a biological neural network, such a gradient computation would be difficult to implement, because of the complex dynamics of intrinsic and synaptic conductances in neurons. Here we show that irregular spiking similar to that observed in biological neurons could be used as the basis for a learning rule that calculates a stochastic approximation to the gradient. The learning rule is derived based on a special class of model networks in which neurons fire spike trains with Poisson statistics. The learning is compatible with forms of synaptic dynamics such as short-term facilitation and depression. By correlating the fluctuations in irregular spiking with a reward signal, the learning rule performs stochastic gradient ascent on the expected reward. It is applied to two examples, learning the XOR computation and learning direction selectivity using depressing synapses. We also show in simulation that the learning rule is applicable to a network of noisy integrate-and-fire neurons.

  17. Observing others stay or switch - How social prediction errors are integrated into reward reversal learning.

    PubMed

    Ihssen, Niklas; Mussweiler, Thomas; Linden, David E J

    2016-08-01

    Reward properties of stimuli can undergo sudden changes, and the detection of these 'reversals' is often made difficult by the probabilistic nature of rewards/punishments. Here we tested whether and how humans use social information (someone else's choices) to overcome uncertainty during reversal learning. We show a substantial social influence during reversal learning, which was modulated by the type of observed behavior. Participants frequently followed observed conservative choices (no switches after punishment) made by the (fictitious) other player but ignored impulsive choices (switches), even though the experiment was set up so that both types of response behavior would be similarly beneficial/detrimental (Study 1). Computational modeling showed that participants integrated the observed choices as a 'social prediction error' instead of ignoring or blindly following the other player. Modeling also confirmed higher learning rates for 'conservative' versus 'impulsive' social prediction errors. Importantly, this 'conservative bias' was boosted by interpersonal similarity, which in conjunction with the lack of effects observed in a non-social control experiment (Study 2) confirmed its social nature. A third study suggested that relative weighting of observed impulsive responses increased with increased volatility (frequency of reversals). Finally, simulations showed that in the present paradigm integrating social and reward information was not necessarily more adaptive to maximize earnings than learning from reward alone. Moreover, integrating social information increased accuracy only when conservative and impulsive choices were weighted similarly during learning. These findings suggest that to guide decisions in choice contexts that involve reward reversals humans utilize social cues conforming with their preconceptions more strongly than cues conflicting with them, especially when the other is similar. PMID:27128170

  18. Tactile learning and the individual evaluation of the reward in honey bees (Apis mellifera L.).

    PubMed

    Scheiner, R; Erber, J; Page, R E

    1999-07-01

    Using the proboscis extension response we conditioned pollen and nectar foragers of the honey bee (Apis mellifera L.) to tactile patterns under laboratory conditions. Pollen foragers demonstrated better acquisition, extinction, and reversal learning than nectar foragers. We tested whether the known differences in response thresholds to sucrose between pollen and nectar foragers could explain the observed differences in learning and found that nectar foragers with low response thresholds performed better during acquisition and extinction than ones with higher thresholds. Conditioning pollen and nectar foragers with similar response thresholds did not yield differences in their learning performance. These results suggest that differences in the learning performance of pollen and nectar foragers are a consequence of differences in their perception of sucrose. Furthermore, we analysed the effect which the perception of sucrose reward has on associative learning. Nectar foragers with uniform low response thresholds were conditioned using varying concentrations of sucrose. We found significant positive correlations between the concentrations of the sucrose rewards and the performance during acquisition and extinction. The results are summarised in a model which describes the relationships between learning performance, response threshold to sucrose, concentration of sucrose and the number of rewards. PMID:10450609

  19. A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning

    PubMed Central

    Franklin, Nicholas T; Frank, Michael J

    2015-01-01

    Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments. DOI: http://dx.doi.org/10.7554/eLife.12029.001 PMID:26705698

  20. Instructional control of reinforcement learning: a behavioral and neurocomputational investigation.

    PubMed

    Doll, Bradley B; Jacobs, W Jake; Sanfey, Alan G; Frank, Michael J

    2009-11-24

    Humans learn how to behave directly through environmental experience and indirectly through rules and instructions. Behavior analytic research has shown that instructions can control behavior, even when such behavior leads to sub-optimal outcomes (Hayes, S. (Ed.). 1989. Rule-governed behavior: cognition, contingencies, and instructional control. Plenum Press.). Here we examine the control of behavior through instructions in a reinforcement learning task known to depend on striatal dopaminergic function. Participants selected between probabilistically reinforced stimuli, and were (incorrectly) told that a specific stimulus had the highest (or lowest) reinforcement probability. Despite experience to the contrary, instructions drove choice behavior. We present neural network simulations that capture the interactions between instruction-driven and reinforcement-driven behavior via two potential neural circuits: one in which the striatum is inaccurately trained by instruction representations coming from prefrontal cortex/hippocampus (PFC/HC), and another in which the striatum learns the environmentally based reinforcement contingencies, but is "overridden" at decision output. Both models capture the core behavioral phenomena but, because they differ fundamentally on what is learned, make distinct predictions for subsequent behavioral and neuroimaging experiments. Finally, we attempt to distinguish between the proposed computational mechanisms governing instructed behavior by fitting a series of abstract "Q-learning" and Bayesian models to subject data. The best-fitting model supports one of the neural models, suggesting the existence of a "confirmation bias" in which the PFC/HC system trains the reinforcement system by amplifying outcomes that are consistent with instructions while diminishing inconsistent outcomes. PMID:19595993

  1. Cooperation and Coordination Between Fuzzy Reinforcement Learning Agents in Continuous State Partially Observable Markov Decision Processes

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Vengerov, David

    1999-01-01

    Successful operations of future multi-agent intelligent systems require efficient cooperation schemes between agents sharing learning experiences. We consider a pseudo-realistic world in which one or more opportunities appear and disappear in random locations. Agents use fuzzy reinforcement learning to learn which opportunities are most worthy of pursuing based on their promise rewards, expected lifetimes, path lengths and expected path costs. We show that this world is partially observable because the history of an agent influences the distribution of its future states. We consider a cooperation mechanism in which agents share experience by using and-updating one joint behavior policy. We also implement a coordination mechanism for allocating opportunities to different agents in the same world. Our results demonstrate that K cooperative agents each learning in a separate world over N time steps outperform K independent agents each learning in a separate world over K*N time steps, with this result becoming more pronounced as the degree of partial observability in the environment increases. We also show that cooperation between agents learning in the same world decreases performance with respect to independent agents. Since cooperation reduces diversity between agents, we conclude that diversity is a key parameter in the trade off between maximizing utility from cooperation when diversity is low and maximizing utility from competitive coordination when diversity is high.

  2. Hidden state and reinforcement learning with instance-based state identification.

    PubMed

    McCallum, R A

    1996-01-01

    Real robots with real sensors are not omniscient. When a robot's next course of action depends on information that is hidden from the sensors because of problems such as occlusion, restricted range, bounded field of view and limited attention, we say the robot suffers from the hidden state problem. State identification techniques use history information to uncover hidden state. Some previous approaches to encoding history include: finite state machines, recurrent neural networks and genetic programming with indexed memory. A chief disadvantage of all these techniques is their long training time. This paper presents instance-based state identification, a new approach to reinforcement learning with state identification that learns with much fewer training steps. Noting that learning with history and learning in continuous spaces both share the property that they begin without knowing the granularity of the state space, the approach applies instance-based (or "memory-based") learning to history sequences-instead of recording instances in a continuous geometrical space, we record instances in action-percept-reward sequence space. The first implementation of this approach, called Nearest Sequence Memory, learns with an order of magnitude fewer steps than several previous approaches. PMID:18263047

  3. Decision theory, reinforcement learning, and the brain.

    PubMed

    Dayan, Peter; Daw, Nathaniel D

    2008-12-01

    Decision making is a core competence for animals and humans acting and surviving in environments they only partially comprehend, gaining rewards and punishments for their troubles. Decision-theoretic concepts permeate experiments and computational models in ethology, psychology, and neuroscience. Here, we review a well-known, coherent Bayesian approach to decision making, showing how it unifies issues in Markovian decision problems, signal detection psychophysics, sequential sampling, and optimal exploration and discuss paradigmatic psychological and neural examples of each problem. We discuss computational issues concerning what subjects know about their task and how ambitious they are in seeking optimal solutions; we address algorithmic topics concerning model-based and model-free methods for making choices; and we highlight key aspects of the neural implementation of decision making. PMID:19033240

  4. Predictive value and reward in implicit classification learning.

    PubMed

    Lam, Judith M; Wächter, Tobias; Globas, Christoph; Karnath, Hans-Otto; Luft, Andreas R

    2013-01-01

    Learning efficacy depends on its emotional context. The contents learned and the feedback received during training tinges this context. The objective here was to investigate the influence of content and feedback on the efficacy of implicit learning and to explore using functional imaging how these factors are processed in the brain. Twenty-one participants completed 150 trials of a probabilistic classification task (predicting sun or rain based on combinations of playing cards). Smileys or frowneys were presented as feedback. In 10 of these subjects, the task was performed during functional magnetic resonance imaging. Card combinations predicting sun were remembered better than those predicting rain. Similarly, positive feedback fortified learning more than negative feedback. The presentation of smileys recruited bilateral nucleus accumbens, sensorimotor cortex, and posterior cingulum more than negative feedback did. The higher the predictive value of a card combination, the more activation was found in the lateral cerebellum. Both context and feedback influence implicit classification learning. Similar to motor skill acquisition, positive feedback during classification learning is processed in part within the sensorimotor cortex, potentially reflecting the activation of a dopaminergic projection to motor cortex (Hosp et al., 2011). Activation of the lateral cerebellum during learning of combinations with high predictive value may reflect the formation of an internal model. PMID:22419392

  5. Human Operant Learning under Concurrent Reinforcement of Response Variability

    ERIC Educational Resources Information Center

    Maes, J. H. R.; van der Goot, M.

    2006-01-01

    This study asked whether the concurrent reinforcement of behavioral variability facilitates learning to emit a difficult target response. Sixty students repeatedly pressed sequences of keys, with an originally infrequently occurring target sequence consistently being followed by positive feedback. Three conditions differed in the feedback given to…

  6. Kinesthetic Reinforcement-Is It a Boon to Learning?

    ERIC Educational Resources Information Center

    Bohrer, Roxilu K.

    1970-01-01

    Language instruction, particularly in the elementary school, should be reinforced through the use of visual aids and through associated physical activity. Kinesthetic experiences provide an opportunity to make use of non-verbal cues to meaning, enliven classroom activities, and maximize learning for pupils. The author discusses the educational…

  7. Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin.

    PubMed

    Ezaki, Takahiro; Horita, Yutaka; Takezawa, Masanori; Masuda, Naoki

    2016-07-01

    Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner's dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations. PMID:27438888

  8. Reinforcement Learning in Young Adults with Developmental Language Impairment

    ERIC Educational Resources Information Center

    Lee, Joanna C.; Tomblin, J. Bruce

    2012-01-01

    The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank, Seeberger, & O'Reilly, 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic…

  9. Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin

    PubMed Central

    Ezaki, Takahiro; Horita, Yutaka; Masuda, Naoki

    2016-01-01

    Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner’s dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations. PMID:27438888

  10. A Tribute to Charlie Chaplin: Induced Positive Affect Improves Reward-Based Decision-Learning in Parkinson’s Disease

    PubMed Central

    Ridderinkhof, K. Richard; van Wouwe, Nelleke C.; Band, Guido P. H.; Wylie, Scott A.; Van der Stigchel, Stefan; van Hees, Pieter; Buitenweg, Jessika; van de Vijver, Irene; van den Wildenberg, Wery P. M.

    2012-01-01

    Reward-based decision-learning refers to the process of learning to select those actions that lead to rewards while avoiding actions that lead to punishments. This process, known to rely on dopaminergic activity in striatal brain regions, is compromised in Parkinson’s disease (PD). We hypothesized that such decision-learning deficits are alleviated by induced positive affect, which is thought to incur transient boosts in midbrain and striatal dopaminergic activity. Computational measures of probabilistic reward-based decision-learning were determined for 51 patients diagnosed with PD. Previous work has shown these measures to rely on the nucleus caudatus (outcome evaluation during the early phases of learning) and the putamen (reward prediction during later phases of learning). We observed that induced positive affect facilitated learning, through its effects on reward prediction rather than outcome evaluation. Viewing a few minutes of comedy clips served to remedy dopamine-related problems associated with frontostriatal circuitry and, consequently, learning to predict which actions will yield reward. PMID:22707944

  11. A tribute to charlie chaplin: induced positive affect improves reward-based decision-learning in Parkinson's disease.

    PubMed

    Ridderinkhof, K Richard; van Wouwe, Nelleke C; Band, Guido P H; Wylie, Scott A; Van der Stigchel, Stefan; van Hees, Pieter; Buitenweg, Jessika; van de Vijver, Irene; van den Wildenberg, Wery P M

    2012-01-01

    Reward-based decision-learning refers to the process of learning to select those actions that lead to rewards while avoiding actions that lead to punishments. This process, known to rely on dopaminergic activity in striatal brain regions, is compromised in Parkinson's disease (PD). We hypothesized that such decision-learning deficits are alleviated by induced positive affect, which is thought to incur transient boosts in midbrain and striatal dopaminergic activity. Computational measures of probabilistic reward-based decision-learning were determined for 51 patients diagnosed with PD. Previous work has shown these measures to rely on the nucleus caudatus (outcome evaluation during the early phases of learning) and the putamen (reward prediction during later phases of learning). We observed that induced positive affect facilitated learning, through its effects on reward prediction rather than outcome evaluation. Viewing a few minutes of comedy clips served to remedy dopamine-related problems associated with frontostriatal circuitry and, consequently, learning to predict which actions will yield reward. PMID:22707944

  12. Effects of subconscious and conscious emotions on human cue-reward association learning.

    PubMed

    Watanabe, Noriya; Haruno, Masahiko

    2015-01-01

    Life demands that we adapt our behaviour continuously in situations in which much of our incoming information is emotional and unrelated to our immediate behavioural goals. Such information is often processed without our consciousness. This poses an intriguing question of whether subconscious exposure to irrelevant emotional information (e.g. the surrounding social atmosphere) affects the way we learn. Here, we addressed this issue by examining whether the learning of cue-reward associations changes when an emotional facial expression is shown subconsciously or consciously prior to the presentation of a reward-predicting cue. We found that both subconscious (0.027 s and 0.033 s) and conscious (0.047 s) emotional signals increased the rate of learning, and this increase was smallest at the border of conscious duration (0.040 s). These data suggest not only that the subconscious and conscious processing of emotional signals enhances value-updating in cue-reward association learning, but also that the computational processes underlying the subconscious enhancement is at least partially dissociable from its conscious counterpart. PMID:25684237

  13. Neural mechanisms of the nucleus accumbens circuit in reward and aversive learning.

    PubMed

    Hikida, Takatoshi; Morita, Makiko; Macpherson, Tom

    2016-07-01

    The basal ganglia are key neural substrates not only for motor function, but also cognitive functions including reward and aversive learning. Critical for these processes are the functional role played by two projection neurons within nucleus accumbens (NAc); the D1- and D2-expressing neurons. Recently, we have developed a novel reversible neurotransmission blocking technique that specifically blocks neurotransmission from NAc D1- and D2-expressing neurons, allowing for in vivo analysis. In this review, we outline the functional dissociation of NAc D1- and D2-expressing neurons of the basal ganglia in reward and aversive learning, as well as drug addiction. These studies have revealed the importance of activation of NAc D1 receptors for reward learning and drug addiction, and inactivation of NAc D2 receptors for aversive learning and flexibility. Based on these findings, we propose a neural mechanism, in which dopamine neurons in the ventral tegmental area that send inputs to the NAc work as a switch between D1- and D2-expressing neurons. These basal ganglia neural mechanisms will give us new insights into the pathophysiology of neuropsychiatric diseases. PMID:26827817

  14. Neurofeedback in Learning Disabled Children: Visual versus Auditory Reinforcement.

    PubMed

    Fernández, Thalía; Bosch-Bayard, Jorge; Harmony, Thalía; Caballero, María I; Díaz-Comas, Lourdes; Galán, Lídice; Ricardo-Garcell, Josefina; Aubert, Eduardo; Otero-Ojeda, Gloria

    2016-03-01

    Children with learning disabilities (LD) frequently have an EEG characterized by an excess of theta and a deficit of alpha activities. NFB using an auditory stimulus as reinforcer has proven to be a useful tool to treat LD children by positively reinforcing decreases of the theta/alpha ratio. The aim of the present study was to optimize the NFB procedure by comparing the efficacy of visual (with eyes open) versus auditory (with eyes closed) reinforcers. Twenty LD children with an abnormally high theta/alpha ratio were randomly assigned to the Auditory or the Visual group, where a 500 Hz tone or a visual stimulus (a white square), respectively, was used as a positive reinforcer when the value of the theta/alpha ratio was reduced. Both groups had signs consistent with EEG maturation, but only the Auditory Group showed behavioral/cognitive improvements. In conclusion, the auditory reinforcer was more efficacious in reducing the theta/alpha ratio, and it improved the cognitive abilities more than the visual reinforcer. PMID:26294269

  15. Reward Networks in the Brain as Captured by Connectivity Measures

    PubMed Central

    Camara, Estela; Rodriguez-Fornells, Antoni; Ye, Zheng; Münte, Thomas F.

    2009-01-01

    An assortment of human behaviors is thought to be driven by rewards including reinforcement learning, novelty processing, learning, decision making, economic choice, incentive motivation, and addiction. In each case the ventral tegmental area/ventral striatum (nucleus accumbens) (VTA–VS) system has been implicated as a key structure by functional imaging studies, mostly on the basis of standard, univariate analyses. Here we propose that standard functional magnetic resonance imaging analysis needs to be complemented by methods that take into account the differential connectivity of the VTA–VS system in the different behavioral contexts in order to describe reward based processes more appropriately. We first consider the wider network for reward processing as it emerged from animal experimentation. Subsequently, an example for a method to assess functional connectivity is given. Finally, we illustrate the usefulness of such analyses by examples regarding reward valuation, reward expectation and the role of reward in addiction. PMID:20198152

  16. Working memory contributions to reinforcement learning impairments in schizophrenia.

    PubMed

    Collins, Anne G E; Brown, Jaime K; Gold, James M; Waltz, James A; Frank, Michael J

    2014-10-01

    Previous research has shown that patients with schizophrenia are impaired in reinforcement learning tasks. However, behavioral learning curves in such tasks originate from the interaction of multiple neural processes, including the basal ganglia- and dopamine-dependent reinforcement learning (RL) system, but also prefrontal cortex-dependent cognitive strategies involving working memory (WM). Thus, it is unclear which specific system induces impairments in schizophrenia. We recently developed a task and computational model allowing us to separately assess the roles of RL (slow, cumulative learning) mechanisms versus WM (fast but capacity-limited) mechanisms in healthy adult human subjects. Here, we used this task to assess patients' specific sources of impairments in learning. In 15 separate blocks, subjects learned to pick one of three actions for stimuli. The number of stimuli to learn in each block varied from two to six, allowing us to separate influences of capacity-limited WM from the incremental RL system. As expected, both patients (n = 49) and healthy controls (n = 36) showed effects of set size and delay between stimulus repetitions, confirming the presence of working memory effects. Patients performed significantly worse than controls overall, but computational model fits and behavioral analyses indicate that these deficits could be entirely accounted for by changes in WM parameters (capacity and reliability), whereas RL processes were spared. These results suggest that the working memory system contributes strongly to learning impairments in schizophrenia. PMID:25297101

  17. Structure identification in fuzzy inference using reinforcement learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1993-01-01

    In our previous work on the GARIC architecture, we have shown that the system can start with surface structure of the knowledge base (i.e., the linguistic expression of the rules) and learn the deep structure (i.e., the fuzzy membership functions of the labels used in the rules) by using reinforcement learning. Assuming the surface structure, GARIC refines the fuzzy membership functions used in the consequents of the rules using a gradient descent procedure. This hybrid fuzzy logic and reinforcement learning approach can learn to balance a cart-pole system and to backup a truck to its docking location after a few trials. In this paper, we discuss how to do structure identification using reinforcement learning in fuzzy inference systems. This involves identifying both surface as well as deep structure of the knowledge base. The term set of fuzzy linguistic labels used in describing the values of each control variable must be derived. In this process, splitting a label refers to creating new labels which are more granular than the original label and merging two labels creates a more general label. Splitting and merging of labels directly transform the structure of the action selection network used in GARIC by increasing or decreasing the number of hidden layer nodes.

  18. Robot Docking Based on Omnidirectional Vision and Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Muse, David; Weber, Cornelius; Wermter, Stefan

    We present a system for visual robotic docking using an omnidirectional camera coupled with the actor critic reinforcement learning algorithm. The system enables a PeopleBot robot to locate and approach a table so that it can pick an object from it using the pan-tilt camera mounted on the robot. We use a staged approach to solve this problem as there are distinct sub tasks and different sensors used. Starting with random wandering of the robot until the table is located via a landmark, and then a network trained via reinforcement allows the robot to rum to and approach the table. Once at the table the robot is to pick the object from it. We argue that our approach has a lot of potential allowing the learning of robot control for navigation removing the need for internal maps of the environment. This is achieved by allowing the robot to learn couplings between motor actions and the position of a landmark.

  19. Probabilistic reward- and punishment-based learning in opioid addiction: Experimental and computational data.

    PubMed

    Myers, Catherine E; Sheynin, Jony; Balsdon, Tarryn; Luzardo, Andre; Beck, Kevin D; Hogarth, Lee; Haber, Paul; Moustafa, Ahmed A

    2016-01-01

    Addiction is the continuation of a habit in spite of negative consequences. A vast literature gives evidence that this poor decision-making behavior in individuals addicted to drugs also generalizes to laboratory decision making tasks, suggesting that the impairment in decision-making is not limited to decisions about taking drugs. In the current experiment, opioid-addicted individuals and matched controls with no history of illicit drug use were administered a probabilistic classification task that embeds both reward-based and punishment-based learning trials, and a computational model of decision making was applied to understand the mechanisms describing individuals' performance on the task. Although behavioral results showed that opioid-addicted individuals performed as well as controls on both reward- and punishment-based learning, the modeling results suggested subtle differences in how decisions were made between the two groups. Specifically, the opioid-addicted group showed decreased tendency to repeat prior responses, meaning that they were more likely to "chase reward" when expectancies were violated, whereas controls were more likely to stick with a previously-successful response rule, despite occasional expectancy violations. This tendency to chase short-term reward, potentially at the expense of developing rules that maximize reward over the long term, may be a contributing factor to opioid addiction. Further work is indicated to better understand whether this tendency arises as a result of brain changes in the wake of continued opioid use/abuse, or might be a pre-existing factor that may contribute to risk for addiction. PMID:26381438

  20. Effective reinforcement learning following cerebellar damage requires a balance between exploration and motor noise

    PubMed Central

    Therrien, Amanda S.; Wolpert, Daniel M.

    2016-01-01

    See Miall and Galea (doi: 10.1093/awv343) for a scientific commentary on this article. Reinforcement and error-based processes are essential for motor learning, with the cerebellum thought to be required only for the error-based mechanism. Here we examined learning and retention of a reaching skill under both processes. Control subjects learned similarly from reinforcement and error-based feedback, but showed much better retention under reinforcement. To apply reinforcement to cerebellar patients, we developed a closed-loop reinforcement schedule in which task difficulty was controlled based on recent performance. This schedule produced substantial learning in cerebellar patients and controls. Cerebellar patients varied in their learning under reinforcement but fully retained what was learned. In contrast, they showed complete lack of retention in error-based learning. We developed a mechanistic model of the reinforcement task and found that learning depended on a balance between exploration variability and motor noise. While the cerebellar and control groups had similar exploration variability, the patients had greater motor noise and hence learned less. Our results suggest that cerebellar damage indirectly impairs reinforcement learning by increasing motor noise, but does not interfere with the reinforcement mechanism itself. Therefore, reinforcement can be used to learn and retain novel skills, but optimal reinforcement learning requires a balance between exploration variability and motor noise. PMID:26626368

  1. Effective reinforcement learning following cerebellar damage requires a balance between exploration and motor noise.

    PubMed

    Therrien, Amanda S; Wolpert, Daniel M; Bastian, Amy J

    2016-01-01

    Reinforcement and error-based processes are essential for motor learning, with the cerebellum thought to be required only for the error-based mechanism. Here we examined learning and retention of a reaching skill under both processes. Control subjects learned similarly from reinforcement and error-based feedback, but showed much better retention under reinforcement. To apply reinforcement to cerebellar patients, we developed a closed-loop reinforcement schedule in which task difficulty was controlled based on recent performance. This schedule produced substantial learning in cerebellar patients and controls. Cerebellar patients varied in their learning under reinforcement but fully retained what was learned. In contrast, they showed complete lack of retention in error-based learning. We developed a mechanistic model of the reinforcement task and found that learning depended on a balance between exploration variability and motor noise. While the cerebellar and control groups had similar exploration variability, the patients had greater motor noise and hence learned less. Our results suggest that cerebellar damage indirectly impairs reinforcement learning by increasing motor noise, but does not interfere with the reinforcement mechanism itself. Therefore, reinforcement can be used to learn and retain novel skills, but optimal reinforcement learning requires a balance between exploration variability and motor noise. PMID:26626368

  2. Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective.

    PubMed

    Story, Giles W; Vlaev, Ivo; Seymour, Ben; Darzi, Ara; Dolan, Raymond J

    2014-01-01

    The tendency to make unhealthy choices is hypothesized to be related to an individual's temporal discount rate, the theoretical rate at which they devalue delayed rewards. Furthermore, a particular form of temporal discounting, hyperbolic discounting, has been proposed to explain why unhealthy behavior can occur despite healthy intentions. We examine these two hypotheses in turn. We first systematically review studies which investigate whether discount rates can predict unhealthy behavior. These studies reveal that high discount rates for money (and in some instances food or drug rewards) are associated with several unhealthy behaviors and markers of health status, establishing discounting as a promising predictive measure. We secondly examine whether intention-incongruent unhealthy actions are consistent with hyperbolic discounting. We conclude that intention-incongruent actions are often triggered by environmental cues or changes in motivational state, whose effects are not parameterized by hyperbolic discounting. We propose a framework for understanding these state-based effects in terms of the interplay of two distinct reinforcement learning mechanisms: a "model-based" (or goal-directed) system and a "model-free" (or habitual) system. Under this framework, while discounting of delayed health may contribute to the initiation of unhealthy behavior, with repetition, many unhealthy behaviors become habitual; if health goals then change, habitual behavior can still arise in response to environmental cues. We propose that the burgeoning development of computational models of these processes will permit further identification of health decision-making phenotypes. PMID:24659960

  3. Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective

    PubMed Central

    Story, Giles W.; Vlaev, Ivo; Seymour, Ben; Darzi, Ara; Dolan, Raymond J.

    2014-01-01

    The tendency to make unhealthy choices is hypothesized to be related to an individual's temporal discount rate, the theoretical rate at which they devalue delayed rewards. Furthermore, a particular form of temporal discounting, hyperbolic discounting, has been proposed to explain why unhealthy behavior can occur despite healthy intentions. We examine these two hypotheses in turn. We first systematically review studies which investigate whether discount rates can predict unhealthy behavior. These studies reveal that high discount rates for money (and in some instances food or drug rewards) are associated with several unhealthy behaviors and markers of health status, establishing discounting as a promising predictive measure. We secondly examine whether intention-incongruent unhealthy actions are consistent with hyperbolic discounting. We conclude that intention-incongruent actions are often triggered by environmental cues or changes in motivational state, whose effects are not parameterized by hyperbolic discounting. We propose a framework for understanding these state-based effects in terms of the interplay of two distinct reinforcement learning mechanisms: a “model-based” (or goal-directed) system and a “model-free” (or habitual) system. Under this framework, while discounting of delayed health may contribute to the initiation of unhealthy behavior, with repetition, many unhealthy behaviors become habitual; if health goals then change, habitual behavior can still arise in response to environmental cues. We propose that the burgeoning development of computational models of these processes will permit further identification of health decision-making phenotypes. PMID:24659960

  4. Reinforcement learning agents providing advice in complex video games

    NASA Astrophysics Data System (ADS)

    Taylor, Matthew E.; Carboni, Nicholas; Fachantidis, Anestis; Vlahavas, Ioannis; Torrey, Lisa

    2014-01-01

    This article introduces a teacher-student framework for reinforcement learning, synthesising and extending material that appeared in conference proceedings [Torrey, L., & Taylor, M. E. (2013)]. Teaching on a budget: Agents advising agents in reinforcement learning. {Proceedings of the international conference on autonomous agents and multiagent systems}] and in a non-archival workshop paper [Carboni, N., &Taylor, M. E. (2013, May)]. Preliminary results for 1 vs. 1 tactics in StarCraft. {Proceedings of the adaptive and learning agents workshop (at AAMAS-13)}]. In this framework, a teacher agent instructs a student agent by suggesting actions the student should take as it learns. However, the teacher may only give such advice a limited number of times. We present several novel algorithms that teachers can use to budget their advice effectively, and we evaluate them in two complex video games: StarCraft and Pac-Man. Our results show that the same amount of advice, given at different moments, can have different effects on student learning, and that teachers can significantly affect student learning even when students use different learning methods and state representations.

  5. Deep Brain Stimulation of the Subthalamic Nucleus Improves Reward-Based Decision-Learning in Parkinson's Disease

    PubMed Central

    van Wouwe, Nelleke C.; Ridderinkhof, K. R.; van den Wildenberg, W. P. M.; Band, G. P. H.; Abisogun, A.; Elias, W. J.; Frysinger, R.; Wylie, S. A.

    2011-01-01

    Recently, the subthalamic nucleus (STN) has been shown to be critically involved in decision-making, action selection, and motor control. Here we investigate the effect of deep brain stimulation (DBS) of the STN on reward-based decision-learning in patients diagnosed with Parkinson's disease (PD). We determined computational measures of outcome evaluation and reward prediction from PD patients who performed a probabilistic reward-based decision-learning task. In previous work, these measures covaried with activation in the nucleus caudatus (outcome evaluation during the early phases of learning) and the putamen (reward prediction during later phases of learning). We observed that stimulation of the STN motor regions in PD patients served to improve reward-based decision-learning, probably through its effect on activity in frontostriatal motor loops (prominently involving the putamen and, hence, reward prediction). In a subset of relatively younger patients with relatively shorter disease duration, the effects of DBS appeared to spread to more cognitive regions of the STN, benefiting loops that connect the caudate to various prefrontal areas importantfor outcome evaluation. These results highlight positive effects of STN stimulation on cognitive functions that may benefit PD patients in daily-life association-learning situations. PMID:21519377

  6. Measuring anhedonia: impaired ability to pursue, experience, and learn about reward.

    PubMed

    Thomsen, Kristine Rømer

    2015-01-01

    Ribot's (1896) long standing definition of anhedonia as "the inability to experience pleasure" has been challenged recently following progress in affective neuroscience. In particular, accumulating evidence suggests that reward consists of multiple subcomponents of wanting, liking and learning, as initially outlined by Berridge and Robinson (2003), and these processes have been proposed to relate to appetitive, consummatory and satiety phases of a pleasure cycle. Building on this work, we recently proposed to reconceptualize anhedonia as "impairments in the ability to pursue, experience, and/or learn about pleasure, which is often, but not always accessible to conscious awareness." (Rømer Thomsen et al., 2015). This framework is in line with Treadway and Zald's (2011) proposal to differentiate between motivational and consummatory types of anhedonia, and stresses the need to combine traditional self-report measures with behavioral measures or procedures. In time, this approach may lead to improved clinical assessment and treatment. In line with our reconceptualization, increasing evidence suggests that reward processing deficits are not restricted to impaired hedonic impact in major psychiatric disorders. Successful translations of animal models have led to strong evidence of impairments in the ability to pursue and learn about reward in psychiatric disorders such as major depressive disorder, schizophrenia, and addiction. It is of high importance that we continue to systematically target impairments in all phases of reward processing across disorders using behavioral testing in combination with neuroimaging techniques. This in turn has implications for diagnosis and treatment, and is essential for the purposes of identifying the underlying neurobiological mechanisms. Here I review recent progress in the development and application of behavioral procedures that measure subcomponents of anhedonia across relevant patient groups, and discuss methodological caveats as

  7. Measuring anhedonia: impaired ability to pursue, experience, and learn about reward

    PubMed Central

    Thomsen, Kristine Rømer

    2015-01-01

    Ribot’s (1896) long standing definition of anhedonia as “the inability to experience pleasure” has been challenged recently following progress in affective neuroscience. In particular, accumulating evidence suggests that reward consists of multiple subcomponents of wanting, liking and learning, as initially outlined by Berridge and Robinson (2003), and these processes have been proposed to relate to appetitive, consummatory and satiety phases of a pleasure cycle. Building on this work, we recently proposed to reconceptualize anhedonia as “impairments in the ability to pursue, experience, and/or learn about pleasure, which is often, but not always accessible to conscious awareness.” (Rømer Thomsen et al., 2015). This framework is in line with Treadway and Zald’s (2011) proposal to differentiate between motivational and consummatory types of anhedonia, and stresses the need to combine traditional self-report measures with behavioral measures or procedures. In time, this approach may lead to improved clinical assessment and treatment. In line with our reconceptualization, increasing evidence suggests that reward processing deficits are not restricted to impaired hedonic impact in major psychiatric disorders. Successful translations of animal models have led to strong evidence of impairments in the ability to pursue and learn about reward in psychiatric disorders such as major depressive disorder, schizophrenia, and addiction. It is of high importance that we continue to systematically target impairments in all phases of reward processing across disorders using behavioral testing in combination with neuroimaging techniques. This in turn has implications for diagnosis and treatment, and is essential for the purposes of identifying the underlying neurobiological mechanisms. Here I review recent progress in the development and application of behavioral procedures that measure subcomponents of anhedonia across relevant patient groups, and discuss

  8. Democratic reinforcement: learning via self-organization

    SciTech Connect

    Stassinopoulos, D.; Bak, P.

    1995-12-31

    The problem of learning in the absence of external intelligence is discussed in the context of a simple model. The model consists of a set of randomly connected, or layered integrate-and fire neurons. Inputs to and outputs from the environment are connected randomly to subsets of neurons. The connections between firing neurons are strengthened or weakened according to whether the action is successful or not. The model departs from the traditional gradient-descent based approaches to learning by operating at a highly susceptible ``critical`` state, with low activity and sparse connections between firing neurons. Quantitative studies on the performance of our model in a simple association task show that by tuning our system close to this critical state we can obtain dramatic gains in performance.

  9. A reinforcement learning-based architecture for fuzzy logic control

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1992-01-01

    This paper introduces a new method for learning to refine a rule-based fuzzy logic controller. A reinforcement learning technique is used in conjunction with a multilayer neural network model of a fuzzy controller. The approximate reasoning based intelligent control (ARIC) architecture proposed here learns by updating its prediction of the physical system's behavior and fine tunes a control knowledge base. Its theory is related to Sutton's temporal difference (TD) method. Because ARIC has the advantage of using the control knowledge of an experienced operator and fine tuning it through the process of learning, it learns faster than systems that train networks from scratch. The approach is applied to a cart-pole balancing system.

  10. Emotional Multiagent Reinforcement Learning in Spatial Social Dilemmas.

    PubMed

    Yu, Chao; Zhang, Minjie; Ren, Fenghui; Tan, Guozhen

    2015-12-01

    Social dilemmas have attracted extensive interest in the research of multiagent systems in order to study the emergence of cooperative behaviors among selfish agents. Understanding how agents can achieve cooperation in social dilemmas through learning from local experience is a critical problem that has motivated researchers for decades. This paper investigates the possibility of exploiting emotions in agent learning in order to facilitate the emergence of cooperation in social dilemmas. In particular, the spatial version of social dilemmas is considered to study the impact of local interactions on the emergence of cooperation in the whole system. A double-layered emotional multiagent reinforcement learning framework is proposed to endow agents with internal cognitive and emotional capabilities that can drive these agents to learn cooperative behaviors. Experimental results reveal that various network topologies and agent heterogeneities have significant impacts on agent learning behaviors in the proposed framework, and under certain circumstances, high levels of cooperation can be achieved among the agents. PMID:25769173

  11. Learning to maximize reward rate: a model based on semi-Markov decision processes

    PubMed Central

    Khodadadi, Arash; Fakhari, Pegah; Busemeyer, Jerome R.

    2014-01-01

    When animals have to make a number of decisions during a limited time interval, they face a fundamental problem: how much time they should spend on each decision in order to achieve the maximum possible total outcome. Deliberating more on one decision usually leads to more outcome but less time will remain for other decisions. In the framework of sequential sampling models, the question is how animals learn to set their decision threshold such that the total expected outcome achieved during a limited time is maximized. The aim of this paper is to provide a theoretical framework for answering this question. To this end, we consider an experimental design in which each trial can come from one of the several possible “conditions.” A condition specifies the difficulty of the trial, the reward, the penalty and so on. We show that to maximize the expected reward during a limited time, the subject should set a separate value of decision threshold for each condition. We propose a model of learning the optimal value of decision thresholds based on the theory of semi-Markov decision processes (SMDP). In our model, the experimental environment is modeled as an SMDP with each “condition” being a “state” and the value of decision thresholds being the “actions” taken in those states. The problem of finding the optimal decision thresholds then is cast as the stochastic optimal control problem of taking actions in each state in the corresponding SMDP such that the average reward rate is maximized. Our model utilizes a biologically plausible learning algorithm to solve this problem. The simulation results show that at the beginning of learning the model choses high values of decision threshold which lead to sub-optimal performance. With experience, however, the model learns to lower the value of decision thresholds till finally it finds the optimal values. PMID:24904252

  12. Learning from experience: Event-related potential correlates of reward processing, neural adaptation, and behavioral choice

    PubMed Central

    Walsh, Matthew M.; Anderson, John R.

    2012-01-01

    To behave adaptively, we must learn from the consequences of our actions. Studies using event-related potentials (ERPs) have been informative with respect to the question of how such learning occurs. These studies have revealed a frontocentral negativity termed the feedback-related negativity (FRN) that appears after negative feedback. According to one prominent theory, the FRN tracks the difference between the values of actual and expected outcomes, or reward prediction errors. As such, the FRN provides a tool for studying reward valuation and decision making. We begin this review by examining the neural significance of the FRN. We then examine its functional significance. To understand the cognitive processes that occur when the FRN is generated, we explore variables that influence its appearance and amplitude. Specifically, we evaluate four hypotheses: (1) the FRN encodes a quantitative reward prediction error; (2) the FRN is evoked by outcomes and by stimuli that predict outcomes; (3) the FRN and behavior change with experience; and (4) the system that produces the FRN is maximally engaged by volitional actions. PMID:22683741

  13. Preliminary investigation of flexibility in learning color-reward associations in gibbons (Hylobatidae).

    PubMed

    D'Agostino, Justin; Cunningham, Clare

    2015-08-01

    Previous studies in learning set formation have shown that most animal species can learn to learn with subsequent novel presentations being solved in fewer presentations than when they first encounter a task. Gibbons (Hylobatidae) have generally struggled with these tasks and do not show the learning to learn pattern found in other species. This is surprising given their phylogenetic position and level of cortical development. However, there have been conflicting results with some studies demonstrating higher level learning abilities in these small apes. This study attempts to clarify whether gibbons can in fact use knowledge gained during one learning task to facilitate performance on a similar, but novel problem that would be a precursor to development of a learning set. We tested 16 captive gibbons' ability to associate color cues with provisioned food items in two experiments where they experienced a period of learning followed by experimental trials during which they could potentially use knowledge gained in their first learning experience to facilitate solution I subsequent novel tasks. Our results are similar to most previous studies in that there was no evidence of gibbons being able to use previously acquired knowledge to solve a novel task. However, once the learning association was made, the gibbons performed well above chance. We found no differences across color associations, indicating learning was not affected by the particular color / reward association. However, there were variations in learning performance with regard to genera. The hoolock (Hoolock leuconedys) and siamang (Symphalangus syndactylus) learned the fastest and the lar group (Hylobates sp.) learned the slowest. We caution these results could be due to the small sample size and because of the captive environment in which these gibbons were raised. However, it is likely that environmental variability in the native habitats of the subjects tested could facilitate the evolution of flexible

  14. The Good, the Bad, and the Irrelevant: Neural Mechanisms of Learning Real and Hypothetical Rewards and Effort

    PubMed Central

    Kolling, Nils; Nelissen, Natalie; Wittmann, Marco K.; Harmer, Catherine J.; Rushworth, Matthew F. S.

    2015-01-01

    Natural environments are complex, and a single choice can lead to multiple outcomes. Agents should learn which outcomes are due to their choices and therefore relevant for future decisions and which are stochastic in ways common to all choices and therefore irrelevant for future decisions between options. We designed an experiment in which human participants learned the varying reward and effort magnitudes of two options and repeatedly chose between them. The reward associated with a choice was randomly real or hypothetical (i.e., participants only sometimes received the reward magnitude associated with the chosen option). The real/hypothetical nature of the reward on any one trial was, however, irrelevant for learning the longer-term values of the choices, and participants ought to have only focused on the informational content of the outcome and disregarded whether it was a real or hypothetical reward. However, we found that participants showed an irrational choice bias, preferring choices that had previously led, by chance, to a real reward in the last trial. Amygdala and ventromedial prefrontal activity was related to the way in which participants' choices were biased by real reward receipt. By contrast, activity in dorsal anterior cingulate cortex, frontal operculum/anterior insula, and especially lateral anterior prefrontal cortex was related to the degree to which participants resisted this bias and chose effectively in a manner guided by aspects of outcomes that had real and more sustained relationships with particular choices, suppressing irrelevant reward information for more optimal learning and decision making. SIGNIFICANCE STATEMENT In complex natural environments, a single choice can lead to multiple outcomes. Human agents should only learn from outcomes that are due to their choices, not from outcomes without such a relationship. We designed an experiment to measure learning about reward and effort magnitudes in an environment in which other features

  15. Learning and tuning fuzzy logic controllers through reinforcements

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1992-01-01

    This paper presents a new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system. In particular, our generalized approximate reasoning-based intelligent control (GARIC) architecture (1) learns and tunes a fuzzy logic controller even when only weak reinforcement, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward neural network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto et al. (1983) to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and demonstrates significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.

  16. Learning and tuning fuzzy logic controllers through reinforcements

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1992-01-01

    A new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system is presented. In particular, our Generalized Approximate Reasoning-based Intelligent Control (GARIC) architecture: (1) learns and tunes a fuzzy logic controller even when only weak reinforcements, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto, Sutton, and Anderson to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and has demonstrated significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.

  17. Towards autonomous neuroprosthetic control using Hebbian reinforcement learning

    NASA Astrophysics Data System (ADS)

    Mahmoudi, Babak; Pohlmeyer, Eric A.; Prins, Noeline W.; Geng, Shijia; Sanchez, Justin C.

    2013-12-01

    Objective. Our goal was to design an adaptive neuroprosthetic controller that could learn the mapping from neural states to prosthetic actions and automatically adjust adaptation using only a binary evaluative feedback as a measure of desirability/undesirability of performance. Approach. Hebbian reinforcement learning (HRL) in a connectionist network was used for the design of the adaptive controller. The method combines the efficiency of supervised learning with the generality of reinforcement learning. The convergence properties of this approach were studied using both closed-loop control simulations and open-loop simulations that used primate neural data from robot-assisted reaching tasks. Main results. The HRL controller was able to perform classification and regression tasks using its episodic and sequential learning modes, respectively. In our experiments, the HRL controller quickly achieved convergence to an effective control policy, followed by robust performance. The controller also automatically stopped adapting the parameters after converging to a satisfactory control policy. Additionally, when the input neural vector was reorganized, the controller resumed adaptation to maintain performance. Significance. By estimating an evaluative feedback directly from the user, the HRL control algorithm may provide an efficient method for autonomous adaptation of neuroprosthetic systems. This method may enable the user to teach the controller the desired behavior using only a simple feedback signal.

  18. Distributed Reinforcement Learning Approach for Vehicular Ad Hoc Networks

    NASA Astrophysics Data System (ADS)

    Wu, Celimuge; Kumekawa, Kazuya; Kato, Toshihiko

    In Vehicular Ad hoc Networks (VANETs), general purpose ad hoc routing protocols such as AODV cannot work efficiently due to the frequent changes in network topology caused by vehicle movement. This paper proposes a VANET routing protocol QLAODV (Q-Learning AODV) which suits unicast applications in high mobility scenarios. QLAODV is a distributed reinforcement learning routing protocol, which uses a Q-Learning algorithm to infer network state information and uses unicast control packets to check the path availability in a real time manner in order to allow Q-Learning to work efficiently in a highly dynamic network environment. QLAODV is favored by its dynamic route change mechanism, which makes it capable of reacting quickly to network topology changes. We present an analysis of the performance of QLAODV by simulation using different mobility models. The simulation results show that QLAODV can efficiently handle unicast applications in VANETs.

  19. Abnormal Brain Activity in Social Reward Learning in Children with Autism Spectrum Disorder: An fMRI Study

    PubMed Central

    Choi, Uk-Su; Kim, Sun-Young; Sim, Hyeon Jeong; Lee, Seo-Young; Park, Sung-Yeon; Jeong, Joon-Sup; Seol, Kyeong In; Yoon, Hyo-Woon; Jhung, Kyungun; Park, Jee-In

    2015-01-01

    Purpose We aimed to determine whether Autism Spectrum Disorder (ASD) would show neural abnormality of the social reward system using functional MRI (fMRI). Materials and Methods 27 ASDs and 12 typically developing controls (TDCs) participated in this study. The social reward task was developed, and all participants performed the task during fMRI scanning. Results ASDs and TDCs with a social reward learning effect were selected on the basis of behavior data. We found significant differences in brain activation between the ASDs and TDCs showing a social reward learning effect. Compared with the TDCs, the ASDs showed reduced activity in the right dorsolateral prefrontal cortex, right orbitofrontal cortex, right parietal lobe, and occipital lobe; however, they showed increased activity in the right parahippocampal gyrus and superior temporal gyrus. Conclusion These findings suggest that there might be neural abnormality of the social reward learning system of ASDs. Although this study has several potential limitations, it presents novel findings in the different neural mechanisms of social reward learning in children with ASD and a possible useful biomarker of high-functioning ASDs. PMID:25837176

  20. An Obesity-Predisposing Variant of the FTO Gene Regulates D2R-Dependent Reward Learning.

    PubMed

    Sevgi, Meltem; Rigoux, Lionel; Kühn, Anne B; Mauer, Jan; Schilbach, Leonhard; Hess, Martin E; Gruendler, Theo O J; Ullsperger, Markus; Stephan, Klaas Enno; Brüning, Jens C; Tittgemeyer, Marc

    2015-09-01

    Variations in the fat mass and obesity-associated (FTO) gene are linked to obesity. However, the underlying neurobiological mechanisms by which these genetic variants influence obesity, behavior, and brain are unknown. Given that Fto regulates D2/3R signaling in mice, we tested in humans whether variants in FTO would interact with a variant in the ANKK1 gene, which alters D2R signaling and is also associated with obesity. In a behavioral and fMRI study, we demonstrate that gene variants of FTO affect dopamine (D2)-dependent midbrain brain responses to reward learning and behavioral responses associated with learning from negative outcome in humans. Furthermore, dynamic causal modeling confirmed that FTO variants modulate the connectivity in a basic reward circuit of meso-striato-prefrontal regions, suggesting a mechanism by which genetic predisposition alters reward processing not only in obesity, but also in other disorders with altered D2R-dependent impulse control, such as addiction. Significance statement: Variations in the fat mass and obesity-associated (FTO) gene are associated with obesity. Here we demonstrate that variants of FTO affect dopamine-dependent midbrain brain responses and learning from negative outcomes in humans during a reward learning task. Furthermore, FTO variants modulate the connectivity in a basic reward circuit of meso-striato-prefrontal regions, suggesting a mechanism by which genetic vulnerability in reward processing can increase predisposition to obesity. PMID:26354923

  1. Multiagent reinforcement learning in the Iterated Prisoner's Dilemma.

    PubMed

    Sandholm, T W; Crites, R H

    1996-01-01

    Reinforcement learning (RL) is based on the idea that the tendency to produce an action should be strengthened (reinforced) if it produces favorable results, and weakened if it produces unfavorable results. Q-learning is a recent RL algorithm that does not need a model of its environment and can be used on-line. Therefore, it is well suited for use in repeated games against an unknown opponent. Most RL research has been confined to single-agent settings or to multiagent settings where the agents have totally positively correlated payoffs (team problems) or totally negatively correlated payoffs (zero-sum games). This paper is an empirical study of reinforcement learning in the Iterated Prisoner's Dilemma (IPD), where the agents' payoffs are neither totally positively nor totally negatively correlated. RL is considerably more difficult in such a domain. This paper investigates the ability of a variety of Q-learning agents to play the IPD game against an unknown opponent. In some experiments, the opponent is the fixed strategy Tit-For-Tat, while in others it is another Q-learner. All the Q-learners learned to play optimally against Tit-For-Tat. Playing against another learner was more difficult because the adaptation of the other learner created a non-stationary environment, and because the other learner was not endowed with any a priori knowledge about the IPD game such as a policy designed to encourage cooperation. The learners that were studied varied along three dimensions: the length of history they received as context, the type of memory they employed (lookup tables based on restricted history windows or recurrent neural networks that can theoretically store features from arbitrarily deep in the past), and the exploration schedule they followed. Although all the learners faced difficulties when playing against other learners, agents with longer history windows, lookup table memories, and longer exploration schedules fared best in the IPD games. PMID:8924633

  2. Preliminary Work for Examining the Scalability of Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Clouse, Jeff

    1998-01-01

    Researchers began studying automated agents that learn to perform multiple-step tasks early in the history of artificial intelligence (Samuel, 1963; Samuel, 1967; Waterman, 1970; Fikes, Hart & Nilsonn, 1972). Multiple-step tasks are tasks that can only be solved via a sequence of decisions, such as control problems, robotics problems, classic problem-solving, and game-playing. The objective of agents attempting to learn such tasks is to use the resources they have available in order to become more proficient at the tasks. In particular, each agent attempts to develop a good policy, a mapping from states to actions, that allows it to select actions that optimize a measure of its performance on the task; for example, reducing the number of steps necessary to complete the task successfully. Our study focuses on reinforcement learning, a set of learning techniques where the learner performs trial-and-error experiments in the task and adapts its policy based on the outcome of those experiments. Much of the work in reinforcement learning has focused on a particular, simple representation, where every problem state is represented explicitly in a table, and associated with each state are the actions that can be chosen in that state. A major advantage of this table lookup representation is that one can prove that certain reinforcement learning techniques will develop an optimal policy for the current task. The drawback is that the representation limits the application of reinforcement learning to multiple-step tasks with relatively small state-spaces. There has been a little theoretical work that proves that convergence to optimal solutions can be obtained when using generalization structures, but the structures are quite simple. The theory says little about complex structures, such as multi-layer, feedforward artificial neural networks (Rumelhart & McClelland, 1986), but empirical results indicate that the use of reinforcement learning with such structures is promising

  3. MOSAIC for multiple-reward environments.

    PubMed

    Sugimoto, Norikazu; Haruno, Masahiko; Doya, Kenji; Kawato, Mitsuo

    2012-03-01

    Reinforcement learning (RL) can provide a basic framework for autonomous robots to learn to control and maximize future cumulative rewards in complex environments. To achieve high performance, RL controllers must consider the complex external dynamics for movements and task (reward function) and optimize control commands. For example, a robot playing tennis and squash needs to cope with the different dynamics of a tennis or squash racket and such dynamic environmental factors as the wind. In addition, this robot has to tailor its tactics simultaneously under the rules of either game. This double complexity of the external dynamics and reward function sometimes becomes more complex when both the multiple dynamics and multiple reward functions switch implicitly, as in the situation of a real (multi-agent) game of tennis where one player cannot observe the intention of her opponents or her partner. The robot must consider its opponent's and its partner's unobservable behavioral goals (reward function). In this article, we address how an RL agent should be designed to handle such double complexity of dynamics and reward. We have previously proposed modular selection and identification for control (MOSAIC) to cope with nonstationary dynamics where appropriate controllers are selected and learned among many candidates based on the error of its paired dynamics predictor: the forward model. Here we extend this framework for RL and propose MOSAIC-MR architecture. It resembles MOSAIC in spirit and selects and learns an appropriate RL controller based on the RL controller's TD error using the errors of the dynamics (the forward model) and the reward predictors. Furthermore, unlike other MOSAIC variants for RL, RL controllers are not a priori paired with the fixed predictors of dynamics and rewards. The simulation results demonstrate that MOSAIC-MR outperforms other counterparts because of this flexible association ability among RL controllers, forward models, and reward

  4. Dynamics of learning in coupled oscillators tutored with delayed reinforcements

    NASA Astrophysics Data System (ADS)

    Trevisan, M. A.; Bouzat, S.; Samengo, I.; Mindlin, G. B.

    2005-07-01

    In this work we analyze the solutions of a simple system of coupled phase oscillators in which the connectivity is learned dynamically. The model is inspired by the process of learning of birdsongs by oscine birds. An oscillator acts as the generator of a basic rhythm and drives slave oscillators which are responsible for different motor actions. The driving signal arrives at each driven oscillator through two different pathways. One of them is a direct pathway. The other one is a reinforcement pathway, through which the signal arrives delayed. The coupling coefficients between the driving oscillator and the slave ones evolve in time following a Hebbian-like rule. We discuss the conditions under which a driven oscillator is capable of learning to lock to the driver. The resulting phase difference and connectivity are a function of the delay of the reinforcement. Around some specific delays, the system is capable of generating dramatic changes in the phase difference between the driver and the driven systems. We discuss the dynamical mechanism responsible for this effect and possible applications of this learning scheme.

  5. Functional Specialization within the Striatum along Both the Dorsal/Ventral and Anterior/Posterior Axes during Associative Learning via Reward and Punishment

    ERIC Educational Resources Information Center

    Mattfeld, Aaron T.; Gluck, Mark A.; Stark, Craig E. L.

    2011-01-01

    The goal of the present study was to elucidate the role of the human striatum in learning via reward and punishment during an associative learning task. Previous studies have identified the striatum as a critical component in the neural circuitry of reward-related learning. It remains unclear, however, under what task conditions, and to what…

  6. Impairment of reward-related learning by cholinergic cell ablation in the striatum.

    PubMed

    Kitabatake, Yasuji; Hikida, Takatoshi; Watanabe, Dai; Pastan, Ira; Nakanishi, Shigetada

    2003-06-24

    The striatum in the basal ganglia-thalamocortical circuitry is a key neural substrate that is implicated in motor balance and procedural learning. The projection neurons in the striatum are dynamically modulated by nigrostriatal dopaminergic input and intrastriatal cholinergic input. The role of intrastriatal acetylcholine (ACh) in learning behaviors, however, remains to be fully clarified. In this investigation, we examine the involvement of intrastriatal ACh in different categories of learning by selectively ablating the striatal cholinergic neurons with use of immunotoxin-mediated cell targeting. We show that selective ablation of cholinergic neurons in the striatum impairs procedural learning in the tone-cued T-maze memory task. Spatial delayed alternation in the T-maze learning test is also impaired by cholinergic cell elimination. In contrast, the deficit in striatal ACh transmission has no effect on motor learning in the rota-rod test or spatial learning in the Morris water-maze test or on contextual- and tone-cued conditioning fear responses. We also report that cholinergic cell elimination adaptively up-regulates nicotinic ACh receptors not only within the striatum but also in the cerebral cortex and substantia nigra. The present investigation indicates that cholinergic modulation in the local striatal circuit plays a pivotal role in regulation of neural circuitry involving reward-related procedural learning and working memory. PMID:12802017

  7. Awards and Incentives Can Help Speed Learning

    ERIC Educational Resources Information Center

    Blomgren, George W.; Thiss, Thomas N.

    1976-01-01

    Describes efforts in the banking industry to combine the reward elements of incentive programs with training activities. Concludes that incentive programs can be combined effectively with learning activities so that training is reinforced and learned behavior is also practiced. (WL)

  8. Temporally Dissociable Contributions of Human Medial Prefrontal Subregions to Reward-Guided Learning

    PubMed Central

    Iannaccone, Reto; Walitza, Susanne; Brandeis, Daniel; Brem, Silvia; Dolan, Raymond J.

    2015-01-01

    In decision making, dorsal and ventral medial prefrontal cortex show a sensitivity to key decision variables, such as reward prediction errors. It is unclear whether these signals reflect parallel processing of a common synchronous input to both regions, for example from mesocortical dopamine, or separate and consecutive stages in reward processing. These two perspectives make distinct predictions about the relative timing of feedback-related activity in each of these regions, a question we address here. To reconstruct the unique temporal contribution of dorsomedial (dmPFC) and ventromedial prefrontal cortex (vmPFC) to simultaneously measured EEG activity in human subjects, we developed a novel trialwise fMRI-informed EEG analysis that allows dissociating correlated and overlapping sources. We show that vmPFC uniquely contributes a sustained activation profile shortly after outcome presentation, whereas dmPFC contributes a later and more peaked activation pattern. This temporal dissociation is expressed mainly in the alpha band for a vmPFC signal, which contrasts with a theta based dmPFC signal. Thus, our data show reward-related vmPFC and dmPFC responses have distinct time courses and unique spectral profiles, findings that support distinct functional roles in a reward-processing network. SIGNIFICANCE STATEMENT Multiple subregions of the medial prefrontal cortex are known to be involved in decision making and learning, and expose similar response patterns in fMRI. Here, we used a novel approach to analyzing simultaneous EEG-fMRI that allows to dissociate the individual time courses of brain regions. We find that vmPFC and dmPFC have distinguishable time courses and time-frequency patterns. PMID:26269631

  9. Implication of Dopaminergic Modulation in Operant Reward Learning and the Induction of Compulsive-Like Feeding Behavior in "Aplysia"

    ERIC Educational Resources Information Center

    Bedecarrats, Alexis; Cornet, Charles; Simmers, John; Nargeot, Romuald

    2013-01-01

    Feeding in "Aplysia" provides an amenable model system for analyzing the neuronal substrates of motivated behavior and its adaptability by associative reward learning and neuromodulation. Among such learning processes, appetitive operant conditioning that leads to a compulsive-like expression of feeding actions is known to be associated…

  10. Expectancies in decision making, reinforcement learning, and ventral striatum.

    PubMed

    van der Meer, Matthijs A A; Redish, A David

    2010-01-01

    Decisions can arise in different ways, such as from a gut feeling, doing what worked last time, or planful deliberation. Different decision-making systems are dissociable behaviorally, map onto distinct brain systems, and have different computational demands. For instance, "model-free" decision strategies use prediction errors to estimate scalar action values from previous experience, while "model-based" strategies leverage internal forward models to generate and evaluate potentially rich outcome expectancies. Animal learning studies indicate that expectancies may arise from different sources, including not only forward models but also Pavlovian associations, and the flexibility with which such representations impact behavior may depend on how they are generated. In the light of these considerations, we review the results of van der Meer and Redish (2009a), who found that ventral striatal neurons that respond to reward delivery can also be activated at other points, notably at a decision point where hippocampal forward representations were also observed. These data suggest the possibility that ventral striatal reward representations contribute to model-based expectancies used in deliberative decision making. PMID:21221409

  11. Dorsal Raphe Neurons Signal Reward through 5-HT and Glutamate

    PubMed Central

    Liu, Zhixiang; Zhou, Jingfeng; Li, Yi; Hu, Fei; Lu, Yao; Ma, Ming; Feng, Qiru; Zhang, Ju-en; Wang, Daqing; Zeng, Jiawei; Bao, Junhong; Kim, Ji-Young; Chen, Zhou-Feng; Mestikawy, Salah El; Luo, Minmin

    2015-01-01

    Summary The dorsal raphe nucleus (DRN) in the midbrain is a key center for serotonin (5-hydroxytryptamine; 5-HT) expressing neurons. Serotonergic neurons in the DRN have been theorized to encode punishment by opposing the reward signaling of dopamine neurons. Here, we show that DRN neurons encode reward, but not punishment, through 5-HT and glutamate. Optogenetic stimulation of DRN Pet-1 neurons reinforces mice to explore the stimulation-coupled spatial region, shifts sucrose preference, drives optical self-stimulation, and directs sensory discrimination learning. DRN Pet-1 neurons increase their firing activity during reward tasks and this activation can be used to rapidly change neuronal activity patterns in the cortnassociated with 5-HT, they also release glutamate, and both neurotransmitters contribute to reward signaling. These experiments demonstrate the ability of DRN neurons to organize reward behaviors and might provide insights into the underlying mechanisms of learning facilitation and anhedonia treatment. PMID:24656254

  12. Reconciling Reinforcement Learning Models with Behavioral Extinction and Renewal: Implications for Addiction, Relapse, and Problem Gambling

    ERIC Educational Resources Information Center

    Redish, A. David; Jensen, Steve; Johnson, Adam; Kurth-Nelson, Zeb

    2007-01-01

    Because learned associations are quickly renewed following extinction, the extinction process must include processes other than unlearning. However, reinforcement learning models, such as the temporal difference reinforcement learning (TDRL) model, treat extinction as an unlearning of associated value and are thus unable to capture renewal. TDRL…

  13. Application of Reinforcement Learning in Cognitive Radio Networks: Models and Algorithms

    PubMed Central

    Yau, Kok-Lim Alvin; Poh, Geong-Sen; Chien, Su Fong; Al-Rawi, Hasan A. A.

    2014-01-01

    Cognitive radio (CR) enables unlicensed users to exploit the underutilized spectrum in licensed spectrum whilst minimizing interference to licensed users. Reinforcement learning (RL), which is an artificial intelligence approach, has been applied to enable each unlicensed user to observe and carry out optimal actions for performance enhancement in a wide range of schemes in CR, such as dynamic channel selection and channel sensing. This paper presents new discussions of RL in the context of CR networks. It provides an extensive review on how most schemes have been approached using the traditional and enhanced RL algorithms through state, action, and reward representations. Examples of the enhancements on RL, which do not appear in the traditional RL approach, are rules and cooperative learning. This paper also reviews performance enhancements brought about by the RL algorithms and open issues. This paper aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive to readers outside the specialty of RL and CR. PMID:24995352

  14. Application of reinforcement learning in cognitive radio networks: models and algorithms.

    PubMed

    Yau, Kok-Lim Alvin; Poh, Geong-Sen; Chien, Su Fong; Al-Rawi, Hasan A A

    2014-01-01

    Cognitive radio (CR) enables unlicensed users to exploit the underutilized spectrum in licensed spectrum whilst minimizing interference to licensed users. Reinforcement learning (RL), which is an artificial intelligence approach, has been applied to enable each unlicensed user to observe and carry out optimal actions for performance enhancement in a wide range of schemes in CR, such as dynamic channel selection and channel sensing. This paper presents new discussions of RL in the context of CR networks. It provides an extensive review on how most schemes have been approached using the traditional and enhanced RL algorithms through state, action, and reward representations. Examples of the enhancements on RL, which do not appear in the traditional RL approach, are rules and cooperative learning. This paper also reviews performance enhancements brought about by the RL algorithms and open issues. This paper aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive to readers outside the specialty of RL and CR. PMID:24995352

  15. Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning

    PubMed Central

    Konovalov, Arkady; Krajbich, Ian

    2016-01-01

    Organisms appear to learn and make decisions using different strategies known as model-free and model-based learning; the former is mere reinforcement of previously rewarded actions and the latter is a forward-looking strategy that involves evaluation of action-state transition probabilities. Prior work has used neural data to argue that both model-based and model-free learners implement a value comparison process at trial onset, but model-based learners assign more weight to forward-looking computations. Here using eye-tracking, we report evidence for a different interpretation of prior results: model-based subjects make their choices prior to trial onset. In contrast, model-free subjects tend to ignore model-based aspects of the task and instead seem to treat the decision problem as a simple comparison process between two differentially valued items, consistent with previous work on sequential-sampling models of decision making. These findings illustrate a problem with assuming that experimental subjects make their decisions at the same prescribed time. PMID:27511383

  16. Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning.

    PubMed

    Konovalov, Arkady; Krajbich, Ian

    2016-01-01

    Organisms appear to learn and make decisions using different strategies known as model-free and model-based learning; the former is mere reinforcement of previously rewarded actions and the latter is a forward-looking strategy that involves evaluation of action-state transition probabilities. Prior work has used neural data to argue that both model-based and model-free learners implement a value comparison process at trial onset, but model-based learners assign more weight to forward-looking computations. Here using eye-tracking, we report evidence for a different interpretation of prior results: model-based subjects make their choices prior to trial onset. In contrast, model-free subjects tend to ignore model-based aspects of the task and instead seem to treat the decision problem as a simple comparison process between two differentially valued items, consistent with previous work on sequential-sampling models of decision making. These findings illustrate a problem with assuming that experimental subjects make their decisions at the same prescribed time. PMID:27511383

  17. Grid Cells, Place Cells, and Geodesic Generalization for Spatial Reinforcement Learning

    PubMed Central

    Gustafson, Nicholas J.; Daw, Nathaniel D.

    2011-01-01

    Reinforcement learning (RL) provides an influential characterization of the brain's mechanisms for learning to make advantageous choices. An important problem, though, is how complex tasks can be represented in a way that enables efficient learning. We consider this problem through the lens of spatial navigation, examining how two of the brain's location representations—hippocampal place cells and entorhinal grid cells—are adapted to serve as basis functions for approximating value over space for RL. Although much previous work has focused on these systems' roles in combining upstream sensory cues to track location, revisiting these representations with a focus on how they support this downstream decision function offers complementary insights into their characteristics. Rather than localization, the key problem in learning is generalization between past and present situations, which may not match perfectly. Accordingly, although neural populations collectively offer a precise representation of position, our simulations of navigational tasks verify the suggestion that RL gains efficiency from the more diffuse tuning of individual neurons, which allows learning about rewards to generalize over longer distances given fewer training experiences. However, work on generalization in RL suggests the underlying representation should respect the environment's layout. In particular, although it is often assumed that neurons track location in Euclidean coordinates (that a place cell's activity declines “as the crow flies” away from its peak), the relevant metric for value is geodesic: the distance along a path, around any obstacles. We formalize this intuition and present simulations showing how Euclidean, but not geodesic, representations can interfere with RL by generalizing inappropriately across barriers. Our proposal that place and grid responses should be modulated by geodesic distances suggests novel predictions about how obstacles should affect spatial firing

  18. Neuronal Reward and Decision Signals: From Theories to Data.

    PubMed

    Schultz, Wolfram

    2015-07-01

    Rewards are crucial objects that induce learning, approach behavior, choices, and emotions. Whereas emotions are difficult to investigate in animals, the learning function is mediated by neuronal reward prediction error signals which implement basic constructs of reinforcement learning theory. These signals are found in dopamine neurons, which emit a global reward signal to striatum and frontal cortex, and in specific neurons in striatum, amygdala, and frontal cortex projecting to select neuronal populations. The approach and choice functions involve subjective value, which is objectively assessed by behavioral choices eliciting internal, subjective reward preferences. Utility is the formal mathematical characterization of subjective value and a prime decision variable in economic choice theory. It is coded as utility prediction error by phasic dopamine responses. Utility can incorporate various influences, including risk, delay, effort, and social interaction. Appropriate for formal decision mechanisms, rewards are coded as object value, action value, difference value, and chosen value by specific neurons. Although all reward, reinforcement, and decision variables are theoretical constructs, their neuronal signals constitute measurable physical implementations and as such confirm the validity of these concepts. The neuronal reward signals provide guidance for behavior while constraining the free will to act. PMID:26109341

  19. Neuronal Reward and Decision Signals: From Theories to Data

    PubMed Central

    Schultz, Wolfram

    2015-01-01

    Rewards are crucial objects that induce learning, approach behavior, choices, and emotions. Whereas emotions are difficult to investigate in animals, the learning function is mediated by neuronal reward prediction error signals which implement basic constructs of reinforcement learning theory. These signals are found in dopamine neurons, which emit a global reward signal to striatum and frontal cortex, and in specific neurons in striatum, amygdala, and frontal cortex projecting to select neuronal populations. The approach and choice functions involve subjective value, which is objectively assessed by behavioral choices eliciting internal, subjective reward preferences. Utility is the formal mathematical characterization of subjective value and a prime decision variable in economic choice theory. It is coded as utility prediction error by phasic dopamine responses. Utility can incorporate various influences, including risk, delay, effort, and social interaction. Appropriate for formal decision mechanisms, rewards are coded as object value, action value, difference value, and chosen value by specific neurons. Although all reward, reinforcement, and decision variables are theoretical constructs, their neuronal signals constitute measurable physical implementations and as such confirm the validity of these concepts. The neuronal reward signals provide guidance for behavior while constraining the free will to act. PMID:26109341

  20. Reinforcement Learning Based Web Service Compositions for Mobile Business

    NASA Astrophysics Data System (ADS)

    Zhou, Juan; Chen, Shouming

    In this paper, we propose a new solution to Reactive Web Service Composition, via molding with Reinforcement Learning, and introducing modified (alterable) QoS variables into the model as elements in the Markov Decision Process tuple. Moreover, we give an example of Reactive-WSC-based mobile banking, to demonstrate the intrinsic capability of the solution in question of obtaining the optimized service composition, characterized by (alterable) target QoS variable sets with optimized values. Consequently, we come to the conclusion that the solution has decent potentials in boosting customer experiences and qualities of services in Web Services, and those in applications in the whole electronic commerce and business sector.

  1. Statistical mechanics approach to a reinforcement learning model with memory

    NASA Astrophysics Data System (ADS)

    Lipowski, Adam; Gontarek, Krzysztof; Ausloos, Marcel

    2009-05-01

    We introduce a two-player model of reinforcement learning with memory. Past actions of an iterated game are stored in a memory and used to determine player’s next action. To examine the behaviour of the model some approximate methods are used and confronted against numerical simulations and exact master equation. When the length of memory of players increases to infinity the model undergoes an absorbing-state phase transition. Performance of examined strategies is checked in the prisoner’ dilemma game. It turns out that it is advantageous to have a large memory in symmetric games, but it is better to have a short memory in asymmetric ones.

  2. A new animal model of placebo analgesia: involvement of the dopaminergic system in reward learning.

    PubMed

    Lee, In-Seon; Lee, Bombi; Park, Hi-Joon; Olausson, Håkan; Enck, Paul; Chae, Younbyoung

    2015-01-01

    We suggest a new placebo analgesia animal model and investigated the role of the dopamine and opioid systems in placebo analgesia. Before and after the conditioning, we conducted a conditioned place preference (CPP) test to measure preferences for the cues (Rooms 1 and 2), and a hot plate test (HPT) to measure the pain responses to high level-pain after the cues. In addition, we quantified the expression of tyrosine hydroxylase (TH) in the ventral tegmental area (VTA) and c-Fos in the anterior cingulate cortex (ACC) as a response to reward learning and pain response. We found an enhanced preference for the low level-pain paired cue and enhanced TH expression in the VTA of the Placebo and Placebo + Naloxone groups. Haloperidol, a dopamine antagonist, blocked these effects in the Placebo + Haloperidol group. An increased pain threshold to high-heat pain and reduced c-Fos expression in the ACC were observed in the Placebo group only. Haloperidol blocked the place preference effect, and naloxone and haloperidol blocked the placebo analgesia. Cue preference is mediated by reward learning via the dopamine system, whereas the expression of placebo analgesia is mediated by the dopamine and opioid systems. PMID:26602173

  3. A new animal model of placebo analgesia: involvement of the dopaminergic system in reward learning

    PubMed Central

    Lee, In-Seon; Lee, Bombi; Park, Hi-Joon; Olausson, Håkan; Enck, Paul; Chae, Younbyoung

    2015-01-01

    We suggest a new placebo analgesia animal model and investigated the role of the dopamine and opioid systems in placebo analgesia. Before and after the conditioning, we conducted a conditioned place preference (CPP) test to measure preferences for the cues (Rooms 1 and 2), and a hot plate test (HPT) to measure the pain responses to high level-pain after the cues. In addition, we quantified the expression of tyrosine hydroxylase (TH) in the ventral tegmental area (VTA) and c-Fos in the anterior cingulate cortex (ACC) as a response to reward learning and pain response. We found an enhanced preference for the low level-pain paired cue and enhanced TH expression in the VTA of the Placebo and Placebo + Naloxone groups. Haloperidol, a dopamine antagonist, blocked these effects in the Placebo + Haloperidol group. An increased pain threshold to high-heat pain and reduced c-Fos expression in the ACC were observed in the Placebo group only. Haloperidol blocked the place preference effect, and naloxone and haloperidol blocked the placebo analgesia. Cue preference is mediated by reward learning via the dopamine system, whereas the expression of placebo analgesia is mediated by the dopamine and opioid systems. PMID:26602173

  4. A neural model of the frontal eye fields with reward-based learning.

    PubMed

    Ye, Weijie; Liu, Shenquan; Liu, Xuanliang; Yu, Yuguo

    2016-09-01

    Decision-making is a flexible process dependent on the accumulation of various kinds of information; however, the corresponding neural mechanisms are far from clear. We extended a layered model of the frontal eye field to a learning-based model, using computational simulations to explain the cognitive process of choice tasks. The core of this extended model has three aspects: direction-preferred populations that cluster together the neurons with the same orientation preference, rule modules that control different rule-dependent activities, and reward-based synaptic plasticity that modulates connections to flexibly change the decision according to task demands. After repeated attempts in a number of trials, the network successfully simulated three decision choice tasks: an anti-saccade task, a no-go task, and an associative task. We found that synaptic plasticity could modulate the competition of choices by suppressing erroneous choices while enhancing the correct (rewarding) choice. In addition, the trained model captured some properties exhibited in animal and human experiments, such as the latency of the reaction time distribution of anti-saccades, the stop signal mechanism for canceling a reflexive saccade, and the variation of latency to half-max selectivity. Furthermore, the trained model was capable of reproducing the re-learning procedures when switching tasks and reversing the cue-saccade association. PMID:27284696

  5. A fuzzy controller with supervised learning assisted reinforcement learning algorithm for obstacle avoidance.

    PubMed

    Ye, Cang; Yung, N C; Wang, Danwei

    2003-01-01

    Fuzzy logic systems are promising for efficient obstacle avoidance. However, it is difficult to maintain the correctness, consistency, and completeness of a fuzzy rule base constructed and tuned by a human expert. A reinforcement learning method is capable of learning the fuzzy rules automatically. However, it incurs a heavy learning phase and may result in an insufficiently learned rule base due to the curse of dimensionality. In this paper, we propose a neural fuzzy system with mixed coarse learning and fine learning phases. In the first phase, a supervised learning method is used to determine the membership functions for input and output variables simultaneously. After sufficient training, fine learning is applied which employs reinforcement learning algorithm to fine-tune the membership functions for output variables. For sufficient learning, a new learning method using a modification of Sutton and Barto's model is proposed to strengthen the exploration. Through this two-step tuning approach, the mobile robot is able to perform collision-free navigation. To deal with the difficulty of acquiring a large amount of training data with high consistency for supervised learning, we develop a virtual environment (VE) simulator, which is able to provide desktop virtual environment (DVE) and immersive virtual environment (IVE) visualization. Through operating a mobile robot in the virtual environment (DVE/IVE) by a skilled human operator, training data are readily obtained and used to train the neural fuzzy system. PMID:18238153

  6. Selective activation of the trace amine-associated receptor 1 decreases cocaine's reinforcing efficacy and prevents cocaine-induced changes in brain reward thresholds.

    PubMed

    Pei, Yue; Mortas, Patrick; Hoener, Marius C; Canales, Juan J

    2015-12-01

    The newly discovered trace amine-associated receptor 1 (TAAR1) has emerged as a promising target for medication development in stimulant addiction due to its ability to regulate dopamine (DA) function and modulate stimulants' effects. Recent findings indicate that TAAR1 activation blocks some of the abuse-related physiological and behavioral effects of cocaine. However, findings from existing self-administration studies are inconclusive due to the very limited range of cocaine unit doses tested. Here, in order to shed light on the influence of TAAR1 on cocaine's reward and reinforcement, we studied the effects of partial and full activation of TAAR1on (1) the dose-response curve for cocaine self-administration and (2) cocaine-induced changes in intracranial self-stimulation (ICSS). In the first experiment, we examined the effects of the selective full and partial TAAR1 agonists, RO5256390 and RO5203648, on self-administration of five unit-injection doses of cocaine (0.03, 0.1, 0.2, 0.45, and 1mg/kg/infusion). Both agonists induced dose-dependent downward shifts in the cocaine dose-response curve, indicating that both partial and full TAAR1 activation decrease cocaine, reinforcing efficacy. In the second experiment, RO5256390 and the partial agonist, RO5263397, dose-dependently prevented cocaine-induced lowering of ICSS thresholds. Taken together, these data demonstrated that TAAR1 stimulation effectively suppresses the rewarding and reinforcing effects of cocaine in self-administration and ICSS models, supporting the candidacy of TAAR1 as a drug discovery target for cocaine addiction. PMID:26048337

  7. Incentive salience attribution under reward uncertainty: A Pavlovian model.

    PubMed

    Anselme, Patrick

    2015-02-01

    There is a vast literature on the behavioural effects of partial reinforcement in Pavlovian conditioning. Compared with animals receiving continuous reinforcement, partially rewarded animals typically show (a) a slower development of the conditioned response (CR) early in training and (b) a higher asymptotic level of the CR later in training. This phenomenon is known as the partial reinforcement acquisition effect (PRAE). Learning models of Pavlovian conditioning fail to account for it. In accordance with the incentive salience hypothesis, it is here argued that incentive motivation (or 'wanting') plays a more direct role in controlling behaviour than does learning, and reward uncertainty is shown to have an excitatory effect on incentive motivation. The psychological origin of that effect is discussed and a computational model integrating this new interpretation is developed. Many features of CRs under partial reinforcement emerge from this model. PMID:25444780

  8. Reward and punishment act as distinct factors in guiding behavior

    PubMed Central

    Kubanek, Jan; Snyder, Lawrence H; Abrams, Richard A

    2015-01-01

    Behavior rests on the experience of reinforcement and punishment. It has been unclear whether reinforcement and punishment act as oppositely valenced components of a single behavioral factor, or whether these two kinds of outcomes play fundamentally distinct behavioral roles. To this end, we varied the magnitude of a reward or a penalty experienced following a choice using monetary tokens. The outcome of each trial was independent of the outcome of the previous trial, which enabled us to isolate and study the effect on behavior of each outcome magnitude in single trials. As expected, we found that a reward led to a repetition of the previous choice, whereas a penalty led to an avoidance of the previous choice. However, the effects of the reward magnitude and the penalty magnitude revealed a striking asymmetry. The choice repetition effect of a reward strongly scaled with the magnitude of the reward. In a marked contrast, the avoidance effect of a penalty was flat, not influenced by the magnitude of the penalty. These effects were mechanistically described using the Reinforcement Learning model after the model was updated to account for the penalty-based asymmetry. The asymmetry in the effects of the reward magnitude and the punishment magnitude was so striking that it is diffcult to conceive that one factor is just a weighted or transformed form of the other factor. Instead, the data suggest that rewards and penalties are fundamentally distinct factors in governing behavior. PMID:25824862

  9. Reinforcement Learning in Distributed Domains: Beyond Team Games

    NASA Technical Reports Server (NTRS)

    Wolpert, David H.; Sill, Joseph; Turner, Kagan

    2000-01-01

    Distributed search algorithms are crucial in dealing with large optimization problems, particularly when a centralized approach is not only impractical but infeasible. Many machine learning concepts have been applied to search algorithms in order to improve their effectiveness. In this article we present an algorithm that blends Reinforcement Learning (RL) and hill climbing directly, by using the RL signal to guide the exploration step of a hill climbing algorithm. We apply this algorithm to the domain of a constellations of communication satellites where the goal is to minimize the loss of importance weighted data. We introduce the concept of 'ghost' traffic, where correctly setting this traffic induces the satellites to act to optimize the world utility. Our results indicated that the bi-utility search introduced in this paper outperforms both traditional hill climbing algorithms and distributed RL approaches such as team games.

  10. Theta-band oscillatory activity differs between gamblers and nongamblers comorbid with attention-deficit hyperactivity disorder in a probabilistic reward-learning task.

    PubMed

    Abouzari, Mehdi; Oberg, Scott; Tata, Matthew

    2016-10-01

    Problemgambling is thought to be comorbid with attention-deficit hyperactivity disorder (ADHD). We tested whether gamblers and ADHD patients exhibit similar reward-related brain activity in response to feedback in a gambling task. A series of brain electrical responses can be observed in the electroencephalogram (EEG) and the stimulus-locked event-related potentials (ERP), when participants in a gambling task are given feedback regardless of winning or losing the previous bet. Here, we used a simplified computerized version of the Iowa Gambling Task (IGT) to assess differences in reinforcement-driven choice adaptation between unmedicated ADHD patients with or without problem gambling traits and contrasted with a sex- and age-matched control group. EEG was recorded from the participants while they were engaged in the task which contained two choice options with different net payouts and win/loss probabilities. Learning trend which shows the ability to acquire and use knowledge of the reward outcomes to obtain a positive financial outcome was not observed in ADHD gamblers versus nongamblers. Induced theta-band (4-8Hz) power over frontal cortex was significantly higher in gamblers versus nongamblers in all different high-risk/low-risk win/lose conditions. Whereas induced low alpha (9-11Hz) power at frontal electrodes could only differentiate high-risk lose between gamblers and nongamblers but not the other three conditions between the two groups. The results indicate that ADHD nongamblers do not share with problem gamblers underlying deficits in reward learning. These pilot data highlight the need for studies of ADHD in gambling to elucidate how motivational states are represented during feedback processing. PMID:27318102

  11. Reinforcement learning for congestion-avoidance in packet flow

    NASA Astrophysics Data System (ADS)

    Horiguchi, Tsuyoshi; Hayashi, Keisuke; Tretiakov, Alexei

    2005-04-01

    Occurrence of congestion of packet flow in computer networks is one of the unfavorable problems in packet communication and hence its avoidance should be investigated. We use a neural network model for packet routing control in a computer network proposed in a previous paper by Horiguchi and Ishioka (Physica A 297 (2001) 521). If we assume that the packets are not sent to nodes whose buffers are already full of packets, then we find that traffic congestion occurs when the number of packets in the computer network is larger than some critical value. In order to avoid the congestion, we introduce reinforcement learning for a control parameter in the neural network model. We find that the congestion is avoided by the reinforcement learning and at the same time we have good performance for the throughput. We investigate the packet flow on computer networks of various types of topology such as a regular network, a network with fractal structure, a small-world network, a scale-free network and so on.

  12. Predicting psychosis across diagnostic boundaries: Behavioral and computational modeling evidence for impaired reinforcement learning in schizophrenia and bipolar disorder with a history of psychosis.

    PubMed

    Strauss, Gregory P; Thaler, Nicholas S; Matveeva, Tatyana M; Vogel, Sally J; Sutton, Griffin P; Lee, Bern G; Allen, Daniel N

    2015-08-01

    There is increasing evidence that schizophrenia (SZ) and bipolar disorder (BD) share a number of cognitive, neurobiological, and genetic markers. Shared features may be most prevalent among SZ and BD with a history of psychosis. This study extended this literature by examining reinforcement learning (RL) performance in individuals with SZ (n = 29), BD with a history of psychosis (BD+; n = 24), BD without a history of psychosis (BD-; n = 23), and healthy controls (HC; n = 24). RL was assessed through a probabilistic stimulus selection task with acquisition and test phases. Computational modeling evaluated competing accounts of the data. Each participant's trial-by-trial decision-making behavior was fit to 3 computational models of RL: (a) a standard actor-critic model simulating pure basal ganglia-dependent learning, (b) a pure Q-learning model simulating action selection as a function of learned expected reward value, and (c) a hybrid model where an actor-critic is "augmented" by a Q-learning component, meant to capture the top-down influence of orbitofrontal cortex value representations on the striatum. The SZ group demonstrated greater reinforcement learning impairments at acquisition and test phases than the BD+, BD-, and HC groups. The BD+ and BD- groups displayed comparable performance at acquisition and test phases. Collapsing across diagnostic categories, greater severity of current psychosis was associated with poorer acquisition of the most rewarding stimuli as well as poor go/no-go learning at test. Model fits revealed that reinforcement learning in SZ was best characterized by a pure actor-critic model where learning is driven by prediction error signaling alone. In contrast, BD-, BD+, and HC were best fit by a hybrid model where prediction errors are influenced by top-down expected value representations that guide decision making. These findings suggest that abnormalities in the reward system are more prominent in SZ than BD; however, current psychotic

  13. "Notice of Violation of IEEE Publication Principles" Multiobjective Reinforcement Learning: A Comprehensive Overview.

    PubMed

    Liu, Chunming; Xu, Xin; Hu, Dewen

    2013-04-29

    Reinforcement learning is a powerful mechanism for enabling agents to learn in an unknown environment, and most reinforcement learning algorithms aim to maximize some numerical value, which represents only one long-term objective. However, multiple long-term objectives are exhibited in many real-world decision and control problems; therefore, recently, there has been growing interest in solving multiobjective reinforcement learning (MORL) problems with multiple conflicting objectives. The aim of this paper is to present a comprehensive overview of MORL. In this paper, the basic architecture, research topics, and naive solutions of MORL are introduced at first. Then, several representative MORL approaches and some important directions of recent research are reviewed. The relationships between MORL and other related research are also discussed, which include multiobjective optimization, hierarchical reinforcement learning, and multi-agent reinforcement learning. Finally, research challenges and open problems of MORL techniques are highlighted. PMID:24240065

  14. Novel reinforcement learning approach for difficult control problems

    NASA Astrophysics Data System (ADS)

    Becus, Georges A.; Thompson, Edward A.

    1997-09-01

    We review work conducted over the past several years and aimed at developing reinforcement learning architectures for solving difficult control problems and based on and inspired by associative control process (ACP) networks. We briefly review ACP networks able to reproduce many classical instrumental conditioning test results observed in animal research and to engage in real-time, closed-loop, goal-seeking interactions with their environment. Chronologically, our contributions include the ideally interfaced ACP network which is endowed with hierarchical, attention, and failure recognition interface mechanisms which greatly enhanced the capabilities of the original ACP network. When solving the cart-pole problem, it achieves 100 percent reliability and a reduction in training time similar to that of Baird and Klopf's modified ACP network and additionally an order of magnitude reduction in number of failures experienced for successful training. Next we introduced the command and control center/internal drive (Cid) architecture for artificial neural learning systems. It consists of a hierarchy of command and control centers governing motor selection networks. Internal drives, similar hunger, thirst, or reproduction in biological systems, are formed within the controller to facilitate learning. Efficiency, reliability, and adjustability of this architecture were demonstrated on the benchmark cart-pole control problem. A comparison with other artificial learning systems indicates that it learns over 100 times faster than Barto, et al's adaptive search element/adaptive critic element, experiencing less failures by more than an order of magnitude while capable of being fine-tuned by the user, on- line, for improved performance without additional training. Finally we present work in progress on a 'peaks and valleys' scheme which moves away from the one-dimensional learning mechanism currently found in Cid and shows promises in solving even more difficult learning control

  15. Grounding the Meanings in Sensorimotor Behavior using Reinforcement Learning.

    PubMed

    Farkaš, Igor; Malík, Tomáš; Rebrová, Kristína

    2012-01-01

    The recent outburst of interest in cognitive developmental robotics is fueled by the ambition to propose ecologically plausible mechanisms of how, among other things, a learning agent/robot could ground linguistic meanings in its sensorimotor behavior. Along this stream, we propose a model that allows the simulated iCub robot to learn the meanings of actions (point, touch, and push) oriented toward objects in robot's peripersonal space. In our experiments, the iCub learns to execute motor actions and comment on them. Architecturally, the model is composed of three neural-network-based modules that are trained in different ways. The first module, a two-layer perceptron, is trained by back-propagation to attend to the target position in the visual scene, given the low-level visual information and the feature-based target information. The second module, having the form of an actor-critic architecture, is the most distinguishing part of our model, and is trained by a continuous version of reinforcement learning to execute actions as sequences, based on a linguistic command. The third module, an echo-state network, is trained to provide the linguistic description of the executed actions. The trained model generalizes well in case of novel action-target combinations with randomized initial arm positions. It can also promptly adapt its behavior if the action/target suddenly changes during motor execution. PMID:22393319

  16. Reinforcement Learning in a Nonstationary Environment: The El Farol Problem

    NASA Technical Reports Server (NTRS)

    Bell, Ann Maria

    1999-01-01

    This paper examines the performance of simple learning rules in a complex adaptive system based on a coordination problem modeled on the El Farol problem. The key features of the El Farol problem are that it typically involves a medium number of agents and that agents' pay-off functions have a discontinuous response to increased congestion. First we consider a single adaptive agent facing a stationary environment. We demonstrate that the simple learning rules proposed by Roth and Er'ev can be extremely sensitive to small changes in the initial conditions and that events early in a simulation can affect the performance of the rule over a relatively long time horizon. In contrast, a reinforcement learning rule based on standard practice in the computer science literature converges rapidly and robustly. The situation is reversed when multiple adaptive agents interact: the RE algorithms often converge rapidly to a stable average aggregate attendance despite the slow and erratic behavior of individual learners, while the CS based learners frequently over-attend in the early and intermediate terms. The symmetric mixed strategy equilibria is unstable: all three learning rules ultimately tend towards pure strategies or stabilize in the medium term at non-equilibrium probabilities of attendance. The brittleness of the algorithms in different contexts emphasize the importance of thorough and thoughtful examination of simulation-based results.

  17. From Creatures of Habit to Goal-Directed Learners: Tracking the Developmental Emergence of Model-Based Reinforcement Learning.

    PubMed

    Decker, Johannes H; Otto, A Ross; Daw, Nathaniel D; Hartley, Catherine A

    2016-06-01

    Theoretical models distinguish two decision-making strategies that have been formalized in reinforcement-learning theory. A model-based strategy leverages a cognitive model of potential actions and their consequences to make goal-directed choices, whereas a model-free strategy evaluates actions based solely on their reward history. Research in adults has begun to elucidate the psychological mechanisms and neural substrates underlying these learning processes and factors that influence their relative recruitment. However, the developmental trajectory of these evaluative strategies has not been well characterized. In this study, children, adolescents, and adults performed a sequential reinforcement-learning task that enabled estimation of model-based and model-free contributions to choice. Whereas a model-free strategy was apparent in choice behavior across all age groups, a model-based strategy was absent in children, became evident in adolescents, and strengthened in adults. These results suggest that recruitment of model-based valuation systems represents a critical cognitive component underlying the gradual maturation of goal-directed behavior. PMID:27084852

  18. A multiplicative reinforcement learning model capturing learning dynamics and interindividual variability in mice.

    PubMed

    Bathellier, Brice; Tee, Sui Poh; Hrovat, Christina; Rumpel, Simon

    2013-12-01

    Both in humans and in animals, different individuals may learn the same task with strikingly different speeds; however, the sources of this variability remain elusive. In standard learning models, interindividual variability is often explained by variations of the learning rate, a parameter indicating how much synapses are updated on each learning event. Here, we theoretically show that the initial connectivity between the neurons involved in learning a task is also a strong determinant of how quickly the task is learned, provided that connections are updated in a multiplicative manner. To experimentally test this idea, we trained mice to perform an auditory Go/NoGo discrimination task followed by a reversal to compare learning speed when starting from naive or already trained synaptic connections. All mice learned the initial task, but often displayed sigmoid-like learning curves, with a variable delay period followed by a steep increase in performance, as often observed in operant conditioning. For all mice, learning was much faster in the subsequent reversal training. An accurate fit of all learning curves could be obtained with a reinforcement learning model endowed with a multiplicative learning rule, but not with an additive rule. Surprisingly, the multiplicative model could explain a large fraction of the interindividual variability by variations in the initial synaptic weights. Altogether, these results demonstrate the power of multiplicative learning rules to account for the full dynamics of biological learning and suggest an important role of initial wiring in the brain for predispositions to different tasks. PMID:24255115

  19. Adventitious Reinforcement of Maladaptive Stimulus Control Interferes with Learning.

    PubMed

    Saunders, Kathryn J; Hine, Kathleen; Hayashi, Yusuke; Williams, Dean C

    2016-09-01

    Persistent error patterns sometimes develop when teaching new discriminations. These patterns can be adventitiously reinforced, especially during long periods of chance-level responding (including baseline). Such behaviors can interfere with learning a new discrimination. They can also disrupt already learned discriminations, if they re-emerge during teaching procedures that generate errors. We present an example of this process. Our goal was to teach a boy with intellectual disabilities to touch one of two shapes on a computer screen (in technical terms, a simple simultaneous discrimination). We used a size-fading procedure. The correct stimulus was at full size, and the incorrect-stimulus size increased in increments of 10 %. Performance was nearly error free up to and including 60 % of full size. In a probe session with the incorrect stimulus at full size, however, accuracy plummeted. Also, a pattern of switching between choices, which apparently had been established in classroom instruction, re-emerged. The switching pattern interfered with already-learned discriminations. Despite having previously mastered a fading step with the incorrect stimulus up to 60 %, we were unable to maintain consistently high accuracy beyond 20 % of full size. We refined the teaching program such that fading was done in smaller steps (5 %), and decisions to "step back" to a smaller incorrect stimulus were made after every 5-instead of 20-trials. Errors were rare, switching behavior stopped, and he mastered the discrimination. This is a practical example of the importance of designing instruction that prevents adventitious reinforcement of maladaptive discriminated response patterns by reducing errors during acquisition. PMID:27622128

  20. The Effect of Tutoring on Children's Learning Under Two Conditions of Reinforcement.

    ERIC Educational Resources Information Center

    Zach, Lillian; And Others

    Studies were some problems of learning motivation and extrinsic reinforcement in a group of disadvantaged youngsters. Also tested was the hypothesis that learning would be facilitated for those children who received regular individual tutoring in addition to classroom instruction, regardless of conditions of reinforcement. Subjects were 60 Negro…

  1. Model-Based and Model-Free Pavlovian Reward Learning: Revaluation, Revision and Revelation

    PubMed Central

    Dayan, Peter; Berridge, Kent C.

    2014-01-01

    Evidence supports at least two methods for learning about reward and punishment and making predictions for guiding actions. One method, called model-free, progressively acquires cached estimates of the long-run values of circumstances and actions from retrospective experience. The other method, called model-based, uses representations of the environment, expectations and prospective calculations to make cognitive predictions of future value. Extensive attention has been paid to both methods in computational analyses of instrumental learning. By contrast, although a full computational analysis has been lacking, Pavlovian learning and prediction has typically been presumed to be solely model-free. Here, we revise that presumption and review compelling evidence from Pavlovian revaluation experiments showing that Pavlovian predictions can involve their own form of model-based evaluation. In model-based Pavlovian evaluation, prevailing states of the body and brain influence value computations, and thereby produce powerful incentive motivations that can sometimes be quite new. We consider the consequences of this revised Pavlovian view for the computational landscape of prediction, response and choice. We also revisit differences between Pavlovian and instrumental learning in the control of incentive motivation. PMID:24647659

  2. FNDC5/irisin, a molecular target for boosting reward-related learning and motivation.

    PubMed

    Zsuga, Judit; Tajti, Gabor; Papp, Csaba; Juhasz, Bela; Gesztelyi, Rudolf

    2016-05-01

    neurotropic factor that increases neuronal dopamine content, modulates dopamine release relevant for neuronal plasticity and increased neuronal survival as well as learning and memory. Further linking BDNF to dopaminergic function is BDNF's ability to activate tropomyosin-related kinase B receptor that shares signalization with presynaptic dopamine-3 receptors in the ventral tegmental area. Summarizing, we propose that the skeletal muscle derived irisin may be the link between physical activity and reward-related processes and motivation. Moreover alteration of this axis may contribute to sedentary lifestyle and subsequent non-communicable diseases. Preclinical and clinical experimental models to test this hypothesis are also proposed. PMID:27063080

  3. Reinforcement learning of self-regulated β-oscillations for motor restoration in chronic stroke.

    PubMed

    Naros, Georgios; Gharabaghi, Alireza

    2015-01-01

    Neurofeedback training of Motor imagery (MI)-related brain-states with brain-computer/brain-machine interfaces (BCI/BMI) is currently being explored as an experimental intervention prior to standard physiotherapy to improve the motor outcome of stroke rehabilitation. The use of BCI/BMI technology increases the adherence to MI training more efficiently than interventions with sham or no feedback. Moreover, pilot studies suggest that such a priming intervention before physiotherapy might-like some brain stimulation techniques-increase the responsiveness of the brain to the subsequent physiotherapy, thereby improving the general clinical outcome. However, there is little evidence up to now that these BCI/BMI-based interventions have achieved operate conditioning of specific brain states that facilitate task-specific functional gains beyond the practice of primed physiotherapy. In this context, we argue that BCI/BMI technology provides a valuable neurofeedback tool for rehabilitation but needs to aim at physiological features relevant for the targeted behavioral gain. Moreover, this therapeutic intervention has to be informed by concepts of reinforcement learning to develop its full potential. Such a refined neurofeedback approach would need to address the following issues: (1) Defining a physiological feedback target specific to the intended behavioral gain, e.g., β-band oscillations for cortico-muscular communication. This targeted brain state could well be different from the brain state optimal for the neurofeedback task, e.g., α-band oscillations for differentiating MI from rest; (2) Selecting a BCI/BMI classification and thresholding approach on the basis of learning principles, i.e., balancing challenge and reward of the neurofeedback task instead of maximizing the classification accuracy of the difficulty level device; and (3) Adjusting the difficulty level in the course of the training period to account for the cognitive load and the learning experience of

  4. Reinforcement learning of self-regulated β-oscillations for motor restoration in chronic stroke

    PubMed Central

    Naros, Georgios; Gharabaghi, Alireza

    2015-01-01

    Neurofeedback training of Motor imagery (MI)-related brain-states with brain-computer/brain-machine interfaces (BCI/BMI) is currently being explored as an experimental intervention prior to standard physiotherapy to improve the motor outcome of stroke rehabilitation. The use of BCI/BMI technology increases the adherence to MI training more efficiently than interventions with sham or no feedback. Moreover, pilot studies suggest that such a priming intervention before physiotherapy might—like some brain stimulation techniques—increase the responsiveness of the brain to the subsequent physiotherapy, thereby improving the general clinical outcome. However, there is little evidence up to now that these BCI/BMI-based interventions have achieved operate conditioning of specific brain states that facilitate task-specific functional gains beyond the practice of primed physiotherapy. In this context, we argue that BCI/BMI technology provides a valuable neurofeedback tool for rehabilitation but needs to aim at physiological features relevant for the targeted behavioral gain. Moreover, this therapeutic intervention has to be informed by concepts of reinforcement learning to develop its full potential. Such a refined neurofeedback approach would need to address the following issues: (1) Defining a physiological feedback target specific to the intended behavioral gain, e.g., β-band oscillations for cortico-muscular communication. This targeted brain state could well be different from the brain state optimal for the neurofeedback task, e.g., α-band oscillations for differentiating MI from rest; (2) Selecting a BCI/BMI classification and thresholding approach on the basis of learning principles, i.e., balancing challenge and reward of the neurofeedback task instead of maximizing the classification accuracy of the difficulty level device; and (3) Adjusting the difficulty level in the course of the training period to account for the cognitive load and the learning experience

  5. Brain Regions Involved in the Learning and Application of Reward Rules in a Two-Deck Gambling Task

    ERIC Educational Resources Information Center

    Hartstra, E.; Oldenburg, J. F. E.; Van Leijenhorst, L.; Rombouts, S. A. R. B.; Crone, E. A.

    2010-01-01

    Decision-making involves the ability to choose between competing actions that are associated with uncertain benefits and penalties. The Iowa Gambling Task (IGT), which mimics real-life decision-making, involves learning a reward-punishment rule over multiple trials. Patients with damage to ventromedial prefrontal cortex (VMPFC) show deficits…

  6. The wick in the candle of learning: epistemic curiosity activates reward circuitry and enhances memory.

    PubMed

    Kang, Min Jeong; Hsu, Ming; Krajbich, Ian M; Loewenstein, George; McClure, Samuel M; Wang, Joseph Tao-yi; Camerer, Colin F

    2009-08-01

    Curiosity has been described as a desire for learning and knowledge, but its underlying mechanisms are not well understood. We scanned subjects with functional magnetic resonance imaging while they read trivia questions. The level of curiosity when reading questions was correlated with activity in caudate regions previously suggested to be involved in anticipated reward. This finding led to a behavioral study, which showed that subjects spent more scarce resources (either limited tokens or waiting time) to find out answers when they were more curious. The functional imaging also showed that curiosity increased activity in memory areas when subjects guessed incorrectly, which suggests that curiosity may enhance memory for surprising new information. This prediction about memory enhancement was confirmed in a behavioral study: Higher curiosity in an initial session was correlated with better recall of surprising answers 1 to 2 weeks later. PMID:19619181

  7. The GABAergic septohippocampal pathway is directly involved in internal processes related to operant reward learning.

    PubMed

    Vega-Flores, Germán; Rubio, Sara E; Jurado-Parras, M Teresa; Gómez-Climent, María Ángeles; Hampe, Christiane S; Manto, Mario; Soriano, Eduardo; Pascual, Marta; Gruart, Agnès; Delgado-García, José M

    2014-08-01

    We studied the role of γ-aminobutyric acid (GABA)ergic septohippocampal projections in medial septum (MS) self-stimulation of behaving mice. Self-stimulation was evoked in wild-type (WT) mice using instrumental conditioning procedures and in J20 mutant mice, a type of mouse with a significant deficit in GABAergic septohippocampal projections. J20 mice showed a significant modification in hippocampal activities, including a different response for input/output curves and the paired-pulse test, a larger long-term potentiation (LTP), and a delayed acquisition and lower performance in the MS self-stimulation task. LTP evoked at the CA3-CA1 synapse further decreased self-stimulation performance in J20, but not in WT, mice. MS self-stimulation evoked a decrease in the amplitude of field excitatory postsynaptic potentials (fEPSPs) at the CA3-CA1 synapse in WT, but not in J20, mice. This self-stimulation-dependent decrease in the amplitude of fEPSPs was also observed in the presence of another positive reinforcer (food collected during an operant task) and was canceled by the local administration of an antibody-inhibiting glutamate decarboxylase 65 (GAD65). LTP evoked in the GAD65Ab-treated group was also larger than in controls. The hippocampus has a different susceptibility to septal GABAergic inputs depending on ongoing cognitive processes, and the GABAergic septohippocampal pathway is involved in consummatory processes related to operant rewards. PMID:23479403

  8. The GABAergic Septohippocampal Pathway Is Directly Involved in Internal Processes Related to Operant Reward Learning

    PubMed Central

    Vega-Flores, Germán; Rubio, Sara E.; Jurado-Parras, M. Teresa; Gómez-Climent, María Ángeles; Hampe, Christiane S.; Manto, Mario; Soriano, Eduardo; Pascual, Marta; Gruart, Agnès; Delgado-García, José M.

    2014-01-01

    We studied the role of γ-aminobutyric acid (GABA)ergic septohippocampal projections in medial septum (MS) self-stimulation of behaving mice. Self-stimulation was evoked in wild-type (WT) mice using instrumental conditioning procedures and in J20 mutant mice, a type of mouse with a significant deficit in GABAergic septohippocampal projections. J20 mice showed a significant modification in hippocampal activities, including a different response for input/output curves and the paired-pulse test, a larger long-term potentiation (LTP), and a delayed acquisition and lower performance in the MS self-stimulation task. LTP evoked at the CA3–CA1 synapse further decreased self-stimulation performance in J20, but not in WT, mice. MS self-stimulation evoked a decrease in the amplitude of field excitatory postsynaptic potentials (fEPSPs) at the CA3–CA1 synapse in WT, but not in J20, mice. This self-stimulation-dependent decrease in the amplitude of fEPSPs was also observed in the presence of another positive reinforcer (food collected during an operant task) and was canceled by the local administration of an antibody-inhibiting glutamate decarboxylase 65 (GAD65). LTP evoked in the GAD65Ab-treated group was also larger than in controls. The hippocampus has a different susceptibility to septal GABAergic inputs depending on ongoing cognitive processes, and the GABAergic septohippocampal pathway is involved in consummatory processes related to operant rewards. PMID:23479403

  9. A neural-network reinforcement-learning model of domestic chicks that learn to localize the centre of closed arenas.

    PubMed

    Mannella, Francesco; Baldassarre, Gianluca

    2007-03-29

    Previous experiments have shown that when domestic chicks (Gallus gallus) are first trained to locate food elements hidden at the centre of a closed square arena and then are tested in a square arena of double the size, they search for food both at its centre and at a distance from walls similar to the distance of the centre from the walls experienced during training. This paper presents a computational model that successfully reproduces these behaviours. The model is based on a neural-network implementation of the reinforcement-learning actor - critic architecture (in this architecture the 'critic' learns to evaluate perceived states in terms of predicted future rewards, while the 'actor' learns to increase the probability of selecting the actions that lead to higher evaluations). The analysis of the model suggests which type of information and cognitive mechanisms might underlie chicks' behaviours: (i) the tendency to explore the area at a specific distance from walls might be based on the processing of the height of walls' horizontal edges, (ii) the capacity to generalize the search at the centre of square arenas independently of their size might be based on the processing of the relative position of walls' vertical edges on the horizontal plane (equalization of walls' width), and (iii) the whole behaviour exhibited in the large square arena can be reproduced by assuming the existence of an attention process that, at each time, focuses chicks' internal processing on either one of the two previously discussed information sources. The model also produces testable predictions regarding the generalization capabilities that real chicks should exhibit if trained in circular arenas of varying size. The paper also highlights the potentialities of the model to address other experiments on animals' navigation and analyses its strengths and weaknesses in comparison to other models. PMID:17255019

  10. Revisiting the Role of Rewards in Motivation and Learning: Implications of Neuroscientific Research

    ERIC Educational Resources Information Center

    Hidi, Suzanne

    2016-01-01

    Rewards have been examined extensively by both psychologists and neuorscientists and have become one of the most contentious issues in social and educational psychology. In psychological research, reward processing has typically been studied in relation to behavioral outcomes. In contrast, neuroscientists have been examining how rewards are…

  11. Reinforcement learning output feedback NN control using deterministic learning technique.

    PubMed

    Xu, Bin; Yang, Chenguang; Shi, Zhongke

    2014-03-01

    In this brief, a novel adaptive-critic-based neural network (NN) controller is investigated for nonlinear pure-feedback systems. The controller design is based on the transformed predictor form, and the actor-critic NN control architecture includes two NNs, whereas the critic NN is used to approximate the strategic utility function, and the action NN is employed to minimize both the strategic utility function and the tracking error. A deterministic learning technique has been employed to guarantee that the partial persistent excitation condition of internal states is satisfied during tracking control to a periodic reference orbit. The uniformly ultimate boundedness of closed-loop signals is shown via Lyapunov stability analysis. Simulation results are presented to demonstrate the effectiveness of the proposed control. PMID:24807456

  12. Reinforcement learning for resource allocation in LEO satellite networks.

    PubMed

    Usaha, Wipawee; Barria, Javier A

    2007-06-01

    In this paper, we develop and assess online decision-making algorithms for call admission and routing for low Earth orbit (LEO) satellite networks. It has been shown in a recent paper that, in a LEO satellite system, a semi-Markov decision process formulation of the call admission and routing problem can achieve better performance in terms of an average revenue function than existing routing methods. However, the conventional dynamic programming (DP) numerical solution becomes prohibited as the problem size increases. In this paper, two solution methods based on reinforcement learning (RL) are proposed in order to circumvent the computational burden of DP. The first method is based on an actor-critic method with temporal-difference (TD) learning. The second method is based on a critic-only method, called optimistic TD learning. The algorithms enhance performance in terms of requirements in storage, computational complexity and computational time, and in terms of an overall long-term average revenue function that penalizes blocked calls. Numerical studies are carried out, and the results obtained show that the RL framework can achieve up to 56% higher average revenue over existing routing methods used in LEO satellite networks with reasonable storage and computational requirements. PMID:17550108

  13. Curiosity driven reinforcement learning for motion planning on humanoids

    PubMed Central

    Frank, Mikhail; Leitner, Jürgen; Stollenga, Marijn; Förster, Alexander; Schmidhuber, Jürgen

    2014-01-01

    Most previous work on artificial curiosity (AC) and intrinsic motivation focuses on basic concepts and theory. Experimental results are generally limited to toy scenarios, such as navigation in a simulated maze, or control of a simple mechanical system with one or two degrees of freedom. To study AC in a more realistic setting, we embody a curious agent in the complex iCub humanoid robot. Our novel reinforcement learning (RL) framework consists of a state-of-the-art, low-level, reactive control layer, which controls the iCub while respecting constraints, and a high-level curious agent, which explores the iCub's state-action space through information gain maximization, learning a world model from experience, controlling the actual iCub hardware in real-time. To the best of our knowledge, this is the first ever embodied, curious agent for real-time motion planning on a humanoid. We demonstrate that it can learn compact Markov models to represent large regions of the iCub's configuration space, and that the iCub explores intelligently, showing interest in its physical constraints as well as in objects it finds in its environment. PMID:24432001

  14. Curiosity driven reinforcement learning for motion planning on humanoids.

    PubMed

    Frank, Mikhail; Leitner, Jürgen; Stollenga, Marijn; Förster, Alexander; Schmidhuber, Jürgen

    2014-01-01

    Most previous work on artificial curiosity (AC) and intrinsic motivation focuses on basic concepts and theory. Experimental results are generally limited to toy scenarios, such as navigation in a simulated maze, or control of a simple mechanical system with one or two degrees of freedom. To study AC in a more realistic setting, we embody a curious agent in the complex iCub humanoid robot. Our novel reinforcement learning (RL) framework consists of a state-of-the-art, low-level, reactive control layer, which controls the iCub while respecting constraints, and a high-level curious agent, which explores the iCub's state-action space through information gain maximization, learning a world model from experience, controlling the actual iCub hardware in real-time. To the best of our knowledge, this is the first ever embodied, curious agent for real-time motion planning on a humanoid. We demonstrate that it can learn compact Markov models to represent large regions of the iCub's configuration space, and that the iCub explores intelligently, showing interest in its physical constraints as well as in objects it finds in its environment. PMID:24432001

  15. Decentralized reinforcement-learning control and emergence of motion patterns

    NASA Astrophysics Data System (ADS)

    Svinin, Mikhail; Yamada, Kazuyaki; Okhura, Kazuhiro; Ueda, Kanji

    1998-10-01

    In this paper we propose a system for studying emergence of motion patterns in autonomous mobile robotic systems. The system implements an instance-based reinforcement learning control. Three spaces are of importance in formulation of the control scheme. They are the work space, the sensor space, and the action space. Important feature of our system is that all these spaces are assumed to be continuous. The core part of the system is a classifier system. Based on the sensory state space analysis, the control is decentralized and is specified at the lowest level of the control system. However, the local controllers are implicitly connected through the perceived environment information. Therefore, they constitute a dynamic environment with respect to each other. The proposed control scheme is tested under simulation for a mobile robot in a navigation task. It is shown that some patterns of global behavior--such as collision avoidance, wall-following, light-seeking--can emerge from the local controllers.

  16. Model-based hierarchical reinforcement learning and human action control

    PubMed Central

    Botvinick, Matthew; Weinstein, Ari

    2014-01-01

    Recent work has reawakened interest in goal-directed or ‘model-based’ choice, where decisions are based on prospective evaluation of potential action outcomes. Concurrently, there has been growing attention to the role of hierarchy in decision-making and action control. We focus here on the intersection between these two areas of interest, considering the topic of hierarchical model-based control. To characterize this form of action control, we draw on the computational framework of hierarchical reinforcement learning, using this to interpret recent empirical findings. The resulting picture reveals how hierarchical model-based mechanisms might play a special and pivotal role in human decision-making, dramatically extending the scope and complexity of human behaviour. PMID:25267822

  17. A Robust Reinforcement Learning Control Design Method for Nonlinear System with Partially Unknown Structure

    NASA Astrophysics Data System (ADS)

    Nakano, Kazuhiro; Obayashi, Masanao; Kuremoto, Takashi; Kobayashi, Kunikazu

    We propose a robust control system which has robustness for disturbance and can deal with a nonlinear system with partially unknown structure by fusing reinforcement learning and robust control theory. First, we solved an optimal control problem without using unknown part of functions of the system, using neural network and the repetition learning of reinforcement learning algorithm. Second, we built the robust reinforcement learning control system which permits uncertainty and has robustness for disturbance by fusing the idea of H infinity control theory with above system.

  18. Effects of D1 receptor knockout on fear and reward learning.

    PubMed

    Abraham, Antony D; Neve, Kim A; Lattal, K Matthew

    2016-09-01

    Dopamine signaling is involved in a variety of neurobiological processes that contribute to learning and memory. D1-like dopamine receptors (including D1 and D5 receptors) are thought to be involved in memory and reward processes, but pharmacological approaches have been limited in their ability to distinguish between D1 and D5 receptors. Here, we examine the effects of a specific knockout of D1 receptors in associative learning tasks involving aversive (shock) or appetitive (cocaine) unconditioned stimuli. We find that D1 knockout mice show similar levels of cued and contextual fear conditioning to WT controls following conditioning protocols involving one, two, or four shocks. D1 knockout mice show increased generalization of fear conditioning and extinction across contexts, revealed as increased freezing to a novel context following conditioning and decreased freezing to an extinguished cue during a contextual renewal test. Further, D1 knockout mice show mild enhancements in extinction following an injection of SKF81297, a D1/D5 receptor agonist, suggesting a role for D5 receptors in extinction enhancements induced by nonspecific pharmacological agonists. Finally, although D1 knockout mice show decreased locomotion induced by cocaine, they are able to form a cocaine-induced conditioned place preference. We discuss these findings in terms of the role of dopamine D1 receptors in general learning and memory processes. PMID:27423521

  19. Lack of effect of Pitressin on the learning ability of Brattleboro rats with diabetes insipidus using positively reinforced operant conditioning.

    PubMed

    Laycock, J F; Gartside, I B

    1985-08-01

    Brattleboro rats with hereditary hypothalamic diabetes insipidus (BDI) received daily subcutaneous injections of vasopressin in the form of Pitressin tannate (0.5 IU/24 hr). They were initially deprived of food and then trained to work for food reward in a Skinner box to a fixed ratio of ten presses for each pellet received. Once this schedule had been learned the rats were given a discrimination task daily for seven days. The performances of these BDI rats were compared with those of rats of the parent Long Evans (LE) strain receiving daily subcutaneous injections of vehicle (arachis oil). Comparisons were also made between these two groups of treated animals and untreated BDI and LE rats studied under similar conditions. In the initial learning trial, both control and Pitressin-treated BDI rats performed significantly better, and manifested less fear initially, than the control or vehicle-injected LE rats when first placed in the Skinner box. Once the initial task had been learned there was no marked difference in the discrimination learning between control or treated BDI and LE animals. These results support the view that vasopressin is not directly involved in all types of learning behaviour, particularly those involving positively reinforced operant conditioning. PMID:4070391

  20. Neuropharmacology of New Psychoactive Substances (NPS): Focus on the Rewarding and Reinforcing Properties of Cannabimimetics and Amphetamine-Like Stimulants.

    PubMed

    Miliano, Cristina; Serpelloni, Giovanni; Rimondo, Claudia; Mereu, Maddalena; Marti, Matteo; De Luca, Maria Antonietta

    2016-01-01

    New psychoactive substances (NPS) are a heterogeneous and rapidly evolving class of molecules available on the global illicit drug market (e.g smart shops, internet, "dark net") as a substitute for controlled substances. The use of NPS, mainly consumed along with other drugs of abuse and/or alcohol, has resulted in a significantly growing number of mortality and emergency admissions for overdoses, as reported by several poison centers from all over the world. The fact that the number of NPS have more than doubled over the last 10 years, is a critical challenge to governments, the scientific community, and civil society [EMCDDA (European Drug Report), 2014; UNODC, 2014b; Trends and developments]. The chemical structure (phenethylamines, piperazines, cathinones, tryptamines, synthetic cannabinoids) of NPS and their pharmacological and clinical effects (hallucinogenic, anesthetic, dissociative, depressant) help classify them into different categories. In the recent past, 50% of newly identified NPS have been classified as synthetic cannabinoids followed by new phenethylamines (17%) (UNODC, 2014b). Besides peripheral toxicological effects, many NPS seem to have addictive properties. Behavioral, neurochemical, and electrophysiological evidence can help in detecting them. This manuscript will review existing literature about the addictive and rewarding properties of the most popular NPS classes: cannabimimetics (JWH, HU, CP series) and amphetamine-like stimulants (amphetamine, methamphetamine, methcathinone, and MDMA analogs). Moreover, the review will include recent data from our lab which links JWH-018, a CB1 and CB2 agonist more potent than Δ(9)-THC, to other cannabinoids with known abuse potential, and to other classes of abused drugs that increase dopamine signaling in the Nucleus Accumbens (NAc) shell. Thus the neurochemical mechanisms that produce the rewarding properties of JWH-018, which most likely contributes to the greater incidence of dependence associated

  1. Neuropharmacology of New Psychoactive Substances (NPS): Focus on the Rewarding and Reinforcing Properties of Cannabimimetics and Amphetamine-Like Stimulants

    PubMed Central

    Miliano, Cristina; Serpelloni, Giovanni; Rimondo, Claudia; Mereu, Maddalena; Marti, Matteo; De Luca, Maria Antonietta

    2016-01-01

    New psychoactive substances (NPS) are a heterogeneous and rapidly evolving class of molecules available on the global illicit drug market (e.g smart shops, internet, “dark net”) as a substitute for controlled substances. The use of NPS, mainly consumed along with other drugs of abuse and/or alcohol, has resulted in a significantly growing number of mortality and emergency admissions for overdoses, as reported by several poison centers from all over the world. The fact that the number of NPS have more than doubled over the last 10 years, is a critical challenge to governments, the scientific community, and civil society [EMCDDA (European Drug Report), 2014; UNODC, 2014b; Trends and developments]. The chemical structure (phenethylamines, piperazines, cathinones, tryptamines, synthetic cannabinoids) of NPS and their pharmacological and clinical effects (hallucinogenic, anesthetic, dissociative, depressant) help classify them into different categories. In the recent past, 50% of newly identified NPS have been classified as synthetic cannabinoids followed by new phenethylamines (17%) (UNODC, 2014b). Besides peripheral toxicological effects, many NPS seem to have addictive properties. Behavioral, neurochemical, and electrophysiological evidence can help in detecting them. This manuscript will review existing literature about the addictive and rewarding properties of the most popular NPS classes: cannabimimetics (JWH, HU, CP series) and amphetamine-like stimulants (amphetamine, methamphetamine, methcathinone, and MDMA analogs). Moreover, the review will include recent data from our lab which links JWH-018, a CB1 and CB2 agonist more potent than Δ9-THC, to other cannabinoids with known abuse potential, and to other classes of abused drugs that increase dopamine signaling in the Nucleus Accumbens (NAc) shell. Thus the neurochemical mechanisms that produce the rewarding properties of JWH-018, which most likely contributes to the greater incidence of dependence

  2. B-tree search reinforcement learning for model based intelligent agent

    NASA Astrophysics Data System (ADS)

    Bhuvaneswari, S.; Vignashwaran, R.

    2013-03-01

    Agents trained by learning techniques provide a powerful approximation of active solutions for naive approaches. In this study using B - Trees implying reinforced learning the data search for information retrieval is moderated to achieve accuracy with minimum search time. The impact of variables and tactics applied in training are determined using reinforcement learning. Agents based on these techniques perform satisfactory baseline and act as finite agents based on the predetermined model against competitors from the course.

  3. The Drive-Reinforcement Neuronal Model: A Real-Time Learning Mechanism For Unsupervised Learning

    NASA Astrophysics Data System (ADS)

    Klopf, A. H.

    1988-05-01

    The drive-reinforcement neuronal model is described as an example of a newly discovered class of real-time learning mechanisms that correlate earlier derivatives of inputs with later derivatives of outputs. The drive-reinforcement neuronal model has been demonstrated to predict a wide range of classical conditioning phenomena in animal learning. A variety of classes of connectionist and neural network models have been investigated in recent years (Hinton and Anderson, 1981; Levine, 1983; Barto, 1985; Feldman, 1985; Rumelhart and McClelland, 1986). After a brief review of these models, discussion will focus on the class of real-time models because they appear to be making the strongest contact with the experimental evidence of animal learning. Theoretical models in physics have inspired Boltzmann machines (Ackley, Hinton, and Sejnowski, 1985) and what are sometimes called Hopfield networks (Hopfield, 1982; Hopfield and Tank, 1986). These connectionist models utilize symmetric connections and adaptive equilibrium processes during which the networks settle into minimal energy states. Networks utilizing error-correction learning mechanisms go back to Rosenblatt's (1962) perception and Widrow's (1962) adaline and currently take the form of back propagation networks (Parker, 1985; Rumelhart, Hinton, and Williams, 1985, 1986). These networks require a "teacher" or "trainer" to provide error signals indicating the difference between desired and actual responses. Networks employing real-time learning mechanisms, in which the temporal association of signals is of fundamental importance, go back to Hebb (1949). Real-time learning mechanisms may require no teacher or trainer and thus may lend themselves to unsupervised learning. Such models have been extended by Klopf (1972, 1982), who introduced the notions of synaptic eligibility and generalized reinforcement. Sutton and Barto (1981) advanced this class of models by proposing that a derivative of the theoretical neuron's out

  4. Functional specialization within the striatum along both the dorsal/ventral and anterior/posterior axes during associative learning via reward and punishment

    PubMed Central

    Mattfeld, Aaron T.; Gluck, Mark A.; Stark, Craig E.L.

    2011-01-01

    The goal of the present study was to elucidate the role of the human striatum in learning via reward and punishment during an associative learning task. Previous studies have identified the striatum as a critical component in the neural circuitry of reward-related learning. It remains unclear, however, under what task conditions, and to what extent, the striatum is modulated by punishment during an instrumental learning task. Using high-resolution functional magnetic resonance imaging (fMRI) during a reward- and punishment-based probabilistic associative learning task, we observed activity in the ventral putamen for stimuli learned via reward regardless of whether participants were correct or incorrect (i.e., outcome). In contrast, activity in the dorsal caudate was modulated by trials that received feedback—either correct reward or incorrect punishment trials. We also identified an anterior/posterior dissociation reflecting reward and punishment prediction error estimates. Additionally, differences in patterns of activity that correlated with the amount of training were identified along the anterior/posterior axis of the striatum. We suggest that unique subregions of the striatum—separated along both a dorsal/ventral and anterior/posterior axis— differentially participate in the learning of associations through reward and punishment. PMID:22021252

  5. Reward-based learning under hardware constraints—using a RISC processor embedded in a neuromorphic substrate

    PubMed Central

    Friedmann, Simon; Frémaux, Nicolas; Schemmel, Johannes; Gerstner, Wulfram; Meier, Karlheinz

    2013-01-01

    In this study, we propose and analyze in simulations a new, highly flexible method of implementing synaptic plasticity in a wafer-scale, accelerated neuromorphic hardware system. The study focuses on globally modulated STDP, as a special use-case of this method. Flexibility is achieved by embedding a general-purpose processor dedicated to plasticity into the wafer. To evaluate the suitability of the proposed system, we use a reward modulated STDP rule in a spike train learning task. A single layer of neurons is trained to fire at specific points in time with only the reward as feedback. This model is simulated to measure its performance, i.e., the increase in received reward after learning. Using this performance as baseline, we then simulate the model with various constraints imposed by the proposed implementation and compare the performance. The simulated constraints include discretized synaptic weights, a restricted interface between analog synapses and embedded processor, and mismatch of analog circuits. We find that probabilistic updates can increase the performance of low-resolution weights, a simple interface between analog synapses and processor is sufficient for learning, and performance is insensitive to mismatch. Further, we consider communication latency between wafer and the conventional control computer system that is simulating the environment. This latency increases the delay, with which the reward is sent to the embedded processor. Because of the time continuous operation of the analog synapses, delay can cause a deviation of the updates as compared to the not delayed situation. We find that for highly accelerated systems latency has to be kept to a minimum. This study demonstrates the suitability of the proposed implementation to emulate the selected reward modulated STDP learning rule. It is therefore an ideal candidate for implementation in an upgraded version of the wafer-scale system developed within the BrainScaleS project. PMID:24065877

  6. Corticostriatal circuit mechanisms of value-based action selection: Implementation of reinforcement learning algorithms and beyond.

    PubMed

    Morita, Kenji; Jitsev, Jenia; Morrison, Abigail

    2016-09-15

    Value-based action selection has been suggested to be realized in the corticostriatal local circuits through competition among neural populations. In this article, we review theoretical and experimental studies that have constructed and verified this notion, and provide new perspectives on how the local-circuit selection mechanisms implement reinforcement learning (RL) algorithms and computations beyond them. The striatal neurons are mostly inhibitory, and lateral inhibition among them has been classically proposed to realize "Winner-Take-All (WTA)" selection of the maximum-valued action (i.e., 'max' operation). Although this view has been challenged by the revealed weakness, sparseness, and asymmetry of lateral inhibition, which suggest more complex dynamics, WTA-like competition could still occur on short time scales. Unlike the striatal circuit, the cortical circuit contains recurrent excitation, which may enable retention or temporal integration of information and probabilistic "soft-max" selection. The striatal "max" circuit and the cortical "soft-max" circuit might co-implement an RL algorithm called Q-learning; the cortical circuit might also similarly serve for other algorithms such as SARSA. In these implementations, the cortical circuit presumably sustains activity representing the executed action, which negatively impacts dopamine neurons so that they can calculate reward-prediction-error. Regarding the suggested more complex dynamics of striatal, as well as cortical, circuits on long time scales, which could be viewed as a sequence of short WTA fragments, computational roles remain open: such a sequence might represent (1) sequential state-action-state transitions, constituting replay or simulation of the internal model, (2) a single state/action by the whole trajectory, or (3) probabilistic sampling of state/action. PMID:27173430

  7. Beamforming and Power Control in Sensor Arrays Using Reinforcement Learning

    PubMed Central

    Almeida, Náthalee C.; Fernandes, Marcelo A.C.; Neto, Adrião D.D.

    2015-01-01

    The use of beamforming and power control, combined or separately, has advantages and disadvantages, depending on the application. The combined use of beamforming and power control has been shown to be highly effective in applications involving the suppression of interference signals from different sources. However, it is necessary to identify efficient methodologies for the combined operation of these two techniques. The most appropriate technique may be obtained by means of the implementation of an intelligent agent capable of making the best selection between beamforming and power control. The present paper proposes an algorithm using reinforcement learning (RL) to determine the optimal combination of beamforming and power control in sensor arrays. The RL algorithm used was Q-learning, employing an ε-greedy policy, and training was performed using the offline method. The simulations showed that RL was effective for implementation of a switching policy involving the different techniques, taking advantage of the positive characteristics of each technique in terms of signal reception. PMID:25808769

  8. Reinforcement learning techniques for controlling resources in power networks

    NASA Astrophysics Data System (ADS)

    Kowli, Anupama Sunil

    As power grids transition towards increased reliance on renewable generation, energy storage and demand response resources, an effective control architecture is required to harness the full functionalities of these resources. There is a critical need for control techniques that recognize the unique characteristics of the different resources and exploit the flexibility afforded by them to provide ancillary services to the grid. The work presented in this dissertation addresses these needs. Specifically, new algorithms are proposed, which allow control synthesis in settings wherein the precise distribution of the uncertainty and its temporal statistics are not known. These algorithms are based on recent developments in Markov decision theory, approximate dynamic programming and reinforcement learning. They impose minimal assumptions on the system model and allow the control to be "learned" based on the actual dynamics of the system. Furthermore, they can accommodate complex constraints such as capacity and ramping limits on generation resources, state-of-charge constraints on storage resources, comfort-related limitations on demand response resources and power flow limits on transmission lines. Numerical studies demonstrating applications of these algorithms to practical control problems in power systems are discussed. Results demonstrate how the proposed control algorithms can be used to improve the performance and reduce the computational complexity of the economic dispatch mechanism in a power network. We argue that the proposed algorithms are eminently suitable to develop operational decision-making tools for large power grids with many resources and many sources of uncertainty.

  9. A reinforcement learning model of joy, distress, hope and fear

    NASA Astrophysics Data System (ADS)

    Broekens, Joost; Jacobs, Elmer; Jonker, Catholijn M.

    2015-07-01

    In this paper we computationally study the relation between adaptive behaviour and emotion. Using the reinforcement learning framework, we propose that learned state utility, ?, models fear (negative) and hope (positive) based on the fact that both signals are about anticipation of loss or gain. Further, we propose that joy/distress is a signal similar to the error signal. We present agent-based simulation experiments that show that this model replicates psychological and behavioural dynamics of emotion. This work distinguishes itself by assessing the dynamics of emotion in an adaptive agent framework - coupling it to the literature on habituation, development, extinction and hope theory. Our results support the idea that the function of emotion is to provide a complex feedback signal for an organism to adapt its behaviour. Our work is relevant for understanding the relation between emotion and adaptation in animals, as well as for human-robot interaction, in particular how emotional signals can be used to communicate between adaptive agents and humans.

  10. Rewarding brain stimulation reverses the disruptive effect of amygdala damage on emotional learning.

    PubMed

    Kádár, Elisabet; Ramoneda, Marc; Aldavert-Vera, Laura; Huguet, Gemma; Morgado-Bernal, Ignacio; Segura-Torres, Pilar

    2014-11-01

    Intracranial self-stimulation (SS) in the lateral hypothalamus, a rewarding deep-brain stimulation, is able to improve acquisition and retention of implicit and explicit memory tasks in rats. SS treatment is also able to reverse cognitive deficits associated with aging or with experimental brain injuries and evaluated in a two-way active avoidance (2wAA) task. The main objective of the present study was to explore the potential of the SS treatment to reverse the complete learning and memory impairment caused by bilateral lesion in the lateral amygdala (LA). The effects of post-training SS, administered after each acquisition session, were evaluated on distributed 2wAA acquisition and 10-day retention in rats with electrolytic bilateral LA lesions. SS effect in acetylcholinestaresase (AchE) activity was evaluated by immunohistochemistry in LA-preserved and Central nuclei (Ce) of the amygdala of LA-damaged rats. Results showed that LA lesion over 40% completely impeded 2wAA acquisition and retention. Post-training SS in the LA-lesioned rats improved conditioning and retention compared with both the lesioned but non-SS treated and the non-lesioned control rats. SS treatment also seemed to induce a decrease in AchE activity in the LA-preserved area of the lesioned rats, but no effects were observed in the Ce. This empirical evidence supports the idea that self-administered rewarding stimulation is able to completely counteract the 2wAA acquisition and retention deficits induced by LA lesion. Cholinergic mechanisms in preserved LA and the contribution of other brain memory-related areas activated by SS could mediate the compensatory effect observed. PMID:25106737

  11. Aversive Counterconditioning Attenuates Reward Signaling in the Ventral Striatum

    PubMed Central

    Kaag, Anne Marije; Schluter, Renée S.; Karel, Peter; Homberg, Judith; van den Brink, Wim; Reneman, Liesbeth; van Wingen, Guido A.

    2016-01-01

    Appetitive conditioning refers to the process of learning cue-reward associations and is mediated by the mesocorticolimbic system. Appetitive conditioned responses are difficult to extinguish, especially for highly salient reward such as food and drugs. We investigate whether aversive counterconditioning can alter reward reinstatement in the ventral striatum in healthy volunteers using functional magnetic resonance imaging (fMRI). In the initial conditioning phase, two different stimuli were reinforced with a monetary reward. In the subsequent counterconditioning phase, one of these stimuli was paired with an aversive shock to the wrist. In the following extinction phase, none of the stimuli were reinforced. In the final reinstatement phase, reward was reinstated by informing the participants that the monetary gain could be doubled. Our fMRI data revealed that reward signaling in the ventral striatum and ventral tegmental area following reinstatement was smaller for the stimulus that was counterconditioned with an electrical shock, compared to the non-counterconditioned stimulus. A functional connectivity analysis showed that aversive counterconditioning strengthened striatal connectivity with the hippocampus and insula. These results suggest that reward signaling in the ventral striatum can be attenuated through aversive counterconditioning, possibly by concurrent retrieval of the aversive association through enhanced connectivity with hippocampus and insula. PMID:27594829

  12. Aversive Counterconditioning Attenuates Reward Signaling in the Ventral Striatum.

    PubMed

    Kaag, Anne Marije; Schluter, Renée S; Karel, Peter; Homberg, Judith; van den Brink, Wim; Reneman, Liesbeth; van Wingen, Guido A

    2016-01-01

    Appetitive conditioning refers to the process of learning cue-reward associations and is mediated by the mesocorticolimbic system. Appetitive conditioned responses are difficult to extinguish, especially for highly salient reward such as food and drugs. We investigate whether aversive counterconditioning can alter reward reinstatement in the ventral striatum in healthy volunteers using functional magnetic resonance imaging (fMRI). In the initial conditioning phase, two different stimuli were reinforced with a monetary reward. In the subsequent counterconditioning phase, one of these stimuli was paired with an aversive shock to the wrist. In the following extinction phase, none of the stimuli were reinforced. In the final reinstatement phase, reward was reinstated by informing the participants that the monetary gain could be doubled. Our fMRI data revealed that reward signaling in the ventral striatum and ventral tegmental area following reinstatement was smaller for the stimulus that was counterconditioned with an electrical shock, compared to the non-counterconditioned stimulus. A functional connectivity analysis showed that aversive counterconditioning strengthened striatal connectivity with the hippocampus and insula. These results suggest that reward signaling in the ventral striatum can be attenuated through aversive counterconditioning, possibly by concurrent retrieval of the aversive association through enhanced connectivity with hippocampus and insula. PMID:27594829

  13. Reward Processing in Autism

    PubMed Central

    Scott-Van Zeeland, Ashley A.; Dapretto, Mirella; Ghahremani, Dara G.; Poldrack, Russell A.; Bookheimer, Susan Y.

    2011-01-01

    The social motivation hypothesis of autism posits that infants with autism do not experience social stimuli as rewarding, thereby leading to a cascade of potentially negative consequences for later development. While possible downstream effects of this hypothesis such as altered face and voice processing have been examined, there has not been a direct investigation of social reward processing in autism. Here we use functional magnetic resonance imaging to examine social and monetary rewarded implicit learning in children with and without autism spectrum disorders (ASD). Sixteen males with ASD and sixteen age- and IQ-matched typically developing (TD) males were scanned while performing two versions of a rewarded implicit learning task. In addition to examining responses to reward, we investigated the neural circuitry supporting rewarded learning and the relationship between these factors and social development. We found diminished neural responses to both social and monetary rewards in ASD, with a pronounced reduction in response to social rewards (SR). Children with ASD also demonstrated a further deficit in frontostriatal response during social, but not monetary, rewarded learning. Moreover, we show a relationship between ventral striatum activity and social reciprocity in TD children. Together, these data support the hypothesis that children with ASD have diminished neural responses to SR, and that this deficit relates to social learning impairments. PMID:20437601

  14. Dopamine reward prediction error coding

    PubMed Central

    Schultz, Wolfram

    2016-01-01

    Reward prediction errors consist of the differences between received and predicted rewards. They are crucial for basic forms of learning about rewards and make us strive for more rewards—an evolutionary beneficial trait. Most dopamine neurons in the midbrain of humans, monkeys, and rodents signal a reward prediction error; they are activated by more reward than predicted (positive prediction error), remain at baseline activity for fully predicted rewards, and show depressed activity with less reward than predicted (negative prediction error). The dopamine signal increases nonlinearly with reward value and codes formal economic utility. Drugs of addiction generate, hijack, and amplify the dopamine reward signal and induce exaggerated, uncontrolled dopamine effects on neuronal plasticity. The striatum, amygdala, and frontal cortex also show reward prediction error coding, but only in subpopulations of neurons. Thus, the important concept of reward prediction errors is implemented in neuronal hardware. PMID:27069377

  15. The better, the bigger: The effect of graded positive performance feedback on the reward positivity.

    PubMed

    Frömer, Romy; Stürmer, Birgit; Sommer, Werner

    2016-02-01

    In this study on skill acquisition in a computerized throwing task, we examined the effect of graded correct-related performance feedback on the reward positivity of the event-related brain potential (ERP). Theories of reinforcement learning predict effects of reward magnitude and expectancy on the reward prediction error. The later is supposed to be reflected in reward positivity, a fronto-central ERP component. A sample of 68 participants learned to throw at a beamer-projected target disk while performance accuracy, displayed as the place of impact of the projectile on the target, served as graded feedback. Effects of performance accuracy in successful trials, hit frequency, and preceding trial performance on reward positivity were analyzed simultaneously on a trial-by-trial basis by means of linear mixed models. In accord with previous findings, reward positivity increased with feedback about more accurate performance. This relationship was not linear, but cubic, with larger impact of feedback towards the end of the accuracy distribution. In line with being a measure of expectancy, the reward positivity decreased with increasing hit frequency and was larger after unsuccessful trials. The effect of hit frequency was more pronounced following successful trials. These results indicate a fast trial-by-trial adaptation of expectation. The results confirm predictions of reinforcement learning theory and extend previous findings on reward magnitude to the area of complex, goal directed skill acquisition. PMID:26756995

  16. Intelligence moderates reinforcement learning: a mini-review of the neural evidence.

    PubMed

    Chen, Chong

    2015-06-01

    Our understanding of the neural basis of reinforcement learning and intelligence, two key factors contributing to human strivings, has progressed significantly recently. However, the overlap of these two lines of research, namely, how intelligence affects neural responses during reinforcement learning, remains uninvestigated. A mini-review of three existing studies suggests that higher IQ (especially fluid IQ) may enhance the neural signal of positive prediction error in dorsolateral prefrontal cortex, dorsal anterior cingulate cortex, and striatum, several brain substrates of reinforcement learning or intelligence. PMID:25185818

  17. Kernel-based least squares policy iteration for reinforcement learning.

    PubMed

    Xu, Xin; Hu, Dewen; Lu, Xicheng

    2007-07-01

    In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating

  18. Agent Reward Shaping for Alleviating Traffic Congestion

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian

    2006-01-01

    Traffic congestion problems provide a unique environment to study how multi-agent systems promote desired system level behavior. What is particularly interesting in this class of problems is that no individual action is intrinsically "bad" for the system but that combinations of actions among agents lead to undesirable outcomes, As a consequence, agents need to learn how to coordinate their actions with those of other agents, rather than learn a particular set of "good" actions. This problem is ubiquitous in various traffic problems, including selecting departure times for commuters, routes for airlines, and paths for data routers. In this paper we present a multi-agent approach to two traffic problems, where far each driver, an agent selects the most suitable action using reinforcement learning. The agent rewards are based on concepts from collectives and aim to provide the agents with rewards that are both easy to learn and that if learned, lead to good system level behavior. In the first problem, we study how agents learn the best departure times of drivers in a daily commuting environment and how following those departure times alleviates congestion. In the second problem, we study how agents learn to select desirable routes to improve traffic flow and minimize delays for. all drivers.. In both sets of experiments,. agents using collective-based rewards produced near optimal performance (93-96% of optimal) whereas agents using system rewards (63-68%) barely outperformed random action selection (62-64%) and agents using local rewards (48-72%) performed worse than random in some instances.

  19. Scale-free memory model for multiagent reinforcement learning. Mean field approximation and rock-paper-scissors dynamics

    NASA Astrophysics Data System (ADS)

    Lubashevsky, I.; Kanemoto, S.

    2010-07-01

    A continuous time model for multiagent systems governed by reinforcement learning with scale-free memory is developed. The agents are assumed to act independently of one another in optimizing their choice of possible actions via trial-and-error search. To gain awareness about the action value the agents accumulate in their memory the rewards obtained from taking a specific action at each moment of time. The contribution of the rewards in the past to the agent current perception of action value is described by an integral operator with a power-law kernel. Finally a fractional differential equation governing the system dynamics is obtained. The agents are considered to interact with one another implicitly via the reward of one agent depending on the choice of the other agents. The pairwise interaction model is adopted to describe this effect. As a specific example of systems with non-transitive interactions, a two agent and three agent systems of the rock-paper-scissors type are analyzed in detail, including the stability analysis and numerical simulation. Scale-free memory is demonstrated to cause complex dynamics of the systems at hand. In particular, it is shown that there can be simultaneously two modes of the system instability undergoing subcritical and supercritical bifurcation, with the latter one exhibiting anomalous oscillations with the amplitude and period growing with time. Besides, the instability onset via this supercritical mode may be regarded as “altruism self-organization”. For the three agent system the instability dynamics is found to be rather irregular and can be composed of alternate fragments of oscillations different in their properties.

  20. Effect of Reinforcement on Modality of Stimulus Control in Learning Disabled Students.

    ERIC Educational Resources Information Center

    Koorland, Mark A.; Wolking, William D.

    1982-01-01

    The effects of reinforcement contingencies on task performance of bisensory missing words were studied with two students (about nine years old): one learning disabled (LD) male with an auditory preference and one LD female with a visual preference. Reinforcement contingencies were found to control both students' performances. (Author/SEW)

  1. Developmental changes in the reward positivity: an electrophysiological trajectory of reward processing.

    PubMed

    Lukie, Carmen N; Montazer-Hojat, Somayyeh; Holroyd, Clay B

    2014-07-01

    Children and adolescents learn to regulate their behavior by utilizing feedback from the environment but exactly how this ability develops remains unclear. To investigate this question, we recorded the event-related brain potential (ERP) from children (8-13 years), adolescents (14-17 years) and young adults (18-23 years) while they navigated a "virtual maze" in pursuit of monetary rewards. The amplitude of the reward positivity, an ERP component elicited by feedback stimuli, was evaluated for each age group. A current theory suggests the reward positivity is produced by the impact of reinforcement learning signals carried by the midbrain dopamine system on anterior cingulate cortex, which utilizes the signals to learn and execute extended behaviors. We found that the three groups produced a reward positivity of comparable size despite relatively longer ERP component latencies for the children, suggesting that the reward processing system reaches maturity early in development. We propose that early development of the midbrain dopamine system facilitates the development of extended goal-directed behaviors in anterior cingulate cortex. PMID:24879113

  2. Robot cognitive control with a neurophysiologically inspired reinforcement learning model.

    PubMed

    Khamassi, Mehdi; Lallée, Stéphane; Enel, Pierre; Procyk, Emmanuel; Dominey, Peter F

    2011-01-01

    A major challenge in modern robotics is to liberate robots from controlled industrial settings, and allow them to interact with humans and changing environments in the real-world. The current research attempts to determine if a neurophysiologically motivated model of cortical function in the primate can help to address this challenge. Primates are endowed with cognitive systems that allow them to maximize the feedback from their environment by learning the values of actions in diverse situations and by adjusting their behavioral parameters (i.e., cognitive control) to accommodate unexpected events. In such contexts uncertainty can arise from at least two distinct sources - expected uncertainty resulting from noise during sensory-motor interaction in a known context, and unexpected uncertainty resulting from the changing probabilistic structure of the environment. However, it is not clear how neurophysiological mechanisms of reinforcement learning and cognitive control integrate in the brain to produce efficient behavior. Based on primate neuroanatomy and neurophysiology, we propose a novel computational model for the interaction between lateral prefrontal and anterior cingulate cortex reconciling previous models dedicated to these two functions. We deployed the model in two robots and demonstrate that, based on adaptive regulation of a meta-parameter β that controls the exploration rate, the model can robustly deal with the two kinds of uncertainties in the real-world. In addition the model could reproduce monkey behavioral performance and neurophysiological data in two problem-solving tasks. A last experiment extends this to human-robot interaction with the iCub humanoid, and novel sources of uncertainty corresponding to "cheating" by the human. The combined results provide concrete evidence for the ability of neurophysiologically inspired cognitive systems to control advanced robots in the real-world. PMID:21808619

  3. Robot Cognitive Control with a Neurophysiologically Inspired Reinforcement Learning Model

    PubMed Central

    Khamassi, Mehdi; Lallée, Stéphane; Enel, Pierre; Procyk, Emmanuel; Dominey, Peter F.

    2011-01-01

    A major challenge in modern robotics is to liberate robots from controlled industrial settings, and allow them to interact with humans and changing environments in the real-world. The current research attempts to determine if a neurophysiologically motivated model of cortical function in the primate can help to address this challenge. Primates are endowed with cognitive systems that allow them to maximize the feedback from their environment by learning the values of actions in diverse situations and by adjusting their behavioral parameters (i.e., cognitive control) to accommodate unexpected events. In such contexts uncertainty can arise from at least two distinct sources – expected uncertainty resulting from noise during sensory-motor interaction in a known context, and unexpected uncertainty resulting from the changing probabilistic structure of the environment. However, it is not clear how neurophysiological mechanisms of reinforcement learning and cognitive control integrate in the brain to produce efficient behavior. Based on primate neuroanatomy and neurophysiology, we propose a novel computational model for the interaction between lateral prefrontal and anterior cingulate cortex reconciling previous models dedicated to these two functions. We deployed the model in two robots and demonstrate that, based on adaptive regulation of a meta-parameter β that controls the exploration rate, the model can robustly deal with the two kinds of uncertainties in the real-world. In addition the model could reproduce monkey behavioral performance and neurophysiological data in two problem-solving tasks. A last experiment extends this to human–robot interaction with the iCub humanoid, and novel sources of uncertainty corresponding to “cheating” by the human. The combined results provide concrete evidence for the ability of neurophysiologically inspired cognitive systems to control advanced robots in the real-world. PMID:21808619

  4. Cognitive control predicts use of model-based reinforcement learning.

    PubMed

    Otto, A Ross; Skatova, Anya; Madlon-Kay, Seth; Daw, Nathaniel D

    2015-02-01

    Accounts of decision-making and its neural substrates have long posited the operation of separate, competing valuation systems in the control of choice behavior. Recent theoretical and experimental work suggest that this classic distinction between behaviorally and neurally dissociable systems for habitual and goal-directed (or more generally, automatic and controlled) choice may arise from two computational strategies for reinforcement learning (RL), called model-free and model-based RL, but the cognitive or computational processes by which one system may dominate over the other in the control of behavior is a matter of ongoing investigation. To elucidate this question, we leverage the theoretical framework of cognitive control, demonstrating that individual differences in utilization of goal-related contextual information--in the service of overcoming habitual, stimulus-driven responses--in established cognitive control paradigms predict model-based behavior in a separate, sequential choice task. The behavioral correspondence between cognitive control and model-based RL compellingly suggests that a common set of processes may underpin the two behaviors. In particular, computational mechanisms originally proposed to underlie controlled behavior may be applicable to understanding the interactions between model-based and model-free choice behavior. PMID:25170791

  5. Off-policy reinforcement learning for H∞ control design.

    PubMed

    Luo, Biao; Wu, Huai-Ning; Huang, Tingwen

    2015-01-01

    The H∞ control design problem is considered for nonlinear systems with unknown internal system model. It is known that the nonlinear H∞ control problem can be transformed into solving the so-called Hamilton-Jacobi-Isaacs (HJI) equation, which is a nonlinear partial differential equation that is generally impossible to be solved analytically. Even worse, model-based approaches cannot be used for approximately solving HJI equation, when the accurate system model is unavailable or costly to obtain in practice. To overcome these difficulties, an off-policy reinforcement leaning (RL) method is introduced to learn the solution of HJI equation from real system data instead of mathematical system model, and its convergence is proved. In the off-policy RL method, the system data can be generated with arbitrary policies rather than the evaluating policy, which is extremely important and promising for practical systems. For implementation purpose, a neural network (NN)-based actor-critic structure is employed and a least-square NN weight update algorithm is derived based on the method of weighted residuals. Finally, the developed NN-based off-policy RL method is tested on a linear F16 aircraft plant, and further applied to a rotational/translational actuator system. PMID:25532162

  6. Effects of Extrinsic Rewards on Intrinsic Motivation: Improving Learning in the Elementary Classroom.

    ERIC Educational Resources Information Center

    Zbrzezny, Ruth A.

    A literature review focused on ways for teachers to increase students' motivation. A total of 37 annotations were organized in terms of positive and negative effects of rewards on students' motivation, the issue of whether negative effects of rewards can be manipulated to have a positive effect, and suggestions for the classroom teacher on the use…

  7. Representation of Reward Feedback in Primate Auditory Cortex

    PubMed Central

    Brosch, Michael; Selezneva, Elena; Scheich, Henning

    2011-01-01

    It is well established that auditory cortex is plastic on different time scales and that this plasticity is driven by the reinforcement that is used to motivate subjects to learn or to perform an auditory task. Motivated by these findings, we study in detail properties of neuronal firing in auditory cortex that is related to reward feedback. We recorded from the auditory cortex of two monkeys while they were performing an auditory categorization task. Monkeys listened to a sequence of tones and had to signal when the frequency of adjacent tones stepped in downward direction, irrespective of the tone frequency and step size. Correct identifications were rewarded with either a large or a small amount of water. The size of reward depended on the monkeys’ performance in the previous trial: it was large after a correct trial and small after an incorrect trial. The rewards served to maintain task performance. During task performance we found three successive periods of neuronal firing in auditory cortex that reflected (1) the reward expectancy for each trial, (2) the reward-size received, and (3) the mismatch between the expected and delivered reward. These results, together with control experiments suggest that auditory cortex receives reward feedback that could be used to adapt auditory cortex to task requirements. Additionally, the results presented here extend previous observations of non-auditory roles of auditory cortex and shows that auditory cortex is even more cognitively influenced than lately recognized. PMID:21369350

  8. Value and probability coding in a feedback-based learning task utilizing food rewards

    PubMed Central

    Lempert, Karolina M.

    2014-01-01

    For the consequences of our actions to guide behavior, the brain must represent different types of outcome-related information. For example, an outcome can be construed as negative because an expected reward was not delivered or because an outcome of low value was delivered. Thus behavioral consequences can differ in terms of the information they provide about outcome probability and value. We investigated the role of the striatum in processing probability-based and value-based negative feedback by training participants to associate cues with food rewards and then employing a selective satiety procedure to devalue one food outcome. Using functional magnetic resonance imaging, we examined brain activity related to receipt of expected rewards, receipt of devalued outcomes, omission of expected rewards, omission of devalued outcomes, and expected omissions of an outcome. Nucleus accumbens activation was greater for rewarding outcomes than devalued outcomes, but activity in this region did not correlate with the probability of reward receipt. Activation of the right caudate and putamen, however, was largest in response to rewarding outcomes relative to expected omissions of reward. The dorsal striatum (caudate and putamen) at the time of feedback also showed a parametric increase correlating with the trialwise probability of reward receipt. Our results suggest that the ventral striatum is sensitive to the motivational relevance, or subjective value, of the outcome, while the dorsal striatum codes for a more complex signal that incorporates reward probability. Value and probability information may be integrated in the dorsal striatum, to facilitate action planning and allocation of effort. PMID:25339705

  9. Incorporation of perception-based information in robot learning using fuzzy reinforcement learning agents

    NASA Astrophysics Data System (ADS)

    Changjiu, Zhou; Qingchun, Meng; Zhongwen, Guo; Wiefen, Qu; Bo, Yin

    2002-04-01

    Robot learning in unstructured environments has been proved to be an extremely challenging problem, mainly because of many uncertainties always present in the real world. Human beings, on the other hand, seem to cope very well with uncertain and unpredictable environments, often relying on perception-based information. Furthermore, humans beings can also utilize perceptions to guide their learning on those parts of the perception-action space that are actually relevant to the task. Therefore, we conduct a research aimed at improving robot learning through the incorporation of both perception-based and measurement-based information. For this reason, a fuzzy reinforcement learning (FRL) agent is proposed in this paper. Based on a neural-fuzzy architecture, different kinds of information can be incorporated into the FRL agent to initialise its action network, critic network and evaluation feedback module so as to accelerate its learning. By making use of the global optimisation capability of GAs (genetic algorithms), a GA-based FRL (GAFRL) agent is presented to solve the local minima problem in traditional actor-critic reinforcement learning. On the other hand, with the prediction capability of the critic network, GAs can perform a more effective global search. Different GAFRL agents are constructed and verified by using the simulation model of a physical biped robot. The simulation analysis shows that the biped learning rate for dynamic balance can be improved by incorporating perception-based information on biped balancing and walking evaluation. The biped robot can find its application in ocean exploration, detection or sea rescue activity, as well as military maritime activity.

  10. Reinforcement Learning of Linking and Tracing Contours in Recurrent Neural Networks

    PubMed Central

    Brosch, Tobias; Neumann, Heiko; Roelfsema, Pieter R.

    2015-01-01

    The processing of a visual stimulus can be subdivided into a number of stages. Upon stimulus presentation there is an early phase of feedforward processing where the visual information is propagated from lower to higher visual areas for the extraction of basic and complex stimulus features. This is followed by a later phase where horizontal connections within areas and feedback connections from higher areas back to lower areas come into play. In this later phase, image elements that are behaviorally relevant are grouped by Gestalt grouping rules and are labeled in the cortex with enhanced neuronal activity (object-based attention in psychology). Recent neurophysiological studies revealed that reward-based learning influences these recurrent grouping processes, but it is not well understood how rewards train recurrent circuits for perceptual organization. This paper examines the mechanisms for reward-based learning of new grouping rules. We derive a learning rule that can explain how rewards influence the information flow through feedforward, horizontal and feedback connections. We illustrate the efficiency with two tasks that have been used to study the neuronal correlates of perceptual organization in early visual cortex. The first task is called contour-integration and demands the integration of collinear contour elements into an elongated curve. We show how reward-based learning causes an enhancement of the representation of the to-be-grouped elements at early levels of a recurrent neural network, just as is observed in the visual cortex of monkeys. The second task is curve-tracing where the aim is to determine the endpoint of an elongated curve composed of connected image elements. If trained with the new learning rule, neural networks learn to propagate enhanced activity over the curve, in accordance with neurophysiological data. We close the paper with a number of model predictions that can be tested in future neurophysiological and computational studies

  11. Reinforcement Learning of Linking and Tracing Contours in Recurrent Neural Networks.

    PubMed

    Brosch, Tobias; Neumann, Heiko; Roelfsema, Pieter R

    2015-10-01

    The processing of a visual stimulus can be subdivided into a number of stages. Upon stimulus presentation there is an early phase of feedforward processing where the visual information is propagated from lower to higher visual areas for the extraction of basic and complex stimulus features. This is followed by a later phase where horizontal connections within areas and feedback connections from higher areas back to lower areas come into play. In this later phase, image elements that are behaviorally relevant are grouped by Gestalt grouping rules and are labeled in the cortex with enhanced neuronal activity (object-based attention in psychology). Recent neurophysiological studies revealed that reward-based learning influences these recurrent grouping processes, but it is not well understood how rewards train recurrent circuits for perceptual organization. This paper examines the mechanisms for reward-based learning of new grouping rules. We derive a learning rule that can explain how rewards influence the information flow through feedforward, horizontal and feedback connections. We illustrate the efficiency with two tasks that have been used to study the neuronal correlates of perceptual organization in early visual cortex. The first task is called contour-integration and demands the integration of collinear contour elements into an elongated curve. We show how reward-based learning causes an enhancement of the representation of the to-be-grouped elements at early levels of a recurrent neural network, just as is observed in the visual cortex of monkeys. The second task is curve-tracing where the aim is to determine the endpoint of an elongated curve composed of connected image elements. If trained with the new learning rule, neural networks learn to propagate enhanced activity over the curve, in accordance with neurophysiological data. We close the paper with a number of model predictions that can be tested in future neurophysiological and computational studies

  12. Neuromodulatory adaptive combination of correlation-based learning in cerebellum and reward-based learning in basal ganglia for goal-directed behavior control.

    PubMed

    Dasgupta, Sakyasingha; Wörgötter, Florentin; Manoonpong, Poramate

    2014-01-01

    Goal-directed decision making in biological systems is broadly based on associations between conditional and unconditional stimuli. This can be further classified as classical conditioning (correlation-based learning) and operant conditioning (reward-based learning). A number of computational and experimental studies have well established the role of the basal ganglia in reward-based learning, where as the cerebellum plays an important role in developing specific conditioned responses. Although viewed as distinct learning systems, recent animal experiments point toward their complementary role in behavioral learning, and also show the existence of substantial two-way communication between these two brain structures. Based on this notion of co-operative learning, in this paper we hypothesize that the basal ganglia and cerebellar learning systems work in parallel and interact with each other. We envision that such an interaction is influenced by reward modulated heterosynaptic plasticity (RMHP) rule at the thalamus, guiding the overall goal directed behavior. Using a recurrent neural network actor-critic model of the basal ganglia and a feed-forward correlation-based learning model of the cerebellum, we demonstrate that the RMHP rule can effectively balance the outcomes of the two learning systems. This is tested using simulated environments of increasing complexity with a four-wheeled robot in a foraging task in both static and dynamic configurations. Although modeled with a simplified level of biological abstraction, we clearly demonstrate that such a RMHP induced combinatorial learning mechanism, leads to stabler and faster learning of goal-directed behaviors, in comparison to the individual systems. Thus, in this paper we provide a computational model for adaptive combination of the basal ganglia and cerebellum learning systems by way of neuromodulated plasticity for goal-directed decision making in biological and bio-mimetic organisms. PMID:25389391

  13. Neuromodulatory adaptive combination of correlation-based learning in cerebellum and reward-based learning in basal ganglia for goal-directed behavior control

    PubMed Central

    Dasgupta, Sakyasingha; Wörgötter, Florentin; Manoonpong, Poramate

    2014-01-01

    Goal-directed decision making in biological systems is broadly based on associations between conditional and unconditional stimuli. This can be further classified as classical conditioning (correlation-based learning) and operant conditioning (reward-based learning). A number of computational and experimental studies have well established the role of the basal ganglia in reward-based learning, where as the cerebellum plays an important role in developing specific conditioned responses. Although viewed as distinct learning systems, recent animal experiments point toward their complementary role in behavioral learning, and also show the existence of substantial two-way communication between these two brain structures. Based on this notion of co-operative learning, in this paper we hypothesize that the basal ganglia and cerebellar learning systems work in parallel and interact with each other. We envision that such an interaction is influenced by reward modulated heterosynaptic plasticity (RMHP) rule at the thalamus, guiding the overall goal directed behavior. Using a recurrent neural network actor-critic model of the basal ganglia and a feed-forward correlation-based learning model of the cerebellum, we demonstrate that the RMHP rule can effectively balance the outcomes of the two learning systems. This is tested using simulated environments of increasing complexity with a four-wheeled robot in a foraging task in both static and dynamic configurations. Although modeled with a simplified level of biological abstraction, we clearly demonstrate that such a RMHP induced combinatorial learning mechanism, leads to stabler and faster learning of goal-directed behaviors, in comparison to the individual systems. Thus, in this paper we provide a computational model for adaptive combination of the basal ganglia and cerebellum learning systems by way of neuromodulated plasticity for goal-directed decision making in biological and bio-mimetic organisms. PMID:25389391

  14. Anatomy of a Decision: Striato-Orbitofrontal Interactions in Reinforcement Learning, Decision Making, and Reversal

    ERIC Educational Resources Information Center

    Frank, Michael J.; Claus, Eric D.

    2006-01-01

    The authors explore the division of labor between the basal ganglia-dopamine (BG-DA) system and the orbitofrontal cortex (OFC) in decision making. They show that a primitive neural network model of the BG-DA system slowly learns to make decisions on the basis of the relative probability of rewards but is not as sensitive to (a) recency or (b) the…

  15. Introducing a reward system in assessment in histology: A comment on the learning strategies it might engender

    PubMed Central

    McLean, Michelle

    2001-01-01

    Background Assessment, as an inextricable component of the curriculum, is an important factor influencing student approaches to learning. If assessment is to drive learning, then it must assess the desired outcomes. In an effort to alleviate some of the anxiety associated with a traditional discipline-based second year of medical studies, a bonus system was introduced into the Histology assessment. Students obtaining a year mark of 70% were rewarded with full marks for some tests, resulting in many requiring only a few percentage points in the final examination to pass Histology. Methods In order to ascertain whether this bonus system might be impacting positively on student learning, thirty-two second year medical students (non-randomly selected, representing four academic groups based on their mid-year results) were interviewed in 1997 and, in 1999, the entire second year class completed a questionnaire (n = 189). Both groups were asked their opinions of the bonus system. Results Both groups overwhelming voted in favour of the bonus system, despite less than 45% of students failing to achieve it. Students commented that it relieved some of the stress of the year-end examinations, and was generally motivating with regard to their work commitment. Conclusions Being satisfied with how and what we assess in Histology, we are of the opinion that this reward system may contribute to engendering appropriate learning approaches (i.e. for understanding) in students. As a result of its apparent positive influence on learning and attitudes towards learning, this bonus system will continue to operate until the traditional programme is phased out. It is hoped that other educators, believing that their assessment is a reflection of the intended outcomes, might recognise merit in rewarding students for consistent achievement. PMID:11741511

  16. The role of multisensor data fusion in neuromuscular control of a sagittal arm with a pair of muscles using actor-critic reinforcement learning method.

    PubMed

    Golkhou, V; Parnianpour, M; Lucas, C

    2004-01-01

    In this study, we consider the role of multisensor data fusion in neuromuscular control using an actor-critic reinforcement learning method. The model we use is a single link system actuated by a pair of muscles that are excited with alpha and gamma signals. Various physiological sensor information such as proprioception, spindle sensors, and Golgi tendon organs have been integrated to achieve an oscillatory movement with variable amplitude and frequency, while achieving a stable movement with minimum metabolic cost and coactivation. The system is highly nonlinear in all its physical and physiological attributes. Transmission delays are included in the afferent and efferent neural paths to account for a more accurate representation of the reflex loops. This paper proposes a reinforcement learning method with an Actor-Critic architecture instead of middle and low level of central nervous system (CNS). The Actor in this structure is a two layer feedforward neural network and the Critic is a model of the cerebellum. The Critic is trained by the State-Action-Reward-State-Action (SARSA) method. The Critic will train the Actor by supervisory learning based on previous experiences. The reinforcement signal in SARSA is evaluated based on available alternatives concerning the concept of multisensor data fusion. The effectiveness and the biological plausibility of the present model are demonstrated by several simulations. The system showed excellent tracking capability when we integrated the available sensor information. Addition of a penalty for activation of muscles resulted in much lower muscle coactivation while keeping the movement stable. PMID:15671597

  17. A relative reward-strength algorithm for the hierarchical structure learning automata operating in the general nonstationary multiteacher environment.

    PubMed

    Baba, Norio; Mogami, Yoshio

    2006-08-01

    A new learning algorithm for the hierarchical structure learning automata (HSLA) operating in the nonstationary multiteacher environment (NME) is proposed. The proposed algorithm is derived by extending the original relative reward-strength algorithm to be utilized in the HSLA operating in the general NME. It is shown that the proposed algorithm ensures convergence with probability 1 to the optimal path under a certain type of the NME. Several computer-simulation results, which have been carried out in order to compare the relative performance of the proposed algorithm in some NMEs against those of the two of the fastest algorithms today, confirm the effectiveness of the proposed algorithm. PMID:16903364

  18. On the integration of reinforcement learning and approximate reasoning for control

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1991-01-01

    The author discusses the importance of strengthening the knowledge representation characteristic of reinforcement learning techniques using methods such as approximate reasoning. The ARIC (approximate reasoning-based intelligent control) architecture is an example of such a hybrid approach in which the fuzzy control rules are modified (fine-tuned) using reinforcement learning. ARIC also demonstrates that it is possible to start with an approximately correct control knowledge base and learn to refine this knowledge through further experience. On the other hand, techniques such as the TD (temporal difference) algorithm and Q-learning establish stronger theoretical foundations for their use in adaptive control and also in stability analysis of hybrid reinforcement learning and approximate reasoning-based controllers.

  19. Reinforcement Learning in Large Scale Systems Using State Generalization and Multi-Agent Techniques

    NASA Astrophysics Data System (ADS)

    Kimura, Hajime; Aoki, Kei; Kobayashi, Shigenobu

    This paper introduces several problems in reinforcement learning of industrial applications, and shows some techniques to overcome it. Reinforcement learning is known as on-line learning of an input-output mapping through a process of trial and error interactions with its uncertain environment, however, the trial and error will cause fatal damages in real applications. We introduce a planning method, based on reinforcement learning in the simulator. It can be seen as a stochastic approximation of dynamic programming in Markov decision processes. But in large problems, simple grid-tiling to quantize state space for tabular Q-learning is still infeasible. We introduce a generalization technique to approximate value functions in continuous state space, and a multiagent architecture to solve large scale problems. The efficiency of these techniques are shown through experiments in a sewage water-flow control system.

  20. Adolescent development of context-dependent stimulus-reward association memory and its neural correlates

    PubMed Central

    Voss, Joel L.; O’Neil, Jonathan T.; Kharitonova, Maria; Briggs-Gowan, Margaret J.; Wakschlag, Lauren S.

    2015-01-01

    Expression of learned stimulus-reward associations based on context is essential for regulation of behavior to meet situational demands. Contextual regulation improves during development, although the developmental progression of relevant neural and cognitive processes is not fully specified. We therefore measured neural correlates of flexible, contextual expression of stimulus-reward associations in pre/early-adolescent children (ages 9–13 years) and young adults (ages 19–22 years). After reinforcement learning using standard parameters, a contextual reversal manipulation was used whereby contextual cues indicated that stimulus-reward associations were the same as previously reinforced for some trials (consistent trials) or were reversed on other trials (inconsistent trials). Subjects were thus required to respond according to original stimulus-reward associations vs. reversed associations based on trial-specific contextual cues. Children and young adults did not differ in reinforcement learning or in relevant functional magnetic resonance imaging (fMRI) correlates. In contrast, adults outperformed children during contextual reversal, with better performance specifically for inconsistent trials. fMRI signals corresponding to this selective advantage included greater activity in lateral prefrontal cortex (LPFC), hippocampus, and dorsal striatum for young adults relative to children. Flexible expression of stimulus-reward associations based on context thus improves via adolescent development, as does recruitment of brain regions involved in reward learning and contextual expression of memory. HighlightsEarly-adolescent children and young adults were equivalent in reinforcement learning.Adults outperformed children in contextual expression of stimulus-reward associations.Adult advantages correlated with increased activity of relevant brain regions.Specific neurocognitive developmental changes support better contextual regulation. PMID:26578926

  1. Vascular Risk Factors and Diseases Modulate Deficits of Reward-Based Reversal Learning in Acute Basal Ganglia Stroke

    PubMed Central

    Wicking, Manon; Bellebaum, Christian; Hermann, Dirk M.

    2016-01-01

    Background Besides motor function, the basal ganglia have been implicated in feedback learning. In patients with chronic basal ganglia infarcts, deficits in reward-based reversal learning have previously been described. Methods We re-examined the acquisition and reversal of stimulus-stimulus-reward associations and acquired equivalence in eleven patients with acute basal ganglia stroke (8 men, 3 women; 57.8±13.3 years), whose performance was compared eleven healthy subjects of comparable age, sex distribution and education, who were recruited outside the hospital. Eleven hospitalized patients with a similar vascular risk profile as the stroke patients but without stroke history served as clinical control group. Results In a neuropsychological assessment 7±3 days post-stroke, verbal and spatial short-term and working memory and inhibition control did not differ between groups. Compared with healthy subjects, control patients with vascular risk factors exhibited significantly reduced performance in the reversal phase (F[2,30] = 3.47; p = 0.044; post-hoc comparison between risk factor controls and healthy controls: p = 0.030), but not the acquisition phase (F[2,30] = 1.01; p = 0.376) and the acquired equivalence (F[2,30] = 1.04; p = 0.367) tasks. In all tasks, the performance of vascular risk factor patients closely resembled that of basal ganglia stroke patients. Correlation studies revealed a significant association of the number of vascular risk factors with reversal learning (r = -0.33, p = 0.012), but not acquisition learning (r = -0.20, p = 0.121) or acquired equivalence (r = -0.22, p = 0.096). Conclusions The previously reported impairment of reward-based learning may be attributed to vascular risk factors and associated diseases, which are enriched in stroke patients. This study emphasizes the necessity of appropriate control subjects in cognition studies. PMID:27163585

  2. The involvement of model-based but not model-free learning signals during observational reward learning in the absence of choice.

    PubMed

    Dunne, Simon; D'Souza, Arun; O'Doherty, John P

    2016-06-01

    A major open question is whether computational strategies thought to be used during experiential learning, specifically model-based and model-free reinforcement learning, also support observational learning. Furthermore, the question of how observational learning occurs when observers must learn about the value of options from observing outcomes in the absence of choice has not been addressed. In the present study we used a multi-armed bandit task that encouraged human participants to employ both experiential and observational learning while they underwent functional magnetic resonance imaging (fMRI). We found evidence for the presence of model-based learning signals during both observational and experiential learning in the intraparietal sulcus. However, unlike during experiential learning, model-free learning signals in the ventral striatum were not detectable during this form of observational learning. These results provide insight into the flexibility of the model-based learning system, implicating this system in learning during observation as well as from direct experience, and further suggest that the model-free reinforcement learning system may be less flexible with regard to its involvement in observational learning. PMID:27052578

  3. The involvement of model-based but not model-free learning signals during observational reward learning in the absence of choice

    PubMed Central

    Dunne, Simon; D’Souza, Arun; O’Doherty, John P.

    2016-01-01

    A major open question is whether computational strategies thought to be used during experiential learning, specifically model-based and model-free reinforcement-learning, also support observational learning. Furthermore, the question of how observational learning occurs when observers must learn about the value of options from observing outcomes in the absence of choice, has not been addressed. In the present study we used a multi-armed bandit task that encouraged human participants to employ both experiential and observational learning while they underwent functional magnetic resonance imaging (fMRI). We found evidence for the presence of model-based learning signals during both observational and experiential learning in the intraparietal sulcus. However, unlike in experiential learning, model-free learning signals in the ventral striatum were not detectable during this form of observational learning. These results provide insight into the flexibilty of the model-based learning system, implicating this system in learning during observation as well as from direct experience, and further suggest that the model-free reinforcement-learning system may be less flexible with regard to its involvement in observational learning. PMID:27052578

  4. An Attempt at Blocking of Position Learning by Training with Reward-Memory Associations

    ERIC Educational Resources Information Center

    Burns, Richard A.; Johnson, Kendra S.

    2006-01-01

    Rats were runway trained with sequences of rewards that changed in 3 phases. In Phase 1 (24 days), the sequences were NP', SNP', and P'SNP' (n = 3), or NS', PNS', and S'PNS', where P and P' refer to 4 and 8 plain Noyes pellets, and S and S' are 4 and 8 sucrose pellets. N was a 30-s confinement in the goal without reward. In Phase 2 (14 days) the…

  5. Connectivity reveals relationship of brain areas for reward-guided learning and decision making in human and monkey frontal cortex

    PubMed Central

    Neubert, Franz-Xaver; Mars, Rogier B.; Sallet, Jérôme; Rushworth, Matthew F. S.

    2015-01-01

    Reward-guided decision-making depends on a network of brain regions. Among these are the orbitofrontal and the anterior cingulate cortex. However, it is difficult to ascertain if these areas constitute anatomical and functional unities, and how these areas correspond between monkeys and humans. To address these questions we looked at connectivity profiles of these areas using resting-state functional MRI in 38 humans and 25 macaque monkeys. We sought brain regions in the macaque that resembled 10 human areas identified with decision making and brain regions in the human that resembled six macaque areas identified with decision making. We also used diffusion-weighted MRI to delineate key human orbital and medial frontal brain regions. We identified 21 different regions, many of which could be linked to particular aspects of reward-guided learning, valuation, and decision making, and in many cases we identified areas in the macaque with similar coupling profiles. PMID:25947150

  6. The plasticity of the mirror system: how reward learning modulates cortical motor simulation of others.

    PubMed

    Trilla Gros, Irene; Panasiti, Maria Serena; Chakrabarti, Bhismadev

    2015-04-01

    Cortical motor simulation supports the understanding of others' actions and intentions. This mechanism is thought to rely on the mirror neuron system (MNS), a brain network that is active both during action execution and observation. Indirect evidence suggests that (alpha/beta) mu suppression, an electroencephalographic (EEG) index of MNS activity, is modulated by reward. In this study we aimed to test the plasticity of the MNS by directly investigating the link between (alpha/beta) mu suppression and reward. 40 individuals from a general population sample took part in an evaluative conditioning experiment, where different neutral faces were associated with high or low reward values. In the test phase, EEG was recorded while participants viewed videoclips of happy expressions made by the conditioned faces. Alpha/beta mu suppression (identified using event-related desynchronisation of specific independent components) in response to rewarding faces was found to be greater than for non-rewarding faces. This result provides a mechanistic insight into the plasticity of the MNS and, more generally, into the role of reward in modulating physiological responses linked to empathy. PMID:25744871

  7. The plasticity of the mirror system: How reward learning modulates cortical motor simulation of others

    PubMed Central

    Trilla Gros, Irene; Panasiti, Maria Serena; Chakrabarti, Bhismadev

    2015-01-01

    Cortical motor simulation supports the understanding of others' actions and intentions. This mechanism is thought to rely on the mirror neuron system (MNS), a brain network that is active both during action execution and observation. Indirect evidence suggests that (alpha/beta) mu suppression, an electroencephalographic (EEG) index of MNS activity, is modulated by reward. In this study we aimed to test the plasticity of the MNS by directly investigating the link between (alpha/beta) mu suppression and reward. 40 individuals from a general population sample took part in an evaluative conditioning experiment, where different neutral faces were associated with high or low reward values. In the test phase, EEG was recorded while participants viewed videoclips of happy expressions made by the conditioned faces. Alpha/beta mu suppression (identified using event-related desynchronisation of specific independent components) in response to rewarding faces was found to be greater than for non-rewarding faces. This result provides a mechanistic insight into the plasticity of the MNS and, more generally, into the role of reward in modulating physiological responses linked to empathy. PMID:25744871

  8. Adolescent-specific patterns of behavior and neural activity during social reinforcement learning

    PubMed Central

    Jones, Rebecca M.; Somerville, Leah H.; Li, Jian; Ruberry, Erika J.; Powers, Alisa; Mehta, Natasha; Dyke, Jonathan; Casey, BJ

    2014-01-01

    Humans are sophisticated social beings. Social cues from others are exceptionally salient, particularly during adolescence. Understanding how adolescents interpret and learn from variable social signals can provide insight into the observed shift in social sensitivity during this period. The current study tested 120 participants between the ages of 8 and 25 years on a social reinforcement learning task where the probability of receiving positive social feedback was parametrically manipulated. Seventy-eight of these participants completed the task during fMRI scanning. Modeling trial-by-trial learning, children and adults showed higher positive learning rates than adolescents, suggesting that adolescents demonstrated less differentiation in their reaction times for peers who provided more positive feedback. Forming expectations about receiving positive social reinforcement correlated with neural activity within the medial prefrontal cortex and ventral striatum across age. Adolescents, unlike children and adults, showed greater insular activity during positive prediction error learning and increased activity in the supplementary motor cortex and the putamen when receiving positive social feedback regardless of the expected outcome, suggesting that peer approval may motivate adolescents towards action. While different amounts of positive social reinforcement enhanced learning in children and adults, all positive social reinforcement equally motivated adolescents. Together, these findings indicate that sensitivity to peer approval during adolescence goes beyond simple reinforcement theory accounts and suggests possible explanations for how peers may motivate adolescent behavior. PMID:24550063

  9. Improved Adaptive-Reinforcement Learning Control for morphing unmanned air vehicles.

    PubMed

    Valasek, John; Doebbler, James; Tandale, Monish D; Meade, Andrew J

    2008-08-01

    This paper presents an improved Adaptive-Reinforcement Learning Control methodology for the problem of unmanned air vehicle morphing control. The reinforcement learning morphing control function that learns the optimal shape change policy is integrated with an adaptive dynamic inversion control trajectory tracking function. An episodic unsupervised learning simulation using the Q-learning method is developed to replace an earlier and less accurate Actor-Critic algorithm. Sequential Function Approximation, a Galerkin-based scattered data approximation scheme, replaces a K-Nearest Neighbors (KNN) method and is used to generalize the learning from previously experienced quantized states and actions to the continuous state-action space, all of which may not have been experienced before. The improved method showed smaller errors and improved learning of the optimal shape compared to the KNN. PMID:18632393

  10. An Evaluation of Pedagogical Tutorial Tactics for a Natural Language Tutoring System: A Reinforcement Learning Approach

    ERIC Educational Resources Information Center

    Chi, Min; VanLehn, Kurt; Litman, Diane; Jordan, Pamela

    2011-01-01

    Pedagogical strategies are policies for a tutor to decide the next action when there are multiple actions available. When the content is controlled to be the same across experimental conditions, there has been little evidence that tutorial decisions have an impact on students' learning. In this paper, we applied Reinforcement Learning (RL) to…

  11. Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories.

    PubMed

    Fonteneau, Raphael; Murphy, Susan A; Wehenkel, Louis; Ernst, Damien

    2013-09-01

    In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of "artificial trajectories" from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning. PMID:24049244

  12. Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories

    PubMed Central

    Fonteneau, Raphael; Murphy, Susan A.; Wehenkel, Louis; Ernst, Damien

    2013-01-01

    In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of “artificial trajectories” from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning. PMID:24049244

  13. Aberrant Salience Is Related to Reduced Reinforcement Learning Signals and Elevated Dopamine Synthesis Capacity in Healthy Adults.

    PubMed

    Boehme, Rebecca; Deserno, Lorenz; Gleich, Tobias; Katthagen, Teresa; Pankow, Anne; Behr, Joachim; Buchert, Ralph; Roiser, Jonathan P; Heinz, Andreas; Schlagenhauf, Florian

    2015-07-15

    The striatum is known to play a key role in reinforcement learning, specifically in the encoding of teaching signals such as reward prediction errors (RPEs). It has been proposed that aberrant salience attribution is associated with impaired coding of RPE and heightened dopamine turnover in the striatum, and might be linked to the development of psychotic symptoms. However, the relationship of aberrant salience attribution, RPE coding, and dopamine synthesis capacity has not been directly investigated. Here we assessed the association between a behavioral measure of aberrant salience attribution, the salience attribution test, to neural correlates of RPEs measured via functional magnetic resonance imaging while healthy participants (n = 58) performed an instrumental learning task. A subset of participants (n = 27) also underwent positron emission tomography with the radiotracer [(18)F]fluoro-l-DOPA to quantify striatal presynaptic dopamine synthesis capacity. Individual variability in aberrant salience measures related negatively to ventral striatal and prefrontal RPE signals and in an exploratory analysis was found to be positively associated with ventral striatal presynaptic dopamine levels. These data provide the first evidence for a specific link between the constructs of aberrant salience attribution, reduced RPE processing, and potentially increased presynaptic dopamine function. PMID:26180188

  14. Morphological elucidation of basal ganglia circuits contributing reward prediction

    PubMed Central

    Fujiyama, Fumino; Takahashi, Susumu; Karube, Fuyuki

    2015-01-01

    Electrophysiological studies in monkeys have shown that dopaminergic neurons respond to the reward prediction error. In addition, striatal neurons alter their responsiveness to cortical or thalamic inputs in response to the dopamine signal, via the mechanism of dopamine-regulated synaptic plasticity. These findings have led to the hypothesis that the striatum exhibits synaptic plasticity under the influence of the reward prediction error and conduct reinforcement learning throughout the basal ganglia circuits. The reinforcement learning model is useful; however, the mechanism by which such a process emerges in the basal ganglia needs to be anatomically explained. The actor–critic model has been previously proposed and extended by the existence of role sharing within the striatum, focusing on the striosome/matrix compartments. However, this hypothesis has been difficult to confirm morphologically, partly because of the complex structure of the striosome/matrix compartments. Here, we review recent morphological studies that elucidate the input/output organization of the striatal compartments. PMID:25698913

  15. Morphological elucidation of basal ganglia circuits contributing reward prediction.

    PubMed

    Fujiyama, Fumino; Takahashi, Susumu; Karube, Fuyuki

    2015-01-01

    Electrophysiological studies in monkeys have shown that dopaminergic neurons respond to the reward prediction error. In addition, striatal neurons alter their responsiveness to cortical or thalamic inputs in response to the dopamine signal, via the mechanism of dopamine-regulated synaptic plasticity. These findings have led to the hypothesis that the striatum exhibits synaptic plasticity under the influence of the reward prediction error and conduct reinforcement learning throughout the basal ganglia circuits. The reinforcement learning model is useful; however, the mechanism by which such a process emerges in the basal ganglia needs to be anatomically explained. The actor-critic model has been previously proposed and extended by the existence of role sharing within the striatum, focusing on the striosome/matrix compartments. However, this hypothesis has been difficult to confirm morphologically, partly because of the complex structure of the striosome/matrix compartments. Here, we review recent morphological studies that elucidate the input/output organization of the striatal compartments. PMID:25698913

  16. A Real-time Reinforcement Learning Control System with H∞ Tracking Performance Compensator

    NASA Astrophysics Data System (ADS)

    Uchiyama, Shogo; Obayashi, Masanao; Kuremoto, Takashi; Kobayashi, Kunikazu

    Robust control theory generally guarantees robustness and stability of the closed-loop system. It however requires a mathematical model of the system to design the control system. It therefore can't often deal with nonlinear systems due to difficulty of modeling of the system. On the other hand, reinforcement learning methods can deal with nonlinear systems without any mathematical model. It however usually doesn't guarantee the stability of the system control. In this paper, we propose a “Real-time Reinforcement Learning Control System (RRLCS)” through combining reinforcement learning to treat unknown nonlinear systems and robust control theory to guarantee the robustness and stability of the system. Moreover, we analyze the stability of the proposed system using H∞ tracking performance and Lyapunov function. Finally, through the computer simulation for controlling an inverted pendulum system, we show the effectiveness of the proposed method.

  17. A Robust Cooperated Control Method with Reinforcement Learning and Adaptive H∞ Control

    NASA Astrophysics Data System (ADS)

    Obayashi, Masanao; Uchiyama, Shogo; Kuremoto, Takashi; Kobayashi, Kunikazu

    This study proposes a robust cooperated control method combining reinforcement learning with robust control to control the system. A remarkable characteristic of the reinforcement learning is that it doesn't require model formula, however, it doesn't guarantee the stability of the system. On the other hand, robust control system guarantees stability and robustness, however, it requires model formula. We employ both the actor-critic method which is a kind of reinforcement learning with minimal amount of computation to control continuous valued actions and the traditional robust control, that is, H∞ control. The proposed system was compared method with the conventional control method, that is, the actor-critic only used, through the computer simulation of controlling the angle and the position of a crane system, and the simulation result showed the effectiveness of the proposed method.

  18. Reinforcement learning control with approximation of time-dependent agent dynamics

    NASA Astrophysics Data System (ADS)

    Kirkpatrick, Kenton Conrad

    Reinforcement Learning has received a lot of attention over the years for systems ranging from static game playing to dynamic system control. Using Reinforcement Learning for control of dynamical systems provides the benefit of learning a control policy without needing a model of the dynamics. This opens the possibility of controlling systems for which the dynamics are unknown, but Reinforcement Learning methods like Q-learning do not explicitly account for time. In dynamical systems, time-dependent characteristics can have a significant effect on the control of the system, so it is necessary to account for system time dynamics while not having to rely on a predetermined model for the system. In this dissertation, algorithms are investigated for expanding the Q-learning algorithm to account for the learning of sampling rates and dynamics approximations. For determining a proper sampling rate, it is desired to find the largest sample time that still allows the learning agent to control the system to goal achievement. An algorithm called Sampled-Data Q-learning is introduced for determining both this sample time and the control policy associated with that sampling rate. Results show that the algorithm is capable of achieving a desired sampling rate that allows for system control while not sampling "as fast as possible". Determining an approximation of an agent's dynamics can be beneficial for the control of hierarchical multiagent systems by allowing a high-level supervisor to use the dynamics approximations for task allocation decisions. To this end, algorithms are investigated for learning first- and second-order dynamics approximations. These algorithms are respectively called First-Order Dynamics Learning and Second-Order Dynamics Learning. The dynamics learning algorithms are evaluated on several examples that show their capability to learn accurate approximations of state dynamics. All of these algorithms are then evaluated on hierarchical multiagent systems

  19. Reinforcement Sensitivity and Responsiveness to Performance Feedback: A Preliminary Investigation

    ERIC Educational Resources Information Center

    Lovett, Benjamin J.; Eckert, Tanya L.

    2009-01-01

    Variability in responsiveness to academic interventions is a common phenomenon in school psychology practice, but the variables associated with this responsiveness are not well understood. Reinforcement sensitivity, a generalized tendency to learn quickly in reward contingency situations, is one variable for increased understanding. In the present…

  20. The Disposition to Learn.

    ERIC Educational Resources Information Center

    Katz, Lilian

    1988-01-01

    Lectures and workbooks cannot instill curiosity and continuous interest or the disposition to respond to experiences in certain ways. This article examines sabateurs of the learning disposition (such as reinforcing learned stupidity and using rewards that suppress interest) and suggests curriculum strategies to engage young minds, such as using…

  1. A reward-modulated Hebbian learning rule can explain experimentally observed network reorganization in a brain control task

    PubMed Central

    Legenstein, Robert; Chase, Steven M.; Schwartz, Andrew B.; Maass, Wolfgang

    2010-01-01

    It has recently been shown in a brain-computer interface experiment that motor cortical neurons change their tuning properties selectively to compensate for errors induced by displaced decoding parameters. In particular, it was shown that the 3D tuning curves of neurons whose decoding parameters were re-assigned changed more than those of neurons whose decoding parameters had not been re-assigned. In this article, we propose a simple learning rule that can reproduce this effect. Our learning rule uses Hebbian weight updates driven by a global reward signal and neuronal noise. In contrast to most previously proposed learning rules, this approach does not require extrinsic information to separate noise from signal. The learning rule is able to optimize the performance of a model system within biologically realistic periods of time under high noise levels. Furthermore, when the model parameters are matched to data recorded during the brain-computer interface learning experiments described above, the model produces learning effects strikingly similar to those found in the experiments. PMID:20573887

  2. Dorsal Striatal-Midbrain Connectivity in Humans Predicts How Reinforcements Are Used to Guide Decisions

    ERIC Educational Resources Information Center

    Kahnt, Thorsten; Park, Soyoung Q.; Cohen, Michael X.; Beck, Anne; Heinz, Andreas; Wrase, Jana

    2009-01-01

    It has been suggested that the target areas of dopaminergic midbrain neurons, the dorsal (DS) and ventral striatum (VS), are differently involved in reinforcement learning especially as actor and critic. Whereas the critic learns to predict rewards, the actor maintains action values to guide future decisions. The different midbrain connections to…

  3. Active-learning strategies: the use of a game to reinforce learning in nursing education. A case study.

    PubMed

    Boctor, Lisa

    2013-03-01

    The majority of nursing students are kinesthetic learners, preferring a hands-on, active approach to education. Research shows that active-learning strategies can increase student learning and satisfaction. This study looks at the use of one active-learning strategy, a Jeopardy-style game, 'Nursopardy', to reinforce Fundamentals of Nursing material, aiding in students' preparation for a standardized final exam. The game was created keeping students varied learning styles and the NCLEX blueprint in mind. The blueprint was used to create 5 categories, with 26 total questions. Student survey results, using a five-point Likert scale showed that they did find this learning method enjoyable and beneficial to learning. More research is recommended regarding learning outcomes, when using active-learning strategies, such as games. PMID:22910398

  4. Reinforcement function design and bias for efficient learning in mobile robots

    SciTech Connect

    Touzet, C.; Santos, J.M.

    1998-06-01

    The main paradigm in sub-symbolic learning robot domain is the reinforcement learning method. Various techniques have been developed to deal with the memorization/generalization problem, demonstrating the superior ability of artificial neural network implementations. In this paper, the authors address the issue of designing the reinforcement so as to optimize the exploration part of the learning. They also present and summarize works relative to the use of bias intended to achieve the effective synthesis of the desired behavior. Demonstrative experiments involving a self-organizing map implementation of the Q-learning and real mobile robots (Nomad 200 and Khepera) in a task of obstacle avoidance behavior synthesis are described. 3 figs., 5 tabs.

  5. Selective learning impairment of delayed reinforcement autoshaped behavior caused by low doses of trimethyltin.

    PubMed

    Cohen, C A; Messing, R B; Sparber, S B

    1987-01-01

    The organometal neurotoxin trimethyltin (TMT), induces impaired learning and memory for various tasks. However, administration is also associated with other "non-specific" behavioral changes which may be responsible for effects on conditioned behaviors. To determine if TMT treatment causes a specific learning impairment, three experiments were done using variations of a delay of reinforcement autoshaping task in which rats learn to associate the presentation and retraction of a lever with the delivery of a food pellet reinforcer. No significant effects of TMT treatment were found with a short (4 s) delay of reinforcement, indicating that rats were motivated and had the sensorimotor capacity for learning. When the delay was increased to 6 s, 3.0 or 6.0 mg TMT/kg produced dose-related reductions in behaviors directed towards the lever. Performance of a group given 7.5 mg TMT/kg, while still impaired relative to controls, appeared to be better than the performance of groups given lower doses. This paradoxical effect was investigated with a latent inhibition paradigm, in which rats were pre-exposed to the Skinner boxes for several sessions without delivery of food reinforcement. Control rats showed retardation of autoshaping when food reinforcement was subsequently introduced. Rats given 7.5 mg TMT/kg exhibited elevated levels of lever responding during pre-exposure and autoshaping sessions. The results indicate that 7.5 mg TMT/kg produces learning impairments which are confounded by hyperreactivity to the environment and an inability to suppress behavior toward irrelevant stimuli. In contrast, low doses of TMT cause learning impairments which are not confounded by hyperreactivity, and may prove to be useful models for studying specific associational dysfunctions. PMID:3124161

  6. Utilising reinforcement learning to develop strategies for driving auditory neural implants

    NASA Astrophysics Data System (ADS)

    Lee, Geoffrey W.; Zambetta, Fabio; Li, Xiaodong; Paolini, Antonio G.

    2016-08-01

    Objective. In this paper we propose a novel application of reinforcement learning to the area of auditory neural stimulation. We aim to develop a simulation environment which is based off real neurological responses to auditory and electrical stimulation in the cochlear nucleus (CN) and inferior colliculus (IC) of an animal model. Using this simulator we implement closed loop reinforcement learning algorithms to determine which methods are most effective at learning effective acoustic neural stimulation strategies. Approach. By recording a comprehensive set of acoustic frequency presentations and neural responses from a set of animals we created a large database of neural responses to acoustic stimulation. Extensive electrical stimulation in the CN and the recording of neural responses in the IC provides a mapping of how the auditory system responds to electrical stimuli. The combined dataset is used as the foundation for the simulator, which is used to implement and test learning algorithms. Main results. Reinforcement learning, utilising a modified n-Armed Bandit solution, is implemented to demonstrate the model’s function. We show the ability to effectively learn stimulation patterns which mimic the cochlea’s ability to covert acoustic frequencies to neural activity. Time taken to learn effective replication using neural stimulation takes less than 20 min under continuous testing. Significance. These results show the utility of reinforcement learning in the field of neural stimulation. These results can be coupled with existing sound processing technologies to develop new auditory prosthetics that are adaptable to the recipients current auditory pathway. The same process can theoretically be abstracted to other sensory and motor systems to develop similar electrical replication of neural signals.

  7. Language Learning of Children as a Function of Sensory Mode of Presentation and Reinforcement Procedure.

    ERIC Educational Resources Information Center

    Oyer, Herbert J.; Frankmann, Judith P.

    Programed training filmstrips from Project LIFE (Language Instruction to Facilitate Education) were used with 114 hearing impaired children and 15 normal hearing language impaired children (4- to 13-years old) to assess the effects of auditory supplementation and a token reinforcement program on language learning and to investigate retention and…

  8. Effects of prior cocaine versus morphine or heroin self-administration on extinction learning driven by over-expectation versus omission of reward

    PubMed Central

    Lucantonio, Federica; Kambhampati, S; Haney, Richard Z; Atalayer, Deniz; Rowland, Neil E; Shaham, Yavin; Schoenbaum, Geoffrey

    2014-01-01

    Background Addiction is characterized by an inability to stop using drugs, despite adverse consequences. One contributing factor to this compulsive drug taking could be the impact of drug use on the ability to extinguish drug seeking after changes in expected outcomes. Here we compared effects of cocaine, morphine, and heroin self-administration on two forms of extinction learning: standard extinction driven by reward omission and extinction driven by reward over-expectation. Methods In Experiment 1, we trained rats to self-administer cocaine, morphine, or sucrose for 3 hr/day (limited access). In Experiment 2, we trained rats to self-administer heroin or sucrose for 12 hr/day (extended access). Three weeks later, we trained the rats to associate several cues with palatable food reward, after which we assessed extinction of the learned Pavlovian response, first by pairing two cues together in the over-expectation procedure and later by omitting the food reward. Results Rats trained under limited access conditions to self-administer sucrose or morphine demonstrated normal extinction in response to both over-expectation and reward omission, whereas cocaine-experienced rats or rats trained to self-administer heroin under extended access conditions exhibited normal extinction in response to reward omission but failed to show extinction in response to over-expectation. Conclusions The specific long-lasting effects of cocaine and heroin show that drug exposure induces long-lasting deficits in the ability to extinguish reward seeking after changes in expected outcomes. These deficits were not observed in a standard extinction procedure but instead only affected extinction learning driven by a more complex phenomenon of over-expectation. PMID:25641634

  9. Premotor and Motor Cortices Encode Reward

    PubMed Central

    Ramkumar, Pavan; Dekleva, Brian; Cooler, Sam; Miller, Lee; Kording, Konrad

    2016-01-01

    Rewards associated with actions are critical for motivation and learning about the consequences of one’s actions on the world. The motor cortices are involved in planning and executing movements, but it is unclear whether they encode reward over and above limb kinematics and dynamics. Here, we report a categorical reward signal in dorsal premotor (PMd) and primary motor (M1) neurons that corresponds to an increase in firing rates when a trial was not rewarded regardless of whether or not a reward was expected. We show that this signal is unrelated to error magnitude, reward prediction error, or other task confounds such as reward consumption, return reach plan, or kinematic differences across rewarded and unrewarded trials. The availability of reward information in motor cortex is crucial for theories of reward-based learning and motivational influences on actions. PMID:27564707

  10. Premotor and Motor Cortices Encode Reward.

    PubMed

    Ramkumar, Pavan; Dekleva, Brian; Cooler, Sam; Miller, Lee; Kording, Konrad

    2016-01-01

    Rewards associated with actions are critical for motivation and learning about the consequences of one's actions on the world. The motor cortices are involved in planning and executing movements, but it is unclear whether they encode reward over and above limb kinematics and dynamics. Here, we report a categorical reward signal in dorsal premotor (PMd) and primary motor (M1) neurons that corresponds to an increase in firing rates when a trial was not rewarded regardless of whether or not a reward was expected. We show that this signal is unrelated to error magnitude, reward prediction error, or other task confounds such as reward consumption, return reach plan, or kinematic differences across rewarded and unrewarded trials. The availability of reward information in motor cortex is crucial for theories of reward-based learning and motivational influences on actions. PMID:27564707

  11. Cerebellar and prefrontal cortex contributions to adaptation, strategies, and reinforcement learning.

    PubMed

    Taylor, Jordan A; Ivry, Richard B

    2014-01-01

    Traditionally, motor learning has been studied as an implicit learning process, one in which movement errors are used to improve performance in a continuous, gradual manner. The cerebellum figures prominently in this literature given well-established ideas about the role of this system in error-based learning and the production of automatized skills. Recent developments have brought into focus the relevance of multiple learning mechanisms for sensorimotor learning. These include processes involving repetition, reinforcement learning, and strategy utilization. We examine these developments, considering their implications for understanding cerebellar function and how this structure interacts with other neural systems to support motor learning. Converging lines of evidence from behavioral, computational, and neuropsychological studies suggest a fundamental distinction between processes that use error information to improve action execution or action selection. While the cerebellum is clearly linked to the former, its role in the latter remains an open question. PMID:24916295

  12. Cerebellar and Prefrontal Cortex Contributions to Adaptation, Strategies, and Reinforcement Learning

    PubMed Central

    Taylor, Jordan A.; Ivry, Richard B.

    2014-01-01

    Traditionally, motor learning has been studied as an implicit learning process, one in which movement errors are used to improve performance in a continuous, gradual manner. The cerebellum figures prominently in this literature given well-established ideas about the role of this system in error-based learning and the production of automatized skills. Recent developments have brought into focus the relevance of multiple learning mechanisms for sensorimotor learning. These include processes involving repetition, reinforcement learning, and strategy utilization. We examine these developments, considering their implications for understanding cerebellar function and how this structure interacts with other neural systems to support motor learning. Converging lines of evidence from behavioral, computational, and neuropsychological studies suggest a fundamental distinction between processes that use error information to improve action execution or action selection. While the cerebellum is clearly linked to the former, its role in the latter remains an open question. PMID:24916295

  13. Neural signature of hierarchically structured expectations predicts clustering and transfer of rule sets in reinforcement learning.

    PubMed

    Collins, Anne Gabrielle Eva; Frank, Michael Joshua

    2016-07-01

    Often the world is structured such that distinct sensory contexts signify the same abstract rule set. Learning from feedback thus informs us not only about the value of stimulus-action associations but also about which rule set applies. Hierarchical clustering models suggest that learners discover structure in the environment, clustering distinct sensory events into a single latent rule set. Such structure enables a learner to transfer any newly acquired information to other contexts linked to the same rule set, and facilitates re-use of learned knowledge in novel contexts. Here, we show that humans exhibit this transfer, generalization and clustering during learning. Trial-by-trial model-based analysis of EEG signals revealed that subjects' reward expectations incorporated this hierarchical structure; these structured neural signals were predictive of behavioral transfer and clustering. These results further our understanding of how humans learn and generalize flexibly by building abstract, behaviorally relevant representations of the complex, high-dimensional sensory environment. PMID:27082659

  14. Brain Circuits Encoding Reward from Pain Relief.

    PubMed

    Navratilova, Edita; Atcherley, Christopher W; Porreca, Frank

    2015-11-01

    Relief from pain in humans is rewarding and pleasurable. Primary rewards, or reward-predictive cues, are encoded in brain reward/motivational circuits. While considerable advances have been made in our understanding of reward circuits underlying positive reinforcement, less is known about the circuits underlying the hedonic and reinforcing actions of pain relief. We review findings from electrophysiological, neuroimaging, and behavioral studies supporting the concept that the rewarding effect of pain relief requires opioid signaling in the anterior cingulate cortex (ACC), activation of midbrain dopamine neurons, and the release of dopamine in the nucleus accumbens (NAc). Understanding of circuits that govern the reward of pain relief may allow the discovery of more effective and satisfying therapies for patients with acute or chronic pain. PMID:26603560

  15. Sensitivity to Temporal Reward Structure in Amygdala Neurons

    PubMed Central

    Bermudez, Maria A.; Göbel, Carl; Schultz, Wolfram

    2012-01-01

    Summary The time of reward and the temporal structure of reward occurrence fundamentally influence behavioral reinforcement and decision processes [1–11]. However, despite knowledge about timing in sensory and motor systems [12–17], we know little about temporal mechanisms of neuronal reward processing. In this experiment, visual stimuli predicted different instantaneous probabilities of reward occurrence that resulted in specific temporal reward structures. Licking behavior demonstrated that the animals had developed expectations for the time of reward that reflected the instantaneous reward probabilities. Neurons in the amygdala, a major component of the brain's reward system [18–29], showed two types of reward signal, both of which were sensitive to the expected time of reward. First, the time courses of anticipatory activity preceding reward delivery followed the specific instantaneous reward probabilities and thus paralleled the temporal reward structures. Second, the magnitudes of responses following reward delivery covaried with the instantaneous reward probabilities, reflecting the influence of temporal reward structures at the moment of reward delivery. In being sensitive to temporal reward structure, the reward signals of amygdala neurons reflected the temporally specific expectations of reward. The data demonstrate an active involvement of amygdala neurons in timing processes that are crucial for reward function. PMID:22959346

  16. Toward a Science of Learning Games

    ERIC Educational Resources Information Center

    Howard-Jones, Paul; Demetriou, Skevi; Bogacz, Rafal; Yoo, Jee H.; Leonards, Ute

    2011-01-01

    Reinforcement learning involves a tight coupling of reward-associated behavior and a type of learning that is very different from that promoted by education. However, the emerging understanding of its underlying processes may help derive principles for effective learning games that have, until now, been elusive. This article first reviews findings…

  17. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning.

    PubMed

    Pilarski, Patrick M; Dawson, Michael R; Degris, Thomas; Fahimi, Farbod; Carey, Jason P; Sutton, Richard S

    2011-01-01

    As a contribution toward the goal of adaptable, intelligent artificial limbs, this work introduces a continuous actor-critic reinforcement learning method for optimizing the control of multi-function myoelectric devices. Using a simulated upper-arm robotic prosthesis, we demonstrate how it is possible to derive successful limb controllers from myoelectric data using only a sparse human-delivered training signal, without requiring detailed knowledge about the task domain. This reinforcement-based machine learning framework is well suited for use by both patients and clinical staff, and may be easily adapted to different application domains and the needs of individual amputees. To our knowledge, this is the first my-oelectric control approach that facilitates the online learning of new amputee-specific motions based only on a one-dimensional (scalar) feedback signal provided by the user of the prosthesis. PMID:22275543

  18. Error-related negativity predicts reinforcement learning and conflict biases.

    PubMed

    Frank, Michael J; Woroch, Brion S; Curran, Tim

    2005-08-18

    The error-related negativity (ERN) is an electrophysiological marker thought to reflect changes in dopamine when participants make errors in cognitive tasks. Our computational model further predicts that larger ERNs should be associated with better learning to avoid maladaptive responses. Here we show that participants who avoided negative events had larger ERNs than those who were biased to learn more from positive outcomes. We also tested for effects of response conflict on ERN magnitude. While there was no overall effect of conflict, positive learners had larger ERNs when having to choose among two good options (win/win decisions) compared with two bad options (lose/lose decisions), whereas negative learners exhibited the opposite pattern. These results demonstrate that the ERN predicts the degree to which participants are biased to learn more from their mistakes than their correct choices and clarify the extent to which it indexes decision conflict. PMID:16102533

  19. Dissecting Neural Responses to Temporal Prediction, Attention, and Memory: Effects of Reward Learning and Interoception on Time Perception.

    PubMed

    Tomasi, Dardo; Wang, Gene-Jack; Studentsova, Yana; Volkow, Nora D

    2015-10-01

    Temporal prediction (TP) is needed to anticipate future events and is essential for survival. Our sense of time is modulated by emotional and interoceptive (corporal) states that are hypothesized to rely on a dopamine (DA)-modulated "internal clock" in the basal ganglia. However, the neurobiological substrates for TP in the human brain have not been identified. We tested the hypothesis that TP involves DA striato-cortical pathways, and that accurate responses are reinforcing in themselves and activate the nucleus accumbens (NAc). Functional magnetic resonance imaging revealed the involvement of the NAc and anterior insula in the temporal precision of the responses, and of the ventral tegmental area in error processing. Moreover, NAc showed higher activation for successful than for unsuccessful trials, indicating that accurate TP per se is rewarding. Inasmuch as activation of the NAc is associated with drug-induced addictive behaviors, its activation by accurate TP could help explain why video games that rely on TP can trigger compulsive behaviors. PMID:25389123

  20. Reward positivity: Reward prediction error or salience prediction error?

    PubMed

    Heydari, Sepideh; Holroyd, Clay B

    2016-08-01

    The reward positivity is a component of the human ERP elicited by feedback stimuli in trial-and-error learning and guessing tasks. A prominent theory holds that the reward positivity reflects a reward prediction error signal that is sensitive to outcome valence, being larger for unexpected positive events relative to unexpected negative events (Holroyd & Coles, 2002). Although the theory has found substantial empirical support, most of these studies have utilized either monetary or performance feedback to test the hypothesis. However, in apparent contradiction to the theory, a recent study found that unexpected physical punishments also elicit the reward positivity (Talmi, Atkinson, & El-Deredy, 2013). The authors of this report argued that the reward positivity reflects a salience prediction error rather than a reward prediction error. To investigate this finding further, in the present study participants navigated a virtual T maze and received feedback on each trial under two conditions. In a reward condition, the feedback indicated that they would either receive a monetary reward or not and in a punishment condition the feedback indicated that they would receive a small shock or not. We found that the feedback stimuli elicited a typical reward positivity in the reward condition and an apparently delayed reward positivity in the punishment condition. Importantly, this signal was more positive to the stimuli that predicted the omission of a possible punishment relative to stimuli that predicted a forthcoming punishment, which is inconsistent with the salience hypothesis. PMID:27184070

  1. Integrating Service Learning into Public Relations Coursework: Applications, Implications, Challenges, and Rewards

    ERIC Educational Resources Information Center

    Gleason, James P.; Violette, Jayne L.

    2012-01-01

    Drawing on a theoretical framework based on "use-inspired" applied research and service learning practice (Honnet-Porter & Poulsen, 1989), this paper argues the relationship between a service-learning approach and Public Relations coursework is a natural and highly desirable fit. Through examination of the goals of both service-learning and public…

  2. Rewards and Costs of Faculty Involvement in Intergenerational Service-Learning

    ERIC Educational Resources Information Center

    Bulot, James J.; Johnson, Christopher J.

    2006-01-01

    Service-learning (S-L) has been regarded as a relatively well-established and effective teaching pedagogy. Students who participate in S-L are more likely to learn more efficiently, more effectively, and remember more of what they have learned than their counterparts. Current studies have been done on the experiences of students in…

  3. Visual reinforcement shapes eye movements in visual search.

    PubMed

    Paeye, Céline; Schütz, Alexander C; Gegenfurtner, Karl R

    2016-08-01

    We use eye movements to gain information about our visual environment; this information can indirectly be used to affect the environment. Whereas eye movements are affected by explicit rewards such as points or money, it is not clear whether the information gained by finding a hidden target has a similar reward value. Here we tested whether finding a visual target can reinforce eye movements in visual search performed in a noise background, which conforms to natural scene statistics and contains a large number of possible target locations. First we tested whether presenting the target more often in one specific quadrant would modify eye movement search behavior. Surprisingly, participants did not learn to search for the target more often in high probability areas. Presumably, participants could not learn the reward structure of the environment. In two subsequent experiments we used a gaze-contingent display to gain full control over the reinforcement schedule. The target was presented more often after saccades into a specific quadrant or a specific direction. The proportions of saccades meeting the reinforcement criteria increased considerably, and participants matched their search behavior to the relative reinforcement rates of targets. Reinforcement learning seems to serve as the mechanism to optimize search behavior with respect to the statistics of the task. PMID:27559719

  4. Orbitofrontal Cortex Volume in Area 11/13 Predicts Reward Devaluation, But Not Reversal Learning Performance, in Young and Aged Monkeys

    PubMed Central

    Burke, Sara N.; Thome, Alex; Plange, Kojo; Engle, James R.; Trouard, Theodore P.; Gothard, Katalin M.

    2014-01-01

    The orbitofrontal cortex (OFC) and amygdala are both necessary for decisions based on expected outcomes. Although behavioral and imaging data suggest that these brain regions are affected by advanced age, the extent to which aging alters appetitive processes coordinated by the OFC and the amygdala is unknown. In the current experiment, young and aged bonnet macaques were trained on OFC- and amygdala-dependent tasks that test the degree to which response selection is guided by reward value and can be adapted when expected outcomes change. To assess whether the structural integrity of these regions varies with levels of performance on reward devaluation and object reversal tasks, volumes of areas 11/13 and 14 of the OFC, central/medial (CM), and basolateral (BL) nuclei of the amygdala were determined from high-resolution anatomical MRIs. With age, there were significant reductions in OFC, but not CM and BL, volume. Moreover, the aged monkeys showed impairments in the ability to associate an object with a higher value reward, and to reverse a previously learned association. Interestingly, greater OFC volume of area 11/13, but not 14, was significantly correlated with an animal's ability to anticipate the reward outcome associated with an object, and smaller BL volume was predictive of an animal's tendency to choose a higher value reward, but volume of neither region correlated with reversal learning. Together, these data indicate that OFC volume has an impact on monkeys' ability to guide choice behavior based on reward value but does not impact ability to reverse a previously learned association. PMID:25057193

  5. An introduction to stochastic control theory, path integrals and reinforcement learning

    NASA Astrophysics Data System (ADS)

    Kappen, Hilbert J.

    2007-02-01

    Control theory is a mathematical description of how to act optimally to gain future rewards. In this paper I give an introduction to deterministic and stochastic control theory and I give an overview of the possible application of control theory to the modeling of animal behavior and learning. I discuss a class of non-linear stochastic control problems that can be efficiently solved using a path integral or by MC sampling. In this control formalism the central concept of cost-to-go becomes a free energy and methods and concepts from statistical physics can be readily applied.

  6. Reward-associated stimuli capture the eyes in spite of strategic attentional set.

    PubMed

    Hickey, Clayton; van Zoest, Wieske

    2013-11-01

    Theories of reinforcement learning have proposed that the association of reward to visual stimuli may cause these objects to become fundamentally salient and thus attention-drawing. A number of recent studies have investigated the oculomotor correlates of this reward-priming effect, but there is some ambiguity in this literature regarding the involvement of top-down attentional set. Existing paradigms tend to create a situation where participants are actively looking for a reward-associated stimulus before subsequently showing that this selective bias sustains when it no longer has strategic purpose. This perseveration of attentional set is potentially different in nature than the direct impact of reward proposed by theory. Here we investigate the effect of reward on saccadic selection in a paradigm where strategic attentional set is decoupled from the effect of reward. We find that during search for a uniquely oriented target, the receipt of reward following selection of a target characterized by an irrelevant unique color causes subsequent stimuli characterized by this color to be preferentially selected. Importantly, this occurs regardless of whether the color characterizes the target or distractor. Other analyses demonstrate that only features associated with correct selection of the target prime the target representation, and that the magnitude of this effect can be predicted by variability in saccadic indices of feedback processing. These results add to a growing literature demonstrating that reward guides visual selection, often in spite of our strategic efforts otherwise. PMID:24084197

  7. Reactivation of Reward-Related Patterns from Single Past Episodes Supports Memory-Based Decision Making.

    PubMed

    Wimmer, G Elliott; Büchel, Christian

    2016-03-01

    Rewarding experiences exert a strong influence on later decision making. While decades of neuroscience research have shown how reinforcement gradually shapes preferences, decisions are often influenced by single past experiences. Surprisingly, relatively little is known about the influence of single learning episodes. Although recent work has proposed a role for episodes in decision making, it is largely unknown whether and how episodic experiences contribute to value-based decision making and how the values of single episodes are represented in the brain. In multiple behavioral experiments and an fMRI experiment, we tested whether and how rewarding episodes could support later decision making. Participants experienced episodes of high reward or low reward in conjunction with incidental, trial-unique neutral pictures. In a surprise test phase, we found that participants could indeed remember the associated level of reward, as evidenced by accurate source memory for value and preferences to re-engage with rewarded objects. Further, in a separate experiment, we found that high-reward objects shown as primes before a gambling task increased financial risk taking. Neurally, re-exposure to objects in the test phase led to significant reactivation of reward-related patterns. Importantly, individual variability in the strength of reactivation predicted value memory performance. Our results provide a novel demonstration that affect-related neural patterns are reactivated during later experience. Reactivation of value information represents a mechanism by which memory can guide decision making. PMID:26961943

  8. Better late than never? The effect of feedback delay on ERP indices of reward processing

    PubMed Central

    Weinberg, Anna; Luhmann, Christian C.; Bress, Jennifer N.; Hajcak, Greg

    2012-01-01

    The feedback negativity (FN), an early neural response that differentiates rewards from losses, appears to be generated in part by reward circuits in the brain. A prominent model of the FN suggests that it reflects learning processes by which environmental feedback shapes behavior. Although there is evidence that human behavior is more strongly influenced by rewards that quickly follow actions, in nonlaboratory settings, optimal behaviors are not always followed by immediate rewards. However, it is not clear how the introduction of a delay between response selection and feedback impacts the FN. Thus, the present study used a simple forced choice gambling task to elicit the FN, in which feedback about rewards and losses was presented after either 1 or 6 s. Results suggest that, at short delays (1 s), participants clearly differentiated losses from rewards, as evidenced in the magnitude of the FN. At long delays (6 s), on the other hand, the difference between losses and rewards was negligible. Results are discussed in terms of eligibility traces and the reinforcement learning model of the FN. PMID:22752976

  9. Factors Contributing to the Effectiveness of Social and Nonsocial Reinforcers in the Discrimination Learning of Children from Two Socioeconomic Groups

    ERIC Educational Resources Information Center

    Spence, Janet Taylor

    1973-01-01

    Middle- and lower-class children who had been treated by E in a warm or aloof manner were given a discrimination learning task under one of six conditions forming a 3 by 2 design: three reinforcement types (Verbal-intoned, Verbal-nonintoned, or Symbolic) and reinforcement for correct or incorrect responses. (Editor)

  10. Toward Generalization of Automated Temporal Abstraction to Partially Observable Reinforcement Learning.

    PubMed

    Çilden, Erkin; Polat, Faruk

    2015-08-01

    Temporal abstraction for reinforcement learning (RL) aims to decrease learning time by making use of repeated sub-policy patterns in the learning task. Automatic extraction of abstractions during RL process is difficult but has many challenges such as dealing with the curse of dimensionality. Various studies have explored the subject under the assumption that the problem domain is fully observable by the learning agent. Learning abstractions for partially observable RL is a relatively less explored area. In this paper, we adapt an existing automatic abstraction method, namely extended sequence tree, originally designed for fully observable problems. The modified method covers a certain family of model-based partially observable RL settings. We also introduce belief state discretization methods that can be used with this new abstraction mechanism. The effectiveness of the proposed abstraction method is shown empirically by experimenting on well-known benchmark problems. PMID:25216494

  11. Online Pedagogical Tutorial Tactics Optimization Using Genetic-Based Reinforcement Learning

    PubMed Central

    Lin, Hsuan-Ta; Lee, Po-Ming; Hsiao, Tzu-Chien

    2015-01-01

    Tutorial tactics are policies for an Intelligent Tutoring System (ITS) to decide the next action when there are multiple actions available. Recent research has demonstrated that when the learning contents were controlled so as to be the same, different tutorial tactics would make difference in students' learning gains. However, the Reinforcement Learning (RL) techniques that were used in previous studies to induce tutorial tactics are insufficient when encountering large problems and hence were used in offline manners. Therefore, we introduced a Genetic-Based Reinforcement Learning (GBML) approach to induce tutorial tactics in an online-learning manner without basing on any preexisting dataset. The introduced method can learn a set of rules from the environment in a manner similar to RL. It includes a genetic-based optimizer for rule discovery task by generating new rules from the old ones. This increases the scalability of a RL learner for larger problems. The results support our hypothesis about the capability of the GBML method to induce tutorial tactics. This suggests that the GBML method should be favorable in developing real-world ITS applications in the domain of tutorial tactics induction. PMID:26065018

  12. Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive.

    PubMed

    Collins, Anne G E; Frank, Michael J

    2014-07-01

    The striatal dopaminergic system has been implicated in reinforcement learning (RL), motor performance, and incentive motivation. Various computational models have been proposed to account for each of these effects individually, but a formal analysis of their interactions is lacking. Here we present a novel algorithmic model expanding the classical actor-critic architecture to include fundamental interactive properties of neural circuit models, incorporating both incentive and learning effects into a single theoretical framework. The standard actor is replaced by a dual opponent actor system representing distinct striatal populations, which come to differentially specialize in discriminating positive and negative action values. Dopamine modulates the degree to which each actor component contributes to both learning and choice discriminations. In contrast to standard frameworks, this model simultaneously captures documented effects of dopamine on both learning and choice incentive-and their interactions-across a variety of studies, including probabilistic RL, effort-based choice, and motor skill learning. PMID:25090423

  13. Informing sequential clinical decision-making through reinforcement learning: an empirical study

    PubMed Central

    Shortreed, Susan M.; Laber, Eric; Lizotte, Daniel J.; Stroup, T. Scott; Pineau, Joelle; Murphy, Susan A.

    2011-01-01

    This paper highlights the role that reinforcement learning can play in the optimization of treatment policies for chronic illnesses. Before applying any off-the-shelf reinforcement learning methods in this setting, we must first tackle a number of challenges. We outline some of these challenges and present methods for overcoming them. First, we describe a multiple imputation approach to overcome the problem of missing data. Second, we discuss the use of function approximation in the context of a highly variable observation set. Finally, we discuss approaches to summarizing the evidence in the data for recommending a particular action and quantifying the uncertainty around the Q-function of the recommended policy. We present the results of applying these methods to real clinical trial data of patients with schizophrenia. PMID:21799585

  14. Application of fuzzy logic-neural network based reinforcement learning to proximity and docking operations

    NASA Technical Reports Server (NTRS)

    Jani, Yashvant

    1992-01-01

    As part of the Research Institute for Computing and Information Systems (RICIS) activity, the reinforcement learning techniques developed at Ames Research Center are being applied to proximity and docking operations using the Shuttle and Solar Max satellite simulation. This activity is carried out in the software technology laboratory utilizing the Orbital Operations Simulator (OOS). This interim report provides the status of the project and outlines the future plans.

  15. Attitudes to the Rights and Rewards for Author Contributions to Repositories for Teaching and Learning

    ERIC Educational Resources Information Center

    Bates, Melanie; Loddington, Steve; Manuel, Sue; Oppenheim, Charles

    2007-01-01

    In the United Kingdom over the past few years there has been a dramatic growth of national and regional repositories to collect and disseminate resources related to teaching and learning. Most notable of these are the Joint Information Systems Committee's Online Repository for [Learning and Teaching] Materials as well as the Higher Education…

  16. An Individual or a Group Grade: Exploring Reward Structures and Motivation for Learning

    ERIC Educational Resources Information Center

    Collins, C. S.

    2012-01-01

    From a student perspective, grades are a central part in the educational experience. In an effort to learn more about student motivation for learning and grades, this study was designed to examine student reactions to the opportunity to choose between the traditional individual grading structure and a group grading structure where all students…

  17. Monetary rewards modulate inhibitory control.

    PubMed

    Herrera, Paula M; Speranza, Mario; Hampshire, Adam; Bekinschtein, Tristán A

    2014-01-01

    The ability to override a dominant response, often referred to as behavioral inhibition, is considered a key element of executive cognition. Poor behavioral inhibition is a defining characteristic of several neurological and psychiatric populations. Recently, there has been increasing interest in the motivational dimension of behavioral inhibition, with some experiments incorporating emotional contingencies in classical inhibitory paradigms such as the Go/NoGo and Stop Signal Tasks (SSTs). Several studies have reported a positive modulatory effect of reward on performance in pathological conditions such as substance abuse, pathological gambling, and Attention Deficit Hyperactive Disorder (ADHD). However, experiments that directly investigate the modulatory effects of reward magnitudes on the performance of inhibitory tasks are scarce and little is known about the finer grained relationship between motivation and inhibitory control. Here we probed the effect of reward magnitude and context on behavioral inhibition with three modified versions of the widely used SST. The pilot study compared inhibition performance during six blocks alternating neutral feedback, low, medium, and high monetary rewards. Study One compared increasing vs. decreasing rewards, with low, high rewards, and neutral feedback; whilst Study Two compared low and high reward magnitudes alone also in an increasing and decreasing reward design. The reward magnitude effect was not demonstrated in the pilot study, probably due to a learning effect induced by practice in this lengthy task. The reward effect per se was weak but the context (order of reward) was clearly suggested in Study One, and was particularly strongly confirmed in study two. In addition, these findings revealed a "kick start effect" over global performance measures. Specifically, there was a long lasting improvement in performance throughout the task when participants received the highest reward magnitudes at the beginning of the

  18. Monetary rewards modulate inhibitory control

    PubMed Central

    Herrera, Paula M.; Speranza, Mario; Hampshire, Adam; Bekinschtein, Tristán A.

    2014-01-01

    The ability to override a dominant response, often referred to as behavioral inhibition, is considered a key element of executive cognition. Poor behavioral inhibition is a defining characteristic of several neurological and psychiatric populations. Recently, there has been increasing interest in the motivational dimension of behavioral inhibition, with some experiments incorporating emotional contingencies in classical inhibitory paradigms such as the Go/NoGo and Stop Signal Tasks (SSTs). Several studies have reported a positive modulatory effect of reward on performance in pathological conditions such as substance abuse, pathological gambling, and Attention Deficit Hyperactive Disorder (ADHD). However, experiments that directly investigate the modulatory effects of reward magnitudes on the performance of inhibitory tasks are scarce and little is known about the finer grained relationship between motivation and inhibitory control. Here we probed the effect of reward magnitude and context on behavioral inhibition with three modified versions of the widely used SST. The pilot study compared inhibition performance during six blocks alternating neutral feedback, low, medium, and high monetary rewards. Study One compared increasing vs. decreasing rewards, with low, high rewards, and neutral feedback; whilst Study Two compared low and high reward magnitudes alone also in an increasing and decreasing reward design. The reward magnitude effect was not demonstrated in the pilot study, probably due to a learning effect induced by practice in this lengthy task. The reward effect per se was weak but the context (order of reward) was clearly suggested in Study One, and was particularly strongly confirmed in study two. In addition, these findings revealed a “kick start effect” over global performance measures. Specifically, there was a long lasting improvement in performance throughout the task when participants received the highest reward magnitudes at the beginning of the

  19. Acceleration of reinforcement learning by policy evaluation using nonstationary iterative method.

    PubMed

    Senda, Kei; Hattori, Suguru; Hishinuma, Toru; Kohda, Takehisa

    2014-12-01

    Typical methods for solving reinforcement learning problems iterate two steps, policy evaluation and policy improvement. This paper proposes algorithms for the policy evaluation to improve learning efficiency. The proposed algorithms are based on the Krylov Subspace Method (KSM), which is a nonstationary iterative method. The algorithms based on KSM are tens to hundreds times more efficient than existing algorithms based on the stationary iterative methods. Algorithms based on KSM are far more efficient than they have been generally expected. This paper clarifies what makes algorithms based on KSM makes more efficient with numerical examples and theoretical discussions. PMID:24733037

  20. RPS Market Analysis Based on Reinforcement Learning in Power Systems

    NASA Astrophysics Data System (ADS)

    Sugano, Takanori; Kita, Hiroyuki; Tanaka, Eiichi; Hasegawa, Jun

    Deregulation and restructuring of electric power supply business are proceeding all over the world. In many cases, a competitive environment is introduced, where a market to transact electric power is established, and various attempts are done to decrease the price. On the other hand, environmental problems are pointed out in recent years. However, there is a possibility of the environmental deterioration by cost reduction of electric power. In this paper, the RPS (Renewable Portfolio Standard) system is taken up as the solution method of environmental problem under the deregulation of electric power supply business. A RPS model is created by multi-agent theory, where Q-learning is used as a decision-making technique of agent. By using this model, the RPS system is verified for its effectiveness and influence.

  1. Concept learning without differential reinforcement in pigeons by means of contextual cueing.

    PubMed

    Couto, Kalliu C; Navarro, Victor M; Smith, Tatiana R; Wasserman, Edward A

    2016-04-01

    How supervision is arranged can affect the way that humans learn concepts. Yet very little is known about the role that supervision plays in nonhuman concept learning. Prior research in pigeon concept learning has commonly used differential response-reinforcer procedures (involving high-level supervision) to support reliable discrimination and generalization involving from 4 to 16 concurrently presented photographic categories. In the present project, we used contextual cueing, a nondifferential reinforcement procedure (involving low-level supervision), to investigate concept learning in pigeons. We found that pigeons were faster to peck a target stimulus when 8 members from each of 4 categories of black-and-white photographs-dogs, trees, shoes, and keys-correctly cued its location than when they did not. This faster detection of the target also generalized to 4 untrained members from each of the 4 photographic categories. Our results thus pass the prime behavioral tests of conceptualization and suggest that high-level supervision is unnecessary to support concept learning in pigeons. (PsycINFO Database Record PMID:26914972

  2. The hypocretins and the reward function: what have we learned so far?

    PubMed

    Boutrel, Benjamin; Steiner, Nadia; Halfon, Olivier

    2013-01-01

    A general consensus acknowledges that drug consumption (including alcohol, tobacco, and illicit drugs) constitutes the leading cause of preventable death worldwide. But the global burden of drug abuse extends the mortality statistics. Indeed, the comorbid long-term debilitating effects of the disease also significantly deteriorate the quality of life of individuals suffering from addiction disorders. Despite the large body of evidence delineating the cellular and molecular adaptations induced by chronic drug consumption, the brain mechanisms responsible for drug craving and relapse remain insufficiently understood, and even the most recent developments in the field have not brought significant improvement in the management of drug dependence. Though, recent preclinical evidence suggests that disrupting the hypocretin (orexin) system may serve as an anticraving medication therapy. Here, we discuss how the hypocretins, which orchestrate normal wakefulness, metabolic health and the execution of goal-oriented behaviors, may be compromised and contribute to elicit compulsive drug seeking. We propose an overview on the most recent studies demonstrating an important role for the hypocretin neuropeptide system in the regulation of drug reward and the prevention of drug relapse, and we question the relevance of disrupting the hypocretin system to alleviate symptoms of drug addiction. PMID:23781178

  3. Reinforcement Learning for Weakly-Coupled MDPs and an Application to Planetary Rover Control

    NASA Technical Reports Server (NTRS)

    Bernstein, Daniel S.; Zilberstein, Shlomo

    2003-01-01

    Weakly-coupled Markov decision processes can be decomposed into subprocesses that interact only through a small set of bottleneck states. We study a hierarchical reinforcement learning algorithm designed to take advantage of this particular type of decomposability. To test our algorithm, we use a decision-making problem faced by autonomous planetary rovers. In this problem, a Mars rover must decide which activities to perform and when to traverse between science sites in order to make the best use of its limited resources. In our experiments, the hierarchical algorithm performs better than Q-learning in the early stages of learning, but unlike Q-learning it converges to a suboptimal policy. This suggests that it may be advantageous to use the hierarchical algorithm when training time is limited.

  4. Micro-opioid receptor activation in the basolateral amygdala mediates the learning of increases but not decreases in the incentive value of a food reward.

    PubMed

    Wassum, Kate M; Cely, Ingrid C; Balleine, Bernard W; Maidment, Nigel T

    2011-02-01

    The decision to perform, or not perform, actions known to lead to a rewarding outcome is strongly influenced by the current incentive value of the reward. Incentive value is largely determined by the affective experience derived during previous consumption of the reward-the process of incentive learning. We trained rats on a two-lever, seeking-taking chain paradigm for sucrose reward, in which responding on the initial seeking lever of the chain was demonstrably controlled by the incentive value of the reward. We found that infusion of the μ-opioid receptor antagonist, CTOP (d-Phe-Cys-Tyr-d-Trp-Orn-Thr-Pen-Thr-NH(2)), into the basolateral amygdala (BLA) during posttraining, noncontingent consumption of sucrose in a novel elevated-hunger state (a positive incentive learning opportunity) blocked the encoding of incentive value information normally used to increase subsequent sucrose-seeking responses. Similar treatment with δ [N, N-diallyl-Tyr-Aib-Aib-Phe-Leu-OH (ICI 174,864)] or κ [5'-guanidinonaltrindole (GNTI)] antagonists was without effect. Interestingly, none of these drugs affected the ability of the rats to encode a decrease in incentive value resulting from experiencing the sucrose in a novel reduced-hunger state. However, the μ agonist, DAMGO ([d-Ala2, NMe-Phe4, Gly5-ol]-enkephalin), appeared to attenuate this negative incentive learning. These data suggest that upshifts and downshifts in endogenous opioid transmission in the BLA mediate the encoding of positive and negative shifts in incentive value, respectively, through actions at μ-opioid receptors, and provide insight into a mechanism through which opiates may elicit inappropriate desire resulting in their continued intake in the face of diminishing affective experience. PMID:21289167

  5. Do Economic Rewards Work?

    ERIC Educational Resources Information Center

    Wallace, Brian D.

    2009-01-01

    The love of learning--that intrinsic desire to gain knowledge and insight into new subjects--was once its own reward. That was altered decades ago when parents started using the proverbial "stick and carrot" to motivate their children to do well in school, or even just show up. Today, educators across the country have taken hold of this approach…

  6. Monetary rewards influence retrieval orientations.

    PubMed

    Halsband, Teresa M; Ferdinand, Nicola K; Bridger, Emma K; Mecklinger, Axel

    2012-09-01

    Reward anticipation during learning is known to support memory formation, but its role in retrieval processes is so far unclear. Retrieval orientations, as a reflection of controlled retrieval processing, are one aspect of retrieval that might be modulated by reward. These processes can be measured using the event-related potentials (ERPs) elicited by retrieval cues from tasks with different retrieval requirements, such as via changes in the class of targeted memory information. To determine whether retrieval orientations of this kind are modulated by reward during learning, we investigated the effects of high and low reward expectancy on the ERP correlates of retrieval orientation in two separate experiments. The reward manipulation at study in Experiment 1 was associated with later memory performance, whereas in Experiment 2, reward was directly linked to accuracy in the study task. In both studies, the participants encoded mixed lists of pictures and words preceded by high- or low-reward cues. After 24 h, they performed a recognition memory exclusion task, with words as the test items. In addition to a previously reported material-specific effect of retrieval orientation, a frontally distributed, reward-associated retrieval orientation effect was found in both experiments. These findings suggest that reward motivation during learning leads to the adoption of a reward-associated retrieval orientation to support the retrieval of highly motivational information. Thus, ERP retrieval orientation effects not only reflect retrieval processes related to the sought-for materials, but also relate to the reward conditions with which items were combined during encoding. PMID:22547161

  7. Simulating the Effect of Reinforcement Learning on Neuronal Synchrony and Periodicity in the Striatum.

    PubMed

    Hélie, Sébastien; Fleischer, Pierson J

    2016-01-01

    The study of rhythms and oscillations in the brain is gaining attention. While it is unclear exactly what the role of oscillation, synchrony, and rhythm is, it appears increasingly likely that synchrony is related to normal and abnormal brain states and possibly cognition. In this article, we explore the relationship between basal ganglia (BG) synchrony and reinforcement learning. We simulate a biologically-realistic model of the striatum initially proposed by Ponzi and Wickens (2010) and enhance the model by adding plastic cortico-BG synapses that can be modified using reinforcement learning. The effect of reinforcement learning on striatal rhythmic activity is then explored, and disrupted using simulated deep brain stimulation (DBS). The stimulator injects current in the brain structure to which it is attached, which affects neuronal synchrony. The results show that training the model without DBS yields a high accuracy in the learning task and reduced the number of active neurons in the striatum, along with an increased firing periodicity and a decreased firing synchrony between neurons in the same assembly. In addition, a spectral decomposition shows a stronger signal for correct trials than incorrect trials in high frequency bands. If the DBS is ON during the training phase, but not the test phase, the amount of learning in the model is reduced, along with firing periodicity. Similar to when the DBS is OFF, spectral decomposition shows a stronger signal for correct trials than for incorrect trials in high frequency domains, but this phenoemenon happens in higher frequency bands than when the DBS is OFF. Synchrony between the neurons is not affected. Finally, the results show that turning the DBS ON at test increases both firing periodicity and striatal synchrony, and spectral decomposition of the signal show that neural activity synchronizes with the DBS fundamental frequency (and its harmonics). Turning the DBS ON during the test phase results in chance

  8. Simulating the Effect of Reinforcement Learning on Neuronal Synchrony and Periodicity in the Striatum

    PubMed Central

    Hélie, Sébastien; Fleischer, Pierson J.

    2016-01-01

    The study of rhythms and oscillations in the brain is gaining attention. While it is unclear exactly what the role of oscillation, synchrony, and rhythm is, it appears increasingly likely that synchrony is related to normal and abnormal brain states and possibly cognition. In this article, we explore the relationship between basal ganglia (BG) synchrony and reinforcement learning. We simulate a biologically-realistic model of the striatum initially proposed by Ponzi and Wickens (2010) and enhance the model by adding plastic cortico-BG synapses that can be modified using reinforcement learning. The effect of reinforcement learning on striatal rhythmic activity is then explored, and disrupted using simulated deep brain stimulation (DBS). The stimulator injects current in the brain structure to which it is attached, which affects neuronal synchrony. The results show that training the model without DBS yields a high accuracy in the learning task and reduced the number of active neurons in the striatum, along with an increased firing periodicity and a decreased firing synchrony between neurons in the same assembly. In addition, a spectral decomposition shows a stronger signal for correct trials than incorrect trials in high frequency bands. If the DBS is ON during the training phase, but not the test phase, the amount of learning in the model is reduced, along with firing periodicity. Similar to when the DBS is OFF, spectral decomposition shows a stronger signal for correct trials than for incorrect trials in high frequency domains, but this phenoemenon happens in higher frequency bands than when the DBS is OFF. Synchrony between the neurons is not affected. Finally, the results show that turning the DBS ON at test increases both firing periodicity and striatal synchrony, and spectral decomposition of the signal show that neural activity synchronizes with the DBS fundamental frequency (and its harmonics). Turning the DBS ON during the test phase results in chance

  9. A Competency-Based Technical Training Model That Embraces Learning Flexibility and Rewards Competency

    ERIC Educational Resources Information Center

    Yasinski, Lee

    2014-01-01

    Today's adult learners are continuously searching for successful programs with added learner flexibility, a positive learning experience, and the best education for their investment. Red Deer College's unique competency based welder apprenticeship training model fulfills this desire for many adult learners.

  10. Real Clients, Real Management, Real Failure: The Risks and Rewards of Service Learning

    ERIC Educational Resources Information Center

    Cyphert, Dale

    2006-01-01

    There are multiple advantages to service-learning projects across the business curriculum, but in communication classes the author has found their biggest value to be authenticity. A "real-world" assignment requires the flexible, creative integration of communication skills in an environment where, "unlike exams and other typical university…

  11. Cooperative Learning: Effects of Task, Reward, and Group Size on Individual Achievement.

    ERIC Educational Resources Information Center

    Hagman, Joseph D.; Hayes, John F.

    This report examines whether cooperative learning can be used to promote individual achievement, and identifies conditions under which a benefit can be expected. Two experiments were conducted at the Quartermaster School, Fort Lee, Virginia. The first experiment compared test performance of 280 trainees after they had completed practical exercises…

  12. Life as a Married Couple with Learning Disabilities: Rewards and Challenges Times Two

    ERIC Educational Resources Information Center

    Yuan, Frances

    2010-01-01

    This study examined the outcomes and issues for married couples who are graduates of the Threshold Program at Lesley University, Cambridge, Massachusetts. Threshold is a non-degree post-secondary program that aims to help young adults with severe learning disabilities and low-average intelligence develop the skills necessary to make the transition…

  13. Reward deficiency and anti-reward in pain chronification.

    PubMed

    Borsook, D; Linnman, C; Faria, V; Strassman, A M; Becerra, L; Elman, I

    2016-09-01

    Converging lines of evidence suggest that the pathophysiology of pain is mediated to a substantial degree via allostatic neuroadaptations in reward- and stress-related brain circuits. Thus, reward deficiency (RD) represents a within-system neuroadaptation to pain-induced protracted activation of the reward circuits that leads to depletion-like hypodopaminergia, clinically manifested anhedonia, and diminished motivation for natural reinforcers. Anti-reward (AR) conversely pertains to a between-systems neuroadaptation involving over-recruitment of key limbic structures (e.g., the central and basolateral amygdala nuclei, the bed nucleus of the stria terminalis, the lateral tegmental noradrenergic nuclei of the brain stem, the hippocampus and the habenula) responsible for massive outpouring of stressogenic neurochemicals (e.g., norepinephrine, corticotropin releasing factor, vasopressin, hypocretin, and substance P) giving rise to such negative affective states as anxiety, fear and depression. We propose here the Combined Reward deficiency and Anti-reward Model (CReAM), in which biopsychosocial variables modulating brain reward, motivation and stress functions can interact in a 'downward spiral' fashion to exacerbate the intensity, chronicity and comorbidities of chronic pain syndromes (i.e., pain chronification). PMID:27246519

  14. The role of the medial prefrontal cortex in updating reward value and avoiding perseveration.

    PubMed

    Laskowski, C S; Williams, R J; Martens, K M; Gruber, A J; Fisher, K G; Euston, D R

    2016-06-01

    The medial prefrontal cortex (mPFC) plays a major role in goal-directed behaviours, but it is unclear whether it plays a role in breaking away from a high-value reward in order to explore for better options. To address this question, we designed a novel 3-arm Bandit Task in which rats were required to choose one of three potential reward arms, each of which was associated with a different amount of food reward and time-out punishment. After a variable number of choice trials the reward locations were shuffled and animals had to disengage from the now devalued arm and explore the other options in order to optimise payout. Lesion and control groups' behaviours on the task were then analysed by fitting data with a reinforcement learning model. As expected, lesioned animals obtained less reward overall due to an inability to flexibly adapt their behaviours after a change in reward location. However, modelling results showed that lesioned animals were no more likely to explore than control animals. We also discovered that all animals showed a strong preference for certain maze arms, at the expense of reward. This tendency was exacerbated in the lesioned animals, with the strongest effects seen in a subset of animals with damage to dorsal mPFC. The results confirm a role for mPFC in goal-directed behaviours but suggest that rats rely on other areas to resolve the explore-exploit dilemma. PMID:26965571

  15. Orbitofrontal Cortex Volume and Brain Reward Response in Obesity

    PubMed Central

    Shott, Megan E.; Cornier, Marc-Andre; Mittal, Vijay A.; Pryor, Tamara L.; Orr, Joseph M.; Brown, Mark S.; Frank, Guido K.W.

    2014-01-01

    Background/Objectives What drives overconsumption of food is poorly understood. Alterations in brain structure and function could contribute to increased food seeking. Recently brain orbitofrontal cortex volume has been implicated in dysregulated eating but little is know how brain structure relates to function. Subjects/Methods We examined obese (n=18, age=28.7.4±8.3 years) and healthy control women (n=24, age=27.4±6.3 years) using a multimodal brain imaging approach. We applied magnetic resonance and diffusion tensor imaging to study brain gray and white matter volume as well as white matter integrity, and tested whether orbitofrontal cortex volume predicts brain reward circuitry activation in a taste reinforcement-learning paradigm that has been associated with dopamine function. Results Obese individuals displayed lower gray and associated white matter volumes (p<.05 family wise error (FWE)-small volume corrected) compared to controls in the orbitofrontal cortex, striatum, and insula. White matter integrity was reduced in obese individuals in fiber tracts including the external capsule, corona radiata, sagittal stratum, and the uncinate, inferior fronto-occipital, and inferior longitudinal fasciculi. Gray matter volume of the gyrus rectus at the medial edge of the orbitofrontal cortex predicted functional taste reward-learning response in frontal cortex, insula, basal ganglia, amygdala, hypothalamus and anterior cingulate cortex in control but not obese individuals. Conclusions This study indicates a strong association between medial orbitofrontal cortex volume and taste reinforcement-learning activation in the brain in control but not in obese women. Lower brain volumes in the orbitofrontal cortex and other brain regions associated with taste reward function as well as lower integrity of connecting pathways in obesity may support a more widespread disruption of reward pathways. The medial orbitofrontal cortex is an important structure in the termination of

  16. Dissociable functions of reward inference in the lateral prefrontal cortex and the striatum.

    PubMed

    Tanaka, Shingo; Pan, Xiaochuan; Oguchi, Mineki; Taylor, Jessica E; Sakagami, Masamichi

    2015-01-01

    In a complex and uncertain world, how do we select appropriate behavior? One possibility is that we choose actions that are highly reinforced by their probabilistic consequences (model-free processing). However, we may instead plan actions prior to their actual execution by predicting their consequences (model-based processing). It has been suggested that the brain contains multiple yet distinct systems involved in reward prediction. Several studies have tried to allocate model-free and model-based systems to the striatum and the lateral prefrontal cortex (LPFC), respectively. Although there is much support for this hypothesis, recent research has revealed discrepancies. To understand the nature of the reward prediction systems in the LPFC and the striatum, a series of single-unit recording experiments were conducted. LPFC neurons were found to infer the reward associated with the stimuli even when the monkeys had not yet learned the stimulus-reward (SR) associations directly. Striatal neurons seemed to predict the reward for each stimulus only after directly experiencing the SR contingency. However, the one exception was "Exclusive Or" situations in which striatal neurons could predict the reward without direct experience. Previous single-unit studies in monkeys have reported that neurons in the LPFC encode category information, and represent reward information specific to a group of stimuli. Here, as an extension of these, we review recent evidence that a group of LPFC neurons can predict reward specific to a category of visual stimuli defined by relevant behavioral responses. We suggest that the functional difference in reward prediction between the LPFC and the striatum is that while LPFC neurons can utilize abstract code, striatal neurons can code individual associations between stimuli and reward but cannot utilize abstract code. PMID:26236266

  17. Dissociable functions of reward inference in the lateral prefrontal cortex and the striatum

    PubMed Central

    Tanaka, Shingo; Pan, Xiaochuan; Oguchi, Mineki; Taylor, Jessica E.; Sakagami, Masamichi

    2015-01-01

    In a complex and uncertain world, how do we select appropriate behavior? One possibility is that we choose actions that are highly reinforced by their probabilistic consequences (model-free processing). However, we may instead plan actions prior to their actual execution by predicting their consequences (model-based processing). It has been suggested that the brain contains multiple yet distinct systems involved in reward prediction. Several studies have tried to allocate model-free and model-based systems to the striatum and the lateral prefrontal cortex (LPFC), respectively. Although there is much support for this hypothesis, recent research has revealed discrepancies. To understand the nature of the reward prediction systems in the LPFC and the striatum, a series of single-unit recording experiments were conducted. LPFC neurons were found to infer the reward associated with the stimuli even when the monkeys had not yet learned the stimulus-reward (SR) associations directly. Striatal neurons seemed to predict the reward for each stimulus only after directly experiencing the SR contingency. However, the one exception was “Exclusive Or” situations in which striatal neurons could predict the reward without direct experience. Previous single-unit studies in monkeys have reported that neurons in the LPFC encode category information, and represent reward information specific to a group of stimuli. Here, as an extension of these, we review recent evidence that a group of LPFC neurons can predict reward specific to a category of visual stimuli defined by relevant behavioral responses. We suggest that the functional difference in reward prediction between the LPFC and the striatum is that while LPFC neurons can utilize abstract code, striatal neurons can code individual associations between stimuli and reward but cannot utilize abstract code. PMID:26236266

  18. The role of the orbitofrontal cortex in the pursuit of happiness and more specific rewards.

    PubMed

    Burke, Kathryn A; Franz, Theresa M; Miller, Danielle N; Schoenbaum, Geoffrey

    2008-07-17

    Cues that reliably predict rewards trigger the thoughts and emotions normally evoked by those rewards. Humans and other animals will work, often quite hard, for these cues. This is termed conditioned reinforcement. The ability to use conditioned reinforcers to guide our behaviour is normally beneficial; however, it can go awry. For example, corporate icons, such as McDonald's Golden Arches, influence consumer behaviour in powerful and sometimes surprising ways, and drug-associated cues trigger relapse to drug seeking in addicts and animals exposed to addictive drugs, even after abstinence or extinction. Yet, despite their prevalence, it is not known how conditioned reinforcers control human or other animal behaviour. One possibility is that they act through the use of the specific rewards they predict; alternatively, they could control behaviour directly by activating emotions that are independent of any specific reward. In other words, the Golden Arches may drive business because they evoke thoughts of hamburgers and fries, or instead, may be effective because they also evoke feelings of hunger or happiness. Moreover, different brain circuits could support conditioned reinforcement mediated by thoughts of specific outcomes versus more general affective information. Here we have attempted to address these questions in rats. Rats were trained to learn that different cues predicted different rewards using specialized conditioning procedures that controlled whether the cues evoked thoughts of specific outcomes or general affective representations common to different outcomes. Subsequently, these rats were given the opportunity to press levers to obtain short and otherwise unrewarded presentations of these cues. We found that rats were willing to work for cues that evoked either outcome-specific or general affective representations. Furthermore the orbitofrontal cortex, a prefrontal region important for adaptive decision-making, was critical for the former but not for

  19. Can Service Learning Reinforce Social and Cultural Bias? Exploring a Popular Model of Family Involvement for Early Childhood Teacher Candidates

    ERIC Educational Resources Information Center

    Dunn-Kenney, Maylan

    2010-01-01

    Service learning is often used in teacher education as a way to challenge social bias and provide teacher candidates with skills needed to work in partnership with diverse families. Although some literature suggests that service learning could reinforce cultural bias, there is little documentation. In a study of 21 early childhood teacher…

  20. Rewarding imperfect motor performance reduces adaptive changes.

    PubMed

    van der Kooij, K; Overvliet, K E

    2016-06-01

    Could a pat on the back affect motor adaptation? Recent studies indeed suggest that rewards can boost motor adaptation. However, the rewards used were typically reward gradients that carried quite detailed information about performance. We investigated whether simple binary rewards affected how participants learned to correct for a visual rotation of performance feedback in a 3D pointing task. To do so, we asked participants to align their unseen hand with virtual target cubes in alternating blocks with and without spatial performance feedback. Forty participants were assigned to one of two groups: a 'spatial only' group, in which the feedback consisted of showing the (perturbed) endpoint of the hand, or to a 'spatial & reward' group, in which a reward could be received in addition to the spatial feedback. In addition, six participants were tested in a 'reward only' group. Binary reward was given when the participants' hand landed in a virtual 'hit area' that was adapted to individual performance to reward about half the trials. The results show a typical pattern of adaptation in both the 'spatial only' and the 'spatial & reward' groups, whereas the 'reward only' group was unable to adapt. The rewards did not affect the overall pattern of adaptation in the 'spatial & reward' group. However, on a trial-by-trial basis, the rewards reduced adaptive changes to spatial errors. PMID:26758721

  1. Neuromuscular control of the point to point and oscillatory movements of a sagittal arm with the actor-critic reinforcement learning method.

    PubMed

    Golkhou, Vahid; Parnianpour, Mohamad; Lucas, Caro

    2005-04-01

    In this study, we have used a single link system with a pair of muscles that are excited with alpha and gamma signals to achieve both point to point and oscillatory movements with variable amplitu