Science.gov

Sample records for reward reinforcement learning

  1. Reward, motivation, and reinforcement learning.

    PubMed

    Dayan, Peter; Balleine, Bernard W

    2002-10-10

    There is substantial evidence that dopamine is involved in reward learning and appetitive conditioning. However, the major reinforcement learning-based theoretical models of classical conditioning (crudely, prediction learning) are actually based on rules designed to explain instrumental conditioning (action learning). Extensive anatomical, pharmacological, and psychological data, particularly concerning the impact of motivational manipulations, show that these models are unreasonable. We review the data and consider the involvement of a rich collection of different neural systems in various aspects of these forms of conditioning. Dopamine plays a pivotal, but complicated, role.

  2. Online learning of shaping rewards in reinforcement learning.

    PubMed

    Grześ, Marek; Kudenko, Daniel

    2010-05-01

    Potential-based reward shaping has been shown to be a powerful method to improve the convergence rate of reinforcement learning agents. It is a flexible technique to incorporate background knowledge into temporal-difference learning in a principled way. However, the question remains of how to compute the potential function which is used to shape the reward that is given to the learning agent. In this paper, we show how, in the absence of knowledge to define the potential function manually, this function can be learned online in parallel with the actual reinforcement learning process. Two cases are considered. The first solution which is based on the multi-grid discretisation is designed for model-free reinforcement learning. In the second case, the approach for the prototypical model-based R-max algorithm is proposed. It learns the potential function using the free space assumption about the transitions in the environment. Two novel algorithms are presented and evaluated.

  3. Balancing Multiple Sources of Reward in Reinforcement Learning

    DTIC Science & Technology

    2006-01-01

    For many problems which would be natural for reinforcement learning , the reward signal is not a single scalar value but has multiple scalar...problems with applying traditional reinforcement learning . We then present an new algorithm for finding a solution and results on simulated environments.

  4. Finding intrinsic rewards by embodied evolution and constrained reinforcement learning.

    PubMed

    Uchibe, Eiji; Doya, Kenji

    2008-12-01

    Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and 'mating' by software reproduction are three major 'extrinsic' rewards. We show in hardware experiments that the robots can find appropriate 'intrinsic' rewards for the vision of battery packs and other robots to promote approach behaviors.

  5. Reward and reinforcement activity in the nucleus accumbens during learning

    PubMed Central

    Gale, John T.; Shields, Donald C.; Ishizawa, Yumiko; Eskandar, Emad N.

    2014-01-01

    The nucleus accumbens core (NAcc) has been implicated in learning associations between sensory cues and profitable motor responses. However, the precise mechanisms that underlie these functions remain unclear. We recorded single-neuron activity from the NAcc of primates trained to perform a visual-motor associative learning task. During learning, we found two distinct classes of NAcc neurons. The first class demonstrated progressive increases in firing rates at the go-cue, feedback/tone and reward epochs of the task, as novel associations were learned. This suggests that these neurons may play a role in the exploitation of rewarding behaviors. In contrast, the second class exhibited attenuated firing rates, but only at the reward epoch of the task. These findings suggest that some NAcc neurons play a role in reward-based reinforcement during learning. PMID:24765069

  6. Optimal Reward Functions in Distributed Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Wolpert, David H.; Tumer, Kagan

    2000-01-01

    We consider the design of multi-agent systems so as to optimize an overall world utility function when (1) those systems lack centralized communication and control, and (2) each agents runs a distinct Reinforcement Learning (RL) algorithm. A crucial issue in such design problems is to initialize/update each agent's private utility function, so as to induce best possible world utility. Traditional 'team game' solutions to this problem sidestep this issue and simply assign to each agent the world utility as its private utility function. In previous work we used the 'Collective Intelligence' framework to derive a better choice of private utility functions, one that results in world utility performance up to orders of magnitude superior to that ensuing from use of the team game utility. In this paper we extend these results. We derive the general class of private utility functions that both are easy for the individual agents to learn and that, if learned well, result in high world utility. We demonstrate experimentally that using these new utility functions can result in significantly improved performance over that of our previously proposed utility, over and above that previous utility's superiority to the conventional team game utility.

  7. Inferring reward prediction errors in patients with schizophrenia: a dynamic reward task for reinforcement learning.

    PubMed

    Li, Chia-Tzu; Lai, Wen-Sung; Liu, Chih-Min; Hsu, Yung-Fong

    2014-01-01

    Abnormalities in the dopamine system have long been implicated in explanations of reinforcement learning and psychosis. The updated reward prediction error (RPE)-a discrepancy between the predicted and actual rewards-is thought to be encoded by dopaminergic neurons. Dysregulation of dopamine systems could alter the appraisal of stimuli and eventually lead to schizophrenia. Accordingly, the measurement of RPE provides a potential behavioral index for the evaluation of brain dopamine activity and psychotic symptoms. Here, we assess two features potentially crucial to the RPE process, namely belief formation and belief perseveration, via a probability learning task and reinforcement-learning modeling. Forty-five patients with schizophrenia [26 high-psychosis and 19 low-psychosis, based on their p1 and p3 scores in the positive-symptom subscales of the Positive and Negative Syndrome Scale (PANSS)] and 24 controls were tested in a feedback-based dynamic reward task for their RPE-related decision making. While task scores across the three groups were similar, matching law analysis revealed that the reward sensitivities of both psychosis groups were lower than that of controls. Trial-by-trial data were further fit with a reinforcement learning model using the Bayesian estimation approach. Model fitting results indicated that both psychosis groups tend to update their reward values more rapidly than controls. Moreover, among the three groups, high-psychosis patients had the lowest degree of choice perseveration. Lumping patients' data together, we also found that patients' perseveration appears to be negatively correlated (p = 0.09, trending toward significance) with their PANSS p1 + p3 scores. Our method provides an alternative for investigating reward-related learning and decision making in basic and clinical settings.

  8. Inferring reward prediction errors in patients with schizophrenia: a dynamic reward task for reinforcement learning

    PubMed Central

    Li, Chia-Tzu; Lai, Wen-Sung; Liu, Chih-Min; Hsu, Yung-Fong

    2014-01-01

    Abnormalities in the dopamine system have long been implicated in explanations of reinforcement learning and psychosis. The updated reward prediction error (RPE)—a discrepancy between the predicted and actual rewards—is thought to be encoded by dopaminergic neurons. Dysregulation of dopamine systems could alter the appraisal of stimuli and eventually lead to schizophrenia. Accordingly, the measurement of RPE provides a potential behavioral index for the evaluation of brain dopamine activity and psychotic symptoms. Here, we assess two features potentially crucial to the RPE process, namely belief formation and belief perseveration, via a probability learning task and reinforcement-learning modeling. Forty-five patients with schizophrenia [26 high-psychosis and 19 low-psychosis, based on their p1 and p3 scores in the positive-symptom subscales of the Positive and Negative Syndrome Scale (PANSS)] and 24 controls were tested in a feedback-based dynamic reward task for their RPE-related decision making. While task scores across the three groups were similar, matching law analysis revealed that the reward sensitivities of both psychosis groups were lower than that of controls. Trial-by-trial data were further fit with a reinforcement learning model using the Bayesian estimation approach. Model fitting results indicated that both psychosis groups tend to update their reward values more rapidly than controls. Moreover, among the three groups, high-psychosis patients had the lowest degree of choice perseveration. Lumping patients' data together, we also found that patients' perseveration appears to be negatively correlated (p = 0.09, trending toward significance) with their PANSS p1 + p3 scores. Our method provides an alternative for investigating reward-related learning and decision making in basic and clinical settings. PMID:25426091

  9. An Upside to Reward Sensitivity: The Hippocampus Supports Enhanced Reinforcement Learning in Adolescence.

    PubMed

    Davidow, Juliet Y; Foerde, Karin; Galván, Adriana; Shohamy, Daphna

    2016-10-05

    Adolescents are notorious for engaging in reward-seeking behaviors, a tendency attributed to heightened activity in the brain's reward systems during adolescence. It has been suggested that reward sensitivity in adolescence might be adaptive, but evidence of an adaptive role has been scarce. Using a probabilistic reinforcement learning task combined with reinforcement learning models and fMRI, we found that adolescents showed better reinforcement learning and a stronger link between reinforcement learning and episodic memory for rewarding outcomes. This behavioral benefit was related to heightened prediction error-related BOLD activity in the hippocampus and to stronger functional connectivity between the hippocampus and the striatum at the time of reinforcement. These findings reveal an important role for the hippocampus in reinforcement learning in adolescence and suggest that reward sensitivity in adolescence is related to adaptive differences in how adolescents learn from experience.

  10. Beyond simple reinforcement learning: the computational neurobiology of reward-learning and valuation.

    PubMed

    O'Doherty, John P

    2012-04-01

    Neural computational accounts of reward-learning have been dominated by the hypothesis that dopamine neurons behave like a reward-prediction error and thus facilitate reinforcement learning in striatal target neurons. While this framework is consistent with a lot of behavioral and neural evidence, this theory fails to account for a number of behavioral and neurobiological observations. In this special issue of EJN we feature a combination of theoretical and experimental papers highlighting some of the explanatory challenges faced by simple reinforcement-learning models and describing some of the ways in which the framework is being extended in order to address these challenges.

  11. Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance

    DTIC Science & Technology

    2014-09-29

    maximize the learning objective, how is task perfor- mance affected and what implications do these effects on task performance have 3 Reward Human...will enhance the effectiveness of teaching by human reward. 4 • Our results provide evidence for the incompatibility of non-myopic learn- ing and...Section 1, this article examines the effect of various objectives for learning from human reward on task performance. In particular, we focus on reward

  12. Homeostatic reinforcement learning for integrating reward collection and physiological stability.

    PubMed

    Keramati, Mehdi; Gutkin, Boris

    2014-12-02

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system.

  13. When, What, and How Much to Reward in Reinforcement Learning-Based Models of Cognition

    ERIC Educational Resources Information Center

    Janssen, Christian P.; Gray, Wayne D.

    2012-01-01

    Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other…

  14. When, What, and How Much to Reward in Reinforcement Learning-Based Models of Cognition

    ERIC Educational Resources Information Center

    Janssen, Christian P.; Gray, Wayne D.

    2012-01-01

    Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other…

  15. Homeostatic reinforcement learning for integrating reward collection and physiological stability

    PubMed Central

    Keramati, Mehdi; Gutkin, Boris

    2014-01-01

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system. DOI: http://dx.doi.org/10.7554/eLife.04811.001 PMID:25457346

  16. Computational models of reinforcement learning: the role of dopamine as a reward signal

    PubMed Central

    Samson, R. D.; Frank, M. J.

    2010-01-01

    Reinforcement learning is ubiquitous. Unlike other forms of learning, it involves the processing of fast yet content-poor feedback information to correct assumptions about the nature of a task or of a set of stimuli. This feedback information is often delivered as generic rewards or punishments, and has little to do with the stimulus features to be learned. How can such low-content feedback lead to such an efficient learning paradigm? Through a review of existing neuro-computational models of reinforcement learning, we suggest that the efficiency of this type of learning resides in the dynamic and synergistic cooperation of brain systems that use different levels of computations. The implementation of reward signals at the synaptic, cellular, network and system levels give the organism the necessary robustness, adaptability and processing speed required for evolutionary and behavioral success. PMID:21629583

  17. Does reward frequency or magnitude drive reinforcement-learning in attention-deficit/hyperactivity disorder?

    PubMed

    Luman, Marjolein; Van Meel, Catharina S; Oosterlaan, Jaap; Sergeant, Joseph A; Geurts, Hilde M

    2009-08-15

    Children with attention-deficit/hyperactivity disorder (ADHD) show an impaired ability to use feedback in the context of learning. A stimulus-response learning task was used to investigate whether (1) children with ADHD displayed flatter learning curves, (2) reinforcement-learning in ADHD was sensitive to either reward frequency, magnitude, or both, and (3) altered sensitivity to reward was specific to ADHD or would co-occur in a group of children with autism spectrum disorder (ASD). Performance of 23 boys with ADHD was compared with that of 30 normal controls (NCs) and 21 boys with ASD, all aged 8-12. Rewards were delivered contingent on performance and varied both in frequency (low, high) and magnitude (small, large). The findings showed that, although learning rates were comparable across groups, both clinical groups committed more errors than NCs. In contrast to the NC boys, boys with ADHD were unaffected by frequency and magnitude of reward. The NC group and, to some extent, the ASD group showed improved performance, when rewards were delivered infrequently versus frequently. Children with ADHD as well as children with ASD displayed difficulties in stimulus-response coupling that were independent of motivational modulations. Possibly, these deficits are related to abnormal reinforcement expectancy.

  18. Toward an autonomous brain machine interface: integrating sensorimotor reward modulation and reinforcement learning.

    PubMed

    Marsh, Brandi T; Tarigoppula, Venkata S Aditya; Chen, Chen; Francis, Joseph T

    2015-05-13

    For decades, neurophysiologists have worked on elucidating the function of the cortical sensorimotor control system from the standpoint of kinematics or dynamics. Recently, computational neuroscientists have developed models that can emulate changes seen in the primary motor cortex during learning. However, these simulations rely on the existence of a reward-like signal in the primary sensorimotor cortex. Reward modulation of the primary sensorimotor cortex has yet to be characterized at the level of neural units. Here we demonstrate that single units/multiunits and local field potentials in the primary motor (M1) cortex of nonhuman primates (Macaca radiata) are modulated by reward expectation during reaching movements and that this modulation is present even while subjects passively view cursor motions that are predictive of either reward or nonreward. After establishing this reward modulation, we set out to determine whether we could correctly classify rewarding versus nonrewarding trials, on a moment-to-moment basis. This reward information could then be used in collaboration with reinforcement learning principles toward an autonomous brain-machine interface. The autonomous brain-machine interface would use M1 for both decoding movement intention and extraction of reward expectation information as evaluative feedback, which would then update the decoding algorithm as necessary. In the work presented here, we show that this, in theory, is possible.

  19. Adaptive Design of Role Differentiation by Division of Reward Function in Multi-Agent Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Taniguchi, Tadahiro; Tabuchi, Kazuma; Sawaragi, Tetsuo

    There are several problems which discourage an organization from achieving tasks, e.g., partial observation, credit assignment, and concurrent learning in multi-agent reinforcement learning. In many conventional approaches, each agent estimates hidden states, e.g., sensor inputs, positions, and policies of other agents, and reduces the uncertainty in the partially-observable Markov decision process (POMDP), which partially solve the multiagent reinforcement learning problem. In contrast, people reduce uncertainty in human organizations in the real world by autonomously dividing the roles played by individual agents. In a framework of reinforcement learning, roles are mainly represented by goals for individual agents. This paper presents a method for generating internal rewards from manager agents to worker agents. It also explicitly divides the roles, which enables a POMDP task for each agent to be transformed into a simple MDP task under certain conditions. Several situational experiments are also described and the validity of the proposed method is evaluated.

  20. The left hemisphere learns what is right: Hemispatial reward learning depends on reinforcement learning processes in the contralateral hemisphere.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly Crystal; Schwartz, Sophie

    2016-08-01

    Orienting biases refer to consistent, trait-like direction of attention or locomotion toward one side of space. Recent studies suggest that such hemispatial biases may determine how well people memorize information presented in the left or right hemifield. Moreover, lesion studies indicate that learning rewarded stimuli in one hemispace depends on the integrity of the contralateral striatum. However, the exact neural and computational mechanisms underlying the influence of individual orienting biases on reward learning remain unclear. Because reward-based behavioural adaptation depends on the dopaminergic system and prediction error (PE) encoding in the ventral striatum, we hypothesized that hemispheric asymmetries in dopamine (DA) function may determine individual spatial biases in reward learning. To test this prediction, we acquired fMRI in 33 healthy human participants while they performed a lateralized reward task. Learning differences between hemispaces were assessed by presenting stimuli, assigned to different reward probabilities, to the left or right of central fixation, i.e. presented in the left or right visual hemifield. Hemispheric differences in DA function were estimated through differential fMRI responses to positive vs. negative feedback in the left vs. right ventral striatum, and a computational approach was used to identify the neural correlates of PEs. Our results show that spatial biases favoring reward learning in the right (vs. left) hemifield were associated with increased reward responses in the left hemisphere and relatively better neural encoding of PEs for stimuli presented in the right (vs. left) hemifield. These findings demonstrate that trait-like spatial biases implicate hemisphere-specific learning mechanisms, with individual differences between hemispheres contributing to reinforcing spatial biases.

  1. Reward-weighted regression with sample reuse for direct policy search in reinforcement learning.

    PubMed

    Hachiya, Hirotaka; Peters, Jan; Sugiyama, Masashi

    2011-11-01

    Direct policy search is a promising reinforcement learning framework, in particular for controlling continuous, high-dimensional systems. Policy search often requires a large number of samples for obtaining a stable policy update estimator, and this is prohibitive when the sampling cost is expensive. In this letter, we extend an expectation-maximization-based policy search method so that previously collected samples can be efficiently reused. The usefulness of the proposed method, reward-weighted regression with sample reuse (R3), is demonstrated through robot learning experiments. (This letter is an extended version of our earlier conference paper: Hachiya, Peters, & Sugiyama, 2009 .).

  2. Principal components analysis of reward prediction errors in a reinforcement learning task.

    PubMed

    Sambrook, Thomas D; Goslin, Jeremy

    2016-01-01

    Models of reinforcement learning represent reward and punishment in terms of reward prediction errors (RPEs), quantitative signed terms describing the degree to which outcomes are better than expected (positive RPEs) or worse (negative RPEs). An electrophysiological component known as feedback related negativity (FRN) occurs at frontocentral sites 240-340ms after feedback on whether a reward or punishment is obtained, and has been claimed to neurally encode an RPE. An outstanding question however, is whether the FRN is sensitive to the size of both positive RPEs and negative RPEs. Previous attempts to answer this question have examined the simple effects of RPE size for positive RPEs and negative RPEs separately. However, this methodology can be compromised by overlap from components coding for unsigned prediction error size, or "salience", which are sensitive to the absolute size of a prediction error but not its valence. In our study, positive and negative RPEs were parametrically modulated using both reward likelihood and magnitude, with principal components analysis used to separate out overlying components. This revealed a single RPE encoding component responsive to the size of positive RPEs, peaking at ~330ms, and occupying the delta frequency band. Other components responsive to unsigned prediction error size were shown, but no component sensitive to negative RPE size was found.

  3. Extending Hierarchical Reinforcement Learning to Continuous-Time, Average-Reward, and Multi-Agent Models

    DTIC Science & Technology

    2003-07-09

    Hierarchical reinforcement learning (HRL) is a general framework that studies how to exploit the structure of actions and tasks to accelerate policy...framework could su ce, we focus in this paper on the MAXQ framework. We describe three new hierarchical reinforcement learning algorithms: continuous-time... reinforcement learning to speed up the acquisition of cooperative multiagent tasks. We extend the MAXQ framework to the multiagent case which we term

  4. Subjective and model-estimated reward prediction: association with the feedback-related negativity (FRN) and reward prediction error in a reinforcement learning task.

    PubMed

    Ichikawa, Naho; Siegle, Greg J; Dombrovski, Alexandre; Ohira, Hideki

    2010-12-01

    In this study, we examined whether the feedback-related negativity (FRN) is associated with both subjective and objective (model-estimated) reward prediction errors (RPE) per trial in a reinforcement learning task in healthy adults (n=25). The level of RPE was assessed by 1) subjective ratings per trial and by 2) a computational model of reinforcement learning. As results, model-estimated RPE was highly correlated with subjective RPE (r=.82), and the grand-averaged ERP waves based on the trials with high and low model-estimated RPE showed the significant difference only in the time period of the FRN component (p<.05). Regardless of the time course of learning, FRN was associated with both subjective and model-estimated RPEs within subject (r=.47, p<.001; r=.40, p<.05) and between subjects (r=.33, p<.05; r=.41, p<.005) only in the Learnable condition where the internal reward prediction varied enough with a behavior-reward contingency.

  5. [The model of the reward choice basing on the theory of reinforcement learning].

    PubMed

    Smirnitskaia, I A; Frolov, A A; Merzhanova, G Kh

    2007-01-01

    We developed the model of alimentary instrumental conditioned bar-pressing reflex for cats making a choice between either immediate small reinforcement ("impulsive behavior") or delayed more valuable reinforcement ("self-control behavior"). Our model is based on the reinforcement learning theory. We emulated dopamine contribution by discount coefficient of this theory (a subjective decrease in the value of a delayed reinforcement). The results of computer simulation showed that "cats" with large discount coefficient demonstrated "self-control behavior"; small discount coefficient was associated with "impulsive behavior". This data are in agreement with the experimental data indicating that the impulsive behavior is due to a decreased amount of dopamine in striatum.

  6. Dopaminergic control of motivation and reinforcement learning: a closed-circuit account for reward-oriented behavior.

    PubMed

    Morita, Kenji; Morishima, Mieko; Sakai, Katsuyuki; Kawaguchi, Yasuo

    2013-05-15

    Humans and animals take actions quickly when they expect that the actions lead to reward, reflecting their motivation. Injection of dopamine receptor antagonists into the striatum has been shown to slow such reward-seeking behavior, suggesting that dopamine is involved in the control of motivational processes. Meanwhile, neurophysiological studies have revealed that phasic response of dopamine neurons appears to represent reward prediction error, indicating that dopamine plays central roles in reinforcement learning. However, previous attempts to elucidate the mechanisms of these dopaminergic controls have not fully explained how the motivational and learning aspects are related and whether they can be understood by the way the activity of dopamine neurons itself is controlled by their upstream circuitries. To address this issue, we constructed a closed-circuit model of the corticobasal ganglia system based on recent findings regarding intracortical and corticostriatal circuit architectures. Simulations show that the model could reproduce the observed distinct motivational effects of D1- and D2-type dopamine receptor antagonists. Simultaneously, our model successfully explains the dopaminergic representation of reward prediction error as observed in behaving animals during learning tasks and could also explain distinct choice biases induced by optogenetic stimulation of the D1 and D2 receptor-expressing striatal neurons. These results indicate that the suggested roles of dopamine in motivational control and reinforcement learning can be understood in a unified manner through a notion that the indirect pathway of the basal ganglia represents the value of states/actions at a previous time point, an empirically driven key assumption of our model.

  7. A model of reward choice based on the theory of reinforcement learning.

    PubMed

    Smirnitskaya, I A; Frolov, A A; Merzhanova, G Kh

    2008-03-01

    A model explaining behavioral "impulsivity" and "self-control" is proposed on the basis of the theory of reinforcement learning. The discount coefficient gamma, which in this theory accounts for the subjective reduction in the value of a delayed reinforcement, is identified with the overall level of dopaminergic neuron activity which, according to published data, also determines the behavioral variant. Computer modeling showed that high values of gamma are characteristic of predominantly "self-controlled" subjects, while smaller values of gamma are characteristic of "impulsive" subjects.

  8. The Rewards of Learning.

    ERIC Educational Resources Information Center

    Chance, Paul

    1992-01-01

    Although intrinsic rewards are important, they (along with punishment and encouragement) are insufficient for efficient learning. Teachers must supplement intrinsic rewards with extrinsic rewards, such as praising, complimenting, applauding, and providing other forms of recognition for good work. Teachers should use the weakest reward required to…

  9. The Rewards of Learning.

    ERIC Educational Resources Information Center

    Chance, Paul

    1992-01-01

    Although intrinsic rewards are important, they (along with punishment and encouragement) are insufficient for efficient learning. Teachers must supplement intrinsic rewards with extrinsic rewards, such as praising, complimenting, applauding, and providing other forms of recognition for good work. Teachers should use the weakest reward required to…

  10. States versus Rewards: Dissociable neural prediction error signals underlying model-based and model-free reinforcement learning

    PubMed Central

    Gläscher, Jan; Daw, Nathaniel; Dayan, Peter; O’Doherty, John P.

    2010-01-01

    Reinforcement learning (RL) uses sequential experience with situations (“states”) and outcomes to assess actions. Whereas model-free RL uses this experience directly, in the form of a reward prediction error (RPE), model-based RL uses it indirectly, building a model of the state transition and outcome structure of the environment, and evaluating actions by searching this model. A state prediction error (SPE) plays a central role, reporting discrepancies between the current model and the observed state transitions. Using functional magnetic resonance imaging in humans solving a probabilistic Markov decision task we found the neural signature of an SPE in the intraparietal sulcus and lateral prefrontal cortex, in addition to the previously well-characterized RPE in the ventral striatum. This finding supports the existence of two unique forms of learning signal in humans, which may form the basis of distinct computational strategies for guiding behavior. PMID:20510862

  11. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

    PubMed

    Glimcher, Paul W

    2011-09-13

    A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn.

  12. Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis

    PubMed Central

    Glimcher, Paul W.

    2011-01-01

    A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn. PMID:21389268

  13. [Reinforcement learning by striatum].

    PubMed

    Kunisato, Yoshihiko; Okada, Go; Okamoto, Yasumasa

    2009-04-01

    Recently, computational models of reinforcement learning have been applied for the analysis of neuroimaging data. It has been clarified that the striatum plays a key role in decision making. We review the reinforcement learning theory and the biological structures such as the brain and signals such as neuromodulators associated with reinforcement learning. We also investigated the function of the striatum and the neurotransmitter serotonin in reward prediction. We first studied the brain mechanisms for reward prediction at different time scales. Our experiment on the striatum showed that the ventroanterior regions are involved in predicting immediate rewards and the dorsoposterior regions are involved in predicting future rewards. Further, we investigated whether serotonin regulates both the reward selection and the striatum function are specialized reward prediction at different time scales. To this end, we regulated the dietary intake of tryptophan, a precursor of serotonin. Our experiment showed that the activity of the ventral part of the striatum was correlated with reward prediction at shorter time scales, and this activity was stronger at low serotonin levels. By contrast, the activity of the dorsal part of the striatum was correlated with reward prediction at longer time scales, and this activity was stronger at high serotonin levels. Further, a higher proportion of small reward choices, together with a higher rate of discounting of delayed rewards is observed in the low-serotonin condition than in the control and high-serotonin conditions. Further examinations are required in future to assess the relation between the disturbance of reward prediction caused by low serotonin and mental disorders related to serotonin such as depression.

  14. Single Dose of a Dopamine Agonist Impairs Reinforcement Learning in Humans: Behavioral Evidence from a Laboratory-based Measure of Reward Responsiveness

    PubMed Central

    Pizzagalli, Diego A.; Evins, A. Eden; Schetter, Erika Cowman; Frank, Michael J.; Pajtas, Petra E.; Santesso, Diane L.; Culhane, Melissa

    2007-01-01

    Rationale The dopaminergic system, particularly D2-like dopamine receptors, has been strongly implicated in reward processing. Animal studies have emphasized the role of phasic dopamine (DA) signaling in reward-related learning, but these processes remain largely unexplored in humans. Objectives To evaluate the effect of a single, low dose of a D2/D3 agonist—pramipexole—on reinforcement learning in healthy adults. Based on prior evidence indicating that low doses of DA agonists decrease phasic DA release through autoreceptor stimulation, we hypothesized that 0.5 mg of pramipexole would impair reward learning due to presynaptic mechanisms. Methods Using a double-blind design, a single 0.5 mg dose of pramipexole or placebo was administered to 32 healthy volunteers, who performed a probabilistic reward task involving a differential reinforcement schedule as well as various control tasks. Results As hypothesized, response bias toward the more frequently rewarded stimulus was impaired in the pramipexole group, even after adjusting for transient adverse effects. In addition, the pramipexole group showed reaction time and motor speed slowing and increased negative affect; however, when adverse physical side effects were considered, group differences in motor speed and negative affect disappeared. Conclusions These findings show that a single low dose of pramipexole impaired the acquisition of reward-related behavior in healthy participants, and they are consistent with prior evidence suggesting that phasic DA signaling is required to reinforce actions leading to reward. The potential implications of the present findings to psychiatric conditions, including depression and impulse control disorders related to addiction, are discussed. PMID:17909750

  15. Reward and learning in the goldfish.

    PubMed

    Lowes, G; Bitterman, M E

    1967-07-28

    An experiment with goldfish showed the effects of change in amount of reward that are predicted from reinforcement theory. The performance of animals shifted from small to large reward improved gradually to the level of unshifted large-reward controls, while the performance of animals shifted from large to small reward remained at the large-reward level. The difference between these results and those obtained in analogous experiments with the rat suggests that reward functions differently in the instrumental learning of the two animals.

  16. Heads for learning, tails for memory: reward, reinforcement and a role of dopamine in determining behavioral relevance across multiple timescales

    PubMed Central

    Baudonnat, Mathieu; Huber, Anna; David, Vincent; Walton, Mark E.

    2013-01-01

    Dopamine has long been tightly associated with aspects of reinforcement learning and motivation in simple situations where there are a limited number of stimuli to guide behavior and constrained range of outcomes. In naturalistic situations, however, there are many potential cues and foraging strategies that could be adopted, and it is critical that animals determine what might be behaviorally relevant in such complex environments. This requires not only detecting discrepancies with what they have recently experienced, but also identifying similarities with past experiences stored in memory. Here, we review what role dopamine might play in determining how and when to learn about the world, and how to develop choice policies appropriate to the situation faced. We discuss evidence that dopamine is shaped by motivation and memory and in turn shapes reward-based memory formation. In particular, we suggest that hippocampal-striatal-dopamine networks may interact to determine how surprising the world is and to either inhibit or promote actions at time of behavioral uncertainty. PMID:24130514

  17. An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning.

    PubMed

    Balasubramani, Pragathi P; Chakravarthy, V Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A

    2014-01-01

    Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG.

  18. Hypocretin / orexin involvement in reward and reinforcement

    PubMed Central

    España, Rodrigo A.

    2015-01-01

    Since the discovery of the hypocretins/orexins, a series of observations have indicated that these peptides influence a variety of physiological processes including feeding, sleep/wake function, memory, and stress. More recently, the hypocretins have been implicated in reinforcement and reward-related processes via actions on the mesolimbic dopamine system. Although investigation into the relationship between the hypocretins and reinforcement/reward remains in relatively early stages, accumulating evidence suggests that continued research into this area may offer new insights into the addiction process and provide the foundation to generate novel pharmacotherapies for drug abuse. The current chapter will focus on contemporary perspectives of hypocretin regulation of cocaine reward and reinforcement via actions on the mesolimbic dopamine system. PMID:22640614

  19. Reinforcement learning with Marr.

    PubMed

    Niv, Yael; Langdon, Angela

    2016-10-01

    To many, the poster child for David Marr's famous three levels of scientific inquiry is reinforcement learning-a computational theory of reward optimization, which readily prescribes algorithmic solutions that evidence striking resemblance to signals found in the brain, suggesting a straightforward neural implementation. Here we review questions that remain open at each level of analysis, concluding that the path forward to their resolution calls for inspiration across levels, rather than a focus on mutual constraints.

  20. Beta-endorphin and drug-induced reward and reinforcement.

    PubMed

    Roth-Deri, Ilana; Green-Sadan, Tamar; Yadid, Gal

    2008-09-01

    Although drugs of abuse have different acute mechanisms of action, their brain pathways of reward exhibit common functional effects upon both acute and chronic administration. Long known for its analgesic effect, the opioid beta-endorphin is now shown to induce euphoria, and to have rewarding and reinforcing properties. In this review, we will summarize the present neurobiological and behavioral evidences that support involvement of beta-endorphin in drug-induced reward and reinforcement. Currently, evidence supports a prominent role for beta-endorphin in the reward pathways of cocaine and alcohol. The existing information indicating the importance of beta-endorphin neurotransmission in mediating the reward pathways of nicotine and THC, is thus far circumstantial. The studies described herein employed diverse techniques, such as biochemical measurements of beta-endorphin in various brain sites and plasma, and behavioral measurements, conducted following elimination (via administration of anti-beta-endorphin antibodies or using mutant mice) or augmentation (by intracerebral administration) of beta-endorphin. We suggest that the reward pathways for different addictive drugs converge to a common pathway in which beta-endorphin is a modulating element. Beta-endorphin is involved also with distress. However, reviewing the data collected so far implies a discrete role, beyond that of a stress response, for beta-endorphin in mediating the substance of abuse reward pathway. This may occur via interacting with the mesolimbic dopaminergic system and also by its interesting effects on learning and memory. The functional meaning of beta-endorphin in the process of drug-seeking behavior is discussed.

  1. Placebo Intervention Enhances Reward Learning in Healthy Individuals

    PubMed Central

    Turi, Zsolt; Mittner, Matthias; Paulus, Walter; Antal, Andrea

    2017-01-01

    According to the placebo-reward hypothesis, placebo is a reward-anticipation process that increases midbrain dopamine (DA) levels. Reward-based learning processes, such as reinforcement learning, involves a large part of the DA-ergic network that is also activated by the placebo intervention. Given the neurochemical overlap between placebo and reward learning, we investigated whether verbal instructions in conjunction with a placebo intervention are capable of enhancing reward learning in healthy individuals by using a monetary reward-based reinforcement-learning task. Placebo intervention was performed with non-invasive brain stimulation techniques. In a randomized, triple-blind, cross-over study we investigated this cognitive placebo effect in healthy individuals by manipulating the participants’ perceived uncertainty about the intervention’s efficacy. Volunteers in the purportedly low- and high-uncertainty conditions earned more money, responded more quickly and had a higher learning rate from monetary rewards relative to baseline. Participants in the purportedly high-uncertainty conditions showed enhanced reward learning, and a model-free computational analysis revealed a higher learning rate from monetary rewards compared to the purportedly low-uncertainty and baseline conditions. Our results indicate that the placebo response is able to enhance reward learning in healthy individuals, opening up exciting avenues for future research in placebo effects on other cognitive functions. PMID:28112207

  2. Reward learning in normal and mutant Drosophila

    PubMed Central

    Tempel, Bruce L.; Bonini, Nancy; Dawson, Douglas R.; Quinn, William G.

    1983-01-01

    Hungry fruit flies can be trained by exposing them to two chemical odorants, one paired with the opportunity to feed on 1 M sucrose. On later testing, when given a choice between odorants the flies migrate specifically toward the sucrose-paired odor. This appetitively reinforced learning by the flies is similar in strength and character to previously demonstrated negatively reinforced learning, but it differs in several properties. Both memory consolidation and memory decay proceed relatively slowly after training with sucrose reward. Consolidation of learned information into anesthesia-resistant long-term memory requires about 100 min after training with sucrose compared to about 30 min after training with electric shock. Memory in wild-type flies persists for 24 hr after training with sucrose compared to 4-6 hr after training with electric shock. Memory in amnesiac mutants appears to be similarly lengthened, from 1 hr to 6 hr, by substituting sucrose reward for shock punishment. Two other mutants, dunce and rutabaga, which were isolated because they failed to learn the shock-avoidance task, learn normally in response to sucrose reward but forget rapidly afterward. One mutant, turnip, does not learn in either paradigm. Reward and punishment can be combined in olfactory discrimination training by pairing one odor to sucrose and the other to electric shock. In this situation, the expression of learning is approximately the sum of that obtained by using either reinforcement alone. After such training, memory decays at two distinct rates, each characteristic of one type of reinforcement. PMID:6572401

  3. Prosocial Reward Learning in Children and Adolescents

    PubMed Central

    Kwak, Youngbin; Huettel, Scott A.

    2016-01-01

    Adolescence is a period of increased sensitivity to social contexts. To evaluate how social context sensitivity changes over development—and influences reward learning—we investigated how children and adolescents perceive and integrate rewards for oneself and others during a dynamic risky decision-making task. Children and adolescents (N = 75, 8–16 years) performed the Social Gambling Task (SGT, Kwak et al., 2014) and completed a set of questionnaires measuring other-regarding behavior. In the SGT, participants choose amongst four card decks that have different payout structures for oneself and for a charity. We examined patterns of choices, overall decision strategies, and how reward outcomes led to trial-by-trial adjustments in behavior, as estimated using a reinforcement-learning model. Performance of children and adolescents was compared to data from a previously collected sample of adults (N = 102) performing the identical task. We found that that children/adolescents were not only more sensitive to rewards directed to the charity than self but also showed greater prosocial tendencies on independent measures of other-regarding behavior. Children and adolescents also showed less use of a strategy that prioritizes rewards for self at the expense of rewards for others. These results support the conclusion that, compared to adults, children and adolescents show greater sensitivity to outcomes for others when making decisions and learning about potential rewards. PMID:27761125

  4. Quantum reinforcement learning.

    PubMed

    Dong, Daoyi; Chen, Chunlin; Li, Hanxiong; Tarn, Tzyh-Jong

    2008-10-01

    The key approaches for machine learning, particularly learning in unknown probabilistic environments, are new representations and computation mechanisms. In this paper, a novel quantum reinforcement learning (QRL) method is proposed by combining quantum theory and reinforcement learning (RL). Inspired by the state superposition principle and quantum parallelism, a framework of a value-updating algorithm is introduced. The state (action) in traditional RL is identified as the eigen state (eigen action) in QRL. The state (action) set can be represented with a quantum superposition state, and the eigen state (eigen action) can be obtained by randomly observing the simulated quantum state according to the collapse postulate of quantum measurement. The probability of the eigen action is determined by the probability amplitude, which is updated in parallel according to rewards. Some related characteristics of QRL such as convergence, optimality, and balancing between exploration and exploitation are also analyzed, which shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism. To evaluate the performance and practicability of QRL, several simulated experiments are given, and the results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems. This paper is also an effective exploration on the application of quantum computation to artificial intelligence.

  5. General functioning predicts reward and punishment learning in schizophrenia.

    PubMed

    Somlai, Zsuzsanna; Moustafa, Ahmed A; Kéri, Szabolcs; Myers, Catherine E; Gluck, Mark A

    2011-04-01

    Previous studies investigating feedback-driven reinforcement learning in patients with schizophrenia have provided mixed results. In this study, we explored the clinical predictors of reward and punishment learning using a probabilistic classification learning task. Patients with schizophrenia (n=40) performed similarly to healthy controls (n=30) on the classification learning task. However, more severe negative and general symptoms were associated with lower reward-learning performance, whereas poorer general psychosocial functioning was correlated with both lower reward- and punishment-learning performances. Multiple linear regression analyses indicated that general psychosocial functioning was the only significant predictor of reinforcement learning performance when education, antipsychotic dose, and positive, negative and general symptoms were included in the analysis. These results suggest a close relationship between reinforcement learning and general psychosocial functioning in schizophrenia.

  6. Variable Resolution Reinforcement Learning.

    DTIC Science & Technology

    1995-04-01

    Can reinforcement learning ever become a practical method for real control problems? This paper begins by reviewing three reinforcement learning algorithms... reinforcement learning . In addition to exploring state space, and developing a control policy to achieve a task, partigame also learns a kd-tree partitioning of

  7. Partial Planning Reinforcement Learning

    DTIC Science & Technology

    2012-08-31

    This project explored several problems in the areas of reinforcement learning , probabilistic planning, and transfer learning. In particular, it...studied Bayesian Optimization for model-based and model-free reinforcement learning , transfer in the context of model-free reinforcement learning based on

  8. Global reinforcement learning in neural networks.

    PubMed

    Ma, Xiaolong; Likharev, Konstantin K

    2007-03-01

    In this letter, we have found a more general formulation of the REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility (REINFORCE) learning principle first suggested by Williams. The new formulation has enabled us to apply the principle to global reinforcement learning in networks with various sources of randomness, and to suggest several simple local rules for such networks. Numerical simulations have shown that for simple classification and reinforcement learning tasks, at least one family of the new learning rules gives results comparable to those provided by the famous Rules A(r-i) and A(r-p) for the Boltzmann machines.

  9. Tonic Dopamine Modulates Exploitation of Reward Learning

    PubMed Central

    Beeler, Jeff A.; Daw, Nathaniel; Frazier, Cristianne R. M.; Zhuang, Xiaoxi

    2010-01-01

    The impact of dopamine on adaptive behavior in a naturalistic environment is largely unexamined. Experimental work suggests that phasic dopamine is central to reinforcement learning whereas tonic dopamine may modulate performance without altering learning per se; however, this idea has not been developed formally or integrated with computational models of dopamine function. We quantitatively evaluate the role of tonic dopamine in these functions by studying the behavior of hyperdopaminergic DAT knockdown mice in an instrumental task in a semi-naturalistic homecage environment. In this “closed economy” paradigm, subjects earn all of their food by pressing either of two levers, but the relative cost for food on each lever shifts frequently. Compared to wild-type mice, hyperdopaminergic mice allocate more lever presses on high-cost levers, thus working harder to earn a given amount of food and maintain their body weight. However, both groups show a similarly quick reaction to shifts in lever cost, suggesting that the hyperdominergic mice are not slower at detecting changes, as with a learning deficit. We fit the lever choice data using reinforcement learning models to assess the distinction between acquisition and expression the models formalize. In these analyses, hyperdopaminergic mice displayed normal learning from recent reward history but diminished capacity to exploit this learning: a reduced coupling between choice and reward history. These data suggest that dopamine modulates the degree to which prior learning biases action selection and consequently alters the expression of learned, motivated behavior. PMID:21120145

  10. Reinforcement learning and Tourette syndrome.

    PubMed

    Palminteri, Stefano; Pessiglione, Mathias

    2013-01-01

    In this chapter, we report the first experimental explorations of reinforcement learning in Tourette syndrome, realized by our team in the last few years. This report will be preceded by an introduction aimed to provide the reader with the state of the art of the knowledge concerning the neural bases of reinforcement learning at the moment of these studies and the scientific rationale beyond them. In short, reinforcement learning is learning by trial and error to maximize rewards and minimize punishments. This decision-making and learning process implicates the dopaminergic system projecting to the frontal cortex-basal ganglia circuits. A large body of evidence suggests that the dysfunction of the same neural systems is implicated in the pathophysiology of Tourette syndrome. Our results show that Tourette condition, as well as the most common pharmacological treatments (dopamine antagonists), affects reinforcement learning performance in these patients. Specifically, the results suggest a deficit in negative reinforcement learning, possibly underpinned by a functional hyperdopaminergia, which could explain the persistence of tics, despite their evident inadaptive (negative) value. This idea, together with the implications of these results in Tourette therapy and the future perspectives, is discussed in Section 4 of this chapter.

  11. Reinforcement Learning: A Tutorial.

    DTIC Science & Technology

    1997-01-01

    The purpose of this tutorial is to provide an introduction to reinforcement learning (RL) at a level easily understood by students and researchers in...provides a simple example to develop intuition of the underlying dynamic programming mechanism. In Section (2) the parts of a reinforcement learning problem... reinforcement learning algorithms. These include TD(lambda) and both the residual and direct forms of value iteration, Q-learning, and advantage learning

  12. [Multiple Dopamine Signals and Their Contributions to Reinforcement Learning].

    PubMed

    Matsumoto, Masayuki

    2016-10-01

    Midbrain dopamine neurons are activated by reward and sensory cue that predicts reward. Their responses resemble reward prediction error that indicates the discrepancy between obtained and expected reward values, which has been thought to play an important role as a teaching signal in reinforcement learning. Indeed, pharmacological blockade of dopamine transmission interferes with reinforcement learning. Recent studies reported, however, that not all dopamine neurons transmit the reward-related signal. They found that a subset of dopamine neurons transmits signals related to non-rewarding, salient experiences such as aversive stimulations and cognitively demanding events. How these signals contribute to animal behavior is not yet well understood. This article reviews recent findings on dopamine signals related to rewarding and non-rewarding experiences, and discusses their contributions to reinforcement learning.

  13. A model of food reward learning with dynamic reward exposure

    PubMed Central

    Hammond, Ross A.; Ornstein, Joseph T.; Fellows, Lesley K.; Dubé, Laurette; Levitan, Robert; Dagher, Alain

    2012-01-01

    The process of conditioning via reward learning is highly relevant to the study of food choice and obesity. Learning is itself shaped by environmental exposure, with the potential for such exposures to vary substantially across individuals and across place and time. In this paper, we use computational techniques to extend a well-validated standard model of reward learning, introducing both substantial heterogeneity and dynamic reward exposures. We then apply the extended model to a food choice context. The model produces a variety of individual behaviors and population-level patterns which are not evident from the traditional formulation, but which offer potential insights for understanding food reward learning and obesity. These include a “lock-in” effect, through which early exposure can strongly shape later reward valuation. We discuss potential implications of our results for the study and prevention of obesity, for the reward learning field, and for future experimental and computational work. PMID:23087640

  14. Mind matters: placebo enhances reward learning in Parkinson's disease.

    PubMed

    Schmidt, Liane; Braun, Erin Kendall; Wager, Tor D; Shohamy, Daphna

    2014-12-01

    Expectations have a powerful influence on how we experience the world. Neurobiological and computational models of learning suggest that dopamine is crucial for shaping expectations of reward and that expectations alone may influence dopamine levels. However, because expectations and reinforcers are typically manipulated together, the role of expectations per se has remained unclear. We separated these two factors using a placebo dopaminergic manipulation in individuals with Parkinson's disease. We combined a reward learning task with functional magnetic resonance imaging to test how expectations of dopamine release modulate learning-related activity in the brain. We found that the mere expectation of dopamine release enhanced reward learning and modulated learning-related signals in the striatum and the ventromedial prefrontal cortex. These effects were selective to learning from reward: neither medication nor placebo had an effect on learning to avoid monetary loss. These findings suggest a neurobiological mechanism by which expectations shape learning and affect.

  15. Microstimulation of the human substantia nigra alters reinforcement learning.

    PubMed

    Ramayya, Ashwin G; Misra, Amrit; Baltuch, Gordon H; Kahana, Michael J

    2014-05-14

    Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action-reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action-reward associations rather than stimulus-reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action-reward associations during reinforcement learning.

  16. Microstimulation of the Human Substantia Nigra Alters Reinforcement Learning

    PubMed Central

    Ramayya, Ashwin G.; Misra, Amrit

    2014-01-01

    Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action–reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action–reward associations rather than stimulus–reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action–reward associations during reinforcement learning. PMID:24828643

  17. Individual differences in sensitivity to reward and punishment and neural activity during reward and avoidance learning.

    PubMed

    Kim, Sang Hee; Yoon, HeungSik; Kim, Hackjin; Hamann, Stephan

    2015-09-01

    In this functional neuroimaging study, we investigated neural activations during the process of learning to gain monetary rewards and to avoid monetary loss, and how these activations are modulated by individual differences in reward and punishment sensitivity. Healthy young volunteers performed a reinforcement learning task where they chose one of two fractal stimuli associated with monetary gain (reward trials) or avoidance of monetary loss (avoidance trials). Trait sensitivity to reward and punishment was assessed using the behavioral inhibition/activation scales (BIS/BAS). Functional neuroimaging results showed activation of the striatum during the anticipation and reception periods of reward trials. During avoidance trials, activation of the dorsal striatum and prefrontal regions was found. As expected, individual differences in reward sensitivity were positively associated with activation in the left and right ventral striatum during reward reception. Individual differences in sensitivity to punishment were negatively associated with activation in the left dorsal striatum during avoidance anticipation and also with activation in the right lateral orbitofrontal cortex during receiving monetary loss. These results suggest that learning to attain reward and learning to avoid loss are dependent on separable sets of neural regions whose activity is modulated by trait sensitivity to reward or punishment.

  18. Individual differences in sensitivity to reward and punishment and neural activity during reward and avoidance learning

    PubMed Central

    Yoon, HeungSik; Kim, Hackjin; Hamann, Stephan

    2015-01-01

    In this functional neuroimaging study, we investigated neural activations during the process of learning to gain monetary rewards and to avoid monetary loss, and how these activations are modulated by individual differences in reward and punishment sensitivity. Healthy young volunteers performed a reinforcement learning task where they chose one of two fractal stimuli associated with monetary gain (reward trials) or avoidance of monetary loss (avoidance trials). Trait sensitivity to reward and punishment was assessed using the behavioral inhibition/activation scales (BIS/BAS). Functional neuroimaging results showed activation of the striatum during the anticipation and reception periods of reward trials. During avoidance trials, activation of the dorsal striatum and prefrontal regions was found. As expected, individual differences in reward sensitivity were positively associated with activation in the left and right ventral striatum during reward reception. Individual differences in sensitivity to punishment were negatively associated with activation in the left dorsal striatum during avoidance anticipation and also with activation in the right lateral orbitofrontal cortex during receiving monetary loss. These results suggest that learning to attain reward and learning to avoid loss are dependent on separable sets of neural regions whose activity is modulated by trait sensitivity to reward or punishment. PMID:25680989

  19. A neural signature of hierarchical reinforcement learning.

    PubMed

    Ribas-Fernandes, José J F; Solway, Alec; Diuk, Carlos; McGuire, Joseph T; Barto, Andrew G; Niv, Yael; Botvinick, Matthew M

    2011-07-28

    Human behavior displays hierarchical structure: simple actions cohere into subtask sequences, which work together to accomplish overall task goals. Although the neural substrates of such hierarchy have been the target of increasing research, they remain poorly understood. We propose that the computations supporting hierarchical behavior may relate to those in hierarchical reinforcement learning (HRL), a machine-learning framework that extends reinforcement-learning mechanisms into hierarchical domains. To test this, we leveraged a distinctive prediction arising from HRL. In ordinary reinforcement learning, reward prediction errors are computed when there is an unanticipated change in the prospects for accomplishing overall task goals. HRL entails that prediction errors should also occur in relation to task subgoals. In three neuroimaging studies we observed neural responses consistent with such subgoal-related reward prediction errors, within structures previously implicated in reinforcement learning. The results reported support the relevance of HRL to the neural processes underlying hierarchical behavior.

  20. (Reinforcement?) Learning to forage optimally.

    PubMed

    Kolling, Nils; Akam, Thomas

    2017-09-14

    Foraging effectively is critical to the survival of all animals and this imperative is thought to have profoundly shaped brain evolution. Decisions made by foraging animals often approximate optimal strategies, but the learning and decision mechanisms generating these choices remain poorly understood. Recent work with laboratory foraging tasks in humans suggest their behaviour is poorly explained by model-free reinforcement learning, with simple heuristic strategies better describing behaviour in some tasks, and in others evidence of prospective prediction of the future state of the environment. We suggest that model-based average reward reinforcement learning may provide a common framework for understanding these apparently divergent foraging strategies. Copyright © 2017 Elsevier Ltd. All rights reserved.

  1. Reinforcement of Learning

    ERIC Educational Resources Information Center

    Jones, Peter

    1977-01-01

    A company trainer shows some ways of scheduling reinforcement of learning for trainees: continuous reinforcement, fixed ratio, variable ratio, fixed interval, and variable interval. As there are problems with all methods, he suggests trying combinations of various types of reinforcement. (MF)

  2. Reinforcement learning: Computational theory and biological mechanisms.

    PubMed

    Doya, Kenji

    2007-05-01

    Reinforcement learning is a computational framework for an active agent to learn behaviors on the basis of a scalar reward signal. The agent can be an animal, a human, or an artificial system such as a robot or a computer program. The reward can be food, water, money, or whatever measure of the performance of the agent. The theory of reinforcement learning, which was developed in an artificial intelligence community with intuitions from animal learning theory, is now giving a coherent account on the function of the basal ganglia. It now serves as the "common language" in which biologists, engineers, and social scientists can exchange their problems and findings. This article reviews the basic theoretical framework of reinforcement learning and discusses its recent and future contributions toward the understanding of animal behaviors and human decision making.

  3. Hierarchical Multiagent Reinforcement Learning

    DTIC Science & Technology

    2004-01-25

    In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multiagent tasks. We...introduce a hierarchical multiagent reinforcement learning (RL) framework and propose a hierarchical multiagent RL algorithm called Cooperative HRL. In

  4. Reinforcement learning: Solving two case studies

    NASA Astrophysics Data System (ADS)

    Duarte, Ana Filipa; Silva, Pedro; dos Santos, Cristina Peixoto

    2012-09-01

    Reinforcement Learning algorithms offer interesting features for the control of autonomous systems, such as the ability to learn from direct interaction with the environment, and the use of a simple reward signalas opposed to the input-outputs pairsused in classic supervised learning. The reward signal indicates the success of failure of the actions executed by the agent in the environment. In this work, are described RL algorithmsapplied to two case studies: the Crawler robot and the widely known inverted pendulum. We explore RL capabilities to autonomously learn a basic locomotion pattern in the Crawler, andapproach the balancing problem of biped locomotion using the inverted pendulum.

  5. Reinforcement learning in scheduling

    NASA Technical Reports Server (NTRS)

    Dietterich, Tom G.; Ok, Dokyeong; Zhang, Wei; Tadepalli, Prasad

    1994-01-01

    The goal of this research is to apply reinforcement learning methods to real-world problems like scheduling. In this preliminary paper, we show that learning to solve scheduling problems such as the Space Shuttle Payload Processing and the Automatic Guided Vehicle (AGV) scheduling can be usefully studied in the reinforcement learning framework. We discuss some of the special challenges posed by the scheduling domain to these methods and propose some possible solutions we plan to implement.

  6. Theory meets pigeons: the influence of reward-magnitude on discrimination-learning.

    PubMed

    Rose, Jonas; Schmidt, Robert; Grabemann, Marco; Güntürkün, Onur

    2009-03-02

    Modern theoretical accounts on reward-based learning are commonly based on reinforcement learning algorithms. Most noted in this context is the temporal-difference (TD) algorithm in which the difference between predicted and obtained reward, the prediction-error, serves as a learning signal. Consequently, larger rewards cause bigger prediction-errors and lead to faster learning than smaller rewards. Therefore, if animals employ a neural implementation of TD learning, reward-magnitude should affect learning in animals accordingly. Here we test this prediction by training pigeons on a simple color-discrimination task with two pairs of colors. In each pair, correct discrimination is rewarded; in pair one with a large-reward, in pair two with a small-reward. Pigeons acquired the 'large-reward' discrimination faster than the 'small-reward' discrimination. Animal behavior and an implementation of the TD-algorithm yielded comparable results with respect to the difference between learning curves in the large-reward and in the small-reward conditions. We conclude that the influence of reward-magnitude on the acquisition of a simple discrimination paradigm is accurately reflected by a TD implementation of reinforcement learning.

  7. Adding a reward increases the reinforcing value of fruit.

    PubMed

    De Cock, Nathalie; Vervoort, Leentje; Kolsteren, Patrick; Huybregts, Lieven; Van Lippevelde, Wendy; Vangeel, Jolien; Notebaert, Melissa; Beullens, Kathleen; Goossens, Lien; Maes, Lea; Deforche, Benedicte; Braet, Caroline; Eggermont, Steven; Van Camp, John; Lachat, Carl

    2017-02-01

    Adolescents' snack choices could be altered by increasing the reinforcing value (RV) of healthy snacks compared with unhealthy snacks. This study assessed whether the RV of fruit increased by linking it to a reward and if this increased RV was comparable with the RV of unhealthy snacks alone. Moderation effects of sex, hunger, BMI z-scores and sensitivity to reward were also explored. The RV of snacks was assessed in a sample of 165 adolescents (15·1 (sd 1·5) years, 39·4 % boys and 17·4 % overweight) using a computerised food reinforcement task. Adolescents obtained points for snacks through mouse clicks (responses) following progressive ratio schedules of increasing response requirements. Participants were (computer) randomised to three experimental groups (1:1:1): fruit (n 53), fruit+reward (n 60) or unhealthy snacks (n 69). The RV was evaluated as total number of responses and breakpoint (schedule of terminating food reinforcement task). Multilevel regression analyses (total number of responses) and Cox's proportional hazard regression models (breakpoint) were used. The total number of responses made were not different between fruit+reward and fruit (b -473; 95 % CI -1152, 205, P=0·17) or unhealthy snacks (b410; 95 % CI -222, 1043, P=0·20). The breakpoint was slightly higher for fruit than fruit+reward (HR 1·34; 95 % CI 1·00, 1·79, P=0·050), whereas no difference between unhealthy snacks and fruit+reward (HR 0·86; 95 % CI 0·62, 1·18, P=0·34) was observed. No indication of moderation was found. Offering rewards slightly increases the RV of fruit and may be a promising strategy to increase healthy food choices. Future studies should however, explore if other rewards, could reach larger effect sizes.

  8. {alpha}-Synuclein gene duplication impairs reward learning.

    PubMed

    Kéri, Szabolcs; Moustafa, Ahmed A; Myers, Catherine E; Benedek, György; Gluck, Mark A

    2010-09-07

    alpha-Synuclein (SNCA) plays an important role in the regulation of dopaminergic neurotransmission and neurodegeneration in Parkinson disease. We investigated reward and punishment learning in asymptomatic carriers of a rare SNCA gene duplication who were healthy siblings of patients with Parkinson disease. Results revealed that healthy SNCA duplication carriers displayed impaired reward and intact punishment learning compared with noncarriers. These results demonstrate that a copy number variation of the SNCA gene is associated with selective impairments on reinforcement learning in asymptomatic carriers without the motor symptoms of Parkinson disease.

  9. α-Synuclein gene duplication impairs reward learning

    PubMed Central

    Kéri, Szabolcs; Moustafa, Ahmed A.; Myers, Catherine E.; Benedek, György; Gluck, Mark A.

    2010-01-01

    α-Synuclein (SNCA) plays an important role in the regulation of dopaminergic neurotransmission and neurodegeneration in Parkinson disease. We investigated reward and punishment learning in asymptomatic carriers of a rare SNCA gene duplication who were healthy siblings of patients with Parkinson disease. Results revealed that healthy SNCA duplication carriers displayed impaired reward and intact punishment learning compared with noncarriers. These results demonstrate that a copy number variation of the SNCA gene is associated with selective impairments on reinforcement learning in asymptomatic carriers without the motor symptoms of Parkinson disease. PMID:20733075

  10. Model-Based Reinforcement Learning under Concurrent Schedules of Reinforcement in Rodents

    ERIC Educational Resources Information Center

    Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

    2009-01-01

    Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's…

  11. Model-Based Reinforcement Learning under Concurrent Schedules of Reinforcement in Rodents

    ERIC Educational Resources Information Center

    Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

    2009-01-01

    Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's…

  12. Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules

    PubMed Central

    La Camera, Giancarlo; Richmond, Barry J.

    2008-01-01

    It is often assumed that animals and people adjust their behavior to maximize reward acquisition. In visually cued reinforcement schedules, monkeys make errors in trials that are not immediately rewarded, despite having to repeat error trials. Here we show that error rates are typically smaller in trials equally distant from reward but belonging to longer schedules (referred to as “schedule length effect”). This violates the principles of reward maximization and invariance and cannot be predicted by the standard methods of Reinforcement Learning, such as the method of temporal differences. We develop a heuristic model that accounts for all of the properties of the behavior in the reinforcement schedule task but whose predictions are not different from those of the standard temporal difference model in choice tasks. In the modification of temporal difference learning introduced here, the effect of schedule length emerges spontaneously from the sensitivity to the immediately preceding trial. We also introduce a policy for general Markov Decision Processes, where the decision made at each node is conditioned on the motivation to perform an instrumental action, and show that the application of our model to the reinforcement schedule task and the choice task are special cases of this general theoretical framework. Within this framework, Reinforcement Learning can approach contextual learning with the mixture of empirical findings and principled assumptions that seem to coexist in the best descriptions of animal behavior. As examples, we discuss two phenomena observed in humans that often derive from the violation of the principle of invariance: “framing,” wherein equivalent options are treated differently depending on the context in which they are presented, and the “sunk cost” effect, the greater tendency to continue an endeavor once an investment in money, effort, or time has been made. The schedule length effect might be a manifestation of these phenomena

  13. Mate call as reward: Acoustic communication signals can acquire positive reinforcing values during adulthood in female zebra finches (Taeniopygia guttata).

    PubMed

    Hernandez, Alexandra M; Perez, Emilie C; Mulard, Hervé; Mathevon, Nicolas; Vignal, Clémentine

    2016-02-01

    Social stimuli can have rewarding properties and promote learning. In birds, conspecific vocalizations like song can act as a reinforcer, and specific song variants can acquire particular rewarding values during early life exposure. Here we ask if, during adulthood, an acoustic signal simpler and shorter than song can become a reward for a female songbird because of its particular social value. Using an operant choice apparatus, we showed that female zebra finches display a preferential response toward their mate's calls. This reinforcing value of mate's calls could be involved in the maintenance of the monogamous pair-bond of the zebra finch.

  14. Do learning rates adapt to the distribution of rewards?

    PubMed

    Gershman, Samuel J

    2015-10-01

    Studies of reinforcement learning have shown that humans learn differently in response to positive and negative reward prediction errors, a phenomenon that can be captured computationally by positing asymmetric learning rates. This asymmetry, motivated by neurobiological and cognitive considerations, has been invoked to explain learning differences across the lifespan as well as a range of psychiatric disorders. Recent theoretical work, motivated by normative considerations, has hypothesized that the learning rate asymmetry should be modulated by the distribution of rewards across the available options. In particular, the learning rate for negative prediction errors should be higher than the learning rate for positive prediction errors when the average reward rate is high, and this relationship should reverse when the reward rate is low. We tested this hypothesis in a series of experiments. Contrary to the theoretical predictions, we found that the asymmetry was largely insensitive to the average reward rate; instead, the dominant pattern was a higher learning rate for negative than for positive prediction errors, possibly reflecting risk aversion.

  15. Dose Dependent Dopaminergic Modulation of Reward-Based Learning in Parkinson's Disease

    ERIC Educational Resources Information Center

    van Wouwe, N. C.; Ridderinkhof, K. R.; Band, G. P. H.; van den Wildenberg, W. P. M.; Wylie, S. A.

    2012-01-01

    Learning to select optimal behavior in new and uncertain situations is a crucial aspect of living and requires the ability to quickly associate stimuli with actions that lead to rewarding outcomes. Mathematical models of reinforcement-based learning to select rewarding actions distinguish between (1) the formation of stimulus-action-reward…

  16. Dose Dependent Dopaminergic Modulation of Reward-Based Learning in Parkinson's Disease

    ERIC Educational Resources Information Center

    van Wouwe, N. C.; Ridderinkhof, K. R.; Band, G. P. H.; van den Wildenberg, W. P. M.; Wylie, S. A.

    2012-01-01

    Learning to select optimal behavior in new and uncertain situations is a crucial aspect of living and requires the ability to quickly associate stimuli with actions that lead to rewarding outcomes. Mathematical models of reinforcement-based learning to select rewarding actions distinguish between (1) the formation of stimulus-action-reward…

  17. A universal role of the ventral striatum in reward-based learning: Evidence from human studies

    PubMed Central

    Daniel, Reka; Pollmann, Stefan

    2014-01-01

    Reinforcement learning enables organisms to adjust their behavior in order to maximize rewards. Electrophysiological recordings of dopaminergic midbrain neurons have shown that they code the difference between actual and predicted rewards, i.e., the reward prediction error, in many species. This error signal is conveyed to both the striatum and cortical areas and is thought to play a central role in learning to optimize behavior. However, in human daily life rewards are diverse and often only indirect feedback is available. Here we explore the range of rewards that are processed by the dopaminergic system in human participants, and examine whether it is also involved in learning in the absence of explicit rewards. While results from electrophysiological recordings in humans are sparse, evidence linking dopaminergic activity to the metabolic signal recorded from the midbrain and striatum with functional magnetic resonance imaging (fMRI) is available. Results from fMRI studies suggest that the human ventral striatum (VS) receives valuation information for a diverse set of rewarding stimuli. These range from simple primary reinforcers such as juice rewards over abstract social rewards to internally generated signals on perceived correctness, suggesting that the VS is involved in learning from trial-and-error irrespective of the specific nature of provided rewards. In addition, we summarize evidence that the VS can also be implicated when learning from observing others, and in tasks that go beyond simple stimulus-action-outcome learning, indicating that the reward system is also recruited in more complex learning tasks. PMID:24825620

  18. Reinforcement Learning or Active Inference?

    PubMed Central

    Friston, Karl J.; Daunizeau, Jean; Kiebel, Stefan J.

    2009-01-01

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain. PMID:19641614

  19. Reinforcement learning or active inference?

    PubMed

    Friston, Karl J; Daunizeau, Jean; Kiebel, Stefan J

    2009-07-29

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

  20. Manifold Regularized Reinforcement Learning.

    PubMed

    Li, Hongliang; Liu, Derong; Wang, Ding

    2017-01-27

    This paper introduces a novel manifold regularized reinforcement learning scheme for continuous Markov decision processes. Smooth feature representations for value function approximation can be automatically learned using the unsupervised manifold regularization method. The learned features are data-driven, and can be adapted to the geometry of the state space. Furthermore, the scheme provides a direct basis representation extension for novel samples during policy learning and control. The performance of the proposed scheme is evaluated on two benchmark control tasks, i.e., the inverted pendulum and the energy storage problem. Simulation results illustrate the concepts of the proposed scheme and show that it can obtain excellent performance.

  1. Learning Reward Uncertainty in the Basal Ganglia

    PubMed Central

    Bogacz, Rafal

    2016-01-01

    Learning the reliability of different sources of rewards is critical for making optimal choices. However, despite the existence of detailed theory describing how the expected reward is learned in the basal ganglia, it is not known how reward uncertainty is estimated in these circuits. This paper presents a class of models that encode both the mean reward and the spread of the rewards, the former in the difference between the synaptic weights of D1 and D2 neurons, and the latter in their sum. In the models, the tendency to seek (or avoid) options with variable reward can be controlled by increasing (or decreasing) the tonic level of dopamine. The models are consistent with the physiology of and synaptic plasticity in the basal ganglia, they explain the effects of dopaminergic manipulations on choices involving risks, and they make multiple experimental predictions. PMID:27589489

  2. Social stress reactivity alters reward and punishment learning.

    PubMed

    Cavanagh, James F; Frank, Michael J; Allen, John J B

    2011-06-01

    To examine how stress affects cognitive functioning, individual differences in trait vulnerability (punishment sensitivity) and state reactivity (negative affect) to social evaluative threat were examined during concurrent reinforcement learning. Lower trait-level punishment sensitivity predicted better reward learning and poorer punishment learning; the opposite pattern was found in more punishment sensitive individuals. Increasing state-level negative affect was directly related to punishment learning accuracy in highly punishment sensitive individuals, but these measures were inversely related in less sensitive individuals. Combined electrophysiological measurement, performance accuracy and computational estimations of learning parameters suggest that trait and state vulnerability to stress alter cortico-striatal functioning during reinforcement learning, possibly mediated via medio-frontal cortical systems.

  3. Learned reward association improves visual working memory.

    PubMed

    Gong, Mengyuan; Li, Sheng

    2014-04-01

    Statistical regularities in the natural environment play a central role in adaptive behavior. Among other regularities, reward association is potentially the most prominent factor that influences our daily life. Recent studies have suggested that pre-established reward association yields strong influence on the spatial allocation of attention. Here we show that reward association can also improve visual working memory (VWM) performance when the reward-associated feature is task-irrelevant. We established the reward association during a visual search training session, and investigated the representation of reward-associated features in VWM by the application of a change detection task before and after the training. The results showed that the improvement in VWM was significantly greater for items in the color associated with high reward than for those in low reward-associated or nonrewarded colors. In particular, the results from control experiments demonstrate that the observed reward effect in VWM could not be sufficiently accounted for by attentional capture toward the high reward-associated item. This was further confirmed when the effect of attentional capture was minimized by presenting the items in the sample and test displays of the change detection task with the same color. The results showed significantly larger improvement in VWM performance when the items in a display were in the high reward-associated color than those in the low reward-associated or nonrewarded colors. Our findings suggest that, apart from inducing space-based attentional capture, the learned reward association could also facilitate the perceptual representation of high reward-associated items through feature-based attentional modulation.

  4. Learning Contextual Reward Expectations for Value Adaptation.

    PubMed

    Rigoli, Francesco; Chew, Benjamin; Dayan, Peter; Dolan, Raymond J

    2017-09-26

    Substantial evidence indicates that subjective value is adapted to the statistics of reward expected within a given temporal context. However, how these contextual expectations are learnt is poorly understood. To examine such learning, we exploited a recent observation that participants performing a gambling task adjust their preferences as a function of context. We show that, in the absence of contextual cues providing reward information, an average reward expectation was learned from recent past experience. Learning dependent on contextual cues emerged when two contexts alternated at a fast rate, whereas both cue-independent and cue-dependent forms of learning were apparent when two contexts alternated at a slower rate. Motivated by these behavioral findings, we reanalyzed a previous fMRI data set to probe the neural substrates of learning contextual reward expectations. We observed a form of reward prediction error related to average reward such that, at option presentation, activity in ventral tegmental area/substantia nigra and ventral striatum correlated positively and negatively, respectively, with the actual and predicted value of options. Moreover, an inverse correlation between activity in ventral tegmental area/substantia nigra (but not striatum) and predicted option value was greater in participants showing enhanced choice adaptation to context. The findings help understanding the mechanisms underlying learning of contextual reward expectation.

  5. Reduced reward-related probability learning in schizophrenia patients.

    PubMed

    Yılmaz, Alpaslan; Simsek, Fatma; Gonul, Ali Saffet

    2012-01-01

    Although it is known that individuals with schizophrenia demonstrate marked impairment in reinforcement learning, the details of this impairment are not known. The aim of this study was to test the hypothesis that reward-related probability learning is altered in schizophrenia patients. Twenty-five clinically stable schizophrenia patients and 25 age- and gender-matched controls participated in the study. A simple gambling paradigm was used in which five different cues were associated with different reward probabilities (50%, 67%, and 100%). Participants were asked to make their best guess about the reward probability of each cue. Compared with controls, patients had significant impairment in learning contingencies on the basis of reward-related feedback. The correlation analyses revealed that the impairment of patients partially correlated with the severity of negative symptoms as measured on the Positive and Negative Syndrome Scale but that it was not related to antipsychotic dose. In conclusion, the present study showed that the schizophrenia patients had impaired reward-based learning and that this was independent from their medication status.

  6. How instructed knowledge modulates the neural systems of reward learning

    PubMed Central

    Delgado, Mauricio R.; Phelps, Elizabeth A.

    2011-01-01

    Recent research in neuroeconomics has demonstrated that the reinforcement learning model of reward learning captures the patterns of both behavioral performance and neural responses during a range of economic decision-making tasks. However, this powerful theoretical model has its limits. Trial-and-error is only one of the means by which individuals can learn the value associated with different decision options. Humans have also developed efficient, symbolic means of communication for learning without the necessity for committing multiple errors across trials. In the present study, we observed that instructed knowledge of cue-reward probabilities improves behavioral performance and diminishes reinforcement learning-related blood-oxygen level-dependent (BOLD) responses to feedback in the nucleus accumbens, ventromedial prefrontal cortex, and hippocampal complex. The decrease in BOLD responses in these brain regions to reward-feedback signals was functionally correlated with activation of the dorsolateral prefrontal cortex (DLPFC). These results suggest that when learning action values, participants use the DLPFC to dynamically adjust outcome responses in valuation regions depending on the usefulness of action-outcome information. PMID:21173266

  7. Mapping anhedonia onto reinforcement learning: a behavioural meta-analysis

    PubMed Central

    2013-01-01

    Background Depression is characterised partly by blunted reactions to reward. However, tasks probing this deficiency have not distinguished insensitivity to reward from insensitivity to the prediction errors for reward that determine learning and are putatively reported by the phasic activity of dopamine neurons. We attempted to disentangle these factors with respect to anhedonia in the context of stress, Major Depressive Disorder (MDD), Bipolar Disorder (BPD) and a dopaminergic challenge. Methods Six behavioural datasets involving 392 experimental sessions were subjected to a model-based, Bayesian meta-analysis. Participants across all six studies performed a probabilistic reward task that used an asymmetric reinforcement schedule to assess reward learning. Healthy controls were tested under baseline conditions, stress or after receiving the dopamine D2 agonist pramipexole. In addition, participants with current or past MDD or BPD were evaluated. Reinforcement learning models isolated the contributions of variation in reward sensitivity and learning rate. Results MDD and anhedonia reduced reward sensitivity more than they affected the learning rate, while a low dose of the dopamine D2 agonist pramipexole showed the opposite pattern. Stress led to a pattern consistent with a mixed effect on reward sensitivity and learning rate. Conclusion Reward-related learning reflected at least two partially separable contributions. The first related to phasic prediction error signalling, and was preferentially modulated by a low dose of the dopamine agonist pramipexole. The second related directly to reward sensitivity, and was preferentially reduced in MDD and anhedonia. Stress altered both components. Collectively, these findings highlight the contribution of model-based reinforcement learning meta-analysis for dissecting anhedonic behavior. PMID:23782813

  8. The Computational Development of Reinforcement Learning during Adolescence

    PubMed Central

    Palminteri, Stefano; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-01-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents’ behaviour was better explained by a basic reinforcement learning algorithm, adults’ behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence. PMID:27322574

  9. Nucleus Accumbens Core and Shell Differentially Encode Reward-Associated Cues after Reinforcer Devaluation

    PubMed Central

    West, Elizabeth A.

    2016-01-01

    Nucleus accumbens (NAc) neurons encode features of stimulus learning and action selection associated with rewards. The NAc is necessary for using information about expected outcome values to guide behavior after reinforcer devaluation. Evidence suggests that core and shell subregions may play dissociable roles in guiding motivated behavior. Here, we recorded neural activity in the NAc core and shell during training and performance of a reinforcer devaluation task. Long–Evans male rats were trained that presses on a lever under an illuminated cue light delivered a flavored sucrose reward. On subsequent test days, each rat was given free access to one of two distinctly flavored foods to consume to satiation and were then immediately tested on the lever pressing task under extinction conditions. Rats decreased pressing on the test day when the reinforcer earned during training was the sated flavor (devalued) compared with the test day when the reinforcer was not the sated flavor (nondevalued), demonstrating evidence of outcome-selective devaluation. Cue-selective encoding during training by NAc core (but not shell) neurons reliably predicted subsequent behavioral performance; that is, the greater the percentage of neurons that responded to the cue, the better the rats suppressed responding after devaluation. In contrast, NAc shell (but not core) neurons significantly decreased cue-selective encoding in the devalued condition compared with the nondevalued condition. These data reveal that NAc core and shell neurons encode information differentially about outcome-specific cues after reinforcer devaluation that are related to behavioral performance and outcome value, respectively. SIGNIFICANCE STATEMENT Many neuropsychiatric disorders are marked by impairments in behavioral flexibility. Although the nucleus accumbens (NAc) is required for behavioral flexibility, it is not known how NAc neurons encode this information. Here, we recorded NAc neurons during a training

  10. Nucleus Accumbens Core and Shell Differentially Encode Reward-Associated Cues after Reinforcer Devaluation.

    PubMed

    West, Elizabeth A; Carelli, Regina M

    2016-01-27

    Nucleus accumbens (NAc) neurons encode features of stimulus learning and action selection associated with rewards. The NAc is necessary for using information about expected outcome values to guide behavior after reinforcer devaluation. Evidence suggests that core and shell subregions may play dissociable roles in guiding motivated behavior. Here, we recorded neural activity in the NAc core and shell during training and performance of a reinforcer devaluation task. Long-Evans male rats were trained that presses on a lever under an illuminated cue light delivered a flavored sucrose reward. On subsequent test days, each rat was given free access to one of two distinctly flavored foods to consume to satiation and were then immediately tested on the lever pressing task under extinction conditions. Rats decreased pressing on the test day when the reinforcer earned during training was the sated flavor (devalued) compared with the test day when the reinforcer was not the sated flavor (nondevalued), demonstrating evidence of outcome-selective devaluation. Cue-selective encoding during training by NAc core (but not shell) neurons reliably predicted subsequent behavioral performance; that is, the greater the percentage of neurons that responded to the cue, the better the rats suppressed responding after devaluation. In contrast, NAc shell (but not core) neurons significantly decreased cue-selective encoding in the devalued condition compared with the nondevalued condition. These data reveal that NAc core and shell neurons encode information differentially about outcome-specific cues after reinforcer devaluation that are related to behavioral performance and outcome value, respectively. Many neuropsychiatric disorders are marked by impairments in behavioral flexibility. Although the nucleus accumbens (NAc) is required for behavioral flexibility, it is not known how NAc neurons encode this information. Here, we recorded NAc neurons during a training session in which rats

  11. Multiplexing signals in reinforcement learning with internal models and dopamine.

    PubMed

    Nakahara, Hiroyuki

    2014-04-01

    A fundamental challenge for computational and cognitive neuroscience is to understand how reward-based learning and decision-making are made and how accrued knowledge and internal models of the environment are incorporated. Remarkable progress has been made in the field, guided by the midbrain dopamine reward prediction error hypothesis and the underlying reinforcement learning framework, which does not involve internal models ('model-free'). Recent studies, however, have begun not only to address more complex decision-making processes that are integrated with model-free decision-making, but also to include internal models about environmental reward structures and the minds of other agents, including model-based reinforcement learning and using generalized prediction errors. Even dopamine, a classic model-free signal, may work as multiplexed signals using model-based information and contribute to representational learning of reward structure.

  12. Individual differences in reinforcement learning: behavioral, electrophysiological, and neuroimaging correlates.

    PubMed

    Santesso, Diane L; Dillon, Daniel G; Birk, Jeffrey L; Holmes, Avram J; Goetz, Elena; Bogdan, Ryan; Pizzagalli, Diego A

    2008-08-15

    During reinforcement learning, phasic modulations of activity in midbrain dopamine neurons are conveyed to the dorsal anterior cingulate cortex (dACC) and basal ganglia (BG) and serve to guide adaptive responding. While the animal literature supports a role for the dACC in integrating reward history over time, most human electrophysiological studies of dACC function have focused on responses to single positive and negative outcomes. The present electrophysiological study investigated the role of the dACC in probabilistic reward learning in healthy subjects using a task that required integration of reinforcement history over time. We recorded the feedback-related negativity (FRN) to reward feedback in subjects who developed a response bias toward a more frequently rewarded ("rich") stimulus ("learners") versus subjects who did not ("non-learners"). Compared to non-learners, learners showed more positive (i.e., smaller) FRNs and greater dACC activation upon receiving reward for correct identification of the rich stimulus. In addition, dACC activation and a bias to select the rich stimulus were positively correlated. The same participants also completed a monetary incentive delay (MID) task administered during functional magnetic resonance imaging. Compared to non-learners, learners displayed stronger BG responses to reward in the MID task. These findings raise the possibility that learners in the probabilistic reinforcement task were characterized by stronger dACC and BG responses to rewarding outcomes. Furthermore, these results highlight the importance of the dACC to probabilistic reward learning in humans.

  13. Reinforcement Learning Trees.

    PubMed

    Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R

    In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings.

  14. Reinforcement Learning Trees

    PubMed Central

    Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R.

    2015-01-01

    In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings. PMID:26903687

  15. Reward and non-reward learning of flower colours in the butterfly Byasa alcinous (Lepidoptera: Papilionidae)

    NASA Astrophysics Data System (ADS)

    Kandori, Ikuo; Yamaki, Takafumi

    2012-09-01

    Learning plays an important role in food acquisition for a wide range of insects. To increase their foraging efficiency, flower-visiting insects may learn to associate floral cues with the presence (so-called reward learning) or the absence (so-called non-reward learning) of a reward. Reward learning whilst foraging for flowers has been demonstrated in many insect taxa, whilst non-reward learning in flower-visiting insects has been demonstrated only in honeybees, bumblebees and hawkmoths. This study examined both reward and non-reward learning abilities in the butterfly Byasa alcinous whilst foraging among artificial flowers of different colours. This butterfly showed both types of learning, although butterflies of both sexes learned faster via reward learning. In addition, females learned via reward learning faster than males. To the best of our knowledge, these are the first empirical data on the learning speed of both reward and non-reward learning in insects. We discuss the adaptive significance of a lower learning speed for non-reward learning when foraging on flowers.

  16. Dopamine, reward learning, and active inference

    PubMed Central

    FitzGerald, Thomas H. B.; Dolan, Raymond J.; Friston, Karl

    2015-01-01

    Temporal difference learning models propose phasic dopamine signaling encodes reward prediction errors that drive learning. This is supported by studies where optogenetic stimulation of dopamine neurons can stand in lieu of actual reward. Nevertheless, a large body of data also shows that dopamine is not necessary for learning, and that dopamine depletion primarily affects task performance. We offer a resolution to this paradox based on an hypothesis that dopamine encodes the precision of beliefs about alternative actions, and thus controls the outcome-sensitivity of behavior. We extend an active inference scheme for solving Markov decision processes to include learning, and show that simulated dopamine dynamics strongly resemble those actually observed during instrumental conditioning. Furthermore, simulated dopamine depletion impairs performance but spares learning, while simulated excitation of dopamine neurons drives reward learning, through aberrant inference about outcome states. Our formal approach provides a novel and parsimonious reconciliation of apparently divergent experimental findings. PMID:26581305

  17. Coexistence of Reward and Unsupervised Learning During the Operant Conditioning of Neural Firing Rates

    PubMed Central

    Kerr, Robert R.; Grayden, David B.; Thomas, Doreen A.; Gilson, Matthieu; Burkitt, Anthony N.

    2014-01-01

    A fundamental goal of neuroscience is to understand how cognitive processes, such as operant conditioning, are performed by the brain. Typical and well studied examples of operant conditioning, in which the firing rates of individual cortical neurons in monkeys are increased using rewards, provide an opportunity for insight into this. Studies of reward-modulated spike-timing-dependent plasticity (RSTDP), and of other models such as R-max, have reproduced this learning behavior, but they have assumed that no unsupervised learning is present (i.e., no learning occurs without, or independent of, rewards). We show that these models cannot elicit firing rate reinforcement while exhibiting both reward learning and ongoing, stable unsupervised learning. To fix this issue, we propose a new RSTDP model of synaptic plasticity based upon the observed effects that dopamine has on long-term potentiation and depression (LTP and LTD). We show, both analytically and through simulations, that our new model can exhibit unsupervised learning and lead to firing rate reinforcement. This requires that the strengthening of LTP by the reward signal is greater than the strengthening of LTD and that the reinforced neuron exhibits irregular firing. We show the robustness of our findings to spike-timing correlations, to the synaptic weight dependence that is assumed, and to changes in the mean reward. We also consider our model in the differential reinforcement of two nearby neurons. Our model aligns more strongly with experimental studies than previous models and makes testable predictions for future experiments. PMID:24475240

  18. Coexistence of reward and unsupervised learning during the operant conditioning of neural firing rates.

    PubMed

    Kerr, Robert R; Grayden, David B; Thomas, Doreen A; Gilson, Matthieu; Burkitt, Anthony N

    2014-01-01

    A fundamental goal of neuroscience is to understand how cognitive processes, such as operant conditioning, are performed by the brain. Typical and well studied examples of operant conditioning, in which the firing rates of individual cortical neurons in monkeys are increased using rewards, provide an opportunity for insight into this. Studies of reward-modulated spike-timing-dependent plasticity (RSTDP), and of other models such as R-max, have reproduced this learning behavior, but they have assumed that no unsupervised learning is present (i.e., no learning occurs without, or independent of, rewards). We show that these models cannot elicit firing rate reinforcement while exhibiting both reward learning and ongoing, stable unsupervised learning. To fix this issue, we propose a new RSTDP model of synaptic plasticity based upon the observed effects that dopamine has on long-term potentiation and depression (LTP and LTD). We show, both analytically and through simulations, that our new model can exhibit unsupervised learning and lead to firing rate reinforcement. This requires that the strengthening of LTP by the reward signal is greater than the strengthening of LTD and that the reinforced neuron exhibits irregular firing. We show the robustness of our findings to spike-timing correlations, to the synaptic weight dependence that is assumed, and to changes in the mean reward. We also consider our model in the differential reinforcement of two nearby neurons. Our model aligns more strongly with experimental studies than previous models and makes testable predictions for future experiments.

  19. Interrelated mechanisms in reward and learning.

    PubMed

    Lajtha, Abel

    2008-01-01

    This brief review is focused on recent work in our laboratory, in which we assayed nicotine-induced neurotransmitter changes, comparing it to changes induced by other compounds and examined the receptor systems and their interactions that mediate the changes. The primary aim of our studies is to examine the role of neurotransmitter changes in reward and learning processes. We find that these processes are interlinked and interact in that reward-addiction mechanisms include processes of learning and learning-memory mechanisms include processes of reward. In spite being interlinked, the two processes have different functions and distinct properties and our long-term aim is to identify factors that control these processes and the differences among the processes. Here, we discuss reward processes, which we define as changes examined after administration of nicotine, cocaine or food, each of which induces changes in neurotransmitter levels and functions in cognitive areas as well as in reward areas. The changes are regionally heterogeneous and are drug or stimulus specific. They include changes in the transmitters assayed (catecholamines, amino acids, and acetylcholine) and also in their metabolites, hence, in addition to release, uptake and metabolism are involved. Many receptors modulate the response with direct and indirect effects. The involvement of many transmitters, receptors and their interactions and the stimulus specificity of the response indicated that each function, reward and learning represents the involvement of different pattern of changes with a different stimulus, therefore, many different learning and many different reward processes are active, which allow stimulus specific responses. The complex pattern of reward-induced changes in neurotransmitters is only a part of the multiple changes observed, but one which has a crucial and controlling function.

  20. Social Influence as Reinforcement Learning

    DTIC Science & Technology

    2016-01-13

    SECURITY CLASSIFICATION OF: This project examined a reinforcement learning model of conformity and social influence. Under this model, individuals...Oct-2015 Approved for Public Release; Distribution Unlimited Final Report: Social Influence as Reinforcement Learning The views, opinions and/or...Research Triangle Park, NC 27709-2211 W911NF-14-1-0001 - Final Report - Social Influence as Reinforcement Learning REPORT DOCUMENTATION PAGE 11. SPONSOR

  1. Neural Basis of Reinforcement Learning and Decision Making

    PubMed Central

    Lee, Daeyeol; Seo, Hyojung; Jung, Min Whan

    2012-01-01

    Reinforcement learning is an adaptive process in which an animal utilizes its previous experience to improve the outcomes of future choices. Computational theories of reinforcement learning play a central role in the newly emerging areas of neuroeconomics and decision neuroscience. In this framework, actions are chosen according to their value functions, which describe how much future reward is expected from each action. Value functions can be adjusted not only through reward and penalty, but also by the animal’s knowledge of its current environment. Studies have revealed that a large proportion of the brain is involved in representing and updating value functions and using them to choose an action. However, how the nature of a behavioral task affects the neural mechanisms of reinforcement learning remains incompletely understood. Future studies should uncover the principles by which different computational elements of reinforcement learning are dynamically coordinated across the entire brain. PMID:22462543

  2. Time-Extended Policies in Mult-Agent Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian K.

    2004-01-01

    Reinforcement learning methods perform well in many domains where a single agent needs to take a sequence of actions to perform a task. These methods use sequences of single-time-step rewards to create a policy that tries to maximize a time-extended utility, which is a (possibly discounted) sum of these rewards. In this paper we build on our previous work showing how these methods can be extended to a multi-agent environment where each agent creates its own policy that works towards maximizing a time-extended global utility over all agents actions. We show improved methods for creating time-extended utilities for the agents that are both "aligned" with the global utility and "learnable." We then show how to crate single-time-step rewards while avoiding the pi fall of having rewards aligned with the global reward leading to utilities not aligned with the global utility. Finally, we apply these reward functions to the multi-agent Gridworld problem. We explicitly quantify a utility's learnability and alignment, and show that reinforcement learning agents using the prescribed reward functions successfully tradeoff learnability and alignment. As a result they outperform both global (e.g., team games ) and local (e.g., "perfectly learnable" ) reinforcement learning solutions by as much as an order of magnitude.

  3. Risk-sensitive reinforcement learning.

    PubMed

    Shen, Yun; Tobia, Michael J; Sommer, Tobias; Obermayer, Klaus

    2014-07-01

    We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

  4. Reward Learning, Neurocognition, Social Cognition, and Symptomatology in Psychosis.

    PubMed

    Lewandowski, Kathryn E; Whitton, Alexis E; Pizzagalli, Diego A; Norris, Lesley A; Ongur, Dost; Hall, Mei-Hua

    2016-01-01

    symptoms - across diagnoses, and was predictive of worse social cognition. Reward learning was not associated with neurocognitive performance, suggesting that, across patient groups, social cognition but not neurocognition may share common pathways with this aspect of reinforcement learning. Better understanding of how cognitive dysfunction and reward processing deficits relate to one another, to other key symptom dimensions (e.g., psychosis), and to diagnostic categories, may help clarify shared etiological pathways and guide efforts toward targeted treatment approaches.

  5. Mind matters: Placebo enhances reward learning in Parkinson’s disease

    PubMed Central

    Schmidt, Liane; Braun, Erin Kendall; Wager, Tor D.; Shohamy, Daphna

    2015-01-01

    Expectations have a powerful influence on how we experience the world. Neurobiological and computational models of learning suggest that dopamine is crucial for shaping expectations of reward and that expectations alone may influence dopamine levels. However, because expectations and reinforcers are typically manipulated together, the role of expectations per se has remained unclear. Here, we separated these two factors using a placebo dopaminergic manipulation in Parkinson’s patients. We combined a reward learning task with fMRI to test how expectations of dopamine release modulate learning-related activity in the brain. We found that the mere expectation of dopamine release enhances reward learning and modulates learning-related signals in the striatum and the ventromedial prefrontal cortex. These effects were selective to learning from reward: neither medication nor placebo had an effect on learning to avoid monetary loss. These findings suggest a neurobiological mechanism by which expectations shape learning and affect. PMID:25326691

  6. The dissociable effects of punishment and reward on motor learning.

    PubMed

    Galea, Joseph M; Mallia, Elizabeth; Rothwell, John; Diedrichsen, Jörn

    2015-04-01

    A common assumption regarding error-based motor learning (motor adaptation) in humans is that its underlying mechanism is automatic and insensitive to reward- or punishment-based feedback. Contrary to this hypothesis, we show in a double dissociation that the two have independent effects on the learning and retention components of motor adaptation. Negative feedback, whether graded or binary, accelerated learning. While it was not necessary for the negative feedback to be coupled to monetary loss, it had to be clearly related to the actual performance on the preceding movement. Positive feedback did not speed up learning, but it increased retention of the motor memory when performance feedback was withdrawn. These findings reinforce the view that independent mechanisms underpin learning and retention in motor adaptation, reject the assumption that motor adaptation is independent of motivational feedback, and raise new questions regarding the neural basis of negative and positive motivational feedback in motor learning.

  7. Credit assignment during movement reinforcement learning.

    PubMed

    Dam, Gregory; Kording, Konrad; Wei, Kunlin

    2013-01-01

    We often need to learn how to move based on a single performance measure that reflects the overall success of our movements. However, movements have many properties, such as their trajectories, speeds and timing of end-points, thus the brain needs to decide which properties of movements should be improved; it needs to solve the credit assignment problem. Currently, little is known about how humans solve credit assignment problems in the context of reinforcement learning. Here we tested how human participants solve such problems during a trajectory-learning task. Without an explicitly-defined target movement, participants made hand reaches and received monetary rewards as feedback on a trial-by-trial basis. The curvature and direction of the attempted reach trajectories determined the monetary rewards received in a manner that can be manipulated experimentally. Based on the history of action-reward pairs, participants quickly solved the credit assignment problem and learned the implicit payoff function. A Bayesian credit-assignment model with built-in forgetting accurately predicts their trial-by-trial learning.

  8. How Transitions from Nonrewarded to Rewarded Trials Regulate Responding in Pavlovian and Instrumental Learning Following Extensive Acquisition Training

    ERIC Educational Resources Information Center

    Capaldi, E.J.; Haas, A.; Miller, R.M.; Martins, A.

    2005-01-01

    In both discrimination learning and partial reinforcement, transitions may occur from nonrewarded to rewarded trials (NR transition). In discrimination learning, NR transitions may occur in two different stimulus alternatives (NR different transitions). In partial reward, NR transitions may occur in a single stimulus alternative (NR same…

  9. Reinforcement Learning Through Gradient Descent

    DTIC Science & Technology

    1999-05-14

    Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for...practice of existing types of algorithms, the gradient descent approach makes it possible to create entirely new classes of reinforcement learning algorithms

  10. Synthetic cathinones and their rewarding and reinforcing effects in rodents

    PubMed Central

    Watterson, Lucas R.; Olive, M. Foster

    2014-01-01

    Synthetic cathinones, colloquially referred to as “bath salts”, are derivatives of the psychoactive alkaloid cathinone found in Catha edulis (Khat). Since the mid-to-late 2000’s, these amphetamine-like psychostimulants have gained popularity amongst drug users due to their potency, low cost, ease of procurement, and constantly evolving chemical structures. Concomitant with their increased use is the emergence of a growing collection of case reports of bizarre and dangerous behaviors, toxicity to numerous organ systems, and death. However, scientific information regarding the abuse liability of these drugs has been relatively slower to materialize. Recently we have published several studies demonstrating that laboratory rodents will readily self-administer the “first generation” synthetic cathinones methylenedioxypyrovalerone (MDPV) and methylone via the intravenous route, in patterns similar to those of methamphetamine. Under progressive ratio schedules of reinforcement, the rank order of reinforcing efficacy of these compounds are MDPV ≥ methamphetamine > methylone. MDPV and methylone, as well as the “second generation” synthetic cathinones α-pyrrolidinovalerophenone (α-PVP) and 4-methylethcathinone (4-MEC), also dose-dependently increase brain reward function. Collectively, these findings indicate that synthetic cathinones have a high abuse and addiction potential and underscore the need for future assessment of the extent and duration of neurotoxicity induced by these emerging drugs of abuse. PMID:25328910

  11. Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: a simulated robotic study.

    PubMed

    Mirolli, Marco; Santucci, Vieri G; Baldassarre, Gianluca

    2013-03-01

    An important issue of recent neuroscientific research is to understand the functional role of the phasic release of dopamine in the striatum, and in particular its relation to reinforcement learning. The literature is split between two alternative hypotheses: one considers phasic dopamine as a reward prediction error similar to the computational TD-error, whose function is to guide an animal to maximize future rewards; the other holds that phasic dopamine is a sensory prediction error signal that lets the animal discover and acquire novel actions. In this paper we propose an original hypothesis that integrates these two contrasting positions: according to our view phasic dopamine represents a TD-like reinforcement prediction error learning signal determined by both unexpected changes in the environment (temporary, intrinsic reinforcements) and biological rewards (permanent, extrinsic reinforcements). Accordingly, dopamine plays the functional role of driving both the discovery and acquisition of novel actions and the maximization of future rewards. To validate our hypothesis we perform a series of experiments with a simulated robotic system that has to learn different skills in order to get rewards. We compare different versions of the system in which we vary the composition of the learning signal. The results show that only the system reinforced by both extrinsic and intrinsic reinforcements is able to reach high performance in sufficiently complex conditions.

  12. Contextual modulation of value signals in reward and punishment learning.

    PubMed

    Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

    2015-08-25

    Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative--context-dependent--scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system.

  13. Contextual modulation of value signals in reward and punishment learning

    PubMed Central

    Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

    2015-01-01

    Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative—context-dependent—scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system. PMID:26302782

  14. Statistical Mechanics of the Delayed Reward-Based Learning with Node Perturbation

    NASA Astrophysics Data System (ADS)

    Hiroshi Saito,; Kentaro Katahira,; Kazuo Okanoya,; Masato Okada,

    2010-06-01

    In reward-based learning, reward is typically given with some delay after a behavior that causes the reward. In machine learning literature, the framework of the eligibility trace has been used as one of the solutions to handle the delayed reward in reinforcement learning. In recent studies, the eligibility trace is implied to be important for difficult neuroscience problem known as the “distal reward problem”. Node perturbation is one of the stochastic gradient methods from among many kinds of reinforcement learning implementations, and it searches the approximate gradient by introducing perturbation to a network. Since the stochastic gradient method does not require a objective function differential, it is expected to be able to account for the learning mechanism of a complex system, like a brain. We study the node perturbation with the eligibility trace as a specific example of delayed reward-based learning, and analyzed it using a statistical mechanics approach. As a result, we show the optimal time constant of the eligibility trace respect to the reward delay and the existence of unlearnable parameter configurations.

  15. Digital Badges--Rewards for Learning?

    ERIC Educational Resources Information Center

    Shields, Rebecca; Chugh, Ritesh

    2017-01-01

    Digital badges are quickly becoming an appropriate, easy and efficient way for educators, community groups and other professional organisations, to exhibit and reward participants for skills obtained in professional development or formal and informal learning. This paper offers an account of digital badges, how they work and the underlying…

  16. Inter-module credit assignment in modular reinforcement learning.

    PubMed

    Samejima, Kazuyuki; Doya, Kenji; Kawato, Mitsuo

    2003-09-01

    Critical issues in modular or hierarchical reinforcement learning (RL) are (i) how to decompose a task into sub-tasks, (ii) how to achieve independence of learning of sub-tasks, and (iii) how to assure optimality of the composite policy for the entire task. The second and last requirements are often under trade-off. We propose a method for propagating the reward for the entire task achievement between modules. This is done in the form of a 'modular reward', which is calculated from the temporal difference of the module gating signal and the value of the succeeding module. We implement modular reward for a multiple model-based reinforcement learning (MMRL) architecture and show its effectiveness in simulations of a pursuit task with hidden states and a continuous-time non-linear control task.

  17. Learning Analytics: Readiness and Rewards

    ERIC Educational Resources Information Center

    Friesen, Norm

    2013-01-01

    This position paper introduces the relatively new field of learning analytics, first by considering the relevant meanings of both "learning" and "analytics," and then by looking at two main levels at which learning analytics can be or has been implemented in educational organizations. Although integrated turnkey systems or…

  18. Optogenetic mimicry of the transient activation of dopamine neurons by natural reward is sufficient for operant reinforcement.

    PubMed

    Kim, Kyung Man; Baratta, Michael V; Yang, Aimei; Lee, Doheon; Boyden, Edward S; Fiorillo, Christopher D

    2012-01-01

    Activation of dopamine receptors in forebrain regions, for minutes or longer, is known to be sufficient for positive reinforcement of stimuli and actions. However, the firing rate of dopamine neurons is increased for only about 200 milliseconds following natural reward events that are better than expected, a response which has been described as a "reward prediction error" (RPE). Although RPE drives reinforcement learning (RL) in computational models, it has not been possible to directly test whether the transient dopamine signal actually drives RL. Here we have performed optical stimulation of genetically targeted ventral tegmental area (VTA) dopamine neurons expressing Channelrhodopsin-2 (ChR2) in mice. We mimicked the transient activation of dopamine neurons that occurs in response to natural reward by applying a light pulse of 200 ms in VTA. When a single light pulse followed each self-initiated nose poke, it was sufficient in itself to cause operant reinforcement. Furthermore, when optical stimulation was delivered in separate sessions according to a predetermined pattern, it increased locomotion and contralateral rotations, behaviors that are known to result from activation of dopamine neurons. All three of the optically induced operant and locomotor behaviors were tightly correlated with the number of VTA dopamine neurons that expressed ChR2, providing additional evidence that the behavioral responses were caused by activation of dopamine neurons. These results provide strong evidence that the transient activation of dopamine neurons provides a functional reward signal that drives learning, in support of RL theories of dopamine function.

  19. Optogenetic Mimicry of the Transient Activation of Dopamine Neurons by Natural Reward Is Sufficient for Operant Reinforcement

    PubMed Central

    Kim, Kyung Man; Baratta, Michael V.; Yang, Aimei; Lee, Doheon; Boyden, Edward S.; Fiorillo, Christopher D.

    2012-01-01

    Activation of dopamine receptors in forebrain regions, for minutes or longer, is known to be sufficient for positive reinforcement of stimuli and actions. However, the firing rate of dopamine neurons is increased for only about 200 milliseconds following natural reward events that are better than expected, a response which has been described as a “reward prediction error” (RPE). Although RPE drives reinforcement learning (RL) in computational models, it has not been possible to directly test whether the transient dopamine signal actually drives RL. Here we have performed optical stimulation of genetically targeted ventral tegmental area (VTA) dopamine neurons expressing Channelrhodopsin-2 (ChR2) in mice. We mimicked the transient activation of dopamine neurons that occurs in response to natural reward by applying a light pulse of 200 ms in VTA. When a single light pulse followed each self-initiated nose poke, it was sufficient in itself to cause operant reinforcement. Furthermore, when optical stimulation was delivered in separate sessions according to a predetermined pattern, it increased locomotion and contralateral rotations, behaviors that are known to result from activation of dopamine neurons. All three of the optically induced operant and locomotor behaviors were tightly correlated with the number of VTA dopamine neurons that expressed ChR2, providing additional evidence that the behavioral responses were caused by activation of dopamine neurons. These results provide strong evidence that the transient activation of dopamine neurons provides a functional reward signal that drives learning, in support of RL theories of dopamine function. PMID:22506004

  20. Reinforcement learning: the good, the bad and the ugly.

    PubMed

    Dayan, Peter; Niv, Yael

    2008-04-01

    Reinforcement learning provides both qualitative and quantitative frameworks for understanding and modeling adaptive decision-making in the face of rewards and punishments. Here we review the latest dispatches from the forefront of this field, and map out some of the territories where lie monsters.

  1. Changes in corticostriatal connectivity during reinforcement learning in humans.

    PubMed

    Horga, Guillermo; Maia, Tiago V; Marsh, Rachel; Hao, Xuejun; Xu, Dongrong; Duan, Yunsuo; Tau, Gregory Z; Graniello, Barbara; Wang, Zhishun; Kangarlu, Alayar; Martinez, Diana; Packard, Mark G; Peterson, Bradley S

    2015-02-01

    Many computational models assume that reinforcement learning relies on changes in synaptic efficacy between cortical regions representing stimuli and striatal regions involved in response selection, but this assumption has thus far lacked empirical support in humans. We recorded hemodynamic signals with fMRI while participants navigated a virtual maze to find hidden rewards. We fitted a reinforcement-learning algorithm to participants' choice behavior and evaluated the neural activity and the changes in functional connectivity related to trial-by-trial learning variables. Activity in the posterior putamen during choice periods increased progressively during learning. Furthermore, the functional connections between the sensorimotor cortex and the posterior putamen strengthened progressively as participants learned the task. These changes in corticostriatal connectivity differentiated participants who learned the task from those who did not. These findings provide a direct link between changes in corticostriatal connectivity and learning, thereby supporting a central assumption common to several computational models of reinforcement learning.

  2. Reward: From Basic Reinforcers to Anticipation of Social Cues.

    PubMed

    Rademacher, Lena; Schulte-Rüther, Martin; Hanewald, Bernd; Lammertz, Sarah

    2017-01-01

    Reward processing plays a major role in goal-directed behavior and motivation. On the neural level, it is mediated by a complex network of brain structures called the dopaminergic reward system. In the last decade, neuroscientific researchers have become increasingly interested in aspects of social interaction that are experienced as rewarding. Recent neuroimaging studies have provided evidence that the reward system mediates the processing of social stimuli in a manner analogous to nonsocial rewards and thus motivates social behavior. In this context, the neuropeptide oxytocin is assumed to play a key role by activating dopaminergic reward pathways in response to social cues, inducing the rewarding quality of social interactions. Alterations in the dopaminergic reward system have been found in several psychiatric disorders that are accompanied by social interaction and motivation problems, for example autism, attention deficit/hyperactivity disorder, addiction disorders, and schizophrenia.

  3. Psychological distance to reward: Segmentation of aperiodic schedules of reinforcement

    PubMed Central

    Leung, Jin-Pang

    1993-01-01

    College students responded for monetary rewards in two experiments on choice between differentially segmented aperiodic schedules of reinforcement. On a microcomputer, the concurrent chains were simulated as an air-defense video game in which subjects used two radars for detecting and destroying enemy aircraft. To earn more cash-exchangeable points, subjects had to shoot down as many planes as possible within a given period of time. For both experiments, access to one of two radar systems (terminal link) was controlled by a pair of independent concurrent variable-interval 60-s schedules (initial link) with a 4-s changeover delay always in effect. In Experiment 1, the appearance of an enemy aircraft in the terminal link was determined by a variable-interval (15 s or 60 s) schedule or a two-component chained variable-interval schedule of equal duration. Experiment 2 was similar to Experiment 1 except for the segmented schedule, which had three components. Subjects preferred the unsegmented schedule over its segmented counterpart in the conditions with variable-interval 60 s, and preference tended to be more pronounced with more components in the segmented schedule. These findings are compatible with those from previous studies of periodic and aperiodic schedules with pigeons or humans as subjects. PMID:16812691

  4. Implicit and explicit reward learning in chronic nicotine use.

    PubMed

    Paelecke-Habermann, Yvonne; Paelecke, Marko; Giegerich, Katharina; Reschke, Katja; Kübler, Andrea

    2013-04-01

    Chronic tobacco use is related to specific neurobiological alterations in the dopaminergic brain reward system that can be termed "reward deficiency syndrome" in dependent nicotine consumers. The close linkage of dopaminergic activity and reward learning led us to expect implicit and explicit reward learning deficits in dependent compared to non-smokers. Smokers who maintain a less regular, occasional use may also, to a lesser extent, show implicit reward learning deficits. The purpose of our study was to examine the behavioral effects of the neurobiological alterations on reward related learning. We also tested whether any deficits observed in an abstinent state are also present in a satiated state. In two studies, we examined implicit and explicit reward learning in smokers. Participants were administered a probabilistic implicit reward learning task, and an explicit reward- and punishment-based trial-and-error learning task. In Study 1, we compared dependent, occasional, and non-smokers, and in Study 2 satiated and abstinent smokers. In Study 1, chronic and occasional smokers showed impairments in both, implicit and explicit reward learning tasks. In Study 2, satiated smokers did not perform better than abstinent smokers. The results support the hypothesis of reward learning deficits. These deficits are not limited to explicit but extend to implicit reward learning and cannot be explained by tobacco withdrawal. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  5. Roles of octopaminergic and dopaminergic neurons in mediating reward and punishment signals in insect visual learning.

    PubMed

    Unoki, Sae; Matsumoto, Yukihisa; Mizunami, Makoto

    2006-10-01

    Insects, like vertebrates, have considerable ability to associate visual, olfactory or other sensory signals with reward or punishment. Previous studies in crickets, honey bees and fruit-flies have suggested that octopamine (OA, invertebrate counterpart of noradrenaline) and dopamine (DA) mediate various kinds of reward and punishment signals in olfactory learning. However, whether the roles of OA and DA in mediating positive and negative reinforcing signals can be generalized to learning of sensory signals other than odors remained unknown. Here we first established a visual learning paradigm in which to associate a visual pattern with water reward or saline punishment for crickets and found that memory after aversive conditioning decayed much faster than that after appetitive conditioning. Then, we pharmacologically studied the roles of OA and DA in appetitive and aversive forms of visual learning. Crickets injected with epinastine or mianserin, OA receptor antagonists, into the hemolymph exhibited a complete impairment of appetitive learning to associate a visual pattern with water reward, but aversive learning with saline punishment was unaffected. By contrast, fluphenazine, chlorpromazine or spiperone, DA receptor antagonists, completely impaired aversive learning without affecting appetitive learning. The results demonstrate that OA and DA participate in reward and punishment conditioning in visual learning. This finding, together with results of previous studies on the roles of OA and DA in olfactory learning, suggests ubiquitous roles of the octopaminergic reward system and dopaminergic punishment system in insect learning.

  6. Meta-learning in reinforcement learning.

    PubMed

    Schweighofer, Nicolas; Doya, Kenji

    2003-01-01

    Meta-parameters in reinforcement learning should be tuned to the environmental dynamics and the animal performance. Here, we propose a biologically plausible meta-reinforcement learning algorithm for tuning these meta-parameters in a dynamic, adaptive manner. We tested our algorithm in both a simulation of a Markov decision task and in a non-linear control task. Our results show that the algorithm robustly finds appropriate meta-parameter values, and controls the meta-parameter time course, in both static and dynamic environments. We suggest that the phasic and tonic components of dopamine neuron firing can encode the signal required for meta-learning of reinforcement learning.

  7. How we learn to make decisions: rapid propagation of reinforcement learning prediction errors in humans.

    PubMed

    Krigolson, Olav E; Hassall, Cameron D; Handy, Todd C

    2014-03-01

    Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors-discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833-1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129-141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769-776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679-709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward

  8. The combination of appetitive and aversive reinforcers and the nature of their interaction during auditory learning.

    PubMed

    Ilango, A; Wetzel, W; Scheich, H; Ohl, F W

    2010-03-31

    Learned changes in behavior can be elicited by either appetitive or aversive reinforcers. It is, however, not clear whether the two types of motivation, (approaching appetitive stimuli and avoiding aversive stimuli) drive learning in the same or different ways, nor is their interaction understood in situations where the two types are combined in a single experiment. To investigate this question we have developed a novel learning paradigm for Mongolian gerbils, which not only allows rewards and punishments to be presented in isolation or in combination with each other, but also can use these opposite reinforcers to drive the same learned behavior. Specifically, we studied learning of tone-conditioned hurdle crossing in a shuttle box driven by either an appetitive reinforcer (brain stimulation reward) or an aversive reinforcer (electrical footshock), or by a combination of both. Combination of the two reinforcers potentiated speed of acquisition, led to maximum possible performance, and delayed extinction as compared to either reinforcer alone. Additional experiments, using partial reinforcement protocols and experiments in which one of the reinforcers was omitted after the animals had been previously trained with the combination of both reinforcers, indicated that appetitive and aversive reinforcers operated together but acted in different ways: in this particular experimental context, punishment appeared to be more effective for initial acquisition and reward more effective to maintain a high level of conditioned responses (CRs). The results imply that learning mechanisms in problem solving were maximally effective when the initial punishment of mistakes was combined with the subsequent rewarding of correct performance.

  9. The Dopamine Prediction Error: Contributions to Associative Models of Reward Learning

    PubMed Central

    Nasser, Helen M.; Calu, Donna J.; Schoenbaum, Geoffrey; Sharpe, Melissa J.

    2017-01-01

    Phasic activity of midbrain dopamine neurons is currently thought to encapsulate the prediction-error signal described in Sutton and Barto’s (1981) model-free reinforcement learning algorithm. This phasic signal is thought to contain information about the quantitative value of reward, which transfers to the reward-predictive cue after learning. This is argued to endow the reward-predictive cue with the value inherent in the reward, motivating behavior toward cues signaling the presence of reward. Yet theoretical and empirical research has implicated prediction-error signaling in learning that extends far beyond a transfer of quantitative value to a reward-predictive cue. Here, we review the research which demonstrates the complexity of how dopaminergic prediction errors facilitate learning. After briefly discussing the literature demonstrating that phasic dopaminergic signals can act in the manner described by Sutton and Barto (1981), we consider how these signals may also influence attentional processing across multiple attentional systems in distinct brain circuits. Then, we discuss how prediction errors encode and promote the development of context-specific associations between cues and rewards. Finally, we consider recent evidence that shows dopaminergic activity contains information about causal relationships between cues and rewards that reflect information garnered from rich associative models of the world that can be adapted in the absence of direct experience. In discussing this research we hope to support the expansion of how dopaminergic prediction errors are thought to contribute to the learning process beyond the traditional concept of transferring quantitative value. PMID:28275359

  10. The Dopamine Prediction Error: Contributions to Associative Models of Reward Learning.

    PubMed

    Nasser, Helen M; Calu, Donna J; Schoenbaum, Geoffrey; Sharpe, Melissa J

    2017-01-01

    Phasic activity of midbrain dopamine neurons is currently thought to encapsulate the prediction-error signal described in Sutton and Barto's (1981) model-free reinforcement learning algorithm. This phasic signal is thought to contain information about the quantitative value of reward, which transfers to the reward-predictive cue after learning. This is argued to endow the reward-predictive cue with the value inherent in the reward, motivating behavior toward cues signaling the presence of reward. Yet theoretical and empirical research has implicated prediction-error signaling in learning that extends far beyond a transfer of quantitative value to a reward-predictive cue. Here, we review the research which demonstrates the complexity of how dopaminergic prediction errors facilitate learning. After briefly discussing the literature demonstrating that phasic dopaminergic signals can act in the manner described by Sutton and Barto (1981), we consider how these signals may also influence attentional processing across multiple attentional systems in distinct brain circuits. Then, we discuss how prediction errors encode and promote the development of context-specific associations between cues and rewards. Finally, we consider recent evidence that shows dopaminergic activity contains information about causal relationships between cues and rewards that reflect information garnered from rich associative models of the world that can be adapted in the absence of direct experience. In discussing this research we hope to support the expansion of how dopaminergic prediction errors are thought to contribute to the learning process beyond the traditional concept of transferring quantitative value.

  11. Model-based reinforcement learning with dimension reduction.

    PubMed

    Tangkaratt, Voot; Morimoto, Jun; Sugiyama, Masashi

    2016-12-01

    The goal of reinforcement learning is to learn an optimal policy which controls an agent to acquire the maximum cumulative reward. The model-based reinforcement learning approach learns a transition model of the environment from data, and then derives the optimal policy using the transition model. However, learning an accurate transition model in high-dimensional environments requires a large amount of data which is difficult to obtain. To overcome this difficulty, in this paper, we propose to combine model-based reinforcement learning with the recently developed least-squares conditional entropy (LSCE) method, which simultaneously performs transition model estimation and dimension reduction. We also further extend the proposed method to imitation learning scenarios. The experimental results show that policy search combined with LSCE performs well for high-dimensional control tasks including real humanoid robot control.

  12. Stochastic optimization of multireservoir systems via reinforcement learning

    NASA Astrophysics Data System (ADS)

    Lee, Jin-Hee; Labadie, John W.

    2007-11-01

    Although several variants of stochastic dynamic programming have been applied to optimal operation of multireservoir systems, they have been plagued by a high-dimensional state space and the inability to accurately incorporate the stochastic environment as characterized by temporally and spatially correlated hydrologic inflows. Reinforcement learning has emerged as an effective approach to solving sequential decision problems by combining concepts from artificial intelligence, cognitive science, and operations research. A reinforcement learning system has a mathematical foundation similar to dynamic programming and Markov decision processes, with the goal of maximizing the long-term reward or returns as conditioned on the state of the system environment and the immediate reward obtained from operational decisions. Reinforcement learning can include Monte Carlo simulation where transition probabilities and rewards are not explicitly known a priori. The Q-Learning method in reinforcement learning is demonstrated on the two-reservoir Geum River system, South Korea, and is shown to outperform implicit stochastic dynamic programming and sampling stochastic dynamic programming methods.

  13. Model-based reinforcement learning under concurrent schedules of reinforcement in rodents.

    PubMed

    Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

    2009-05-01

    Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's knowledge or model of the environment in model-based reinforcement learning algorithms. To investigate how animals update value functions, we trained rats under two different free-choice tasks. The reward probability of the unchosen target remained unchanged in one task, whereas it increased over time since the target was last chosen in the other task. The results show that goal choice probability increased as a function of the number of consecutive alternative choices in the latter, but not the former task, indicating that the animals were aware of time-dependent increases in arming probability and used this information in choosing goals. In addition, the choice behavior in the latter task was better accounted for by a model-based reinforcement learning algorithm. Our results show that rats adopt a decision-making process that cannot be accounted for by simple reinforcement learning models even in a relatively simple binary choice task, suggesting that rats can readily improve their decision-making strategy through the knowledge of their environments.

  14. Reinforcement Learning and Savings Behavior.

    PubMed

    Choi, James J; Laibson, David; Madrian, Brigitte C; Metrick, Andrew

    2009-12-01

    We show that individual investors over-extrapolate from their personal experience when making savings decisions. Investors who experience particularly rewarding outcomes from saving in their 401(k)-a high average and/or low variance return-increase their 401(k) savings rate more than investors who have less rewarding experiences with saving. This finding is not driven by aggregate time-series shocks, income effects, rational learning about investing skill, investor fixed effects, or time-varying investor-level heterogeneity that is correlated with portfolio allocations to stock, bond, and cash asset classes. We discuss implications for the equity premium puzzle and interventions aimed at improving household financial outcomes.

  15. Optimizing reproducibility of operant testing through reinforcer standardization: identification of key nutritional constituents determining reward strength in touchscreens.

    PubMed

    Kim, Eun Woo; Phillips, Benjamin U; Heath, Christopher J; Cho, So Yeon; Kim, Hyunjeong; Sreedharan, Jemeen; Song, Ho-Taek; Lee, Jong Eun; Bussey, Timothy J; Kim, Chul Hoon; Kim, Eosu; Saksida, Lisa M

    2017-07-17

    Reliable and reproducible assessment of animal learning and behavior is a central aim of basic and translational neuroscience research. Recent developments in automated operant chamber technology have led to the possibility of universal standard protocols, in addition to increased translational potential, reliability and accuracy. However, the impact of regional and national differences in the supplies of available reinforcers in this system on behavioural performance and inter-laboratory variability is an unknown and at present uncontrolled variable. Therefore, we aimed to identify which constituent(s) of the reward determines reinforcer strength to enable improved standardization of this parameter across laboratories. Male C57BL/6 mice were examined in the touchscreen-based fixed ratio (FR) and progressive ratio (PR) schedules, reinforced with different kinds of milk-based reinforcers to directly compare the incentive values of plain milk (PM, high-calorie: high-fat/low-sugar), strawberry-flavored milk (SM, high-calorie: low-fat/high-sugar), and semi-skimmed low-fat milk (LM, low-calorie: low-fat/low-sugar) on the basis of differences in caloric content, sugar/fat content, and flavor. Use of a higher caloric content reward was effective in increasing operant training acquisition rate. Total trial number completed in FR and breakpoint in PR were higher using the two isocaloric milk products (PM and SM) than the lower caloric LM, with comparable outcomes between PM and SM conditions, suggesting that total caloric content determines reward strength. Analysis of within-session changes in response rate revealed that overall outputs in FR and PR primarily depend on the response rate at the initial phase of a session, which itself was dependent on reinforcer caloric content. Interestingly, the rate of satiation, indicated by decay in response rate within a FR session, was highest when reinforced with SM, suggesting a rapid satiating effect of sugar. The key contribution

  16. Differential Reward Learning for Self and Others Predicts Self-Reported Altruism

    PubMed Central

    Kwak, Youngbin; Pearson, John; Huettel, Scott A.

    2014-01-01

    In social environments, decisions not only determine rewards for oneself but also for others. However, individual differences in pro-social behaviors have been typically studied through self-report. We developed a decision-making paradigm in which participants chose from card decks with differing rewards for themselves and charity; some decks gave similar rewards to both, while others gave higher rewards for one or the other. We used a reinforcement-learning model that estimated each participant's relative weighting of self versus charity reward. As shown both in choices and model parameters, individuals who showed relatively better learning of rewards for charity – compared to themselves – were more likely to engage in pro-social behavior outside of a laboratory setting indicated by self-report. Overall rates of reward learning, however, did not predict individual differences in pro-social tendencies. These results support the idea that biases toward learning about social rewards are associated with one's altruistic tendencies. PMID:25215883

  17. Reinforcement learning in supply chains.

    PubMed

    Valluri, Annapurna; North, Michael J; Macal, Charles M

    2009-10-01

    Effective management of supply chains creates value and can strategically position companies. In practice, human beings have been found to be both surprisingly successful and disappointingly inept at managing supply chains. The related fields of cognitive psychology and artificial intelligence have postulated a variety of potential mechanisms to explain this behavior. One of the leading candidates is reinforcement learning. This paper applies agent-based modeling to investigate the comparative behavioral consequences of three simple reinforcement learning algorithms in a multi-stage supply chain. For the first time, our findings show that the specific algorithm that is employed can have dramatic effects on the results obtained. Reinforcement learning is found to be valuable in multi-stage supply chains with several learning agents, as independent agents can learn to coordinate their behavior. However, learning in multi-stage supply chains using these postulated approaches from cognitive psychology and artificial intelligence take extremely long time periods to achieve stability which raises questions about their ability to explain behavior in real supply chains. The fact that it takes thousands of periods for agents to learn in this simple multi-agent setting provides new evidence that real world decision makers are unlikely to be using strict reinforcement learning in practice.

  18. Role of Dopamine D2 Receptors in Human Reinforcement Learning

    PubMed Central

    Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

    2014-01-01

    Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well. PMID:24713613

  19. Role of dopamine D2 receptors in human reinforcement learning.

    PubMed

    Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

    2014-09-01

    Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well.

  20. Habits, action sequences and reinforcement learning.

    PubMed

    Dezfouli, Amir; Balleine, Bernard W

    2012-04-01

    It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquired and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary, and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions.

  1. Habits, action sequences, and reinforcement learning

    PubMed Central

    Dezfouli, Amir; Balleine, Bernard W.

    2012-01-01

    It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquire and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions. PMID:22487034

  2. Reinforcement active learning in the vibrissae system: optimal object localization.

    PubMed

    Gordon, Goren; Dorfman, Nimrod; Ahissar, Ehud

    2013-01-01

    Rats move their whiskers to acquire information about their environment. It has been observed that they palpate novel objects and objects they are required to localize in space. We analyze whisker-based object localization using two complementary paradigms, namely, active learning and intrinsic-reward reinforcement learning. Active learning algorithms select the next training samples according to the hypothesized solution in order to better discriminate between correct and incorrect labels. Intrinsic-reward reinforcement learning uses prediction errors as the reward to an actor-critic design, such that behavior converges to the one that optimizes the learning process. We show that in the context of object localization, the two paradigms result in palpation whisking as their respective optimal solution. These results suggest that rats may employ principles of active learning and/or intrinsic reward in tactile exploration and can guide future research to seek the underlying neuronal mechanisms that implement them. Furthermore, these paradigms are easily transferable to biomimetic whisker-based artificial sensors and can improve the active exploration of their environment. Copyright © 2012 Elsevier Ltd. All rights reserved.

  3. Feature Reinforcement Learning: Part I. Unstructured MDPs

    NASA Astrophysics Data System (ADS)

    Hutter, Marcus

    2009-12-01

    General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II (Hutter, 2009c). The role of POMDPs is also considered there.

  4. Multi-agent Reinforcement Learning Model for Effective Action Selection

    NASA Astrophysics Data System (ADS)

    Youk, Sang Jo; Lee, Bong Keun

    Reinforcement learning is a sub area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. In the case of multi-agent, especially, which state space and action space gets very enormous in compared to single agent, so it needs to take most effective measure available select the action strategy for effective reinforcement learning. This paper proposes a multi-agent reinforcement learning model based on fuzzy inference system in order to improve learning collect speed and select an effective action in multi-agent. This paper verifies an effective action select strategy through evaluation tests based on Robocop Keep away which is one of useful test-beds for multi-agent. Our proposed model can apply to evaluate efficiency of the various intelligent multi-agents and also can apply to strategy and tactics of robot soccer system.

  5. Associations among Smoking, Anhedonia, and Reward Learning in Depression

    PubMed Central

    Liverant, Gabrielle I.; Sloan, Denise M.; Pizzagalli, Diego A.; Harte, Christopher B.; Kamholz, Barbara W.; Rosebrock, Laina E.; Cohen, Andrew L.; Fava, Maurizio; Kaplan, Gary B.

    2015-01-01

    Depression and cigarette smoking co-occur at high rates. However, the etiological mechanisms that contribute to this relationship remain unclear. Anhedonia and associated impairments in reward learning are key features of depression, which also have been linked to the onset and maintenance of cigarette smoking. However, few studies have investigated differences in anhedonia and reward learning among depressed smokers and depressed nonsmokers. The goal of this study was to examine putative differences in anhedonia and reward learning in depressed smokers (n = 36) and depressed nonsmokers (n = 44). To this end, participants completed self-report measures of anhedonia and behavioral activation (BAS reward responsiveness scores) and as well as a probabilistic reward task rooted in signal detection theory, which measures reward learning (Pizzagalli, Jahn, & O’Shea, 2005). When considering self-report measures, depressed smokers reported higher trait anhedonia and reduced BAS reward responsiveness scores compared to depressed nonsmokers. In contrast to self-report measures, nicotine-satiated depressed smokers demonstrated greater acquisition of reward-based learning compared to depressed nonsmokers as indexed by the probabilistic reward task. Findings may point to a potential mechanism underlying the frequent co-occurrence of smoking and depression. These results highlight the importance of continued investigation of the role of anhedonia and reward system functioning in the co-occurrence of depression and nicotine abuse. Results also may support the use of treatments targeting reward learning (e.g., behavioral activation) to enhance smoking cessation among individuals with depression. PMID:25022776

  6. Associations among smoking, anhedonia, and reward learning in depression.

    PubMed

    Liverant, Gabrielle I; Sloan, Denise M; Pizzagalli, Diego A; Harte, Christopher B; Kamholz, Barbara W; Rosebrock, Laina E; Cohen, Andrew L; Fava, Maurizio; Kaplan, Gary B

    2014-09-01

    Depression and cigarette smoking co-occur at high rates. However, the etiological mechanisms that contribute to this relationship remain unclear. Anhedonia and associated impairments in reward learning are key features of depression, which also have been linked to the onset and maintenance of cigarette smoking. However, few studies have investigated differences in anhedonia and reward learning among depressed smokers and depressed nonsmokers. The goal of this study was to examine putative differences in anhedonia and reward learning in depressed smokers (n=36) and depressed nonsmokers (n=44). To this end, participants completed self-report measures of anhedonia and behavioral activation (BAS reward responsiveness scores) and as well as a probabilistic reward task rooted in signal detection theory, which measures reward learning (Pizzagalli, Jahn, & O'Shea, 2005). When considering self-report measures, depressed smokers reported higher trait anhedonia and reduced BAS reward responsiveness scores compared to depressed nonsmokers. In contrast to self-report measures, nicotine-satiated depressed smokers demonstrated greater acquisition of reward-based learning compared to depressed nonsmokers as indexed by the probabilistic reward task. Findings may point to a potential mechanism underlying the frequent co-occurrence of smoking and depression. These results highlight the importance of continued investigation of the role of anhedonia and reward system functioning in the co-occurrence of depression and nicotine abuse. Results also may support the use of treatments targeting reward learning (e.g., behavioral activation) to enhance smoking cessation among individuals with depression.

  7. Impaired associative learning with food rewards in obese women.

    PubMed

    Zhang, Zhihao; Manson, Kirk F; Schiller, Daniela; Levy, Ifat

    2014-08-04

    Obesity is a major epidemic in many parts of the world. One of the main factors contributing to obesity is overconsumption of high-fat and high-calorie food, which is driven by the rewarding properties of these types of food. Previous studies have suggested that dysfunction in reward circuits may be associated with overeating and obesity. The nature of this dysfunction, however, is still unknown. Here, we demonstrate impairment in reward-based associative learning specific to food in obese women. Normal-weight and obese participants performed an appetitive reversal learning task in which they had to learn and modify cue-reward associations. To test whether any learning deficits were specific to food reward or were more general, we used a between-subject design in which half of the participants received food reward and the other half received money reward. Our results reveal a marked difference in associative learning between normal-weight and obese women when food was used as reward. Importantly, no learning deficits were observed with money reward. Multiple regression analyses also established a robust negative association between body mass index and learning performance in the food domain in female participants. Interestingly, such impairment was not observed in obese men. These findings suggest that obesity may be linked to impaired reward-based associative learning and that this impairment may be specific to the food domain. Copyright © 2014 Elsevier Ltd. All rights reserved.

  8. Differential Effect of Reward and Punishment on Procedural Learning

    PubMed Central

    Wächter, Tobias; Lungu, Ovidiu V.; Liu, Tao; Willingham, Daniel T.; Ashe, James

    2009-01-01

    Reward and punishment are potent modulators of associative learning in instrumental and classical conditioning. However, the effect of reward and punishment on procedural learning is not known. The striatum is known to be an important locus of reward-related neural signals and part of the neural substrate of procedural learning. Here, using an implicit motor learning task, we show that reward leads to enhancement of learning in human subjects, whereas punishment is associated only with improvement in motor performance. Furthermore, these behavioral effects have distinct neural substrates with the learning effect of reward being mediated through the dorsal striatum and the performance effect of punishment through the insula. Our results suggest that reward and punishment engage separate motivational systems with distinctive behavioral effects and neural substrates. PMID:19144843

  9. Differential effect of reward and punishment on procedural learning.

    PubMed

    Wächter, Tobias; Lungu, Ovidiu V; Liu, Tao; Willingham, Daniel T; Ashe, James

    2009-01-14

    Reward and punishment are potent modulators of associative learning in instrumental and classical conditioning. However, the effect of reward and punishment on procedural learning is not known. The striatum is known to be an important locus of reward-related neural signals and part of the neural substrate of procedural learning. Here, using an implicit motor learning task, we show that reward leads to enhancement of learning in human subjects, whereas punishment is associated only with improvement in motor performance. Furthermore, these behavioral effects have distinct neural substrates with the learning effect of reward being mediated through the dorsal striatum and the performance effect of punishment through the insula. Our results suggest that reward and punishment engage separate motivational systems with distinctive behavioral effects and neural substrates.

  10. A Comparative Analysis of Reinforcement Learning Methods

    DTIC Science & Technology

    1991-10-01

    reinforcement learning for both programming and adapting situated agents. In the first part of the paper we discuss two specific reinforcement learning algorithms: Q-learning and the Bucket Brigade. We introduce a special case of the Bucket Brigade, and analyze and compare its performance to Q-learning in a number of experiments. The second part of the paper discusses the key problems of reinforcement learning : time and space complexity, input generalization, sensitivity to parameter values, and selection of the reinforcement

  11. Attention-gated reinforcement learning of internal representations for classification.

    PubMed

    Roelfsema, Pieter R; van Ooyen, Arjen

    2005-10-01

    Animal learning is associated with changes in the efficacy of connections between neurons. The rules that govern this plasticity can be tested in neural networks. Rules that train neural networks to map stimuli onto outputs are given by supervised learning and reinforcement learning theories. Supervised learning is efficient but biologically implausible. In contrast, reinforcement learning is biologically plausible but comparatively inefficient. It lacks a mechanism that can identify units at early processing levels that play a decisive role in the stimulus-response mapping. Here we show that this so-called credit assignment problem can be solved by a new role for attention in learning. There are two factors in our new learning scheme that determine synaptic plasticity: (1) a reinforcement signal that is homogeneous across the network and depends on the amount of reward obtained after a trial, and (2) an attentional feedback signal from the output layer that limits plasticity to those units at earlier processing levels that are crucial for the stimulus-response mapping. The new scheme is called attention-gated reinforcement learning (AGREL). We show that it is as efficient as supervised learning in classification tasks. AGREL is biologically realistic and integrates the role of feedback connections, attention effects, synaptic plasticity, and reinforcement learning signals into a coherent framework.

  12. Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults.

    PubMed

    Vernetti, Angélina; Smith, Tim J; Senju, Atsushi

    2017-03-15

    While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue-reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue-reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting.

  13. Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults

    PubMed Central

    Smith, Tim J.; Senju, Atsushi

    2017-01-01

    While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue–reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue–reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting. PMID:28250186

  14. Conflict acts as an implicit cost in reinforcement learning.

    PubMed

    Cavanagh, James F; Masters, Sean E; Bath, Kevin; Frank, Michael J

    2014-11-04

    Conflict has been proposed to act as a cost in action selection, implying a general function of medio-frontal cortex in the adaptation to aversive events. Here we investigate if response conflict acts as a cost during reinforcement learning by modulating experienced reward values in cortical and striatal systems. Electroencephalography recordings show that conflict diminishes the relationship between reward-related frontal theta power and cue preference yet it enhances the relationship between punishment and cue avoidance. Individual differences in the cost of conflict on reward versus punishment sensitivity are also related to a genetic polymorphism associated with striatal D1 versus D2 pathway balance (DARPP-32). We manipulate these patterns with the D2 agent cabergoline, which induces a strong bias to amplify the aversive value of punishment outcomes following conflict. Collectively, these findings demonstrate that interactive cortico-striatal systems implicitly modulate experienced reward and punishment values as a function of conflict.

  15. Dissecting components of reward: 'liking', 'wanting', and learning.

    PubMed

    Berridge, Kent C; Robinson, Terry E; Aldridge, J Wayne

    2009-02-01

    In recent years significant progress has been made delineating the psychological components of reward and their underlying neural mechanisms. Here we briefly highlight findings on three dissociable psychological components of reward: 'liking' (hedonic impact), 'wanting' (incentive salience), and learning (predictive associations and cognitions). A better understanding of the components of reward, and their neurobiological substrates, may help in devising improved treatments for disorders of mood and motivation, ranging from depression to eating disorders, drug addiction, and related compulsive pursuits of rewards.

  16. Tunnel Ventilation Control Using Reinforcement Learning Methodology

    NASA Astrophysics Data System (ADS)

    Chu, Baeksuk; Kim, Dongnam; Hong, Daehie; Park, Jooyoung; Chung, Jin Taek; Kim, Tae-Hyung

    The main purpose of tunnel ventilation system is to maintain CO pollutant concentration and VI (visibility index) under an adequate level to provide drivers with comfortable and safe driving environment. Moreover, it is necessary to minimize power consumption used to operate ventilation system. To achieve the objectives, the control algorithm used in this research is reinforcement learning (RL) method. RL is a goal-directed learning of a mapping from situations to actions without relying on exemplary supervision or complete models of the environment. The goal of RL is to maximize a reward which is an evaluative feedback from the environment. In the process of constructing the reward of the tunnel ventilation system, two objectives listed above are included, that is, maintaining an adequate level of pollutants and minimizing power consumption. RL algorithm based on actor-critic architecture and gradient-following algorithm is adopted to the tunnel ventilation system. The simulations results performed with real data collected from existing tunnel ventilation system and real experimental verification are provided in this paper. It is confirmed that with the suggested controller, the pollutant level inside the tunnel was well maintained under allowable limit and the performance of energy consumption was improved compared to conventional control scheme.

  17. Is Avoiding an Aversive Outcome Rewarding? Neural Substrates of Avoidance Learning in the Human Brain

    PubMed Central

    Kim, Hackjin; Shimojo, Shinsuke

    2006-01-01

    Avoidance learning poses a challenge for reinforcement-based theories of instrumental conditioning, because once an aversive outcome is successfully avoided an individual may no longer experience extrinsic reinforcement for their behavior. One possible account for this is to propose that avoiding an aversive outcome is in itself a reward, and thus avoidance behavior is positively reinforced on each trial when the aversive outcome is successfully avoided. In the present study we aimed to test this possibility by determining whether avoidance of an aversive outcome recruits the same neural circuitry as that elicited by a reward itself. We scanned 16 human participants with functional MRI while they performed an instrumental choice task, in which on each trial they chose from one of two actions in order to either win money or else avoid losing money. Neural activity in a region previously implicated in encoding stimulus reward value, the medial orbitofrontal cortex, was found to increase, not only following receipt of reward, but also following successful avoidance of an aversive outcome. This neural signal may itself act as an intrinsic reward, thereby serving to reinforce actions during instrumental avoidance. PMID:16802856

  18. Is avoiding an aversive outcome rewarding? Neural substrates of avoidance learning in the human brain.

    PubMed

    Kim, Hackjin; Shimojo, Shinsuke; O'Doherty, John P

    2006-07-01

    Avoidance learning poses a challenge for reinforcement-based theories of instrumental conditioning, because once an aversive outcome is successfully avoided an individual may no longer experience extrinsic reinforcement for their behavior. One possible account for this is to propose that avoiding an aversive outcome is in itself a reward, and thus avoidance behavior is positively reinforced on each trial when the aversive outcome is successfully avoided. In the present study we aimed to test this possibility by determining whether avoidance of an aversive outcome recruits the same neural circuitry as that elicited by a reward itself. We scanned 16 human participants with functional MRI while they performed an instrumental choice task, in which on each trial they chose from one of two actions in order to either win money or else avoid losing money. Neural activity in a region previously implicated in encoding stimulus reward value, the medial orbitofrontal cortex, was found to increase, not only following receipt of reward, but also following successful avoidance of an aversive outcome. This neural signal may itself act as an intrinsic reward, thereby serving to reinforce actions during instrumental avoidance.

  19. Estimation of Distribution Algorithms for Solving Reinforcement Learning Problems

    NASA Astrophysics Data System (ADS)

    Handa, Hisashi

    Estimation of Distribution Algorithms (EDAs) are a promising evolutionary computation method. Due to the use of probabilistic models, EDAs can outperform conventional evolutionary computation. In this paper, EDAs are extended to solve reinforcement learning problems which are a framework for autonomous agents. In the reinforcement learning problems, we have to find out better policy of agents such that it yields a large amount of reward for the agents in the future. In general, such policy can be represented by conditional probabilities of agents' actions, given the perceptual inputs. In order to estimate such a conditional probability distribution, Conditional Random Fields (CRFs) by Lafferty (2001) are introduced into EDAs. The reason why CRFs are adopted is that CRFs are able to learn conditional probabilistic distributions from a large amount of input-output data, i.e., episodes in the case of reinforcement learning problems. Computer simulations on Probabilistic Transition Problems and Perceptual Aliasing Maze Problems show the effectiveness of EDA-RL.

  20. The influence of attention and reward on the learning of stimulus-response associations.

    PubMed

    Vartak, Devavrat; Jeurissen, Danique; Self, Matthew W; Roelfsema, Pieter R

    2017-08-22

    We can learn new tasks by listening to a teacher, but we can also learn by trial-and-error. Here, we investigate the factors that determine how participants learn new stimulus-response mappings by trial-and-error. Does learning in human observers comply with reinforcement learning theories, which describe how subjects learn from rewards and punishments? If yes, what is the influence of selective attention in the learning process? We developed a novel redundant-relevant learning paradigm to examine the conjoint influence of attention and reward feedback. We found that subjects only learned stimulus-response mappings for attended shapes, even when unattended shapes were equally informative. Reward magnitude also influenced learning, an effect that was stronger for attended than for non-attended shapes and that carried over to a subsequent visual search task. Our results provide insights into how attention and reward jointly determine how we learn. They support the powerful learning rules that capitalize on the conjoint influence of these two factors on neuronal plasticity.

  1. Robot-assisted motor training: assistance decreases exploration during reinforcement learning.

    PubMed

    Sans-Muntadas, Albert; Duarte, Jaime E; Reinkensmeyer, David J

    2014-01-01

    Reinforcement learning (RL) is a form of motor learning that robotic therapy devices could potentially manipulate to promote neurorehabilitation. We developed a system that requires trainees to use RL to learn a predefined target movement. The system provides higher rewards for movements that are more similar to the target movement. We also developed a novel algorithm that rewards trainees of different abilities with comparable reward sizes. This algorithm measures a trainee's performance relative to their best performance, rather than relative to an absolute target performance, to determine reward. We hypothesized this algorithm would permit subjects who cannot normally achieve high reward levels to do so while still learning. In an experiment with 21 unimpaired human subjects, we found that all subjects quickly learned to make a first target movement with and without the reward equalization. However, artificially increasing reward decreased the subjects' tendency to engage in exploration and therefore slowed learning, particularly when we changed the target movement. An anti-slacking watchdog algorithm further slowed learning. These results suggest that robotic algorithms that assist trainees in achieving rewards or in preventing slacking might, over time, discourage the exploration needed for reinforcement learning.

  2. Scaling prediction errors to reward variability benefits error-driven learning in humans

    PubMed Central

    Schultz, Wolfram

    2015-01-01

    Effective error-driven learning requires individuals to adapt learning to environmental reward variability. The adaptive mechanism may involve decays in learning rate across subsequent trials, as shown previously, and rescaling of reward prediction errors. The present study investigated the influence of prediction error scaling and, in particular, the consequences for learning performance. Participants explicitly predicted reward magnitudes that were drawn from different probability distributions with specific standard deviations. By fitting the data with reinforcement learning models, we found scaling of prediction errors, in addition to the learning rate decay shown previously. Importantly, the prediction error scaling was closely related to learning performance, defined as accuracy in predicting the mean of reward distributions, across individual participants. In addition, participants who scaled prediction errors relative to standard deviation also presented with more similar performance for different standard deviations, indicating that increases in standard deviation did not substantially decrease “adapters'” accuracy in predicting the means of reward distributions. However, exaggerated scaling beyond the standard deviation resulted in impaired performance. Thus efficient adaptation makes learning more robust to changing variability. PMID:26180123

  3. Stress modulates reinforcement learning in younger and older adults.

    PubMed

    Lighthall, Nichole R; Gorlick, Marissa A; Schoeke, Andrej; Frank, Michael J; Mather, Mara

    2013-03-01

    Animal research and human neuroimaging studies indicate that stress increases dopamine levels in brain regions involved in reward processing, and stress also appears to increase the attractiveness of addictive drugs. The current study tested the hypothesis that stress increases reward salience, leading to more effective learning about positive than negative outcomes in a probabilistic selection task. Changes to dopamine pathways with age raise the question of whether stress effects on incentive-based learning differ by age. Thus, the present study also examined whether effects of stress on reinforcement learning differed for younger (age 18-34) and older participants (age 65-85). Cold pressor stress was administered to half of the participants in each age group, and salivary cortisol levels were used to confirm biophysiological response to cold stress. After the manipulation, participants completed a probabilistic learning task involving positive and negative feedback. In both younger and older adults, stress enhanced learning about cues that predicted positive outcomes. In addition, during the initial learning phase, stress diminished sensitivity to recent feedback across age groups. These results indicate that stress affects reinforcement learning in both younger and older adults and suggests that stress exerts different effects on specific components of reinforcement learning depending on their neural underpinnings.

  4. Learning strategies in table tennis using inverse reinforcement learning.

    PubMed

    Muelling, Katharina; Boularias, Abdeslam; Mohler, Betty; Schölkopf, Bernhard; Peters, Jan

    2014-10-01

    Learning a complex task such as table tennis is a challenging problem for both robots and humans. Even after acquiring the necessary motor skills, a strategy is needed to choose where and how to return the ball to the opponent's court in order to win the game. The data-driven identification of basic strategies in interactive tasks, such as table tennis, is a largely unexplored problem. In this paper, we suggest a computational model for representing and inferring strategies, based on a Markov decision problem, where the reward function models the goal of the task as well as the strategic information. We show how this reward function can be discovered from demonstrations of table tennis matches using model-free inverse reinforcement learning. The resulting framework allows to identify basic elements on which the selection of striking movements is based. We tested our approach on data collected from players with different playing styles and under different playing conditions. The estimated reward function was able to capture expert-specific strategic information that sufficed to distinguish the expert among players with different skill levels as well as different playing styles.

  5. Rational and Mechanistic Perspectives on Reinforcement Learning

    ERIC Educational Resources Information Center

    Chater, Nick

    2009-01-01

    This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: "mechanistic" and "rational." Reinforcement learning is often viewed in mechanistic terms--as…

  6. A reinforcement learning approach to online clustering.

    PubMed

    Likas, A

    1999-11-15

    A general technique is proposed for embedding online clustering algorithms based on competitive learning in a reinforcement learning framework. The basic idea is that the clustering system can be viewed as a reinforcement learning system that learns through reinforcements to follow the clustering strategy we wish to implement. In this sense, the reinforcement guided competitive learning (RGCL) algorithm is proposed that constitutes a reinforcement-based adaptation of learning vector quantization (LVQ) with enhanced clustering capabilities. In addition, we suggest extensions of RGCL and LVQ that are characterized by the property of sustained exploration and significantly improve the performance of those algorithms, as indicated by experimental tests on well-known data sets.

  7. Behavioral and neural properties of social reinforcement learning.

    PubMed

    Jones, Rebecca M; Somerville, Leah H; Li, Jian; Ruberry, Erika J; Libby, Victoria; Glover, Gary; Voss, Henning U; Ballon, Douglas J; Casey, B J

    2011-09-14

    Social learning is critical for engaging in complex interactions with other individuals. Learning from positive social exchanges, such as acceptance from peers, may be similar to basic reinforcement learning. We formally test this hypothesis by developing a novel paradigm that is based on work in nonhuman primates and human imaging studies of reinforcement learning. The probability of receiving positive social reinforcement from three distinct peers was parametrically manipulated while brain activity was recorded in healthy adults using event-related functional magnetic resonance imaging. Over the course of the experiment, participants responded more quickly to faces of peers who provided more frequent positive social reinforcement, and rated them as more likeable. Modeling trial-by-trial learning showed ventral striatum and orbital frontal cortex activity correlated positively with forming expectations about receiving social reinforcement. Rostral anterior cingulate cortex activity tracked positively with modulations of expected value of the cues (peers). Together, the findings across three levels of analysis--social preferences, response latencies, and modeling neural responses--are consistent with reinforcement learning theory and nonhuman primate electrophysiological studies of reward. This work highlights the fundamental influence of acceptance by one's peers in altering subsequent behavior.

  8. Behavioral and neural properties of social reinforcement learning

    PubMed Central

    Jones, Rebecca M.; Somerville, Leah H.; Li, Jian; Ruberry, Erika J.; Libby, Victoria; Glover, Gary; Voss, Henning U.; Ballon, Douglas J.; Casey, BJ

    2011-01-01

    Social learning is critical for engaging in complex interactions with other individuals. Learning from positive social exchanges, such as acceptance from peers, may be similar to basic reinforcement learning. We formally test this hypothesis by developing a novel paradigm that is based upon work in non-human primates and human imaging studies of reinforcement learning. The probability of receiving positive social reinforcement from three distinct peers was parametrically manipulated while brain activity was recorded in healthy adults using event-related functional magnetic resonance imaging (fMRI). Over the course of the experiment, participants responded more quickly to faces of peers who provided more frequent positive social reinforcement, and rated them as more likeable. Modeling trial-by-trial learning showed ventral striatum and orbital frontal cortex activity correlated positively with forming expectations about receiving social reinforcement. Rostral anterior cingulate cortex activity tracked positively with modulations of expected value of the cues (peers). Together, the findings across three levels of analysis - social preferences, response latencies and modeling neural responses – are consistent with reinforcement learning theory and non-human primate electrophysiological studies of reward. This work highlights the fundamental influence of acceptance by one’s peers in altering subsequent behavior. PMID:21917787

  9. SAN-RL: combining spreading activation networks and reinforcement learning to learn configurable behaviors

    NASA Astrophysics Data System (ADS)

    Gaines, Daniel M.; Wilkes, Don M.; Kusumalnukool, Kanok; Thongchai, Siripun; Kawamura, Kazuhiko; White, John H.

    2002-02-01

    Reinforcement learning techniques have been successful in allowing an agent to learn a policy for achieving tasks. The overall behavior of the agent can be controlled with an appropriate reward function. However, the policy that is learned will be fixed to this reward function. If the user wishes to change his or her preference about how the task is achieved the agent must be retrained with this new reward function. We address this challenge by combining Spreading Activation Networks and Reinforcement Learning in an approach we call SAN-RL. This approach provides the agent with a causal structure, the spreading activation network, relating goals to the actions that can achieve those goals. This enables the agent to select actions relative to the goal priorities. We combine this with reinforcement learning to enable the agent to learn a policy. Together, these approaches enable the learning of a configurable behaviors, a policy that can be adapted to meet the current preferences. We compare the approach with Q-learning on a robot navigation task. We demonstrate that SAN-RL exhibits goal-directed behavior before learning, exploits the causal structure of the network to focus its search during learning and results in configurable behaviors after learning.

  10. Reinforcement Learning and Savings Behavior*

    PubMed Central

    Choi, James J.; Laibson, David; Madrian, Brigitte C.; Metrick, Andrew

    2009-01-01

    We show that individual investors over-extrapolate from their personal experience when making savings decisions. Investors who experience particularly rewarding outcomes from saving in their 401(k)—a high average and/or low variance return—increase their 401(k) savings rate more than investors who have less rewarding experiences with saving. This finding is not driven by aggregate time-series shocks, income effects, rational learning about investing skill, investor fixed effects, or time-varying investor-level heterogeneity that is correlated with portfolio allocations to stock, bond, and cash asset classes. We discuss implications for the equity premium puzzle and interventions aimed at improving household financial outcomes. PMID:20352013

  11. Force-proportional reinforcement: pimozide does not reduce rats' emission of higher forces for sweeter rewards.

    PubMed

    Kirkpatrick, M A; Fowler, S C

    1989-02-01

    A two-step force-proportional reinforcement procedure was used to assess the efficacy of a sucrose reward under neuroleptic challenge. The force-proportional reinforcement method entails an increase in the quality of reward contingent upon higher force-emission. This paradigm was conceived as a rate-free means of addressing the putative anhedonic effects of dopaminergic receptor-blocking agents. Results failed to support the anhedonia interpretation of neuroleptic-induced response decrements: Pimozide did not diminish the ability of a high-concentration sucrose solution to maintain elevated response forces. Alternatives to the anhedonia interpretation were discussed with emphasis on the drug's motor effects in the temporal domain.

  12. A parallel framework for Bayesian reinforcement learning

    NASA Astrophysics Data System (ADS)

    Barrett, Enda; Duggan, Jim; Howley, Enda

    2014-01-01

    Solving a finite Markov decision process using techniques from dynamic programming such as value or policy iteration require a complete model of the environmental dynamics. The distribution of rewards, transition probabilities, states and actions all need to be fully observable, discrete and complete. For many problem domains, a complete model containing a full representation of the environmental dynamics may not be readily available. Bayesian reinforcement learning (RL) is a technique devised to make better use of the information observed through learning than simply computing Q-functions. However, this approach can often require extensive experience in order to build up an accurate representation of the true values. To address this issue, this paper proposes a method for parallelising a Bayesian RL technique aimed at reducing the time it takes to approximate the missing model. We demonstrate the technique on learning next state transition probabilities without prior knowledge. The approach is general enough for approximating any probabilistically driven component of the model. The solution involves multiple learning agents learning in parallel on the same task. Agents share probability density estimates amongst each other in an effort to speed up convergence to the true values.

  13. Reward-Guided Learning with and without Causal Attribution.

    PubMed

    Jocham, Gerhard; Brodersen, Kay H; Constantinescu, Alexandra O; Kahn, Martin C; Ianni, Angela M; Walton, Mark E; Rushworth, Matthew F S; Behrens, Timothy E J

    2016-04-06

    When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task.

  14. Reward-Guided Learning with and without Causal Attribution

    PubMed Central

    Jocham, Gerhard; Brodersen, Kay H.; Constantinescu, Alexandra O.; Kahn, Martin C.; Ianni, Angela M.; Walton, Mark E.; Rushworth, Matthew F.S.; Behrens, Timothy E.J.

    2016-01-01

    Summary When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task. PMID:26971947

  15. Dopamine-Dependent Reinforcement of Motor Skill Learning: Evidence from Gilles de la Tourette Syndrome

    ERIC Educational Resources Information Center

    Palminteri, Stefano; Lebreton, Mael; Worbe, Yulia; Hartmann, Andreas; Lehericy, Stephane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-01-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only…

  16. Dopamine-Dependent Reinforcement of Motor Skill Learning: Evidence from Gilles de la Tourette Syndrome

    ERIC Educational Resources Information Center

    Palminteri, Stefano; Lebreton, Mael; Worbe, Yulia; Hartmann, Andreas; Lehericy, Stephane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-01-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only…

  17. Simulation of rat behavior by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals.

    PubMed

    Murakoshi, Kazushi; Noguchi, Takuya

    2005-04-01

    Brown and Wanger [Brown, R.T., Wanger, A.R., 1964. Resistance to punishment and extinction following training with shock or nonreinforcement. J. Exp. Psychol. 68, 503-507] investigated rat behaviors with the following features: (1) rats were exposed to reward and punishment at the same time, (2) environment changed and rats relearned, and (3) rats were stochastically exposed to reward and punishment. The results are that exposure to nonreinforcement produces resistance to the decremental effects of behavior after stochastic reward schedule and that exposure to both punishment and reinforcement produces resistance to the decremental effects of behavior after stochastic punishment schedule. This paper aims to simulate the rat behaviors by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals. The former algorithms of reinforcement learning were unable to simulate the behavior of the feature (3). We improve the former reinforcement learning algorithms by controlling learning parameters in consideration of the acquisition probabilities of reinforcement signals. The proposed algorithm qualitatively simulates the result of the animal experiment of Brown and Wanger.

  18. The impact of mineralocorticoid receptor ISO/VAL genotype (rs5522) and stress on reward learning.

    PubMed

    Bogdan, R; Perlis, R H; Fagerness, J; Pizzagalli, D A

    2010-08-01

    Research suggests that stress disrupts reinforcement learning and induces anhedonia. The mineralocorticoid receptor (MR) determines the sensitivity of the stress response, and the missense iso/val polymorphism (Ile180Val, rs5522) of the MR gene (NR3C2) has been associated with enhanced physiological stress responses, elevated depressive symptoms and reduced cortisol-induced MR gene expression. The goal of these studies was to evaluate whether rs5522 genotype and stress independently and interactively influence reward learning. In study 1, participants (n = 174) completed a probabilistic reward task under baseline (i.e. no-stress) conditions. In study 2, participants (n = 53) completed the task during a stress (threat-of-shock) and no-stress condition. Reward learning, i.e. the ability to modulate behavior as a function of reinforcement history, was the main variable of interest. In study 1, in which participants were evaluated under no-stress conditions, reward learning was enhanced in val carriers. In study 2, participants developed a weaker response bias toward a more frequently rewarded stimulus under the stress relative to no-stress condition. Critically, stress-induced reward learning deficits were largest in val carriers. Although preliminary and in need of replication due to small sample size, findings indicate that psychiatrically healthy individuals carrying the MR val allele, gene, which has been recently linked to depression, showed a reduced ability to modulate behavior as a function of reward when facing an acute, uncontrollable stressor. Future studies are warranted to evaluate whether rs5522 genotype interacts with naturalistic stressors to increase the risk of depression and whether stress-induced anhedonia might moderate such risk.

  19. Reinforcement Learning Deficits in People with Schizophrenia Persist after Extended Trials

    PubMed Central

    Cicero, David C.; Martin, Elizabeth A.; Becker, Theresa M.; Kerns, John G.

    2014-01-01

    Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning. PMID:25172610

  20. Reinforcement learning deficits in people with schizophrenia persist after extended trials.

    PubMed

    Cicero, David C; Martin, Elizabeth A; Becker, Theresa M; Kerns, John G

    2014-12-30

    Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive feedback and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning.

  1. Reinforcement Learning for Robots Using Neural Networks

    DTIC Science & Technology

    1993-01-06

    Reinforcement learning agents are adaptive, reactive, and self-supervised. The aim of this dissertation is to extend the state of the art of... reinforcement learning and enable its applications to complex robot-learning problems. In particular, it focuses on two issues. First, learning from sparse... reinforcement learning methods assume that the world is a Markov decision process. This assumption is too strong for many robot tasks of interest, This

  2. Rule Learning in Autism: The Role of Reward Type and Social Context

    PubMed Central

    Jones, E. J. H.; Webb, S. J.; Estes, A.; Dawson, G.

    2013-01-01

    Learning abstract rules is central to social and cognitive development. Across two experiments, we used Delayed Non-Matching to Sample tasks to characterize the longitudinal development and nature of rule-learning impairments in children with Autism Spectrum Disorder (ASD). Results showed that children with ASD consistently experienced more difficulty learning an abstract rule from a discrete physical reward than children with DD. Rule learning was facilitated by the provision of more concrete reinforcement, suggesting an underlying difficulty in forming conceptual connections. Learning abstract rules about social stimuli remained challenging through late childhood, indicating the importance of testing executive functions in both social and non-social contexts. PMID:23311315

  3. Use of Inverse Reinforcement Learning for Identity Prediction

    NASA Technical Reports Server (NTRS)

    Hayes, Roy; Bao, Jonathan; Beling, Peter; Horowitz, Barry

    2011-01-01

    We adopt Markov Decision Processes (MDP) to model sequential decision problems, which have the characteristic that the current decision made by a human decision maker has an uncertain impact on future opportunity. We hypothesize that the individuality of decision makers can be modeled as differences in the reward function under a common MDP model. A machine learning technique, Inverse Reinforcement Learning (IRL), was used to learn an individual's reward function based on limited observation of his or her decision choices. This work serves as an initial investigation for using IRL to analyze decision making, conducted through a human experiment in a cyber shopping environment. Specifically, the ability to determine the demographic identity of users is conducted through prediction analysis and supervised learning. The results show that IRL can be used to correctly identify participants, at a rate of 68% for gender and 66% for one of three college major categories.

  4. The ubiquity of model-based reinforcement learning.

    PubMed

    Doll, Bradley B; Simon, Dylan A; Daw, Nathaniel D

    2012-12-01

    The reward prediction error (RPE) theory of dopamine (DA) function has enjoyed great success in the neuroscience of learning and decision-making. This theory is derived from model-free reinforcement learning (RL), in which choices are made simply on the basis of previously realized rewards. Recently, attention has turned to correlates of more flexible, albeit computationally complex, model-based methods in the brain. These methods are distinguished from model-free learning by their evaluation of candidate actions using expected future outcomes according to a world model. Puzzlingly, signatures from these computations seem to be pervasive in the very same regions previously thought to support model-free learning. Here, we review recent behavioral and neural evidence about these two systems, in attempt to reconcile their enigmatic cohabitation in the brain.

  5. Probabilistic reward learning in adults with Attention Deficit Hyperactivity Disorder--an electrophysiological study.

    PubMed

    Thoma, Patrizia; Edel, Marc-Andreas; Suchan, Boris; Bellebaum, Christian

    2015-01-30

    Attention Deficit Hyperactivity Disorder (ADHD) is hypothesized to be characterized by altered reinforcement sensitivity. The main aim of the present study was to assess alterations in the electrophysiological correlates of monetary reward processing in adult patients with ADHD of the combined subtype. Fourteen adults with ADHD of the combined subtype and 14 healthy control participants performed an active and an observational probabilistic reward-based learning task while an electroencephalogramm (EEG) was recorded. Regardless of feedback valence, there was a general feedback-related negativity (FRN) enhancement in combination with reduced learning performance during both active and observational reward learning in patients with ADHD relative to healthy controls. Other feedback-locked potentials such as the P200 and P300 and response-locked potentials were unaltered in the patients. There were no significant correlations between learning performance, FRN amplitudes and clinical symptoms, neither in the overall group involving all participants, nor in patients or controls considered separately. This pattern of findings might reflect generally impaired reward prediction in adults with ADHD of the combined subtype. We demonstrated for the first time that patients with ADHD of the combined subtype show not only deficient active reward learning but are also impaired when learning by observing other people׳s outcomes.

  6. Striatal dopamine D1 receptor suppression impairs reward-associative learning.

    PubMed

    Higa, Kerin K; Young, Jared W; Ji, Baohu; Nichols, David E; Geyer, Mark A; Zhou, Xianjin

    2017-04-14

    Dopamine (DA) is required for reinforcement learning. Hence, disruptions in DA signaling may contribute to the learning deficits associated with psychiatric disorders. The DA D1 receptor (D1R) has been linked to learning and is a target for cognitive/motivational enhancement in patients with schizophrenia. Separating the striatal D1R contribution to learning vs. motivation, however, has been challenging. We suppressed striatal D1R expression in mice using a D1R-targeting short hairpin RNA (shRNA), delivered locally to the striatum via an adeno-associated virus (AAV). We then assessed reward- and punishment-associative learning using a probabilistic learning task and motivation using a progressive-ratio breakpoint procedure. We confirmed suppression of striatal D1Rs immunohistochemically and by testing locomotor activity after the administration of (+)-doxanthrine, a full D1R agonist, in control mice and those treated with the D1RshRNA. D1RshRNA-treated mice exhibited impaired reward-associative learning, while punishment-associative learning was spared. This deficit was unrelated to general learning impairments or amotivation, because the D1shRNA-treated mice exhibited normal Barnes maze learning and normal motivation in the progressive-ratio breakpoint procedure. Suppression of striatal D1Rs selectively impaired reward-associative learning whereas punishment-associative learning, aversion-motivated learning, and appetitive motivation were spared. Because patients with schizophrenia exhibit similar reward-associative learning deficits, D1R-targeted treatments should be investigated to improve reward learning in these patients.

  7. Overlapping prediction errors in dorsal striatum during instrumental learning with juice and money reward in the human brain.

    PubMed

    Valentin, Vivian V; O'Doherty, John P

    2009-12-01

    Prediction error signals have been reported in human imaging studies in target areas of dopamine neurons such as ventral and dorsal striatum during learning with many different types of reinforcers. However, a key question that has yet to be addressed is whether prediction error signals recruit distinct or overlapping regions of striatum and elsewhere during learning with different types of reward. To address this, we scanned 17 healthy subjects with functional magnetic resonance imaging while they chose actions to obtain either a pleasant juice reward (1 ml apple juice), or a monetary gain (5 cents) and applied a computational reinforcement learning model to subjects' behavioral and imaging data. Evidence for an overlapping prediction error signal during learning with juice and money rewards was found in a region of dorsal striatum (caudate nucleus), while prediction error signals in a subregion of ventral striatum were significantly stronger during learning with money but not juice reward. These results provide evidence for partially overlapping reward prediction signals for different types of appetitive reinforcers within the striatum, a finding with important implications for understanding the nature of associative encoding in the striatum as a function of reinforcer type.

  8. Generalization of value in reinforcement learning by humans.

    PubMed

    Wimmer, G Elliott; Daw, Nathaniel D; Shohamy, Daphna

    2012-04-01

    Research in decision-making has focused on the role of dopamine and its striatal targets in guiding choices via learned stimulus-reward or stimulus-response associations, behavior that is well described by reinforcement learning theories. However, basic reinforcement learning is relatively limited in scope and does not explain how learning about stimulus regularities or relations may guide decision-making. A candidate mechanism for this type of learning comes from the domain of memory, which has highlighted a role for the hippocampus in learning of stimulus-stimulus relations, typically dissociated from the role of the striatum in stimulus-response learning. Here, we used functional magnetic resonance imaging and computational model-based analyses to examine the joint contributions of these mechanisms to reinforcement learning. Humans performed a reinforcement learning task with added relational structure, modeled after tasks used to isolate hippocampal contributions to memory. On each trial participants chose one of four options, but the reward probabilities for pairs of options were correlated across trials. This (uninstructed) relationship between pairs of options potentially enabled an observer to learn about option values based on experience with the other options and to generalize across them. We observed blood oxygen level-dependent (BOLD) activity related to learning in the striatum and also in the hippocampus. By comparing a basic reinforcement learning model to one augmented to allow feedback to generalize between correlated options, we tested whether choice behavior and BOLD activity were influenced by the opportunity to generalize across correlated options. Although such generalization goes beyond standard computational accounts of reinforcement learning and striatal BOLD, both choices and striatal BOLD activity were better explained by the augmented model. Consistent with the hypothesized role for the hippocampus in this generalization, functional

  9. Working memory and reward association learning impairments in obesity

    PubMed Central

    Coppin, Géraldine; Nolan-Poupart, Sarah; Jones-Gotman, Marilyn; Small, Dana M.

    2014-01-01

    Obesity has been associated with impaired executive functions including working memory. Less explored is the influence of obesity on learning and memory. In the current study we assessed stimulus reward association learning, explicit learning and memory and working memory in healthy weight, overweight and obese individuals. Explicit learning and memory did not differ as a function of group. In contrast, working memory was significantly and similarly impaired in both overweight and obese individuals compared to the healthy weight group. In the first reward association learning task the obese, but not healthy weight or overweight participants consistently formed paradoxical preferences for a pattern associated with a negative outcome (fewer food rewards). To determine if the deficit was specific to food reward a second experiment was conducted using money. Consistent with experiment 1, obese individuals selected the pattern associated with a negative outcome (fewer monetary rewards) more frequently than healthy weight individuals and thus failed to develop a significant preference for the most rewarded patterns as was observed in the healthy weight group. Finally, on a probabilistic learning task, obese compared to healthy weight individuals showed deficits in negative, but not positive outcome learning. Taken together, our results demonstrate deficits in working memory and stimulus reward learning in obesity and suggest that obese individuals are impaired in learning to avoid negative outcomes. PMID:25447070

  10. Working memory and reward association learning impairments in obesity.

    PubMed

    Coppin, Géraldine; Nolan-Poupart, Sarah; Jones-Gotman, Marilyn; Small, Dana M

    2014-12-01

    Obesity has been associated with impaired executive functions including working memory. Less explored is the influence of obesity on learning and memory. In the current study we assessed stimulus reward association learning, explicit learning and memory and working memory in healthy weight, overweight and obese individuals. Explicit learning and memory did not differ as a function of group. In contrast, working memory was significantly and similarly impaired in both overweight and obese individuals compared to the healthy weight group. In the first reward association learning task the obese, but not healthy weight or overweight participants consistently formed paradoxical preferences for a pattern associated with a negative outcome (fewer food rewards). To determine if the deficit was specific to food reward a second experiment was conducted using money. Consistent with Experiment 1, obese individuals selected the pattern associated with a negative outcome (fewer monetary rewards) more frequently than healthy weight individuals and thus failed to develop a significant preference for the most rewarded patterns as was observed in the healthy weight group. Finally, on a probabilistic learning task, obese compared to healthy weight individuals showed deficits in negative, but not positive outcome learning. Taken together, our results demonstrate deficits in working memory and stimulus reward learning in obesity and suggest that obese individuals are impaired in learning to avoid negative outcomes. Copyright © 2014 Elsevier Ltd. All rights reserved.

  11. Learning the specific quality of taste reinforcement in larval Drosophila.

    PubMed

    Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

    2015-01-27

    The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing-in any brain.

  12. Dopamine selectively remediates 'model-based' reward learning: a computational approach.

    PubMed

    Sharp, Madeleine E; Foerde, Karin; Daw, Nathaniel D; Shohamy, Daphna

    2016-02-01

    Patients with loss of dopamine due to Parkinson's disease are impaired at learning from reward. However, it remains unknown precisely which aspect of learning is impaired. In particular, learning from reward, or reinforcement learning, can be driven by two distinct computational processes. One involves habitual stamping-in of stimulus-response associations, hypothesized to arise computationally from 'model-free' learning. The other, 'model-based' learning, involves learning a model of the world that is believed to support goal-directed behaviour. Much work has pointed to a role for dopamine in model-free learning. But recent work suggests model-based learning may also involve dopamine modulation, raising the possibility that model-based learning may contribute to the learning impairment in Parkinson's disease. To directly test this, we used a two-step reward-learning task which dissociates model-free versus model-based learning. We evaluated learning in patients with Parkinson's disease tested ON versus OFF their dopamine replacement medication and in healthy controls. Surprisingly, we found no effect of disease or medication on model-free learning. Instead, we found that patients tested OFF medication showed a marked impairment in model-based learning, and that this impairment was remediated by dopaminergic medication. Moreover, model-based learning was positively correlated with a separate measure of working memory performance, raising the possibility of common neural substrates. Our results suggest that some learning deficits in Parkinson's disease may be related to an inability to pursue reward based on complete representations of the environment. © The Author (2015). Published by Oxford University Press on behalf of the Guarantors of Brain. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  13. No Apparent Influence of Reward upon Visual Statistical Learning.

    PubMed

    Rogers, Leeland L; Friedman, Kyle G; Vickery, Timothy J

    2016-01-01

    Humans are capable of detecting and exploiting a variety of environmental regularities, including stimulus-stimulus contingencies (e.g., visual statistical learning) and stimulus-reward contingencies. However, the relationship between these two types of learning is poorly understood. In two experiments, we sought evidence that the occurrence of rewarding events enhances or impairs visual statistical learning. Across all of our attempts to find such evidence, we employed a training stage during which we grouped shapes into triplets and presented triplets one shape at a time in an undifferentiated stream. Participants subsequently performed a surprise recognition task in which they were tested on their knowledge of the underlying structure of the triplets. Unbeknownst to participants, triplets were also assigned no-, low-, or high-reward status. In Experiments 1A and 1B, participants viewed shape streams while low and high rewards were "randomly" given, presented as low- and high-pitched tones played through headphones. Rewards were always given on the third shape of a triplet (Experiment 1A) or the first shape of a triplet (Experiment 1B), and high- and low-reward sounds were always consistently paired with the same triplets. Experiment 2 was similar to Experiment 1, except that participants were required to learn value associations of a subset of shapes before viewing the shape stream. Across all experiments, we observed significant visual statistical learning effects, but the strength of learning did not differ amongst no-, low-, or high-reward conditions for any of the experiments. Thus, our experiments failed to find any influence of rewards on statistical learning, implying that visual statistical learning may be unaffected by the occurrence of reward. The system that detects basic stimulus-stimulus regularities may operate independently of the system that detects reward contingencies.

  14. No Apparent Influence of Reward upon Visual Statistical Learning

    PubMed Central

    Rogers, Leeland L.; Friedman, Kyle G.; Vickery, Timothy J.

    2016-01-01

    Humans are capable of detecting and exploiting a variety of environmental regularities, including stimulus-stimulus contingencies (e.g., visual statistical learning) and stimulus-reward contingencies. However, the relationship between these two types of learning is poorly understood. In two experiments, we sought evidence that the occurrence of rewarding events enhances or impairs visual statistical learning. Across all of our attempts to find such evidence, we employed a training stage during which we grouped shapes into triplets and presented triplets one shape at a time in an undifferentiated stream. Participants subsequently performed a surprise recognition task in which they were tested on their knowledge of the underlying structure of the triplets. Unbeknownst to participants, triplets were also assigned no-, low-, or high-reward status. In Experiments 1A and 1B, participants viewed shape streams while low and high rewards were “randomly” given, presented as low- and high-pitched tones played through headphones. Rewards were always given on the third shape of a triplet (Experiment 1A) or the first shape of a triplet (Experiment 1B), and high- and low-reward sounds were always consistently paired with the same triplets. Experiment 2 was similar to Experiment 1, except that participants were required to learn value associations of a subset of shapes before viewing the shape stream. Across all experiments, we observed significant visual statistical learning effects, but the strength of learning did not differ amongst no-, low-, or high-reward conditions for any of the experiments. Thus, our experiments failed to find any influence of rewards on statistical learning, implying that visual statistical learning may be unaffected by the occurrence of reward. The system that detects basic stimulus-stimulus regularities may operate independently of the system that detects reward contingencies. PMID:27853441

  15. The role of reward in word learning and its implications for language acquisition.

    PubMed

    Ripollés, Pablo; Marco-Pallarés, Josep; Hielscher, Ulrike; Mestres-Missé, Anna; Tempelmann, Claus; Heinze, Hans-Jochen; Rodríguez-Fornells, Antoni; Noesselt, Toemme

    2014-11-03

    The exact neural processes behind humans' drive to acquire a new language--first as infants and later as second-language learners--are yet to be established. Recent theoretical models have proposed that during human evolution, emerging language-learning mechanisms might have been glued to phylogenetically older subcortical reward systems, reinforcing human motivation to learn a new language. Supporting this hypothesis, our results showed that adult participants exhibited robust fMRI activation in the ventral striatum (VS)--a core region of reward processing--when successfully learning the meaning of new words. This activation was similar to the VS recruitment elicited using an independent reward task. Moreover, the VS showed enhanced functional and structural connectivity with neocortical language areas during successful word learning. Together, our results provide evidence for the neural substrate of reward and motivation during word learning. We suggest that this strong functional and anatomical coupling between neocortical language regions and the subcortical reward system provided a crucial advantage in humans that eventually enabled our lineage to successfully acquire linguistic skills. Copyright © 2014 Elsevier Ltd. All rights reserved.

  16. Credit assignment in movement-dependent reinforcement learning.

    PubMed

    McDougle, Samuel D; Boggess, Matthew J; Crossley, Matthew J; Parvin, Darius; Ivry, Richard B; Taylor, Jordan A

    2016-06-14

    When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants' explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem.

  17. Credit assignment in movement-dependent reinforcement learning

    PubMed Central

    Boggess, Matthew J.; Crossley, Matthew J.; Parvin, Darius; Ivry, Richard B.; Taylor, Jordan A.

    2016-01-01

    When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants’ explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem. PMID:27247404

  18. Multi Agent Reward Analysis for Learning in Noisy Domains

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian K.

    2005-01-01

    In many multi agent learning problems, it is difficult to determine, a priori, the agent reward structure that will lead to good performance. This problem is particularly pronounced in continuous, noisy domains ill-suited to simple table backup schemes commonly used in TD(lambda)/Q-learning. In this paper, we present a new reward evaluation method that allows the tradeoff between coordination among the agents and the difficulty of the learning problem each agent faces to be visualized. This method is independent of the learning algorithm and is only a function of the problem domain and the agents reward structure. We then use this reward efficiency visualization method to determine an effective reward without performing extensive simulations. We test this method in both a static and a dynamic multi-rover learning domain where the agents have continuous state spaces and where their actions are noisy (e.g., the agents movement decisions are not always carried out properly). Our results show that in the more difficult dynamic domain, the reward efficiency visualization method provides a two order of magnitude speedup in selecting a good reward. Most importantly it allows one to quickly create and verify rewards tailored to the observational limitations of the domain.

  19. Efficient reinforcement learning: computational theories, neuroscience and robotics.

    PubMed

    Kawato, Mitsuo; Samejima, Kazuyuki

    2007-04-01

    Reinforcement learning algorithms have provided some of the most influential computational theories for behavioral learning that depends on reward and penalty. After briefly reviewing supporting experimental data, this paper tackles three difficult theoretical issues that remain to be explored. First, plain reinforcement learning is much too slow to be considered a plausible brain model. Second, although the temporal-difference error has an important role both in theory and in experiments, how to compute it remains an enigma. Third, function of all brain areas, including the cerebral cortex, cerebellum, brainstem and basal ganglia, seems to necessitate a new computational framework. Computational studies that emphasize meta-parameters, hierarchy, modularity and supervised learning to resolve these issues are reviewed here, together with the related experimental data.

  20. Pleasurable music affects reinforcement learning according to the listener.

    PubMed

    Gold, Benjamin P; Frank, Michael J; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy.

  1. Pleasurable music affects reinforcement learning according to the listener

    PubMed Central

    Gold, Benjamin P.; Frank, Michael J.; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875

  2. Dopamine neurons learn relative chosen value from probabilistic rewards

    PubMed Central

    Lak, Armin; Stauffer, William R; Schultz, Wolfram

    2016-01-01

    Economic theories posit reward probability as one of the factors defining reward value. Individuals learn the value of cues that predict probabilistic rewards from experienced reward frequencies. Building on the notion that responses of dopamine neurons increase with reward probability and expected value, we asked how dopamine neurons in monkeys acquire this value signal that may represent an economic decision variable. We found in a Pavlovian learning task that reward probability-dependent value signals arose from experienced reward frequencies. We then assessed neuronal response acquisition during choices among probabilistic rewards. Here, dopamine responses became sensitive to the value of both chosen and unchosen options. Both experiments showed also the novelty responses of dopamine neurones that decreased as learning advanced. These results show that dopamine neurons acquire predictive value signals from the frequency of experienced rewards. This flexible and fast signal reflects a specific decision variable and could update neuronal decision mechanisms. DOI: http://dx.doi.org/10.7554/eLife.18044.001 PMID:27787196

  3. Can model-free reinforcement learning explain deontological moral judgments?

    PubMed

    Ayars, Alisabeth

    2016-05-01

    Dual-systems frameworks propose that moral judgments are derived from both an immediate emotional response, and controlled/rational cognition. Recently Cushman (2013) proposed a new dual-system theory based on model-free and model-based reinforcement learning. Model-free learning attaches values to actions based on their history of reward and punishment, and explains some deontological, non-utilitarian judgments. Model-based learning involves the construction of a causal model of the world and allows for far-sighted planning; this form of learning fits well with utilitarian considerations that seek to maximize certain kinds of outcomes. I present three concerns regarding the use of model-free reinforcement learning to explain deontological moral judgment. First, many actions that humans find aversive from model-free learning are not judged to be morally wrong. Moral judgment must require something in addition to model-free learning. Second, there is a dearth of evidence for central predictions of the reinforcement account-e.g., that people with different reinforcement histories will, all else equal, make different moral judgments. Finally, to account for the effect of intention within the framework requires certain assumptions which lack support. These challenges are reasonable foci for future empirical/theoretical work on the model-free/model-based framework.

  4. Neuronal tuning in a brain-machine interface during Reinforcement Learning.

    PubMed

    Mahmoudi, Babak; Digiovanna, Jack; Principe, Jose C; Sanchez, Justin C

    2008-01-01

    In this research, we have used neural tuning to quantify the neural representation of prosthetic arm's actions in a new framework of BMI, which is based on Reinforcement Learning (RLBMI). We observed that through closed-loop brain control, the neural representation has changed to encode robot actions that maximize rewards. This is an interesting result because in our paradigm robot actions are directly controlled by a Computer Agent (CA) with reward states compatible with the user's rewards. Through co-adaptation, neural modulation is used to establish the value of robot actions to achieve reward.

  5. Bayesian Cue Integration as a Developmental Outcome of Reward Mediated Learning

    PubMed Central

    Weisswange, Thomas H.; Rothkopf, Constantin A.; Rodemann, Tobias; Triesch, Jochen

    2011-01-01

    Average human behavior in cue combination tasks is well predicted by Bayesian inference models. As this capability is acquired over developmental timescales, the question arises, how it is learned. Here we investigated whether reward dependent learning, that is well established at the computational, behavioral, and neuronal levels, could contribute to this development. It is shown that a model free reinforcement learning algorithm can indeed learn to do cue integration, i.e. weight uncertain cues according to their respective reliabilities and even do so if reliabilities are changing. We also consider the case of causal inference where multimodal signals can originate from one or multiple separate objects and should not always be integrated. In this case, the learner is shown to develop a behavior that is closest to Bayesian model averaging. We conclude that reward mediated learning could be a driving force for the development of cue integration and causal inference. PMID:21750717

  6. Goal-Directed and Habit-Like Modulations of Stimulus Processing during Reinforcement Learning.

    PubMed

    Luque, David; Beesley, Tom; Morris, Richard W; Jack, Bradley N; Griffiths, Oren; Whitford, Thomas J; Le Pelley, Mike E

    2017-03-15

    Recent research has shown that perceptual processing of stimuli previously associated with high-value rewards is automatically prioritized even when rewards are no longer available. It has been hypothesized that such reward-related modulation of stimulus salience is conceptually similar to an "attentional habit." Recording event-related potentials in humans during a reinforcement learning task, we show strong evidence in favor of this hypothesis. Resistance to outcome devaluation (the defining feature of a habit) was shown by the stimulus-locked P1 component, reflecting activity in the extrastriate visual cortex. Analysis at longer latencies revealed a positive component (corresponding to the P3b, from 550-700 ms) sensitive to outcome devaluation. Therefore, distinct spatiotemporal patterns of brain activity were observed corresponding to habitual and goal-directed processes. These results demonstrate that reinforcement learning engages both attentional habits and goal-directed processes in parallel. Consequences for brain and computational models of reinforcement learning are discussed.SIGNIFICANCE STATEMENT The human attentional network adapts to detect stimuli that predict important rewards. A recent hypothesis suggests that the visual cortex automatically prioritizes reward-related stimuli, driven by cached representations of reward value; that is, stimulus-response habits. Alternatively, the neural system may track the current value of the predicted outcome. Our results demonstrate for the first time that visual cortex activity is increased for reward-related stimuli even when the rewarding event is temporarily devalued. In contrast, longer-latency brain activity was specifically sensitive to transient changes in reward value. Therefore, we show that both habit-like attention and goal-directed processes occur in the same learning episode at different latencies. This result has important consequences for computational models of reinforcement learning.

  7. Social and monetary reward learning engage overlapping neural substrates

    PubMed Central

    Lin, Alice; Adolphs, Ralph

    2012-01-01

    Learning to make choices that yield rewarding outcomes requires the computation of three distinct signals: stimulus values that are used to guide choices at the time of decision making, experienced utility signals that are used to evaluate the outcomes of those decisions and prediction errors that are used to update the values assigned to stimuli during reward learning. Here we investigated whether monetary and social rewards involve overlapping neural substrates during these computations. Subjects engaged in two probabilistic reward learning tasks that were identical except that rewards were either social (pictures of smiling or angry people) or monetary (gaining or losing money). We found substantial overlap between the two types of rewards for all components of the learning process: a common area of ventromedial prefrontal cortex (vmPFC) correlated with stimulus value at the time of choice and another common area of vmPFC correlated with reward magnitude and common areas in the striatum correlated with prediction errors. Taken together, the findings support the hypothesis that shared anatomical substrates are involved in the computation of both monetary and social rewards. PMID:21427193

  8. Social and monetary reward learning engage overlapping neural substrates.

    PubMed

    Lin, Alice; Adolphs, Ralph; Rangel, Antonio

    2012-03-01

    Learning to make choices that yield rewarding outcomes requires the computation of three distinct signals: stimulus values that are used to guide choices at the time of decision making, experienced utility signals that are used to evaluate the outcomes of those decisions and prediction errors that are used to update the values assigned to stimuli during reward learning. Here we investigated whether monetary and social rewards involve overlapping neural substrates during these computations. Subjects engaged in two probabilistic reward learning tasks that were identical except that rewards were either social (pictures of smiling or angry people) or monetary (gaining or losing money). We found substantial overlap between the two types of rewards for all components of the learning process: a common area of ventromedial prefrontal cortex (vmPFC) correlated with stimulus value at the time of choice and another common area of vmPFC correlated with reward magnitude and common areas in the striatum correlated with prediction errors. Taken together, the findings support the hypothesis that shared anatomical substrates are involved in the computation of both monetary and social rewards. © The Author (2011). Published by Oxford University Press.

  9. Anhedonia and the Relative Reward Value of Drug and Nondrug Reinforcers in Cigarette Smokers

    PubMed Central

    Leventhal, Adam M.; Trujillo, Michael; Ameringer, Katherine J.; Tidey, Jennifer W.; Sussman, Steve; Kahler, Christopher W.

    2015-01-01

    Anhedonia—a psychopathologic trait indicative of diminished interest, pleasure, and enjoyment—has been linked to use of and addiction to several substances, including tobacco. We hypothesized that anhedonic drug users develop an imbalance in the relative reward value of drug versus nondrug reinforcers, which could maintain drug use behavior. To test this hypothesis, we examined whether anhedonia predicted the tendency to choose an immediate drug reward (i.e., smoking) over a less immediate nondrug reward (i.e., money) in a laboratory study of non–treatment-seeking adult cigarette smokers. Participants (N = 275, ≥ 10 cigarettes/day) attended a baseline visit that involved anhedonia assessment followed by 2 counterbalanced experimental visits: (a) after 16-hr smoking abstinence and (b) nonabstinent. At both experimental visits, participants completed self-report measures of mood state followed by a behavioral smoking task, which measured 2 aspects of the relative reward value of smoking versus money: (1) latency to initiate smoking when delaying smoking was monetarily rewarded and (2) willingness to purchase individual cigarettes. Results indicated that higher anhedonia predicted quicker smoking initiation and more cigarettes purchased. These relations were partially mediated by low positive and high negative mood states assessed immediately prior to the smoking task. Abstinence amplified the extent to which anhedonia predicted cigarette consumption among those who responded to the abstinence manipulation, but not the entire sample. Anhedonia may bias motivation toward smoking over alternative reinforcers, perhaps by giving rise to poor acute mood states. An imbalance in the reward value assigned to drug versus nondrug reinforcers may link anhedonia-related psychopathology to drug use. PMID:24886011

  10. Anhedonia and the relative reward value of drug and nondrug reinforcers in cigarette smokers.

    PubMed

    Leventhal, Adam M; Trujillo, Michael; Ameringer, Katherine J; Tidey, Jennifer W; Sussman, Steve; Kahler, Christopher W

    2014-05-01

    Anhedonia-a psychopathologic trait indicative of diminished interest, pleasure, and enjoyment-has been linked to use of and addiction to several substances, including tobacco. We hypothesized that anhedonic drug users develop an imbalance in the relative reward value of drug versus nondrug reinforcers, which could maintain drug use behavior. To test this hypothesis, we examined whether anhedonia predicted the tendency to choose an immediate drug reward (i.e., smoking) over a less immediate nondrug reward (i.e., money) in a laboratory study of non-treatment-seeking adult cigarette smokers. Participants (N = 275, ≥10 cigarettes/day) attended a baseline visit that involved anhedonia assessment followed by 2 counterbalanced experimental visits: (a) after 16-hr smoking abstinence and (b) nonabstinent. At both experimental visits, participants completed self-report measures of mood state followed by a behavioral smoking task, which measured 2 aspects of the relative reward value of smoking versus money: (1) latency to initiate smoking when delaying smoking was monetarily rewarded and (2) willingness to purchase individual cigarettes. Results indicated that higher anhedonia predicted quicker smoking initiation and more cigarettes purchased. These relations were partially mediated by low positive and high negative mood states assessed immediately prior to the smoking task. Abstinence amplified the extent to which anhedonia predicted cigarette consumption among those who responded to the abstinence manipulation, but not the entire sample. Anhedonia may bias motivation toward smoking over alternative reinforcers, perhaps by giving rise to poor acute mood states. An imbalance in the reward value assigned to drug versus nondrug reinforcers may link anhedonia-related psychopathology to drug use.

  11. Collaborating Fuzzy Reinforcement Learning Agents

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1997-01-01

    Earlier, we introduced GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Relearning and at the local level, each agent learns and operates based on ANTARCTIC, a technique for fuzzy reinforcement learning. In this paper, we show that it is possible for these agents to compete in order to affect the selected control policy but at the same time, they can collaborate while investigating the state space. In this model, the evaluator or the critic learns by observing all the agents behaviors but the control policy changes only based on the behavior of the winning agent also known as the super agent.

  12. Collaborating Fuzzy Reinforcement Learning Agents

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1997-01-01

    Earlier, we introduced GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Relearning and at the local level, each agent learns and operates based on ANTARCTIC, a technique for fuzzy reinforcement learning. In this paper, we show that it is possible for these agents to compete in order to affect the selected control policy but at the same time, they can collaborate while investigating the state space. In this model, the evaluator or the critic learns by observing all the agents behaviors but the control policy changes only based on the behavior of the winning agent also known as the super agent.

  13. The influence of trial order on learning from reward vs. punishment in a probabilistic categorization task: experimental and computational analyses.

    PubMed

    Moustafa, Ahmed A; Gluck, Mark A; Herzallah, Mohammad M; Myers, Catherine E

    2015-01-01

    Previous research has shown that trial ordering affects cognitive performance, but this has not been tested using category-learning tasks that differentiate learning from reward and punishment. Here, we tested two groups of healthy young adults using a probabilistic category learning task of reward and punishment in which there are two types of trials (reward, punishment) and three possible outcomes: (1) positive feedback for correct responses in reward trials; (2) negative feedback for incorrect responses in punishment trials; and (3) no feedback for incorrect answers in reward trials and correct answers in punishment trials. Hence, trials without feedback are ambiguous, and may represent either successful avoidance of punishment or failure to obtain reward. In Experiment 1, the first group of subjects received an intermixed task in which reward and punishment trials were presented in the same block, as a standard baseline task. In Experiment 2, a second group completed the separated task, in which reward and punishment trials were presented in separate blocks. Additionally, in order to understand the mechanisms underlying performance in the experimental conditions, we fit individual data using a Q-learning model. Results from Experiment 1 show that subjects who completed the intermixed task paradoxically valued the no-feedback outcome as a reinforcer when it occurred on reinforcement-based trials, and as a punisher when it occurred on punishment-based trials. This is supported by patterns of empirical responding, where subjects showed more win-stay behavior following an explicit reward than following an omission of punishment, and more lose-shift behavior following an explicit punisher than following an omission of reward. In Experiment 2, results showed similar performance whether subjects received reward-based or punishment-based trials first. However, when the Q-learning model was applied to these data, there were differences between subjects in the reward

  14. Pain relief produces negative reinforcement through activation of mesolimbic reward-valuation circuitry.

    PubMed

    Navratilova, Edita; Xie, Jennifer Y; Okun, Alec; Qu, Chaoling; Eyde, Nathan; Ci, Shuang; Ossipov, Michael H; King, Tamara; Fields, Howard L; Porreca, Frank

    2012-12-11

    Relief of pain is rewarding. Using a model of experimental postsurgical pain we show that blockade of afferent input from the injury with local anesthetic elicits conditioned place preference, activates ventral tegmental dopaminergic cells, and increases dopamine release in the nucleus accumbens. Importantly, place preference is associated with increased activity in midbrain dopaminergic neurons and blocked by dopamine antagonists injected into the nucleus accumbens. The data directly support the hypothesis that relief of pain produces negative reinforcement through activation of the mesolimbic reward-valuation circuitry.

  15. A Reinforcement Learning Approach to Control.

    DTIC Science & Technology

    1997-05-31

    acquisition is inherently a partially observable Markov decision problem. This report describes an efficient, scalable reinforcement learning approach to the...deployment of refined intelligent gaze control techniques. This report first lays a theoretical foundation for reinforcement learning . It then introduces...perform well in both high and low SNR ATR environments. Reinforcement learning coupled with history features appears to be both a sound foundation and a practical scalable base for gaze control.

  16. Reward learning and negative emotion during rapid attentional competition

    PubMed Central

    Yokoyama, Takemasa; Padmala, Srikanth; Pessoa, Luiz

    2015-01-01

    Learned stimulus-reward associations influence how attention is allocated, such that stimuli rewarded in the past are favored in situations involving limited resources and competition. At the same time, task-irrelevant, high-arousal negative stimuli capture attention and divert resources away from tasks resulting in poor behavioral performance. Yet, investigations of how reward learning and negative stimuli affect perceptual and attentional processing have been conducted in a largely independent fashion. We have recently reported that performance-based monetary rewards reduce negative stimuli interference during perception. The goal of the present study was to investigate how stimuli associated with past monetary rewards compete with negative stimuli during a subsequent attentional task when, critically, no performance-based rewards were at stake. Across two experiments, we found that target stimuli that were associated with high reward reduced the interference effect of potent, negative distractors. Similar to our recent findings with performance-based rewards, our results demonstrate that reward-associated stimuli reduce the deleterious impact of negative stimuli on behavior. PMID:25814971

  17. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation.

    PubMed

    Kato, Ayaka; Morita, Kenji

    2016-10-01

    It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of 'Go' or 'No-Go' selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of 'Go' values towards a goal, and (2) value-contrasts between 'Go' and 'No-Go' are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological systems for value-learning

  18. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation

    PubMed Central

    Morita, Kenji

    2016-01-01

    It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of ‘Go’ or ‘No-Go’ selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of ‘Go’ values towards a goal, and (2) value-contrasts between ‘Go’ and ‘No-Go’ are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological

  19. Methamphetamine-induced disruption of frontostriatal reward learning signals: relation to psychotic symptoms.

    PubMed

    Bernacer, Javier; Corlett, Philip R; Ramachandra, Pranathi; McFarlane, Brady; Turner, Danielle C; Clark, Luke; Robbins, Trevor W; Fletcher, Paul C; Murray, Graham K

    2013-11-01

    Frontostriatal circuitry is critical to learning processes, and its disruption may underlie maladaptive decision making and the generation of psychotic symptoms in schizophrenia. However, there is a paucity of evidence directly examining the role of modulatory neurotransmitters on frontostriatal function in humans. In order to probe the effects of modulation on frontostriatal circuitry during learning and to test whether disruptions in learning processes may be related to the pathogenesis of psychosis, the authors explored the brain representations of reward prediction error and incentive value, two key reinforcement learning parameters, before and after methamphetamine challenge. Healthy volunteers (N=18) underwent functional MRI (fMRI) scanning while performing a reward learning task on three occasions: after placebo, after methamphetamine infusion (0.3 mg/kg body weight), and after pretreatment with 400 mg of amisulpride and then methamphetamine infusion. Brain fMRI representations of learning signals, calculated using a reinforcement Q-learning algorithm, were compared across drug conditions. In the placebo condition, reward prediction error was coded in the ventral striatum bilaterally and incentive value in the ventromedial prefrontal cortex bilaterally. Reward prediction error and incentive value signals were disrupted by methamphetamine in the left nucleus accumbens and left ventromedial prefrontal cortex, respectively. Psychotic symptoms were significantly correlated with incentive value disruption in the ventromedial prefrontal and posterior cingulate cortex. Amisulpride pretreatment did not significantly alter methamphetamine-induced effects. The results demonstrate that methamphetamine impairs brain representations of computational parameters that underpin learning. They also demonstrate a significant link between psychosis and abnormal monoamine-regulated learning signals in the prefrontal and cingulate cortices.

  20. Distributed reinforcement learning for adaptive and robust network intrusion response

    NASA Astrophysics Data System (ADS)

    Malialis, Kleanthis; Devlin, Sam; Kudenko, Daniel

    2015-07-01

    Distributed denial of service (DDoS) attacks constitute a rapidly evolving threat in the current Internet. Multiagent Router Throttling is a novel approach to defend against DDoS attacks where multiple reinforcement learning agents are installed on a set of routers and learn to rate-limit or throttle traffic towards a victim server. The focus of this paper is on online learning and scalability. We propose an approach that incorporates task decomposition, team rewards and a form of reward shaping called difference rewards. One of the novel characteristics of the proposed system is that it provides a decentralised coordinated response to the DDoS problem, thus being resilient to DDoS attacks themselves. The proposed system learns remarkably fast, thus being suitable for online learning. Furthermore, its scalability is successfully demonstrated in experiments involving 1000 learning agents. We compare our approach against a baseline and a popular state-of-the-art throttling technique from the network security literature and show that the proposed approach is more effective, adaptive to sophisticated attack rate dynamics and robust to agent failures.

  1. Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

    PubMed

    Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

    2013-04-01

    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

  2. Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

    PubMed Central

    Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

    2013-01-01

    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity. PMID:23592970

  3. Classroom Reinforcement and Learning: A Quantitative Synthesis.

    ERIC Educational Resources Information Center

    Lysakowski, Richard S.; Walberg, Herbert J.

    1981-01-01

    A preview of statistical data from previous studies determined the benefits of positive reinforcement on learning in students from kindergarten through college. Results indicate that differences between reinforced and control groups are greater for girls and for students from special schools and that reinforcement appears to have a strong effect…

  4. Reinforcement learning in multidimensional environments relies on attention mechanisms.

    PubMed

    Niv, Yael; Daniel, Reka; Geana, Andra; Gershman, Samuel J; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C

    2015-05-27

    In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this "representation learning" process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the "curse of dimensionality" in reinforcement learning. Copyright © 2015 the authors 0270-6474/15/358145-13$15.00/0.

  5. A reinforcement learning approach to gait training improves retention

    PubMed Central

    Hasson, Christopher J.; Manczurowsky, Julia; Yen, Sheng-Che

    2015-01-01

    Many gait training programs are based on supervised learning principles: an individual is guided towards a desired gait pattern with directional error feedback. While this results in rapid adaptation, improvements quickly disappear. This study tested the hypothesis that a reinforcement learning approach improves retention and transfer of a new gait pattern. The results of a pilot study and larger experiment are presented. Healthy subjects were randomly assigned to either a supervised group, who received explicit instructions and directional error feedback while they learned a new gait pattern on a treadmill, or a reinforcement group, who was only shown whether they were close to or far from the desired gait. Subjects practiced for 10 min, followed by immediate and overnight retention and over-ground transfer tests. The pilot study showed that subjects could learn a new gait pattern under a reinforcement learning paradigm. The larger experiment, which had twice as many subjects (16 in each group) showed that the reinforcement group had better overnight retention than the supervised group (a 32% vs. 120% error increase, respectively), but there were no differences for over-ground transfer. These results suggest that encouraging participants to find rewarding actions through self-guided exploration is beneficial for retention. PMID:26379524

  6. A reinforcement learning approach to gait training improves retention.

    PubMed

    Hasson, Christopher J; Manczurowsky, Julia; Yen, Sheng-Che

    2015-01-01

    Many gait training programs are based on supervised learning principles: an individual is guided towards a desired gait pattern with directional error feedback. While this results in rapid adaptation, improvements quickly disappear. This study tested the hypothesis that a reinforcement learning approach improves retention and transfer of a new gait pattern. The results of a pilot study and larger experiment are presented. Healthy subjects were randomly assigned to either a supervised group, who received explicit instructions and directional error feedback while they learned a new gait pattern on a treadmill, or a reinforcement group, who was only shown whether they were close to or far from the desired gait. Subjects practiced for 10 min, followed by immediate and overnight retention and over-ground transfer tests. The pilot study showed that subjects could learn a new gait pattern under a reinforcement learning paradigm. The larger experiment, which had twice as many subjects (16 in each group) showed that the reinforcement group had better overnight retention than the supervised group (a 32% vs. 120% error increase, respectively), but there were no differences for over-ground transfer. These results suggest that encouraging participants to find rewarding actions through self-guided exploration is beneficial for retention.

  7. Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer.

    PubMed

    Hu, Yujing; Gao, Yang; An, Bo

    2015-07-01

    An important approach in multiagent reinforcement learning (MARL) is equilibrium-based MARL, which adopts equilibrium solution concepts in game theory and requires agents to play equilibrium strategies at each state. However, most existing equilibrium-based MARL algorithms cannot scale due to a large number of computationally expensive equilibrium computations (e.g., computing Nash equilibria is PPAD-hard) during learning. For the first time, this paper finds that during the learning process of equilibrium-based MARL, the one-shot games corresponding to each state's successive visits often have the same or similar equilibria (for some states more than 90% of games corresponding to successive visits have similar equilibria). Inspired by this observation, this paper proposes to use equilibrium transfer to accelerate equilibrium-based MARL. The key idea of equilibrium transfer is to reuse previously computed equilibria when each agent has a small incentive to deviate. By introducing transfer loss and transfer condition, a novel framework called equilibrium transfer-based MARL is proposed. We prove that although equilibrium transfer brings transfer loss, equilibrium-based MARL algorithms can still converge to an equilibrium policy under certain assumptions. Experimental results in widely used benchmarks (e.g., grid world game, soccer game, and wall game) show that the proposed framework: 1) not only significantly accelerates equilibrium-based MARL (up to 96.7% reduction in learning time), but also achieves higher average rewards than algorithms without equilibrium transfer and 2) scales significantly better than algorithms without equilibrium transfer when the state/action space grows and the number of agents increases.

  8. A Selective Role for Lmo4 in Cue–Reward Learning

    PubMed Central

    Mangieri, Regina A.; Morrisett, Richard A.; Heberlein, Ulrike; Messing, Robert O.

    2015-01-01

    The ability to use environmental cues to predict rewarding events is essential to survival. The basolateral amygdala (BLA) plays a central role in such forms of associative learning. Aberrant cue–reward learning is thought to underlie many psychopathologies, including addiction, so understanding the underlying molecular mechanisms can inform strategies for intervention. The transcriptional regulator LIM-only 4 (LMO4) is highly expressed in pyramidal neurons of the BLA, where it plays an important role in fear learning. Because the BLA also contributes to cue–reward learning, we investigated the role of BLA LMO4 in this process using Lmo4-deficient mice and RNA interference. Lmo4-deficient mice showed a selective deficit in conditioned reinforcement. Knockdown of LMO4 in the BLA, but not in the nucleus accumbens, recapitulated this deficit in wild-type mice. Molecular and electrophysiological studies identified a deficit in dopamine D2 receptor signaling in the BLA of Lmo4-deficient mice. These results reveal a novel, LMO4-dependent transcriptional program within the BLA that is essential to cue–reward learning. PMID:26134647

  9. Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems

    PubMed Central

    Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

    2013-01-01

    A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider. PMID:24137146

  10. Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems.

    PubMed

    Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

    2013-01-01

    A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider.

  11. Reinforcement Learning in Multidimensional Environments Relies on Attention Mechanisms

    PubMed Central

    Daniel, Reka; Geana, Andra; Gershman, Samuel J.; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C.

    2015-01-01

    In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this “representation learning” process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the “curse of dimensionality” in reinforcement learning. PMID:26019331

  12. Probabilistic reinforcement learning in adults with autism spectrum disorders.

    PubMed

    Solomon, Marjorie; Smith, Anne C; Frank, Michael J; Ly, Stanford; Carter, Cameron S

    2011-04-01

    Autism spectrum disorders (ASDs) can be conceptualized as disorders of learning, however there have been few experimental studies taking this perspective. We examined the probabilistic reinforcement learning performance of 28 adults with ASDs and 30 typically developing adults on a task requiring learning relationships between three stimulus pairs consisting of Japanese characters with feedback that was valid with different probabilities (80%, 70%, and 60%). Both univariate and Bayesian state-space data analytic methods were employed. Hypotheses were based on the extant literature as well as on neurobiological and computational models of reinforcement learning. Both groups learned the task after training. However, there were group differences in early learning in the first task block where individuals with ASDs acquired the most frequently accurately reinforced stimulus pair (80%) comparably to typically developing individuals; exhibited poorer acquisition of the less frequently reinforced 70% pair as assessed by state-space learning curves; and outperformed typically developing individuals on the near chance (60%) pair. Individuals with ASDs also demonstrated deficits in using positive feedback to exploit rewarded choices. Results support the contention that individuals with ASDs are slower learners. Based on neurobiology and on the results of computational modeling, one interpretation of this pattern of findings is that impairments are related to deficits in flexible updating of reinforcement history as mediated by the orbito-frontal cortex, with spared functioning of the basal ganglia. This hypothesis about the pathophysiology of learning in ASDs can be tested using functional magnetic resonance imaging. Copyright © 2011, International Society for Autism Research, Wiley-Liss, Inc.

  13. Enhanced Experience Replay for Deep Reinforcement Learning

    DTIC Science & Technology

    2015-11-01

    ARL-TR-7538 ● NOV 2015 US Army Research Laboratory Enhanced Experience Replay for Deep Reinforcement Learning by David Doria...Experience Replay for Deep Reinforcement Learning by David Doria, Bryan Dawson, and Manuel Vindiola Computational and Information Sciences Directorate...

  14. The Influence of Emotional State on Learning From Reward and Punishment in Borderline Personality Disorder.

    PubMed

    Dixon-Gordon, Katherine L; Tull, Matthew T; Hackel, Leor M; Gratz, Kim L

    2017-06-08

    Despite preliminary evidence that individuals with borderline personality disorder (BPD) demonstrate deficits in learning from corrective feedback, no studies have examined the influence of emotional state on these learning deficits in BPD. This laboratory study examined the influence of negative emotions on learning among participants with BPD (n = 17), compared with clinical (past-year mood/anxiety disorder; n = 20) and healthy (n = 23) controls. Participants completed a reinforcement learning task before and after a negative emotion induction. The learning task involved presenting pairs of stimuli with probabilistic feedback in the training phase, and subsequently assessing accuracy for choosing previously rewarded stimuli or avoiding previously punished stimuli. ANOVAs and ANCOVAs revealed no significant between-group differences in overall learning accuracy. However, there was an effect of group in the ANCOVA for postemotion induction high-conflict punishment learning accuracy, with the BPD group showing greater decrements in learning accuracy than controls following the negative emotion induction.

  15. A quantitative analysis of the reward-enhancing effects of nicotine using reinforcer demand

    PubMed Central

    Barrett, Scott T.; Bevins, Rick A.

    2013-01-01

    Reward enhancement by nicotine has been suggested as an important phenomenon contributing toward tobacco abuse and dependence. Reinforcement value is a multifaceted construct not fully represented by any single measure of response strength. The present study evaluated the changes in the reinforcement value of a visual stimulus in 16 male Sprague–Dawley rats using the reinforcer demand technique proposed by Hursh and Silberberg. The different parameters of the model have been shown to represent differing facets of reinforcement value, including intensity, perseverance, and sensitivity to changes in response cost. Rats lever-pressed for 1-min presentations of a compound visual stimulus over blocks of 10 sessions across a range of response requirements (fixed ratio 1, 2, 4, 8, 14, 22, 32). Nicotine (0.4 mg/kg, base) or saline was administered 5 min before each session. Estimates from the demand model were calculated between nicotine and saline administration conditions within subjects and changes in reinforcement value were assessed as differences in Q0, Pmax, Omax, and essential value. Nicotine administration increased operant responding across the entire range of reinforcement schedules tested, and uniformly affected model parameter estimates in a manner suggesting increased reinforcement value of the visual stimulus. PMID:23080311

  16. Resting-state EEG theta activity and risk learning: sensitivity to reward or punishment?

    PubMed

    Massar, Stijn A A; Kenemans, J Leon; Schutter, Dennis J L G

    2014-03-01

    Increased theta (4-7 Hz)-beta (13-30 Hz) power ratio in resting state electroencephalography (EEG) has been associated with risky disadvantageous decision making and with impaired reinforcement learning. However, the specific contributions of theta and beta power in risky decision making remain unclear. The first aim of the present study was to replicate the earlier found relationship and examine the specific contributions of theta and beta power in risky decision making using the Iowa Gambling Task. The second aim of the study was to examine whether the relation were associated with differences in reward or punishment sensitivity. We replicated the earlier found relationship by showing a positive association between theta/beta ratio and risky decision making. This correlation was mainly driven by theta oscillations. Furthermore, theta power correlated with reward motivated learning, but not with punishment learning. The present results replicate and extend earlier findings by providing novel insights into the relation between thetabeta ratios and risky decision making. Specifically, findings show that resting-state theta activity is correlated with reinforcement learning, and that this association may be explained by differences in reward sensitivity.

  17. Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling.

    PubMed

    Redish, A David; Jensen, Steve; Johnson, Adam; Kurth-Nelson, Zeb

    2007-07-01

    Because learned associations are quickly renewed following extinction, the extinction process must include processes other than unlearning. However, reinforcement learning models, such as the temporal difference reinforcement learning (TDRL) model, treat extinction as an unlearning of associated value and are thus unable to capture renewal. TDRL models are based on the hypothesis that dopamine carries a reward prediction error signal; these models predict reward by driving that reward error to zero. The authors construct a TDRL model that can accommodate extinction and renewal through two simple processes: (a) a TDRL process that learns the value of situation-action pairs and (b) a situation recognition process that categorizes the observed cues into situations. This model has implications for dysfunctional states, including relapse after addiction and problem gambling.

  18. Abnormal temporal difference reward-learning signals in major depression.

    PubMed

    Kumar, P; Waiter, G; Ahearn, T; Milders, M; Reid, I; Steele, J D

    2008-08-01

    Anhedonia is a core symptom of major depressive disorder (MDD), long thought to be associated with reduced dopaminergic function. However, most antidepressants do not act directly on the dopamine system and all antidepressants have a delayed full therapeutic effect. Recently, it has been proposed that antidepressants fail to alter dopamine function in antidepressant unresponsive MDD. There is compelling evidence that dopamine neurons code a specific phasic (short duration) reward-learning signal, described by temporal difference (TD) theory. There is no current evidence for other neurons coding a TD reward-learning signal, although such evidence may be found in time. The neuronal substrates of the TD signal were not explored in this study. Phasic signals are believed to have quite different properties to tonic (long duration) signals. No studies have investigated phasic reward-learning signals in MDD. Therefore, adults with MDD receiving long-term antidepressant medication, and comparison controls both unmedicated and acutely medicated with the antidepressant citalopram, were scanned using fMRI during a reward-learning task. Three hypotheses were tested: first, patients with MDD have blunted TD reward-learning signals; second, controls given an antidepressant acutely have blunted TD reward-learning signals; third, the extent of alteration in TD signals in major depression correlates with illness severity ratings. The results supported the hypotheses. Patients with MDD had significantly reduced reward-learning signals in many non-brainstem regions: ventral striatum (VS), rostral and dorsal anterior cingulate, retrosplenial cortex (RC), midbrain and hippocampus. However, the TD signal was increased in the brainstem of patients. As predicted, acute antidepressant administration to controls was associated with a blunted TD signal, and the brainstem TD signal was not increased by acute citalopram administration. In a number of regions, the magnitude of the abnormal

  19. Human dorsal striatal activity during choice discriminates reinforcement learning behavior from the gambler's fallacy.

    PubMed

    Jessup, Ryan K; O'Doherty, John P

    2011-04-27

    Reinforcement learning theory has generated substantial interest in neurobiology, particularly because of the resemblance between phasic dopamine and reward prediction errors. Actor-critic theories have been adapted to account for the functions of the striatum, with parts of the dorsal striatum equated to the actor. Here, we specifically test whether the human dorsal striatum--as predicted by an actor-critic instantiation--is used on a trial-to-trial basis at the time of choice to choose in accordance with reinforcement learning theory, as opposed to a competing strategy: the gambler's fallacy. Using a partial-brain functional magnetic resonance imaging scanning protocol focused on the striatum and other ventral brain areas, we found that the dorsal striatum is more active when choosing consistent with reinforcement learning compared with the competing strategy. Moreover, an overlapping area of dorsal striatum along with the ventral striatum was found to be correlated with reward prediction errors at the time of outcome, as predicted by the actor-critic framework. These findings suggest that the same region of dorsal striatum involved in learning stimulus-response associations may contribute to the control of behavior during choice, thereby using those learned associations. Intriguingly, neither reinforcement learning nor the gambler's fallacy conformed to the optimal choice strategy on the specific decision-making task we used. Thus, the dorsal striatum may contribute to the control of behavior according to reinforcement learning even when the prescriptions of such an algorithm are suboptimal in terms of maximizing future rewards.

  20. Multiagent cooperation and competition with deep reinforcement learning.

    PubMed

    Tampuu, Ardi; Matiisen, Tambet; Kodelja, Dorian; Kuzovkin, Ilya; Korjus, Kristjan; Aru, Juhan; Aru, Jaan; Vicente, Raul

    2017-01-01

    Evolution of cooperation and competition can appear when multiple adaptive agents share a biological, social, or technological niche. In the present work we study how cooperation and competition emerge between autonomous agents that learn by reinforcement while using only their raw visual input as the state representation. In particular, we extend the Deep Q-Learning framework to multiagent environments to investigate the interaction between two learning agents in the well-known video game Pong. By manipulating the classical rewarding scheme of Pong we show how competitive and collaborative behaviors emerge. We also describe the progression from competitive to collaborative behavior when the incentive to cooperate is increased. Finally we show how learning by playing against another adaptive agent, instead of against a hard-wired algorithm, results in more robust strategies. The present work shows that Deep Q-Networks can become a useful tool for studying decentralized learning of multiagent systems coping with high-dimensional environments.

  1. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading.

    PubMed

    Deng, Yue; Bao, Feng; Kong, Youyong; Ren, Zhiquan; Dai, Qionghai

    2017-03-01

    Can we train the computer to beat experienced traders for financial assert trading? In this paper, we try to address this challenge by introducing a recurrent deep neural network (NN) for real-time financial signal representation and trading. Our model is inspired by two biological-related learning concepts of deep learning (DL) and reinforcement learning (RL). In the framework, the DL part automatically senses the dynamic market condition for informative feature learning. Then, the RL module interacts with deep representations and makes trading decisions to accumulate the ultimate rewards in an unknown environment. The learning system is implemented in a complex NN that exhibits both the deep and recurrent structures. Hence, we propose a task-aware backpropagation through time method to cope with the gradient vanishing issue in deep training. The robustness of the neural system is verified on both the stock and the commodity future markets under broad testing conditions.

  2. Multiagent cooperation and competition with deep reinforcement learning

    PubMed Central

    Kodelja, Dorian; Kuzovkin, Ilya; Korjus, Kristjan; Aru, Juhan; Aru, Jaan; Vicente, Raul

    2017-01-01

    Evolution of cooperation and competition can appear when multiple adaptive agents share a biological, social, or technological niche. In the present work we study how cooperation and competition emerge between autonomous agents that learn by reinforcement while using only their raw visual input as the state representation. In particular, we extend the Deep Q-Learning framework to multiagent environments to investigate the interaction between two learning agents in the well-known video game Pong. By manipulating the classical rewarding scheme of Pong we show how competitive and collaborative behaviors emerge. We also describe the progression from competitive to collaborative behavior when the incentive to cooperate is increased. Finally we show how learning by playing against another adaptive agent, instead of against a hard-wired algorithm, results in more robust strategies. The present work shows that Deep Q-Networks can become a useful tool for studying decentralized learning of multiagent systems coping with high-dimensional environments. PMID:28380078

  3. Intra-accumbens amphetamine increases the conditioned incentive salience of sucrose reward: enhancement of reward "wanting" without enhanced "liking" or response reinforcement.

    PubMed

    Wyvell, C L; Berridge, K C

    2000-11-01

    Amphetamine microinjection into the nucleus accumbens shell enhanced the ability of a Pavlovian reward cue to trigger increased instrumental performance for sucrose reward in a pure conditioned incentive paradigm. Rats were first trained to press one of two levers to obtain sucrose pellets. They were separately conditioned to associate a Pavlovian cue (30 sec light) with free sucrose pellets. On test days, the rats received bilateral microinjection of intra-accumbens vehicle or amphetamine (0.0, 2.0, 10.0, or 20.0 microgram/0.5 microliter), and lever pressing was tested in the absence of any reinforcement contingency, while the Pavlovian cue alone was freely presented at intervals throughout the session. Amphetamine microinjection selectively potentiated the cue-elicited increase in sucrose-associated lever pressing, although instrumental responding was not reinforced by either sucrose or the cue during the test. Intra-accumbens amphetamine can therefore potentiate cue-triggered incentive motivation for reward in the absence of primary or secondary reinforcement. Using the taste reactivity measure of hedonic impact, it was shown that intra-accumbens amphetamine failed to increase positive hedonic reaction patterns elicited by sucrose (i.e., sucrose "liking") at doses that effectively increase sucrose "wanting." We conclude that nucleus accumbens dopamine specifically mediates the ability of reward cues to trigger "wanting" (incentive salience) for their associated rewards, independent of both hedonic impact and response reinforcement.

  4. How Food as a Reward Is Detrimental to Children's Health, Learning, and Behavior.

    PubMed

    Fedewa, Alicia L; Davis, Matthew Cody

    2015-09-01

    Despite small- and wide-scale prevention efforts to curb obesity, the percentage of children classified as overweight and obese has remained relatively consistent in the last decade. As school personnel are increasingly pressured to enhance student performance, many educators use food as a reward to motivate and reinforce positive behavior and high achievement. Yet, many educators have missed the link between student health and academic achievement. Based on a review of the literature, this article explores the link between childhood obesity and adverse mental and physical health, learning, and behavior outcomes. The role in providing children with food as a reward in the relationship between obesity and detrimental health and performance outcomes are examined. The use of food as a reward is pervasive in school classrooms. Although there is a paucity of research in this area, the few studies published in this area show detrimental outcomes for children in the areas of physical health, learning, and behavior. It is imperative that educators understand the adverse outcomes associated with using food as a reward for good behavior and achievement. This study provides alternatives to using food as a reward and outlines future directions for research. © 2015, American School Health Association.

  5. Reward-based learning for virtual neurorobotics through emotional speech processing.

    PubMed

    Jayet Bray, Laurence C; Ferneyhough, Gareth B; Barker, Emily R; Thibeault, Corey M; Harris, Frederick C

    2013-01-01

    Reward-based learning can easily be applied to real life with a prevalence in children teaching methods. It also allows machines and software agents to automatically determine the ideal behavior from a simple reward feedback (e.g., encouragement) to maximize their performance. Advancements in affective computing, especially emotional speech processing (ESP) have allowed for more natural interaction between humans and robots. Our research focuses on integrating a novel ESP system in a relevant virtual neurorobotic (VNR) application. We created an emotional speech classifier that successfully distinguished happy and utterances. The accuracy of the system was 95.3 and 98.7% during the offline mode (using an emotional speech database) and the live mode (using live recordings), respectively. It was then integrated in a neurorobotic scenario, where a virtual neurorobot had to learn a simple exercise through reward-based learning. If the correct decision was made the robot received a spoken reward, which in turn stimulated synapses (in our simulated model) undergoing spike-timing dependent plasticity (STDP) and reinforced the corresponding neural pathways. Both our ESP and neurorobotic systems allowed our neurorobot to successfully and consistently learn the exercise. The integration of ESP in real-time computational neuroscience architecture is a first step toward the combination of human emotions and virtual neurorobotics.

  6. Reward-based learning for virtual neurorobotics through emotional speech processing

    PubMed Central

    Jayet Bray, Laurence C.; Ferneyhough, Gareth B.; Barker, Emily R.; Thibeault, Corey M.; Harris, Frederick C.

    2013-01-01

    Reward-based learning can easily be applied to real life with a prevalence in children teaching methods. It also allows machines and software agents to automatically determine the ideal behavior from a simple reward feedback (e.g., encouragement) to maximize their performance. Advancements in affective computing, especially emotional speech processing (ESP) have allowed for more natural interaction between humans and robots. Our research focuses on integrating a novel ESP system in a relevant virtual neurorobotic (VNR) application. We created an emotional speech classifier that successfully distinguished happy and utterances. The accuracy of the system was 95.3 and 98.7% during the offline mode (using an emotional speech database) and the live mode (using live recordings), respectively. It was then integrated in a neurorobotic scenario, where a virtual neurorobot had to learn a simple exercise through reward-based learning. If the correct decision was made the robot received a spoken reward, which in turn stimulated synapses (in our simulated model) undergoing spike-timing dependent plasticity (STDP) and reinforced the corresponding neural pathways. Both our ESP and neurorobotic systems allowed our neurorobot to successfully and consistently learn the exercise. The integration of ESP in real-time computational neuroscience architecture is a first step toward the combination of human emotions and virtual neurorobotics. PMID:23641213

  7. Two spatiotemporally distinct value systems shape reward-based learning in the human brain

    PubMed Central

    Fouragnan, Elsa; Retzler, Chris; Mullinger, Karen; Philiastides, Marios G.

    2015-01-01

    Avoiding repeated mistakes and learning to reinforce rewarding decisions is critical for human survival and adaptive actions. Yet, the neural underpinnings of the value systems that encode different decision-outcomes remain elusive. Here coupling single-trial electroencephalography with simultaneously acquired functional magnetic resonance imaging, we uncover the spatiotemporal dynamics of two separate but interacting value systems encoding decision-outcomes. Consistent with a role in regulating alertness and switching behaviours, an early system is activated only by negative outcomes and engages arousal-related and motor-preparatory brain structures. Consistent with a role in reward-based learning, a later system differentially suppresses or activates regions of the human reward network in response to negative and positive outcomes, respectively. Following negative outcomes, the early system interacts and downregulates the late system, through a thalamic interaction with the ventral striatum. Critically, the strength of this coupling predicts participants' switching behaviour and avoidance learning, directly implicating the thalamostriatal pathway in reward-based learning. PMID:26348160

  8. Honeybees learn the sign and magnitude of reward variations.

    PubMed

    Gil, Mariana; De Marco, Rodrigo J

    2009-09-01

    In this study, we asked whether honeybees learn the sign and magnitude of variations in the level of reward. We designed an experiment in which bees first had to forage on a three-flower patch offering variable reward levels, and then search for food at the site in the absence of reward and after a long foraging pause. At the time of training, we presented the bees with a decrease in reward level or, instead, with either a small or a large increase in reward level. Testing took place as soon as they visited the patch on the day following training, when we measured the bees' food-searching behaviours. We found that the bees that had experienced increasing reward levels searched for food more persistently than the bees that had experienced decreasing reward levels, and that the bees that had experienced a large increase in reward level searched for food more persistently than the bees that had experienced a small increase in reward level. Because these differences at the time of testing cannot be accounted for by the bees' previous crop loads and food-intake rates, our results unambiguously demonstrate that honeybees adjust their investment of time/energy during foraging in relation to both the sign and the magnitude of past variations in the level of reward. It is likely that such variations lead to the formation of reward expectations enhancing a forager's reliance on a feeding site. Ultimately, this would make it more likely for honeybees to find food when forage is scarce.

  9. Reinforcement learning of periodical gaits in locomotion robots

    NASA Astrophysics Data System (ADS)

    Svinin, Mikhail; Yamada, Kazuyaki; Ushio, S.; Ueda, Kanji

    1999-08-01

    Emergence of stable gaits in locomotion robots is studied in this paper. A classifier system, implementing an instance- based reinforcement learning scheme, is used for sensory- motor control of an eight-legged mobile robot. Important feature of the classifier system is its ability to work with the continuous sensor space. The robot does not have a prior knowledge of the environment, its own internal model, and the goal coordinates. It is only assumed that the robot can acquire stable gaits by learning how to reach a light source. During the learning process the control system, is self-organized by reinforcement signals. Reaching the light source defines a global reward. Forward motion gets a local reward, while stepping back and falling down get a local punishment. Feasibility of the proposed self-organized system is tested under simulation and experiment. The control actions are specified at the leg level. It is shown that, as learning progresses, the number of the action rules in the classifier systems is stabilized to a certain level, corresponding to the acquired gait patterns.

  10. DeltaFosB in the nucleus accumbens is critical for reinforcing effects of sexual reward

    PubMed Central

    Pitchers, Kyle K.; Frohmader, Karla S.; Vialou, Vincent; Mouzon, Ezekiell; Nestler, Eric J.; Lehman, Michael N.; Coolen, Lique M.

    2010-01-01

    Sexual behavior in male rats is rewarding and reinforcing. However, little is known about the specific cellular and molecular mechanisms mediating sexual reward or the reinforcing effects of reward on subsequent expression of sexual behavior. The current study tests the hypothesis that ΔFosB, the stably expressed truncated form of FosB, plays a critical role in the reinforcement of sexual behavior and experience-induced facilitation of sexual motivation and performance. Sexual experience was shown to cause ΔFosB accumulation in several limbic brain regions including the nucleus accumbens (NAc), medial prefrontal cortex, ventral tegmental area and caudate putamen, but not the medial preoptic nucleus. Next, the induction of c-Fos, a downstream (repressed) target of ΔFosB, was measured in sexually experienced and naïve animals. The number of mating-induced c-Fos-IR cells was significantly decreased in sexually experienced animals compared to sexually naïve controls. Finally, ΔFosB levels and its activity in the NAc were manipulated using viral-mediated gene transfer to study its potential role in mediating sexual experience and experience-induced facilitation of sexual performance. Animals with ΔFosB over-expression displayed enhanced facilitation of sexual performance with sexual experience relative to controls. In contrast, the expression of ΔJunD, a dominant-negative binding partner of ΔFosB, attenuated sexual experience-induced facilitation of sexual performance, and stunted long-term maintenance of facilitation compared to GFP and ΔFosB over-expressing groups. Together, these findings support a critical role for ΔFosB expression in the NAc for the reinforcing effects of sexual behavior and sexual experience-induced facilitation of sexual performance. PMID:20618447

  11. Reward-based contextual learning supported by anterior cingulate cortex.

    PubMed

    Umemoto, Akina; HajiHosseini, Azadeh; Yates, Michael E; Holroyd, Clay B

    2017-02-24

    The anterior cingulate cortex (ACC) is commonly associated with cognitive control and decision making, but its specific function is highly debated. To explore a recent theory that the ACC learns the reward values of task contexts (Holroyd & McClure in Psychological Review, 122, 54-83, 2015; Holroyd & Yeung in Trends in Cognitive Sciences, 16, 122-128, 2012), we recorded the event-related brain potentials (ERPs) from participants as they played a novel gambling task. The participants were first required to select from among three games in one "virtual casino," and subsequently they were required to select from among three different games in a different virtual casino; unbeknownst to them, the payoffs for the games were higher in one casino than in the other. Analysis of the reward positivity, an ERP component believed to reflect reward-related signals carried to the ACC by the midbrain dopamine system, revealed that the ACC is sensitive to differences in the reward values associated with both the casinos and the games inside the casinos, indicating that participants learned the values of the contexts in which rewards were delivered. These results highlight the importance of the ACC in learning the reward values of task contexts in order to guide action selection.

  12. Novelty and Inductive Generalization in Human Reinforcement Learning

    PubMed Central

    Gershman, Samuel J.; Niv, Yael

    2015-01-01

    In reinforcement learning, a decision maker searching for the most rewarding option is often faced with the question: what is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: how can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of reinforcement learning in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional reinforcement learning algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty. PMID:25808176

  13. Effects of the chronic restraint stress induced depression on reward-related learning in rats.

    PubMed

    Xu, Pan; Wang, Kezhu; Lu, Cong; Dong, Liming; Chen, Yixi; Wang, Qiong; Shi, Zhe; Yang, Yanyan; Chen, Shanguang; Liu, Xinmin

    2017-03-15

    Chronic mild or unpredictability stress produces a persistent depressive-like state. The main symptoms of depression include weight loss, despair, anhedonia, diminished motivation and mild cognition impairment, which could influence the ability of reward-related learning. In the present study, we aimed to evaluate the effects of chronic restraint stress on the performance of reward-related learning of rats. We used the exposure of repeated restraint stress (6h/day, for 28days) to induce depression-like behavior in rats. Then designed tasks including Pavlovian conditioning (magazine head entries), acquisition and maintenance of instrumental conditioning (lever pressing) and goal directed learning (higher fixed ratio schedule of reinforcement) to study the effects of chronic restraint stress. The results indicated that chronic restraint stress influenced rats in those aspects including the acquisition of a Pavlovian stimulus-outcome (S-O) association, the formation and maintenance of action-outcome (A-O) causal relation and the ability of learning in higher fixed ratio schedule. In conclusion, depression could influence the performances in reward-related learning obviously and the series of instrumental learning tasks may have potential as a method to evaluate cognitive changes in depression.

  14. Anticipated Reward Enhances Offline Learning during Sleep

    ERIC Educational Resources Information Center

    Fischer, Stefan; Born, Jan

    2009-01-01

    Sleep is known to promote the consolidation of motor memories. In everyday life, typically more than 1 isolated motor skill is acquired at a time, and this possibly gives rise to interference during consolidation. Here, it is shown that reward expectancy determines the amount of sleep-dependent memory consolidation. Subjects were trained on 2…

  15. Anticipated Reward Enhances Offline Learning during Sleep

    ERIC Educational Resources Information Center

    Fischer, Stefan; Born, Jan

    2009-01-01

    Sleep is known to promote the consolidation of motor memories. In everyday life, typically more than 1 isolated motor skill is acquired at a time, and this possibly gives rise to interference during consolidation. Here, it is shown that reward expectancy determines the amount of sleep-dependent memory consolidation. Subjects were trained on 2…

  16. Memory Transformation Enhances Reinforcement Learning in Dynamic Environments.

    PubMed

    Santoro, Adam; Frankland, Paul W; Richards, Blake A

    2016-11-30

    Over the course of systems consolidation, there is a switch from a reliance on detailed episodic memories to generalized schematic memories. This switch is sometimes referred to as "memory transformation." Here we demonstrate a previously unappreciated benefit of memory transformation, namely, its ability to enhance reinforcement learning in a dynamic environment. We developed a neural network that is trained to find rewards in a foraging task where reward locations are continuously changing. The network can use memories for specific locations (episodic memories) and statistical patterns of locations (schematic memories) to guide its search. We find that switching from an episodic to a schematic strategy over time leads to enhanced performance due to the tendency for the reward location to be highly correlated with itself in the short-term, but regress to a stable distribution in the long-term. We also show that the statistics of the environment determine the optimal utilization of both types of memory. Our work recasts the theoretical question of why memory transformation occurs, shifting the focus from the avoidance of memory interference toward the enhancement of reinforcement learning across multiple timescales. As time passes, memories transform from a highly detailed state to a more gist-like state, in a process called "memory transformation." Theories of memory transformation speak to its advantages in terms of reducing memory interference, increasing memory robustness, and building models of the environment. However, the role of memory transformation from the perspective of an agent that continuously acts and receives reward in its environment is not well explored. In this work, we demonstrate a view of memory transformation that defines it as a way of optimizing behavior across multiple timescales. Copyright © 2016 the authors 0270-6474/16/3612228-15$15.00/0.

  17. Incidental Learning of Rewarded Associations Bolsters Learning on an Associative Task

    ERIC Educational Resources Information Center

    Freedberg, Michael; Schacherer, Jonathan; Hazeltine, Eliot

    2016-01-01

    Reward has been shown to change behavior as a result of incentive learning (by motivating the individual to increase their effort) and instrumental learning (by increasing the frequency of a particular behavior). However, Palminteri et al. (2011) demonstrated that reward can also improve the incidental learning of a motor skill even when…

  18. Rewards.

    PubMed

    Gunderman, Richard B; Kamer, Aaron P

    2011-05-01

    For much of the 20th century, psychologists and economists operated on the assumption that work is devoid of intrinsic rewards, and the only way to get people to work harder is through the use of rewards and punishments. This so-called carrot-and-stick model of workplace motivation, when applied to medical practice, emphasizes the use of financial incentives and disincentives to manipulate behavior. More recently, however, it has become apparent that, particularly when applied to certain kinds of work, such approaches can be ineffective or even frankly counterproductive. Instead of focusing on extrinsic rewards such as compensation, organizations and their leaders need to devote more attention to the intrinsic rewards of work itself. This article reviews this new understanding of rewards and traces out its practical implications for radiology today.

  19. Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning.

    PubMed

    Frank, Michael J; Moustafa, Ahmed A; Haughey, Heather M; Curran, Tim; Hutchison, Kent E

    2007-10-09

    What are the genetic and neural components that support adaptive learning from positive and negative outcomes? Here, we show with genetic analyses that three independent dopaminergic mechanisms contribute to reward and avoidance learning in humans. A polymorphism in the DARPP-32 gene, associated with striatal dopamine function, predicted relatively better probabilistic reward learning. Conversely, the C957T polymorphism of the DRD2 gene, associated with striatal D2 receptor function, predicted the degree to which participants learned to avoid choices that had been probabilistically associated with negative outcomes. The Val/Met polymorphism of the COMT gene, associated with prefrontal cortical dopamine function, predicted participants' ability to rapidly adapt behavior on a trial-to-trial basis. These findings support a neurocomputational dissociation between striatal and prefrontal dopaminergic mechanisms in reinforcement learning. Computational maximum likelihood analyses reveal independent gene effects on three reinforcement learning parameters that can explain the observed dissociations.

  20. Emotion and reward are dissociable from error during motor learning.

    PubMed

    Festini, Sara B; Preston, Stephanie D; Reuter-Lorenz, Patricia A; Seidler, Rachael D

    2016-06-01

    Although emotion is known to reciprocally interact with cognitive and motor performance, contemporary theories of motor learning do not specifically consider how dynamic variations in a learner's affective state may influence motor performance during motor learning. Using a prism adaptation paradigm, we assessed emotion during motor learning on a trial-by-trial basis. We designed two dart-throwing experiments to dissociate motor performance and reward outcomes by giving participants maximum points for accurate throws and reduced points for throws that hit zones away from the target (i.e., "accidental points"). Experiment 1 dissociated motor performance from emotional responses and found that affective ratings tracked points earned more closely than error magnitude. Further, both reward and error uniquely contributed to motor learning, as indexed by the change in error from one trial to the next. Experiment 2 manipulated accidental point locations vertically, whereas prism displacement remained horizontal. Results demonstrated that reward could bias motor performance even when concurrent sensorimotor adaptation was taking place in a perpendicular direction. Thus, these experiments demonstrate that affective states were dissociable from error magnitude during motor learning and that affect more closely tracked points earned. Our findings further implicate reward as another factor, other than error, that contributes to motor learning, suggesting the importance of incorporating affective states into models of motor learning.

  1. The Establishment of Learned Reinforcers in Mildly Retarded Children. IMRID Behavioral Science Monograph No. 24.

    ERIC Educational Resources Information Center

    Worley, John C., Jr.

    Research regarding the establishment of learned reinforcement with mildly retarded children is reviewed. Noted are findings which indicate that educable retarded students, possibly due to cultural differences, are less responsive to social rewards than either nonretarded or more severely retarded children. Characteristics of primary and secondary…

  2. Reinforcement learning for discounted values often loses the goal in the application to animal learning.

    PubMed

    Yamaguchi, Yoshiya; Sakai, Yutaka

    2012-11-01

    The impulsive preference of an animal for an immediate reward implies that it might subjectively discount the value of potential future outcomes. A theoretical framework to maximize the discounted subjective value has been established in the reinforcement learning theory. The framework has been successfully applied in engineering. However, this study identified a limitation when applied to animal behavior, where in some cases, there is no learning goal. Here a possible learning framework was proposed that is well-posed in any cases and that is consistent with the impulsive preference.

  3. Indices of extinction-induced "depression" after operant learning using a runway vs. a cued free-reward delivery schedule.

    PubMed

    Topic, Bianca; Kröger, Inga; Vildirasova, Petya G; Huston, Joseph P

    2012-11-01

    Loss of reward is one of the etiological factors leading to affective disorders, such as major depression. We have proposed several variants of an animal model of depression based on extinction of reinforced behavior of rats. A number of behaviors emitted during extinction trials were found to be attenuated by antidepressant treatment and, thus, qualified as indices of extinction-induced "despair". These include increases in immobility in the Morris water maze and withdrawal from the former source of reward as well as biting behavior in operant chambers. Here, we assess the effects of reward omission on behaviors after learning of (a) a cued free-reward delivery in an operant chamber and (b) food-reinforced runway behavior. Sixty adult male Wistar rats were either trained to receive food reinforcement every 90 s (s) after a 5s lasting cue light (FI 90), or to traverse an alley to gain food reward. Daily drug treatment with either the selective serotonin reuptake inhibitor citalopram or the tricyclic antidepressant imipramine (each 10mg/kg) or vehicle was begun either 25 days (operant chamber) or 3 days (runway) prior to extinction. The antidepressants suppressed rearing behavior in both paradigms specifically during the extinction trials, which indicates this measure as a useful marker of depression-related behavior, possibly indicating vertical withdrawal. In the operant chamber, only marginal effects on operant learning responses during extinction were found. In the runway, the operant learned responses run time and distance to the goal, as well as total distance moved, grooming and quiescence were also influenced by the antidepressants, providing a potential set of markers for extinction-induced "depression" in the runway. Both paradigms differ substantially with respect to the anticipation of reward, behaviors that are learned and that accompany extinction. Accordingly, antidepressant treatment influenced different sets of behaviors in these two learning tasks.

  4. Hypocretin/orexin regulation of dopamine signaling: implications for reward and reinforcement mechanisms

    PubMed Central

    Calipari, Erin S.; España, Rodrigo A.

    2012-01-01

    The hypocretins/orexins are comprised of two neuroexcitatory peptides that are synthesized exclusively within a circumscribed region of the lateral hypothalamus. These peptides project widely throughout the brain and interact with a variety of regions involved in the regulation of arousal-related processes including those associated with motivated behavior. The current review focuses on emerging evidence indicating that the hypocretins influence reward and reinforcement processing via actions on the mesolimbic dopamine system. We discuss contemporary perspectives of hypocretin regulation of mesolimbic dopamine signaling in both drug free and drug states, as well as hypocretin regulation of behavioral responses to drugs of abuse, particularly as it relates to cocaine. PMID:22933994

  5. Drive-Reinforcement Learning System Applications

    DTIC Science & Technology

    1992-07-31

    evidence suggests that D-R would be effective in control system applications outside the robotics arena.... Drive- Reinforcement Learning , Neural Network Controllers, Robotics, Manipulator Kinematics, Dynamics and Control.

  6. Tree-Based Hierarchical Reinforcement Learning

    DTIC Science & Technology

    2002-08-01

    Lindsey and Krissie have all been wonderful friends. My Australian friends, Cameron, Sarah and Max gave sup- port from all corners of the world; maybe we’ll...229, 1998. BIBLIOGRAPHY 139 Bernhard Hengst. Generating hierarchical structure in reinforcement learning from state variables. In Riichiro Mizoguchi and...Computer Science. Springer, 2000. ISBN 3-540-67925-1. Bernhard Hengst. Discovering hierarchy in reinforcement learning with HEXQ. In Inter- national

  7. DAT isn’t all that: cocaine reward and reinforcement requires Toll Like Receptor 4 signaling

    PubMed Central

    Northcutt, A.L.; Hutchinson, M.R.; Wang, X.; Baratta, M.V.; Hiranita, T.; Cochran, T.A.; Pomrenze, M.B.; Galer, E.L.; Kopajtic, T.A.; Li, C.M.; Amat, J.; Larson, G.; Cooper, D.C.; Huang, Y.; O’Neill, C.E.; Yin, H.; Zahniser, N.R.; Katz, J.L.; Rice, K.C.; Maier, S.F.; Bachtell, R.K.; Watkins, L.R.

    2014-01-01

    The initial reinforcing properties of drugs of abuse, such as cocaine, are largely attributed to their ability to activate the mesolimbic dopamine system. Resulting increases in extracellular dopamine in the nucleus accumbens (NAc) are traditionally thought to result from cocaine’s ability to block dopamine transporters (DATs). Here we demonstrate that cocaine also interacts with the immunosurveillance receptor complex, Toll-Like Receptor 4 (TLR4), on microglial cells to initiate central innate immune signaling. Disruption of cocaine signaling at TLR4 suppresses cocaine-induced extracellular dopamine in the NAc, as well as cocaine conditioned place preference and cocaine self-administration. These results provide a novel understanding of the neurobiological mechanisms underlying cocaine reward/reinforcement that includes a critical role for central immune signaling, and offer a new target for medication development for cocaine abuse treatment. PMID:25644383

  8. Reinforcement learning improves behaviour from evaluative feedback

    NASA Astrophysics Data System (ADS)

    Littman, Michael L.

    2015-05-01

    Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

  9. Reinforcement learning improves behaviour from evaluative feedback.

    PubMed

    Littman, Michael L

    2015-05-28

    Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

  10. Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain.

    PubMed

    Niv, Yael; Edlund, Jeffrey A; Dayan, Peter; O'Doherty, John P

    2012-01-11

    Humans and animals are exquisitely, though idiosyncratically, sensitive to risk or variance in the outcomes of their actions. Economic, psychological, and neural aspects of this are well studied when information about risk is provided explicitly. However, we must normally learn about outcomes from experience, through trial and error. Traditional models of such reinforcement learning focus on learning about the mean reward value of cues and ignore higher order moments such as variance. We used fMRI to test whether the neural correlates of human reinforcement learning are sensitive to experienced risk. Our analysis focused on anatomically delineated regions of a priori interest in the nucleus accumbens, where blood oxygenation level-dependent (BOLD) signals have been suggested as correlating with quantities derived from reinforcement learning. We first provide unbiased evidence that the raw BOLD signal in these regions corresponds closely to a reward prediction error. We then derive from this signal the learned values of cues that predict rewards of equal mean but different variance and show that these values are indeed modulated by experienced risk. Moreover, a close neurometric-psychometric coupling exists between the fluctuations of the experience-based evaluations of risky options that we measured neurally and the fluctuations in behavioral risk aversion. This suggests that risk sensitivity is integral to human learning, illuminating economic models of choice, neuroscientific models of affective learning, and the workings of the underlying neural mechanisms.

  11. Cocaine addiction as a homeostatic reinforcement learning disorder.

    PubMed

    Keramati, Mehdi; Durand, Audrey; Girardeau, Paul; Gutkin, Boris; Ahmed, Serge H

    2017-03-01

    Drug addiction implicates both reward learning and homeostatic regulation mechanisms of the brain. This has stimulated 2 partially successful theoretical perspectives on addiction. Many important aspects of addiction, however, remain to be explained within a single, unified framework that integrates the 2 mechanisms. Building upon a recently developed homeostatic reinforcement learning theory, the authors focus on a key transition stage of addiction that is well modeled in animals, escalation of drug use, and propose a computational theory of cocaine addiction where cocaine reinforces behavior due to its rapid homeostatic corrective effect, whereas its chronic use induces slow and long-lasting changes in homeostatic setpoint. Simulations show that our new theory accounts for key behavioral and neurobiological features of addiction, most notably, escalation of cocaine use, drug-primed craving and relapse, individual differences underlying dose-response curves, and dopamine D2-receptor downregulation in addicts. The theory also generates unique predictions about cocaine self-administration behavior in rats that are confirmed by new experimental results. Viewing addiction as a homeostatic reinforcement learning disorder coherently explains many behavioral and neurobiological aspects of the transition to cocaine addiction, and suggests a new perspective toward understanding addiction. (PsycINFO Database Record

  12. The role of basal ganglia in reinforcement learning and imprinting in domestic chicks.

    PubMed

    Izawa, E; Yanagihara, S; Atsumi, T; Matsushima, T

    2001-06-13

    Effects of bilateral kainate lesions of telencephalic basal ganglia (lobus parolfactorius, LPO) were examined in domestic chicks. In the imprinting paradigm, where chicks learned to selectively approach a moving object without any explicitly associated reward, both the pre- and post-training lesions were without effects. On the other hand, in the water-reinforced pecking task, pre-training lesions of LPO severely impaired immediate reinforcement as well as formation of the association memory. However, post-training LPO lesions did not cause amnesia, and chicks selectively pecked at the reinforced color. The LPO could thus be involved specifically in the evaluation of present rewards and the instantaneous reinforcement of pecking, but not in the execution of selective behavior based on a memorized color cue.

  13. Reward and Cognition: Integrating Reinforcement Sensitivity Theory and Social Cognitive Theory to Predict Drinking Behavior.

    PubMed

    Hasking, Penelope; Boyes, Mark; Mullan, Barbara

    2015-01-01

    Both Reinforcement Sensitivity Theory and Social Cognitive Theory have been applied to understanding drinking behavior. We propose that theoretical relationships between these models support an integrated approach to understanding alcohol use and misuse. We aimed to test an integrated model in which the relationships between reward sensitivity and drinking behavior (alcohol consumption, alcohol-related problems, and symptoms of dependence) were mediated by alcohol expectancies and drinking refusal self-efficacy. Online questionnaires assessing the constructs of interest were completed by 443 Australian adults (M age = 26.40, sd = 1.83) in 2013 and 2014. Path analysis revealed both direct and indirect effects and implicated two pathways to drinking behavior with differential outcomes. Drinking refusal self-efficacy both in social situations and for emotional relief was related to alcohol consumption. Sensitivity to reward was associated with alcohol-related problems, but operated through expectations of increased confidence and personal belief in the ability to limit drinking in social situations. Conversely, sensitivity to punishment operated through negative expectancies and drinking refusal self-efficacy for emotional relief to predict symptoms of dependence. Two pathways relating reward sensitivity, alcohol expectancies, and drinking refusal self-efficacy may underlie social and dependent drinking, which has implications for development of intervention to limit harmful drinking.

  14. Early Years Education: Are Young Students Intrinsically or Extrinsically Motivated Towards School Activities? A Discussion about the Effects of Rewards on Young Children's Learning

    ERIC Educational Resources Information Center

    Theodotou, Evgenia

    2014-01-01

    Rewards can reinforce and at the same time forestall young children's willingness to learn. However, they are broadly used in the field of education, especially in early years settings, to stimulate children towards learning activities. This paper reviews the theoretical and research literature related to intrinsic and extrinsic motivational…

  15. The "proactive" model of learning: Integrative framework for model-free and model-based reinforcement learning utilizing the associative learning-based proactive brain concept.

    PubMed

    Zsuga, Judit; Biro, Klara; Papp, Csaba; Tajti, Gabor; Gesztelyi, Rudolf

    2016-02-01

    Reinforcement learning (RL) is a powerful concept underlying forms of associative learning governed by the use of a scalar reward signal, with learning taking place if expectations are violated. RL may be assessed using model-based and model-free approaches. Model-based reinforcement learning involves the amygdala, the hippocampus, and the orbitofrontal cortex (OFC). The model-free system involves the pedunculopontine-tegmental nucleus (PPTgN), the ventral tegmental area (VTA) and the ventral striatum (VS). Based on the functional connectivity of VS, model-free and model based RL systems center on the VS that by integrating model-free signals (received as reward prediction error) and model-based reward related input computes value. Using the concept of reinforcement learning agent we propose that the VS serves as the value function component of the RL agent. Regarding the model utilized for model-based computations we turned to the proactive brain concept, which offers an ubiquitous function for the default network based on its great functional overlap with contextual associative areas. Hence, by means of the default network the brain continuously organizes its environment into context frames enabling the formulation of analogy-based association that are turned into predictions of what to expect. The OFC integrates reward-related information into context frames upon computing reward expectation by compiling stimulus-reward and context-reward information offered by the amygdala and hippocampus, respectively. Furthermore we suggest that the integration of model-based expectations regarding reward into the value signal is further supported by the efferent of the OFC that reach structures canonical for model-free learning (e.g., the PPTgN, VTA, and VS).

  16. Altering spatial priority maps via reward-based learning.

    PubMed

    Chelazzi, Leonardo; Eštočinová, Jana; Calletti, Riccardo; Lo Gerfo, Emanuele; Sani, Ilaria; Della Libera, Chiara; Santandrea, Elisa

    2014-06-18

    Spatial priority maps are real-time representations of the behavioral salience of locations in the visual field, resulting from the combined influence of stimulus driven activity and top-down signals related to the current goals of the individual. They arbitrate which of a number of (potential) targets in the visual scene will win the competition for attentional resources. As a result, deployment of visual attention to a specific spatial location is determined by the current peak of activation (corresponding to the highest behavioral salience) across the map. Here we report a behavioral study performed on healthy human volunteers, where we demonstrate that spatial priority maps can be shaped via reward-based learning, reflecting long-lasting alterations (biases) in the behavioral salience of specific spatial locations. These biases exert an especially strong influence on performance under conditions where multiple potential targets compete for selection, conferring competitive advantage to targets presented in spatial locations associated with greater reward during learning relative to targets presented in locations associated with lesser reward. Such acquired biases of spatial attention are persistent, are nonstrategic in nature, and generalize across stimuli and task contexts. These results suggest that reward-based attentional learning can induce plastic changes in spatial priority maps, endowing these representations with the "intelligent" capacity to learn from experience.

  17. Dopamine-dependent reinforcement of motor skill learning: evidence from Gilles de la Tourette syndrome.

    PubMed

    Palminteri, Stefano; Lebreton, Maël; Worbe, Yulia; Hartmann, Andreas; Lehéricy, Stéphane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-08-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only affect choices but also motor skills such as typing. Here, we employed a novel paradigm to demonstrate that monetary rewards can improve motor skill learning in humans. Indeed, healthy participants progressively got faster in executing sequences of key presses that were repeatedly rewarded with 10 euro compared with 1 cent. Control tests revealed that the effect of reinforcement on motor skill learning was independent of subjects being aware of sequence-reward associations. To account for this implicit effect, we developed an actor-critic model, in which reward prediction errors are used by the critic to update state values and by the actor to facilitate action execution. To assess the role of dopamine in such computations, we applied the same paradigm in patients with Gilles de la Tourette syndrome, who were either unmedicated or treated with neuroleptics. We also included patients with focal dystonia, as an example of hyperkinetic motor disorder unrelated to dopamine. Model fit showed the following dissociation: while motor skills were affected in all patient groups, reinforcement learning was selectively enhanced in unmedicated patients with Gilles de la Tourette syndrome and impaired by neuroleptics. These results support the hypothesis that overactive dopamine transmission leads to excessive reinforcement of motor sequences, which might explain the formation of tics in Gilles de la Tourette syndrome.

  18. Spiking neural networks with different reinforcement learning (RL) schemes in a multiagent setting.

    PubMed

    Christodoulou, Chris; Cleanthous, Aristodemos

    2010-12-31

    This paper investigates the effectiveness of spiking agents when trained with reinforcement learning (RL) in a challenging multiagent task. In particular, it explores learning through reward-modulated spike-timing dependent plasticity (STDP) and compares it to reinforcement of stochastic synaptic transmission in the general-sum game of the Iterated Prisoner's Dilemma (IPD). More specifically, a computational model is developed where we implement two spiking neural networks as two "selfish" agents learning simultaneously but independently, competing in the IPD game. The purpose of our system (or collective) is to maximise its accumulated reward in the presence of reward-driven competing agents within the collective. This can only be achieved when the agents engage in a behaviour of mutual cooperation during the IPD. Previously, we successfully applied reinforcement of stochastic synaptic transmission to the IPD game. The current study utilises reward-modulated STDP with eligibility trace and results show that the system managed to exhibit the desired behaviour by establishing mutual cooperation between the agents. It is noted that the cooperative outcome was attained after a relatively short learning period which enhanced the accumulation of reward by the system. As in our previous implementation, the successful application of the learning algorithm to the IPD becomes possible only after we extended it with additional global reinforcement signals in order to enhance competition at the neuronal level. Moreover it is also shown that learning is enhanced (as indicated by an increased IPD cooperative outcome) through: (i) strong memory for each agent (regulated by a high eligibility trace time constant) and (ii) firing irregularity produced by equipping the agents' LIF neurons with a partial somatic reset mechanism.

  19. Dynamic Sensor Tasking for Space Situational Awareness via Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Linares, R.; Furfaro, R.

    2016-09-01

    This paper studies the Sensor Management (SM) problem for optical Space Object (SO) tracking. The tasking problem is formulated as a Markov Decision Process (MDP) and solved using Reinforcement Learning (RL). The RL problem is solved using the actor-critic policy gradient approach. The actor provides a policy which is random over actions and given by a parametric probability density function (pdf). The critic evaluates the policy by calculating the estimated total reward or the value function for the problem. The parameters of the policy action pdf are optimized using gradients with respect to the reward function. Both the critic and the actor are modeled using deep neural networks (multi-layer neural networks). The policy neural network takes the current state as input and outputs probabilities for each possible action. This policy is random, and can be evaluated by sampling random actions using the probabilities determined by the policy neural network's outputs. The critic approximates the total reward using a neural network. The estimated total reward is used to approximate the gradient of the policy network with respect to the network parameters. This approach is used to find the non-myopic optimal policy for tasking optical sensors to estimate SO orbits. The reward function is based on reducing the uncertainty for the overall catalog to below a user specified uncertainty threshold. This work uses a 30 km total position error for the uncertainty threshold. This work provides the RL method with a negative reward as long as any SO has a total position error above the uncertainty threshold. This penalizes policies that take longer to achieve the desired accuracy. A positive reward is provided when all SOs are below the catalog uncertainty threshold. An optimal policy is sought that takes actions to achieve the desired catalog uncertainty in minimum time. This work trains the policy in simulation by letting it task a single sensor to "learn" from its performance

  20. Reinforcement learning for routing in cognitive radio ad hoc networks.

    PubMed

    Al-Rawi, Hasan A A; Yau, Kok-Lim Alvin; Mohamad, Hafizal; Ramli, Nordin; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs.

  1. Reinforcement Learning for Routing in Cognitive Radio Ad Hoc Networks

    PubMed Central

    Al-Rawi, Hasan A. A.; Mohamad, Hafizal; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs. PMID:25140350

  2. Hippocampal lesions facilitate instrumental learning with delayed reinforcement but induce impulsive choice in rats

    PubMed Central

    Cheung, Timothy HC; Cardinal, Rudolf N

    2005-01-01

    Background Animals must frequently act to influence the world even when the reinforcing outcomes of their actions are delayed. Learning with action-outcome delays is a complex problem, and little is known of the neural mechanisms that bridge such delays. When outcomes are delayed, they may be attributed to (or associated with) the action that caused them, or mistakenly attributed to other stimuli, such as the environmental context. Consequently, animals that are poor at forming context-outcome associations might learn action-outcome associations better with delayed reinforcement than normal animals. The hippocampus contributes to the representation of environmental context, being required for aspects of contextual conditioning. We therefore hypothesized that animals with hippocampal lesions would be better than normal animals at learning to act on the basis of delayed reinforcement. We tested the ability of hippocampal-lesioned rats to learn a free-operant instrumental response using delayed reinforcement, and what is potentially a related ability – the ability to exhibit self-controlled choice, or to sacrifice an immediate, small reward in order to obtain a delayed but larger reward. Results Rats with sham or excitotoxic hippocampal lesions acquired an instrumental response with different delays (0, 10, or 20 s) between the response and reinforcer delivery. These delays retarded learning in normal rats. Hippocampal-lesioned rats responded slightly less than sham-operated controls in the absence of delays, but they became better at learning (relative to shams) as the delays increased; delays impaired learning less in hippocampal-lesioned rats than in shams. In contrast, lesioned rats exhibited impulsive choice, preferring an immediate, small reward to a delayed, larger reward, even though they preferred the large reward when it was not delayed. Conclusion These results support the view that the hippocampus hinders action-outcome learning with delayed outcomes

  3. Functional Contour-following via Haptic Perception and Reinforcement Learning.

    PubMed

    Hellman, Randall B; Tekin, Cem; Schaar, Mihaela van der; Santos, Veronica J

    2017-09-18

    Many tasks involve the fine manipulation of objects despite limited visual feedback. In such scenarios, tactile and proprioceptive feedback can be leveraged for task completion. We present an approach for real-time haptic perception and decision-making for a haptics-driven, functional contour-following task: the closure of a ziplock bag. This task is challenging for robots because the bag is deformable, transparent, and visually occluded by artificial fingertip sensors that are also compliant. A deep neural net classifier was trained to estimate the state of a zipper within a robot's pinch grasp. A Contextual Multi-Armed Bandit (C-MAB) reinforcement learning algorithm was implemented to maximize cumulative rewards by balancing exploration versus exploitation of the state-action space. The C-MAB learner outperformed a benchmark Q-learner by more efficiently exploring the state-action space while learning a hard-to-code task. The learned C-MAB policy was tested with novel ziplock bag scenarios and contours (wire, rope). Importantly, this work contributes to the development of reinforcement learning approaches that account for limited resources such as hardware life and researcher time. As robots are used to perform complex, physically interactive tasks in unstructured or unmodeled environments, it becomes important to develop methods that enable efficient and effective learning with physical testbeds.

  4. What is the optimal task difficulty for reinforcement learning of brain self-regulation?

    PubMed

    Bauer, Robert; Vukelić, Mathias; Gharabaghi, Alireza

    2016-09-01

    The balance between action and reward during neurofeedback may influence reinforcement learning of brain self-regulation. Eleven healthy volunteers participated in three runs of motor imagery-based brain-machine interface feedback where a robot passively opened the hand contingent to β-band modulation. For each run, the β-desynchronization threshold to initiate the hand robot movement increased in difficulty (low, moderate, and demanding). In this context, the incentive to learn was estimated by the change of reward per action, operationalized as the change in reward duration per movement onset. Variance analysis revealed a significant interaction between threshold difficulty and the relationship between reward duration and number of movement onsets (p<0.001), indicating a negative learning incentive for low difficulty, but a positive learning incentive for moderate and demanding runs. Exploration of different thresholds in the same data set indicated that the learning incentive peaked at higher thresholds than the threshold which resulted in maximum classification accuracy. Specificity is more important than sensitivity of neurofeedback for reinforcement learning of brain self-regulation. Learning efficiency requires adequate challenge by neurofeedback interventions. Copyright © 2016 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.

  5. Efficient exploration through active learning for value function approximation in reinforcement learning.

    PubMed

    Akiyama, Takayuki; Hachiya, Hirotaka; Sugiyama, Masashi

    2010-06-01

    Appropriately designing sampling policies is highly important for obtaining better control policies in reinforcement learning. In this paper, we first show that the least-squares policy iteration (LSPI) framework allows us to employ statistical active learning methods for linear regression. Then we propose a design method of good sampling policies for efficient exploration, which is particularly useful when the sampling cost of immediate rewards is high. The effectiveness of the proposed method, which we call active policy iteration (API), is demonstrated through simulations with a batting robot.

  6. Racial bias shapes social reinforcement learning.

    PubMed

    Lindström, Björn; Selbing, Ida; Molapour, Tanaz; Olsson, Andreas

    2014-03-01

    Both emotional facial expressions and markers of racial-group belonging are ubiquitous signals in social interaction, but little is known about how these signals together affect future behavior through learning. To address this issue, we investigated how emotional (threatening or friendly) in-group and out-group faces reinforced behavior in a reinforcement-learning task. We asked whether reinforcement learning would be modulated by intergroup attitudes (i.e., racial bias). The results showed that individual differences in racial bias critically modulated reinforcement learning. As predicted, racial bias was associated with more efficiently learned avoidance of threatening out-group individuals. We used computational modeling analysis to quantitatively delimit the underlying processes affected by social reinforcement. These analyses showed that racial bias modulates the rate at which exposure to threatening out-group individuals is transformed into future avoidance behavior. In concert, these results shed new light on the learning processes underlying social interaction with racial-in-group and out-group individuals.

  7. Adaptive Educational Software by Applying Reinforcement Learning

    ERIC Educational Resources Information Center

    Bennane, Abdellah

    2013-01-01

    The introduction of the intelligence in teaching software is the object of this paper. In software elaboration process, one uses some learning techniques in order to adapt the teaching software to characteristics of student. Generally, one uses the artificial intelligence techniques like reinforcement learning, Bayesian network in order to adapt…

  8. Using a board game to reinforce learning.

    PubMed

    Yoon, Bona; Rodriguez, Leslie; Faselis, Charles J; Liappis, Angelike P

    2014-03-01

    Experiential gaming strategies offer a variation on traditional learning. A board game was used to present synthesized content of fundamental catheter care concepts and reinforce evidence-based practices relevant to nursing. Board games are innovative educational tools that can enhance active learning. Copyright 2014, SLACK Incorporated.

  9. A REINFORCEMENT LEARNING MODEL OF PERSUASIVE COMMUNICATION.

    ERIC Educational Resources Information Center

    WEISS, ROBERT FRANK

    THEORETICAL AND EXPERIMENTAL ANALOGIES ARE DRAWN BETWEEN LEARNING THEORY AND PERSUASIVE COMMUNICATION AS AN EXTENSION OF LIBERALIZED STIMULUS RESPONSE THEORY. IN THE FIRST EXPERIMENT ON INSTRUMENTAL CONDITIONING OF ATTITUDES, THE SUBJECTS READ AN OPINION TO BE LEARNED, FOLLOWED BY A SUPPORTING ARGUMENT ASSUMED TO FUNCTION AS A REINFORCER. THE TIME…

  10. Ventral striatum and orbitofrontal cortex are both required for model-based, but not model-free, reinforcement learning.

    PubMed

    McDannald, Michael A; Lucantonio, Federica; Burke, Kathryn A; Niv, Yael; Schoenbaum, Geoffrey

    2011-02-16

    In many cases, learning is thought to be driven by differences between the value of rewards we expect and rewards we actually receive. Yet learning can also occur when the identity of the reward we receive is not as expected, even if its value remains unchanged. Learning from changes in reward identity implies access to an internal model of the environment, from which information about the identity of the expected reward can be derived. As a result, such learning is not easily accounted for by model-free reinforcement learning theories such as temporal difference reinforcement learning (TDRL), which predicate learning on changes in reward value, but not identity. Here, we used unblocking procedures to assess learning driven by value- versus identity-based prediction errors. Rats were trained to associate distinct visual cues with different food quantities and identities. These cues were subsequently presented in compound with novel auditory cues and the reward quantity or identity was selectively changed. Unblocking was assessed by presenting the auditory cues alone in a probe test. Consistent with neural implementations of TDRL models, we found that the ventral striatum was necessary for learning in response to changes in reward value. However, this area, along with orbitofrontal cortex, was also required for learning driven by changes in reward identity. This observation requires that existing models of TDRL in the ventral striatum be modified to include information about the specific features of expected outcomes derived from model-based representations, and that the role of orbitofrontal cortex in these models be clearly delineated.

  11. Neural mechanisms of reinforcement learning in unmedicated patients with major depressive disorder.

    PubMed

    Rothkirch, Marcus; Tonn, Jonas; Köhler, Stephan; Sterzer, Philipp

    2017-04-01

    According to current concepts, major depressive disorder is strongly related to dysfunctional neural processing of motivational information, entailing impairments in reinforcement learning. While computational modelling can reveal the precise nature of neural learning signals, it has not been used to study learning-related neural dysfunctions in unmedicated patients with major depressive disorder so far. We thus aimed at comparing the neural coding of reward and punishment prediction errors, representing indicators of neural learning-related processes, between unmedicated patients with major depressive disorder and healthy participants. To this end, a group of unmedicated patients with major depressive disorder (n = 28) and a group of age- and sex-matched healthy control participants (n = 30) completed an instrumental learning task involving monetary gains and losses during functional magnetic resonance imaging. The two groups did not differ in their learning performance. Patients and control participants showed the same level of prediction error-related activity in the ventral striatum and the anterior insula. In contrast, neural coding of reward prediction errors in the medial orbitofrontal cortex was reduced in patients. Moreover, neural reward prediction error signals in the medial orbitofrontal cortex and ventral striatum showed negative correlations with anhedonia severity. Using a standard instrumental learning paradigm we found no evidence for an overall impairment of reinforcement learning in medication-free patients with major depressive disorder. Importantly, however, the attenuated neural coding of reward in the medial orbitofrontal cortex and the relation between anhedonia and reduced reward prediction error-signalling in the medial orbitofrontal cortex and ventral striatum likely reflect an impairment in experiencing pleasure from rewarding events as a key mechanism of anhedonia in major depressive disorder. © The Author (2017). Published by Oxford

  12. Go and no-go learning in reward and punishment: Interactions between affect and effect

    PubMed Central

    Guitart-Masip, Marc; Huys, Quentin J.M.; Fuentemilla, Lluis; Dayan, Peter; Duzel, Emrah; Dolan, Raymond J.

    2012-01-01

    Decision-making invokes two fundamental axes of control: affect or valence, spanning reward and punishment, and effect or action, spanning invigoration and inhibition. We studied the acquisition of instrumental responding in healthy human volunteers in a task in which we orthogonalized action requirements and outcome valence. Subjects were much more successful in learning active choices in rewarded conditions, and passive choices in punished conditions. Using computational reinforcement-learning models, we teased apart contributions from putatively instrumental and Pavlovian components in the generation of the observed asymmetry during learning. Moreover, using model-based fMRI, we showed that BOLD signals in striatum and substantia nigra/ventral tegmental area (SN/VTA) correlated with instrumentally learnt action values, but with opposite signs for go and no-go choices. Finally, we showed that successful instrumental learning depends on engagement of bilateral inferior frontal gyrus. Our behavioral and computational data showed that instrumental learning is contingent on overcoming inherent and plastic Pavlovian biases, while our neuronal data showed this learning is linked to unique patterns of brain activity in regions implicated in action and inhibition respectively. PMID:22548809

  13. Generalization of value in reinforcement learning by humans

    PubMed Central

    Wimmer, G. Elliott; Daw, Nathaniel D.; Shohamy, Daphna

    2012-01-01

    Research in decision making has focused on the role of dopamine and its striatal targets in guiding choices via learned stimulus-reward or stimulus-response associations, behavior that is well-described by reinforcement learning (RL) theories. However, basic RL is relatively limited in scope and does not explain how learning about stimulus regularities or relations may guide decision making. A candidate mechanism for this type of learning comes from the domain of memory, which has highlighted a role for the hippocampus in learning of stimulus-stimulus relations, typically dissociated from the role of the striatum in stimulus-response learning. Here, we used fMRI and computational model-based analyses to examine the joint contributions of these mechanisms to RL. Humans performed an RL task with added relational structure, modeled after tasks used to isolate hippocampal contributions to memory. On each trial participants chose one of four options, but the reward probabilities for pairs of options were correlated across trials. This (uninstructed) relationship between pairs of options potentially enabled an observer to learn about options’ values based on experience with the other options and to generalize across them. We observed BOLD activity related to learning in the striatum and also in the hippocampus. By comparing a basic RL model to one augmented to allow feedback to generalize between correlated options, we tested whether choice behavior and BOLD activity were influenced by the opportunity to generalize across correlated options. Although such generalization goes beyond standard computational accounts of RL and striatal BOLD, both choices and striatal BOLD were better explained by the augmented model. Consistent with the hypothesized role for the hippocampus in this generalization, functional connectivity between the ventral striatum and hippocampus was modulated, across participants, by the ability of the augmented model to capture participants’ choice

  14. Context Transfer in Reinforcement Learning Using Action-Value Functions

    PubMed Central

    Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

    2014-01-01

    This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task. PMID:25610457

  15. Context transfer in reinforcement learning using action-value functions.

    PubMed

    Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

    2014-01-01

    This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task.

  16. The attention habit: how reward learning shapes attentional selection.

    PubMed

    Anderson, Brian A

    2016-04-01

    There is growing consensus that reward plays an important role in the control of attention. Until recently, reward was thought to influence attention indirectly by modulating task-specific motivation and its effects on voluntary control over selection. Such an account was consistent with the goal-directed (endogenous) versus stimulus-driven (exogenous) framework that had long dominated the field of attention research. Now, a different perspective is emerging. Demonstrations that previously reward-associated stimuli can automatically capture attention even when physically inconspicuous and task-irrelevant challenge previously held assumptions about attentional control. The idea that attentional selection can be value driven, reflecting a distinct and previously unrecognized control mechanism, has gained traction. Since these early demonstrations, the influence of reward learning on attention has rapidly become an area of intense investigation, sparking many new insights. The result is an emerging picture of how the reward system of the brain automatically biases information processing. Here, I review the progress that has been made in this area, synthesizing a wealth of recent evidence to provide an integrated, up-to-date account of value-driven attention and some of its broader implications.

  17. Frontostriatal white matter integrity mediates adult age differences in probabilistic reward learning.

    PubMed

    Samanez-Larkin, Gregory R; Levens, Sara M; Perry, Lee M; Dougherty, Robert F; Knutson, Brian

    2012-04-11

    Frontostriatal circuits have been implicated in reward learning, and emerging findings suggest that frontal white matter structural integrity and probabilistic reward learning are reduced in older age. This cross-sectional study examined whether age differences in frontostriatal white matter integrity could account for age differences in reward learning in a community life span sample of human adults. By combining diffusion tensor imaging with a probabilistic reward learning task, we found that older age was associated with decreased reward learning and decreased white matter integrity in specific pathways running from the thalamus to the medial prefrontal cortex and from the medial prefrontal cortex to the ventral striatum. Further, white matter integrity in these thalamocorticostriatal paths could statistically account for age differences in learning. These findings suggest that the integrity of frontostriatal white matter pathways critically supports reward learning. The findings also raise the possibility that interventions that bolster frontostriatal integrity might improve reward learning and decision making.

  18. A proposed resolution to the paradox of drug reward: Dopamine's evolution from an aversive signal to a facilitator of drug reward via negative reinforcement.

    PubMed

    Ting-A-Kee, Ryan; Heinmiller, Andrew; van der Kooy, Derek

    2015-09-01

    The mystery surrounding how plant neurotoxins came to possess reinforcing properties is termed the paradox of drug reward. Here we propose a resolution to this paradox whereby dopamine - which has traditionally been viewed as a signal of reward - initially signaled aversion and encouraged escape. We suggest that after being consumed, plant neurotoxins such as nicotine activated an aversive dopaminergic pathway, thereby deterring predatory herbivores. Later evolutionary events - including the development of a GABAergic system capable of modulating dopaminergic activity - led to the ability to down-regulate and 'control' this dopamine-based aversion. We speculate that this negative reinforcement system evolved so that animals could suppress aversive states such as hunger in order to attend to other internal drives (such as mating and shelter) that would result in improved organismal fitness.

  19. Common Neural Mechanisms Underlying Reversal Learning by Reward and Punishment

    PubMed Central

    Xue, Gui; Xue, Feng; Droutman, Vita; Lu, Zhong-Lin; Bechara, Antoine; Read, Stephen

    2013-01-01

    Impairments in flexible goal-directed decisions, often examined by reversal learning, are associated with behavioral abnormalities characterized by impulsiveness and disinhibition. Although the lateral orbital frontal cortex (OFC) has been consistently implicated in reversal learning, it is still unclear whether this region is involved in negative feedback processing, behavioral control, or both, and whether reward and punishment might have different effects on lateral OFC involvement. Using a relatively large sample (N = 47), and a categorical learning task with either monetary reward or moderate electric shock as feedback, we found overlapping activations in the right lateral OFC (and adjacent insula) for reward and punishment reversal learning when comparing correct reversal trials with correct acquisition trials, whereas we found overlapping activations in the right dorsolateral prefrontal cortex (DLPFC) when negative feedback signaled contingency change. The right lateral OFC and DLPFC also showed greater sensitivity to punishment than did their left homologues, indicating an asymmetry in how punishment is processed. We propose that the right lateral OFC and anterior insula are important for transforming affective feedback to behavioral adjustment, whereas the right DLPFC is involved in higher level attention control. These results provide insight into the neural mechanisms of reversal learning and behavioral flexibility, which can be leveraged to understand risky behaviors among vulnerable populations. PMID:24349211

  20. Common neural mechanisms underlying reversal learning by reward and punishment.

    PubMed

    Xue, Gui; Xue, Feng; Droutman, Vita; Lu, Zhong-Lin; Bechara, Antoine; Read, Stephen

    2013-01-01

    Impairments in flexible goal-directed decisions, often examined by reversal learning, are associated with behavioral abnormalities characterized by impulsiveness and disinhibition. Although the lateral orbital frontal cortex (OFC) has been consistently implicated in reversal learning, it is still unclear whether this region is involved in negative feedback processing, behavioral control, or both, and whether reward and punishment might have different effects on lateral OFC involvement. Using a relatively large sample (N = 47), and a categorical learning task with either monetary reward or moderate electric shock as feedback, we found overlapping activations in the right lateral OFC (and adjacent insula) for reward and punishment reversal learning when comparing correct reversal trials with correct acquisition trials, whereas we found overlapping activations in the right dorsolateral prefrontal cortex (DLPFC) when negative feedback signaled contingency change. The right lateral OFC and DLPFC also showed greater sensitivity to punishment than did their left homologues, indicating an asymmetry in how punishment is processed. We propose that the right lateral OFC and anterior insula are important for transforming affective feedback to behavioral adjustment, whereas the right DLPFC is involved in higher level attention control. These results provide insight into the neural mechanisms of reversal learning and behavioral flexibility, which can be leveraged to understand risky behaviors among vulnerable populations.

  1. Reinforcement learning in depression: A review of computational research.

    PubMed

    Chen, Chong; Takahashi, Taiki; Nakagawa, Shin; Inoue, Takeshi; Kusumi, Ichiro

    2015-08-01

    Despite being considered primarily a mood disorder, major depressive disorder (MDD) is characterized by cognitive and decision making deficits. Recent research has employed computational models of reinforcement learning (RL) to address these deficits. The computational approach has the advantage in making explicit predictions about learning and behavior, specifying the process parameters of RL, differentiating between model-free and model-based RL, and the computational model-based functional magnetic resonance imaging and electroencephalography. With these merits there has been an emerging field of computational psychiatry and here we review specific studies that focused on MDD. Considerable evidence suggests that MDD is associated with impaired brain signals of reward prediction error and expected value ('wanting'), decreased reward sensitivity ('liking') and/or learning (be it model-free or model-based), etc., although the causality remains unclear. These parameters may serve as valuable intermediate phenotypes of MDD, linking general clinical symptoms to underlying molecular dysfunctions. We believe future computational research at clinical, systems, and cellular/molecular/genetic levels will propel us toward a better understanding of the disease.

  2. Evolution with reinforcement learning in negotiation.

    PubMed

    Zou, Yi; Zhan, Wenjie; Shao, Yuan

    2014-01-01

    Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm.

  3. Evolution with Reinforcement Learning in Negotiation

    PubMed Central

    Zou, Yi; Zhan, Wenjie; Shao, Yuan

    2014-01-01

    Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm. PMID:25048108

  4. Reinforcement Learning with Bounded Information Loss

    NASA Astrophysics Data System (ADS)

    Peters, Jan; Mülling, Katharina; Seldin, Yevgeny; Altun, Yasemin

    2011-03-01

    Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant or natural policy gradients, many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest two reinforcement learning methods, i.e., a model-based and a model free algorithm that bound the loss in relative entropy while maximizing their return. The resulting methods differ significantly from previous policy gradient approaches and yields an exact update step. It works well on typical reinforcement learning benchmark problems as well as novel evaluations in robotics. We also show a Bayesian bound motivation of this new approach [8].

  5. Short-term memory traces for action bias in human reinforcement learning.

    PubMed

    Bogacz, Rafal; McClure, Samuel M; Li, Jian; Cohen, Jonathan D; Montague, P Read

    2007-06-11

    Recent experimental and theoretical work on reinforcement learning has shed light on the neural bases of learning from rewards and punishments. One fundamental problem in reinforcement learning is the credit assignment problem, or how to properly assign credit to actions that lead to reward or punishment following a delay. Temporal difference learning solves this problem, but its efficiency can be significantly improved by the addition of eligibility traces (ET). In essence, ETs function as decaying memories of previous choices that are used to scale synaptic weight changes. It has been shown in theoretical studies that ETs spanning a number of actions may improve the performance of reinforcement learning. However, it remains an open question whether including ETs that persist over sequences of actions allows reinforcement learning models to better fit empirical data regarding the behaviors of humans and other animals. Here, we report an experiment in which human subjects performed a sequential economic decision game in which the long-term optimal strategy differed from the strategy that leads to the greatest short-term return. We demonstrate that human subjects' performance in the task is significantly affected by the time between choices in a surprising and seemingly counterintuitive way. However, this behavior is naturally explained by a temporal difference learning model which includes ETs persisting across actions. Furthermore, we review recent findings that suggest that short-term synaptic plasticity in dopamine neurons may provide a realistic biophysical mechanism for producing ETs that persist on a timescale consistent with behavioral observations.

  6. Heightened reward learning under stress in generalized anxiety disorder: a predictor of depression resistance?

    PubMed

    Morris, Bethany H; Rottenberg, Jonathan

    2015-02-01

    Stress-induced anhedonia is associated with depression vulnerability (Bogdan & Pizzagalli, 2006). We investigated stress-induced deficits in reward learning in a depression-vulnerable group with analogue generalized anxiety disorder (GAD, n = 34), and never-depressed healthy controls (n = 41). Utilizing a computerized signal detection task, reward learning was assessed under stressor and neutral conditions. Controls displayed intact reward learning in the neutral condition, and the expected stress-induced blunting. The GAD group as a whole also showed intact reward learning in the neutral condition. When GAD subjects were analyzed as a function of prior depression history, never-depressed GAD subjects showed heightened reward learning in the stressor condition. Better reward learning under stress among GAD subjects predicted lower depression symptoms 1 month later. Robust reward learning under stress may indicate depression resistance among anxious individuals.

  7. Effort-Reward Imbalance for Learning Is Associated with Fatigue in School Children

    ERIC Educational Resources Information Center

    Fukuda, Sanae; Yamano, Emi; Joudoi, Takako; Mizuno, Kei; Tanaka, Masaaki; Kawatani, Junko; Takano, Miyuki; Tomoda, Akemi; Imai-Matsumura, Kyoko; Miike, Teruhisa; Watanabe, Yasuyoshi

    2010-01-01

    We examined relationships among fatigue, sleep quality, and effort-reward imbalance for learning in school children. We developed an effort-reward for learning scale in school students and examined its reliability and validity. Self-administered surveys, including the effort reward for leaning scale and fatigue scale, were completed by 1,023…

  8. Effort-Reward Imbalance for Learning Is Associated with Fatigue in School Children

    ERIC Educational Resources Information Center

    Fukuda, Sanae; Yamano, Emi; Joudoi, Takako; Mizuno, Kei; Tanaka, Masaaki; Kawatani, Junko; Takano, Miyuki; Tomoda, Akemi; Imai-Matsumura, Kyoko; Miike, Teruhisa; Watanabe, Yasuyoshi

    2010-01-01

    We examined relationships among fatigue, sleep quality, and effort-reward imbalance for learning in school children. We developed an effort-reward for learning scale in school students and examined its reliability and validity. Self-administered surveys, including the effort reward for leaning scale and fatigue scale, were completed by 1,023…

  9. Environmental Service Learning: Relevant, Rewarding, and Responsible

    ERIC Educational Resources Information Center

    Leege, Lissa; Cawthorn, Michelle

    2008-01-01

    At Georgia Southern University (GSU), a regional university of 17,000 students, environmental science is a required introductory course for all students. Consequently, environmental-biology class sizes are large, often approaching 1,000 students each semester in multiple sections of up to 250 students. To improve students' learning and sense of…

  10. Learning to Obtain Reward, but Not Avoid Punishment, Is Affected by Presence of PTSD Symptoms in Male Veterans: Empirical Data and Computational Model

    PubMed Central

    Myers, Catherine E.; Moustafa, Ahmed A.; Sheynin, Jony; VanMeenen, Kirsten M.; Gilbertson, Mark W.; Orr, Scott P.; Beck, Kevin D.; Pang, Kevin C. H.; Servatius, Richard J.

    2013-01-01

    Post-traumatic stress disorder (PTSD) symptoms include behavioral avoidance which is acquired and tends to increase with time. This avoidance may represent a general learning bias; indeed, individuals with PTSD are often faster than controls on acquiring conditioned responses based on physiologically-aversive feedback. However, it is not clear whether this learning bias extends to cognitive feedback, or to learning from both reward and punishment. Here, male veterans with self-reported current, severe PTSD symptoms (PTSS group) or with few or no PTSD symptoms (control group) completed a probabilistic classification task that included both reward-based and punishment-based trials, where feedback could take the form of reward, punishment, or an ambiguous “no-feedback” outcome that could signal either successful avoidance of punishment or failure to obtain reward. The PTSS group outperformed the control group in total points obtained; the PTSS group specifically performed better than the control group on reward-based trials, with no difference on punishment-based trials. To better understand possible mechanisms underlying observed performance, we used a reinforcement learning model of the task, and applied maximum likelihood estimation techniques to derive estimated parameters describing individual participants’ behavior. Estimations of the reinforcement value of the no-feedback outcome were significantly greater in the control group than the PTSS group, suggesting that the control group was more likely to value this outcome as positively reinforcing (i.e., signaling successful avoidance of punishment). This is consistent with the control group’s generally poorer performance on reward trials, where reward feedback was to be obtained in preference to the no-feedback outcome. Differences in the interpretation of ambiguous feedback may contribute to the facilitated reinforcement learning often observed in PTSD patients, and may in turn provide new insight into

  11. Learning to obtain reward, but not avoid punishment, is affected by presence of PTSD symptoms in male veterans: empirical data and computational model.

    PubMed

    Myers, Catherine E; Moustafa, Ahmed A; Sheynin, Jony; Vanmeenen, Kirsten M; Gilbertson, Mark W; Orr, Scott P; Beck, Kevin D; Pang, Kevin C H; Servatius, Richard J

    2013-01-01

    Post-traumatic stress disorder (PTSD) symptoms include behavioral avoidance which is acquired and tends to increase with time. This avoidance may represent a general learning bias; indeed, individuals with PTSD are often faster than controls on acquiring conditioned responses based on physiologically-aversive feedback. However, it is not clear whether this learning bias extends to cognitive feedback, or to learning from both reward and punishment. Here, male veterans with self-reported current, severe PTSD symptoms (PTSS group) or with few or no PTSD symptoms (control group) completed a probabilistic classification task that included both reward-based and punishment-based trials, where feedback could take the form of reward, punishment, or an ambiguous "no-feedback" outcome that could signal either successful avoidance of punishment or failure to obtain reward. The PTSS group outperformed the control group in total points obtained; the PTSS group specifically performed better than the control group on reward-based trials, with no difference on punishment-based trials. To better understand possible mechanisms underlying observed performance, we used a reinforcement learning model of the task, and applied maximum likelihood estimation techniques to derive estimated parameters describing individual participants' behavior. Estimations of the reinforcement value of the no-feedback outcome were significantly greater in the control group than the PTSS group, suggesting that the control group was more likely to value this outcome as positively reinforcing (i.e., signaling successful avoidance of punishment). This is consistent with the control group's generally poorer performance on reward trials, where reward feedback was to be obtained in preference to the no-feedback outcome. Differences in the interpretation of ambiguous feedback may contribute to the facilitated reinforcement learning often observed in PTSD patients, and may in turn provide new insight into how

  12. Optimal control in microgrid using multi-agent reinforcement learning.

    PubMed

    Li, Fu-Dong; Wu, Min; He, Yong; Chen, Xin

    2012-11-01

    This paper presents an improved reinforcement learning method to minimize electricity costs on the premise of satisfying the power balance and generation limit of units in a microgrid with grid-connected mode. Firstly, the microgrid control requirements are analyzed and the objective function of optimal control for microgrid is proposed. Then, a state variable "Average Electricity Price Trend" which is used to express the most possible transitions of the system is developed so as to reduce the complexity and randomicity of the microgrid, and a multi-agent architecture including agents, state variables, action variables and reward function is formulated. Furthermore, dynamic hierarchical reinforcement learning, based on change rate of key state variable, is established to carry out optimal policy exploration. The analysis shows that the proposed method is beneficial to handle the problem of "curse of dimensionality" and speed up learning in the unknown large-scale world. Finally, the simulation results under JADE (Java Agent Development Framework) demonstrate the validity of the presented method in optimal control for a microgrid with grid-connected mode.

  13. Deficits in reinforcement learning but no link to apathy in patients with schizophrenia

    PubMed Central

    Hartmann-Riemer, Matthias N.; Aschenbrenner, Steffen; Bossert, Magdalena; Westermann, Celina; Seifritz, Erich; Tobler, Philippe N.; Weisbrod, Matthias; Kaiser, Stefan

    2017-01-01

    Negative symptoms in schizophrenia have been linked to selective reinforcement learning deficits in the context of gains combined with intact loss-avoidance learning. Fundamental mechanisms of reinforcement learning and choice are prediction error signaling and the precise representation of reward value for future decisions. It is unclear which of these mechanisms contribute to the impairments in learning from positive outcomes observed in schizophrenia. A recent study suggested that patients with severe apathy symptoms show deficits in the representation of expected value. Considering the fundamental relevance for the understanding of these symptoms, we aimed to assess the stability of these findings across studies. Sixty-four patients with schizophrenia and 19 healthy control participants performed a probabilistic reward learning task. They had to associate stimuli with gain or loss-avoidance. In a transfer phase participants indicated valuation of the previously learned stimuli by choosing among them. Patients demonstrated an overall impairment in learning compared to healthy controls. No effects of apathy symptoms on task indices were observed. However, patients with schizophrenia learned better in the context of loss-avoidance than in the context of gain. Earlier findings were thus partially replicated. Further studies are needed to clarify the mechanistic link between negative symptoms and reinforcement learning. PMID:28071747

  14. Deficits in reinforcement learning but no link to apathy in patients with schizophrenia.

    PubMed

    Hartmann-Riemer, Matthias N; Aschenbrenner, Steffen; Bossert, Magdalena; Westermann, Celina; Seifritz, Erich; Tobler, Philippe N; Weisbrod, Matthias; Kaiser, Stefan

    2017-01-10

    Negative symptoms in schizophrenia have been linked to selective reinforcement learning deficits in the context of gains combined with intact loss-avoidance learning. Fundamental mechanisms of reinforcement learning and choice are prediction error signaling and the precise representation of reward value for future decisions. It is unclear which of these mechanisms contribute to the impairments in learning from positive outcomes observed in schizophrenia. A recent study suggested that patients with severe apathy symptoms show deficits in the representation of expected value. Considering the fundamental relevance for the understanding of these symptoms, we aimed to assess the stability of these findings across studies. Sixty-four patients with schizophrenia and 19 healthy control participants performed a probabilistic reward learning task. They had to associate stimuli with gain or loss-avoidance. In a transfer phase participants indicated valuation of the previously learned stimuli by choosing among them. Patients demonstrated an overall impairment in learning compared to healthy controls. No effects of apathy symptoms on task indices were observed. However, patients with schizophrenia learned better in the context of loss-avoidance than in the context of gain. Earlier findings were thus partially replicated. Further studies are needed to clarify the mechanistic link between negative symptoms and reinforcement learning.

  15. Pollen Elicits Proboscis Extension but Does not Reinforce PER Learning in Honeybees

    PubMed Central

    Nicholls, Elizabeth; Hempel de Ibarra, Natalie

    2013-01-01

    The function of pollen as a reward for foraging bees is little understood, though there is evidence to suggest that it can reinforce associations with visual and olfactory floral cues. Foraging bees do not feed on pollen, thus one could argue that it cannot serve as an appetitive reinforcer in the same way as sucrose. However, ingestion is not a critical parameter for sucrose reinforcement, since olfactory proboscis extension (PER) learning can be conditioned through antennal stimulation only. During pollen collection, the antennae and mouthparts come into contact with pollen, thus it is possible that pollen reinforces associative learning through similar gustatory pathways as sucrose. Here pollen was presented as the unconditioned stimulus (US), either in its natural state or in a 30% pollen-water solution, and was found to elicit proboscis extension following antennal stimulation. Control groups were exposed to either sucrose or a clean sponge as the US, or an unpaired presentation of the conditioned stimulus (CS) and pollen US. Despite steady levels of responding to the US, bees did not learn to associate a neutral odour with the delivery of a pollen reward, thus whilst pollen has a proboscis extension releasing function, it does not reinforce olfactory PER learning. PMID:26462523

  16. Antipsychotic dose modulates behavioral and neural responses to feedback during reinforcement learning in schizophrenia.

    PubMed

    Insel, Catherine; Reinen, Jenna; Weber, Jochen; Wager, Tor D; Jarskog, L Fredrik; Shohamy, Daphna; Smith, Edward E

    2014-03-01

    Schizophrenia is characterized by an abnormal dopamine system, and dopamine blockade is the primary mechanism of antipsychotic treatment. Consistent with the known role of dopamine in reward processing, prior research has demonstrated that patients with schizophrenia exhibit impairments in reward-based learning. However, it remains unknown how treatment with antipsychotic medication impacts the behavioral and neural signatures of reinforcement learning in schizophrenia. The goal of this study was to examine whether antipsychotic medication modulates behavioral and neural responses to prediction error coding during reinforcement learning. Patients with schizophrenia completed a reinforcement learning task while undergoing functional magnetic resonance imaging. The task consisted of two separate conditions in which participants accumulated monetary gain or avoided monetary loss. Behavioral results indicated that antipsychotic medication dose was associated with altered behavioral approaches to learning, such that patients taking higher doses of medication showed increased sensitivity to negative reinforcement. Higher doses of antipsychotic medication were also associated with higher learning rates (LRs), suggesting that medication enhanced sensitivity to trial-by-trial feedback. Neuroimaging data demonstrated that antipsychotic dose was related to differences in neural signatures of feedback prediction error during the loss condition. Specifically, patients taking higher doses of medication showed attenuated prediction error responses in the striatum and the medial prefrontal cortex. These findings indicate that antipsychotic medication treatment may influence motivational processes in patients with schizophrenia.

  17. Refining Linear Fuzzy Rules by Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.; Malkani, Anil

    1996-01-01

    Linear fuzzy rules are increasingly being used in the development of fuzzy logic systems. Radial basis functions have also been used in the antecedents of the rules for clustering in product space which can automatically generate a set of linear fuzzy rules from an input/output data set. Manual methods are usually used in refining these rules. This paper presents a method for refining the parameters of these rules using reinforcement learning which can be applied in domains where supervised input-output data is not available and reinforcements are received only after a long sequence of actions. This is shown for a generalization of radial basis functions. The formation of fuzzy rules from data and their automatic refinement is an important step in closing the gap between the application of reinforcement learning methods in the domains where only some limited input-output data is available.

  18. Refining Linear Fuzzy Rules by Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.; Malkani, Anil

    1996-01-01

    Linear fuzzy rules are increasingly being used in the development of fuzzy logic systems. Radial basis functions have also been used in the antecedents of the rules for clustering in product space which can automatically generate a set of linear fuzzy rules from an input/output data set. Manual methods are usually used in refining these rules. This paper presents a method for refining the parameters of these rules using reinforcement learning which can be applied in domains where supervised input-output data is not available and reinforcements are received only after a long sequence of actions. This is shown for a generalization of radial basis functions. The formation of fuzzy rules from data and their automatic refinement is an important step in closing the gap between the application of reinforcement learning methods in the domains where only some limited input-output data is available.

  19. The basolateral amygdala in reward learning and addiction

    PubMed Central

    Wassum, Kate M.; Izquierdo, Alicia

    2015-01-01

    Sophisticated behavioral paradigms partnered with the emergence of increasingly selective techniques to target the basolateral amygdala (BLA) have resulted in an enhanced understanding of the role of this nucleus in learning and using reward information. Due to the wide variety of behavioral approaches many questions remain on the circumscribed role of BLA in appetitive behavior. In this review, we integrate conclusions of BLA function in reward-related behavior using traditional interference techniques (lesion, pharmacological inactivation) with those using newer methodological approaches in experimental animals that allow in vivo manipulation of cell type-specific populations and neural recordings. Secondly, from a review of appetitive behavioral tasks in rodents and monkeys and recent computational models of reward procurement, we derive evidence for BLA as a neural integrator of reward value, history, and cost parameters. Taken together, BLA codes specific and temporally dynamic outcome representations in a distributed network to orchestrate adaptive responses. We provide evidence that experiences with opiates and psychostimulants alter these outcome representations in BLA, resulting in long-term modified action. PMID:26341938

  20. Autonomous reinforcement learning with experience replay.

    PubMed

    Wawrzyński, Paweł; Tanwani, Ajay Kumar

    2013-05-01

    This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time.

  1. Geographical Inquiry and Learning Reinforcement Theory.

    ERIC Educational Resources Information Center

    Davies, Christopher S.

    1983-01-01

    Although instructors have been reluctant to utilize the Keller Plan (a personalized system of instruction), it lends itself to teaching introductory geography. College students found that the routine and frequent reinforcement led to progressive learning. However, it does not lend itself to the study of reflexive or polemical concepts. (IS)

  2. Classroom Reinforcement and Learning: A Quantitative Synthesis.

    ERIC Educational Resources Information Center

    Lysakowski, Richard S.; Walberg, Herbert J.

    To estimate the influence of positive reinforcement on classroom learning, the authors analyzed statistical data from 39 studies spanning the years 1958-1978 and containing a combined sample of 4,842 students in 202 classes. Twenty-nine characteristics of each study's sample, methodology, and reliability were coded to measure their effects on…

  3. Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology.

    PubMed

    Schultz, Wolfram

    2004-04-01

    Neurons in a small number of brain structures detect rewards and reward-predicting stimuli and are active during the expectation of predictable food and liquid rewards. These neurons code the reward information according to basic terms of various behavioural theories that seek to explain reward-directed learning, approach behaviour and decision-making. The involved brain structures include groups of dopamine neurons, the striatum including the nucleus accumbens, the orbitofrontal cortex and the amygdala. The reward information is fed to brain structures involved in decision-making and organisation of behaviour, such as the dorsolateral prefrontal cortex and possibly the parietal cortex. The neural coding of basic reward terms derived from formal theories puts the neurophysiological investigation of reward mechanisms on firm conceptual grounds and provides neural correlates for the function of rewards in learning, approach behaviour and decision-making.

  4. Reinforcement Learning in Information Searching

    ERIC Educational Resources Information Center

    Cen, Yonghua; Gan, Liren; Bai, Chen

    2013-01-01

    Introduction: The study seeks to answer two questions: How do university students learn to use correct strategies to conduct scholarly information searches without instructions? and, What are the differences in learning mechanisms between users at different cognitive levels? Method: Two groups of users, thirteen first year undergraduate students…

  5. The Reinforcing and Rewarding Effects of Methylone, a Synthetic Cathinone Commonly Found in "Bath Salts"

    PubMed

    Watterson, Lucas R; Hood, Lauren; Sewalia, Kaveish; Tomek, Seven E; Yahn, Stephanie; Johnson, Craig Trevor; Wegner, Scott; Blough, Bruce E; Marusich, Julie A; Olive, M Foster

    2012-12-01

    Methylone is a member of the designer drug class known as synthetic cathinones which have become increasingly popular drugs of abuse in recent years. Commonly referred to as "bath salts", these amphetamine-like compounds are sold as "legal" alternatives to illicit drugs such as cocaine, methamphetamine, and 3,4-methylenedioxymethamphetamine (MDMA, ecstasy). Following their dramatic rise in popularity along with numerous reports of toxicity and death, several of these drugs were classified as Schedule I drugs in the United States in 2012. Despite these bans, these drugs and other new structurally similar analogues continue to be abused. Currently, however, it is unknown whether these compounds possess the potential for compulsive use and addiction. The present study sought to determine the relative abuse liability of methylone by employing intravenous self-administration (IVSA) and intracranial self-stimulation (ICSS) paradigms in rats. We demonstrate that methylone (0.05, 0.1, 0.2, and 0.5 mg/kg/infusion) dose-dependently functions as a reinforcer, and that there is a significant positive relationship between methylone dose and reinforcer efficacy. Furthermore, responding during short access sessions (ShA, 2 hr/day) appeared more robust than previous IVSA studies with MDMA. However, unlike previous findings with abused stimulants such as cocaine or methamphetamine, long access sessions (LgA, 6 hr/day) did not lead to escalated drug intake or increased reinforcer efficacy. Finally, methylone produced a dose-dependent, but statistically non-significant, trend towards reductions in ICSS thresholds. Together these results reveal that methylone may possess an addiction potential similar to or greater than MDMA, yet patterns of self-administration and effects on brain reward function suggest that this drug may have a lower potential for abuse and compulsive use than prototypical psychostimulants.

  6. Potent rewarding and reinforcing effects of the synthetic cathinone 3,4-methylenedioxypyrovalerone (MDPV)

    PubMed Central

    Watterson, Lucas R.; Kufahl, Peter R.; Nemirovsky, Natali E.; Sewalia, Kaveish; Grabenauer, Megan; Thomas, Brian F.; Marusich, Julie A.; Wegner, Scott; Olive, M. Foster

    2012-01-01

    Reports of abuse and toxic effects of synthetic cathinones, frequently sold as “bath salts” or “legal highs”, have increased dramatically in recent years. One of the most widely used synthetic cathinones is 3,4-methylenedioxypyrovalerone (MDPV). The current study evaluated the abuse potential of MDPV by assessing its ability to support intravenous self-administration and lower thresholds for intracranial self-stimulation (ICSS) in rats. In the first experiment, rats were trained to intravenously self-administer MDPV in daily 2 hr sessions for 10 days at doses of 0.05, 0.1, or 0.2 mg/kg/infusion. Rats were then allowed to self-administer MDPV under a progressive ratio (PR) schedule of reinforcement. Next, rats self-administered MDPV for an additional 10 days under short (2 hr/day, ShA) or long (6 hr/day, LgA) access conditions to assess escalation of intake. Aseparate group of rats underwent the same procedures with the exception of self-administering methamphetamine (0.05 mg/kg/infusion) instead of MDPV. In a second experiment, the effects of MDPV on ICSS thresholds following acute administration (0.1, 0.5, 1 and 2 mg/kg i.p.) were assessed. MDPV maintained self-administration across all doses tested. A positive relationship between MDPV dose and breakpoints for reinforcement under PR conditions was observed. LgA conditions led to escalation of drug intake at the 0.1 and 0.2 mg/kg doses, and rats self-administering methamphetamine showed similar patterns of escalation. Finally, MDPV significantly lowered ICSS thresholds at all doses tested. Together, these findings indicate that MDPV has reinforcing properties and activates brain reward circuitry, suggesting a potential for abuse and addiction in humans. PMID:22784198

  7. Potent rewarding and reinforcing effects of the synthetic cathinone 3,4-methylenedioxypyrovalerone (MDPV).

    PubMed

    Watterson, Lucas R; Kufahl, Peter R; Nemirovsky, Natali E; Sewalia, Kaveish; Grabenauer, Megan; Thomas, Brian F; Marusich, Julie A; Wegner, Scott; Olive, M Foster

    2014-03-01

    Reports of abuse and toxic effects of synthetic cathinones, frequently sold as 'bath salts' or 'legal highs', have increased dramatically in recent years. One of the most widely used synthetic cathinones is 3,4-methylenedioxypyrovalerone (MDPV). The current study evaluated the abuse potential of MDPV by assessing its ability to support intravenous self-administration and to lower thresholds for intracranial self-stimulation (ICSS) in rats. In the first experiment, the rats were trained to intravenously self-administer MDPV in daily 2-hour sessions for 10 days at doses of 0.05, 0.1 or 0.2 mg/kg per infusion. The rats were then allowed to self-administer MDPV under a progressive ratio (PR) schedule of reinforcement. Next, the rats self-administered MDPV for an additional 10 days under short access (ShA; 2 hours/day) or long access (LgA; 6 hours/day) conditions to assess escalation of intake. A separate group of rats underwent the same procedures, with the exception of self-administering methamphetamine (0.05 mg/kg per infusion) instead of MDPV. In the second experiment, the effects of MDPV on ICSS thresholds following acute administration (0.1, 0.5, 1 and 2 mg/kg, i.p.) were assessed. MDPV maintained self-administration across all doses tested. A positive relationship between MDPV dose and breakpoints for reinforcement under PR conditions was observed. LgA conditions led to escalation of drug intake at 0.1 and 0.2 mg/kg doses, and rats self-administering methamphetamine showed similar patterns of escalation. Finally, MDPV significantly lowered ICSS thresholds at all doses tested. Together, these findings indicate that MDPV has reinforcing properties and activates brain reward circuitry, suggesting a potential for abuse and addiction in humans. © 2012 The Authors, Addiction Biology © 2012 Society for the Study of Addiction.

  8. Dissociating error-based and reinforcement-based loss functions during sensorimotor learning

    PubMed Central

    McGregor, Heather R.; Mohatarem, Ayman

    2017-01-01

    It has been proposed that the sensorimotor system uses a loss (cost) function to evaluate potential movements in the presence of random noise. Here we test this idea in the context of both error-based and reinforcement-based learning. In a reaching task, we laterally shifted a cursor relative to true hand position using a skewed probability distribution. This skewed probability distribution had its mean and mode separated, allowing us to dissociate the optimal predictions of an error-based loss function (corresponding to the mean of the lateral shifts) and a reinforcement-based loss function (corresponding to the mode). We then examined how the sensorimotor system uses error feedback and reinforcement feedback, in isolation and combination, when deciding where to aim the hand during a reach. We found that participants compensated differently to the same skewed lateral shift distribution depending on the form of feedback they received. When provided with error feedback, participants compensated based on the mean of the skewed noise. When provided with reinforcement feedback, participants compensated based on the mode. Participants receiving both error and reinforcement feedback continued to compensate based on the mean while repeatedly missing the target, despite receiving auditory, visual and monetary reinforcement feedback that rewarded hitting the target. Our work shows that reinforcement-based and error-based learning are separable and can occur independently. Further, when error and reinforcement feedback are in conflict, the sensorimotor system heavily weights error feedback over reinforcement feedback. PMID:28753634

  9. Dissociating error-based and reinforcement-based loss functions during sensorimotor learning.

    PubMed

    Cashaback, Joshua G A; McGregor, Heather R; Mohatarem, Ayman; Gribble, Paul L

    2017-07-01

    It has been proposed that the sensorimotor system uses a loss (cost) function to evaluate potential movements in the presence of random noise. Here we test this idea in the context of both error-based and reinforcement-based learning. In a reaching task, we laterally shifted a cursor relative to true hand position using a skewed probability distribution. This skewed probability distribution had its mean and mode separated, allowing us to dissociate the optimal predictions of an error-based loss function (corresponding to the mean of the lateral shifts) and a reinforcement-based loss function (corresponding to the mode). We then examined how the sensorimotor system uses error feedback and reinforcement feedback, in isolation and combination, when deciding where to aim the hand during a reach. We found that participants compensated differently to the same skewed lateral shift distribution depending on the form of feedback they received. When provided with error feedback, participants compensated based on the mean of the skewed noise. When provided with reinforcement feedback, participants compensated based on the mode. Participants receiving both error and reinforcement feedback continued to compensate based on the mean while repeatedly missing the target, despite receiving auditory, visual and monetary reinforcement feedback that rewarded hitting the target. Our work shows that reinforcement-based and error-based learning are separable and can occur independently. Further, when error and reinforcement feedback are in conflict, the sensorimotor system heavily weights error feedback over reinforcement feedback.

  10. Reinforcement learning on slow features of high-dimensional input streams.

    PubMed

    Legenstein, Robert; Wilbert, Niko; Wiskott, Laurenz

    2010-08-19

    Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.

  11. Reinforcement Learning on Slow Features of High-Dimensional Input Streams

    PubMed Central

    Legenstein, Robert; Wilbert, Niko; Wiskott, Laurenz

    2010-01-01

    Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning. PMID:20808883

  12. Reinforcement learning for robot control

    NASA Astrophysics Data System (ADS)

    Smart, William D.; Pack Kaelbling, Leslie

    2002-02-01

    Writing control code for mobile robots can be a very time-consuming process. Even for apparently simple tasks, it is often difficult to specify in detail how the robot should accomplish them. Robot control code is typically full of magic numbers that must be painstakingly set for each environment that the robot must operate in. The idea of having a robot learn how to accomplish a task, rather than being told explicitly is an appealing one. It seems easier and much more intuitive for the programmer to specify what the robot should be doing, and let it learn the fine details of how to do it. In this paper, we describe JAQL, a framework for efficient learning on mobile robots, and present the results of using it to learn control policies for simple tasks.

  13. Prediction error in reinforcement learning: a meta-analysis of neuroimaging studies.

    PubMed

    Garrison, Jane; Erdeniz, Burak; Done, John

    2013-08-01

    Activation likelihood estimation (ALE) meta-analyses were used to examine the neural correlates of prediction error in reinforcement learning. The findings are interpreted in the light of current computational models of learning and action selection. In this context, particular consideration is given to the comparison of activation patterns from studies using instrumental and Pavlovian conditioning, and where reinforcement involved rewarding or punishing feedback. The striatum was the key brain area encoding for prediction error, with activity encompassing dorsal and ventral regions for instrumental and Pavlovian reinforcement alike, a finding which challenges the functional separation of the striatum into a dorsal 'actor' and a ventral 'critic'. Prediction error activity was further observed in diverse areas of predominantly anterior cerebral cortex including medial prefrontal cortex and anterior cingulate cortex. Distinct patterns of prediction error activity were found for studies using rewarding and aversive reinforcers; reward prediction errors were observed primarily in the striatum while aversive prediction errors were found more widely including insula and habenula.

  14. Ventral tegmental area neurons in learned appetitive behavior and positive reinforcement.

    PubMed

    Fields, Howard L; Hjelmstad, Gregory O; Margolis, Elyssa B; Nicola, Saleem M

    2007-01-01

    Ventral tegmental area (VTA) neuron firing precedes behaviors elicited by reward-predictive sensory cues and scales with the magnitude and unpredictability of received rewards. These patterns are consistent with roles in the performance of learned appetitive behaviors and in positive reinforcement, respectively. The VTA includes subpopulations of neurons with different afferent connections, neurotransmitter content, and projection targets. Because the VTA and substantia nigra pars compacta are the sole sources of striatal and limbic forebrain dopamine, measurements of dopamine release and manipulations of dopamine function have provided critical evidence supporting a VTA contribution to these functions. However, the VTA also sends GABAergic and glutamatergic projections to the nucleus accumbens and prefrontal cortex. Furthermore, VTA-mediated but dopamine-independent positive reinforcement has been demonstrated. Consequently, identifying the neurotransmitter content and projection target of VTA neurons recorded in vivo will be critical for determining their contribution to learned appetitive behaviors.

  15. Can Traditions Emerge from the Interaction of Stimulus Enhancement and Reinforcement Learning? An Experimental Model

    PubMed Central

    MATTHEWS, LUKE J; PAUKNER, ANNIKA; SUOMI, STEPHEN J

    2010-01-01

    The study of social learning in captivity and behavioral traditions in the wild are two burgeoning areas of research, but few empirical studies have tested how learning mechanisms produce emergent patterns of tradition. Studies have examined how social learning mechanisms that are cognitively complex and possessed by few species, such as imitation, result in traditional patterns, yet traditional patterns are also exhibited by species that may not possess such mechanisms. We propose an explicit model of how stimulus enhancement and reinforcement learning could interact to produce traditions. We tested the model experimentally with tufted capuchin monkeys (Cebus apella), which exhibit traditions in the wild but have rarely demonstrated imitative abilities in captive experiments. Monkeys showed both stimulus enhancement learning and a habitual bias to perform whichever behavior first obtained them a reward. These results support our model that simple social learning mechanisms combined with reinforcement can result in traditional patterns of behavior. PMID:21135912

  16. Can Traditions Emerge from the Interaction of Stimulus Enhancement and Reinforcement Learning? An Experimental Model.

    PubMed

    Matthews, Luke J; Paukner, Annika; Suomi, Stephen J

    2010-06-01

    The study of social learning in captivity and behavioral traditions in the wild are two burgeoning areas of research, but few empirical studies have tested how learning mechanisms produce emergent patterns of tradition. Studies have examined how social learning mechanisms that are cognitively complex and possessed by few species, such as imitation, result in traditional patterns, yet traditional patterns are also exhibited by species that may not possess such mechanisms. We propose an explicit model of how stimulus enhancement and reinforcement learning could interact to produce traditions. We tested the model experimentally with tufted capuchin monkeys (Cebus apella), which exhibit traditions in the wild but have rarely demonstrated imitative abilities in captive experiments. Monkeys showed both stimulus enhancement learning and a habitual bias to perform whichever behavior first obtained them a reward. These results support our model that simple social learning mechanisms combined with reinforcement can result in traditional patterns of behavior.

  17. Extraversion differentiates between model-based and model-free strategies in a reinforcement learning task.

    PubMed

    Skatova, Anya; Chan, Patricia A; Daw, Nathaniel D

    2013-01-01

    Prominent computational models describe a neural mechanism for learning from reward prediction errors, and it has been suggested that variations in this mechanism are reflected in personality factors such as trait extraversion. However, although trait extraversion has been linked to improved reward learning, it is not yet known whether this relationship is selective for the particular computational strategy associated with error-driven learning, known as model-free reinforcement learning, vs. another strategy, model-based learning, which the brain is also known to employ. In the present study we test this relationship by examining whether humans' scores on an extraversion scale predict individual differences in the balance between model-based and model-free learning strategies in a sequentially structured decision task designed to distinguish between them. In previous studies with this task, participants have shown a combination of both types of learning, but with substantial individual variation in the balance between them. In the current study, extraversion predicted worse behavior across both sorts of learning. However, the hypothesis that extraverts would be selectively better at model-free reinforcement learning held up among a subset of the more engaged participants, and overall, higher task engagement was associated with a more selective pattern by which extraversion predicted better model-free learning. The findings indicate a relationship between a broad personality orientation and detailed computational learning mechanisms. Results like those in the present study suggest an intriguing and rich relationship between core neuro-computational mechanisms and broader life orientations and outcomes.

  18. Reinforcement learning based artificial immune classifier.

    PubMed

    Karakose, Mehmet

    2013-01-01

    One of the widely used methods for classification that is a decision-making process is artificial immune systems. Artificial immune systems based on natural immunity system can be successfully applied for classification, optimization, recognition, and learning in real-world problems. In this study, a reinforcement learning based artificial immune classifier is proposed as a new approach. This approach uses reinforcement learning to find better antibody with immune operators. The proposed new approach has many contributions according to other methods in the literature such as effectiveness, less memory cell, high accuracy, speed, and data adaptability. The performance of the proposed approach is demonstrated by simulation and experimental results using real data in Matlab and FPGA. Some benchmark data and remote image data are used for experimental results. The comparative results with supervised/unsupervised based artificial immune system, negative selection classifier, and resource limited artificial immune classifier are given to demonstrate the effectiveness of the proposed new method.

  19. Contingency learning in alcohol dependence and pathological gambling: learning and unlearning reward contingencies.

    PubMed

    Vanes, Lucy D; van Holst, Ruth J; Jansen, Jochem M; van den Brink, Wim; Oosterlaan, Jaap; Goudriaan, Anna E

    2014-06-01

    Patients with alcohol dependence (AD) and pathological gambling (PG) are characterized by dysfunctional reward processing and their ability to adapt to alterations of reward contingencies is impaired. However, most neurocognitive tasks investigating reward processing involve a complex mix of elements, such as working memory, immediate and delayed rewards, and risk-taking. As a consequence, it is not clear whether contingency learning is altered in AD or PG. Therefore, the current study aimed to examine performance in a deterministic contingency learning task, investigating discrimination, reversal, and extinction learning. Thirty-three alcohol-dependent patients (ADs), 28 pathological gamblers (PGs), and 18 healthy controls (HCs) performed a contingency learning task in which they learned stimulus-reward associations that were first reversed and later extinguished while receiving deterministic feedback throughout. Accumulated points, number of perseverative errors and trials required to reach a criterion in each learning phase were compared between groups using nonparametric Kruskal-Wallis rank-sum tests. Regression analyses were performed to compare learning curves. PGs and ADs did not differ from HCs in discrimination learning, reversal learning, or extinction learning, on the nonparametric tests. Regression analyses, however, showed differences in the initial speed of learning: PGs were significantly faster in discrimination learning compared to ADs, and both PGs and ADs learned slower than HCs in the reversal learning and extinction phases of the task. Learning rates for reversal and extinction were slower for the alcohol-dependent group and PG group compared to HCs, suggesting that reversing and extinguishing learned contingencies require more effort in ADs and PGs. This implicates a diminished flexibility to overcome previously learned contingencies. © 2014 The Authors Alcoholism: Clinical and Experimental Research published by Wiley Periodicals, Inc. on

  20. Appetitive olfactory learning and memory in the honeybee depend on sugar reward identity.

    PubMed

    Simcock, Nicola K; Gray, Helen; Bouchebti, Sofia; Wright, Geraldine A

    2017-08-24

    One of the most important tasks of the brain is to learn and remember information associated with food. Studies in mice and Drosophila have shown that sugar rewards must be metabolisable to form lasting memories, but few other animals have been studied. Here, we trained adult, worker honeybees (Apis mellifera) in two olfactory tasks (massed and spaced conditioning) known to affect memory formation to test how the schedule of reinforcement and the nature of a sugar reward affected learning and memory. The antennae and mouthparts of honeybees were most sensitive to sucrose but glucose and fructose were equally phagostimulatory. Whether or not bees could learn the tasks depended on sugar identity and concentration. However, only bees rewarded with glucose or sucrose formed robust long-term memory. This was true for bees trained in both the massed and spaced conditioning tasks. Honeybees fed with glucose or fructose exhibited a surge in haemolymph sugar of greater than 120mM within 30s that remained elevated for as long as 20min after a single feeding event. For bees fed with sucrose, this change in haemolymph glucose and fructose occurred with a 30s delay. Our data showed that olfactory learning in honeybees was affected by sugar identity and concentration, but that olfactory memory was most strongly affected by sugar identity. Taken together, these data suggest that the neural mechanisms involved in memory formation sense rapid changes in haemolymph glucose that occur during and after conditioning. Copyright © 2017 The Authors. Published by Elsevier Ltd.. All rights reserved.

  1. Online reinforcement learning for dynamic multimedia systems.

    PubMed

    Mastronarde, Nicholas; van der Schaar, Mihaela

    2010-02-01

    In our previous work, we proposed a systematic cross-layer framework for dynamic multimedia systems, which allows each layer to make autonomous and foresighted decisions that maximize the system's long-term performance, while meeting the application's real-time delay constraints. The proposed solution solved the cross-layer optimization offline, under the assumption that the multimedia system's probabilistic dynamics were known a priori, by modeling the system as a layered Markov decision process. In practice, however, these dynamics are unknown a priori and, therefore, must be learned online. In this paper, we address this problem by allowing the multimedia system layers to learn, through repeated interactions with each other, to autonomously optimize the system's long-term performance at run-time. The two key challenges in this layered learning setting are: (i) each layer's learning performance is directly impacted by not only its own dynamics, but also by the learning processes of the other layers with which it interacts; and (ii) selecting a learning model that appropriately balances time-complexity (i.e., learning speed) with the multimedia system's limited memory and the multimedia application's real-time delay constraints. We propose two reinforcement learning algorithms for optimizing the system under different design constraints: the first algorithm solves the cross-layer optimization in a centralized manner and the second solves it in a decentralized manner. We analyze both algorithms in terms of their required computation, memory, and interlayer communication overheads. After noting that the proposed reinforcement learning algorithms learn too slowly, we introduce a complementary accelerated learning algorithm that exploits partial knowledge about the system's dynamics in order to dramatically improve the system's performance. In our experiments, we demonstrate that decentralized learning can perform equally as well as centralized learning, while

  2. Functional polymorphism of the mu-opioid receptor gene (OPRM1) influences reinforcement learning in humans.

    PubMed

    Lee, Mary R; Gallen, Courtney L; Zhang, Xiaochu; Hodgkinson, Colin A; Goldman, David; Stein, Elliot A; Barr, Christina S

    2011-01-01

    Previous reports on the functional effects (i.e., gain or loss of function), and phenotypic outcomes (e.g., changes in addiction vulnerability and stress response) of a commonly occurring functional single nucleotide polymorphism (SNP) of the mu-opioid receptor (OPRM1 A118G) have been inconsistent. Here we examine the effect of this polymorphism on implicit reward learning. We used a probabilistic signal detection task to determine whether this polymorphism impacts response bias to monetary reward in 63 healthy adult subjects: 51 AA homozygotes and 12 G allele carriers. OPRM1 AA homozygotes exhibited typical responding to the rewarded response--that is, their bias to the rewarded stimulus increased over time. However, OPRM1 G allele carriers exhibited a decline in response to the rewarded stimulus compared to the AA homozygotes. These results extend previous reports on the heritability of performance on this task by implicating a specific polymorphism. Through comparison with other studies using this task, we suggest a possible mechanism by which the OPRM1 polymorphism may confer reduced response to natural reward through a dopamine-mediated decrease during positive reinforcement learning.

  3. Fuzzy Q-Learning for Generalization of Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1996-01-01

    Fuzzy Q-Learning, introduced earlier by the author, is an extension of Q-Learning into fuzzy environments. GARIC is a methodology for fuzzy reinforcement learning. In this paper, we introduce GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Q-Learning and at the local level, each agent learns and operates based on GARIC. GARIC-Q improves the speed and applicability of Fuzzy Q-Learning through generalization of input space by using fuzzy rules and bridges the gap between Q-Learning and rule based intelligent systems.

  4. Fuzzy Q-Learning for Generalization of Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1996-01-01

    Fuzzy Q-Learning, introduced earlier by the author, is an extension of Q-Learning into fuzzy environments. GARIC is a methodology for fuzzy reinforcement learning. In this paper, we introduce GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Q-Learning and at the local level, each agent learns and operates based on GARIC. GARIC-Q improves the speed and applicability of Fuzzy Q-Learning through generalization of input space by using fuzzy rules and bridges the gap between Q-Learning and rule based intelligent systems.

  5. fMRI and EEG predictors of dynamic decision parameters during human reinforcement learning.

    PubMed

    Frank, Michael J; Gagne, Chris; Nyhus, Erika; Masters, Sean; Wiecki, Thomas V; Cavanagh, James F; Badre, David

    2015-01-14

    What are the neural dynamics of choice processes during reinforcement learning? Two largely separate literatures have examined dynamics of reinforcement learning (RL) as a function of experience but assuming a static choice process, or conversely, the dynamics of choice processes in decision making but based on static decision values. Here we show that human choice processes during RL are well described by a drift diffusion model (DDM) of decision making in which the learned trial-by-trial reward values are sequentially sampled, with a choice made when the value signal crosses a decision threshold. Moreover, simultaneous fMRI and EEG recordings revealed that this decision threshold is not fixed across trials but varies as a function of activity in the subthalamic nucleus (STN) and is further modulated by trial-by-trial measures of decision conflict and activity in the dorsomedial frontal cortex (pre-SMA BOLD and mediofrontal theta in EEG). These findings provide converging multimodal evidence for a model in which decision threshold in reward-based tasks is adjusted as a function of communication from pre-SMA to STN when choices differ subtly in reward values, allowing more time to choose the statistically more rewarding option.

  6. fMRI and EEG Predictors of Dynamic Decision Parameters during Human Reinforcement Learning

    PubMed Central

    Gagne, Chris; Nyhus, Erika; Masters, Sean; Wiecki, Thomas V.; Cavanagh, James F.; Badre, David

    2015-01-01

    What are the neural dynamics of choice processes during reinforcement learning? Two largely separate literatures have examined dynamics of reinforcement learning (RL) as a function of experience but assuming a static choice process, or conversely, the dynamics of choice processes in decision making but based on static decision values. Here we show that human choice processes during RL are well described by a drift diffusion model (DDM) of decision making in which the learned trial-by-trial reward values are sequentially sampled, with a choice made when the value signal crosses a decision threshold. Moreover, simultaneous fMRI and EEG recordings revealed that this decision threshold is not fixed across trials but varies as a function of activity in the subthalamic nucleus (STN) and is further modulated by trial-by-trial measures of decision conflict and activity in the dorsomedial frontal cortex (pre-SMA BOLD and mediofrontal theta in EEG). These findings provide converging multimodal evidence for a model in which decision threshold in reward-based tasks is adjusted as a function of communication from pre-SMA to STN when choices differ subtly in reward values, allowing more time to choose the statistically more rewarding option. PMID:25589744

  7. Neural correlates of reinforcement learning and social preferences in competitive bidding.

    PubMed

    van den Bos, Wouter; Talwar, Arjun; McClure, Samuel M

    2013-01-30

    In competitive social environments, people often deviate from what rational choice theory prescribes, resulting in losses or suboptimal monetary gains. We investigate how competition affects learning and decision-making in a common value auction task. During the experiment, groups of five human participants were simultaneously scanned using MRI while playing the auction task. We first demonstrate that bidding is well characterized by reinforcement learning with biased reward representations dependent on social preferences. Indicative of reinforcement learning, we found that estimated trial-by-trial prediction errors correlated with activity in the striatum and ventromedial prefrontal cortex. Additionally, we found that individual differences in social preferences were related to activity in the temporal-parietal junction and anterior insula. Connectivity analyses suggest that monetary and social value signals are integrated in the ventromedial prefrontal cortex and striatum. Based on these results, we argue for a novel mechanistic account for the integration of reinforcement history and social preferences in competitive decision-making.

  8. Role of CB2 Cannabinoid Receptors in the Rewarding, Reinforcing, and Physical Effects of Nicotine

    PubMed Central

    Navarrete, Francisco; Rodríguez-Arias, Marta; Martín-García, Elena; Navarro, Daniela; García-Gutiérrez, María S; Aguilar, María A; Aracil-Fernández, Auxiliadora; Berbel, Pere; Miñarro, José; Maldonado, Rafael; Manzanares, Jorge

    2013-01-01

    This study was aimed to evaluate the involvement of CB2 cannabinoid receptors (CB2r) in the rewarding, reinforcing and motivational effects of nicotine. Conditioned place preference (CPP) and intravenous self-administration experiments were carried out in knockout mice lacking CB2r (CB2KO) and wild-type (WT) littermates treated with the CB2r antagonist AM630 (1 and 3 mg/kg). Gene expression analyses of tyrosine hydroxylase (TH) and α3- and α4-nicotinic acetylcholine receptor subunits (nAChRs) in the ventral tegmental area (VTA) and immunohistochemical studies to elucidate whether CB2r colocalized with α3- and α4-nAChRs in the nucleus accumbens and VTA were performed. Mecamylamine-precipitated withdrawal syndrome after chronic nicotine exposure was evaluated in CB2KO mice and WT mice treated with AM630 (1 and 3 mg/kg). CB2KO mice did not show nicotine-induced place conditioning and self-administered significantly less nicotine. In addition, AM630 was able to block (3 mg/kg) nicotine-induced CPP and reduce (1 and 3 mg/kg) nicotine self-administration. Under baseline conditions, TH, α3-nAChR, and α4-nAChR mRNA levels in the VTA of CB2KO mice were significantly lower compared with WT mice. Confocal microscopy images revealed that CB2r colocalized with α3- and α4-nAChRs. Somatic signs of nicotine withdrawal (rearings, groomings, scratches, teeth chattering, and body tremors) increased significantly in WT but were absent in CB2KO mice. Interestingly, the administration of AM630 blocked the nicotine withdrawal syndrome and failed to alter basal behavior in saline-treated WT mice. These results suggest that CB2r play a relevant role in the rewarding, reinforcing, and motivational effects of nicotine. Pharmacological manipulation of this receptor deserves further consideration as a potential new valuable target for the treatment of nicotine dependence. PMID:23817165

  9. Multi-Agent Reinforcement Learning and Adaptive Neural Networks.

    DTIC Science & Technology

    2007-11-02

    learning method. The objective was to study the utility of reinforcement learning as an approach to complex decentralized control problems. The major...accomplishment was a detailed study of multi-agent reinforcement learning applied to a large-scale decentralized stochastic control problem. This study...included a very successful demonstration that a multi-agent reinforcement learning system using neural networks could learn high-performance

  10. Reinforcement Learning Models and Their Neural Correlates: An Activation Likelihood Estimation Meta-Analysis

    PubMed Central

    Kumar, Poornima; Eickhoff, Simon B.; Dombrovski, Alexandre Y.

    2015-01-01

    Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments – prediction error – is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies suggest that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that employed algorithmic reinforcement learning models, across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, while instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually-estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies. PMID:25665667

  11. Reinforcement learning models and their neural correlates: An activation likelihood estimation meta-analysis.

    PubMed

    Chase, Henry W; Kumar, Poornima; Eickhoff, Simon B; Dombrovski, Alexandre Y

    2015-06-01

    Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments-prediction error-is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies have suggested that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that had employed algorithmic reinforcement learning models across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, whereas instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies.

  12. Learning and generalization from reward and punishment in opioid addiction.

    PubMed

    Myers, Catherine E; Rego, Janice; Haber, Paul; Morley, Kirsten; Beck, Kevin D; Hogarth, Lee; Moustafa, Ahmed A

    2017-01-15

    This study adapts a widely-used acquired equivalence paradigm to investigate how opioid-addicted individuals learn from positive and negative feedback, and how they generalize this learning. The opioid-addicted group consisted of 33 participants with a history of heroin dependency currently in a methadone maintenance program; the control group consisted of 32 healthy participants without a history of drug addiction. All participants performed a novel variant of the acquired equivalence task, where they learned to map some stimuli to correct outcomes in order to obtain reward, and to map other stimuli to correct outcomes in order to avoid punishment; some stimuli were implicitly "equivalent" in the sense of being paired with the same outcome. On the initial training phase, both groups performed similarly on learning to obtain reward, but as memory load grew, the control group outperformed the addicted group on learning to avoid punishment. On a subsequent testing phase, the addicted and control groups performed similarly on retention trials involving previously-trained stimulus-outcome pairs, as well as on generalization trials to assess acquired equivalence. Since prior work with acquired equivalence tasks has associated stimulus-outcome learning with the nigrostriatal dopamine system, and generalization with the hippocampal region, the current results are consistent with basal ganglia dysfunction in the opioid-addicted patients. Further, a selective deficit in learning from punishment could contribute to processes by which addicted individuals continue to pursue drug use even at the cost of negative consequences such as loss of income and the opportunity to engage in other life activities.

  13. The effects of aging on the interaction between reinforcement learning and attention.

    PubMed

    Radulescu, Angela; Daniel, Reka; Niv, Yael

    2016-11-01

    Reinforcement learning (RL) in complex environments relies on selective attention to uncover those aspects of the environment that are most predictive of reward. Whereas previous work has focused on age-related changes in RL, it is not known whether older adults learn differently from younger adults when selective attention is required. In 2 experiments, we examined how aging affects the interaction between RL and selective attention. Younger and older adults performed a learning task in which only 1 stimulus dimension was relevant to predicting reward, and within it, 1 "target" feature was the most rewarding. Participants had to discover this target feature through trial and error. In Experiment 1, stimuli varied on 1 or 3 dimensions and participants received hints that revealed the target feature, the relevant dimension, or gave no information. Group-related differences in accuracy and RTs differed systematically as a function of the number of dimensions and the type of hint available. In Experiment 2 we used trial-by-trial computational modeling of the learning process to test for age-related differences in learning strategies. Behavior of both young and older adults was explained well by a reinforcement-learning model that uses selective attention to constrain learning. However, the model suggested that older adults restricted their learning to fewer features, employing more focused attention than younger adults. Furthermore, this difference in strategy predicted age-related deficits in accuracy. We discuss these results suggesting that a narrower filter of attention may reflect an adaptation to the reduced capabilities of the reinforcement learning system. (PsycINFO Database Record

  14. Reinforcement learning, spike-time-dependent plasticity, and the BCM rule.

    PubMed

    Baras, Dorit; Meir, Ron

    2007-08-01

    Learning agents, whether natural or artificial, must update their internal parameters in order to improve their behavior over time. In reinforcement learning, this plasticity is influenced by an environmental signal, termed a reward, that directs the changes in appropriate directions. We apply a recently introduced policy learning algorithm from machine learning to networks of spiking neurons and derive a spike-time-dependent plasticity rule that ensures convergence to a local optimum of the expected average reward. The approach is applicable to a broad class of neuronal models, including the Hodgkin-Huxley model. We demonstrate the effectiveness of the derived rule in several toy problems. Finally, through statistical analysis, we show that the synaptic plasticity rule established is closely related to the widely used BCM rule, for which good biological evidence exists.

  15. Reward-dependent learning in neuronal networks for planning and decision making.

    PubMed

    Dehaene, S; Changeux, J P

    2000-01-01

    Neuronal network models have been proposed for the organization of evaluation and decision processes in prefrontal circuitry and their putative neuronal and molecular bases. The models all include an implementation and simulation of an elementary reward mechanism. Their central hypothesis is that tentative rules of behavior, which are coded by clusters of active neurons in prefrontal cortex, are selected or rejected based on an evaluation by this reward signal, which may be conveyed, for instance, by the mesencephalic dopaminergic neurons with which the prefrontal cortex is densely interconnected. At the molecular level, the reward signal is postulated to be a neurotransmitter such as dopamine, which exerts a global modulatory action on prefrontal synaptic efficacies, either via volume transmission or via targeted synaptic triads. Negative reinforcement has the effect of destabilizing the currently active rule-coding clusters; subsequently, spontaneous activity varies again from one cluster to another, giving the organism the chance to discover and learn a new rule. Thus, reward signals function as effective selection signals that either maintain or suppress currently active prefrontal representations as a function of their current adequacy. Simulations of this variation-selection have successfully accounted for the main features of several major tasks that depend on prefrontal cortex integrity, such as the delayed-response test, the Wisconsin card sorting test, the Tower of London test and the Stroop test. For the more complex tasks, we have found it necessary to supplement the external reward input with a second mechanism that supplies an internal reward; it consists of an auto-evaluation loop which short-circuits the reward input from the exterior. This allows for an internal evaluation of covert motor intentions without actualizing them as behaviors, by simply testing them covertly by comparison with memorized former experiences. This element of architecture

  16. Hemispheric Asymmetries in Striatal Reward Responses Relate to Approach-Avoidance Learning and Encoding of Positive-Negative Prediction Errors in Dopaminergic Midbrain Regions.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly C; Schwartz, Sophie

    2015-10-28

    Some individuals are better at learning about rewarding situations, whereas others are inclined to avoid punishments (i.e., enhanced approach or avoidance learning, respectively). In reinforcement learning, action values are increased when outcomes are better than predicted (positive prediction errors [PEs]) and decreased for worse than predicted outcomes (negative PEs). Because actions with high and low values are approached and avoided, respectively, individual differences in the neural encoding of PEs may influence the balance between approach-avoidance learning. Recent correlational approaches also indicate that biases in approach-avoidance learning involve hemispheric asymmetries in dopamine function. However, the computational and neural mechanisms underpinning such learning biases remain unknown. Here we assessed hemispheric reward asymmetry in striatal activity in 34 human participants who performed a task involving rewards and punishments. We show that the relative difference in reward response between hemispheres relates to individual biases in approach-avoidance learning. Moreover, using a computational modeling approach, we demonstrate that better encoding of positive (vs negative) PEs in dopaminergic midbrain regions is associated with better approach (vs avoidance) learning, specifically in participants with larger reward responses in the left (vs right) ventral striatum. Thus, individual dispositions or traits may be determined by neural processes acting to constrain learning about specific aspects of the world.

  17. Reinforcement learning in professional basketball players

    PubMed Central

    Neiman, Tal; Loewenstein, Yonatan

    2011-01-01

    Reinforcement learning in complex natural environments is a challenging task because the agent should generalize from the outcomes of actions taken in one state of the world to future actions in different states of the world. The extent to which human experts find the proper level of generalization is unclear. Here we show, using the sequences of field goal attempts made by professional basketball players, that the outcome of even a single field goal attempt has a considerable effect on the rate of subsequent 3 point shot attempts, in line with standard models of reinforcement learning. However, this change in behaviour is associated with negative correlations between the outcomes of successive field goal attempts. These results indicate that despite years of experience and high motivation, professional players overgeneralize from the outcomes of their most recent actions, which leads to decreased performance. PMID:22146388

  18. The cerebellum: a neural system for the study of reinforcement learning.

    PubMed

    Swain, Rodney A; Kerr, Abigail L; Thompson, Richard F

    2011-01-01

    In its strictest application, the term "reinforcement learning" refers to a computational approach to learning in which an agent (often a machine) interacts with a mutable environment to maximize reward through trial and error. The approach borrows essentials from several fields, most notably Computer Science, Behavioral Neuroscience, and Psychology. At the most basic level, a neural system capable of mediating reinforcement learning must be able to acquire sensory information about the external environment and internal milieu (either directly or through connectivities with other brain regions), must be able to select a behavior to be executed, and must be capable of providing evaluative feedback about the success of that behavior. Given that Psychology informs us that reinforcers, both positive and negative, are stimuli or consequences that increase the probability that the immediately antecedent behavior will be repeated and that reinforcer strength or viability is modulated by the organism's past experience with the reinforcer, its affect, and even the state of its muscles (e.g., eyes open or closed); it is the case that any neural system that supports reinforcement learning must also be sensitive to these same considerations. Once learning is established, such a neural system must finally be able to maintain continued response expression and prevent response drift. In this report, we examine both historical and recent evidence that the cerebellum satisfies all of these requirements. While we report evidence from a variety of learning paradigms, the majority of our discussion will focus on classical conditioning of the rabbit eye blink response as an ideal model system for the study of reinforcement and reinforcement learning.

  19. Optimal chaos control through reinforcement learning.

    PubMed

    Gadaleta, Sabino; Dangelmayr, Gerhard

    1999-09-01

    A general purpose chaos control algorithm based on reinforcement learning is introduced and applied to the stabilization of unstable periodic orbits in various chaotic systems and to the targeting problem. The algorithm does not require any information about the dynamical system nor about the location of periodic orbits. Numerical tests demonstrate good and fast performance under noisy and nonstationary conditions. (c) 1999 American Institute of Physics.

  20. Embedded Incremental Feature Selection for Reinforcement Learning

    DTIC Science & Technology

    2012-05-01

    policy by a problem-specific fit- ness function. The composition of the selected subset in terms of the fraction of relevant features among se- lected...features. In Figure 4b we see the composition of the se- lected subsets by the three algorithms. IFSE-NEAT clearly has the highest percentage of relevant...528. Kroon, M. and Whiteson, S. (2009). Automatic feature se- lection for model-based reinforcement learning in fac- tored mdps . In Proceedings of the

  1. The Function of Direct and Vicarious Reinforcement in Human Learning.

    ERIC Educational Resources Information Center

    Owens, Carl R.; And Others

    The role of reinforcement has long been an issue in learning theory. The effects of reinforcement in learning were investigated under circumstances which made the information necessary for correct performance equally available to reinforced and nonreinforced subjects. Fourth graders (N=36) were given a pre-test of 20 items from the Peabody Picture…

  2. Separating the effect of reward from corrective feedback during learning in patients with Parkinson's disease.

    PubMed

    Freedberg, Michael; Schacherer, Jonathan; Chen, Kuan-Hua; Uc, Ergun Y; Narayanan, Nandakumar S; Hazeltine, Eliot

    2017-04-10

    Parkinson's disease (PD) is associated with procedural learning deficits. Nonetheless, studies have demonstrated that reward-related learning is comparable between patients with PD and controls (Bódi et al., Brain, 132(9), 2385-2395, 2009; Frank, Seeberger, & O'Reilly, Science, 306(5703), 1940-1943, 2004; Palminteri et al., Proceedings of the National Academy of Sciences of the United States of America, 106(45), 19179-19184, 2009). However, because these studies do not separate the effect of reward from the effect of practice, it is difficult to determine whether the effect of reward on learning is distinct from the effect of corrective feedback on learning. Thus, it is unknown whether these group differences in learning are due to reward processing or learning in general. Here, we compared the performance of medicated PD patients to demographically matched healthy controls (HCs) on a task where the effect of reward can be examined separately from the effect of practice. We found that patients with PD showed significantly less reward-related learning improvements compared to HCs. In addition, stronger learning of rewarded associations over unrewarded associations was significantly correlated with smaller skin-conductance responses for HCs but not PD patients. These results demonstrate that when separating the effect of reward from the effect of corrective feedback, PD patients do not benefit from reward.

  3. FMRQ-A Multiagent Reinforcement Learning Algorithm for Fully Cooperative Tasks.

    PubMed

    Zhang, Zhen; Zhao, Dongbin; Gao, Junwei; Wang, Dongqing; Dai, Yujie

    2016-04-14

    In this paper, we propose a multiagent reinforcement learning algorithm dealing with fully cooperative tasks. The algorithm is called frequency of the maximum reward Q-learning (FMRQ). FMRQ aims to achieve one of the optimal Nash equilibria so as to optimize the performance index in multiagent systems. The frequency of obtaining the highest global immediate reward instead of immediate reward is used as the reinforcement signal. With FMRQ each agent does not need the observation of the other agents' actions and only shares its state and reward at each step. We validate FMRQ through case studies of repeated games: four cases of two-player two-action and one case of three-player two-action. It is demonstrated that FMRQ can converge to one of the optimal Nash equilibria in these cases. Moreover, comparison experiments on tasks with multiple states and finite steps are conducted. One is box-pushing and the other one is distributed sensor network problem. Experimental results show that the proposed algorithm outperforms others with higher performance.

  4. Reinforcement learning design for cancer clinical trials

    PubMed Central

    Zhao, Yufan; Kosorok, Michael R.; Zeng, Donglin

    2009-01-01

    Summary We develop reinforcement learning trials for discovering individualized treatment regimens for life-threatening diseases such as cancer. A temporal-difference learning method called Q-learning is utilized which involves learning an optimal policy from a single training set of finite longitudinal patient trajectories. Approximating the Q-function with time-indexed parameters can be achieved by using support vector regression or extremely randomized trees. Within this framework, we demonstrate that the procedure can extract optimal strategies directly from clinical data without relying on the identification of any accurate mathematical models, unlike approaches based on adaptive design. We show that reinforcement learning has tremendous potential in clinical research because it can select actions that improve outcomes by taking into account delayed effects even when the relationship between actions and outcomes is not fully known. To support our claims, the methodology's practical utility is illustrated in a simulation analysis. In the immediate future, we will apply this general strategy to studying and identifying new treatments for advanced metastatic stage IIIB/IV non-small cell lung cancer, which usually includes multiple lines of chemotherapy treatment. Moreover, there is significant potential of the proposed methodology for developing personalized treatment strategies in other cancers, in cystic fibrosis, and in other life-threatening diseases. PMID:19750510

  5. The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces

    PubMed Central

    Huertas, Marco A.; Schwettmann, Sarah E.; Shouval, Harel Z.

    2016-01-01

    The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for

  6. The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

    PubMed

    Huertas, Marco A; Schwettmann, Sarah E; Shouval, Harel Z

    2016-01-01

    The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for

  7. Motivational neural circuits underlying reinforcement learning.

    PubMed

    Averbeck, Bruno B; Costa, Vincent D

    2017-03-29

    Reinforcement learning (RL) is the behavioral process of learning the values of actions and objects. Most models of RL assume that the dopaminergic prediction error signal drives plasticity in frontal-striatal circuits. The striatum then encodes value representations that drive decision processes. However, the amygdala has also been shown to play an important role in forming Pavlovian stimulus-outcome associations. These Pavlovian associations can drive motivated behavior via the amygdala projections to the ventral striatum or the ventral tegmental area. The amygdala may, therefore, play a central role in RL. Here we compare the contributions of the amygdala and the striatum to RL and show that both the amygdala and striatum learn and represent expected values in RL tasks. Furthermore, value representations in the striatum may be inherited, to some extent, from the amygdala. The striatum may, therefore, play less of a primary role in learning stimulus-outcome associations in RL than previously suggested.

  8. Reinforcement learning and dopamine in schizophrenia: dimensions of symptoms or specific features of a disease group?

    PubMed

    Deserno, Lorenz; Boehme, Rebecca; Heinz, Andreas; Schlagenhauf, Florian

    2013-12-23

    Abnormalities in reinforcement learning are a key finding in schizophrenia and have been proposed to be linked to elevated levels of dopamine neurotransmission. Behavioral deficits in reinforcement learning and their neural correlates may contribute to the formation of clinical characteristics of schizophrenia. The ability to form predictions about future outcomes is fundamental for environmental interactions and depends on neuronal teaching signals, like reward prediction errors. While aberrant prediction errors, that encode non-salient events as surprising, have been proposed to contribute to the formation of positive symptoms, a failure to build neural representations of decision values may result in negative symptoms. Here, we review behavioral and neuroimaging research in schizophrenia and focus on studies that implemented reinforcement learning models. In addition, we discuss studies that combined reinforcement learning with measures of dopamine. Thereby, we suggest how reinforcement learning abnormalities in schizophrenia may contribute to the formation of psychotic symptoms and may interact with cognitive deficits. These ideas point toward an interplay of more rigid versus flexible control over reinforcement learning. Pronounced deficits in the flexible or model-based domain may allow for a detailed characterization of well-established cognitive deficits in schizophrenia patients based on computational models of learning. Finally, we propose a framework based on the potentially crucial contribution of dopamine to dysfunctional reinforcement learning on the level of neural networks. Future research may strongly benefit from computational modeling but also requires further methodological improvement for clinical group studies. These research tools may help to improve our understanding of disease-specific mechanisms and may help to identify clinically relevant subgroups of the heterogeneous entity schizophrenia.

  9. Bi-directional effect of increasing doses of baclofen on reinforcement learning.

    PubMed

    Terrier, Jean; Ort, Andres; Yvon, Cédric; Saj, Arnaud; Vuilleumier, Patrik; Lüscher, Christian

    2011-01-01

    In rodents as well as in humans, efficient reinforcement learning depends on dopamine (DA) released from ventral tegmental area (VTA) neurons. It has been shown that in brain slices of mice, GABA(B)-receptor agonists at low concentrations increase the firing frequency of VTA-DA neurons, while high concentrations reduce the firing frequency. It remains however elusive whether baclofen can modulate reinforcement learning in humans. Here, in a double-blind study in 34 healthy human volunteers, we tested the effects of a low and a high concentration of oral baclofen, a high affinity GABA(B)-receptor agonist, in a gambling task associated with monetary reward. A low (20 mg) dose of baclofen increased the efficiency of reward-associated learning but had no effect on the avoidance of monetary loss. A high (50 mg) dose of baclofen on the other hand did not affect the learning curve. At the end of the task, subjects who received 20 mg baclofen p.o. were more accurate in choosing the symbol linked to the highest probability of earning money compared to the control group (89.55 ± 1.39 vs. 81.07 ± 1.55%, p = 0.002). Our results support a model where baclofen, at low concentrations, causes a disinhibition of DA neurons, increases DA levels and thus facilitates reinforcement learning.

  10. Bi-Directional Effect of Increasing Doses of Baclofen on Reinforcement Learning

    PubMed Central

    Terrier, Jean; Ort, Andres; Yvon, Cédric; Saj, Arnaud; Vuilleumier, Patrik; Lüscher, Christian

    2011-01-01

    In rodents as well as in humans, efficient reinforcement learning depends on dopamine (DA) released from ventral tegmental area (VTA) neurons. It has been shown that in brain slices of mice, GABAB-receptor agonists at low concentrations increase the firing frequency of VTA–DA neurons, while high concentrations reduce the firing frequency. It remains however elusive whether baclofen can modulate reinforcement learning in humans. Here, in a double-blind study in 34 healthy human volunteers, we tested the effects of a low and a high concentration of oral baclofen, a high affinity GABAB-receptor agonist, in a gambling task associated with monetary reward. A low (20 mg) dose of baclofen increased the efficiency of reward-associated learning but had no effect on the avoidance of monetary loss. A high (50 mg) dose of baclofen on the other hand did not affect the learning curve. At the end of the task, subjects who received 20 mg baclofen p.o. were more accurate in choosing the symbol linked to the highest probability of earning money compared to the control group (89.55 ± 1.39 vs. 81.07 ± 1.55%, p = 0.002). Our results support a model where baclofen, at low concentrations, causes a disinhibition of DA neurons, increases DA levels and thus facilitates reinforcement learning. PMID:21811448

  11. Pressure to cooperate: is positive reward interdependence really needed in cooperative learning?

    PubMed

    Buchs, Céline; Gilles, Ingrid; Dutrévis, Marion; Butera, Fabrizio

    2011-03-01

    BACKGROUND. Despite extensive research on cooperative learning, the debate regarding whether or not its effectiveness depends on positive reward interdependence has not yet found clear evidence. AIMS. We tested the hypothesis that positive reward interdependence, as compared to reward independence, enhances cooperative learning only if learners work on a 'routine task'; if the learners work on a 'true group task', positive reward interdependence induces the same level of learning as reward independence. SAMPLE. The study involved 62 psychology students during regular workshops. METHOD. Students worked on two psychology texts in cooperative dyads for three sessions. The type of task was manipulated through resource interdependence: students worked on either identical (routine task) or complementary (true group task) information. Students expected to be assessed with a Multiple Choice Test (MCT) on the two texts. The MCT assessment type was introduced according to two reward interdependence conditions, either individual (reward independence) or common (positive reward interdependence). A follow-up individual test took place 4 weeks after the third session of dyadic work to examine individual learning. RESULTS. The predicted interaction between the two types of interdependence was significant, indicating that students learned more with positive reward interdependence than with reward independence when they worked on identical information (routine task), whereas students who worked on complementary information (group task) learned the same with or without reward interdependence. CONCLUSIONS. This experiment sheds light on the conditions under which positive reward interdependence enhances cooperative learning, and suggests that creating a real group task allows to avoid the need for positive reward interdependence. © 2010 The British Psychological Society.

  12. Greater striatopallidal adaptive coding during cue-reward learning and food reward habituation predict future weight gain

    PubMed Central

    Burger, Kyle S.; Stice, Eric

    2014-01-01

    Animal experiments indicate that after repeated pairings of palatable food receipt and cues that predict palatable food receipt, dopamine signaling increases in response to predictive cues, but decreases in response to food receipt. Using functional MRI and mixed effects growth curve models with 35 females (M age = 15.5 ± 0.9; M BMI = 24.5 ± 5.4) we documented an increase in BOLD response in the caudate (r = .42) during exposure to cues predicting impending milkshake receipt over repeated exposures, demonstrating a direct measure of in vivo cue-reward learning in humans. Further, we observed a simultaneous decrease in putamen (r = −.33) and ventral pallidum (r = −.45) response during milkshake receipt that occurred over repeated exposures, putatively reflecting food reward habitation. We then tested whether cue-reward learning and habituation slopes predicted future weight over 2-year follow-up. Those who exhibited the greatest escalation in ventral pallidum responsivity to cues and the greatest decrease in caudate response to milkshake receipt showed significantly larger increases in BMI (r = .39 and −.69 respectively). Interestingly, cue-reward learning propensity and food reward habituation were not correlated, implying that these factors may constitute qualitatively distinct vulnerability pathways to excess weight gain. These two individual difference factors may provide insight as to why certain people have shown obesity onset in response to the current obesogenic environment in western cultures, whereas others have not. PMID:24893320

  13. Novelty and Inductive Generalization in Human Reinforcement Learning.

    PubMed

    Gershman, Samuel J; Niv, Yael

    2015-07-01

    In reinforcement learning (RL), a decision maker searching for the most rewarding option is often faced with the question: What is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: How can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and we describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of RL in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional RL algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty. Copyright © 2015 Cognitive Science Society, Inc.

  14. The Cerebellum: A Neural System for the Study of Reinforcement Learning

    PubMed Central

    Swain, Rodney A.; Kerr, Abigail L.; Thompson, Richard F.

    2011-01-01

    In its strictest application, the term “reinforcement learning” refers to a computational approach to learning in which an agent (often a machine) interacts with a mutable environment to maximize reward through trial and error. The approach borrows essentials from several fields, most notably Computer Science, Behavioral Neuroscience, and Psychology. At the most basic level, a neural system capable of mediating reinforcement learning must be able to acquire sensory information about the external environment and internal milieu (either directly or through connectivities with other brain regions), must be able to select a behavior to be executed, and must be capable of providing evaluative feedback about the success of that behavior. Given that Psychology informs us that reinforcers, both positive and negative, are stimuli or consequences that increase the probability that the immediately antecedent behavior will be repeated and that reinforcer strength or viability is modulated by the organism's past experience with the reinforcer, its affect, and even the state of its muscles (e.g., eyes open or closed); it is the case that any neural system that supports reinforcement learning must also be sensitive to these same considerations. Once learning is established, such a neural system must finally be able to maintain continued response expression and prevent response drift. In this report, we examine both historical and recent evidence that the cerebellum satisfies all of these requirements. While we report evidence from a variety of learning paradigms, the majority of our discussion will focus on classical conditioning of the rabbit eye blink response as an ideal model system for the study of reinforcement and reinforcement learning. PMID:21427778

  15. Autistic Traits Moderate the Impact of Reward Learning on Social Behaviour.

    PubMed

    Panasiti, Maria Serena; Puzzo, Ignazio; Chakrabarti, Bhismadev

    2016-04-01

    A deficit in empathy has been suggested to underlie social behavioural atypicalities in autism. A parallel theoretical account proposes that reduced social motivation (i.e., low responsivity to social rewards) can account for the said atypicalities. Recent evidence suggests that autistic traits modulate the link between reward and proxy metrics related to empathy. Using an evaluative conditioning paradigm to associate high and low rewards with faces, a previous study has shown that individuals high in autistic traits show reduced spontaneous facial mimicry of faces associated with high vs. low reward. This observation raises the possibility that autistic traits modulate the magnitude of evaluative conditioning. To test this, we investigated (a) if autistic traits could modulate the ability to implicitly associate a reward value to a social stimulus (reward learning/conditioning, using the Implicit Association Task, IAT); (b) if the learned association could modulate participants' prosocial behaviour (i.e., social reciprocity, measured using the cyberball task); (c) if the strength of this modulation was influenced by autistic traits. In 43 neurotypical participants, we found that autistic traits moderated the relationship of social reward learning on prosocial behaviour but not reward learning itself. This evidence suggests that while autistic traits do not directly influence social reward learning, they modulate the relationship of social rewards with prosocial behaviour.

  16. Autistic Traits Moderate the Impact of Reward Learning on Social Behaviour

    PubMed Central

    Panasiti, Maria Serena; Puzzo, Ignazio

    2015-01-01

    A deficit in empathy has been suggested to underlie social behavioural atypicalities in autism. A parallel theoretical account proposes that reduced social motivation (i.e., low responsivity to social rewards) can account for the said atypicalities. Recent evidence suggests that autistic traits modulate the link between reward and proxy metrics related to empathy. Using an evaluative conditioning paradigm to associate high and low rewards with faces, a previous study has shown that individuals high in autistic traits show reduced spontaneous facial mimicry of faces associated with high vs. low reward. This observation raises the possibility that autistic traits modulate the magnitude of evaluative conditioning. To test this, we investigated (a) if autistic traits could modulate the ability to implicitly associate a reward value to a social stimulus (reward learning/conditioning, using the Implicit Association Task, IAT); (b) if the learned association could modulate participants’ prosocial behaviour (i.e., social reciprocity, measured using the cyberball task); (c) if the strength of this modulation was influenced by autistic traits. In 43 neurotypical participants, we found that autistic traits moderated the relationship of social reward learning on prosocial behaviour but not reward learning itself. This evidence suggests that while autistic traits do not directly influence social reward learning, they modulate the relationship of social rewards with prosocial behaviour. Autism Res 2016, 9: 471–479. © 2015 The Authors Autism Research published by Wiley Periodicals, Inc. on behalf of International Society for Autism Research PMID:26280134

  17. Excitotoxic lesions of the medial striatum delay extinction of a reinforcement color discrimination operant task in domestic chicks; a functional role of reward anticipation.

    PubMed

    Ichikawa, Yoko; Izawa, Ei-Ichi; Matsushima, Toshiya

    2004-12-01

    To reveal the functional roles of the striatum, we examined the effects of excitotoxic lesions to the bilateral medial striatum (mSt) and nucleus accumbens (Ac) in a food reinforcement color discrimination operant task. With a food reward as reinforcement, 1-week-old domestic chicks were trained to peck selectively at red and yellow beads (S+) and not to peck at a blue bead (S-). Those chicks then received either lesions or sham operations and were tested in extinction training sessions, during which yellow turned out to be nonrewarding (S-), whereas red and blue remained unchanged. To further examine the effects on postoperant noninstrumental aspects of behavior, we also measured the "waiting time", during which chicks stayed at the empty feeder after pecking at yellow. Although the lesioned chicks showed significantly higher error rates in the nonrewarding yellow trials, their postoperant waiting time gradually decreased similarly to the sham controls. Furthermore, the lesioned chicks waited significantly longer than the controls, even from the first extinction block. In the blue trials, both lesioned and sham chicks consistently refrained from pecking, indicating that the delayed extinction was not due to a general disinhibition of pecking. Similarly, no effects were found in the novel training sessions, suggesting that the lesions had selective effects on the extinction of a learned operant. These results suggest that a neural representation of memory-based reward anticipation in the mSt/Ac could contribute to the anticipation error required for extinction.

  18. Implication of dopaminergic modulation in operant reward learning and the induction of compulsive-like feeding behavior in Aplysia.

    PubMed

    Bédécarrats, Alexis; Cornet, Charles; Simmers, John; Nargeot, Romuald

    2013-05-16

    Feeding in Aplysia provides an amenable model system for analyzing the neuronal substrates of motivated behavior and its adaptability by associative reward learning and neuromodulation. Among such learning processes, appetitive operant conditioning that leads to a compulsive-like expression of feeding actions is known to be associated with changes in the membrane properties and electrical coupling of essential action-initiating B63 neurons in the buccal central pattern generator (CPG). Moreover, the food-reward signal for this learning is conveyed in the esophageal nerve (En), an input nerve rich in dopamine-containing fibers. Here, to investigate whether dopamine (DA) is involved in this learning-induced plasticity, we used an in vitro analog of operant conditioning in which electrical stimulation of En substituted the contingent reinforcement of biting movements in vivo. Our data indicate that contingent En stimulation does, indeed, replicate the operant learning-induced changes in CPG output and the underlying membrane and synaptic properties of B63. Significantly, moreover, this network and cellular plasticity was blocked when the input nerve was stimulated in the presence of the DA receptor antagonist cis-flupenthixol. These results therefore suggest that En-derived dopaminergic modulation of CPG circuitry contributes to the operant reward-dependent emergence of a compulsive-like expression of Aplysia's feeding behavior.

  19. Intrinsically Motivated Reinforcement Learning: A Promising Framework for Developmental Robot Learning

    DTIC Science & Technology

    2005-01-01

    for intrinsically motivated reinforcement learning that strives to achieve broad competence in an environment in a task-nonspecific manner by...hierarchical learning, intrinsically motivated reinforcement learning is an obvious choice for organizing behavior in developmental robotics. We present

  20. Striatal dopamine ramping may indicate flexible reinforcement learning with forgetting in the cortico-basal ganglia circuits.

    PubMed

    Morita, Kenji; Kato, Ayaka

    2014-01-01

    It has been suggested that the midbrain dopamine (DA) neurons, receiving inputs from the cortico-basal ganglia (CBG) circuits and the brainstem, compute reward prediction error (RPE), the difference between reward obtained or expected to be obtained and reward that had been expected to be obtained. These reward expectations are suggested to be stored in the CBG synapses and updated according to RPE through synaptic plasticity, which is induced by released DA. These together constitute the "DA=RPE" hypothesis, which describes the mutual interaction between DA and the CBG circuits and serves as the primary working hypothesis in studying reward learning and value-based decision-making. However, recent work has revealed a new type of DA signal that appears not to represent RPE. Specifically, it has been found in a reward-associated maze task that striatal DA concentration primarily shows a gradual increase toward the goal. We explored whether such ramping DA could be explained by extending the "DA=RPE" hypothesis by taking into account biological properties of the CBG circuits. In particular, we examined effects of possible time-dependent decay of DA-dependent plastic changes of synaptic strengths by incorporating decay of learned values into the RPE-based reinforcement learning model and simulating reward learning tasks. We then found that incorporation of such a decay dramatically changes the model's behavior, causing gradual ramping of RPE. Moreover, we further incorporated magnitude-dependence of the rate of decay, which could potentially be in accord with some past observations, and found that near-sigmoidal ramping of RPE, resembling the observed DA ramping, could then occur. Given that synaptic decay can be useful for flexibly reversing and updating the learned reward associations, especially in case the baseline DA is low and encoding of negative RPE by DA is limited, the observed DA ramping would be indicative of the operation of such flexible reward learning.

  1. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning

    PubMed Central

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  2. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning.

    PubMed

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  3. Reward-based learning of a redundant task.

    PubMed

    Tamagnone, Irene; Casadio, Maura; Sanguineti, Vittorio

    2013-06-01

    Motor skill learning has different components. When we acquire a new motor skill we have both to learn a reliable action-value map to select a highly rewarded action (task model) and to develop an internal representation of the novel dynamics of the task environment, in order to execute properly the action previously selected (internal model). Here we focus on a 'pure' motor skill learning task, in which adaptation to a novel dynamical environment is negligible and the problem is reduced to the acquisition of an action-value map, only based on knowledge of results. Subjects performed point-to-point movement, in which start and target positions were fixed and visible, but the score provided at the end of the movement depended on the distance of the trajectory from a hidden viapoint. Subjects did not have clues on the correct movement other than the score value. The task is highly redundant, as infinite trajectories are compatible with the maximum score. Our aim was to capture the strategies subjects use in the exploration of the task space and in the exploitation of the task redundancy during learning. The main findings were that (i) subjects did not converge to a unique solution; rather, their final trajectories are determined by subject-specific history of exploration. (ii) with learning, subjects reduced the trajectory's overall variability, but the point of minimum variability gradually shifted toward the portion of the trajectory closer to the hidden via-point.

  4. The Effects of Verbal and Material Rewards and Punishers on the Performance of Impulsive and Reflective Children

    ERIC Educational Resources Information Center

    Firestone, Philip; Douglas, Virginia I.

    1977-01-01

    Impulsive and reflective children performed in a discrimination learning task which included four reinforcement conditions: verbal-reward, verbal-punishment, material-reward, and material-punishment. (SB)

  5. Dissociable neural representations of reinforcement and belief prediction errors underlie strategic learning.

    PubMed

    Zhu, Lusha; Mathewson, Kyle E; Hsu, Ming

    2012-01-31

    Decision-making in the presence of other competitive intelligent agents is fundamental for social and economic behavior. Such decisions require agents to behave strategically, where in addition to learning about the rewards and punishments available in the environment, they also need to anticipate and respond to actions of others competing for the same rewards. However, whereas we know much about strategic learning at both theoretical and behavioral levels, we know relatively little about the underlying neural mechanisms. Here, we show using a multi-strategy competitive learning paradigm that strategic choices can be characterized by extending the reinforcement learning (RL) framework to incorporate agents' beliefs about the actions of their opponents. Furthermore, using this characterization to generate putative internal values, we used model-based functional magnetic resonance imaging to investigate neural computations underlying strategic learning. We found that the distinct notions of prediction errors derived from our computational model are processed in a partially overlapping but distinct set of brain regions. Specifically, we found that the RL prediction error was correlated with activity in the ventral striatum. In contrast, activity in the ventral striatum, as well as the rostral anterior cingulate (rACC), was correlated with a previously uncharacterized belief-based prediction error. Furthermore, activity in rACC reflected individual differences in degree of engagement in belief learning. These results suggest a model of strategic behavior where learning arises from interaction of dissociable reinforcement and belief-based inputs.

  6. Dopamine and opioid gene variants are associated with increased smoking reward and reinforcement owing to negative mood.

    PubMed

    Perkins, Kenneth A; Lerman, Caryn; Grottenthaler, Amy; Ciccocioppo, Melinda M; Milanak, Melissa; Conklin, Cynthia A; Bergen, Andrew W; Benowitz, Neal L

    2008-09-01

    Negative mood increases smoking reinforcement and risk of relapse. We explored associations of gene variants in the dopamine, opioid, and serotonin pathways with smoking reward ('liking') and reinforcement (latency to first puff and total puffs) as a function of negative mood and expected versus actual nicotine content of the cigarette. Smokers of European ancestry (n=72) were randomized to one of four groups in a 2x2 balanced placebo design, corresponding with manipulation of actual (0.6 vs. 0.05 mg) and expected (told nicotine and told denicotinized) nicotine 'dose' in cigarettes during each of two sessions (negative vs. positive mood induction). Following mood induction and expectancy instructions, they sampled and rated the assigned cigarette, and then smoked additional cigarettes ad lib during continued mood induction. The increase in smoking amount owing to negative mood was associated with: dopamine D2 receptor (DRD2) C957T (CC>TT or CT), SLC6A3 (presence of 9 repeat>absence of 9), and among those given a nicotine cigarette, DRD4 (presence of 7 repeat>absence of 7) and DRD2/ANKK1 TaqIA (TT or CT>CC). SLC6A3, and DRD2/ANKK1 TaqIA were also associated with smoking reward and smoking latency. OPRM1 (AA>AG or GG) was associated with smoking reward, but SLC6A4 variable number tandem repeat was unrelated to any of these measures. These results warrant replication but provide the first evidence for genetic associations with the acute increase in smoking reward and reinforcement owing to negative mood.

  7. Analysis of Reward Functions in Learning: Unconscious Information Processing: Noncognitive Determinants of Response Strength

    DTIC Science & Technology

    1984-05-01

    Research Note 84-76 ANALYSIS OF REWARD FUNCTIONS IN LEARNING: UNCONSCIOUS INFORMATION PROCESSING : Lf NONCOGNITIVE DETERMINANTS OF RESPONSE STRENGTH...Melvin H. Marx University of Missouri, Columbia David W. Bessemer , Contracting Officer’s Representative0 Submitted by Robert M. Sasmor, Director BASIC...REPORT & PERIOD COVERED ANALYSIS OF REWARD FUNCTIONS IN LEARNING: Final Report UNCONSCIOUS INFORMATION PROCESSING : NONCOGNITIVE Sept. 1978 - Sept. 15

  8. The Influence of Personality on Neural Mechanisms of Observational Fear and Reward Learning

    ERIC Educational Resources Information Center

    Hooker, Christine I.; Verosky, Sara C.; Miyakawa, Asako; Knight, Robert T.; D'Esposito, Mark

    2008-01-01

    Fear and reward learning can occur through direct experience or observation. Both channels can enhance survival or create maladaptive behavior. We used fMRI to isolate neural mechanisms of observational fear and reward learning and investigate whether neural response varied according to individual differences in neuroticism and extraversion.…

  9. Effects of Cooperative versus Individual Study on Learning and Motivation after Reward-Removal

    ERIC Educational Resources Information Center

    Sears, David A.; Pai, Hui-Hua

    2012-01-01

    Rewards are frequently used in classrooms and recommended as a key component of well-researched methods of cooperative learning (e.g., Slavin, 1995). While many studies of cooperative learning find beneficial effects of rewards, many studies of individuals find negative effects (e.g., Deci, Koestner, & Ryan, 1999; Lepper, 1988). This may be…

  10. Effects of Cooperative versus Individual Study on Learning and Motivation after Reward-Removal

    ERIC Educational Resources Information Center

    Sears, David A.; Pai, Hui-Hua

    2012-01-01

    Rewards are frequently used in classrooms and recommended as a key component of well-researched methods of cooperative learning (e.g., Slavin, 1995). While many studies of cooperative learning find beneficial effects of rewards, many studies of individuals find negative effects (e.g., Deci, Koestner, & Ryan, 1999; Lepper, 1988). This may be…

  11. Reinforcement learning can account for associative and perceptual learning on a visual-decision task.

    PubMed

    Law, Chi-Tat; Gold, Joshua I

    2009-05-01

    We recently showed that improved perceptual performance on a visual motion direction-discrimination task corresponds to changes in how an unmodified sensory representation in the brain is interpreted to form a decision that guides behavior. Here we found that these changes can be accounted for using a reinforcement-learning rule to shape functional connectivity between the sensory and decision neurons. We modeled performance on the basis of the readout of simulated responses of direction-selective sensory neurons in the middle temporal area (MT) of monkey cortex. A reward prediction error guided changes in connections between these sensory neurons and the decision process, first establishing the association between motion direction and response direction, and then gradually improving perceptual sensitivity by selectively strengthening the connections from the most sensitive neurons in the sensory population. The results suggest a common, feedback-driven mechanism for some forms of associative and perceptual learning.

  12. Hierarchical extreme learning machine based reinforcement learning for goal localization

    NASA Astrophysics Data System (ADS)

    AlDahoul, Nouar; Zaw Htike, Zaw; Akmeliawati, Rini

    2017-03-01

    The objective of goal localization is to find the location of goals in noisy environments. Simple actions are performed to move the agent towards the goal. The goal detector should be capable of minimizing the error between the predicted locations and the true ones. Few regions need to be processed by the agent to reduce the computational effort and increase the speed of convergence. In this paper, reinforcement learning (RL) method was utilized to find optimal series of actions to localize the goal region. The visual data, a set of images, is high dimensional unstructured data and needs to be represented efficiently to get a robust detector. Different deep Reinforcement models have already been used to localize a goal but most of them take long time to learn the model. This long learning time results from the weights fine tuning stage that is applied iteratively to find an accurate model. Hierarchical Extreme Learning Machine (H-ELM) was used as a fast deep model that doesn’t fine tune the weights. In other words, hidden weights are generated randomly and output weights are calculated analytically. H-ELM algorithm was used in this work to find good features for effective representation. This paper proposes a combination of Hierarchical Extreme learning machine and Reinforcement learning to find an optimal policy directly from visual input. This combination outperforms other methods in terms of accuracy and learning speed. The simulations and results were analysed by using MATLAB.

  13. Knowledge-Based Reinforcement Learning for Data Mining

    NASA Astrophysics Data System (ADS)

    Kudenko, Daniel; Grzes, Marek

    Data Mining is the process of extracting patterns from data. Two general avenues of research in the intersecting areas of agents and data mining can be distinguished. The first approach is concerned with mining an agent’s observation data in order to extract patterns, categorize environment states, and/or make predictions of future states. In this setting, data is normally available as a batch, and the agent’s actions and goals are often independent of the data mining task. The data collection is mainly considered as a side effect of the agent’s activities. Machine learning techniques applied in such situations fall into the class of supervised learning. In contrast, the second scenario occurs where an agent is actively performing the data mining, and is responsible for the data collection itself. For example, a mobile network agent is acquiring and processing data (where the acquisition may incur a certain cost), or a mobile sensor agent is moving in a (perhaps hostile) environment, collecting and processing sensor readings. In these settings, the tasks of the agent and the data mining are highly intertwined and interdependent (or even identical). Supervised learning is not a suitable technique for these cases. Reinforcement Learning (RL) enables an agent to learn from experience (in form of reward and punishment for explorative actions) and adapt to new situations, without a teacher. RL is an ideal learning technique for these data mining scenarios, because it fits the agent paradigm of continuous sensing and acting, and the RL agent is able to learn to make decisions on the sampling of the environment which provides the data. Nevertheless, RL still suffers from scalability problems, which have prevented its successful use in many complex real-world domains. The more complex the tasks, the longer it takes a reinforcement learning algorithm to converge to a good solution. For many real-world tasks, human expert knowledge is available. For example, human

  14. An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning

    DTIC Science & Technology

    2000-10-01

    Learning behaviors in a multiagent environment are crucial for developing and adapting multiagent systems. Reinforcement learning techniques have...presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities. We examine the

  15. Curiosity and reward: Valence predicts choice and information prediction errors enhance learning.

    PubMed

    Marvin, Caroline B; Shohamy, Daphna

    2016-03-01

    Curiosity drives many of our daily pursuits and interactions; yet, we know surprisingly little about how it works. Here, we harness an idea implied in many conceptualizations of curiosity: that information has value in and of itself. Reframing curiosity as the motivation to obtain reward-where the reward is information-allows one to leverage major advances in theoretical and computational mechanisms of reward-motivated learning. We provide new evidence supporting 2 predictions that emerge from this framework. First, we find an asymmetric effect of positive versus negative information, with positive information enhancing both curiosity and long-term memory for information. Second, we find that it is not the absolute value of information that drives learning but, rather, the gap between the reward expected and reward received, an "information prediction error." These results support the idea that information functions as a reward, much like money or food, guiding choices and driving learning in systematic ways.

  16. Somatic and Reinforcement-Based Plasticity in the Initial Stages of Human Motor Learning.

    PubMed

    Sidarta, Ananda; Vahdat, Shahabeddin; Bernardi, Nicolò F; Ostry, David J

    2016-11-16

    As one learns to dance or play tennis, the desired somatosensory state is typically unknown. Trial and error is important as motor behavior is shaped by successful and unsuccessful movements. As an experimental model, we designed a task in which human participants make reaching movements to a hidden target and receive positive reinforcement when successful. We identified somatic and reinforcement-based sources of plasticity on the basis of changes in functional connectivity using resting-state fMRI before and after learning. The neuroimaging data revealed reinforcement-related changes in both motor and somatosensory brain areas in which a strengthening of connectivity was related to the amount of positive reinforcement during learning. Areas of prefrontal cortex were similarly altered in relation to reinforcement, with connectivity between sensorimotor areas of putamen and the reward-related ventromedial prefrontal cortex strengthened in relation to the amount of successful feedback received. In other analyses, we assessed connectivity related to changes in movement direction between trials, a type of variability that presumably reflects exploratory strategies during learning. We found that connectivity in a network linking motor and somatosensory cortices increased with trial-to-trial changes in direction. Connectivity varied as well with the change in movement direction following incorrect movements. Here the changes were observed in a somatic memory and decision making network involving ventrolateral prefrontal cortex and second somatosensory cortex. Our results point to the idea that the initial stages of motor learning are not wholly motor but rather involve plasticity in somatic and prefrontal networks related both to reward and exploration.

  17. Learning to Produce Syllabic Speech Sounds via Reward-Modulated Neural Plasticity

    PubMed Central

    Warlaumont, Anne S.; Finnegan, Megan K.

    2016-01-01

    At around 7 months of age, human infants begin to reliably produce well-formed syllables containing both consonants and vowels, a behavior called canonical babbling. Over subsequent months, the frequency of canonical babbling continues to increase. How the infant’s nervous system supports the acquisition of this ability is unknown. Here we present a computational model that combines a spiking neural network, reinforcement-modulated spike-timing-dependent plasticity, and a human-like vocal tract to simulate the acquisition of canonical babbling. Like human infants, the model’s frequency of canonical babbling gradually increases. The model is rewarded when it produces a sound that is more auditorily salient than sounds it has previously produced. This is consistent with data from human infants indicating that contingent adult responses shape infant behavior and with data from deaf and tracheostomized infants indicating that hearing, including hearing one’s own vocalizations, is critical for canonical babbling development. Reward receipt increases the level of dopamine in the neural network. The neural network contains a reservoir with recurrent connections and two motor neuron groups, one agonist and one antagonist, which control the masseter and orbicularis oris muscles, promoting or inhibiting mouth closure. The model learns to increase the number of salient, syllabic sounds it produces by adjusting the base level of muscle activation and increasing their range of activity. Our results support the possibility that through dopamine-modulated spike-timing-dependent plasticity, the motor cortex learns to harness its natural oscillations in activity in order to produce syllabic sounds. It thus suggests that learning to produce rhythmic mouth movements for speech production may be supported by general cortical learning mechanisms. The model makes several testable predictions and has implications for our understanding not only of how syllabic vocalizations develop

  18. Integration of reinforcement learning and optimal decision-making theories of the basal ganglia.

    PubMed

    Bogacz, Rafal; Larsen, Tobias

    2011-04-01

    This article seeks to integrate two sets of theories describing action selection in the basal ganglia: reinforcement learning theories describing learning which actions to select to maximize reward and decision-making theories proposing that the basal ganglia selects actions on the basis of sensory evidence accumulated in the cortex. In particular, we present a model that integrates the actor-critic model of reinforcement learning and a model assuming that the cortico-basal-ganglia circuit implements a statistically optimal decision-making procedure. The values of cortico-striatal weights required for optimal decision making in our model differ from those provided by standard reinforcement learning models. Nevertheless, we show that an actor-critic model converges to the weights required for optimal decision making when biologically realistic limits on synaptic weights are introduced. We also describe the model's predictions concerning reaction times and neural responses during learning, and we discuss directions required for further integration of reinforcement learning and optimal decision-making theories.

  19. Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control.

    PubMed

    Uragami, Daisuke; Takahashi, Tatsuji; Matsuo, Yoshiki

    2014-02-01

    Many algorithms and methods in artificial intelligence or machine learning were inspired by human cognition. As a mechanism to handle the exploration-exploitation dilemma in reinforcement learning, the loosely symmetric (LS) value function that models causal intuition of humans was proposed (Shinohara et al., 2007). While LS shows the highest correlation with causal induction by humans, it has been reported that it effectively works in multi-armed bandit problems that form the simplest class of tasks representing the dilemma. However, the scope of application of LS was limited to the reinforcement learning problems that have K actions with only one state (K-armed bandit problems). This study proposes LS-Q learning architecture that can deal with general reinforcement learning tasks with multiple states and delayed reward. We tested the learning performance of the new architecture in giant-swing robot motion learning, where uncertainty and unknown-ness of the environment is huge. In the test, the help of ready-made internal models or functional approximation of the state space were not given. The simulations showed that while the ordinary Q-learning agent does not reach giant-swing motion because of stagnant loops (local optima with low rewards), LS-Q escapes such loops and acquires giant-swing. It is confirmed that the smaller number of states is, in other words, the more coarse-grained the division of states and the more incomplete the state observation is, the better LS-Q performs in comparison with Q-learning. We also showed that the high performance of LS-Q depends comparatively little on parameter tuning and learning time. This suggests that the proposed method inspired by human cognition works adaptively in real environments. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  20. Post-learning hippocampal dynamics promote preferential retention of rewarding events

    PubMed Central

    Gruber, Matthias J.; Ritchey, Maureen; Wang, Shao-Fang; Doss, Manoj K.; Ranganath, Charan

    2016-01-01

    Reward motivation is known to modulate memory encoding, and this effect depends on interactions between the substantia nigra/ ventral tegmental area complex (SN/VTA) and the hippocampus. It is unknown, however, whether these interactions influence offline neural activity in the human brain that is thought to promote memory consolidation. Here, we used functional magnetic resonance imaging (fMRI) to test the effect of reward motivation on post-learning neural dynamics and subsequent memory for objects that were learned in high- or low-reward motivation contexts. We found that post-learning increases in resting-state functional connectivity between the SN/VTA and hippocampus predicted preferential retention of objects that were learned in high-reward contexts. In addition, multivariate pattern classification revealed that hippocampal representations of high-reward contexts were preferentially reactivated during post-learning rest, and the number of hippocampal reactivations was predictive of preferential retention of items learned in high-reward contexts. These findings indicate that reward motivation alters offline post-learning dynamics between the SN/VTA and hippocampus, providing novel evidence for a potential mechanism by which reward could influence memory consolidation. PMID:26875624

  1. Experiential reward learning outweighs instruction prior to adulthood.

    PubMed

    Decker, Johannes H; Lourenco, Frederico S; Doll, Bradley B; Hartley, Catherine A

    2015-06-01

    Throughout our lives, we face the important task of distinguishing rewarding actions from those that are best avoided. Importantly, there are multiple means by which we acquire this information. Through trial and error, we use experiential feedback to evaluate our actions. We also learn which actions are advantageous through explicit instruction from others. Here, we examined whether the influence of these two forms of learning on choice changes across development by placing instruction and experience in competition in a probabilistic-learning task. Whereas inaccurate instruction markedly biased adults' estimations of a stimulus's value, children and adolescents were better able to objectively estimate stimulus values through experience. Instructional control of learning is thought to recruit prefrontal-striatal brain circuitry, which continues to mature into adulthood. Our behavioral data suggest that this protracted neurocognitive maturation may cause the motivated actions of children and adolescents to be less influenced by explicit instruction than are those of adults. This absence of a confirmation bias in children and adolescents represents a paradoxical developmental advantage of youth over adults in the unbiased evaluation of actions through positive and negative experience.

  2. Experiential reward learning outweighs instruction prior to adulthood

    PubMed Central

    Decker, Johannes H.; Lourenco, Frederico S.; Doll, Bradley B.; Hartley, Catherine A.

    2015-01-01

    Throughout our lives, we face the important task of distinguishing rewarding actions from those that are best avoided. Importantly, there are multiple means by which we acquire this information. Through trial and error, we use experiential feedback to evaluate our actions. We also learn which actions are advantageous through explicit instruction from others. Here, we examined whether the influence of these two forms of learning on choice changes across development by placing instruction and experience in competition in a probabilistic-learning task. Whereas inaccurate instruction markedly biased adults’ estimations of a stimulus’s value, children and adolescents were better able to objectively estimate stimulus values through experience. Instructional control of learning is thought to recruit prefrontal–striatal brain circuitry, which continues to mature into adulthood. Our behavioral data suggest that this protracted neurocognitive maturation may cause the motivated actions of children and adolescents to be less influenced by explicit instruction than are those of adults. This absence of a confirmation bias in children and adolescents represents a paradoxical developmental advantage of youth over adults in the unbiased evaluation of actions through positive and negative experience. PMID:25582607

  3. Stochastic Scheduling and Planning Using Reinforcement Learning

    DTIC Science & Technology

    2007-11-02

    reinforcement learning (RL) methods to large-scale optimization problems relevant to Air Force operations planning, scheduling, and maintenance. The objectives of this project were to: (1) investigate the utility of RL on large-scale logistics problems; (2) extend existing RL theory and practice to enhance the ease of application and the performance of RL on these problems; and (3) explore new problem formulations in order to take maximal advantage of RL methods. A method using RL to modify local search cost functions was developed and shown to yield significant

  4. A reinforcement learning approach to instrumental contingency degradation in rats.

    PubMed

    Dutech, Alain; Coutureau, Etienne; Marchand, Alain R

    2011-01-01

    Goal-directed action involves a representation of action consequences. Adapting to changes in action-outcome contingency requires the prefrontal region. Indeed, rats with lesions of the medial prefrontal cortex do not adapt their free operant response when food delivery becomes unrelated to lever-pressing. The present study explores the bases of this deficit through a combined behavioural and computational approach. We show that lesioned rats retain some behavioural flexibility and stop pressing if this action prevents food delivery. We attempt to model this phenomenon in a reinforcement learning framework. The model assumes that distinct action values are learned in an incremental manner in distinct states. The model represents states as n-uplets of events, emphasizing sequences rather than the continuous passage of time. Probabilities of lever-pressing and visits to the food magazine observed in the behavioural experiments are first analyzed as a function of these states, to identify sequences of events that influence action choice. Observed action probabilities appear to be essentially function of the last event that occurred, with reward delivery and waiting significantly facilitating magazine visits and lever-pressing respectively. Behavioural sequences of normal and lesioned rats are then fed into the model, action values are updated at each event transition according to the SARSA algorithm, and predicted action probabilities are derived through a softmax policy. The model captures the time course of learning, as well as the differential adaptation of normal and prefrontal lesioned rats to contingency degradation with the same parameters for both groups. The results suggest that simple temporal difference algorithms with low learning rates can largely account for instrumental learning and performance. Prefrontal lesioned rats appear to mainly differ from control rats in their low rates of visits to the magazine after a lever press, and their inability to

  5. Reward learning in pediatric depression and anxiety: preliminary findings in a high-risk sample.

    PubMed

    Morris, Bethany H; Bylsma, Lauren M; Yaroslavsky, Ilya; Kovacs, Maria; Rottenberg, Jonathan

    2015-05-01

    Reward learning has been postulated as a critical component of hedonic functioning that predicts depression risk. Reward learning deficits have been established in adults with current depressive disorders, but no prior studies have examined the relationship of reward learning and depression in children. The present study investigated reward learning as a function of familial depression risk and current diagnostic status in a pediatric sample. The sample included 204 children of parents with a history of depression (n = 86 high-risk offspring) or parents with no history of major mental disorder (n = 118 low-risk offspring). Semistructured clinical interviews were used to establish current mental diagnoses in the children. A modified signal detection task was used for assessing reward learning. We tested whether reward learning was impaired in high-risk offspring relative to low-risk offspring. We also tested whether reward learning was impaired in children with current disorders known to blunt hedonic function (depression, social phobia, PTSD, GAD, n = 13) compared to children with no disorders and to a psychiatric comparison group with ADHD. High- and low-risk youth did not differ in reward learning. However, youth with current anhedonic disorders (depression, social phobia, PTSD, GAD) exhibited blunted reward learning relative to nondisordered youth and those with ADHD. Our results are a first demonstration that reward learning deficits are present among youth with disorders known to blunt anhedonic function and that these deficits have some degree of diagnostic specificity. We advocate for future studies to replicate and extend these preliminary findings. © 2015 Wiley Periodicals, Inc.

  6. Learning processes affecting human decision making: An assessment of reinforcer-selective Pavlovian-to-instrumental transfer following reinforcer devaluation.

    PubMed

    Allman, Melissa J; DeLeon, Iser G; Cataldo, Michael F; Holland, Peter C; Johnson, Alexander W

    2010-07-01

    In reinforcer-selective transfer, Pavlovian stimuli that are predictive of specific outcomes bias performance toward responses associated with those outcomes. Although this phenomenon has been extensively examined in rodents, recent assessments have extended to humans. Using a stock market paradigm adults were trained to associate particular symbols and responses with particular currencies. During the first test, individuals showed a preference for responding on actions associated with the same outcome as that predicted by the presented stimulus (i.e., a reinforcer-selective transfer effect). In the second test of the experiment, one of the currencies was devalued. We found it notable that this served to reduce responses to those stimuli associated with the devalued currency. This finding is in contrast to that typically observed in rodent studies, and suggests that participants in this task represented the sensory features that differentiate the reinforcers and their value during reinforcer-selective transfer. These results are discussed in terms of implications for understanding associative learning processes in humans and the ability of reward-paired cues to direct adaptive and maladaptive behavior.

  7. A reinforcement learning mechanism responsible for the valuation of free choice.

    PubMed

    Cockburn, Jeffrey; Collins, Anne G E; Frank, Michael J

    2014-08-06

    Humans exhibit a preference for options they have freely chosen over equally valued options they have not; however, the neural mechanism that drives this bias and its functional significance have yet to be identified. Here, we propose a model in which choice biases arise due to amplified positive reward prediction errors associated with free choice. Using a novel variant of a probabilistic learning task, we show that choice biases are selective to options that are predominantly associated with positive outcomes. A polymorphism in DARPP-32, a gene linked to dopaminergic striatal plasticity and individual differences in reinforcement learning, was found to predict the effect of choice as a function of value. We propose that these choice biases are the behavioral byproduct of a credit assignment mechanism responsible for ensuring the effective delivery of dopaminergic reinforcement learning signals broadcast to the striatum.

  8. A reinforcement learning mechanism responsible for the valuation of free choice

    PubMed Central

    Cockburn, Jeffrey; Collins, Anne G.E.; Frank, Michael J.

    2014-01-01

    Summary Humans exhibit a preference for options they have freely chosen over equally valued options they have not; however, the neural mechanism that drives this bias and its functional significance have yet to be identified. Here, we propose a model in which choice biases arise due to amplified positive reward prediction errors associated with free choice. Using a novel variant of a probabilistic learning task, we show that choice biases are selective to options that are predominantly associated with positive outcomes. A polymorphism in DARPP-32, a gene linked to dopaminergic striatal plasticity and individual differences in reinforcement learning, was found to predict the effect of choice as a function of value. We propose that these choice biases are the behavioral byproduct of a credit assignment mechanism responsible for ensuring the effective delivery of dopaminergic reinforcement learning signals broadcast to the striatum. PMID:25066083

  9. A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning.

    PubMed

    Franklin, Nicholas T; Frank, Michael J

    2015-12-25

    Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments.

  10. Rewards versus Learning: A Response to Paul Chance.

    ERIC Educational Resources Information Center

    Kohn, Alfie

    1993-01-01

    Responding to Paul Chance's November 1992 "Kappan" article on motivational value of rewards, this article argues that manipulating student behavior with either punishments or rewards is unnecessary and counterproductive. Extrinsic rewards can never buy more than short-term compliance because they are inherently controlling and…

  11. Choice modulates the neural dynamics of prediction error processing during rewarded learning.

    PubMed

    Peterson, David A; Lotz, Daniel T; Halgren, Eric; Sejnowski, Terrence J; Poizner, Howard

    2011-01-15

    Our ability to selectively engage with our environment enables us to guide our learning and to take advantage of its benefits. When facing multiple possible actions, our choices are a critical aspect of learning. In the case of learning from rewarding feedback, there has been substantial theoretical and empirical progress in elucidating the associated behavioral and neural processes, predominantly in terms of a reward prediction error, a measure of the discrepancy between actual versus expected reward. Nevertheless, the distinct influence of choice on prediction error processing and its neural dynamics remains relatively unexplored. In this study we used a novel paradigm to determine how choice influences prediction error processing and to examine whether there are correspondingly distinct neural dynamics. We recorded scalp electroencephalogram while healthy adults were administered a rewarded learning task in which choice trials were intermingled with control trials involving the same stimuli, motor responses, and probabilistic rewards. We used a temporal difference learning model of subjects' trial-by-trial choices to infer subjects' image valuations and corresponding prediction errors. As expected, choices were associated with lower overall prediction error magnitudes, most notably over the course of learning the stimulus-reward contingencies. Choices also induced a higher-amplitude relative positivity in the frontocentral event-related potential about 200 ms after reward signal onset that was negatively correlated with the differential effect of choice on the prediction error. Thus choice influences the neural dynamics associated with how reward signals are processed during learning. Behavioral, computational, and neurobiological models of rewarded learning should therefore accommodate a distinct influence for choice during rewarded learning. Copyright © 2010 Elsevier Inc. All rights reserved.

  12. Reinforcement Learning of Targeted Movement in a Spiking Neuronal Model of Motor Cortex

    PubMed Central

    Chadderdon, George L.; Neymotin, Samuel A.; Kerr, Cliff C.; Lytton, William W.

    2012-01-01

    Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint “forearm” to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (−1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior. PMID:23094042

  13. Reinforcement learning of targeted movement in a spiking neuronal model of motor cortex.

    PubMed

    Chadderdon, George L; Neymotin, Samuel A; Kerr, Cliff C; Lytton, William W

    2012-01-01

    Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint "forearm" to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (-1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior.

  14. Attenuating GABA(A) receptor signaling in dopamine neurons selectively enhances reward learning and alters risk preference in mice.

    PubMed

    Parker, Jones G; Wanat, Matthew J; Soden, Marta E; Ahmad, Kinza; Zweifel, Larry S; Bamford, Nigel S; Palmiter, Richard D

    2011-11-23

    Phasic dopamine (DA) transmission encodes the value of reward-predictive stimuli and influences both learning and decision-making. Altered DA signaling is associated with psychiatric conditions characterized by risky choices such as pathological gambling. These observations highlight the importance of understanding how DA neuron activity is modulated. While excitatory drive onto DA neurons is critical for generating phasic DA responses, emerging evidence suggests that inhibitory signaling also modulates these responses. To address the functional importance of inhibitory signaling in DA neurons, we generated mice lacking the β3 subunit of the GABA(A) receptor specifically in DA neurons (β3-KO mice) and examined their behavior in tasks that assessed appetitive learning, aversive learning, and risk preference. DA neurons in midbrain slices from β3-KO mice exhibited attenuated GABA-evoked IPSCs. Furthermore, electrical stimulation of excitatory afferents to DA neurons elicited more DA release in the nucleus accumbens of β3-KO mice as measured by fast-scan cyclic voltammetry. β3-KO mice were more active than controls when given morphine, which correlated with potential compensatory upregulation of GABAergic tone onto DA neurons. β3-KO mice learned faster in two food-reinforced learning paradigms, but extinguished their learned behavior normally. Enhanced learning was specific for appetitive tasks, as aversive learning was unaffected in β3-KO mice. Finally, we found that β3-KO mice had enhanced risk preference in a probabilistic selection task that required mice to choose between a small certain reward and a larger uncertain reward. Collectively, these findings identify a selective role for GABA(A) signaling in DA neurons in appetitive learning and decision-making.

  15. Single amino acids in sucrose rewards modulate feeding and associative learning in the honeybee.

    PubMed

    Simcock, Nicola K; Gray, Helen E; Wright, Geraldine A

    2014-10-01

    Obtaining the correct balance of nutrients requires that the brain integrates information about the body's nutritional state with sensory information from food to guide feeding behaviour. Learning is a mechanism that allows animals to identify cues associated with nutrients so that they can be located quickly when required. Feedback about nutritional state is essential for nutrient balancing and could influence learning. How specific this feedback is to individual nutrients has not often been examined. Here, we tested how the honeybee's nutritional state influenced the likelihood it would feed on and learn sucrose solutions containing single amino acids. Nutritional state was manipulated by pre-feeding bees with either 1M sucrose or 1M sucrose containing 100mM of isoleucine, proline, phenylalanine, or methionine 24h prior to olfactory conditioning of the proboscis extension response. We found that bees pre-fed sucrose solution consumed less of solutions containing amino acids and were also less likely to learn to associate amino acid solutions with odours. Unexpectedly, bees pre-fed solutions containing an amino acid were also less likely to learn to associate odours with sucrose the next day. Furthermore, they consumed more of and were more likely to learn when rewarded with an amino acid solution if they were pre-fed isoleucine and proline. Our data indicate that single amino acids at relatively high concentrations inhibit feeding on sucrose solutions containing them, and they can act as appetitive reinforcers during learning. Our data also suggest that select amino acids interact with mechanisms that signal nutritional sufficiency to reduce hunger. Based on these experiments, we predict that nutrient balancing for essential amino acids during learning requires integration of information about several amino acids experienced simultaneously.

  16. Single amino acids in sucrose rewards modulate feeding and associative learning in the honeybee

    PubMed Central

    Simcock, Nicola K.; Gray, Helen E.; Wright, Geraldine A.

    2014-01-01

    Obtaining the correct balance of nutrients requires that the brain integrates information about the body’s nutritional state with sensory information from food to guide feeding behaviour. Learning is a mechanism that allows animals to identify cues associated with nutrients so that they can be located quickly when required. Feedback about nutritional state is essential for nutrient balancing and could influence learning. How specific this feedback is to individual nutrients has not often been examined. Here, we tested how the honeybee’s nutritional state influenced the likelihood it would feed on and learn sucrose solutions containing single amino acids. Nutritional state was manipulated by pre-feeding bees with either 1 M sucrose or 1 M sucrose containing 100 mM of isoleucine, proline, phenylalanine, or methionine 24 h prior to olfactory conditioning of the proboscis extension response. We found that bees pre-fed sucrose solution consumed less of solutions containing amino acids and were also less likely to learn to associate amino acid solutions with odours. Unexpectedly, bees pre-fed solutions containing an amino acid were also less likely to learn to associate odours with sucrose the next day. Furthermore, they consumed more of and were more likely to learn when rewarded with an amino acid solution if they were pre-fed isoleucine and proline. Our data indicate that single amino acids at relatively high concentrations inhibit feeding on sucrose solutions containing them, and they can act as appetitive reinforcers during learning. Our data also suggest that select amino acids interact with mechanisms that signal nutritional sufficiency to reduce hunger. Based on these experiments, we predict that nutrient balancing for essential amino acids during learning requires integration of information about several amino acids experienced simultaneously. PMID:24819203

  17. Vicarious reinforcement learning signals when instructing others.

    PubMed

    Apps, Matthew A J; Lesage, Elise; Ramnani, Narender

    2015-02-18

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action-outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors.

  18. Vicarious Reinforcement Learning Signals When Instructing Others

    PubMed Central

    Lesage, Elise; Ramnani, Narender

    2015-01-01

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action–outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors. PMID:25698730

  19. Forward shift of feeding buzz components of dolphins and belugas during associative learning reveals a likely connection to reward expectation, pleasure and brain dopamine activation.

    PubMed

    Ridgway, S H; Moore, P W; Carder, D A; Romano, T A

    2014-08-15

    For many years, we heard sounds associated with reward from dolphins and belugas. We named these pulsed sounds victory squeals (VS), as they remind us of a child's squeal of delight. Here we put these sounds in context with natural and learned behavior. Like bats, echolocating cetaceans produce feeding buzzes as they approach and catch prey. Unlike bats, cetaceans continue their feeding buzzes after prey capture and the after portion is what we call the VS. Prior to training (or conditioning), the VS comes after the fish reward; with repeated trials it moves to before the reward. During training, we use a whistle or other sound to signal a correct response by the animal. This sound signal, named a secondary reinforcer (SR), leads to the primary reinforcer, fish. Trainers usually name their whistle or other SR a bridge, as it bridges the time gap between the correct response and reward delivery. During learning, the SR becomes associated with reward and the VS comes after the SR rather than after the fish. By following the SR, the VS confirms that the animal expects a reward. Results of early brain stimulation work suggest to us that SR stimulates brain dopamine release, which leads to the VS. Although there are no direct studies of dopamine release in cetaceans, we found that the timing of our VS is consistent with a response after dopamine release. We compared trained vocal responses to auditory stimuli with VS responses to SR sounds. Auditory stimuli that did not signal reward resulted in faster responses by a mean of 151 ms for dolphins and 250 ms for belugas. In laboratory animals, there is a 100 to 200 ms delay for dopamine release. VS delay in our animals is similar and consistent with vocalization after dopamine release. Our novel observation suggests that the dopamine reward system is active in cetacean brains.

  20. Seizure Control in a Computational Model Using a Reinforcement Learning Stimulation Paradigm.

    PubMed

    Nagaraj, Vivek; Lamperski, Andrew; Netoff, Theoden I

    2016-11-02

    Neuromodulation technologies such as vagus nerve stimulation and deep brain stimulation, have shown some efficacy in controlling seizures in medically intractable patients. However, inherent patient-to-patient variability of seizure disorders leads to a wide range of therapeutic efficacy. A patient specific approach to determining stimulation parameters may lead to increased therapeutic efficacy while minimizing stimulation energy and side effects. This paper presents a reinforcement learning algorithm that optimizes stimulation frequency for controlling seizures with minimum stimulation energy. We apply our method to a computational model called the epileptor. The epileptor model simulates inter-ictal and ictal local field potential data. In order to apply reinforcement learning to the Epileptor, we introduce a specialized reward function and state-space discretization. With the reward function and discretization fixed, we test the effectiveness of the temporal difference reinforcement learning algorithm (TD(0)). For periodic pulsatile stimulation, we derive a relation that describes, for any stimulation frequency, the minimal pulse amplitude required to suppress seizures. The TD(0) algorithm is able to identify parameters that control seizures quickly. Additionally, our results show that the TD(0) algorithm refines the stimulation frequency to minimize stimulation energy thereby converging to optimal parameters reliably. An advantage of the TD(0) algorithm is that it is adaptive so that the parameters necessary to control the seizures can change over time. We show that the algorithm can converge on the optimal solution in simulation with slow and fast inter-seizure intervals.

  1. Molecular mechanisms underlying a cellular analogue of operant reward learning

    PubMed Central

    Lorenzetti, Fred D.; Baxter, Douglas A.; Byrne, John H.

    2008-01-01

    SUMMARY Operant conditioning is a ubiquitous but mechanistically poorly understood form of associative learning in which an animal learns the consequences of its behavior. Using a single-cell analogue of operant conditioning in neuron B51 of Aplysia, we examined second-messenger pathways engaged by activity and reward and how they may provide a biochemical association underlying operant learning. Conditioning was blocked by Rp-cAMP, a peptide inhibitor of PKA, a PKC inhibitor and by expressing a dominant negative isoform of Ca2+-dependent PKC (apl-I). Thus, both PKA and PKC were necessary for operant conditioning. Injection of cAMP into B51 mimicked the effects of operant conditioning. Activation of PKC also mimicked conditioning, but was dependent on both cAMP and PKA, suggesting that PKC acted at some point upstream of PKA activation. Our results demonstrate how these molecules can interact to mediate operant conditioning in an individual neuron important for the expression of the conditioned behavior. PMID:18786364

  2. Ventral striatal control of appetitive motivation: role in ingestive behavior and reward-related learning.

    PubMed

    Kelley, Ann E

    2004-01-01

    The nucleus accumbens is a brain region that participates in the control of behaviors related to natural reinforcers, such as ingestion, sexual behavior, incentive and instrumental learning, and that also plays a role in addictive processes. This paper comprises a review of work from our laboratory that focuses on two main research areas: (i). the role of the nucleus accumbens in food motivation, and (ii). its putative functions in cellular plasticity underlying appetitive learning. First, work within a number of different behavioral paradigms has shown that accumbens neurochemical systems play specific and dissociable roles in different aspects of food seeking and food intake, and part of this function depends on integration with the lateral hypothalamus and amygdala. We propose that the nucleus accumbens integrates information related to cognitive, sensory, and emotional processing with hypothalamic mechanisms mediating energy balance. This system as a whole enables complex hierarchical control of adaptive ingestive behavior. Regarding the second research area, our studies examining acquisition of lever-pressing for food in rats have shown that activation of glutamate N-methyl-d-aspartate (NMDA) receptors, within broadly distributed but interconnected regions (nucleus accumbens core, posterior striatum, prefrontal cortex, basolateral and central amygdala), is critical for such learning to occur. This receptor stimulation triggers intracellular cascades that involve protein phosphorylation and new protein synthesis. It is hypothesized that activity in this distributed network (including D1 receptor activity) computes coincident events and thus enhances the probability that temporally related actions and events (e.g. lever pressing and delivery of reward) become associated. Such basic mechanisms of plasticity within this reinforcement learning network also appear to be profoundly affected in addiction.

  3. Stochastic reinforcement benefits skill acquisition.

    PubMed

    Dayan, Eran; Averbeck, Bruno B; Richmond, Barry J; Cohen, Leonardo G

    2014-02-14

    Learning complex skills is driven by reinforcement, which facilitates both online within-session gains and retention of the acquired skills. Yet, in ecologically relevant situations, skills are often acquired when mapping between actions and rewarding outcomes is unknown to the learning agent, resulting in reinforcement schedules of a stochastic nature. Here we trained subjects on a visuomotor learning task, comparing reinforcement schedules with higher, lower, or no stochasticity. Training under higher levels of stochastic reinforcement benefited skill acquisition, enhancing both online gains and long-term retention. These findings indicate that the enhancing effects of reinforcement on skill acquisition depend on reinforcement schedules.

  4. Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback.

    PubMed

    Tan, A H; Lu, N; Xiao, D

    2008-02-01

    This paper presents a neural architecture for learning category nodes encoding mappings across multimodal patterns involving sensory inputs, actions, and rewards. By integrating adaptive resonance theory (ART) and temporal difference (TD) methods, the proposed neural model, called TD fusion architecture for learning, cognition, and navigation (TD-FALCON), enables an autonomous agent to adapt and function in a dynamic environment with immediate as well as delayed evaluative feedback (reinforcement) signals. TD-FALCON learns the value functions of the state-action space estimated through on-policy and off-policy TD learning methods, specifically state-action-reward-state-action (SARSA) and Q-learning. The learned value functions are then used to determine the optimal actions based on an action selection policy. We have developed TD-FALCON systems using various TD learning strategies and compared their performance in terms of task completion, learning speed, as well as time and space efficiency. Experiments based on a minefield navigation task have shown that TD-FALCON systems are able to learn effectively with both immediate and delayed reinforcement and achieve a stable performance in a pace much faster than those of standard gradient-descent-based reinforcement learning systems.

  5. Dopamine-mediated reinforcement learning signals in the striatum and ventromedial prefrontal cortex underlie value-based choices.

    PubMed

    Jocham, Gerhard; Klein, Tilmann A; Ullsperger, Markus

    2011-02-02

    A large body of evidence exists on the role of dopamine in reinforcement learning. Less is known about how dopamine shapes the relative impact of positive and negative outcomes to guide value-based choices. We combined administration of the dopamine D(2) receptor antagonist amisulpride with functional magnetic resonance imaging in healthy human volunteers. Amisulpride did not affect initial reinforcement learning. However, in a later transfer phase that involved novel choice situations requiring decisions between two symbols based on their previously learned values, amisulpride improved participants' ability to select the better of two highly rewarding options, while it had no effect on choices between two very poor options. During the learning phase, activity in the striatum encoded a reward prediction error. In the transfer phase, in the absence of any outcome, ventromedial prefrontal cortex (vmPFC) continually tracked the learned value of the available options on each trial. Both striatal prediction error coding and tracking of learned value in the vmPFC were predictive of subjects' choice performance in the transfer phase, and both were enhanced under amisulpride. These findings show that dopamine-dependent mechanisms enhance reinforcement learning signals in the striatum and sharpen representations of associative values in prefrontal cortex that are used to guide reinforcement-based decisions.

  6. Affective personality predictors of disrupted reward learning and pursuit in major depressive disorder.

    PubMed

    DelDonno, Sophie R; Weldon, Anne L; Crane, Natania A; Passarotti, Alessandra M; Pruitt, Patrick J; Gabriel, Laura B; Yau, Wendy; Meyers, Kortni K; Hsu, David T; Taylor, Stephen F; Heitzeg, Mary M; Herbener, Ellen; Shankman, Stewart A; Mickey, Brian J; Zubieta, Jon-Kar; Langenecker, Scott A

    2015-11-30

    Anhedonia, the diminished anticipation and pursuit of reward, is a core symptom of major depressive disorder (MDD). Trait behavioral activation (BA), as a proxy for anhedonia, and behavioral inhibition (BI) may moderate the relationship between MDD and reward-seeking. The present studies probed for reward learning deficits, potentially due to aberrant BA and/or BI, in active or remitted MDD individuals compared to healthy controls (HC). Active MDD (Study 1) and remitted MDD (Study 2) participants completed the modified monetary incentive delay task (mMIDT), a behavioral reward-seeking task whose response window parameters were individually titrated to theoretically elicit equivalent accuracy between groups. Participants completed the BI Scale and BA Reward-Responsiveness and Drive Scales. Despite individual titration, active MDD participants won significantly less money than HCs. Higher Reward-Responsiveness scores predicted more won; Drive and BI were not predictive. Remitted MDD participants' performance did not differ from controls', and trait BA and BI measures did not predict r-MDD performance. These results suggest that diminished reward-responsiveness may contribute to decreased motivation and reward pursuit during active MDD, but that reward learning is intact in remission. Understanding individual reward processing deficits in MDD may inform personalized intervention addressing anhedonia and motivation deficits in select MDD patients. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  7. Rewarding properties of visual stimuli.

    PubMed

    Blatter, Katharina; Schultz, Wolfram

    2006-01-01

    The behavioral functions of rewards comprise the induction of learning and approach behavior. Rewards are not only related to vegetative states of hunger, thirst and reproduction but may also consist of visual stimuli. The present experiment tested the reward potential of different types of still and moving pictures in three operant tasks involving key press, touch of computer monitor and choice behavior in a laboratory environment. We found that all tested visual stimuli induced approach behavior in all three tasks, and that action movies sustained consistently higher rates of responding compared to changing still pictures, which were more effective than constant still pictures. These results demonstrate that visual stimuli can serve as positive reinforcers for operant reactions of animals in controlled laboratory settings. In particular, the coherently animated visual stimuli of movies have considerable reward potential. These observations would allow similar forms of visual rewards to be used for neurophysiological investigations of mechanisms related to non-vegetative rewards.

  8. Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats

    PubMed Central

    Lloyd, Kevin; Becker, Nadine; Jones, Matthew W.; Bogacz, Rafal

    2012-01-01

    Learning to form appropriate, task-relevant working memory representations is a complex process central to cognition. Gating models frame working memory as a collection of past observations and use reinforcement learning (RL) to solve the problem of when to update these observations. Investigation of how gating models relate to brain and behavior remains, however, at an early stage. The current study sought to explore the ability of simple RL gating models to replicate rule learning behavior in rats. Rats were trained in a maze-based spatial learning task that required animals to make trial-by-trial choices contingent upon their previous experience. Using an abstract version of this task, we tested the ability of two gating algorithms, one based on the Actor-Critic and the other on the State-Action-Reward-State-Action (SARSA) algorithm, to generate behavior consistent with the rats'. Both models produced rule-acquisition behavior consistent with the experimental data, though only the SARSA gating model mirrored faster learning following rule reversal. We also found that both gating models learned multiple strategies in solving the initial task, a property which highlights the multi-agent nature of such models and which is of importance in considering the neural basis of individual differences in behavior. PMID:23115551

  9. What motivates adolescents? Neural responses to rewards and their influence on adolescents' risk taking, learning, and cognitive control.

    PubMed

    van Duijvenvoorde, Anna C K; Peters, Sabine; Braams, Barbara R; Crone, Eveline A

    2016-11-01

    Adolescence is characterized by pronounced changes in motivated behavior, during which emphasis on potential rewards may result in an increased tendency to approach things that are novel and bring potential for positive reinforcement. While this may result in risky and health-endangering behavior, it may also lead to positive consequences, such as behavioral flexibility and greater learning. In this review we will discuss both the maladaptive and adaptive properties of heightened reward-sensitivity in adolescents by reviewing recent cognitive neuroscience findings in relation to behavioral outcomes. First, we identify brain regions involved in processing rewards in adults and adolescents. Second, we discuss how functional changes in reward-related brain activity during adolescence are related to two behavioral domains: risk taking and cognitive control. Finally, we conclude that progress lies in new levels of explanation by further integration of neural results with behavioral theories and computational models. In addition, we highlight that longitudinal measures, and a better conceptualization of adolescence and environmental determinants, are of crucial importance for understanding positive and negative developmental trajectories.

  10. Connectionist reinforcement learning of robot control skills

    NASA Astrophysics Data System (ADS)

    Araújo, Rui; Nunes, Urbano; de Almeida, A. T.

    1998-07-01

    Many robot manipulator tasks are difficult to model explicitly and it is difficult to design and program automatic control algorithms for them. The development, improvement, and application of learning techniques taking advantage of sensory information would enable the acquisition of new robot skills and avoid some of the difficulties of explicit programming. In this paper we use a reinforcement learning approach for on-line generation of skills for control of robot manipulator systems. Instead of generating skills by explicit programming of a perception to action mapping they are generated by trial and error learning, guided by a performance evaluation feedback function. The resulting system may be seen as an anticipatory system that constructs an internal representation model of itself and of its environment. This enables it to identify its current situation and to generate corresponding appropriate commands to the system in order to perform the required skill. The method was applied to the problem of learning a force control skill in which the tool-tip of a robot manipulator must be moved from a free space situation, to a contact state with a compliant surface and having a constant interaction force.

  11. Stochastic Reinforcement Benefits Skill Acquisition

    ERIC Educational Resources Information Center

    Dayan, Eran; Averbeck, Bruno B.; Richmond, Barry J.; Cohen, Leonardo G.

    2014-01-01

    Learning complex skills is driven by reinforcement, which facilitates both online within-session gains and retention of the acquired skills. Yet, in ecologically relevant situations, skills are often acquired when mapping between actions and rewarding outcomes is unknown to the learning agent, resulting in reinforcement schedules of a stochastic…

  12. Stochastic Reinforcement Benefits Skill Acquisition

    ERIC Educational Resources Information Center

    Dayan, Eran; Averbeck, Bruno B.; Richmond, Barry J.; Cohen, Leonardo G.

    2014-01-01

    Learning complex skills is driven by reinforcement, which facilitates both online within-session gains and retention of the acquired skills. Yet, in ecologically relevant situations, skills are often acquired when mapping between actions and rewarding outcomes is unknown to the learning agent, resulting in reinforcement schedules of a stochastic…

  13. Updating dopamine reward signals

    PubMed Central

    Schultz, Wolfram

    2013-01-01

    Recent work has advanced our knowledge of phasic dopamine reward prediction error signals. The error signal is bidirectional, reflects well the higher order prediction error described by temporal difference learning models, is compatible with model-free and model-based reinforcement learning, reports the subjective rather than physical reward value during temporal discounting and reflects subjective stimulus perception rather than physical stimulus aspects. Dopamine activations are primarily driven by reward, and to some extent risk, whereas punishment and salience have only limited activating effects when appropriate controls are respected. The signal is homogeneous in terms of time course but heterogeneous in many other aspects. It is essential for synaptic plasticity and a range of behavioural learning situations. PMID:23267662

  14. Choice as a function of reinforcer "hold": from probability learning to concurrent reinforcement.

    PubMed

    Jensen, Greg; Neuringer, Allen

    2008-10-01

    Two procedures commonly used to study choice are concurrent reinforcement and probability learning. Under concurrent-reinforcement procedures, once a reinforcer is scheduled, it remains available indefinitely until collected. Therefore reinforcement becomes increasingly likely with passage of time or responses on other operanda. Under probability learning, reinforcer probabilities are constant and independent of passage of time or responses. Therefore a particular reinforcer is gained or not, on the basis of a single response, and potential reinforcers are not retained, as when betting at a roulette wheel. In the "real" world, continued availability of reinforcers often lies between these two extremes, with potential reinforcers being lost owing to competition, maturation, decay, and random scatter. The authors parametrically manipulated the likelihood of continued reinforcer availability, defined as hold, and examined the effects on pigeons' choices. Choices varied as power functions of obtained reinforcers under all values of hold. Stochastic models provided generally good descriptions of choice emissions with deviations from stochasticity systematically related to hold. Thus, a single set of principles accounted for choices across hold values that represent a wide range of real-world conditions.

  15. Reinforcement Learning for the Adaptive Control of Perception and Action

    DTIC Science & Technology

    1992-02-01

    This dissertation applies reinforcement learning to the adaptive control of active sensory-motor systems. Active sensory-motor systems, in addition...distinct states in the external world. This phenomenon, called perceptual aliasing, is shown to destabilize existing reinforcement learning algorithms

  16. Reinforcement of Science Learning through Local Culture: A Delphi Study

    ERIC Educational Resources Information Center

    Nuangchalerm, Prasart

    2008-01-01

    This study aims to explore the ways to reinforce science learning through local culture by using Delphi technique. Twenty four participants in various fields of study were selected. The result of study provides a framework for reinforcement of science learning through local culture on the theme life and environment. (Contains 1 table.)

  17. Punishment Insensitivity and Impaired Reinforcement Learning in Preschoolers

    ERIC Educational Resources Information Center

    Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

    2014-01-01

    Background: Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a…

  18. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia

    PubMed Central

    Markou, Athina; Salamone, John D.; Bussey, Timothy; Mar, Adam; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-01-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu). A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. PMID:23994273

  19. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia.

    PubMed

    Markou, Athina; Salamone, John D; Bussey, Timothy J; Mar, Adam C; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-11-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu) meeting. A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits.

  20. DAT isn't all that: cocaine reward and reinforcement require Toll-like receptor 4 signaling.

    PubMed

    Northcutt, A L; Hutchinson, M R; Wang, X; Baratta, M V; Hiranita, T; Cochran, T A; Pomrenze, M B; Galer, E L; Kopajtic, T A; Li, C M; Amat, J; Larson, G; Cooper, D C; Huang, Y; O'Neill, C E; Yin, H; Zahniser, N R; Katz, J L; Rice, K C; Maier, S F; Bachtell, R K; Watkins, L R

    2015-12-01

    The initial reinforcing properties of drugs of abuse, such as cocaine, are largely attributed to their ability to activate the mesolimbic dopamine system. Resulting increases in extracellular dopamine in the nucleus accumbens (NAc) are traditionally thought to result from cocaine's ability to block dopamine transporters (DATs). Here we demonstrate that cocaine also interacts with the immunosurveillance receptor complex, Toll-like receptor 4 (TLR4), on microglial cells to initiate central innate immune signaling. Disruption of cocaine signaling at TLR4 suppresses cocaine-induced extracellular dopamine in the NAc, as well as cocaine conditioned place preference and cocaine self-administration. These results provide a novel understanding of the neurobiological mechanisms underlying cocaine reward/reinforcement that includes a critical role for central immune signaling, and offer a new target for medication development for cocaine abuse treatment.

  1. A Discussion of Possibility of Reinforcement Learning Using Event-Related Potential in BCI

    NASA Astrophysics Data System (ADS)

    Yamagishi, Yuya; Tsubone, Tadashi; Wada, Yasuhiro

    Recently, Brain computer interface (BCI) which is a direct connecting pathway an external device such as a computer or a robot and a human brain have gotten a lot of attention. Since BCI can control the machines as robots by using the brain activity without using the voluntary muscle, the BCI may become a useful communication tool for handicapped persons, for instance, amyotrophic lateral sclerosis patients. However, in order to realize the BCI system which can perform precise tasks on various environments, it is necessary to design the control rules to adapt to the dynamic environments. Reinforcement learning is one approach of the design of the control rule. If this reinforcement leaning can be performed by the brain activity, it leads to the attainment of BCI that has general versatility. In this research, we paid attention to P300 of event-related potential as an alternative signal of the reward of reinforcement learning. We discriminated between the success and the failure trials from P300 of the EEG of the single trial by using the proposed discrimination algorithm based on Support vector machine. The possibility of reinforcement learning was examined from the viewpoint of the number of discriminated trials. It was shown that there was a possibility to be able to learn in most subjects.

  2. Comparing the neural basis of monetary reward and cognitive feedback during information-integration category learning.

    PubMed

    Daniel, Reka; Pollmann, Stefan

    2010-01-06

    The dopaminergic system is known to play a central role in reward-based learning (Schultz, 2006), yet it was also observed to be involved when only cognitive feedback is given (Aron et al., 2004). Within the domain of information-integration category learning, in which information from several stimulus dimensions has to be integrated predecisionally (Ashby and Maddox, 2005), the importance of contingent feedback is well established (Maddox et al., 2003). We examined the common neural correlates of reward anticipation and prediction error in this task. Sixteen subjects performed two parallel information-integration tasks within a single event-related functional magnetic resonance imaging session but received a monetary reward only for one of them. Similar functional areas including basal ganglia structures were activated in both task versions. In contrast, a single structure, the nucleus accumbens, showed higher activation during monetary reward anticipation compared with the anticipation of cognitive feedback in information-integration learning. Additionally, this activation was predicted by measures of intrinsic motivation in the cognitive feedback task and by measures of extrinsic motivation in the rewarded task. Our results indicate that, although all other structures implicated in category learning are not significantly affected by altering the type of reward, the nucleus accumbens responds to the positive incentive properties of an expected reward depending on the specific type of the reward.

  3. Stimulus-Reward Association and Reversal Learning in Individuals with Asperger Syndrome

    ERIC Educational Resources Information Center

    Zalla, Tiziana; Sav, Anca-Maria; Leboyer, Marion

    2009-01-01

    In the present study, performance of a group of adults with Asperger Syndrome (AS) on two series of object reversal and extinction was compared with that of a group of adults with typical development. Participants were requested to learn a stimulus-reward association rule and monitor changes in reward value of stimuli in order to gain as many…

  4. The Roles of Dopamine and Related Compounds in Reward-Seeking Behavior Across Animal Phyla

    PubMed Central

    Barron, Andrew B.; Søvik, Eirik; Cornish, Jennifer L.

    2010-01-01

    Motile animals actively seek out and gather resources they find rewarding, and this is an extremely powerful organizer and motivator of animal behavior. Mammalian studies have revealed interconnected neurobiological systems for reward learning, reward assessment, reinforcement and reward-seeking; all involving the biogenic amine dopamine. The neurobiology of reward-seeking behavioral systems is less well understood in invertebrates, but in many diverse invertebrate groups, reward learning and responses to food rewards also involve dopamine. The obvious exceptions are the arthropods in which the chemically related biogenic amine octopamine has a greater effect on reward learning and reinforcement than dopamine. Here we review the functions of these biogenic amines in behavioral responses to rewards in different animal groups, and discuss these findings in an evolutionary context. PMID:21048897

  5. Episodic memory encoding interferes with reward learning and decreases striatal prediction errors.

    PubMed

    Wimmer, G Elliott; Braun, Erin Kendall; Daw, Nathaniel D; Shohamy, Daphna

    2014-11-05

    Learning is essential for adaptive decision making. The striatum and its dopaminergic inputs are known to support incremental reward-based learning, while the hippocampus is known to support encoding of single events (episodic memory). Although traditionally studied separately, in even simple experiences, these two types of learning are likely to co-occur and may interact. Here we sought to understand the nature of this interaction by examining how incremental reward learning is related to concurrent episodic memory encoding. During the experiment, human participants made choices between two options (colored squares), each associated with a drifting probability of reward, with the goal of earning as much money as possible. Incidental, trial-unique object pictures, unrelated to the choice, were overlaid on each option. The next day, participants were given a surprise memory test for these pictures. We found that better episodic memory was related to a decreased influence of recent reward experience on choice, both within and across participants. fMRI analyses further revealed that during learning the canonical striatal reward prediction error signal was significantly weaker when episodic memory was stronger. This decrease in reward prediction error signals in the striatum was associated with enhanced functional connectivity between the hippocampus and striatum at the time of choice. Our results suggest a mechanism by which memory encoding may compete for striatal processing and provide insight into how interactions between different forms of learning guide reward-based decision making.

  6. Episodic Memory Encoding Interferes with Reward Learning and Decreases Striatal Prediction Errors

    PubMed Central

    Braun, Erin Kendall; Daw, Nathaniel D.

    2014-01-01

    Learning is essential for adaptive decision making. The striatum and its dopaminergic inputs are known to support incremental reward-based learning, while the hippocampus is known to support encoding of single events (episodic memory). Although traditionally studied separately, in even simple experiences, these two types of learning are likely to co-occur and may interact. Here we sought to understand the nature of this interaction by examining how incremental reward learning is related to concurrent episodic memory encoding. During the experiment, human participants made choices between two options (colored squares), each associated with a drifting probability of reward, with the goal of earning as much money as possible. Incidental, trial-unique object pictures, unrelated to the choice, were overlaid on each option. The next day, participants were given a surprise memory test for these pictures. We found that better episodic memory was related to a decreased influence of recent reward experience on choice, both within and across participants. fMRI analyses further revealed that during learning the canonical striatal reward prediction error signal was significantly weaker when episodic memory was stronger. This decrease in reward prediction error signals in the striatum was associated with enhanced functional connectivity between the hippocampus and striatum at the time of choice. Our results suggest a mechanism by which memory encoding may compete for striatal processing and provide insight into how interactions between different forms of learning guide reward-based decision making. PMID:25378157

  7. Establishing the dopamine dependency of human striatal signals during reward and punishment reversal learning.

    PubMed

    van der Schaaf, Marieke E; van Schouwenburg, Martine R; Geurts, Dirk E M; Schellekens, Arnt F A; Buitelaar, Jan K; Verkes, Robbert Jan; Cools, Roshan

    2014-03-01

    Drugs that alter dopamine transmission have opposite effects on reward and punishment learning. These opposite effects have been suggested to depend on dopamine in the striatum. Here, we establish for the first time the neurochemical specificity of such drug effects, during reward and punishment learning in humans, by adopting a coadministration design. Participants (N = 22) were scanned on 4 occasions using functional magnetic resonance imaging, following intake of placebo, bromocriptine (dopamine-receptor agonist), sulpiride (dopamine-receptor antagonist), or a combination of both drugs. A reversal-learning task was employed, in which both unexpected rewards and punishments signaled reversals. Drug effects were stratified with baseline working memory to take into account individual variations in drug response. Sulpiride induced parallel span-dependent changes on striatal blood oxygen level-dependent (BOLD) signal during unexpected rewards and punishments. These drug effects were found to be partially dopamine-dependent, as they were blocked by coadministration with bromocriptine. In contrast, sulpiride elicited opposite effects on behavioral measures of reward and punishment learning. Moreover, sulpiride-induced increases in striatal BOLD signal during both outcomes were associated with behavioral improvement in reward versus punishment learning. These results provide a strong support for current theories, suggesting that drug effects on reward and punishment learning are mediated via striatal dopamine.

  8. CLEANing the Reward: Counterfactual Actions to Remove Exploratory Action Noise in Multiagent Learning

    NASA Technical Reports Server (NTRS)

    HolmesParker, Chris; Taylor, Mathew E.; Tumer, Kagan; Agogino, Adrian

    2014-01-01

    Learning in multiagent systems can be slow because agents must learn both how to behave in a complex environment and how to account for the actions of other agents. The inability of an agent to distinguish between the true environmental dynamics and those caused by the stochastic exploratory actions of other agents creates noise in each agent's reward signal. This learning noise can have unforeseen and often undesirable effects on the resultant system performance. We define such noise as exploratory action noise, demonstrate the critical impact it can have on the learning process in multiagent settings, and introduce a reward structure to effectively remove such noise from each agent's reward signal. In particular, we introduce Coordinated Learning without Exploratory Action Noise (CLEAN) rewards and empirically demonstrate their benefits

  9. Dissecting components of reward: ‘liking’, ‘wanting’, and learning

    PubMed Central

    Berridge, Kent C; Robinson, Terry E; Aldridge, J Wayne

    2009-01-01

    In recent years significant progress has been made delineating the psychological components of reward and their underlying neural mechanisms. Here we briefly highlight findings on three dissociable psychological components of reward: ‘liking’ (hedonic impact), ‘wanting’ (incentive salience), and learning (predictive associations and cognitions). A better understanding of the components of reward, and their neurobiological substrates, may help in devising improved treatments for disorders of mood and motivation, ranging from depression to eating disorders, drug addiction, and related compulsive pursuits of rewards. PMID:19162544

  10. Reinforcement learning and counterfactual reasoning explain adaptive behavior in a changing environment.

    PubMed

    Zhang, Yunfeng; Paik, Jaehyon; Pirolli, Peter

    2015-04-01

    Animals routinely adapt to changes in the environment in order to survive. Though reinforcement learning may play a role in such adaptation, it is not clear that it is the only mechanism involved, as it is not well suited to producing rapid, relatively immediate changes in strategies in response to environmental changes. This research proposes that counterfactual reasoning might be an additional mechanism that facilitates change detection. An experiment is conducted in which a task state changes over time and the participants had to detect the changes in order to perform well and gain monetary rewards. A cognitive model is constructed that incorporates reinforcement learning with counterfactual reasoning to help quickly adjust the utility of task strategies in response to changes. The results show that the model can accurately explain human data and that counterfactual reasoning is key to reproducing the various effects observed in this change detection paradigm. Copyright © 2015 Cognitive Science Society, Inc.

  11. Reinforcement learning in complementarity game and population dynamics.

    PubMed

    Jost, Jürgen; Li, Wei

    2014-02-01

    We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005)] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.

  12. Response-reinforcement learning is dependent on N-methyl-D-aspartate receptor activation in the nucleus accumbens core.

    PubMed

    Kelley, A E; Smith-Roe, S L; Holahan, M R

    1997-10-28

    The nucleus accumbens, a site within the ventral striatum, is best known for its prominent role in mediating the reinforcing effects of drugs of abuse such as cocaine, alcohol, and nicotine. Indeed, it is generally believed that this structure subserves motivated behaviors, such as feeding, drinking, sexual behavior, and exploratory locomotion, which are elicited by natural rewards or incentive stimuli. A basic rule of positive reinforcement is that motor responses will increase in magnitude and vigor if followed by a rewarding event. It is likely, therefore, that the nucleus accumbens may serve as a substrate for reinforcement learning. However, there is surprisingly little information concerning the neural mechanisms by which appetitive responses are learned. In the present study, we report that treatment of the nucleus accumbens core with the selective competitive N-methyl-D-aspartate (NMDA) antagonist 2-amino-5-phosphonopentanoic acid (AP-5; 5 nmol/0.5 microl bilaterally) impairs response-reinforcement learning in the acquisition of a simple lever-press task to obtain food. Once the rats learned the task, AP-5 had no effect, demonstrating the requirement of NMDA receptor-dependent plasticity in the early stages of learning. Infusion of AP-5 into the accumbens shell produced a much smaller impairment of learning. Additional experiments showed that AP-5 core-treated rats had normal feeding and locomotor responses and were capable of acquiring stimulus-reward associations. We hypothesize that stimulation of NMDA receptors within the accumbens core is a key process through which motor responses become established in response to reinforcing stimuli. Further, this mechanism, may also play a critical role in the motivational and addictive properties of drugs of abuse.

  13. Affective modulation of the startle reflex and the Reinforcement Sensitivity Theory of personality: The role of sensitivity to reward.

    PubMed

    Aluja, Anton; Blanch, Angel; Blanco, Eduardo; Balada, Ferran

    2015-01-01

    This study evaluated differences in the amplitude of startle reflex and Sensitivity to Reward (SR) and Sensitivity to Punishment (SP) personality variables of the Reinforcement Sensitivity Theory (RST). We hypothesized that subjects with higher scores in SR would obtain a higher startle reflex when exposed to pleasant pictures than lower scores, while higher scores in SP would obtain a higher startle reflex when exposed to unpleasant pictures than subjects with lower scores in this dimension. The sample consisted of 112 healthy female undergraduate psychology students. Personality was assessed using the short version of the Sensitivity to Punishment and Sensitivity Reward Questionnaire (SPSRQ). Laboratory anxiety was controlled by the State Anxiety Inventory. The startle blink reflex was recorded electromyographically (EMG) from the right orbicularis oculi muscle as a response to the International Affective Picture System (IAPS) pleasant, neutral and unpleasant pictures. Subjects higher in SR obtained a significant higher startle reflex response in pleasant pictures than lower scorers (48.48 vs 46.28, p<0.012). Subjects with higher scores in SP showed a light tendency of higher startle responses in unpleasant pictures in a non-parametric local regression graphical analysis (LOESS). The findings shed light on the relationships among the impulsive-disinhibited personality, including sensitivity to reward and emotions evoked through pictures of emotional content. Copyright © 2014 Elsevier Inc. All rights reserved.

  14. Relative reinforcing value of food and delayed reward discounting in obesity and disordered eating: A systematic review.

    PubMed

    Stojek, Monika M K; MacKillop, James

    2017-07-01

    Understanding the food choice decision-making may help identify those at higher risk for excess weight gain and dysregulated eating patterns. This paper systematically reviews the literature related to eating behavior and behavioral economic constructs of relative reinforcing value of food (RRVfood) and delayed reward discounting (DRD). RRVfood characterizes how valuable energy-dense food is to the individual, and DRD characterizes preferences for smaller immediate rewards over larger future rewards, an index of impulsivity. Literature search on PubMed was conducted using combination of terms that involve behavioral economics and dysregulated eating in youth and adults. Forty-seven articles were reviewed. There is consistent evidence that obese youth and adults exhibit higher RRVfood. There is a need for more research on the role of RRVfood in eating disorders, as an insufficient number of studies exist to draw meaningful conclusions. There is accumulating evidence that obese individuals have higher DRD but the study of moderators of this relationship is crucial. Only a small number of studies have been conducted on DRD and binge eating, and no clear conclusions can be made currently. Approximately half of existing studies suggest lower DRD in individuals with anorexia nervosa. Research implications and treatment application are discussed. Copyright © 2017 Elsevier Ltd. All rights reserved.

  15. Value Learning and Arousal in the Extinction of Probabilistic Rewards: The Role of Dopamine in a Modified Temporal Difference Model

    PubMed Central

    Song, Minryung R.; Fellous, Jean-Marc

    2014-01-01

    Because most rewarding events are probabilistic and changing, the extinction of probabilistic rewards is important for survival. It has been proposed that the extinction of probabilistic rewards depends on arousal and the amount of learning of reward values. Midbrain dopamine neurons were suggested to play a role in both arousal and learning reward values. Despite extensive research on modeling dopaminergic activity in reward learning (e.g. temporal difference models), few studies have been done on modeling its role in arousal. Although temporal difference models capture key characteristics of dopaminergic activity during the extinction of deterministic rewards, they have been less successful at simulating the extinction of probabilistic rewards. By adding an arousal signal to a temporal difference model, we were able to simulate the extinction of probabilistic rewards and its dependence on the amount of learning. Our simulations propose that arousal allows the probability of reward to have lasting effects on the updating of reward value, which slows the extinction of low probability rewards. Using this model, we predicted that, by signaling the prediction error, dopamine determines the learned reward value that has to be extinguished during extinction and participates in regulating the size of the arousal signal that controls the learning rate. These predictions were supported by pharmacological experiments in rats. PMID:24586823

  16. Value learning and arousal in the extinction of probabilistic rewards: the role of dopamine in a modified temporal difference model.

    PubMed

    Song, Minryung R; Fellous, Jean-Marc

    2014-01-01

    Because most rewarding events are probabilistic and changing, the extinction of probabilistic rewards is important for survival. It has been proposed that the extinction of probabilistic rewards depends on arousal and the amount of learning of reward values. Midbrain dopamine neurons were suggested to play a role in both arousal and learning reward values. Despite extensive research on modeling dopaminergic activity in reward learning (e.g. temporal difference models), few studies have been done on modeling its role in arousal. Although temporal difference models capture key characteristics of dopaminergic activity during the extinction of deterministic rewards, they have been less successful at simulating the extinction of probabilistic rewards. By adding an arousal signal to a temporal difference model, we were able to simulate the extinction of probabilistic rewards and its dependence on the amount of learning. Our simulations propose that arousal allows the probability of reward to have lasting effects on the updating of reward value, which slows the extinction of low probability rewards. Using this model, we predicted that, by signaling the prediction error, dopamine determines the learned reward value that has to be extinguished during extinction and participates in regulating the size of the arousal signal that controls the learning rate. These predictions were supported by pharmacological experiments in rats.

  17. The role of GABAB receptors in human reinforcement learning.

    PubMed

    Ort, Andres; Kometer, Michael; Rohde, Judith; Seifritz, Erich; Vollenweider, Franz X

    2014-10-01

    Behavioral evidence from human studies suggests that the γ-aminobutyric acid type B receptor (GABAB receptor) agonist baclofen modulates reinforcement learning and reduces craving in patients with addiction spectrum disorders. However, in contrast to the well established role of dopamine in reinforcement learning, the mechanisms by which the GABAB receptor influences reinforcement learning in humans remain completely unknown. To further elucidate this issue, a cross-over, double-blind, placebo-controlled study was performed in healthy human subjects (N=15) to test the effects of baclofen (20 and 50mg p.o.) on probabilistic reinforcement learning. Outcomes were the feedback-induced P2 component of the event-related potential, the feedback-related negativity, and the P300 component of the event-related potential. Baclofen produced a reduction of P2 amplitude over the course of the experiment, but did not modulate the feedback-related negativity. Furthermore, there was a trend towards increased learning after baclofen administration relative to placebo over the course of the experiment. The present results extend previous theories of reinforcement learning, which focus on the importance of mesolimbic dopamine signaling, and indicate that stimulation of cortical GABAB receptors in a fronto-parietal network leads to better attentional allocation in reinforcement learning. This observation is a first step in our understanding of how baclofen may improve reinforcement learning in healthy subjects. Further studies with bigger sample sizes are needed to corroborate this conclusion and furthermore, test this effect in patients with addiction spectrum disorder.

  18. How Food as a Reward Is Detrimental to Children's Health, Learning, and Behavior

    ERIC Educational Resources Information Center

    Fedewa, Alicia L.; Davis, Matthew Cody

    2015-01-01

    Background: Despite small- and wide-scale prevention efforts to curb obesity, the percentage of children classified as overweight and obese has remained relatively consistent in the last decade. As school personnel are increasingly pressured to enhance student performance, many educators use food as a reward to motivate and reinforce positive…

  19. How Food as a Reward Is Detrimental to Children's Health, Learning, and Behavior

    ERIC Educational Resources Information Center

    Fedewa, Alicia L.; Davis, Matthew Cody

    2015-01-01

    Background: Despite small- and wide-scale prevention efforts to curb obesity, the percentage of children classified as overweight and obese has remained relatively consistent in the last decade. As school personnel are increasingly pressured to enhance student performance, many educators use food as a reward to motivate and reinforce positive…

  20. Repeated electrical stimulation of reward-related brain regions affects cocaine but not "natural" reinforcement.

    PubMed

    Levy, Dino; Shabat-Simon, Maytal; Shalev, Uri; Barnea-Ygael, Noam; Cooper, Ayelet; Zangen, Abraham

    2007-12-19

    Drug addiction is associated with long-lasting neuronal adaptations including alterations in dopamine and glutamate receptors in the brain reward system. Treatment strategies for cocaine addiction and especially the prevention of craving and relapse are limited, and their effectiveness is still questionable. We hypothesized that repeated stimulation of the brain reward system can induce localized neuronal adaptations that may either potentiate or reduce addictive behaviors. The present study was designed to test how repeated interference with the brain reward system using localized electrical stimulation of the medial forebrain bundle at the lateral hypothalamus (LH) or the prefrontal cortex (PFC) affects cocaine addiction-associated behaviors and some of the neuronal adaptations induced by repeated exposure to cocaine. Repeated high-frequency stimulation in either site influenced cocaine, but not sucrose reward-related behaviors. Stimulation of the LH reduced cue-induced seeking behavior, whereas stimulation of the PFC reduced both cocaine-seeking behavior and the motivation for its consumption. The behavioral findings were accompanied by glutamate receptor subtype alterations in the nucleus accumbens and the ventral tegmental area, both key structures of the reward system. It is therefore suggested that repeated electrical stimulation of the PFC can become a novel strategy for treating addiction.

  1. Comparing rewarding and reinforcing properties between 'bath salt' 3,4-methylenedioxypyrovalerone (MDPV) and cocaine using ultrasonic vocalizations in rats.

    PubMed

    Simmons, Steven J; Gregg, Ryan A; Tran, Fionya H; Mo, Lili; von Weltin, Eva; Barker, David J; Gentile, Taylor A; Watterson, Lucas R; Rawls, Scott M; Muschamp, John W

    2016-12-01

    Abuse of synthetic psychostimulants like synthetic cathinones has risen in recent years. 3,4-Methylenedioxypyrovalerone (MDPV) is one such synthetic cathinone that demonstrates a mechanism of action similar to cocaine. Compared to cocaine, MDPV is more potent at blocking dopamine and norepinephrine reuptake and is readily self-administered by rodents. The present study compared the rewarding and reinforcing properties of MDPV and cocaine using systemic injection dose-response and self-administration models. Fifty kilohertz ultrasonic vocalizations (USVs) were recorded as an index of positive affect throughout experiments. In Experiment 1, MDPV and cocaine dose-dependently elicited 50-kHz USVs upon systemic injection, but MDPV increased USVs at greater rates and with greater persistence relative to cocaine. In Experiment 2, latency to begin MDPV self-administration was shorter than latency to begin cocaine self-administration, and self-administered MDPV elicited greater and more persistent rates of 50-kHz USVs versus cocaine. MDPV-elicited 50-kHz USVs were sustained over the course of drug load-up whereas cocaine-elicited USVs waned following initial infusions. Notably, we observed a robust presence of context-elicited 50-kHz USVs from both MDPV and cocaine self-administering rats. Collectively, these data suggest that MDPV has powerfully rewarding and reinforcing effects relative to cocaine at one-tenth doses. Consistent with prior work, we additionally interpret these data in supporting that MDPV has significant abuse risk based on its potency and subjectively positive effects. Future studies will be needed to better refine therapeutic strategies targeted at reducing the rewarding effects of cathinone analogs in efforts to ultimately reduce abuse liability. © 2016 Society for the Study of Addiction.

  2. Deficient reinforcement learning in medial frontal cortex as a model of dopamine-related motivational deficits in ADHD.

    PubMed

    Silvetti, Massimo; Wiersema, Jan R; Sonuga-Barke, Edmund; Verguts, Tom

    2013-10-01

    Attention Deficit/Hyperactivity Disorder (ADHD) is a pathophysiologically complex and heterogeneous condition with both cognitive and motivational components. We propose a novel computational hypothesis of motivational deficits in ADHD, drawing together recent evidence on the role of anterior cingulate cortex (ACC) and associated mesolimbic dopamine circuits in both reinforcement learning and ADHD. Based on findings of dopamine dysregulation and ACC involvement in ADHD we simulated a lesion in a previously validated computational model of ACC (Reward Value and Prediction Model, RVPM). We explored the effects of the lesion on the processing of reinforcement signals. We tested specific behavioral predictions about the profile of reinforcement-related deficits in ADHD in three experimental contexts; probability tracking task, partial and continuous reward schedules, and immediate versus delayed rewards. In addition, predictions were made at the neurophysiological level. Behavioral and neurophysiological predictions from the RVPM-based lesion-model of motivational dysfunction in ADHD were confirmed by data from previously published studies. RVPM represents a promising model of ADHD reinforcement learning suggesting that ACC dysregulation might play a role in the pathogenesis of motivational deficits in ADHD. However, more behavioral and neurophysiological studies are required to test core predictions of the model. In addition, the interaction with different brain networks underpinning other aspects of ADHD neuropathology (i.e., executive function) needs to be better understood.

  3. A Neurogenetic Dissociation between Punishment-, Reward-, and Relief-Learning in Drosophila

    PubMed Central

    Yarali, Ayse; Gerber, Bertram

    2010-01-01

    What is particularly worth remembering about a traumatic experience is what brought it about, and what made it cease. For example, fruit flies avoid an odor which during training had preceded electric shock punishment; on the other hand, if the odor had followed shock during training, it is later on approached as a signal for the relieving end of shock. We provide a neurogenetic analysis of such relief learning. Blocking, using UAS-shibirets1, the output from a particular set of dopaminergic neurons defined by the TH-Gal4 driver partially impaired punishment learning, but left relief learning intact. Thus, with respect to these particular neurons, relief learning differs from punishment learning. Targeting another set of dopaminergic/serotonergic neurons defined by the DDC-Gal4 driver on the other hand affected neither punishment nor relief learning. As for the octopaminergic system, the tbhM18 mutation, compromising octopamine biosynthesis, partially impaired sugar-reward learning, but not relief learning. Thus, with respect to this particular mutation, relief learning, and reward learning are dissociated. Finally, blocking output from the set of octopaminergic/tyraminergic neurons defined by the TDC2-Gal4 driver affected neither reward, nor relief learning. We conclude that regarding the used genetic tools, relief learning is neurogenetically dissociated from both punishment and reward learning. This may be a message relevant also for analyses of relief learning in other experimental systems including man. PMID:21206762

  4. Using Fuzzy Logic for Performance Evaluation in Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.

    1992-01-01

    Current reinforcement learning algorithms require long training periods which generally limit their applicability to small size problems. A new architecture is described which uses fuzzy rules to initialize its two neural networks: a neural network for performance evaluation and another for action selection. This architecture is applied to control of dynamic systems and it is demonstrated that it is possible to start with an approximate prior knowledge and learn to refine it through experiments using reinforcement learning.

  5. Rewarded by Punishment: Reflections on the Disuse of Positive Reinforcement in Education.

    ERIC Educational Resources Information Center

    Maag, John W.

    2001-01-01

    This article delineates the reasons why educators find punishment a more acceptable approach for managing students' challenging behaviors than positive reinforcement. The article argues that educators should plan the occurrence of positive reinforcement to increase appropriate behaviors rather than running the risk of it haphazardly promoting…

  6. Coevolutionary networks of reinforcement-learning agents.

    PubMed

    Kianercy, Ardeshir; Galstyan, Aram

    2013-07-01

    This paper presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement-learning scheme. It is demonstrated that the coevolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with nontrivial structural dynamics. In particular, we characterize the Nash equilibria (NE) in such games and examine the local stability of the rest points corresponding to those equilibria. We also study general n-player networks via both simulations and analytical methods and find that, in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable.

  7. Coevolutionary networks of reinforcement-learning agents

    NASA Astrophysics Data System (ADS)

    Kianercy, Ardeshir; Galstyan, Aram

    2013-07-01

    This paper presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement-learning scheme. It is demonstrated that the coevolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with nontrivial structural dynamics. In particular, we characterize the Nash equilibria (NE) in such games and examine the local stability of the rest points corresponding to those equilibria. We also study general n-player networks via both simulations and analytical methods and find that, in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable.

  8. Developing PFC representations using reinforcement learning.

    PubMed

    Reynolds, Jeremy R; O'Reilly, Randall C

    2009-12-01

    From both functional and biological considerations, it is widely believed that action production, planning, and goal-oriented behaviors supported by the frontal cortex are organized hierarchically [Fuster (1991); Koechlin, E., Ody, C.,