Science.gov

Sample records for reward reinforcement learning

  1. Reward, motivation, and reinforcement learning.

    PubMed

    Dayan, Peter; Balleine, Bernard W

    2002-10-10

    There is substantial evidence that dopamine is involved in reward learning and appetitive conditioning. However, the major reinforcement learning-based theoretical models of classical conditioning (crudely, prediction learning) are actually based on rules designed to explain instrumental conditioning (action learning). Extensive anatomical, pharmacological, and psychological data, particularly concerning the impact of motivational manipulations, show that these models are unreasonable. We review the data and consider the involvement of a rich collection of different neural systems in various aspects of these forms of conditioning. Dopamine plays a pivotal, but complicated, role.

  2. Online learning of shaping rewards in reinforcement learning.

    PubMed

    Grześ, Marek; Kudenko, Daniel

    2010-05-01

    Potential-based reward shaping has been shown to be a powerful method to improve the convergence rate of reinforcement learning agents. It is a flexible technique to incorporate background knowledge into temporal-difference learning in a principled way. However, the question remains of how to compute the potential function which is used to shape the reward that is given to the learning agent. In this paper, we show how, in the absence of knowledge to define the potential function manually, this function can be learned online in parallel with the actual reinforcement learning process. Two cases are considered. The first solution which is based on the multi-grid discretisation is designed for model-free reinforcement learning. In the second case, the approach for the prototypical model-based R-max algorithm is proposed. It learns the potential function using the free space assumption about the transitions in the environment. Two novel algorithms are presented and evaluated.

  3. Balancing Multiple Sources of Reward in Reinforcement Learning

    DTIC Science & Technology

    2006-01-01

    For many problems which would be natural for reinforcement learning , the reward signal is not a single scalar value but has multiple scalar...problems with applying traditional reinforcement learning . We then present an new algorithm for finding a solution and results on simulated environments.

  4. Finding intrinsic rewards by embodied evolution and constrained reinforcement learning.

    PubMed

    Uchibe, Eiji; Doya, Kenji

    2008-12-01

    Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and 'mating' by software reproduction are three major 'extrinsic' rewards. We show in hardware experiments that the robots can find appropriate 'intrinsic' rewards for the vision of battery packs and other robots to promote approach behaviors.

  5. Reward and reinforcement activity in the nucleus accumbens during learning

    PubMed Central

    Gale, John T.; Shields, Donald C.; Ishizawa, Yumiko; Eskandar, Emad N.

    2014-01-01

    The nucleus accumbens core (NAcc) has been implicated in learning associations between sensory cues and profitable motor responses. However, the precise mechanisms that underlie these functions remain unclear. We recorded single-neuron activity from the NAcc of primates trained to perform a visual-motor associative learning task. During learning, we found two distinct classes of NAcc neurons. The first class demonstrated progressive increases in firing rates at the go-cue, feedback/tone and reward epochs of the task, as novel associations were learned. This suggests that these neurons may play a role in the exploitation of rewarding behaviors. In contrast, the second class exhibited attenuated firing rates, but only at the reward epoch of the task. These findings suggest that some NAcc neurons play a role in reward-based reinforcement during learning. PMID:24765069

  6. Optimal Reward Functions in Distributed Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Wolpert, David H.; Tumer, Kagan

    2000-01-01

    We consider the design of multi-agent systems so as to optimize an overall world utility function when (1) those systems lack centralized communication and control, and (2) each agents runs a distinct Reinforcement Learning (RL) algorithm. A crucial issue in such design problems is to initialize/update each agent's private utility function, so as to induce best possible world utility. Traditional 'team game' solutions to this problem sidestep this issue and simply assign to each agent the world utility as its private utility function. In previous work we used the 'Collective Intelligence' framework to derive a better choice of private utility functions, one that results in world utility performance up to orders of magnitude superior to that ensuing from use of the team game utility. In this paper we extend these results. We derive the general class of private utility functions that both are easy for the individual agents to learn and that, if learned well, result in high world utility. We demonstrate experimentally that using these new utility functions can result in significantly improved performance over that of our previously proposed utility, over and above that previous utility's superiority to the conventional team game utility.

  7. Inferring reward prediction errors in patients with schizophrenia: a dynamic reward task for reinforcement learning.

    PubMed

    Li, Chia-Tzu; Lai, Wen-Sung; Liu, Chih-Min; Hsu, Yung-Fong

    2014-01-01

    Abnormalities in the dopamine system have long been implicated in explanations of reinforcement learning and psychosis. The updated reward prediction error (RPE)-a discrepancy between the predicted and actual rewards-is thought to be encoded by dopaminergic neurons. Dysregulation of dopamine systems could alter the appraisal of stimuli and eventually lead to schizophrenia. Accordingly, the measurement of RPE provides a potential behavioral index for the evaluation of brain dopamine activity and psychotic symptoms. Here, we assess two features potentially crucial to the RPE process, namely belief formation and belief perseveration, via a probability learning task and reinforcement-learning modeling. Forty-five patients with schizophrenia [26 high-psychosis and 19 low-psychosis, based on their p1 and p3 scores in the positive-symptom subscales of the Positive and Negative Syndrome Scale (PANSS)] and 24 controls were tested in a feedback-based dynamic reward task for their RPE-related decision making. While task scores across the three groups were similar, matching law analysis revealed that the reward sensitivities of both psychosis groups were lower than that of controls. Trial-by-trial data were further fit with a reinforcement learning model using the Bayesian estimation approach. Model fitting results indicated that both psychosis groups tend to update their reward values more rapidly than controls. Moreover, among the three groups, high-psychosis patients had the lowest degree of choice perseveration. Lumping patients' data together, we also found that patients' perseveration appears to be negatively correlated (p = 0.09, trending toward significance) with their PANSS p1 + p3 scores. Our method provides an alternative for investigating reward-related learning and decision making in basic and clinical settings.

  8. Inferring reward prediction errors in patients with schizophrenia: a dynamic reward task for reinforcement learning

    PubMed Central

    Li, Chia-Tzu; Lai, Wen-Sung; Liu, Chih-Min; Hsu, Yung-Fong

    2014-01-01

    Abnormalities in the dopamine system have long been implicated in explanations of reinforcement learning and psychosis. The updated reward prediction error (RPE)—a discrepancy between the predicted and actual rewards—is thought to be encoded by dopaminergic neurons. Dysregulation of dopamine systems could alter the appraisal of stimuli and eventually lead to schizophrenia. Accordingly, the measurement of RPE provides a potential behavioral index for the evaluation of brain dopamine activity and psychotic symptoms. Here, we assess two features potentially crucial to the RPE process, namely belief formation and belief perseveration, via a probability learning task and reinforcement-learning modeling. Forty-five patients with schizophrenia [26 high-psychosis and 19 low-psychosis, based on their p1 and p3 scores in the positive-symptom subscales of the Positive and Negative Syndrome Scale (PANSS)] and 24 controls were tested in a feedback-based dynamic reward task for their RPE-related decision making. While task scores across the three groups were similar, matching law analysis revealed that the reward sensitivities of both psychosis groups were lower than that of controls. Trial-by-trial data were further fit with a reinforcement learning model using the Bayesian estimation approach. Model fitting results indicated that both psychosis groups tend to update their reward values more rapidly than controls. Moreover, among the three groups, high-psychosis patients had the lowest degree of choice perseveration. Lumping patients' data together, we also found that patients' perseveration appears to be negatively correlated (p = 0.09, trending toward significance) with their PANSS p1 + p3 scores. Our method provides an alternative for investigating reward-related learning and decision making in basic and clinical settings. PMID:25426091

  9. An Upside to Reward Sensitivity: The Hippocampus Supports Enhanced Reinforcement Learning in Adolescence.

    PubMed

    Davidow, Juliet Y; Foerde, Karin; Galván, Adriana; Shohamy, Daphna

    2016-10-05

    Adolescents are notorious for engaging in reward-seeking behaviors, a tendency attributed to heightened activity in the brain's reward systems during adolescence. It has been suggested that reward sensitivity in adolescence might be adaptive, but evidence of an adaptive role has been scarce. Using a probabilistic reinforcement learning task combined with reinforcement learning models and fMRI, we found that adolescents showed better reinforcement learning and a stronger link between reinforcement learning and episodic memory for rewarding outcomes. This behavioral benefit was related to heightened prediction error-related BOLD activity in the hippocampus and to stronger functional connectivity between the hippocampus and the striatum at the time of reinforcement. These findings reveal an important role for the hippocampus in reinforcement learning in adolescence and suggest that reward sensitivity in adolescence is related to adaptive differences in how adolescents learn from experience.

  10. Beyond simple reinforcement learning: the computational neurobiology of reward-learning and valuation.

    PubMed

    O'Doherty, John P

    2012-04-01

    Neural computational accounts of reward-learning have been dominated by the hypothesis that dopamine neurons behave like a reward-prediction error and thus facilitate reinforcement learning in striatal target neurons. While this framework is consistent with a lot of behavioral and neural evidence, this theory fails to account for a number of behavioral and neurobiological observations. In this special issue of EJN we feature a combination of theoretical and experimental papers highlighting some of the explanatory challenges faced by simple reinforcement-learning models and describing some of the ways in which the framework is being extended in order to address these challenges.

  11. Homeostatic reinforcement learning for integrating reward collection and physiological stability.

    PubMed

    Keramati, Mehdi; Gutkin, Boris

    2014-12-02

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system.

  12. When, What, and How Much to Reward in Reinforcement Learning-Based Models of Cognition

    ERIC Educational Resources Information Center

    Janssen, Christian P.; Gray, Wayne D.

    2012-01-01

    Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other…

  13. Homeostatic reinforcement learning for integrating reward collection and physiological stability

    PubMed Central

    Keramati, Mehdi; Gutkin, Boris

    2014-01-01

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system. DOI: http://dx.doi.org/10.7554/eLife.04811.001 PMID:25457346

  14. A Reward Optimization Method Based on Action Subrewards in Hierarchical Reinforcement Learning

    PubMed Central

    Liu, Quan; Ling, Xionghong; Cui, Zhiming

    2014-01-01

    Reinforcement learning (RL) is one kind of interactive learning methods. Its main characteristics are “trial and error” and “related reward.” A hierarchical reinforcement learning method based on action subrewards is proposed to solve the problem of “curse of dimensionality,” which means that the states space will grow exponentially in the number of features and low convergence speed. The method can reduce state spaces greatly and choose actions with favorable purpose and efficiency so as to optimize reward function and enhance convergence speed. Apply it to the online learning in Tetris game, and the experiment result shows that the convergence speed of this algorithm can be enhanced evidently based on the new method which combines hierarchical reinforcement learning algorithm and action subrewards. The “curse of dimensionality” problem is also solved to a certain extent with hierarchical method. All the performance with different parameters is compared and analyzed as well. PMID:24600318

  15. Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance

    DTIC Science & Technology

    2014-09-29

    Achieving master level play in 9× 9 computer go. In: Proceedings of AAAI. pp. 1537–1540. Grollman, D., Jenkins, O., Apr 2007. Dogged learning for robots...punishment and will make the screen flash red. You can think of this as similar to training a dog or another animal through reward and punishment, but it will

  16. Computational models of reinforcement learning: the role of dopamine as a reward signal

    PubMed Central

    Samson, R. D.; Frank, M. J.

    2010-01-01

    Reinforcement learning is ubiquitous. Unlike other forms of learning, it involves the processing of fast yet content-poor feedback information to correct assumptions about the nature of a task or of a set of stimuli. This feedback information is often delivered as generic rewards or punishments, and has little to do with the stimulus features to be learned. How can such low-content feedback lead to such an efficient learning paradigm? Through a review of existing neuro-computational models of reinforcement learning, we suggest that the efficiency of this type of learning resides in the dynamic and synergistic cooperation of brain systems that use different levels of computations. The implementation of reward signals at the synaptic, cellular, network and system levels give the organism the necessary robustness, adaptability and processing speed required for evolutionary and behavioral success. PMID:21629583

  17. Does reward frequency or magnitude drive reinforcement-learning in attention-deficit/hyperactivity disorder?

    PubMed

    Luman, Marjolein; Van Meel, Catharina S; Oosterlaan, Jaap; Sergeant, Joseph A; Geurts, Hilde M

    2009-08-15

    Children with attention-deficit/hyperactivity disorder (ADHD) show an impaired ability to use feedback in the context of learning. A stimulus-response learning task was used to investigate whether (1) children with ADHD displayed flatter learning curves, (2) reinforcement-learning in ADHD was sensitive to either reward frequency, magnitude, or both, and (3) altered sensitivity to reward was specific to ADHD or would co-occur in a group of children with autism spectrum disorder (ASD). Performance of 23 boys with ADHD was compared with that of 30 normal controls (NCs) and 21 boys with ASD, all aged 8-12. Rewards were delivered contingent on performance and varied both in frequency (low, high) and magnitude (small, large). The findings showed that, although learning rates were comparable across groups, both clinical groups committed more errors than NCs. In contrast to the NC boys, boys with ADHD were unaffected by frequency and magnitude of reward. The NC group and, to some extent, the ASD group showed improved performance, when rewards were delivered infrequently versus frequently. Children with ADHD as well as children with ASD displayed difficulties in stimulus-response coupling that were independent of motivational modulations. Possibly, these deficits are related to abnormal reinforcement expectancy.

  18. Toward an autonomous brain machine interface: integrating sensorimotor reward modulation and reinforcement learning.

    PubMed

    Marsh, Brandi T; Tarigoppula, Venkata S Aditya; Chen, Chen; Francis, Joseph T

    2015-05-13

    For decades, neurophysiologists have worked on elucidating the function of the cortical sensorimotor control system from the standpoint of kinematics or dynamics. Recently, computational neuroscientists have developed models that can emulate changes seen in the primary motor cortex during learning. However, these simulations rely on the existence of a reward-like signal in the primary sensorimotor cortex. Reward modulation of the primary sensorimotor cortex has yet to be characterized at the level of neural units. Here we demonstrate that single units/multiunits and local field potentials in the primary motor (M1) cortex of nonhuman primates (Macaca radiata) are modulated by reward expectation during reaching movements and that this modulation is present even while subjects passively view cursor motions that are predictive of either reward or nonreward. After establishing this reward modulation, we set out to determine whether we could correctly classify rewarding versus nonrewarding trials, on a moment-to-moment basis. This reward information could then be used in collaboration with reinforcement learning principles toward an autonomous brain-machine interface. The autonomous brain-machine interface would use M1 for both decoding movement intention and extraction of reward expectation information as evaluative feedback, which would then update the decoding algorithm as necessary. In the work presented here, we show that this, in theory, is possible.

  19. The left hemisphere learns what is right: Hemispatial reward learning depends on reinforcement learning processes in the contralateral hemisphere.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly Crystal; Schwartz, Sophie

    2016-08-01

    Orienting biases refer to consistent, trait-like direction of attention or locomotion toward one side of space. Recent studies suggest that such hemispatial biases may determine how well people memorize information presented in the left or right hemifield. Moreover, lesion studies indicate that learning rewarded stimuli in one hemispace depends on the integrity of the contralateral striatum. However, the exact neural and computational mechanisms underlying the influence of individual orienting biases on reward learning remain unclear. Because reward-based behavioural adaptation depends on the dopaminergic system and prediction error (PE) encoding in the ventral striatum, we hypothesized that hemispheric asymmetries in dopamine (DA) function may determine individual spatial biases in reward learning. To test this prediction, we acquired fMRI in 33 healthy human participants while they performed a lateralized reward task. Learning differences between hemispaces were assessed by presenting stimuli, assigned to different reward probabilities, to the left or right of central fixation, i.e. presented in the left or right visual hemifield. Hemispheric differences in DA function were estimated through differential fMRI responses to positive vs. negative feedback in the left vs. right ventral striatum, and a computational approach was used to identify the neural correlates of PEs. Our results show that spatial biases favoring reward learning in the right (vs. left) hemifield were associated with increased reward responses in the left hemisphere and relatively better neural encoding of PEs for stimuli presented in the right (vs. left) hemifield. These findings demonstrate that trait-like spatial biases implicate hemisphere-specific learning mechanisms, with individual differences between hemispheres contributing to reinforcing spatial biases.

  20. Adaptive Design of Role Differentiation by Division of Reward Function in Multi-Agent Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Taniguchi, Tadahiro; Tabuchi, Kazuma; Sawaragi, Tetsuo

    There are several problems which discourage an organization from achieving tasks, e.g., partial observation, credit assignment, and concurrent learning in multi-agent reinforcement learning. In many conventional approaches, each agent estimates hidden states, e.g., sensor inputs, positions, and policies of other agents, and reduces the uncertainty in the partially-observable Markov decision process (POMDP), which partially solve the multiagent reinforcement learning problem. In contrast, people reduce uncertainty in human organizations in the real world by autonomously dividing the roles played by individual agents. In a framework of reinforcement learning, roles are mainly represented by goals for individual agents. This paper presents a method for generating internal rewards from manager agents to worker agents. It also explicitly divides the roles, which enables a POMDP task for each agent to be transformed into a simple MDP task under certain conditions. Several situational experiments are also described and the validity of the proposed method is evaluated.

  1. Reward-weighted regression with sample reuse for direct policy search in reinforcement learning.

    PubMed

    Hachiya, Hirotaka; Peters, Jan; Sugiyama, Masashi

    2011-11-01

    Direct policy search is a promising reinforcement learning framework, in particular for controlling continuous, high-dimensional systems. Policy search often requires a large number of samples for obtaining a stable policy update estimator, and this is prohibitive when the sampling cost is expensive. In this letter, we extend an expectation-maximization-based policy search method so that previously collected samples can be efficiently reused. The usefulness of the proposed method, reward-weighted regression with sample reuse (R3), is demonstrated through robot learning experiments. (This letter is an extended version of our earlier conference paper: Hachiya, Peters, & Sugiyama, 2009 .).

  2. Principal components analysis of reward prediction errors in a reinforcement learning task.

    PubMed

    Sambrook, Thomas D; Goslin, Jeremy

    2016-01-01

    Models of reinforcement learning represent reward and punishment in terms of reward prediction errors (RPEs), quantitative signed terms describing the degree to which outcomes are better than expected (positive RPEs) or worse (negative RPEs). An electrophysiological component known as feedback related negativity (FRN) occurs at frontocentral sites 240-340ms after feedback on whether a reward or punishment is obtained, and has been claimed to neurally encode an RPE. An outstanding question however, is whether the FRN is sensitive to the size of both positive RPEs and negative RPEs. Previous attempts to answer this question have examined the simple effects of RPE size for positive RPEs and negative RPEs separately. However, this methodology can be compromised by overlap from components coding for unsigned prediction error size, or "salience", which are sensitive to the absolute size of a prediction error but not its valence. In our study, positive and negative RPEs were parametrically modulated using both reward likelihood and magnitude, with principal components analysis used to separate out overlying components. This revealed a single RPE encoding component responsive to the size of positive RPEs, peaking at ~330ms, and occupying the delta frequency band. Other components responsive to unsigned prediction error size were shown, but no component sensitive to negative RPE size was found.

  3. Extending Hierarchical Reinforcement Learning to Continuous-Time, Average-Reward, and Multi-Agent Models

    DTIC Science & Technology

    2003-07-09

    Hierarchical reinforcement learning (HRL) is a general framework that studies how to exploit the structure of actions and tasks to accelerate policy...framework could su ce, we focus in this paper on the MAXQ framework. We describe three new hierarchical reinforcement learning algorithms: continuous-time... reinforcement learning to speed up the acquisition of cooperative multiagent tasks. We extend the MAXQ framework to the multiagent case which we term

  4. Subjective and model-estimated reward prediction: association with the feedback-related negativity (FRN) and reward prediction error in a reinforcement learning task.

    PubMed

    Ichikawa, Naho; Siegle, Greg J; Dombrovski, Alexandre; Ohira, Hideki

    2010-12-01

    In this study, we examined whether the feedback-related negativity (FRN) is associated with both subjective and objective (model-estimated) reward prediction errors (RPE) per trial in a reinforcement learning task in healthy adults (n=25). The level of RPE was assessed by 1) subjective ratings per trial and by 2) a computational model of reinforcement learning. As results, model-estimated RPE was highly correlated with subjective RPE (r=.82), and the grand-averaged ERP waves based on the trials with high and low model-estimated RPE showed the significant difference only in the time period of the FRN component (p<.05). Regardless of the time course of learning, FRN was associated with both subjective and model-estimated RPEs within subject (r=.47, p<.001; r=.40, p<.05) and between subjects (r=.33, p<.05; r=.41, p<.005) only in the Learnable condition where the internal reward prediction varied enough with a behavior-reward contingency.

  5. [The model of the reward choice basing on the theory of reinforcement learning].

    PubMed

    Smirnitskaia, I A; Frolov, A A; Merzhanova, G Kh

    2007-01-01

    We developed the model of alimentary instrumental conditioned bar-pressing reflex for cats making a choice between either immediate small reinforcement ("impulsive behavior") or delayed more valuable reinforcement ("self-control behavior"). Our model is based on the reinforcement learning theory. We emulated dopamine contribution by discount coefficient of this theory (a subjective decrease in the value of a delayed reinforcement). The results of computer simulation showed that "cats" with large discount coefficient demonstrated "self-control behavior"; small discount coefficient was associated with "impulsive behavior". This data are in agreement with the experimental data indicating that the impulsive behavior is due to a decreased amount of dopamine in striatum.

  6. Dopaminergic control of motivation and reinforcement learning: a closed-circuit account for reward-oriented behavior.

    PubMed

    Morita, Kenji; Morishima, Mieko; Sakai, Katsuyuki; Kawaguchi, Yasuo

    2013-05-15

    Humans and animals take actions quickly when they expect that the actions lead to reward, reflecting their motivation. Injection of dopamine receptor antagonists into the striatum has been shown to slow such reward-seeking behavior, suggesting that dopamine is involved in the control of motivational processes. Meanwhile, neurophysiological studies have revealed that phasic response of dopamine neurons appears to represent reward prediction error, indicating that dopamine plays central roles in reinforcement learning. However, previous attempts to elucidate the mechanisms of these dopaminergic controls have not fully explained how the motivational and learning aspects are related and whether they can be understood by the way the activity of dopamine neurons itself is controlled by their upstream circuitries. To address this issue, we constructed a closed-circuit model of the corticobasal ganglia system based on recent findings regarding intracortical and corticostriatal circuit architectures. Simulations show that the model could reproduce the observed distinct motivational effects of D1- and D2-type dopamine receptor antagonists. Simultaneously, our model successfully explains the dopaminergic representation of reward prediction error as observed in behaving animals during learning tasks and could also explain distinct choice biases induced by optogenetic stimulation of the D1 and D2 receptor-expressing striatal neurons. These results indicate that the suggested roles of dopamine in motivational control and reinforcement learning can be understood in a unified manner through a notion that the indirect pathway of the basal ganglia represents the value of states/actions at a previous time point, an empirically driven key assumption of our model.

  7. The Rewards of Learning.

    ERIC Educational Resources Information Center

    Chance, Paul

    1992-01-01

    Although intrinsic rewards are important, they (along with punishment and encouragement) are insufficient for efficient learning. Teachers must supplement intrinsic rewards with extrinsic rewards, such as praising, complimenting, applauding, and providing other forms of recognition for good work. Teachers should use the weakest reward required to…

  8. A model of reward choice based on the theory of reinforcement learning.

    PubMed

    Smirnitskaya, I A; Frolov, A A; Merzhanova, G Kh

    2008-03-01

    A model explaining behavioral "impulsivity" and "self-control" is proposed on the basis of the theory of reinforcement learning. The discount coefficient gamma, which in this theory accounts for the subjective reduction in the value of a delayed reinforcement, is identified with the overall level of dopaminergic neuron activity which, according to published data, also determines the behavioral variant. Computer modeling showed that high values of gamma are characteristic of predominantly "self-controlled" subjects, while smaller values of gamma are characteristic of "impulsive" subjects.

  9. States versus Rewards: Dissociable neural prediction error signals underlying model-based and model-free reinforcement learning

    PubMed Central

    Gläscher, Jan; Daw, Nathaniel; Dayan, Peter; O’Doherty, John P.

    2010-01-01

    Reinforcement learning (RL) uses sequential experience with situations (“states”) and outcomes to assess actions. Whereas model-free RL uses this experience directly, in the form of a reward prediction error (RPE), model-based RL uses it indirectly, building a model of the state transition and outcome structure of the environment, and evaluating actions by searching this model. A state prediction error (SPE) plays a central role, reporting discrepancies between the current model and the observed state transitions. Using functional magnetic resonance imaging in humans solving a probabilistic Markov decision task we found the neural signature of an SPE in the intraparietal sulcus and lateral prefrontal cortex, in addition to the previously well-characterized RPE in the ventral striatum. This finding supports the existence of two unique forms of learning signal in humans, which may form the basis of distinct computational strategies for guiding behavior. PMID:20510862

  10. Learning from Noisy and Delayed Rewards: The Value of Reinforcement Learning to Defense Modeling and Simulation

    DTIC Science & Technology

    2012-09-01

    mission demands in the context of a combat scenario. The current approach employs a linear program that maximizes value over a 14 finite - time horizon...modification to ensure convergence is to gradually decay the learning rate as a function of time or samples. Note that Sutton makes a claim that the...1992). Convergence in the limit does not provide practical benefit in most real- world applications and this measure while relevant to the theoretical

  11. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

    PubMed

    Glimcher, Paul W

    2011-09-13

    A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn.

  12. Reward and learning in the goldfish.

    PubMed

    Lowes, G; Bitterman, M E

    1967-07-28

    An experiment with goldfish showed the effects of change in amount of reward that are predicted from reinforcement theory. The performance of animals shifted from small to large reward improved gradually to the level of unshifted large-reward controls, while the performance of animals shifted from large to small reward remained at the large-reward level. The difference between these results and those obtained in analogous experiments with the rat suggests that reward functions differently in the instrumental learning of the two animals.

  13. [Reinforcement learning by striatum].

    PubMed

    Kunisato, Yoshihiko; Okada, Go; Okamoto, Yasumasa

    2009-04-01

    Recently, computational models of reinforcement learning have been applied for the analysis of neuroimaging data. It has been clarified that the striatum plays a key role in decision making. We review the reinforcement learning theory and the biological structures such as the brain and signals such as neuromodulators associated with reinforcement learning. We also investigated the function of the striatum and the neurotransmitter serotonin in reward prediction. We first studied the brain mechanisms for reward prediction at different time scales. Our experiment on the striatum showed that the ventroanterior regions are involved in predicting immediate rewards and the dorsoposterior regions are involved in predicting future rewards. Further, we investigated whether serotonin regulates both the reward selection and the striatum function are specialized reward prediction at different time scales. To this end, we regulated the dietary intake of tryptophan, a precursor of serotonin. Our experiment showed that the activity of the ventral part of the striatum was correlated with reward prediction at shorter time scales, and this activity was stronger at low serotonin levels. By contrast, the activity of the dorsal part of the striatum was correlated with reward prediction at longer time scales, and this activity was stronger at high serotonin levels. Further, a higher proportion of small reward choices, together with a higher rate of discounting of delayed rewards is observed in the low-serotonin condition than in the control and high-serotonin conditions. Further examinations are required in future to assess the relation between the disturbance of reward prediction caused by low serotonin and mental disorders related to serotonin such as depression.

  14. Single Dose of a Dopamine Agonist Impairs Reinforcement Learning in Humans: Behavioral Evidence from a Laboratory-based Measure of Reward Responsiveness

    PubMed Central

    Pizzagalli, Diego A.; Evins, A. Eden; Schetter, Erika Cowman; Frank, Michael J.; Pajtas, Petra E.; Santesso, Diane L.; Culhane, Melissa

    2007-01-01

    Rationale The dopaminergic system, particularly D2-like dopamine receptors, has been strongly implicated in reward processing. Animal studies have emphasized the role of phasic dopamine (DA) signaling in reward-related learning, but these processes remain largely unexplored in humans. Objectives To evaluate the effect of a single, low dose of a D2/D3 agonist—pramipexole—on reinforcement learning in healthy adults. Based on prior evidence indicating that low doses of DA agonists decrease phasic DA release through autoreceptor stimulation, we hypothesized that 0.5 mg of pramipexole would impair reward learning due to presynaptic mechanisms. Methods Using a double-blind design, a single 0.5 mg dose of pramipexole or placebo was administered to 32 healthy volunteers, who performed a probabilistic reward task involving a differential reinforcement schedule as well as various control tasks. Results As hypothesized, response bias toward the more frequently rewarded stimulus was impaired in the pramipexole group, even after adjusting for transient adverse effects. In addition, the pramipexole group showed reaction time and motor speed slowing and increased negative affect; however, when adverse physical side effects were considered, group differences in motor speed and negative affect disappeared. Conclusions These findings show that a single low dose of pramipexole impaired the acquisition of reward-related behavior in healthy participants, and they are consistent with prior evidence suggesting that phasic DA signaling is required to reinforce actions leading to reward. The potential implications of the present findings to psychiatric conditions, including depression and impulse control disorders related to addiction, are discussed. PMID:17909750

  15. Heads for learning, tails for memory: reward, reinforcement and a role of dopamine in determining behavioral relevance across multiple timescales

    PubMed Central

    Baudonnat, Mathieu; Huber, Anna; David, Vincent; Walton, Mark E.

    2013-01-01

    Dopamine has long been tightly associated with aspects of reinforcement learning and motivation in simple situations where there are a limited number of stimuli to guide behavior and constrained range of outcomes. In naturalistic situations, however, there are many potential cues and foraging strategies that could be adopted, and it is critical that animals determine what might be behaviorally relevant in such complex environments. This requires not only detecting discrepancies with what they have recently experienced, but also identifying similarities with past experiences stored in memory. Here, we review what role dopamine might play in determining how and when to learn about the world, and how to develop choice policies appropriate to the situation faced. We discuss evidence that dopamine is shaped by motivation and memory and in turn shapes reward-based memory formation. In particular, we suggest that hippocampal-striatal-dopamine networks may interact to determine how surprising the world is and to either inhibit or promote actions at time of behavioral uncertainty. PMID:24130514

  16. An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning.

    PubMed

    Balasubramani, Pragathi P; Chakravarthy, V Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A

    2014-01-01

    Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG.

  17. An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning

    PubMed Central

    Balasubramani, Pragathi P.; Chakravarthy, V. Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A.

    2014-01-01

    Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG. PMID:24795614

  18. A Social Reinforcement Learning Hypothesis of Mutual Reward Preferences in Rats.

    PubMed

    Hernandez-Lallement, Julen; van Wingerden, Marijn; Schäble, Sandra; Kalenscher, Tobias

    2017-01-01

    Although the use of neuroimaging techniques has revealed much about the neural correlates of social decision making (SDM) in humans, it remains poorly understood how social stimuli are represented, and how social decisions are implemented at the neural level in humans and in other species. To address this issue, the establishment of novel animal paradigms allowing a broad spectrum of neurobiological causal manipulations and neurophysiological recordings provides an exciting tool to investigate the neural implementation of social valuation in the brain. Here, we discuss the potential of a rodent model, Rattus norvegicus, for the understanding of SDM and its neural underpinnings. Particularly, we consider recent data collected in a rodent prosocial choice task within a social reinforcement framework and discuss factors that could drive SDM in rodents.

  19. Reinforcement learning with Marr.

    PubMed

    Niv, Yael; Langdon, Angela

    2016-10-01

    To many, the poster child for David Marr's famous three levels of scientific inquiry is reinforcement learning-a computational theory of reward optimization, which readily prescribes algorithmic solutions that evidence striking resemblance to signals found in the brain, suggesting a straightforward neural implementation. Here we review questions that remain open at each level of analysis, concluding that the path forward to their resolution calls for inspiration across levels, rather than a focus on mutual constraints.

  20. Role of brain dopamine in food reward and reinforcement

    PubMed Central

    Wise, Roy A

    2006-01-01

    The ability of food to establish and maintain response habits and conditioned preferences depends largely on the function of brain dopamine systems. While dopaminergic transmission in the nucleus accumbens appears sufficient for some forms of reward, the role of dopamine in food reward does not appear to be restricted to this region. Dopamine plays an important role in both the ability to energize feeding and to reinforce food-seeking behaviour; the role in energizing feeding is secondary to the prerequisite role in reinforcement. Dopaminergic activation is triggered by the auditory and visual as well as the tactile, olfactory, and gustatory stimuli of foods. While dopamine plays a central role in the feeding and food-seeking of normal animals, some food rewarded learning can be seen in genetically engineered dopamine-deficient mice. PMID:16874930

  1. Placebo Intervention Enhances Reward Learning in Healthy Individuals

    PubMed Central

    Turi, Zsolt; Mittner, Matthias; Paulus, Walter; Antal, Andrea

    2017-01-01

    According to the placebo-reward hypothesis, placebo is a reward-anticipation process that increases midbrain dopamine (DA) levels. Reward-based learning processes, such as reinforcement learning, involves a large part of the DA-ergic network that is also activated by the placebo intervention. Given the neurochemical overlap between placebo and reward learning, we investigated whether verbal instructions in conjunction with a placebo intervention are capable of enhancing reward learning in healthy individuals by using a monetary reward-based reinforcement-learning task. Placebo intervention was performed with non-invasive brain stimulation techniques. In a randomized, triple-blind, cross-over study we investigated this cognitive placebo effect in healthy individuals by manipulating the participants’ perceived uncertainty about the intervention’s efficacy. Volunteers in the purportedly low- and high-uncertainty conditions earned more money, responded more quickly and had a higher learning rate from monetary rewards relative to baseline. Participants in the purportedly high-uncertainty conditions showed enhanced reward learning, and a model-free computational analysis revealed a higher learning rate from monetary rewards compared to the purportedly low-uncertainty and baseline conditions. Our results indicate that the placebo response is able to enhance reward learning in healthy individuals, opening up exciting avenues for future research in placebo effects on other cognitive functions. PMID:28112207

  2. Prosocial Reward Learning in Children and Adolescents

    PubMed Central

    Kwak, Youngbin; Huettel, Scott A.

    2016-01-01

    Adolescence is a period of increased sensitivity to social contexts. To evaluate how social context sensitivity changes over development—and influences reward learning—we investigated how children and adolescents perceive and integrate rewards for oneself and others during a dynamic risky decision-making task. Children and adolescents (N = 75, 8–16 years) performed the Social Gambling Task (SGT, Kwak et al., 2014) and completed a set of questionnaires measuring other-regarding behavior. In the SGT, participants choose amongst four card decks that have different payout structures for oneself and for a charity. We examined patterns of choices, overall decision strategies, and how reward outcomes led to trial-by-trial adjustments in behavior, as estimated using a reinforcement-learning model. Performance of children and adolescents was compared to data from a previously collected sample of adults (N = 102) performing the identical task. We found that that children/adolescents were not only more sensitive to rewards directed to the charity than self but also showed greater prosocial tendencies on independent measures of other-regarding behavior. Children and adolescents also showed less use of a strategy that prioritizes rewards for self at the expense of rewards for others. These results support the conclusion that, compared to adults, children and adolescents show greater sensitivity to outcomes for others when making decisions and learning about potential rewards. PMID:27761125

  3. General functioning predicts reward and punishment learning in schizophrenia.

    PubMed

    Somlai, Zsuzsanna; Moustafa, Ahmed A; Kéri, Szabolcs; Myers, Catherine E; Gluck, Mark A

    2011-04-01

    Previous studies investigating feedback-driven reinforcement learning in patients with schizophrenia have provided mixed results. In this study, we explored the clinical predictors of reward and punishment learning using a probabilistic classification learning task. Patients with schizophrenia (n=40) performed similarly to healthy controls (n=30) on the classification learning task. However, more severe negative and general symptoms were associated with lower reward-learning performance, whereas poorer general psychosocial functioning was correlated with both lower reward- and punishment-learning performances. Multiple linear regression analyses indicated that general psychosocial functioning was the only significant predictor of reinforcement learning performance when education, antipsychotic dose, and positive, negative and general symptoms were included in the analysis. These results suggest a close relationship between reinforcement learning and general psychosocial functioning in schizophrenia.

  4. Quantum reinforcement learning.

    PubMed

    Dong, Daoyi; Chen, Chunlin; Li, Hanxiong; Tarn, Tzyh-Jong

    2008-10-01

    The key approaches for machine learning, particularly learning in unknown probabilistic environments, are new representations and computation mechanisms. In this paper, a novel quantum reinforcement learning (QRL) method is proposed by combining quantum theory and reinforcement learning (RL). Inspired by the state superposition principle and quantum parallelism, a framework of a value-updating algorithm is introduced. The state (action) in traditional RL is identified as the eigen state (eigen action) in QRL. The state (action) set can be represented with a quantum superposition state, and the eigen state (eigen action) can be obtained by randomly observing the simulated quantum state according to the collapse postulate of quantum measurement. The probability of the eigen action is determined by the probability amplitude, which is updated in parallel according to rewards. Some related characteristics of QRL such as convergence, optimality, and balancing between exploration and exploitation are also analyzed, which shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism. To evaluate the performance and practicability of QRL, several simulated experiments are given, and the results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems. This paper is also an effective exploration on the application of quantum computation to artificial intelligence.

  5. Variable Resolution Reinforcement Learning.

    DTIC Science & Technology

    1995-04-01

    Can reinforcement learning ever become a practical method for real control problems? This paper begins by reviewing three reinforcement learning algorithms... reinforcement learning . In addition to exploring state space, and developing a control policy to achieve a task, partigame also learns a kd-tree partitioning of

  6. Partial Planning Reinforcement Learning

    DTIC Science & Technology

    2012-08-31

    This project explored several problems in the areas of reinforcement learning , probabilistic planning, and transfer learning. In particular, it...studied Bayesian Optimization for model-based and model-free reinforcement learning , transfer in the context of model-free reinforcement learning based on

  7. Mind matters: placebo enhances reward learning in Parkinson's disease.

    PubMed

    Schmidt, Liane; Braun, Erin Kendall; Wager, Tor D; Shohamy, Daphna

    2014-12-01

    Expectations have a powerful influence on how we experience the world. Neurobiological and computational models of learning suggest that dopamine is crucial for shaping expectations of reward and that expectations alone may influence dopamine levels. However, because expectations and reinforcers are typically manipulated together, the role of expectations per se has remained unclear. We separated these two factors using a placebo dopaminergic manipulation in individuals with Parkinson's disease. We combined a reward learning task with functional magnetic resonance imaging to test how expectations of dopamine release modulate learning-related activity in the brain. We found that the mere expectation of dopamine release enhanced reward learning and modulated learning-related signals in the striatum and the ventromedial prefrontal cortex. These effects were selective to learning from reward: neither medication nor placebo had an effect on learning to avoid monetary loss. These findings suggest a neurobiological mechanism by which expectations shape learning and affect.

  8. Individual differences in sensitivity to reward and punishment and neural activity during reward and avoidance learning.

    PubMed

    Kim, Sang Hee; Yoon, HeungSik; Kim, Hackjin; Hamann, Stephan

    2015-09-01

    In this functional neuroimaging study, we investigated neural activations during the process of learning to gain monetary rewards and to avoid monetary loss, and how these activations are modulated by individual differences in reward and punishment sensitivity. Healthy young volunteers performed a reinforcement learning task where they chose one of two fractal stimuli associated with monetary gain (reward trials) or avoidance of monetary loss (avoidance trials). Trait sensitivity to reward and punishment was assessed using the behavioral inhibition/activation scales (BIS/BAS). Functional neuroimaging results showed activation of the striatum during the anticipation and reception periods of reward trials. During avoidance trials, activation of the dorsal striatum and prefrontal regions was found. As expected, individual differences in reward sensitivity were positively associated with activation in the left and right ventral striatum during reward reception. Individual differences in sensitivity to punishment were negatively associated with activation in the left dorsal striatum during avoidance anticipation and also with activation in the right lateral orbitofrontal cortex during receiving monetary loss. These results suggest that learning to attain reward and learning to avoid loss are dependent on separable sets of neural regions whose activity is modulated by trait sensitivity to reward or punishment.

  9. Global reinforcement learning in neural networks.

    PubMed

    Ma, Xiaolong; Likharev, Konstantin K

    2007-03-01

    In this letter, we have found a more general formulation of the REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility (REINFORCE) learning principle first suggested by Williams. The new formulation has enabled us to apply the principle to global reinforcement learning in networks with various sources of randomness, and to suggest several simple local rules for such networks. Numerical simulations have shown that for simple classification and reinforcement learning tasks, at least one family of the new learning rules gives results comparable to those provided by the famous Rules A(r-i) and A(r-p) for the Boltzmann machines.

  10. [Multiple Dopamine Signals and Their Contributions to Reinforcement Learning].

    PubMed

    Matsumoto, Masayuki

    2016-10-01

    Midbrain dopamine neurons are activated by reward and sensory cue that predicts reward. Their responses resemble reward prediction error that indicates the discrepancy between obtained and expected reward values, which has been thought to play an important role as a teaching signal in reinforcement learning. Indeed, pharmacological blockade of dopamine transmission interferes with reinforcement learning. Recent studies reported, however, that not all dopamine neurons transmit the reward-related signal. They found that a subset of dopamine neurons transmits signals related to non-rewarding, salient experiences such as aversive stimulations and cognitively demanding events. How these signals contribute to animal behavior is not yet well understood. This article reviews recent findings on dopamine signals related to rewarding and non-rewarding experiences, and discusses their contributions to reinforcement learning.

  11. Reinforcement learning and Tourette syndrome.

    PubMed

    Palminteri, Stefano; Pessiglione, Mathias

    2013-01-01

    In this chapter, we report the first experimental explorations of reinforcement learning in Tourette syndrome, realized by our team in the last few years. This report will be preceded by an introduction aimed to provide the reader with the state of the art of the knowledge concerning the neural bases of reinforcement learning at the moment of these studies and the scientific rationale beyond them. In short, reinforcement learning is learning by trial and error to maximize rewards and minimize punishments. This decision-making and learning process implicates the dopaminergic system projecting to the frontal cortex-basal ganglia circuits. A large body of evidence suggests that the dysfunction of the same neural systems is implicated in the pathophysiology of Tourette syndrome. Our results show that Tourette condition, as well as the most common pharmacological treatments (dopamine antagonists), affects reinforcement learning performance in these patients. Specifically, the results suggest a deficit in negative reinforcement learning, possibly underpinned by a functional hyperdopaminergia, which could explain the persistence of tics, despite their evident inadaptive (negative) value. This idea, together with the implications of these results in Tourette therapy and the future perspectives, is discussed in Section 4 of this chapter.

  12. Reinforcement Learning: A Tutorial.

    DTIC Science & Technology

    1997-01-01

    The purpose of this tutorial is to provide an introduction to reinforcement learning (RL) at a level easily understood by students and researchers in...provides a simple example to develop intuition of the underlying dynamic programming mechanism. In Section (2) the parts of a reinforcement learning problem... reinforcement learning algorithms. These include TD(lambda) and both the residual and direct forms of value iteration, Q-learning, and advantage learning

  13. Microstimulation of the Human Substantia Nigra Alters Reinforcement Learning

    PubMed Central

    Ramayya, Ashwin G.; Misra, Amrit

    2014-01-01

    Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action–reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action–reward associations rather than stimulus–reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action–reward associations during reinforcement learning. PMID:24828643

  14. Microstimulation of the human substantia nigra alters reinforcement learning.

    PubMed

    Ramayya, Ashwin G; Misra, Amrit; Baltuch, Gordon H; Kahana, Michael J

    2014-05-14

    Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action-reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action-reward associations rather than stimulus-reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action-reward associations during reinforcement learning.

  15. A neural signature of hierarchical reinforcement learning.

    PubMed

    Ribas-Fernandes, José J F; Solway, Alec; Diuk, Carlos; McGuire, Joseph T; Barto, Andrew G; Niv, Yael; Botvinick, Matthew M

    2011-07-28

    Human behavior displays hierarchical structure: simple actions cohere into subtask sequences, which work together to accomplish overall task goals. Although the neural substrates of such hierarchy have been the target of increasing research, they remain poorly understood. We propose that the computations supporting hierarchical behavior may relate to those in hierarchical reinforcement learning (HRL), a machine-learning framework that extends reinforcement-learning mechanisms into hierarchical domains. To test this, we leveraged a distinctive prediction arising from HRL. In ordinary reinforcement learning, reward prediction errors are computed when there is an unanticipated change in the prospects for accomplishing overall task goals. HRL entails that prediction errors should also occur in relation to task subgoals. In three neuroimaging studies we observed neural responses consistent with such subgoal-related reward prediction errors, within structures previously implicated in reinforcement learning. The results reported support the relevance of HRL to the neural processes underlying hierarchical behavior.

  16. Theory meets pigeons: the influence of reward-magnitude on discrimination-learning.

    PubMed

    Rose, Jonas; Schmidt, Robert; Grabemann, Marco; Güntürkün, Onur

    2009-03-02

    Modern theoretical accounts on reward-based learning are commonly based on reinforcement learning algorithms. Most noted in this context is the temporal-difference (TD) algorithm in which the difference between predicted and obtained reward, the prediction-error, serves as a learning signal. Consequently, larger rewards cause bigger prediction-errors and lead to faster learning than smaller rewards. Therefore, if animals employ a neural implementation of TD learning, reward-magnitude should affect learning in animals accordingly. Here we test this prediction by training pigeons on a simple color-discrimination task with two pairs of colors. In each pair, correct discrimination is rewarded; in pair one with a large-reward, in pair two with a small-reward. Pigeons acquired the 'large-reward' discrimination faster than the 'small-reward' discrimination. Animal behavior and an implementation of the TD-algorithm yielded comparable results with respect to the difference between learning curves in the large-reward and in the small-reward conditions. We conclude that the influence of reward-magnitude on the acquisition of a simple discrimination paradigm is accurately reflected by a TD implementation of reinforcement learning.

  17. Reinforcement of Learning

    ERIC Educational Resources Information Center

    Jones, Peter

    1977-01-01

    A company trainer shows some ways of scheduling reinforcement of learning for trainees: continuous reinforcement, fixed ratio, variable ratio, fixed interval, and variable interval. As there are problems with all methods, he suggests trying combinations of various types of reinforcement. (MF)

  18. Reinforcement learning: Computational theory and biological mechanisms.

    PubMed

    Doya, Kenji

    2007-05-01

    Reinforcement learning is a computational framework for an active agent to learn behaviors on the basis of a scalar reward signal. The agent can be an animal, a human, or an artificial system such as a robot or a computer program. The reward can be food, water, money, or whatever measure of the performance of the agent. The theory of reinforcement learning, which was developed in an artificial intelligence community with intuitions from animal learning theory, is now giving a coherent account on the function of the basal ganglia. It now serves as the "common language" in which biologists, engineers, and social scientists can exchange their problems and findings. This article reviews the basic theoretical framework of reinforcement learning and discusses its recent and future contributions toward the understanding of animal behaviors and human decision making.

  19. Enhanced Experience Replay for Deep Reinforcement Learning

    DTIC Science & Technology

    2015-11-01

    Temporal-difference Q-learning is used to train the network, and a memory of state–action–reward transitions is kept and used in an experience-reply...al. 2015) uses a convolutional neural network to automatically extract relevant features from the video-game display, then uses reinforcement...situations. To counteract this problem, a memory of past experiences (the state–action–reward information) is stored during training and the network is

  20. Reinforcement learning: Solving two case studies

    NASA Astrophysics Data System (ADS)

    Duarte, Ana Filipa; Silva, Pedro; dos Santos, Cristina Peixoto

    2012-09-01

    Reinforcement Learning algorithms offer interesting features for the control of autonomous systems, such as the ability to learn from direct interaction with the environment, and the use of a simple reward signalas opposed to the input-outputs pairsused in classic supervised learning. The reward signal indicates the success of failure of the actions executed by the agent in the environment. In this work, are described RL algorithmsapplied to two case studies: the Crawler robot and the widely known inverted pendulum. We explore RL capabilities to autonomously learn a basic locomotion pattern in the Crawler, andapproach the balancing problem of biped locomotion using the inverted pendulum.

  1. Hierarchical Multiagent Reinforcement Learning

    DTIC Science & Technology

    2004-01-25

    In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multiagent tasks. We...introduce a hierarchical multiagent reinforcement learning (RL) framework and propose a hierarchical multiagent RL algorithm called Cooperative HRL. In

  2. Modular Inverse Reinforcement Learning for Visuomotor Behavior

    PubMed Central

    Rothkopf, Constantin A.; Ballard, Dana H.

    2013-01-01

    In a large variety of situations one would like to have an expressive and accurate model of observed animal or human behavior. While general purpose mathematical models may capture successfully properties of observed behavior, it is desirable to root models in biological facts. Because of ample empirical evidence for reward-based learning in visuomotor tasks we use a computational model based on the assumption that the observed agent is balancing the costs and benefits of its behavior to meet its goals. This leads to using the framework of Reinforcement Learning, which additionally provides well-established algorithms for learning of visuomotor task solutions. To quantify the agent’s goals as rewards implicit in the observed behavior we propose to use inverse reinforcement learning, which quantifies the agent’s goals as rewards implicit in the observed behavior. Based on the assumption of a modular cognitive architecture, we introduce a modular inverse reinforcement learning algorithm that estimates the relative reward contributions of the component tasks in navigation, consisting of following a path while avoiding obstacles and approaching targets. It is shown how to recover the component reward weights for individual tasks and that variability in observed trajectories can be explained succinctly through behavioral goals. It is demonstrated through simulations that good estimates can be obtained already with modest amounts of observation data, which in turn allows the prediction of behavior in novel configurations. PMID:23832417

  3. Reinforcement learning in scheduling

    NASA Technical Reports Server (NTRS)

    Dietterich, Tom G.; Ok, Dokyeong; Zhang, Wei; Tadepalli, Prasad

    1994-01-01

    The goal of this research is to apply reinforcement learning methods to real-world problems like scheduling. In this preliminary paper, we show that learning to solve scheduling problems such as the Space Shuttle Payload Processing and the Automatic Guided Vehicle (AGV) scheduling can be usefully studied in the reinforcement learning framework. We discuss some of the special challenges posed by the scheduling domain to these methods and propose some possible solutions we plan to implement.

  4. Do learning rates adapt to the distribution of rewards?

    PubMed

    Gershman, Samuel J

    2015-10-01

    Studies of reinforcement learning have shown that humans learn differently in response to positive and negative reward prediction errors, a phenomenon that can be captured computationally by positing asymmetric learning rates. This asymmetry, motivated by neurobiological and cognitive considerations, has been invoked to explain learning differences across the lifespan as well as a range of psychiatric disorders. Recent theoretical work, motivated by normative considerations, has hypothesized that the learning rate asymmetry should be modulated by the distribution of rewards across the available options. In particular, the learning rate for negative prediction errors should be higher than the learning rate for positive prediction errors when the average reward rate is high, and this relationship should reverse when the reward rate is low. We tested this hypothesis in a series of experiments. Contrary to the theoretical predictions, we found that the asymmetry was largely insensitive to the average reward rate; instead, the dominant pattern was a higher learning rate for negative than for positive prediction errors, possibly reflecting risk aversion.

  5. Dose Dependent Dopaminergic Modulation of Reward-Based Learning in Parkinson's Disease

    ERIC Educational Resources Information Center

    van Wouwe, N. C.; Ridderinkhof, K. R.; Band, G. P. H.; van den Wildenberg, W. P. M.; Wylie, S. A.

    2012-01-01

    Learning to select optimal behavior in new and uncertain situations is a crucial aspect of living and requires the ability to quickly associate stimuli with actions that lead to rewarding outcomes. Mathematical models of reinforcement-based learning to select rewarding actions distinguish between (1) the formation of stimulus-action-reward…

  6. Model-Based Reinforcement Learning under Concurrent Schedules of Reinforcement in Rodents

    ERIC Educational Resources Information Center

    Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

    2009-01-01

    Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's…

  7. A universal role of the ventral striatum in reward-based learning: Evidence from human studies

    PubMed Central

    Daniel, Reka; Pollmann, Stefan

    2014-01-01

    Reinforcement learning enables organisms to adjust their behavior in order to maximize rewards. Electrophysiological recordings of dopaminergic midbrain neurons have shown that they code the difference between actual and predicted rewards, i.e., the reward prediction error, in many species. This error signal is conveyed to both the striatum and cortical areas and is thought to play a central role in learning to optimize behavior. However, in human daily life rewards are diverse and often only indirect feedback is available. Here we explore the range of rewards that are processed by the dopaminergic system in human participants, and examine whether it is also involved in learning in the absence of explicit rewards. While results from electrophysiological recordings in humans are sparse, evidence linking dopaminergic activity to the metabolic signal recorded from the midbrain and striatum with functional magnetic resonance imaging (fMRI) is available. Results from fMRI studies suggest that the human ventral striatum (VS) receives valuation information for a diverse set of rewarding stimuli. These range from simple primary reinforcers such as juice rewards over abstract social rewards to internally generated signals on perceived correctness, suggesting that the VS is involved in learning from trial-and-error irrespective of the specific nature of provided rewards. In addition, we summarize evidence that the VS can also be implicated when learning from observing others, and in tasks that go beyond simple stimulus-action-outcome learning, indicating that the reward system is also recruited in more complex learning tasks. PMID:24825620

  8. Mate call as reward: Acoustic communication signals can acquire positive reinforcing values during adulthood in female zebra finches (Taeniopygia guttata).

    PubMed

    Hernandez, Alexandra M; Perez, Emilie C; Mulard, Hervé; Mathevon, Nicolas; Vignal, Clémentine

    2016-02-01

    Social stimuli can have rewarding properties and promote learning. In birds, conspecific vocalizations like song can act as a reinforcer, and specific song variants can acquire particular rewarding values during early life exposure. Here we ask if, during adulthood, an acoustic signal simpler and shorter than song can become a reward for a female songbird because of its particular social value. Using an operant choice apparatus, we showed that female zebra finches display a preferential response toward their mate's calls. This reinforcing value of mate's calls could be involved in the maintenance of the monogamous pair-bond of the zebra finch.

  9. Learning Reward Uncertainty in the Basal Ganglia

    PubMed Central

    Bogacz, Rafal

    2016-01-01

    Learning the reliability of different sources of rewards is critical for making optimal choices. However, despite the existence of detailed theory describing how the expected reward is learned in the basal ganglia, it is not known how reward uncertainty is estimated in these circuits. This paper presents a class of models that encode both the mean reward and the spread of the rewards, the former in the difference between the synaptic weights of D1 and D2 neurons, and the latter in their sum. In the models, the tendency to seek (or avoid) options with variable reward can be controlled by increasing (or decreasing) the tonic level of dopamine. The models are consistent with the physiology of and synaptic plasticity in the basal ganglia, they explain the effects of dopaminergic manipulations on choices involving risks, and they make multiple experimental predictions. PMID:27589489

  10. Social stress reactivity alters reward and punishment learning.

    PubMed

    Cavanagh, James F; Frank, Michael J; Allen, John J B

    2011-06-01

    To examine how stress affects cognitive functioning, individual differences in trait vulnerability (punishment sensitivity) and state reactivity (negative affect) to social evaluative threat were examined during concurrent reinforcement learning. Lower trait-level punishment sensitivity predicted better reward learning and poorer punishment learning; the opposite pattern was found in more punishment sensitive individuals. Increasing state-level negative affect was directly related to punishment learning accuracy in highly punishment sensitive individuals, but these measures were inversely related in less sensitive individuals. Combined electrophysiological measurement, performance accuracy and computational estimations of learning parameters suggest that trait and state vulnerability to stress alter cortico-striatal functioning during reinforcement learning, possibly mediated via medio-frontal cortical systems.

  11. Reduced reward-related probability learning in schizophrenia patients.

    PubMed

    Yılmaz, Alpaslan; Simsek, Fatma; Gonul, Ali Saffet

    2012-01-01

    Although it is known that individuals with schizophrenia demonstrate marked impairment in reinforcement learning, the details of this impairment are not known. The aim of this study was to test the hypothesis that reward-related probability learning is altered in schizophrenia patients. Twenty-five clinically stable schizophrenia patients and 25 age- and gender-matched controls participated in the study. A simple gambling paradigm was used in which five different cues were associated with different reward probabilities (50%, 67%, and 100%). Participants were asked to make their best guess about the reward probability of each cue. Compared with controls, patients had significant impairment in learning contingencies on the basis of reward-related feedback. The correlation analyses revealed that the impairment of patients partially correlated with the severity of negative symptoms as measured on the Positive and Negative Syndrome Scale but that it was not related to antipsychotic dose. In conclusion, the present study showed that the schizophrenia patients had impaired reward-based learning and that this was independent from their medication status.

  12. How instructed knowledge modulates the neural systems of reward learning

    PubMed Central

    Delgado, Mauricio R.; Phelps, Elizabeth A.

    2011-01-01

    Recent research in neuroeconomics has demonstrated that the reinforcement learning model of reward learning captures the patterns of both behavioral performance and neural responses during a range of economic decision-making tasks. However, this powerful theoretical model has its limits. Trial-and-error is only one of the means by which individuals can learn the value associated with different decision options. Humans have also developed efficient, symbolic means of communication for learning without the necessity for committing multiple errors across trials. In the present study, we observed that instructed knowledge of cue-reward probabilities improves behavioral performance and diminishes reinforcement learning-related blood-oxygen level-dependent (BOLD) responses to feedback in the nucleus accumbens, ventromedial prefrontal cortex, and hippocampal complex. The decrease in BOLD responses in these brain regions to reward-feedback signals was functionally correlated with activation of the dorsolateral prefrontal cortex (DLPFC). These results suggest that when learning action values, participants use the DLPFC to dynamically adjust outcome responses in valuation regions depending on the usefulness of action-outcome information. PMID:21173266

  13. Reinforcement learning or active inference?

    PubMed

    Friston, Karl J; Daunizeau, Jean; Kiebel, Stefan J

    2009-07-29

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

  14. Reinforcement Learning or Active Inference?

    PubMed Central

    Friston, Karl J.; Daunizeau, Jean; Kiebel, Stefan J.

    2009-01-01

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain. PMID:19641614

  15. Manifold Regularized Reinforcement Learning.

    PubMed

    Li, Hongliang; Liu, Derong; Wang, Ding

    2017-01-27

    This paper introduces a novel manifold regularized reinforcement learning scheme for continuous Markov decision processes. Smooth feature representations for value function approximation can be automatically learned using the unsupervised manifold regularization method. The learned features are data-driven, and can be adapted to the geometry of the state space. Furthermore, the scheme provides a direct basis representation extension for novel samples during policy learning and control. The performance of the proposed scheme is evaluated on two benchmark control tasks, i.e., the inverted pendulum and the energy storage problem. Simulation results illustrate the concepts of the proposed scheme and show that it can obtain excellent performance.

  16. Dopamine, reward learning, and active inference

    PubMed Central

    FitzGerald, Thomas H. B.; Dolan, Raymond J.; Friston, Karl

    2015-01-01

    Temporal difference learning models propose phasic dopamine signaling encodes reward prediction errors that drive learning. This is supported by studies where optogenetic stimulation of dopamine neurons can stand in lieu of actual reward. Nevertheless, a large body of data also shows that dopamine is not necessary for learning, and that dopamine depletion primarily affects task performance. We offer a resolution to this paradox based on an hypothesis that dopamine encodes the precision of beliefs about alternative actions, and thus controls the outcome-sensitivity of behavior. We extend an active inference scheme for solving Markov decision processes to include learning, and show that simulated dopamine dynamics strongly resemble those actually observed during instrumental conditioning. Furthermore, simulated dopamine depletion impairs performance but spares learning, while simulated excitation of dopamine neurons drives reward learning, through aberrant inference about outcome states. Our formal approach provides a novel and parsimonious reconciliation of apparently divergent experimental findings. PMID:26581305

  17. Reward and non-reward learning of flower colours in the butterfly Byasa alcinous (Lepidoptera: Papilionidae)

    NASA Astrophysics Data System (ADS)

    Kandori, Ikuo; Yamaki, Takafumi

    2012-09-01

    Learning plays an important role in food acquisition for a wide range of insects. To increase their foraging efficiency, flower-visiting insects may learn to associate floral cues with the presence (so-called reward learning) or the absence (so-called non-reward learning) of a reward. Reward learning whilst foraging for flowers has been demonstrated in many insect taxa, whilst non-reward learning in flower-visiting insects has been demonstrated only in honeybees, bumblebees and hawkmoths. This study examined both reward and non-reward learning abilities in the butterfly Byasa alcinous whilst foraging among artificial flowers of different colours. This butterfly showed both types of learning, although butterflies of both sexes learned faster via reward learning. In addition, females learned via reward learning faster than males. To the best of our knowledge, these are the first empirical data on the learning speed of both reward and non-reward learning in insects. We discuss the adaptive significance of a lower learning speed for non-reward learning when foraging on flowers.

  18. The Computational Development of Reinforcement Learning during Adolescence

    PubMed Central

    Palminteri, Stefano; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-01-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents’ behaviour was better explained by a basic reinforcement learning algorithm, adults’ behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence. PMID:27322574

  19. The Computational Development of Reinforcement Learning during Adolescence.

    PubMed

    Palminteri, Stefano; Kilford, Emma J; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-06-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents' behaviour was better explained by a basic reinforcement learning algorithm, adults' behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence.

  20. Mapping anhedonia onto reinforcement learning: a behavioural meta-analysis

    PubMed Central

    2013-01-01

    Background Depression is characterised partly by blunted reactions to reward. However, tasks probing this deficiency have not distinguished insensitivity to reward from insensitivity to the prediction errors for reward that determine learning and are putatively reported by the phasic activity of dopamine neurons. We attempted to disentangle these factors with respect to anhedonia in the context of stress, Major Depressive Disorder (MDD), Bipolar Disorder (BPD) and a dopaminergic challenge. Methods Six behavioural datasets involving 392 experimental sessions were subjected to a model-based, Bayesian meta-analysis. Participants across all six studies performed a probabilistic reward task that used an asymmetric reinforcement schedule to assess reward learning. Healthy controls were tested under baseline conditions, stress or after receiving the dopamine D2 agonist pramipexole. In addition, participants with current or past MDD or BPD were evaluated. Reinforcement learning models isolated the contributions of variation in reward sensitivity and learning rate. Results MDD and anhedonia reduced reward sensitivity more than they affected the learning rate, while a low dose of the dopamine D2 agonist pramipexole showed the opposite pattern. Stress led to a pattern consistent with a mixed effect on reward sensitivity and learning rate. Conclusion Reward-related learning reflected at least two partially separable contributions. The first related to phasic prediction error signalling, and was preferentially modulated by a low dose of the dopamine agonist pramipexole. The second related directly to reward sensitivity, and was preferentially reduced in MDD and anhedonia. Stress altered both components. Collectively, these findings highlight the contribution of model-based reinforcement learning meta-analysis for dissecting anhedonic behavior. PMID:23782813

  1. Interrelated mechanisms in reward and learning.

    PubMed

    Lajtha, Abel

    2008-01-01

    This brief review is focused on recent work in our laboratory, in which we assayed nicotine-induced neurotransmitter changes, comparing it to changes induced by other compounds and examined the receptor systems and their interactions that mediate the changes. The primary aim of our studies is to examine the role of neurotransmitter changes in reward and learning processes. We find that these processes are interlinked and interact in that reward-addiction mechanisms include processes of learning and learning-memory mechanisms include processes of reward. In spite being interlinked, the two processes have different functions and distinct properties and our long-term aim is to identify factors that control these processes and the differences among the processes. Here, we discuss reward processes, which we define as changes examined after administration of nicotine, cocaine or food, each of which induces changes in neurotransmitter levels and functions in cognitive areas as well as in reward areas. The changes are regionally heterogeneous and are drug or stimulus specific. They include changes in the transmitters assayed (catecholamines, amino acids, and acetylcholine) and also in their metabolites, hence, in addition to release, uptake and metabolism are involved. Many receptors modulate the response with direct and indirect effects. The involvement of many transmitters, receptors and their interactions and the stimulus specificity of the response indicated that each function, reward and learning represents the involvement of different pattern of changes with a different stimulus, therefore, many different learning and many different reward processes are active, which allow stimulus specific responses. The complex pattern of reward-induced changes in neurotransmitters is only a part of the multiple changes observed, but one which has a crucial and controlling function.

  2. Multiplexing signals in reinforcement learning with internal models and dopamine.

    PubMed

    Nakahara, Hiroyuki

    2014-04-01

    A fundamental challenge for computational and cognitive neuroscience is to understand how reward-based learning and decision-making are made and how accrued knowledge and internal models of the environment are incorporated. Remarkable progress has been made in the field, guided by the midbrain dopamine reward prediction error hypothesis and the underlying reinforcement learning framework, which does not involve internal models ('model-free'). Recent studies, however, have begun not only to address more complex decision-making processes that are integrated with model-free decision-making, but also to include internal models about environmental reward structures and the minds of other agents, including model-based reinforcement learning and using generalized prediction errors. Even dopamine, a classic model-free signal, may work as multiplexed signals using model-based information and contribute to representational learning of reward structure.

  3. Individual differences in reinforcement learning: behavioral, electrophysiological, and neuroimaging correlates.

    PubMed

    Santesso, Diane L; Dillon, Daniel G; Birk, Jeffrey L; Holmes, Avram J; Goetz, Elena; Bogdan, Ryan; Pizzagalli, Diego A

    2008-08-15

    During reinforcement learning, phasic modulations of activity in midbrain dopamine neurons are conveyed to the dorsal anterior cingulate cortex (dACC) and basal ganglia (BG) and serve to guide adaptive responding. While the animal literature supports a role for the dACC in integrating reward history over time, most human electrophysiological studies of dACC function have focused on responses to single positive and negative outcomes. The present electrophysiological study investigated the role of the dACC in probabilistic reward learning in healthy subjects using a task that required integration of reinforcement history over time. We recorded the feedback-related negativity (FRN) to reward feedback in subjects who developed a response bias toward a more frequently rewarded ("rich") stimulus ("learners") versus subjects who did not ("non-learners"). Compared to non-learners, learners showed more positive (i.e., smaller) FRNs and greater dACC activation upon receiving reward for correct identification of the rich stimulus. In addition, dACC activation and a bias to select the rich stimulus were positively correlated. The same participants also completed a monetary incentive delay (MID) task administered during functional magnetic resonance imaging. Compared to non-learners, learners displayed stronger BG responses to reward in the MID task. These findings raise the possibility that learners in the probabilistic reinforcement task were characterized by stronger dACC and BG responses to rewarding outcomes. Furthermore, these results highlight the importance of the dACC to probabilistic reward learning in humans.

  4. Mind matters: Placebo enhances reward learning in Parkinson’s disease

    PubMed Central

    Schmidt, Liane; Braun, Erin Kendall; Wager, Tor D.; Shohamy, Daphna

    2015-01-01

    Expectations have a powerful influence on how we experience the world. Neurobiological and computational models of learning suggest that dopamine is crucial for shaping expectations of reward and that expectations alone may influence dopamine levels. However, because expectations and reinforcers are typically manipulated together, the role of expectations per se has remained unclear. Here, we separated these two factors using a placebo dopaminergic manipulation in Parkinson’s patients. We combined a reward learning task with fMRI to test how expectations of dopamine release modulate learning-related activity in the brain. We found that the mere expectation of dopamine release enhances reward learning and modulates learning-related signals in the striatum and the ventromedial prefrontal cortex. These effects were selective to learning from reward: neither medication nor placebo had an effect on learning to avoid monetary loss. These findings suggest a neurobiological mechanism by which expectations shape learning and affect. PMID:25326691

  5. Reinforcement Learning Trees.

    PubMed

    Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R

    In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings.

  6. Reinforcement Learning Trees

    PubMed Central

    Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R.

    2015-01-01

    In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings. PMID:26903687

  7. The dissociable effects of punishment and reward on motor learning.

    PubMed

    Galea, Joseph M; Mallia, Elizabeth; Rothwell, John; Diedrichsen, Jörn

    2015-04-01

    A common assumption regarding error-based motor learning (motor adaptation) in humans is that its underlying mechanism is automatic and insensitive to reward- or punishment-based feedback. Contrary to this hypothesis, we show in a double dissociation that the two have independent effects on the learning and retention components of motor adaptation. Negative feedback, whether graded or binary, accelerated learning. While it was not necessary for the negative feedback to be coupled to monetary loss, it had to be clearly related to the actual performance on the preceding movement. Positive feedback did not speed up learning, but it increased retention of the motor memory when performance feedback was withdrawn. These findings reinforce the view that independent mechanisms underpin learning and retention in motor adaptation, reject the assumption that motor adaptation is independent of motivational feedback, and raise new questions regarding the neural basis of negative and positive motivational feedback in motor learning.

  8. How Transitions from Nonrewarded to Rewarded Trials Regulate Responding in Pavlovian and Instrumental Learning Following Extensive Acquisition Training

    ERIC Educational Resources Information Center

    Capaldi, E.J.; Haas, A.; Miller, R.M.; Martins, A.

    2005-01-01

    In both discrimination learning and partial reinforcement, transitions may occur from nonrewarded to rewarded trials (NR transition). In discrimination learning, NR transitions may occur in two different stimulus alternatives (NR different transitions). In partial reward, NR transitions may occur in a single stimulus alternative (NR same…

  9. Neural basis of reinforcement learning and decision making.

    PubMed

    Lee, Daeyeol; Seo, Hyojung; Jung, Min Whan

    2012-01-01

    Reinforcement learning is an adaptive process in which an animal utilizes its previous experience to improve the outcomes of future choices. Computational theories of reinforcement learning play a central role in the newly emerging areas of neuroeconomics and decision neuroscience. In this framework, actions are chosen according to their value functions, which describe how much future reward is expected from each action. Value functions can be adjusted not only through reward and penalty, but also by the animal's knowledge of its current environment. Studies have revealed that a large proportion of the brain is involved in representing and updating value functions and using them to choose an action. However, how the nature of a behavioral task affects the neural mechanisms of reinforcement learning remains incompletely understood. Future studies should uncover the principles by which different computational elements of reinforcement learning are dynamically coordinated across the entire brain.

  10. Time-Extended Policies in Mult-Agent Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian K.

    2004-01-01

    Reinforcement learning methods perform well in many domains where a single agent needs to take a sequence of actions to perform a task. These methods use sequences of single-time-step rewards to create a policy that tries to maximize a time-extended utility, which is a (possibly discounted) sum of these rewards. In this paper we build on our previous work showing how these methods can be extended to a multi-agent environment where each agent creates its own policy that works towards maximizing a time-extended global utility over all agents actions. We show improved methods for creating time-extended utilities for the agents that are both "aligned" with the global utility and "learnable." We then show how to crate single-time-step rewards while avoiding the pi fall of having rewards aligned with the global reward leading to utilities not aligned with the global utility. Finally, we apply these reward functions to the multi-agent Gridworld problem. We explicitly quantify a utility's learnability and alignment, and show that reinforcement learning agents using the prescribed reward functions successfully tradeoff learnability and alignment. As a result they outperform both global (e.g., team games ) and local (e.g., "perfectly learnable" ) reinforcement learning solutions by as much as an order of magnitude.

  11. Contextual modulation of value signals in reward and punishment learning

    PubMed Central

    Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

    2015-01-01

    Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative—context-dependent—scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system. PMID:26302782

  12. Contextual modulation of value signals in reward and punishment learning.

    PubMed

    Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

    2015-08-25

    Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative--context-dependent--scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system.

  13. Risk-sensitive reinforcement learning.

    PubMed

    Shen, Yun; Tobia, Michael J; Sommer, Tobias; Obermayer, Klaus

    2014-07-01

    We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

  14. Statistical Mechanics of the Delayed Reward-Based Learning with Node Perturbation

    NASA Astrophysics Data System (ADS)

    Hiroshi Saito,; Kentaro Katahira,; Kazuo Okanoya,; Masato Okada,

    2010-06-01

    In reward-based learning, reward is typically given with some delay after a behavior that causes the reward. In machine learning literature, the framework of the eligibility trace has been used as one of the solutions to handle the delayed reward in reinforcement learning. In recent studies, the eligibility trace is implied to be important for difficult neuroscience problem known as the “distal reward problem”. Node perturbation is one of the stochastic gradient methods from among many kinds of reinforcement learning implementations, and it searches the approximate gradient by introducing perturbation to a network. Since the stochastic gradient method does not require a objective function differential, it is expected to be able to account for the learning mechanism of a complex system, like a brain. We study the node perturbation with the eligibility trace as a specific example of delayed reward-based learning, and analyzed it using a statistical mechanics approach. As a result, we show the optimal time constant of the eligibility trace respect to the reward delay and the existence of unlearnable parameter configurations.

  15. Learning Analytics: Readiness and Rewards

    ERIC Educational Resources Information Center

    Friesen, Norm

    2013-01-01

    This position paper introduces the relatively new field of learning analytics, first by considering the relevant meanings of both "learning" and "analytics," and then by looking at two main levels at which learning analytics can be or has been implemented in educational organizations. Although integrated turnkey systems or…

  16. Credit assignment during movement reinforcement learning.

    PubMed

    Dam, Gregory; Kording, Konrad; Wei, Kunlin

    2013-01-01

    We often need to learn how to move based on a single performance measure that reflects the overall success of our movements. However, movements have many properties, such as their trajectories, speeds and timing of end-points, thus the brain needs to decide which properties of movements should be improved; it needs to solve the credit assignment problem. Currently, little is known about how humans solve credit assignment problems in the context of reinforcement learning. Here we tested how human participants solve such problems during a trajectory-learning task. Without an explicitly-defined target movement, participants made hand reaches and received monetary rewards as feedback on a trial-by-trial basis. The curvature and direction of the attempted reach trajectories determined the monetary rewards received in a manner that can be manipulated experimentally. Based on the history of action-reward pairs, participants quickly solved the credit assignment problem and learned the implicit payoff function. A Bayesian credit-assignment model with built-in forgetting accurately predicts their trial-by-trial learning.

  17. Synthetic cathinones and their rewarding and reinforcing effects in rodents

    PubMed Central

    Watterson, Lucas R.; Olive, M. Foster

    2014-01-01

    Synthetic cathinones, colloquially referred to as “bath salts”, are derivatives of the psychoactive alkaloid cathinone found in Catha edulis (Khat). Since the mid-to-late 2000’s, these amphetamine-like psychostimulants have gained popularity amongst drug users due to their potency, low cost, ease of procurement, and constantly evolving chemical structures. Concomitant with their increased use is the emergence of a growing collection of case reports of bizarre and dangerous behaviors, toxicity to numerous organ systems, and death. However, scientific information regarding the abuse liability of these drugs has been relatively slower to materialize. Recently we have published several studies demonstrating that laboratory rodents will readily self-administer the “first generation” synthetic cathinones methylenedioxypyrovalerone (MDPV) and methylone via the intravenous route, in patterns similar to those of methamphetamine. Under progressive ratio schedules of reinforcement, the rank order of reinforcing efficacy of these compounds are MDPV ≥ methamphetamine > methylone. MDPV and methylone, as well as the “second generation” synthetic cathinones α-pyrrolidinovalerophenone (α-PVP) and 4-methylethcathinone (4-MEC), also dose-dependently increase brain reward function. Collectively, these findings indicate that synthetic cathinones have a high abuse and addiction potential and underscore the need for future assessment of the extent and duration of neurotoxicity induced by these emerging drugs of abuse. PMID:25328910

  18. Reinforcement Learning Through Gradient Descent

    DTIC Science & Technology

    1999-05-14

    Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for...practice of existing types of algorithms, the gradient descent approach makes it possible to create entirely new classes of reinforcement learning algorithms

  19. Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: a simulated robotic study.

    PubMed

    Mirolli, Marco; Santucci, Vieri G; Baldassarre, Gianluca

    2013-03-01

    An important issue of recent neuroscientific research is to understand the functional role of the phasic release of dopamine in the striatum, and in particular its relation to reinforcement learning. The literature is split between two alternative hypotheses: one considers phasic dopamine as a reward prediction error similar to the computational TD-error, whose function is to guide an animal to maximize future rewards; the other holds that phasic dopamine is a sensory prediction error signal that lets the animal discover and acquire novel actions. In this paper we propose an original hypothesis that integrates these two contrasting positions: according to our view phasic dopamine represents a TD-like reinforcement prediction error learning signal determined by both unexpected changes in the environment (temporary, intrinsic reinforcements) and biological rewards (permanent, extrinsic reinforcements). Accordingly, dopamine plays the functional role of driving both the discovery and acquisition of novel actions and the maximization of future rewards. To validate our hypothesis we perform a series of experiments with a simulated robotic system that has to learn different skills in order to get rewards. We compare different versions of the system in which we vary the composition of the learning signal. The results show that only the system reinforced by both extrinsic and intrinsic reinforcements is able to reach high performance in sufficiently complex conditions.

  20. Benchmarking for Bayesian Reinforcement Learning

    PubMed Central

    Ernst, Damien; Couëtoux, Adrien

    2016-01-01

    In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but the benchmarks used to compare them are only relevant for specific cases. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed. PMID:27304891

  1. Benchmarking for Bayesian Reinforcement Learning.

    PubMed

    Castronovo, Michael; Ernst, Damien; Couëtoux, Adrien; Fonteneau, Raphael

    2016-01-01

    In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but the benchmarks used to compare them are only relevant for specific cases. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed.

  2. Roles of octopaminergic and dopaminergic neurons in mediating reward and punishment signals in insect visual learning.

    PubMed

    Unoki, Sae; Matsumoto, Yukihisa; Mizunami, Makoto

    2006-10-01

    Insects, like vertebrates, have considerable ability to associate visual, olfactory or other sensory signals with reward or punishment. Previous studies in crickets, honey bees and fruit-flies have suggested that octopamine (OA, invertebrate counterpart of noradrenaline) and dopamine (DA) mediate various kinds of reward and punishment signals in olfactory learning. However, whether the roles of OA and DA in mediating positive and negative reinforcing signals can be generalized to learning of sensory signals other than odors remained unknown. Here we first established a visual learning paradigm in which to associate a visual pattern with water reward or saline punishment for crickets and found that memory after aversive conditioning decayed much faster than that after appetitive conditioning. Then, we pharmacologically studied the roles of OA and DA in appetitive and aversive forms of visual learning. Crickets injected with epinastine or mianserin, OA receptor antagonists, into the hemolymph exhibited a complete impairment of appetitive learning to associate a visual pattern with water reward, but aversive learning with saline punishment was unaffected. By contrast, fluphenazine, chlorpromazine or spiperone, DA receptor antagonists, completely impaired aversive learning without affecting appetitive learning. The results demonstrate that OA and DA participate in reward and punishment conditioning in visual learning. This finding, together with results of previous studies on the roles of OA and DA in olfactory learning, suggests ubiquitous roles of the octopaminergic reward system and dopaminergic punishment system in insect learning.

  3. The Dopamine Prediction Error: Contributions to Associative Models of Reward Learning

    PubMed Central

    Nasser, Helen M.; Calu, Donna J.; Schoenbaum, Geoffrey; Sharpe, Melissa J.

    2017-01-01

    Phasic activity of midbrain dopamine neurons is currently thought to encapsulate the prediction-error signal described in Sutton and Barto’s (1981) model-free reinforcement learning algorithm. This phasic signal is thought to contain information about the quantitative value of reward, which transfers to the reward-predictive cue after learning. This is argued to endow the reward-predictive cue with the value inherent in the reward, motivating behavior toward cues signaling the presence of reward. Yet theoretical and empirical research has implicated prediction-error signaling in learning that extends far beyond a transfer of quantitative value to a reward-predictive cue. Here, we review the research which demonstrates the complexity of how dopaminergic prediction errors facilitate learning. After briefly discussing the literature demonstrating that phasic dopaminergic signals can act in the manner described by Sutton and Barto (1981), we consider how these signals may also influence attentional processing across multiple attentional systems in distinct brain circuits. Then, we discuss how prediction errors encode and promote the development of context-specific associations between cues and rewards. Finally, we consider recent evidence that shows dopaminergic activity contains information about causal relationships between cues and rewards that reflect information garnered from rich associative models of the world that can be adapted in the absence of direct experience. In discussing this research we hope to support the expansion of how dopaminergic prediction errors are thought to contribute to the learning process beyond the traditional concept of transferring quantitative value. PMID:28275359

  4. The Dopamine Prediction Error: Contributions to Associative Models of Reward Learning.

    PubMed

    Nasser, Helen M; Calu, Donna J; Schoenbaum, Geoffrey; Sharpe, Melissa J

    2017-01-01

    Phasic activity of midbrain dopamine neurons is currently thought to encapsulate the prediction-error signal described in Sutton and Barto's (1981) model-free reinforcement learning algorithm. This phasic signal is thought to contain information about the quantitative value of reward, which transfers to the reward-predictive cue after learning. This is argued to endow the reward-predictive cue with the value inherent in the reward, motivating behavior toward cues signaling the presence of reward. Yet theoretical and empirical research has implicated prediction-error signaling in learning that extends far beyond a transfer of quantitative value to a reward-predictive cue. Here, we review the research which demonstrates the complexity of how dopaminergic prediction errors facilitate learning. After briefly discussing the literature demonstrating that phasic dopaminergic signals can act in the manner described by Sutton and Barto (1981), we consider how these signals may also influence attentional processing across multiple attentional systems in distinct brain circuits. Then, we discuss how prediction errors encode and promote the development of context-specific associations between cues and rewards. Finally, we consider recent evidence that shows dopaminergic activity contains information about causal relationships between cues and rewards that reflect information garnered from rich associative models of the world that can be adapted in the absence of direct experience. In discussing this research we hope to support the expansion of how dopaminergic prediction errors are thought to contribute to the learning process beyond the traditional concept of transferring quantitative value.

  5. Reinforcement Learning for Scheduling of Maintenance

    NASA Astrophysics Data System (ADS)

    Knowles, Michael; Baglee, David; Wermter, Stefan

    Improving maintenance scheduling has become an area of crucial importance in recent years. Condition-based maintenance (CBM) has started to move away from scheduled maintenance by providing an indication of the likelihood of failure. Improving the timing of maintenance based on this information to maintain high reliability without resorting to over-maintenance remains, however, a problem. In this paper we propose Reinforcement Learning (RL), to improve long term reward for a multistage decision based on feedback given either during or at the end of a sequence of actions, as a potential solution to this problem. Several indicative scenarios are presented and simulated experiments illustrate the performance of RL in this application.

  6. Inter-module credit assignment in modular reinforcement learning.

    PubMed

    Samejima, Kazuyuki; Doya, Kenji; Kawato, Mitsuo

    2003-09-01

    Critical issues in modular or hierarchical reinforcement learning (RL) are (i) how to decompose a task into sub-tasks, (ii) how to achieve independence of learning of sub-tasks, and (iii) how to assure optimality of the composite policy for the entire task. The second and last requirements are often under trade-off. We propose a method for propagating the reward for the entire task achievement between modules. This is done in the form of a 'modular reward', which is calculated from the temporal difference of the module gating signal and the value of the succeeding module. We implement modular reward for a multiple model-based reinforcement learning (MMRL) architecture and show its effectiveness in simulations of a pursuit task with hidden states and a continuous-time non-linear control task.

  7. Optogenetic Mimicry of the Transient Activation of Dopamine Neurons by Natural Reward Is Sufficient for Operant Reinforcement

    PubMed Central

    Kim, Kyung Man; Baratta, Michael V.; Yang, Aimei; Lee, Doheon; Boyden, Edward S.; Fiorillo, Christopher D.

    2012-01-01

    Activation of dopamine receptors in forebrain regions, for minutes or longer, is known to be sufficient for positive reinforcement of stimuli and actions. However, the firing rate of dopamine neurons is increased for only about 200 milliseconds following natural reward events that are better than expected, a response which has been described as a “reward prediction error” (RPE). Although RPE drives reinforcement learning (RL) in computational models, it has not been possible to directly test whether the transient dopamine signal actually drives RL. Here we have performed optical stimulation of genetically targeted ventral tegmental area (VTA) dopamine neurons expressing Channelrhodopsin-2 (ChR2) in mice. We mimicked the transient activation of dopamine neurons that occurs in response to natural reward by applying a light pulse of 200 ms in VTA. When a single light pulse followed each self-initiated nose poke, it was sufficient in itself to cause operant reinforcement. Furthermore, when optical stimulation was delivered in separate sessions according to a predetermined pattern, it increased locomotion and contralateral rotations, behaviors that are known to result from activation of dopamine neurons. All three of the optically induced operant and locomotor behaviors were tightly correlated with the number of VTA dopamine neurons that expressed ChR2, providing additional evidence that the behavioral responses were caused by activation of dopamine neurons. These results provide strong evidence that the transient activation of dopamine neurons provides a functional reward signal that drives learning, in support of RL theories of dopamine function. PMID:22506004

  8. Optogenetic mimicry of the transient activation of dopamine neurons by natural reward is sufficient for operant reinforcement.

    PubMed

    Kim, Kyung Man; Baratta, Michael V; Yang, Aimei; Lee, Doheon; Boyden, Edward S; Fiorillo, Christopher D

    2012-01-01

    Activation of dopamine receptors in forebrain regions, for minutes or longer, is known to be sufficient for positive reinforcement of stimuli and actions. However, the firing rate of dopamine neurons is increased for only about 200 milliseconds following natural reward events that are better than expected, a response which has been described as a "reward prediction error" (RPE). Although RPE drives reinforcement learning (RL) in computational models, it has not been possible to directly test whether the transient dopamine signal actually drives RL. Here we have performed optical stimulation of genetically targeted ventral tegmental area (VTA) dopamine neurons expressing Channelrhodopsin-2 (ChR2) in mice. We mimicked the transient activation of dopamine neurons that occurs in response to natural reward by applying a light pulse of 200 ms in VTA. When a single light pulse followed each self-initiated nose poke, it was sufficient in itself to cause operant reinforcement. Furthermore, when optical stimulation was delivered in separate sessions according to a predetermined pattern, it increased locomotion and contralateral rotations, behaviors that are known to result from activation of dopamine neurons. All three of the optically induced operant and locomotor behaviors were tightly correlated with the number of VTA dopamine neurons that expressed ChR2, providing additional evidence that the behavioral responses were caused by activation of dopamine neurons. These results provide strong evidence that the transient activation of dopamine neurons provides a functional reward signal that drives learning, in support of RL theories of dopamine function.

  9. Reinforcement learning: the good, the bad and the ugly.

    PubMed

    Dayan, Peter; Niv, Yael

    2008-04-01

    Reinforcement learning provides both qualitative and quantitative frameworks for understanding and modeling adaptive decision-making in the face of rewards and punishments. Here we review the latest dispatches from the forefront of this field, and map out some of the territories where lie monsters.

  10. Reinforcement learning in multidimensional environments relies on attention mechanisms.

    PubMed

    Niv, Yael; Daniel, Reka; Geana, Andra; Gershman, Samuel J; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C

    2015-05-27

    In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this "representation learning" process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the "curse of dimensionality" in reinforcement learning.

  11. Changes in corticostriatal connectivity during reinforcement learning in humans.

    PubMed

    Horga, Guillermo; Maia, Tiago V; Marsh, Rachel; Hao, Xuejun; Xu, Dongrong; Duan, Yunsuo; Tau, Gregory Z; Graniello, Barbara; Wang, Zhishun; Kangarlu, Alayar; Martinez, Diana; Packard, Mark G; Peterson, Bradley S

    2015-02-01

    Many computational models assume that reinforcement learning relies on changes in synaptic efficacy between cortical regions representing stimuli and striatal regions involved in response selection, but this assumption has thus far lacked empirical support in humans. We recorded hemodynamic signals with fMRI while participants navigated a virtual maze to find hidden rewards. We fitted a reinforcement-learning algorithm to participants' choice behavior and evaluated the neural activity and the changes in functional connectivity related to trial-by-trial learning variables. Activity in the posterior putamen during choice periods increased progressively during learning. Furthermore, the functional connections between the sensorimotor cortex and the posterior putamen strengthened progressively as participants learned the task. These changes in corticostriatal connectivity differentiated participants who learned the task from those who did not. These findings provide a direct link between changes in corticostriatal connectivity and learning, thereby supporting a central assumption common to several computational models of reinforcement learning.

  12. Changes in corticostriatal connectivity during reinforcement learning in humans

    PubMed Central

    Horga, Guillermo; Maia, Tiago V.; Marsh, Rachel; Hao, Xuejun; Xu, Dongrong; Duan, Yunsuo; Tau, Gregory Z.; Graniello, Barbara; Wang, Zhishun; Kangarlu, Alayar; Martinez, Diana; Packard, Mark G.; Peterson, Bradley S.

    2015-01-01

    Many computational models assume that reinforcement learning relies on changes in synaptic efficacy between cortical regions representing stimuli and striatal regions involved in response selection, but this assumption has thus far lacked empirical support in humans. We recorded hemodynamic signals with fMRI while participants navigated a virtual maze to find hidden rewards. We fitted a reinforcement-learning algorithm to participants’ choice behavior and evaluated the neural activity and the changes in functional connectivity related to trial-by-trial learning variables. Activity in the posterior putamen during choice periods increased progressively during learning. Furthermore, the functional connections between the sensorimotor cortex and the posterior putamen strengthened progressively as participants learned the task. These changes in corticostriatal connectivity differentiated participants who learned the task from those who did not. These findings provide a direct link between changes in corticostriatal connectivity and learning, thereby supporting a central assumption common to several computational models of reinforcement learning. PMID:25393839

  13. Differential Reward Learning for Self and Others Predicts Self-Reported Altruism

    PubMed Central

    Kwak, Youngbin; Pearson, John; Huettel, Scott A.

    2014-01-01

    In social environments, decisions not only determine rewards for oneself but also for others. However, individual differences in pro-social behaviors have been typically studied through self-report. We developed a decision-making paradigm in which participants chose from card decks with differing rewards for themselves and charity; some decks gave similar rewards to both, while others gave higher rewards for one or the other. We used a reinforcement-learning model that estimated each participant's relative weighting of self versus charity reward. As shown both in choices and model parameters, individuals who showed relatively better learning of rewards for charity – compared to themselves – were more likely to engage in pro-social behavior outside of a laboratory setting indicated by self-report. Overall rates of reward learning, however, did not predict individual differences in pro-social tendencies. These results support the idea that biases toward learning about social rewards are associated with one's altruistic tendencies. PMID:25215883

  14. Reward: From Basic Reinforcers to Anticipation of Social Cues.

    PubMed

    Rademacher, Lena; Schulte-Rüther, Martin; Hanewald, Bernd; Lammertz, Sarah

    2017-01-01

    Reward processing plays a major role in goal-directed behavior and motivation. On the neural level, it is mediated by a complex network of brain structures called the dopaminergic reward system. In the last decade, neuroscientific researchers have become increasingly interested in aspects of social interaction that are experienced as rewarding. Recent neuroimaging studies have provided evidence that the reward system mediates the processing of social stimuli in a manner analogous to nonsocial rewards and thus motivates social behavior. In this context, the neuropeptide oxytocin is assumed to play a key role by activating dopaminergic reward pathways in response to social cues, inducing the rewarding quality of social interactions. Alterations in the dopaminergic reward system have been found in several psychiatric disorders that are accompanied by social interaction and motivation problems, for example autism, attention deficit/hyperactivity disorder, addiction disorders, and schizophrenia.

  15. Psychological distance to reward: Segmentation of aperiodic schedules of reinforcement

    PubMed Central

    Leung, Jin-Pang

    1993-01-01

    College students responded for monetary rewards in two experiments on choice between differentially segmented aperiodic schedules of reinforcement. On a microcomputer, the concurrent chains were simulated as an air-defense video game in which subjects used two radars for detecting and destroying enemy aircraft. To earn more cash-exchangeable points, subjects had to shoot down as many planes as possible within a given period of time. For both experiments, access to one of two radar systems (terminal link) was controlled by a pair of independent concurrent variable-interval 60-s schedules (initial link) with a 4-s changeover delay always in effect. In Experiment 1, the appearance of an enemy aircraft in the terminal link was determined by a variable-interval (15 s or 60 s) schedule or a two-component chained variable-interval schedule of equal duration. Experiment 2 was similar to Experiment 1 except for the segmented schedule, which had three components. Subjects preferred the unsegmented schedule over its segmented counterpart in the conditions with variable-interval 60 s, and preference tended to be more pronounced with more components in the segmented schedule. These findings are compatible with those from previous studies of periodic and aperiodic schedules with pigeons or humans as subjects. PMID:16812691

  16. How we learn to make decisions: rapid propagation of reinforcement learning prediction errors in humans.

    PubMed

    Krigolson, Olav E; Hassall, Cameron D; Handy, Todd C

    2014-03-01

    Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors-discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833-1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129-141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769-776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679-709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward

  17. Meta-learning in reinforcement learning.

    PubMed

    Schweighofer, Nicolas; Doya, Kenji

    2003-01-01

    Meta-parameters in reinforcement learning should be tuned to the environmental dynamics and the animal performance. Here, we propose a biologically plausible meta-reinforcement learning algorithm for tuning these meta-parameters in a dynamic, adaptive manner. We tested our algorithm in both a simulation of a Markov decision task and in a non-linear control task. Our results show that the algorithm robustly finds appropriate meta-parameter values, and controls the meta-parameter time course, in both static and dynamic environments. We suggest that the phasic and tonic components of dopamine neuron firing can encode the signal required for meta-learning of reinforcement learning.

  18. The combination of appetitive and aversive reinforcers and the nature of their interaction during auditory learning.

    PubMed

    Ilango, A; Wetzel, W; Scheich, H; Ohl, F W

    2010-03-31

    Learned changes in behavior can be elicited by either appetitive or aversive reinforcers. It is, however, not clear whether the two types of motivation, (approaching appetitive stimuli and avoiding aversive stimuli) drive learning in the same or different ways, nor is their interaction understood in situations where the two types are combined in a single experiment. To investigate this question we have developed a novel learning paradigm for Mongolian gerbils, which not only allows rewards and punishments to be presented in isolation or in combination with each other, but also can use these opposite reinforcers to drive the same learned behavior. Specifically, we studied learning of tone-conditioned hurdle crossing in a shuttle box driven by either an appetitive reinforcer (brain stimulation reward) or an aversive reinforcer (electrical footshock), or by a combination of both. Combination of the two reinforcers potentiated speed of acquisition, led to maximum possible performance, and delayed extinction as compared to either reinforcer alone. Additional experiments, using partial reinforcement protocols and experiments in which one of the reinforcers was omitted after the animals had been previously trained with the combination of both reinforcers, indicated that appetitive and aversive reinforcers operated together but acted in different ways: in this particular experimental context, punishment appeared to be more effective for initial acquisition and reward more effective to maintain a high level of conditioned responses (CRs). The results imply that learning mechanisms in problem solving were maximally effective when the initial punishment of mistakes was combined with the subsequent rewarding of correct performance.

  19. Model-based reinforcement learning with dimension reduction.

    PubMed

    Tangkaratt, Voot; Morimoto, Jun; Sugiyama, Masashi

    2016-12-01

    The goal of reinforcement learning is to learn an optimal policy which controls an agent to acquire the maximum cumulative reward. The model-based reinforcement learning approach learns a transition model of the environment from data, and then derives the optimal policy using the transition model. However, learning an accurate transition model in high-dimensional environments requires a large amount of data which is difficult to obtain. To overcome this difficulty, in this paper, we propose to combine model-based reinforcement learning with the recently developed least-squares conditional entropy (LSCE) method, which simultaneously performs transition model estimation and dimension reduction. We also further extend the proposed method to imitation learning scenarios. The experimental results show that policy search combined with LSCE performs well for high-dimensional control tasks including real humanoid robot control.

  20. Associations among smoking, anhedonia, and reward learning in depression.

    PubMed

    Liverant, Gabrielle I; Sloan, Denise M; Pizzagalli, Diego A; Harte, Christopher B; Kamholz, Barbara W; Rosebrock, Laina E; Cohen, Andrew L; Fava, Maurizio; Kaplan, Gary B

    2014-09-01

    Depression and cigarette smoking co-occur at high rates. However, the etiological mechanisms that contribute to this relationship remain unclear. Anhedonia and associated impairments in reward learning are key features of depression, which also have been linked to the onset and maintenance of cigarette smoking. However, few studies have investigated differences in anhedonia and reward learning among depressed smokers and depressed nonsmokers. The goal of this study was to examine putative differences in anhedonia and reward learning in depressed smokers (n=36) and depressed nonsmokers (n=44). To this end, participants completed self-report measures of anhedonia and behavioral activation (BAS reward responsiveness scores) and as well as a probabilistic reward task rooted in signal detection theory, which measures reward learning (Pizzagalli, Jahn, & O'Shea, 2005). When considering self-report measures, depressed smokers reported higher trait anhedonia and reduced BAS reward responsiveness scores compared to depressed nonsmokers. In contrast to self-report measures, nicotine-satiated depressed smokers demonstrated greater acquisition of reward-based learning compared to depressed nonsmokers as indexed by the probabilistic reward task. Findings may point to a potential mechanism underlying the frequent co-occurrence of smoking and depression. These results highlight the importance of continued investigation of the role of anhedonia and reward system functioning in the co-occurrence of depression and nicotine abuse. Results also may support the use of treatments targeting reward learning (e.g., behavioral activation) to enhance smoking cessation among individuals with depression.

  1. Differential effect of reward and punishment on procedural learning.

    PubMed

    Wächter, Tobias; Lungu, Ovidiu V; Liu, Tao; Willingham, Daniel T; Ashe, James

    2009-01-14

    Reward and punishment are potent modulators of associative learning in instrumental and classical conditioning. However, the effect of reward and punishment on procedural learning is not known. The striatum is known to be an important locus of reward-related neural signals and part of the neural substrate of procedural learning. Here, using an implicit motor learning task, we show that reward leads to enhancement of learning in human subjects, whereas punishment is associated only with improvement in motor performance. Furthermore, these behavioral effects have distinct neural substrates with the learning effect of reward being mediated through the dorsal striatum and the performance effect of punishment through the insula. Our results suggest that reward and punishment engage separate motivational systems with distinctive behavioral effects and neural substrates.

  2. Differential Effect of Reward and Punishment on Procedural Learning

    PubMed Central

    Wächter, Tobias; Lungu, Ovidiu V.; Liu, Tao; Willingham, Daniel T.; Ashe, James

    2009-01-01

    Reward and punishment are potent modulators of associative learning in instrumental and classical conditioning. However, the effect of reward and punishment on procedural learning is not known. The striatum is known to be an important locus of reward-related neural signals and part of the neural substrate of procedural learning. Here, using an implicit motor learning task, we show that reward leads to enhancement of learning in human subjects, whereas punishment is associated only with improvement in motor performance. Furthermore, these behavioral effects have distinct neural substrates with the learning effect of reward being mediated through the dorsal striatum and the performance effect of punishment through the insula. Our results suggest that reward and punishment engage separate motivational systems with distinctive behavioral effects and neural substrates. PMID:19144843

  3. Stochastic optimization of multireservoir systems via reinforcement learning

    NASA Astrophysics Data System (ADS)

    Lee, Jin-Hee; Labadie, John W.

    2007-11-01

    Although several variants of stochastic dynamic programming have been applied to optimal operation of multireservoir systems, they have been plagued by a high-dimensional state space and the inability to accurately incorporate the stochastic environment as characterized by temporally and spatially correlated hydrologic inflows. Reinforcement learning has emerged as an effective approach to solving sequential decision problems by combining concepts from artificial intelligence, cognitive science, and operations research. A reinforcement learning system has a mathematical foundation similar to dynamic programming and Markov decision processes, with the goal of maximizing the long-term reward or returns as conditioned on the state of the system environment and the immediate reward obtained from operational decisions. Reinforcement learning can include Monte Carlo simulation where transition probabilities and rewards are not explicitly known a priori. The Q-Learning method in reinforcement learning is demonstrated on the two-reservoir Geum River system, South Korea, and is shown to outperform implicit stochastic dynamic programming and sampling stochastic dynamic programming methods.

  4. Dissecting components of reward: 'liking', 'wanting', and learning.

    PubMed

    Berridge, Kent C; Robinson, Terry E; Aldridge, J Wayne

    2009-02-01

    In recent years significant progress has been made delineating the psychological components of reward and their underlying neural mechanisms. Here we briefly highlight findings on three dissociable psychological components of reward: 'liking' (hedonic impact), 'wanting' (incentive salience), and learning (predictive associations and cognitions). A better understanding of the components of reward, and their neurobiological substrates, may help in devising improved treatments for disorders of mood and motivation, ranging from depression to eating disorders, drug addiction, and related compulsive pursuits of rewards.

  5. Reinforcement Learning and Savings Behavior.

    PubMed

    Choi, James J; Laibson, David; Madrian, Brigitte C; Metrick, Andrew

    2009-12-01

    We show that individual investors over-extrapolate from their personal experience when making savings decisions. Investors who experience particularly rewarding outcomes from saving in their 401(k)-a high average and/or low variance return-increase their 401(k) savings rate more than investors who have less rewarding experiences with saving. This finding is not driven by aggregate time-series shocks, income effects, rational learning about investing skill, investor fixed effects, or time-varying investor-level heterogeneity that is correlated with portfolio allocations to stock, bond, and cash asset classes. We discuss implications for the equity premium puzzle and interventions aimed at improving household financial outcomes.

  6. Model-based reinforcement learning under concurrent schedules of reinforcement in rodents.

    PubMed

    Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

    2009-05-01

    Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's knowledge or model of the environment in model-based reinforcement learning algorithms. To investigate how animals update value functions, we trained rats under two different free-choice tasks. The reward probability of the unchosen target remained unchanged in one task, whereas it increased over time since the target was last chosen in the other task. The results show that goal choice probability increased as a function of the number of consecutive alternative choices in the latter, but not the former task, indicating that the animals were aware of time-dependent increases in arming probability and used this information in choosing goals. In addition, the choice behavior in the latter task was better accounted for by a model-based reinforcement learning algorithm. Our results show that rats adopt a decision-making process that cannot be accounted for by simple reinforcement learning models even in a relatively simple binary choice task, suggesting that rats can readily improve their decision-making strategy through the knowledge of their environments.

  7. Role of Dopamine D2 Receptors in Human Reinforcement Learning

    PubMed Central

    Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

    2014-01-01

    Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well. PMID:24713613

  8. Role of dopamine D2 receptors in human reinforcement learning.

    PubMed

    Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

    2014-09-01

    Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well.

  9. Reinforcement active learning in the vibrissae system: optimal object localization.

    PubMed

    Gordon, Goren; Dorfman, Nimrod; Ahissar, Ehud

    2013-01-01

    Rats move their whiskers to acquire information about their environment. It has been observed that they palpate novel objects and objects they are required to localize in space. We analyze whisker-based object localization using two complementary paradigms, namely, active learning and intrinsic-reward reinforcement learning. Active learning algorithms select the next training samples according to the hypothesized solution in order to better discriminate between correct and incorrect labels. Intrinsic-reward reinforcement learning uses prediction errors as the reward to an actor-critic design, such that behavior converges to the one that optimizes the learning process. We show that in the context of object localization, the two paradigms result in palpation whisking as their respective optimal solution. These results suggest that rats may employ principles of active learning and/or intrinsic reward in tactile exploration and can guide future research to seek the underlying neuronal mechanisms that implement them. Furthermore, these paradigms are easily transferable to biomimetic whisker-based artificial sensors and can improve the active exploration of their environment.

  10. Reinforcement learning in supply chains.

    PubMed

    Valluri, Annapurna; North, Michael J; Macal, Charles M

    2009-10-01

    Effective management of supply chains creates value and can strategically position companies. In practice, human beings have been found to be both surprisingly successful and disappointingly inept at managing supply chains. The related fields of cognitive psychology and artificial intelligence have postulated a variety of potential mechanisms to explain this behavior. One of the leading candidates is reinforcement learning. This paper applies agent-based modeling to investigate the comparative behavioral consequences of three simple reinforcement learning algorithms in a multi-stage supply chain. For the first time, our findings show that the specific algorithm that is employed can have dramatic effects on the results obtained. Reinforcement learning is found to be valuable in multi-stage supply chains with several learning agents, as independent agents can learn to coordinate their behavior. However, learning in multi-stage supply chains using these postulated approaches from cognitive psychology and artificial intelligence take extremely long time periods to achieve stability which raises questions about their ability to explain behavior in real supply chains. The fact that it takes thousands of periods for agents to learn in this simple multi-agent setting provides new evidence that real world decision makers are unlikely to be using strict reinforcement learning in practice.

  11. Is Avoiding an Aversive Outcome Rewarding? Neural Substrates of Avoidance Learning in the Human Brain

    PubMed Central

    Kim, Hackjin; Shimojo, Shinsuke

    2006-01-01

    Avoidance learning poses a challenge for reinforcement-based theories of instrumental conditioning, because once an aversive outcome is successfully avoided an individual may no longer experience extrinsic reinforcement for their behavior. One possible account for this is to propose that avoiding an aversive outcome is in itself a reward, and thus avoidance behavior is positively reinforced on each trial when the aversive outcome is successfully avoided. In the present study we aimed to test this possibility by determining whether avoidance of an aversive outcome recruits the same neural circuitry as that elicited by a reward itself. We scanned 16 human participants with functional MRI while they performed an instrumental choice task, in which on each trial they chose from one of two actions in order to either win money or else avoid losing money. Neural activity in a region previously implicated in encoding stimulus reward value, the medial orbitofrontal cortex, was found to increase, not only following receipt of reward, but also following successful avoidance of an aversive outcome. This neural signal may itself act as an intrinsic reward, thereby serving to reinforce actions during instrumental avoidance. PMID:16802856

  12. Is avoiding an aversive outcome rewarding? Neural substrates of avoidance learning in the human brain.

    PubMed

    Kim, Hackjin; Shimojo, Shinsuke; O'Doherty, John P

    2006-07-01

    Avoidance learning poses a challenge for reinforcement-based theories of instrumental conditioning, because once an aversive outcome is successfully avoided an individual may no longer experience extrinsic reinforcement for their behavior. One possible account for this is to propose that avoiding an aversive outcome is in itself a reward, and thus avoidance behavior is positively reinforced on each trial when the aversive outcome is successfully avoided. In the present study we aimed to test this possibility by determining whether avoidance of an aversive outcome recruits the same neural circuitry as that elicited by a reward itself. We scanned 16 human participants with functional MRI while they performed an instrumental choice task, in which on each trial they chose from one of two actions in order to either win money or else avoid losing money. Neural activity in a region previously implicated in encoding stimulus reward value, the medial orbitofrontal cortex, was found to increase, not only following receipt of reward, but also following successful avoidance of an aversive outcome. This neural signal may itself act as an intrinsic reward, thereby serving to reinforce actions during instrumental avoidance.

  13. Habits, action sequences, and reinforcement learning

    PubMed Central

    Dezfouli, Amir; Balleine, Bernard W.

    2012-01-01

    It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquire and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions. PMID:22487034

  14. Habits, action sequences and reinforcement learning.

    PubMed

    Dezfouli, Amir; Balleine, Bernard W

    2012-04-01

    It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquired and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary, and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions.

  15. Reward-Guided Learning with and without Causal Attribution.

    PubMed

    Jocham, Gerhard; Brodersen, Kay H; Constantinescu, Alexandra O; Kahn, Martin C; Ianni, Angela M; Walton, Mark E; Rushworth, Matthew F S; Behrens, Timothy E J

    2016-04-06

    When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task.

  16. Reward-Guided Learning with and without Causal Attribution

    PubMed Central

    Jocham, Gerhard; Brodersen, Kay H.; Constantinescu, Alexandra O.; Kahn, Martin C.; Ianni, Angela M.; Walton, Mark E.; Rushworth, Matthew F.S.; Behrens, Timothy E.J.

    2016-01-01

    Summary When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task. PMID:26971947

  17. Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults

    PubMed Central

    Smith, Tim J.; Senju, Atsushi

    2017-01-01

    While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue–reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue–reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting. PMID:28250186

  18. Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults.

    PubMed

    Vernetti, Angélina; Smith, Tim J; Senju, Atsushi

    2017-03-15

    While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue-reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue-reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting.

  19. Conflict acts as an implicit cost in reinforcement learning.

    PubMed

    Cavanagh, James F; Masters, Sean E; Bath, Kevin; Frank, Michael J

    2014-11-04

    Conflict has been proposed to act as a cost in action selection, implying a general function of medio-frontal cortex in the adaptation to aversive events. Here we investigate if response conflict acts as a cost during reinforcement learning by modulating experienced reward values in cortical and striatal systems. Electroencephalography recordings show that conflict diminishes the relationship between reward-related frontal theta power and cue preference yet it enhances the relationship between punishment and cue avoidance. Individual differences in the cost of conflict on reward versus punishment sensitivity are also related to a genetic polymorphism associated with striatal D1 versus D2 pathway balance (DARPP-32). We manipulate these patterns with the D2 agent cabergoline, which induces a strong bias to amplify the aversive value of punishment outcomes following conflict. Collectively, these findings demonstrate that interactive cortico-striatal systems implicitly modulate experienced reward and punishment values as a function of conflict.

  20. A Comparative Analysis of Reinforcement Learning Methods

    DTIC Science & Technology

    1991-10-01

    reinforcement learning for both programming and adapting situated agents. In the first part of the paper we discuss two specific reinforcement learning algorithms: Q-learning and the Bucket Brigade. We introduce a special case of the Bucket Brigade, and analyze and compare its performance to Q-learning in a number of experiments. The second part of the paper discusses the key problems of reinforcement learning : time and space complexity, input generalization, sensitivity to parameter values, and selection of the reinforcement

  1. The impact of mineralocorticoid receptor ISO/VAL genotype (rs5522) and stress on reward learning.

    PubMed

    Bogdan, R; Perlis, R H; Fagerness, J; Pizzagalli, D A

    2010-08-01

    Research suggests that stress disrupts reinforcement learning and induces anhedonia. The mineralocorticoid receptor (MR) determines the sensitivity of the stress response, and the missense iso/val polymorphism (Ile180Val, rs5522) of the MR gene (NR3C2) has been associated with enhanced physiological stress responses, elevated depressive symptoms and reduced cortisol-induced MR gene expression. The goal of these studies was to evaluate whether rs5522 genotype and stress independently and interactively influence reward learning. In study 1, participants (n = 174) completed a probabilistic reward task under baseline (i.e. no-stress) conditions. In study 2, participants (n = 53) completed the task during a stress (threat-of-shock) and no-stress condition. Reward learning, i.e. the ability to modulate behavior as a function of reinforcement history, was the main variable of interest. In study 1, in which participants were evaluated under no-stress conditions, reward learning was enhanced in val carriers. In study 2, participants developed a weaker response bias toward a more frequently rewarded stimulus under the stress relative to no-stress condition. Critically, stress-induced reward learning deficits were largest in val carriers. Although preliminary and in need of replication due to small sample size, findings indicate that psychiatrically healthy individuals carrying the MR val allele, gene, which has been recently linked to depression, showed a reduced ability to modulate behavior as a function of reward when facing an acute, uncontrollable stressor. Future studies are warranted to evaluate whether rs5522 genotype interacts with naturalistic stressors to increase the risk of depression and whether stress-induced anhedonia might moderate such risk.

  2. Robot-assisted motor training: assistance decreases exploration during reinforcement learning.

    PubMed

    Sans-Muntadas, Albert; Duarte, Jaime E; Reinkensmeyer, David J

    2014-01-01

    Reinforcement learning (RL) is a form of motor learning that robotic therapy devices could potentially manipulate to promote neurorehabilitation. We developed a system that requires trainees to use RL to learn a predefined target movement. The system provides higher rewards for movements that are more similar to the target movement. We also developed a novel algorithm that rewards trainees of different abilities with comparable reward sizes. This algorithm measures a trainee's performance relative to their best performance, rather than relative to an absolute target performance, to determine reward. We hypothesized this algorithm would permit subjects who cannot normally achieve high reward levels to do so while still learning. In an experiment with 21 unimpaired human subjects, we found that all subjects quickly learned to make a first target movement with and without the reward equalization. However, artificially increasing reward decreased the subjects' tendency to engage in exploration and therefore slowed learning, particularly when we changed the target movement. An anti-slacking watchdog algorithm further slowed learning. These results suggest that robotic algorithms that assist trainees in achieving rewards or in preventing slacking might, over time, discourage the exploration needed for reinforcement learning.

  3. Striatal dopamine D1 receptor suppression impairs reward-associative learning.

    PubMed

    Higa, Kerin K; Young, Jared W; Ji, Baohu; Nichols, David E; Geyer, Mark A; Zhou, Xianjin

    2017-04-14

    Dopamine (DA) is required for reinforcement learning. Hence, disruptions in DA signaling may contribute to the learning deficits associated with psychiatric disorders. The DA D1 receptor (D1R) has been linked to learning and is a target for cognitive/motivational enhancement in patients with schizophrenia. Separating the striatal D1R contribution to learning vs. motivation, however, has been challenging. We suppressed striatal D1R expression in mice using a D1R-targeting short hairpin RNA (shRNA), delivered locally to the striatum via an adeno-associated virus (AAV). We then assessed reward- and punishment-associative learning using a probabilistic learning task and motivation using a progressive-ratio breakpoint procedure. We confirmed suppression of striatal D1Rs immunohistochemically and by testing locomotor activity after the administration of (+)-doxanthrine, a full D1R agonist, in control mice and those treated with the D1RshRNA. D1RshRNA-treated mice exhibited impaired reward-associative learning, while punishment-associative learning was spared. This deficit was unrelated to general learning impairments or amotivation, because the D1shRNA-treated mice exhibited normal Barnes maze learning and normal motivation in the progressive-ratio breakpoint procedure. Suppression of striatal D1Rs selectively impaired reward-associative learning whereas punishment-associative learning, aversion-motivated learning, and appetitive motivation were spared. Because patients with schizophrenia exhibit similar reward-associative learning deficits, D1R-targeted treatments should be investigated to improve reward learning in these patients.

  4. Tunnel Ventilation Control Using Reinforcement Learning Methodology

    NASA Astrophysics Data System (ADS)

    Chu, Baeksuk; Kim, Dongnam; Hong, Daehie; Park, Jooyoung; Chung, Jin Taek; Kim, Tae-Hyung

    The main purpose of tunnel ventilation system is to maintain CO pollutant concentration and VI (visibility index) under an adequate level to provide drivers with comfortable and safe driving environment. Moreover, it is necessary to minimize power consumption used to operate ventilation system. To achieve the objectives, the control algorithm used in this research is reinforcement learning (RL) method. RL is a goal-directed learning of a mapping from situations to actions without relying on exemplary supervision or complete models of the environment. The goal of RL is to maximize a reward which is an evaluative feedback from the environment. In the process of constructing the reward of the tunnel ventilation system, two objectives listed above are included, that is, maintaining an adequate level of pollutants and minimizing power consumption. RL algorithm based on actor-critic architecture and gradient-following algorithm is adopted to the tunnel ventilation system. The simulations results performed with real data collected from existing tunnel ventilation system and real experimental verification are provided in this paper. It is confirmed that with the suggested controller, the pollutant level inside the tunnel was well maintained under allowable limit and the performance of energy consumption was improved compared to conventional control scheme.

  5. Working memory and reward association learning impairments in obesity

    PubMed Central

    Coppin, Géraldine; Nolan-Poupart, Sarah; Jones-Gotman, Marilyn; Small, Dana M.

    2014-01-01

    Obesity has been associated with impaired executive functions including working memory. Less explored is the influence of obesity on learning and memory. In the current study we assessed stimulus reward association learning, explicit learning and memory and working memory in healthy weight, overweight and obese individuals. Explicit learning and memory did not differ as a function of group. In contrast, working memory was significantly and similarly impaired in both overweight and obese individuals compared to the healthy weight group. In the first reward association learning task the obese, but not healthy weight or overweight participants consistently formed paradoxical preferences for a pattern associated with a negative outcome (fewer food rewards). To determine if the deficit was specific to food reward a second experiment was conducted using money. Consistent with experiment 1, obese individuals selected the pattern associated with a negative outcome (fewer monetary rewards) more frequently than healthy weight individuals and thus failed to develop a significant preference for the most rewarded patterns as was observed in the healthy weight group. Finally, on a probabilistic learning task, obese compared to healthy weight individuals showed deficits in negative, but not positive outcome learning. Taken together, our results demonstrate deficits in working memory and stimulus reward learning in obesity and suggest that obese individuals are impaired in learning to avoid negative outcomes. PMID:25447070

  6. SAN-RL: combining spreading activation networks and reinforcement learning to learn configurable behaviors

    NASA Astrophysics Data System (ADS)

    Gaines, Daniel M.; Wilkes, Don M.; Kusumalnukool, Kanok; Thongchai, Siripun; Kawamura, Kazuhiko; White, John H.

    2002-02-01

    Reinforcement learning techniques have been successful in allowing an agent to learn a policy for achieving tasks. The overall behavior of the agent can be controlled with an appropriate reward function. However, the policy that is learned will be fixed to this reward function. If the user wishes to change his or her preference about how the task is achieved the agent must be retrained with this new reward function. We address this challenge by combining Spreading Activation Networks and Reinforcement Learning in an approach we call SAN-RL. This approach provides the agent with a causal structure, the spreading activation network, relating goals to the actions that can achieve those goals. This enables the agent to select actions relative to the goal priorities. We combine this with reinforcement learning to enable the agent to learn a policy. Together, these approaches enable the learning of a configurable behaviors, a policy that can be adapted to meet the current preferences. We compare the approach with Q-learning on a robot navigation task. We demonstrate that SAN-RL exhibits goal-directed behavior before learning, exploits the causal structure of the network to focus its search during learning and results in configurable behaviors after learning.

  7. Probabilistic reward learning in adults with Attention Deficit Hyperactivity Disorder--an electrophysiological study.

    PubMed

    Thoma, Patrizia; Edel, Marc-Andreas; Suchan, Boris; Bellebaum, Christian

    2015-01-30

    Attention Deficit Hyperactivity Disorder (ADHD) is hypothesized to be characterized by altered reinforcement sensitivity. The main aim of the present study was to assess alterations in the electrophysiological correlates of monetary reward processing in adult patients with ADHD of the combined subtype. Fourteen adults with ADHD of the combined subtype and 14 healthy control participants performed an active and an observational probabilistic reward-based learning task while an electroencephalogramm (EEG) was recorded. Regardless of feedback valence, there was a general feedback-related negativity (FRN) enhancement in combination with reduced learning performance during both active and observational reward learning in patients with ADHD relative to healthy controls. Other feedback-locked potentials such as the P200 and P300 and response-locked potentials were unaltered in the patients. There were no significant correlations between learning performance, FRN amplitudes and clinical symptoms, neither in the overall group involving all participants, nor in patients or controls considered separately. This pattern of findings might reflect generally impaired reward prediction in adults with ADHD of the combined subtype. We demonstrated for the first time that patients with ADHD of the combined subtype show not only deficient active reward learning but are also impaired when learning by observing other people׳s outcomes.

  8. Learning strategies in table tennis using inverse reinforcement learning.

    PubMed

    Muelling, Katharina; Boularias, Abdeslam; Mohler, Betty; Schölkopf, Bernhard; Peters, Jan

    2014-10-01

    Learning a complex task such as table tennis is a challenging problem for both robots and humans. Even after acquiring the necessary motor skills, a strategy is needed to choose where and how to return the ball to the opponent's court in order to win the game. The data-driven identification of basic strategies in interactive tasks, such as table tennis, is a largely unexplored problem. In this paper, we suggest a computational model for representing and inferring strategies, based on a Markov decision problem, where the reward function models the goal of the task as well as the strategic information. We show how this reward function can be discovered from demonstrations of table tennis matches using model-free inverse reinforcement learning. The resulting framework allows to identify basic elements on which the selection of striking movements is based. We tested our approach on data collected from players with different playing styles and under different playing conditions. The estimated reward function was able to capture expert-specific strategic information that sufficed to distinguish the expert among players with different skill levels as well as different playing styles.

  9. Stress modulates reinforcement learning in younger and older adults.

    PubMed

    Lighthall, Nichole R; Gorlick, Marissa A; Schoeke, Andrej; Frank, Michael J; Mather, Mara

    2013-03-01

    Animal research and human neuroimaging studies indicate that stress increases dopamine levels in brain regions involved in reward processing, and stress also appears to increase the attractiveness of addictive drugs. The current study tested the hypothesis that stress increases reward salience, leading to more effective learning about positive than negative outcomes in a probabilistic selection task. Changes to dopamine pathways with age raise the question of whether stress effects on incentive-based learning differ by age. Thus, the present study also examined whether effects of stress on reinforcement learning differed for younger (age 18-34) and older participants (age 65-85). Cold pressor stress was administered to half of the participants in each age group, and salivary cortisol levels were used to confirm biophysiological response to cold stress. After the manipulation, participants completed a probabilistic learning task involving positive and negative feedback. In both younger and older adults, stress enhanced learning about cues that predicted positive outcomes. In addition, during the initial learning phase, stress diminished sensitivity to recent feedback across age groups. These results indicate that stress affects reinforcement learning in both younger and older adults and suggests that stress exerts different effects on specific components of reinforcement learning depending on their neural underpinnings.

  10. Behavioral and neural properties of social reinforcement learning.

    PubMed

    Jones, Rebecca M; Somerville, Leah H; Li, Jian; Ruberry, Erika J; Libby, Victoria; Glover, Gary; Voss, Henning U; Ballon, Douglas J; Casey, B J

    2011-09-14

    Social learning is critical for engaging in complex interactions with other individuals. Learning from positive social exchanges, such as acceptance from peers, may be similar to basic reinforcement learning. We formally test this hypothesis by developing a novel paradigm that is based on work in nonhuman primates and human imaging studies of reinforcement learning. The probability of receiving positive social reinforcement from three distinct peers was parametrically manipulated while brain activity was recorded in healthy adults using event-related functional magnetic resonance imaging. Over the course of the experiment, participants responded more quickly to faces of peers who provided more frequent positive social reinforcement, and rated them as more likeable. Modeling trial-by-trial learning showed ventral striatum and orbital frontal cortex activity correlated positively with forming expectations about receiving social reinforcement. Rostral anterior cingulate cortex activity tracked positively with modulations of expected value of the cues (peers). Together, the findings across three levels of analysis--social preferences, response latencies, and modeling neural responses--are consistent with reinforcement learning theory and nonhuman primate electrophysiological studies of reward. This work highlights the fundamental influence of acceptance by one's peers in altering subsequent behavior.

  11. No Apparent Influence of Reward upon Visual Statistical Learning

    PubMed Central

    Rogers, Leeland L.; Friedman, Kyle G.; Vickery, Timothy J.

    2016-01-01

    Humans are capable of detecting and exploiting a variety of environmental regularities, including stimulus-stimulus contingencies (e.g., visual statistical learning) and stimulus-reward contingencies. However, the relationship between these two types of learning is poorly understood. In two experiments, we sought evidence that the occurrence of rewarding events enhances or impairs visual statistical learning. Across all of our attempts to find such evidence, we employed a training stage during which we grouped shapes into triplets and presented triplets one shape at a time in an undifferentiated stream. Participants subsequently performed a surprise recognition task in which they were tested on their knowledge of the underlying structure of the triplets. Unbeknownst to participants, triplets were also assigned no-, low-, or high-reward status. In Experiments 1A and 1B, participants viewed shape streams while low and high rewards were “randomly” given, presented as low- and high-pitched tones played through headphones. Rewards were always given on the third shape of a triplet (Experiment 1A) or the first shape of a triplet (Experiment 1B), and high- and low-reward sounds were always consistently paired with the same triplets. Experiment 2 was similar to Experiment 1, except that participants were required to learn value associations of a subset of shapes before viewing the shape stream. Across all experiments, we observed significant visual statistical learning effects, but the strength of learning did not differ amongst no-, low-, or high-reward conditions for any of the experiments. Thus, our experiments failed to find any influence of rewards on statistical learning, implying that visual statistical learning may be unaffected by the occurrence of reward. The system that detects basic stimulus-stimulus regularities may operate independently of the system that detects reward contingencies. PMID:27853441

  12. No Apparent Influence of Reward upon Visual Statistical Learning.

    PubMed

    Rogers, Leeland L; Friedman, Kyle G; Vickery, Timothy J

    2016-01-01

    Humans are capable of detecting and exploiting a variety of environmental regularities, including stimulus-stimulus contingencies (e.g., visual statistical learning) and stimulus-reward contingencies. However, the relationship between these two types of learning is poorly understood. In two experiments, we sought evidence that the occurrence of rewarding events enhances or impairs visual statistical learning. Across all of our attempts to find such evidence, we employed a training stage during which we grouped shapes into triplets and presented triplets one shape at a time in an undifferentiated stream. Participants subsequently performed a surprise recognition task in which they were tested on their knowledge of the underlying structure of the triplets. Unbeknownst to participants, triplets were also assigned no-, low-, or high-reward status. In Experiments 1A and 1B, participants viewed shape streams while low and high rewards were "randomly" given, presented as low- and high-pitched tones played through headphones. Rewards were always given on the third shape of a triplet (Experiment 1A) or the first shape of a triplet (Experiment 1B), and high- and low-reward sounds were always consistently paired with the same triplets. Experiment 2 was similar to Experiment 1, except that participants were required to learn value associations of a subset of shapes before viewing the shape stream. Across all experiments, we observed significant visual statistical learning effects, but the strength of learning did not differ amongst no-, low-, or high-reward conditions for any of the experiments. Thus, our experiments failed to find any influence of rewards on statistical learning, implying that visual statistical learning may be unaffected by the occurrence of reward. The system that detects basic stimulus-stimulus regularities may operate independently of the system that detects reward contingencies.

  13. Force-proportional reinforcement: pimozide does not reduce rats' emission of higher forces for sweeter rewards.

    PubMed

    Kirkpatrick, M A; Fowler, S C

    1989-02-01

    A two-step force-proportional reinforcement procedure was used to assess the efficacy of a sucrose reward under neuroleptic challenge. The force-proportional reinforcement method entails an increase in the quality of reward contingent upon higher force-emission. This paradigm was conceived as a rate-free means of addressing the putative anhedonic effects of dopaminergic receptor-blocking agents. Results failed to support the anhedonia interpretation of neuroleptic-induced response decrements: Pimozide did not diminish the ability of a high-concentration sucrose solution to maintain elevated response forces. Alternatives to the anhedonia interpretation were discussed with emphasis on the drug's motor effects in the temporal domain.

  14. Dopamine selectively remediates 'model-based' reward learning: a computational approach.

    PubMed

    Sharp, Madeleine E; Foerde, Karin; Daw, Nathaniel D; Shohamy, Daphna

    2016-02-01

    Patients with loss of dopamine due to Parkinson's disease are impaired at learning from reward. However, it remains unknown precisely which aspect of learning is impaired. In particular, learning from reward, or reinforcement learning, can be driven by two distinct computational processes. One involves habitual stamping-in of stimulus-response associations, hypothesized to arise computationally from 'model-free' learning. The other, 'model-based' learning, involves learning a model of the world that is believed to support goal-directed behaviour. Much work has pointed to a role for dopamine in model-free learning. But recent work suggests model-based learning may also involve dopamine modulation, raising the possibility that model-based learning may contribute to the learning impairment in Parkinson's disease. To directly test this, we used a two-step reward-learning task which dissociates model-free versus model-based learning. We evaluated learning in patients with Parkinson's disease tested ON versus OFF their dopamine replacement medication and in healthy controls. Surprisingly, we found no effect of disease or medication on model-free learning. Instead, we found that patients tested OFF medication showed a marked impairment in model-based learning, and that this impairment was remediated by dopaminergic medication. Moreover, model-based learning was positively correlated with a separate measure of working memory performance, raising the possibility of common neural substrates. Our results suggest that some learning deficits in Parkinson's disease may be related to an inability to pursue reward based on complete representations of the environment.

  15. Rational and Mechanistic Perspectives on Reinforcement Learning

    ERIC Educational Resources Information Center

    Chater, Nick

    2009-01-01

    This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: "mechanistic" and "rational." Reinforcement learning is often viewed in mechanistic terms--as…

  16. A parallel framework for Bayesian reinforcement learning

    NASA Astrophysics Data System (ADS)

    Barrett, Enda; Duggan, Jim; Howley, Enda

    2014-01-01

    Solving a finite Markov decision process using techniques from dynamic programming such as value or policy iteration require a complete model of the environmental dynamics. The distribution of rewards, transition probabilities, states and actions all need to be fully observable, discrete and complete. For many problem domains, a complete model containing a full representation of the environmental dynamics may not be readily available. Bayesian reinforcement learning (RL) is a technique devised to make better use of the information observed through learning than simply computing Q-functions. However, this approach can often require extensive experience in order to build up an accurate representation of the true values. To address this issue, this paper proposes a method for parallelising a Bayesian RL technique aimed at reducing the time it takes to approximate the missing model. We demonstrate the technique on learning next state transition probabilities without prior knowledge. The approach is general enough for approximating any probabilistically driven component of the model. The solution involves multiple learning agents learning in parallel on the same task. Agents share probability density estimates amongst each other in an effort to speed up convergence to the true values.

  17. Multi Agent Reward Analysis for Learning in Noisy Domains

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian K.

    2005-01-01

    In many multi agent learning problems, it is difficult to determine, a priori, the agent reward structure that will lead to good performance. This problem is particularly pronounced in continuous, noisy domains ill-suited to simple table backup schemes commonly used in TD(lambda)/Q-learning. In this paper, we present a new reward evaluation method that allows the tradeoff between coordination among the agents and the difficulty of the learning problem each agent faces to be visualized. This method is independent of the learning algorithm and is only a function of the problem domain and the agents reward structure. We then use this reward efficiency visualization method to determine an effective reward without performing extensive simulations. We test this method in both a static and a dynamic multi-rover learning domain where the agents have continuous state spaces and where their actions are noisy (e.g., the agents movement decisions are not always carried out properly). Our results show that in the more difficult dynamic domain, the reward efficiency visualization method provides a two order of magnitude speedup in selecting a good reward. Most importantly it allows one to quickly create and verify rewards tailored to the observational limitations of the domain.

  18. The role of reward in word learning and its implications for language acquisition.

    PubMed

    Ripollés, Pablo; Marco-Pallarés, Josep; Hielscher, Ulrike; Mestres-Missé, Anna; Tempelmann, Claus; Heinze, Hans-Jochen; Rodríguez-Fornells, Antoni; Noesselt, Toemme

    2014-11-03

    The exact neural processes behind humans' drive to acquire a new language--first as infants and later as second-language learners--are yet to be established. Recent theoretical models have proposed that during human evolution, emerging language-learning mechanisms might have been glued to phylogenetically older subcortical reward systems, reinforcing human motivation to learn a new language. Supporting this hypothesis, our results showed that adult participants exhibited robust fMRI activation in the ventral striatum (VS)--a core region of reward processing--when successfully learning the meaning of new words. This activation was similar to the VS recruitment elicited using an independent reward task. Moreover, the VS showed enhanced functional and structural connectivity with neocortical language areas during successful word learning. Together, our results provide evidence for the neural substrate of reward and motivation during word learning. We suggest that this strong functional and anatomical coupling between neocortical language regions and the subcortical reward system provided a crucial advantage in humans that eventually enabled our lineage to successfully acquire linguistic skills.

  19. Reinforcement learning deficits in people with schizophrenia persist after extended trials.

    PubMed

    Cicero, David C; Martin, Elizabeth A; Becker, Theresa M; Kerns, John G

    2014-12-30

    Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive feedback and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning.

  20. Reinforcement Learning Deficits in People with Schizophrenia Persist after Extended Trials

    PubMed Central

    Cicero, David C.; Martin, Elizabeth A.; Becker, Theresa M.; Kerns, John G.

    2014-01-01

    Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning. PMID:25172610

  1. Dopamine-Dependent Reinforcement of Motor Skill Learning: Evidence from Gilles de la Tourette Syndrome

    ERIC Educational Resources Information Center

    Palminteri, Stefano; Lebreton, Mael; Worbe, Yulia; Hartmann, Andreas; Lehericy, Stephane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-01-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only…

  2. Dopamine neurons learn relative chosen value from probabilistic rewards

    PubMed Central

    Lak, Armin; Stauffer, William R; Schultz, Wolfram

    2016-01-01

    Economic theories posit reward probability as one of the factors defining reward value. Individuals learn the value of cues that predict probabilistic rewards from experienced reward frequencies. Building on the notion that responses of dopamine neurons increase with reward probability and expected value, we asked how dopamine neurons in monkeys acquire this value signal that may represent an economic decision variable. We found in a Pavlovian learning task that reward probability-dependent value signals arose from experienced reward frequencies. We then assessed neuronal response acquisition during choices among probabilistic rewards. Here, dopamine responses became sensitive to the value of both chosen and unchosen options. Both experiments showed also the novelty responses of dopamine neurones that decreased as learning advanced. These results show that dopamine neurons acquire predictive value signals from the frequency of experienced rewards. This flexible and fast signal reflects a specific decision variable and could update neuronal decision mechanisms. DOI: http://dx.doi.org/10.7554/eLife.18044.001 PMID:27787196

  3. Simulation of rat behavior by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals.

    PubMed

    Murakoshi, Kazushi; Noguchi, Takuya

    2005-04-01

    Brown and Wanger [Brown, R.T., Wanger, A.R., 1964. Resistance to punishment and extinction following training with shock or nonreinforcement. J. Exp. Psychol. 68, 503-507] investigated rat behaviors with the following features: (1) rats were exposed to reward and punishment at the same time, (2) environment changed and rats relearned, and (3) rats were stochastically exposed to reward and punishment. The results are that exposure to nonreinforcement produces resistance to the decremental effects of behavior after stochastic reward schedule and that exposure to both punishment and reinforcement produces resistance to the decremental effects of behavior after stochastic punishment schedule. This paper aims to simulate the rat behaviors by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals. The former algorithms of reinforcement learning were unable to simulate the behavior of the feature (3). We improve the former reinforcement learning algorithms by controlling learning parameters in consideration of the acquisition probabilities of reinforcement signals. The proposed algorithm qualitatively simulates the result of the animal experiment of Brown and Wanger.

  4. The ubiquity of model-based reinforcement learning.

    PubMed

    Doll, Bradley B; Simon, Dylan A; Daw, Nathaniel D

    2012-12-01

    The reward prediction error (RPE) theory of dopamine (DA) function has enjoyed great success in the neuroscience of learning and decision-making. This theory is derived from model-free reinforcement learning (RL), in which choices are made simply on the basis of previously realized rewards. Recently, attention has turned to correlates of more flexible, albeit computationally complex, model-based methods in the brain. These methods are distinguished from model-free learning by their evaluation of candidate actions using expected future outcomes according to a world model. Puzzlingly, signatures from these computations seem to be pervasive in the very same regions previously thought to support model-free learning. Here, we review recent behavioral and neural evidence about these two systems, in attempt to reconcile their enigmatic cohabitation in the brain.

  5. Use of Inverse Reinforcement Learning for Identity Prediction

    NASA Technical Reports Server (NTRS)

    Hayes, Roy; Bao, Jonathan; Beling, Peter; Horowitz, Barry

    2011-01-01

    We adopt Markov Decision Processes (MDP) to model sequential decision problems, which have the characteristic that the current decision made by a human decision maker has an uncertain impact on future opportunity. We hypothesize that the individuality of decision makers can be modeled as differences in the reward function under a common MDP model. A machine learning technique, Inverse Reinforcement Learning (IRL), was used to learn an individual's reward function based on limited observation of his or her decision choices. This work serves as an initial investigation for using IRL to analyze decision making, conducted through a human experiment in a cyber shopping environment. Specifically, the ability to determine the demographic identity of users is conducted through prediction analysis and supervised learning. The results show that IRL can be used to correctly identify participants, at a rate of 68% for gender and 66% for one of three college major categories.

  6. Reinforcement Learning for Robots Using Neural Networks

    DTIC Science & Technology

    1993-01-06

    Reinforcement learning agents are adaptive, reactive, and self-supervised. The aim of this dissertation is to extend the state of the art of... reinforcement learning and enable its applications to complex robot-learning problems. In particular, it focuses on two issues. First, learning from sparse... reinforcement learning methods assume that the world is a Markov decision process. This assumption is too strong for many robot tasks of interest, This

  7. Generalization of value in reinforcement learning by humans.

    PubMed

    Wimmer, G Elliott; Daw, Nathaniel D; Shohamy, Daphna

    2012-04-01

    Research in decision-making has focused on the role of dopamine and its striatal targets in guiding choices via learned stimulus-reward or stimulus-response associations, behavior that is well described by reinforcement learning theories. However, basic reinforcement learning is relatively limited in scope and does not explain how learning about stimulus regularities or relations may guide decision-making. A candidate mechanism for this type of learning comes from the domain of memory, which has highlighted a role for the hippocampus in learning of stimulus-stimulus relations, typically dissociated from the role of the striatum in stimulus-response learning. Here, we used functional magnetic resonance imaging and computational model-based analyses to examine the joint contributions of these mechanisms to reinforcement learning. Humans performed a reinforcement learning task with added relational structure, modeled after tasks used to isolate hippocampal contributions to memory. On each trial participants chose one of four options, but the reward probabilities for pairs of options were correlated across trials. This (uninstructed) relationship between pairs of options potentially enabled an observer to learn about option values based on experience with the other options and to generalize across them. We observed blood oxygen level-dependent (BOLD) activity related to learning in the striatum and also in the hippocampus. By comparing a basic reinforcement learning model to one augmented to allow feedback to generalize between correlated options, we tested whether choice behavior and BOLD activity were influenced by the opportunity to generalize across correlated options. Although such generalization goes beyond standard computational accounts of reinforcement learning and striatal BOLD, both choices and striatal BOLD activity were better explained by the augmented model. Consistent with the hypothesized role for the hippocampus in this generalization, functional

  8. The influence of trial order on learning from reward vs. punishment in a probabilistic categorization task: experimental and computational analyses.

    PubMed

    Moustafa, Ahmed A; Gluck, Mark A; Herzallah, Mohammad M; Myers, Catherine E

    2015-01-01

    Previous research has shown that trial ordering affects cognitive performance, but this has not been tested using category-learning tasks that differentiate learning from reward and punishment. Here, we tested two groups of healthy young adults using a probabilistic category learning task of reward and punishment in which there are two types of trials (reward, punishment) and three possible outcomes: (1) positive feedback for correct responses in reward trials; (2) negative feedback for incorrect responses in punishment trials; and (3) no feedback for incorrect answers in reward trials and correct answers in punishment trials. Hence, trials without feedback are ambiguous, and may represent either successful avoidance of punishment or failure to obtain reward. In Experiment 1, the first group of subjects received an intermixed task in which reward and punishment trials were presented in the same block, as a standard baseline task. In Experiment 2, a second group completed the separated task, in which reward and punishment trials were presented in separate blocks. Additionally, in order to understand the mechanisms underlying performance in the experimental conditions, we fit individual data using a Q-learning model. Results from Experiment 1 show that subjects who completed the intermixed task paradoxically valued the no-feedback outcome as a reinforcer when it occurred on reinforcement-based trials, and as a punisher when it occurred on punishment-based trials. This is supported by patterns of empirical responding, where subjects showed more win-stay behavior following an explicit reward than following an omission of punishment, and more lose-shift behavior following an explicit punisher than following an omission of reward. In Experiment 2, results showed similar performance whether subjects received reward-based or punishment-based trials first. However, when the Q-learning model was applied to these data, there were differences between subjects in the reward

  9. Learning the specific quality of taste reinforcement in larval Drosophila.

    PubMed

    Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

    2015-01-27

    The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing-in any brain.

  10. Learning the specific quality of taste reinforcement in larval Drosophila

    PubMed Central

    Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

    2015-01-01

    The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing—in any brain. DOI: http://dx.doi.org/10.7554/eLife.04711.001 PMID:25622533

  11. Credit assignment in movement-dependent reinforcement learning

    PubMed Central

    Boggess, Matthew J.; Crossley, Matthew J.; Parvin, Darius; Ivry, Richard B.; Taylor, Jordan A.

    2016-01-01

    When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants’ explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem. PMID:27247404

  12. Credit assignment in movement-dependent reinforcement learning.

    PubMed

    McDougle, Samuel D; Boggess, Matthew J; Crossley, Matthew J; Parvin, Darius; Ivry, Richard B; Taylor, Jordan A

    2016-06-14

    When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants' explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem.

  13. Pleasurable music affects reinforcement learning according to the listener

    PubMed Central

    Gold, Benjamin P.; Frank, Michael J.; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875

  14. Pleasurable music affects reinforcement learning according to the listener.

    PubMed

    Gold, Benjamin P; Frank, Michael J; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy.

  15. Goal-Directed and Habit-Like Modulations of Stimulus Processing during Reinforcement Learning.

    PubMed

    Luque, David; Beesley, Tom; Morris, Richard W; Jack, Bradley N; Griffiths, Oren; Whitford, Thomas J; Le Pelley, Mike E

    2017-03-15

    Recent research has shown that perceptual processing of stimuli previously associated with high-value rewards is automatically prioritized even when rewards are no longer available. It has been hypothesized that such reward-related modulation of stimulus salience is conceptually similar to an "attentional habit." Recording event-related potentials in humans during a reinforcement learning task, we show strong evidence in favor of this hypothesis. Resistance to outcome devaluation (the defining feature of a habit) was shown by the stimulus-locked P1 component, reflecting activity in the extrastriate visual cortex. Analysis at longer latencies revealed a positive component (corresponding to the P3b, from 550-700 ms) sensitive to outcome devaluation. Therefore, distinct spatiotemporal patterns of brain activity were observed corresponding to habitual and goal-directed processes. These results demonstrate that reinforcement learning engages both attentional habits and goal-directed processes in parallel. Consequences for brain and computational models of reinforcement learning are discussed.SIGNIFICANCE STATEMENT The human attentional network adapts to detect stimuli that predict important rewards. A recent hypothesis suggests that the visual cortex automatically prioritizes reward-related stimuli, driven by cached representations of reward value; that is, stimulus-response habits. Alternatively, the neural system may track the current value of the predicted outcome. Our results demonstrate for the first time that visual cortex activity is increased for reward-related stimuli even when the rewarding event is temporarily devalued. In contrast, longer-latency brain activity was specifically sensitive to transient changes in reward value. Therefore, we show that both habit-like attention and goal-directed processes occur in the same learning episode at different latencies. This result has important consequences for computational models of reinforcement learning.

  16. Efficient reinforcement learning: computational theories, neuroscience and robotics.

    PubMed

    Kawato, Mitsuo; Samejima, Kazuyuki

    2007-04-01

    Reinforcement learning algorithms have provided some of the most influential computational theories for behavioral learning that depends on reward and penalty. After briefly reviewing supporting experimental data, this paper tackles three difficult theoretical issues that remain to be explored. First, plain reinforcement learning is much too slow to be considered a plausible brain model. Second, although the temporal-difference error has an important role both in theory and in experiments, how to compute it remains an enigma. Third, function of all brain areas, including the cerebral cortex, cerebellum, brainstem and basal ganglia, seems to necessitate a new computational framework. Computational studies that emphasize meta-parameters, hierarchy, modularity and supervised learning to resolve these issues are reviewed here, together with the related experimental data.

  17. Can model-free reinforcement learning explain deontological moral judgments?

    PubMed

    Ayars, Alisabeth

    2016-05-01

    Dual-systems frameworks propose that moral judgments are derived from both an immediate emotional response, and controlled/rational cognition. Recently Cushman (2013) proposed a new dual-system theory based on model-free and model-based reinforcement learning. Model-free learning attaches values to actions based on their history of reward and punishment, and explains some deontological, non-utilitarian judgments. Model-based learning involves the construction of a causal model of the world and allows for far-sighted planning; this form of learning fits well with utilitarian considerations that seek to maximize certain kinds of outcomes. I present three concerns regarding the use of model-free reinforcement learning to explain deontological moral judgment. First, many actions that humans find aversive from model-free learning are not judged to be morally wrong. Moral judgment must require something in addition to model-free learning. Second, there is a dearth of evidence for central predictions of the reinforcement account-e.g., that people with different reinforcement histories will, all else equal, make different moral judgments. Finally, to account for the effect of intention within the framework requires certain assumptions which lack support. These challenges are reasonable foci for future empirical/theoretical work on the model-free/model-based framework.

  18. A Selective Role for Lmo4 in Cue–Reward Learning

    PubMed Central

    Mangieri, Regina A.; Morrisett, Richard A.; Heberlein, Ulrike; Messing, Robert O.

    2015-01-01

    The ability to use environmental cues to predict rewarding events is essential to survival. The basolateral amygdala (BLA) plays a central role in such forms of associative learning. Aberrant cue–reward learning is thought to underlie many psychopathologies, including addiction, so understanding the underlying molecular mechanisms can inform strategies for intervention. The transcriptional regulator LIM-only 4 (LMO4) is highly expressed in pyramidal neurons of the BLA, where it plays an important role in fear learning. Because the BLA also contributes to cue–reward learning, we investigated the role of BLA LMO4 in this process using Lmo4-deficient mice and RNA interference. Lmo4-deficient mice showed a selective deficit in conditioned reinforcement. Knockdown of LMO4 in the BLA, but not in the nucleus accumbens, recapitulated this deficit in wild-type mice. Molecular and electrophysiological studies identified a deficit in dopamine D2 receptor signaling in the BLA of Lmo4-deficient mice. These results reveal a novel, LMO4-dependent transcriptional program within the BLA that is essential to cue–reward learning. PMID:26134647

  19. Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems.

    PubMed

    Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

    2013-01-01

    A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider.

  20. Anhedonia and the relative reward value of drug and nondrug reinforcers in cigarette smokers.

    PubMed

    Leventhal, Adam M; Trujillo, Michael; Ameringer, Katherine J; Tidey, Jennifer W; Sussman, Steve; Kahler, Christopher W

    2014-05-01

    Anhedonia-a psychopathologic trait indicative of diminished interest, pleasure, and enjoyment-has been linked to use of and addiction to several substances, including tobacco. We hypothesized that anhedonic drug users develop an imbalance in the relative reward value of drug versus nondrug reinforcers, which could maintain drug use behavior. To test this hypothesis, we examined whether anhedonia predicted the tendency to choose an immediate drug reward (i.e., smoking) over a less immediate nondrug reward (i.e., money) in a laboratory study of non-treatment-seeking adult cigarette smokers. Participants (N = 275, ≥10 cigarettes/day) attended a baseline visit that involved anhedonia assessment followed by 2 counterbalanced experimental visits: (a) after 16-hr smoking abstinence and (b) nonabstinent. At both experimental visits, participants completed self-report measures of mood state followed by a behavioral smoking task, which measured 2 aspects of the relative reward value of smoking versus money: (1) latency to initiate smoking when delaying smoking was monetarily rewarded and (2) willingness to purchase individual cigarettes. Results indicated that higher anhedonia predicted quicker smoking initiation and more cigarettes purchased. These relations were partially mediated by low positive and high negative mood states assessed immediately prior to the smoking task. Abstinence amplified the extent to which anhedonia predicted cigarette consumption among those who responded to the abstinence manipulation, but not the entire sample. Anhedonia may bias motivation toward smoking over alternative reinforcers, perhaps by giving rise to poor acute mood states. An imbalance in the reward value assigned to drug versus nondrug reinforcers may link anhedonia-related psychopathology to drug use.

  1. Anhedonia and the Relative Reward Value of Drug and Nondrug Reinforcers in Cigarette Smokers

    PubMed Central

    Leventhal, Adam M.; Trujillo, Michael; Ameringer, Katherine J.; Tidey, Jennifer W.; Sussman, Steve; Kahler, Christopher W.

    2015-01-01

    Anhedonia—a psychopathologic trait indicative of diminished interest, pleasure, and enjoyment—has been linked to use of and addiction to several substances, including tobacco. We hypothesized that anhedonic drug users develop an imbalance in the relative reward value of drug versus nondrug reinforcers, which could maintain drug use behavior. To test this hypothesis, we examined whether anhedonia predicted the tendency to choose an immediate drug reward (i.e., smoking) over a less immediate nondrug reward (i.e., money) in a laboratory study of non–treatment-seeking adult cigarette smokers. Participants (N = 275, ≥ 10 cigarettes/day) attended a baseline visit that involved anhedonia assessment followed by 2 counterbalanced experimental visits: (a) after 16-hr smoking abstinence and (b) nonabstinent. At both experimental visits, participants completed self-report measures of mood state followed by a behavioral smoking task, which measured 2 aspects of the relative reward value of smoking versus money: (1) latency to initiate smoking when delaying smoking was monetarily rewarded and (2) willingness to purchase individual cigarettes. Results indicated that higher anhedonia predicted quicker smoking initiation and more cigarettes purchased. These relations were partially mediated by low positive and high negative mood states assessed immediately prior to the smoking task. Abstinence amplified the extent to which anhedonia predicted cigarette consumption among those who responded to the abstinence manipulation, but not the entire sample. Anhedonia may bias motivation toward smoking over alternative reinforcers, perhaps by giving rise to poor acute mood states. An imbalance in the reward value assigned to drug versus nondrug reinforcers may link anhedonia-related psychopathology to drug use. PMID:24886011

  2. Abnormal temporal difference reward-learning signals in major depression.

    PubMed

    Kumar, P; Waiter, G; Ahearn, T; Milders, M; Reid, I; Steele, J D

    2008-08-01

    Anhedonia is a core symptom of major depressive disorder (MDD), long thought to be associated with reduced dopaminergic function. However, most antidepressants do not act directly on the dopamine system and all antidepressants have a delayed full therapeutic effect. Recently, it has been proposed that antidepressants fail to alter dopamine function in antidepressant unresponsive MDD. There is compelling evidence that dopamine neurons code a specific phasic (short duration) reward-learning signal, described by temporal difference (TD) theory. There is no current evidence for other neurons coding a TD reward-learning signal, although such evidence may be found in time. The neuronal substrates of the TD signal were not explored in this study. Phasic signals are believed to have quite different properties to tonic (long duration) signals. No studies have investigated phasic reward-learning signals in MDD. Therefore, adults with MDD receiving long-term antidepressant medication, and comparison controls both unmedicated and acutely medicated with the antidepressant citalopram, were scanned using fMRI during a reward-learning task. Three hypotheses were tested: first, patients with MDD have blunted TD reward-learning signals; second, controls given an antidepressant acutely have blunted TD reward-learning signals; third, the extent of alteration in TD signals in major depression correlates with illness severity ratings. The results supported the hypotheses. Patients with MDD had significantly reduced reward-learning signals in many non-brainstem regions: ventral striatum (VS), rostral and dorsal anterior cingulate, retrosplenial cortex (RC), midbrain and hippocampus. However, the TD signal was increased in the brainstem of patients. As predicted, acute antidepressant administration to controls was associated with a blunted TD signal, and the brainstem TD signal was not increased by acute citalopram administration. In a number of regions, the magnitude of the abnormal

  3. Pain relief produces negative reinforcement through activation of mesolimbic reward-valuation circuitry.

    PubMed

    Navratilova, Edita; Xie, Jennifer Y; Okun, Alec; Qu, Chaoling; Eyde, Nathan; Ci, Shuang; Ossipov, Michael H; King, Tamara; Fields, Howard L; Porreca, Frank

    2012-12-11

    Relief of pain is rewarding. Using a model of experimental postsurgical pain we show that blockade of afferent input from the injury with local anesthetic elicits conditioned place preference, activates ventral tegmental dopaminergic cells, and increases dopamine release in the nucleus accumbens. Importantly, place preference is associated with increased activity in midbrain dopaminergic neurons and blocked by dopamine antagonists injected into the nucleus accumbens. The data directly support the hypothesis that relief of pain produces negative reinforcement through activation of the mesolimbic reward-valuation circuitry.

  4. Resting-state EEG theta activity and risk learning: sensitivity to reward or punishment?

    PubMed

    Massar, Stijn A A; Kenemans, J Leon; Schutter, Dennis J L G

    2014-03-01

    Increased theta (4-7 Hz)-beta (13-30 Hz) power ratio in resting state electroencephalography (EEG) has been associated with risky disadvantageous decision making and with impaired reinforcement learning. However, the specific contributions of theta and beta power in risky decision making remain unclear. The first aim of the present study was to replicate the earlier found relationship and examine the specific contributions of theta and beta power in risky decision making using the Iowa Gambling Task. The second aim of the study was to examine whether the relation were associated with differences in reward or punishment sensitivity. We replicated the earlier found relationship by showing a positive association between theta/beta ratio and risky decision making. This correlation was mainly driven by theta oscillations. Furthermore, theta power correlated with reward motivated learning, but not with punishment learning. The present results replicate and extend earlier findings by providing novel insights into the relation between thetabeta ratios and risky decision making. Specifically, findings show that resting-state theta activity is correlated with reinforcement learning, and that this association may be explained by differences in reward sensitivity.

  5. Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

    PubMed

    Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

    2013-04-01

    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

  6. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation.

    PubMed

    Kato, Ayaka; Morita, Kenji

    2016-10-01

    It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of 'Go' or 'No-Go' selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of 'Go' values towards a goal, and (2) value-contrasts between 'Go' and 'No-Go' are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological systems for value-learning

  7. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation

    PubMed Central

    Morita, Kenji

    2016-01-01

    It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of ‘Go’ or ‘No-Go’ selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of ‘Go’ values towards a goal, and (2) value-contrasts between ‘Go’ and ‘No-Go’ are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological

  8. Collaborating Fuzzy Reinforcement Learning Agents

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1997-01-01

    Earlier, we introduced GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Relearning and at the local level, each agent learns and operates based on ANTARCTIC, a technique for fuzzy reinforcement learning. In this paper, we show that it is possible for these agents to compete in order to affect the selected control policy but at the same time, they can collaborate while investigating the state space. In this model, the evaluator or the critic learns by observing all the agents behaviors but the control policy changes only based on the behavior of the winning agent also known as the super agent.

  9. Honeybees learn the sign and magnitude of reward variations.

    PubMed

    Gil, Mariana; De Marco, Rodrigo J

    2009-09-01

    In this study, we asked whether honeybees learn the sign and magnitude of variations in the level of reward. We designed an experiment in which bees first had to forage on a three-flower patch offering variable reward levels, and then search for food at the site in the absence of reward and after a long foraging pause. At the time of training, we presented the bees with a decrease in reward level or, instead, with either a small or a large increase in reward level. Testing took place as soon as they visited the patch on the day following training, when we measured the bees' food-searching behaviours. We found that the bees that had experienced increasing reward levels searched for food more persistently than the bees that had experienced decreasing reward levels, and that the bees that had experienced a large increase in reward level searched for food more persistently than the bees that had experienced a small increase in reward level. Because these differences at the time of testing cannot be accounted for by the bees' previous crop loads and food-intake rates, our results unambiguously demonstrate that honeybees adjust their investment of time/energy during foraging in relation to both the sign and the magnitude of past variations in the level of reward. It is likely that such variations lead to the formation of reward expectations enhancing a forager's reliance on a feeding site. Ultimately, this would make it more likely for honeybees to find food when forage is scarce.

  10. Distributed reinforcement learning for adaptive and robust network intrusion response

    NASA Astrophysics Data System (ADS)

    Malialis, Kleanthis; Devlin, Sam; Kudenko, Daniel

    2015-07-01

    Distributed denial of service (DDoS) attacks constitute a rapidly evolving threat in the current Internet. Multiagent Router Throttling is a novel approach to defend against DDoS attacks where multiple reinforcement learning agents are installed on a set of routers and learn to rate-limit or throttle traffic towards a victim server. The focus of this paper is on online learning and scalability. We propose an approach that incorporates task decomposition, team rewards and a form of reward shaping called difference rewards. One of the novel characteristics of the proposed system is that it provides a decentralised coordinated response to the DDoS problem, thus being resilient to DDoS attacks themselves. The proposed system learns remarkably fast, thus being suitable for online learning. Furthermore, its scalability is successfully demonstrated in experiments involving 1000 learning agents. We compare our approach against a baseline and a popular state-of-the-art throttling technique from the network security literature and show that the proposed approach is more effective, adaptive to sophisticated attack rate dynamics and robust to agent failures.

  11. A Reinforcement Learning Approach to Control.

    DTIC Science & Technology

    1997-05-31

    acquisition is inherently a partially observable Markov decision problem. This report describes an efficient, scalable reinforcement learning approach to the...deployment of refined intelligent gaze control techniques. This report first lays a theoretical foundation for reinforcement learning . It then introduces...perform well in both high and low SNR ATR environments. Reinforcement learning coupled with history features appears to be both a sound foundation and a practical scalable base for gaze control.

  12. Two spatiotemporally distinct value systems shape reward-based learning in the human brain

    PubMed Central

    Fouragnan, Elsa; Retzler, Chris; Mullinger, Karen; Philiastides, Marios G.

    2015-01-01

    Avoiding repeated mistakes and learning to reinforce rewarding decisions is critical for human survival and adaptive actions. Yet, the neural underpinnings of the value systems that encode different decision-outcomes remain elusive. Here coupling single-trial electroencephalography with simultaneously acquired functional magnetic resonance imaging, we uncover the spatiotemporal dynamics of two separate but interacting value systems encoding decision-outcomes. Consistent with a role in regulating alertness and switching behaviours, an early system is activated only by negative outcomes and engages arousal-related and motor-preparatory brain structures. Consistent with a role in reward-based learning, a later system differentially suppresses or activates regions of the human reward network in response to negative and positive outcomes, respectively. Following negative outcomes, the early system interacts and downregulates the late system, through a thalamic interaction with the ventral striatum. Critically, the strength of this coupling predicts participants' switching behaviour and avoidance learning, directly implicating the thalamostriatal pathway in reward-based learning. PMID:26348160

  13. Reward-based learning for virtual neurorobotics through emotional speech processing

    PubMed Central

    Jayet Bray, Laurence C.; Ferneyhough, Gareth B.; Barker, Emily R.; Thibeault, Corey M.; Harris, Frederick C.

    2013-01-01

    Reward-based learning can easily be applied to real life with a prevalence in children teaching methods. It also allows machines and software agents to automatically determine the ideal behavior from a simple reward feedback (e.g., encouragement) to maximize their performance. Advancements in affective computing, especially emotional speech processing (ESP) have allowed for more natural interaction between humans and robots. Our research focuses on integrating a novel ESP system in a relevant virtual neurorobotic (VNR) application. We created an emotional speech classifier that successfully distinguished happy and utterances. The accuracy of the system was 95.3 and 98.7% during the offline mode (using an emotional speech database) and the live mode (using live recordings), respectively. It was then integrated in a neurorobotic scenario, where a virtual neurorobot had to learn a simple exercise through reward-based learning. If the correct decision was made the robot received a spoken reward, which in turn stimulated synapses (in our simulated model) undergoing spike-timing dependent plasticity (STDP) and reinforced the corresponding neural pathways. Both our ESP and neurorobotic systems allowed our neurorobot to successfully and consistently learn the exercise. The integration of ESP in real-time computational neuroscience architecture is a first step toward the combination of human emotions and virtual neurorobotics. PMID:23641213

  14. Classroom Reinforcement and Learning: A Quantitative Synthesis.

    ERIC Educational Resources Information Center

    Lysakowski, Richard S.; Walberg, Herbert J.

    1981-01-01

    A preview of statistical data from previous studies determined the benefits of positive reinforcement on learning in students from kindergarten through college. Results indicate that differences between reinforced and control groups are greater for girls and for students from special schools and that reinforcement appears to have a strong effect…

  15. Reward-based contextual learning supported by anterior cingulate cortex.

    PubMed

    Umemoto, Akina; HajiHosseini, Azadeh; Yates, Michael E; Holroyd, Clay B

    2017-02-24

    The anterior cingulate cortex (ACC) is commonly associated with cognitive control and decision making, but its specific function is highly debated. To explore a recent theory that the ACC learns the reward values of task contexts (Holroyd & McClure in Psychological Review, 122, 54-83, 2015; Holroyd & Yeung in Trends in Cognitive Sciences, 16, 122-128, 2012), we recorded the event-related brain potentials (ERPs) from participants as they played a novel gambling task. The participants were first required to select from among three games in one "virtual casino," and subsequently they were required to select from among three different games in a different virtual casino; unbeknownst to them, the payoffs for the games were higher in one casino than in the other. Analysis of the reward positivity, an ERP component believed to reflect reward-related signals carried to the ACC by the midbrain dopamine system, revealed that the ACC is sensitive to differences in the reward values associated with both the casinos and the games inside the casinos, indicating that participants learned the values of the contexts in which rewards were delivered. These results highlight the importance of the ACC in learning the reward values of task contexts in order to guide action selection.

  16. A reinforcement learning approach to gait training improves retention

    PubMed Central

    Hasson, Christopher J.; Manczurowsky, Julia; Yen, Sheng-Che

    2015-01-01

    Many gait training programs are based on supervised learning principles: an individual is guided towards a desired gait pattern with directional error feedback. While this results in rapid adaptation, improvements quickly disappear. This study tested the hypothesis that a reinforcement learning approach improves retention and transfer of a new gait pattern. The results of a pilot study and larger experiment are presented. Healthy subjects were randomly assigned to either a supervised group, who received explicit instructions and directional error feedback while they learned a new gait pattern on a treadmill, or a reinforcement group, who was only shown whether they were close to or far from the desired gait. Subjects practiced for 10 min, followed by immediate and overnight retention and over-ground transfer tests. The pilot study showed that subjects could learn a new gait pattern under a reinforcement learning paradigm. The larger experiment, which had twice as many subjects (16 in each group) showed that the reinforcement group had better overnight retention than the supervised group (a 32% vs. 120% error increase, respectively), but there were no differences for over-ground transfer. These results suggest that encouraging participants to find rewarding actions through self-guided exploration is beneficial for retention. PMID:26379524

  17. A reinforcement learning approach to gait training improves retention.

    PubMed

    Hasson, Christopher J; Manczurowsky, Julia; Yen, Sheng-Che

    2015-01-01

    Many gait training programs are based on supervised learning principles: an individual is guided towards a desired gait pattern with directional error feedback. While this results in rapid adaptation, improvements quickly disappear. This study tested the hypothesis that a reinforcement learning approach improves retention and transfer of a new gait pattern. The results of a pilot study and larger experiment are presented. Healthy subjects were randomly assigned to either a supervised group, who received explicit instructions and directional error feedback while they learned a new gait pattern on a treadmill, or a reinforcement group, who was only shown whether they were close to or far from the desired gait. Subjects practiced for 10 min, followed by immediate and overnight retention and over-ground transfer tests. The pilot study showed that subjects could learn a new gait pattern under a reinforcement learning paradigm. The larger experiment, which had twice as many subjects (16 in each group) showed that the reinforcement group had better overnight retention than the supervised group (a 32% vs. 120% error increase, respectively), but there were no differences for over-ground transfer. These results suggest that encouraging participants to find rewarding actions through self-guided exploration is beneficial for retention.

  18. Punishment insensitivity and impaired reinforcement learning in preschoolers

    PubMed Central

    Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

    2013-01-01

    Background Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a developmental vulnerability to psychopathic traits. Methods 157 preschoolers (mean age 4.7 ±0.8 years) participated in a substudy that was embedded within a larger project. Children completed the “Stars-in-Jars” task, which involved learning to select rewarded jars and avoid punished jars. Maternal report of responsiveness to socialization was assessed with the Punishment Insensitivity and Low Concern for Others scales of the Multidimensional Assessment of Preschool Disruptive Behavior (MAP-DB). Results Punishment Insensitivity, but not Low Concern for Others, was significantly associated with reinforcement learning in multivariate models that accounted for age and sex. Specifically, higher Punishment Insensitivity was associated with significantly lower overall performance and more errors on punished trials (“passive avoidance”). Conclusions Impairments in reinforcement learning manifest in preschoolers who are high in maternal ratings of Punishment Insensitivity. If replicated, these findings may help to pinpoint the neurodevelopmental antecedents of psychopathic tendencies and suggest novel intervention targets beginning in early childhood. PMID:24033313

  19. Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer.

    PubMed

    Hu, Yujing; Gao, Yang; An, Bo

    2015-07-01

    An important approach in multiagent reinforcement learning (MARL) is equilibrium-based MARL, which adopts equilibrium solution concepts in game theory and requires agents to play equilibrium strategies at each state. However, most existing equilibrium-based MARL algorithms cannot scale due to a large number of computationally expensive equilibrium computations (e.g., computing Nash equilibria is PPAD-hard) during learning. For the first time, this paper finds that during the learning process of equilibrium-based MARL, the one-shot games corresponding to each state's successive visits often have the same or similar equilibria (for some states more than 90% of games corresponding to successive visits have similar equilibria). Inspired by this observation, this paper proposes to use equilibrium transfer to accelerate equilibrium-based MARL. The key idea of equilibrium transfer is to reuse previously computed equilibria when each agent has a small incentive to deviate. By introducing transfer loss and transfer condition, a novel framework called equilibrium transfer-based MARL is proposed. We prove that although equilibrium transfer brings transfer loss, equilibrium-based MARL algorithms can still converge to an equilibrium policy under certain assumptions. Experimental results in widely used benchmarks (e.g., grid world game, soccer game, and wall game) show that the proposed framework: 1) not only significantly accelerates equilibrium-based MARL (up to 96.7% reduction in learning time), but also achieves higher average rewards than algorithms without equilibrium transfer and 2) scales significantly better than algorithms without equilibrium transfer when the state/action space grows and the number of agents increases.

  20. Effects of the chronic restraint stress induced depression on reward-related learning in rats.

    PubMed

    Xu, Pan; Wang, Kezhu; Lu, Cong; Dong, Liming; Chen, Yixi; Wang, Qiong; Shi, Zhe; Yang, Yanyan; Chen, Shanguang; Liu, Xinmin

    2017-03-15

    Chronic mild or unpredictability stress produces a persistent depressive-like state. The main symptoms of depression include weight loss, despair, anhedonia, diminished motivation and mild cognition impairment, which could influence the ability of reward-related learning. In the present study, we aimed to evaluate the effects of chronic restraint stress on the performance of reward-related learning of rats. We used the exposure of repeated restraint stress (6h/day, for 28days) to induce depression-like behavior in rats. Then designed tasks including Pavlovian conditioning (magazine head entries), acquisition and maintenance of instrumental conditioning (lever pressing) and goal directed learning (higher fixed ratio schedule of reinforcement) to study the effects of chronic restraint stress. The results indicated that chronic restraint stress influenced rats in those aspects including the acquisition of a Pavlovian stimulus-outcome (S-O) association, the formation and maintenance of action-outcome (A-O) causal relation and the ability of learning in higher fixed ratio schedule. In conclusion, depression could influence the performances in reward-related learning obviously and the series of instrumental learning tasks may have potential as a method to evaluate cognitive changes in depression.

  1. Anticipated Reward Enhances Offline Learning during Sleep

    ERIC Educational Resources Information Center

    Fischer, Stefan; Born, Jan

    2009-01-01

    Sleep is known to promote the consolidation of motor memories. In everyday life, typically more than 1 isolated motor skill is acquired at a time, and this possibly gives rise to interference during consolidation. Here, it is shown that reward expectancy determines the amount of sleep-dependent memory consolidation. Subjects were trained on 2…

  2. Reinforcement Learning in Multidimensional Environments Relies on Attention Mechanisms

    PubMed Central

    Daniel, Reka; Geana, Andra; Gershman, Samuel J.; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C.

    2015-01-01

    In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this “representation learning” process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the “curse of dimensionality” in reinforcement learning. PMID:26019331

  3. Memory Transformation Enhances Reinforcement Learning in Dynamic Environments.

    PubMed

    Santoro, Adam; Frankland, Paul W; Richards, Blake A

    2016-11-30

    Over the course of systems consolidation, there is a switch from a reliance on detailed episodic memories to generalized schematic memories. This switch is sometimes referred to as "memory transformation." Here we demonstrate a previously unappreciated benefit of memory transformation, namely, its ability to enhance reinforcement learning in a dynamic environment. We developed a neural network that is trained to find rewards in a foraging task where reward locations are continuously changing. The network can use memories for specific locations (episodic memories) and statistical patterns of locations (schematic memories) to guide its search. We find that switching from an episodic to a schematic strategy over time leads to enhanced performance due to the tendency for the reward location to be highly correlated with itself in the short-term, but regress to a stable distribution in the long-term. We also show that the statistics of the environment determine the optimal utilization of both types of memory. Our work recasts the theoretical question of why memory transformation occurs, shifting the focus from the avoidance of memory interference toward the enhancement of reinforcement learning across multiple timescales.

  4. Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling.

    PubMed

    Redish, A David; Jensen, Steve; Johnson, Adam; Kurth-Nelson, Zeb

    2007-07-01

    Because learned associations are quickly renewed following extinction, the extinction process must include processes other than unlearning. However, reinforcement learning models, such as the temporal difference reinforcement learning (TDRL) model, treat extinction as an unlearning of associated value and are thus unable to capture renewal. TDRL models are based on the hypothesis that dopamine carries a reward prediction error signal; these models predict reward by driving that reward error to zero. The authors construct a TDRL model that can accommodate extinction and renewal through two simple processes: (a) a TDRL process that learns the value of situation-action pairs and (b) a situation recognition process that categorizes the observed cues into situations. This model has implications for dysfunctional states, including relapse after addiction and problem gambling.

  5. Emotion and reward are dissociable from error during motor learning.

    PubMed

    Festini, Sara B; Preston, Stephanie D; Reuter-Lorenz, Patricia A; Seidler, Rachael D

    2016-06-01

    Although emotion is known to reciprocally interact with cognitive and motor performance, contemporary theories of motor learning do not specifically consider how dynamic variations in a learner's affective state may influence motor performance during motor learning. Using a prism adaptation paradigm, we assessed emotion during motor learning on a trial-by-trial basis. We designed two dart-throwing experiments to dissociate motor performance and reward outcomes by giving participants maximum points for accurate throws and reduced points for throws that hit zones away from the target (i.e., "accidental points"). Experiment 1 dissociated motor performance from emotional responses and found that affective ratings tracked points earned more closely than error magnitude. Further, both reward and error uniquely contributed to motor learning, as indexed by the change in error from one trial to the next. Experiment 2 manipulated accidental point locations vertically, whereas prism displacement remained horizontal. Results demonstrated that reward could bias motor performance even when concurrent sensorimotor adaptation was taking place in a perpendicular direction. Thus, these experiments demonstrate that affective states were dissociable from error magnitude during motor learning and that affect more closely tracked points earned. Our findings further implicate reward as another factor, other than error, that contributes to motor learning, suggesting the importance of incorporating affective states into models of motor learning.

  6. Human dorsal striatal activity during choice discriminates reinforcement learning behavior from the gambler's fallacy.

    PubMed

    Jessup, Ryan K; O'Doherty, John P

    2011-04-27

    Reinforcement learning theory has generated substantial interest in neurobiology, particularly because of the resemblance between phasic dopamine and reward prediction errors. Actor-critic theories have been adapted to account for the functions of the striatum, with parts of the dorsal striatum equated to the actor. Here, we specifically test whether the human dorsal striatum--as predicted by an actor-critic instantiation--is used on a trial-to-trial basis at the time of choice to choose in accordance with reinforcement learning theory, as opposed to a competing strategy: the gambler's fallacy. Using a partial-brain functional magnetic resonance imaging scanning protocol focused on the striatum and other ventral brain areas, we found that the dorsal striatum is more active when choosing consistent with reinforcement learning compared with the competing strategy. Moreover, an overlapping area of dorsal striatum along with the ventral striatum was found to be correlated with reward prediction errors at the time of outcome, as predicted by the actor-critic framework. These findings suggest that the same region of dorsal striatum involved in learning stimulus-response associations may contribute to the control of behavior during choice, thereby using those learned associations. Intriguingly, neither reinforcement learning nor the gambler's fallacy conformed to the optimal choice strategy on the specific decision-making task we used. Thus, the dorsal striatum may contribute to the control of behavior according to reinforcement learning even when the prescriptions of such an algorithm are suboptimal in terms of maximizing future rewards.

  7. Intra-accumbens amphetamine increases the conditioned incentive salience of sucrose reward: enhancement of reward "wanting" without enhanced "liking" or response reinforcement.

    PubMed

    Wyvell, C L; Berridge, K C

    2000-11-01

    Amphetamine microinjection into the nucleus accumbens shell enhanced the ability of a Pavlovian reward cue to trigger increased instrumental performance for sucrose reward in a pure conditioned incentive paradigm. Rats were first trained to press one of two levers to obtain sucrose pellets. They were separately conditioned to associate a Pavlovian cue (30 sec light) with free sucrose pellets. On test days, the rats received bilateral microinjection of intra-accumbens vehicle or amphetamine (0.0, 2.0, 10.0, or 20.0 microgram/0.5 microliter), and lever pressing was tested in the absence of any reinforcement contingency, while the Pavlovian cue alone was freely presented at intervals throughout the session. Amphetamine microinjection selectively potentiated the cue-elicited increase in sucrose-associated lever pressing, although instrumental responding was not reinforced by either sucrose or the cue during the test. Intra-accumbens amphetamine can therefore potentiate cue-triggered incentive motivation for reward in the absence of primary or secondary reinforcement. Using the taste reactivity measure of hedonic impact, it was shown that intra-accumbens amphetamine failed to increase positive hedonic reaction patterns elicited by sucrose (i.e., sucrose "liking") at doses that effectively increase sucrose "wanting." We conclude that nucleus accumbens dopamine specifically mediates the ability of reward cues to trigger "wanting" (incentive salience) for their associated rewards, independent of both hedonic impact and response reinforcement.

  8. Multiagent cooperation and competition with deep reinforcement learning.

    PubMed

    Tampuu, Ardi; Matiisen, Tambet; Kodelja, Dorian; Kuzovkin, Ilya; Korjus, Kristjan; Aru, Juhan; Aru, Jaan; Vicente, Raul

    2017-01-01

    Evolution of cooperation and competition can appear when multiple adaptive agents share a biological, social, or technological niche. In the present work we study how cooperation and competition emerge between autonomous agents that learn by reinforcement while using only their raw visual input as the state representation. In particular, we extend the Deep Q-Learning framework to multiagent environments to investigate the interaction between two learning agents in the well-known video game Pong. By manipulating the classical rewarding scheme of Pong we show how competitive and collaborative behaviors emerge. We also describe the progression from competitive to collaborative behavior when the incentive to cooperate is increased. Finally we show how learning by playing against another adaptive agent, instead of against a hard-wired algorithm, results in more robust strategies. The present work shows that Deep Q-Networks can become a useful tool for studying decentralized learning of multiagent systems coping with high-dimensional environments.

  9. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading.

    PubMed

    Deng, Yue; Bao, Feng; Kong, Youyong; Ren, Zhiquan; Dai, Qionghai

    2017-03-01

    Can we train the computer to beat experienced traders for financial assert trading? In this paper, we try to address this challenge by introducing a recurrent deep neural network (NN) for real-time financial signal representation and trading. Our model is inspired by two biological-related learning concepts of deep learning (DL) and reinforcement learning (RL). In the framework, the DL part automatically senses the dynamic market condition for informative feature learning. Then, the RL module interacts with deep representations and makes trading decisions to accumulate the ultimate rewards in an unknown environment. The learning system is implemented in a complex NN that exhibits both the deep and recurrent structures. Hence, we propose a task-aware backpropagation through time method to cope with the gradient vanishing issue in deep training. The robustness of the neural system is verified on both the stock and the commodity future markets under broad testing conditions.

  10. Rewards.

    PubMed

    Gunderman, Richard B; Kamer, Aaron P

    2011-05-01

    For much of the 20th century, psychologists and economists operated on the assumption that work is devoid of intrinsic rewards, and the only way to get people to work harder is through the use of rewards and punishments. This so-called carrot-and-stick model of workplace motivation, when applied to medical practice, emphasizes the use of financial incentives and disincentives to manipulate behavior. More recently, however, it has become apparent that, particularly when applied to certain kinds of work, such approaches can be ineffective or even frankly counterproductive. Instead of focusing on extrinsic rewards such as compensation, organizations and their leaders need to devote more attention to the intrinsic rewards of work itself. This article reviews this new understanding of rewards and traces out its practical implications for radiology today.

  11. Reinforcement learning of periodical gaits in locomotion robots

    NASA Astrophysics Data System (ADS)

    Svinin, Mikhail; Yamada, Kazuyaki; Ushio, S.; Ueda, Kanji

    1999-08-01

    Emergence of stable gaits in locomotion robots is studied in this paper. A classifier system, implementing an instance- based reinforcement learning scheme, is used for sensory- motor control of an eight-legged mobile robot. Important feature of the classifier system is its ability to work with the continuous sensor space. The robot does not have a prior knowledge of the environment, its own internal model, and the goal coordinates. It is only assumed that the robot can acquire stable gaits by learning how to reach a light source. During the learning process the control system, is self-organized by reinforcement signals. Reaching the light source defines a global reward. Forward motion gets a local reward, while stepping back and falling down get a local punishment. Feasibility of the proposed self-organized system is tested under simulation and experiment. The control actions are specified at the leg level. It is shown that, as learning progresses, the number of the action rules in the classifier systems is stabilized to a certain level, corresponding to the acquired gait patterns.

  12. Indices of extinction-induced "depression" after operant learning using a runway vs. a cued free-reward delivery schedule.

    PubMed

    Topic, Bianca; Kröger, Inga; Vildirasova, Petya G; Huston, Joseph P

    2012-11-01

    Loss of reward is one of the etiological factors leading to affective disorders, such as major depression. We have proposed several variants of an animal model of depression based on extinction of reinforced behavior of rats. A number of behaviors emitted during extinction trials were found to be attenuated by antidepressant treatment and, thus, qualified as indices of extinction-induced "despair". These include increases in immobility in the Morris water maze and withdrawal from the former source of reward as well as biting behavior in operant chambers. Here, we assess the effects of reward omission on behaviors after learning of (a) a cued free-reward delivery in an operant chamber and (b) food-reinforced runway behavior. Sixty adult male Wistar rats were either trained to receive food reinforcement every 90 s (s) after a 5s lasting cue light (FI 90), or to traverse an alley to gain food reward. Daily drug treatment with either the selective serotonin reuptake inhibitor citalopram or the tricyclic antidepressant imipramine (each 10mg/kg) or vehicle was begun either 25 days (operant chamber) or 3 days (runway) prior to extinction. The antidepressants suppressed rearing behavior in both paradigms specifically during the extinction trials, which indicates this measure as a useful marker of depression-related behavior, possibly indicating vertical withdrawal. In the operant chamber, only marginal effects on operant learning responses during extinction were found. In the runway, the operant learned responses run time and distance to the goal, as well as total distance moved, grooming and quiescence were also influenced by the antidepressants, providing a potential set of markers for extinction-induced "depression" in the runway. Both paradigms differ substantially with respect to the anticipation of reward, behaviors that are learned and that accompany extinction. Accordingly, antidepressant treatment influenced different sets of behaviors in these two learning tasks.

  13. DeltaFosB in the nucleus accumbens is critical for reinforcing effects of sexual reward

    PubMed Central

    Pitchers, Kyle K.; Frohmader, Karla S.; Vialou, Vincent; Mouzon, Ezekiell; Nestler, Eric J.; Lehman, Michael N.; Coolen, Lique M.

    2010-01-01

    Sexual behavior in male rats is rewarding and reinforcing. However, little is known about the specific cellular and molecular mechanisms mediating sexual reward or the reinforcing effects of reward on subsequent expression of sexual behavior. The current study tests the hypothesis that ΔFosB, the stably expressed truncated form of FosB, plays a critical role in the reinforcement of sexual behavior and experience-induced facilitation of sexual motivation and performance. Sexual experience was shown to cause ΔFosB accumulation in several limbic brain regions including the nucleus accumbens (NAc), medial prefrontal cortex, ventral tegmental area and caudate putamen, but not the medial preoptic nucleus. Next, the induction of c-Fos, a downstream (repressed) target of ΔFosB, was measured in sexually experienced and naïve animals. The number of mating-induced c-Fos-IR cells was significantly decreased in sexually experienced animals compared to sexually naïve controls. Finally, ΔFosB levels and its activity in the NAc were manipulated using viral-mediated gene transfer to study its potential role in mediating sexual experience and experience-induced facilitation of sexual performance. Animals with ΔFosB over-expression displayed enhanced facilitation of sexual performance with sexual experience relative to controls. In contrast, the expression of ΔJunD, a dominant-negative binding partner of ΔFosB, attenuated sexual experience-induced facilitation of sexual performance, and stunted long-term maintenance of facilitation compared to GFP and ΔFosB over-expressing groups. Together, these findings support a critical role for ΔFosB expression in the NAc for the reinforcing effects of sexual behavior and sexual experience-induced facilitation of sexual performance. PMID:20618447

  14. Novelty and Inductive Generalization in Human Reinforcement Learning

    PubMed Central

    Gershman, Samuel J.; Niv, Yael

    2015-01-01

    In reinforcement learning, a decision maker searching for the most rewarding option is often faced with the question: what is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: how can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of reinforcement learning in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional reinforcement learning algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty. PMID:25808176

  15. Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning.

    PubMed

    Frank, Michael J; Moustafa, Ahmed A; Haughey, Heather M; Curran, Tim; Hutchison, Kent E

    2007-10-09

    What are the genetic and neural components that support adaptive learning from positive and negative outcomes? Here, we show with genetic analyses that three independent dopaminergic mechanisms contribute to reward and avoidance learning in humans. A polymorphism in the DARPP-32 gene, associated with striatal dopamine function, predicted relatively better probabilistic reward learning. Conversely, the C957T polymorphism of the DRD2 gene, associated with striatal D2 receptor function, predicted the degree to which participants learned to avoid choices that had been probabilistically associated with negative outcomes. The Val/Met polymorphism of the COMT gene, associated with prefrontal cortical dopamine function, predicted participants' ability to rapidly adapt behavior on a trial-to-trial basis. These findings support a neurocomputational dissociation between striatal and prefrontal dopaminergic mechanisms in reinforcement learning. Computational maximum likelihood analyses reveal independent gene effects on three reinforcement learning parameters that can explain the observed dissociations.

  16. Reinforcement learning for discounted values often loses the goal in the application to animal learning.

    PubMed

    Yamaguchi, Yoshiya; Sakai, Yutaka

    2012-11-01

    The impulsive preference of an animal for an immediate reward implies that it might subjectively discount the value of potential future outcomes. A theoretical framework to maximize the discounted subjective value has been established in the reinforcement learning theory. The framework has been successfully applied in engineering. However, this study identified a limitation when applied to animal behavior, where in some cases, there is no learning goal. Here a possible learning framework was proposed that is well-posed in any cases and that is consistent with the impulsive preference.

  17. Altering spatial priority maps via reward-based learning.

    PubMed

    Chelazzi, Leonardo; Eštočinová, Jana; Calletti, Riccardo; Lo Gerfo, Emanuele; Sani, Ilaria; Della Libera, Chiara; Santandrea, Elisa

    2014-06-18

    Spatial priority maps are real-time representations of the behavioral salience of locations in the visual field, resulting from the combined influence of stimulus driven activity and top-down signals related to the current goals of the individual. They arbitrate which of a number of (potential) targets in the visual scene will win the competition for attentional resources. As a result, deployment of visual attention to a specific spatial location is determined by the current peak of activation (corresponding to the highest behavioral salience) across the map. Here we report a behavioral study performed on healthy human volunteers, where we demonstrate that spatial priority maps can be shaped via reward-based learning, reflecting long-lasting alterations (biases) in the behavioral salience of specific spatial locations. These biases exert an especially strong influence on performance under conditions where multiple potential targets compete for selection, conferring competitive advantage to targets presented in spatial locations associated with greater reward during learning relative to targets presented in locations associated with lesser reward. Such acquired biases of spatial attention are persistent, are nonstrategic in nature, and generalize across stimuli and task contexts. These results suggest that reward-based attentional learning can induce plastic changes in spatial priority maps, endowing these representations with the "intelligent" capacity to learn from experience.

  18. The Establishment of Learned Reinforcers in Mildly Retarded Children. IMRID Behavioral Science Monograph No. 24.

    ERIC Educational Resources Information Center

    Worley, John C., Jr.

    Research regarding the establishment of learned reinforcement with mildly retarded children is reviewed. Noted are findings which indicate that educable retarded students, possibly due to cultural differences, are less responsive to social rewards than either nonretarded or more severely retarded children. Characteristics of primary and secondary…

  19. Early Years Education: Are Young Students Intrinsically or Extrinsically Motivated Towards School Activities? A Discussion about the Effects of Rewards on Young Children's Learning

    ERIC Educational Resources Information Center

    Theodotou, Evgenia

    2014-01-01

    Rewards can reinforce and at the same time forestall young children's willingness to learn. However, they are broadly used in the field of education, especially in early years settings, to stimulate children towards learning activities. This paper reviews the theoretical and research literature related to intrinsic and extrinsic motivational…

  20. Hypocretin/orexin regulation of dopamine signaling: implications for reward and reinforcement mechanisms

    PubMed Central

    Calipari, Erin S.; España, Rodrigo A.

    2012-01-01

    The hypocretins/orexins are comprised of two neuroexcitatory peptides that are synthesized exclusively within a circumscribed region of the lateral hypothalamus. These peptides project widely throughout the brain and interact with a variety of regions involved in the regulation of arousal-related processes including those associated with motivated behavior. The current review focuses on emerging evidence indicating that the hypocretins influence reward and reinforcement processing via actions on the mesolimbic dopamine system. We discuss contemporary perspectives of hypocretin regulation of mesolimbic dopamine signaling in both drug free and drug states, as well as hypocretin regulation of behavioral responses to drugs of abuse, particularly as it relates to cocaine. PMID:22933994

  1. The attention habit: how reward learning shapes attentional selection.

    PubMed

    Anderson, Brian A

    2016-04-01

    There is growing consensus that reward plays an important role in the control of attention. Until recently, reward was thought to influence attention indirectly by modulating task-specific motivation and its effects on voluntary control over selection. Such an account was consistent with the goal-directed (endogenous) versus stimulus-driven (exogenous) framework that had long dominated the field of attention research. Now, a different perspective is emerging. Demonstrations that previously reward-associated stimuli can automatically capture attention even when physically inconspicuous and task-irrelevant challenge previously held assumptions about attentional control. The idea that attentional selection can be value driven, reflecting a distinct and previously unrecognized control mechanism, has gained traction. Since these early demonstrations, the influence of reward learning on attention has rapidly become an area of intense investigation, sparking many new insights. The result is an emerging picture of how the reward system of the brain automatically biases information processing. Here, I review the progress that has been made in this area, synthesizing a wealth of recent evidence to provide an integrated, up-to-date account of value-driven attention and some of its broader implications.

  2. Frontostriatal white matter integrity mediates adult age differences in probabilistic reward learning.

    PubMed

    Samanez-Larkin, Gregory R; Levens, Sara M; Perry, Lee M; Dougherty, Robert F; Knutson, Brian

    2012-04-11

    Frontostriatal circuits have been implicated in reward learning, and emerging findings suggest that frontal white matter structural integrity and probabilistic reward learning are reduced in older age. This cross-sectional study examined whether age differences in frontostriatal white matter integrity could account for age differences in reward learning in a community life span sample of human adults. By combining diffusion tensor imaging with a probabilistic reward learning task, we found that older age was associated with decreased reward learning and decreased white matter integrity in specific pathways running from the thalamus to the medial prefrontal cortex and from the medial prefrontal cortex to the ventral striatum. Further, white matter integrity in these thalamocorticostriatal paths could statistically account for age differences in learning. These findings suggest that the integrity of frontostriatal white matter pathways critically supports reward learning. The findings also raise the possibility that interventions that bolster frontostriatal integrity might improve reward learning and decision making.

  3. Drive-Reinforcement Learning System Applications

    DTIC Science & Technology

    1992-07-31

    evidence suggests that D-R would be effective in control system applications outside the robotics arena.... Drive- Reinforcement Learning , Neural Network Controllers, Robotics, Manipulator Kinematics, Dynamics and Control.

  4. Common neural mechanisms underlying reversal learning by reward and punishment.

    PubMed

    Xue, Gui; Xue, Feng; Droutman, Vita; Lu, Zhong-Lin; Bechara, Antoine; Read, Stephen

    2013-01-01

    Impairments in flexible goal-directed decisions, often examined by reversal learning, are associated with behavioral abnormalities characterized by impulsiveness and disinhibition. Although the lateral orbital frontal cortex (OFC) has been consistently implicated in reversal learning, it is still unclear whether this region is involved in negative feedback processing, behavioral control, or both, and whether reward and punishment might have different effects on lateral OFC involvement. Using a relatively large sample (N = 47), and a categorical learning task with either monetary reward or moderate electric shock as feedback, we found overlapping activations in the right lateral OFC (and adjacent insula) for reward and punishment reversal learning when comparing correct reversal trials with correct acquisition trials, whereas we found overlapping activations in the right dorsolateral prefrontal cortex (DLPFC) when negative feedback signaled contingency change. The right lateral OFC and DLPFC also showed greater sensitivity to punishment than did their left homologues, indicating an asymmetry in how punishment is processed. We propose that the right lateral OFC and anterior insula are important for transforming affective feedback to behavioral adjustment, whereas the right DLPFC is involved in higher level attention control. These results provide insight into the neural mechanisms of reversal learning and behavioral flexibility, which can be leveraged to understand risky behaviors among vulnerable populations.

  5. DAT isn’t all that: cocaine reward and reinforcement requires Toll Like Receptor 4 signaling

    PubMed Central

    Northcutt, A.L.; Hutchinson, M.R.; Wang, X.; Baratta, M.V.; Hiranita, T.; Cochran, T.A.; Pomrenze, M.B.; Galer, E.L.; Kopajtic, T.A.; Li, C.M.; Amat, J.; Larson, G.; Cooper, D.C.; Huang, Y.; O’Neill, C.E.; Yin, H.; Zahniser, N.R.; Katz, J.L.; Rice, K.C.; Maier, S.F.; Bachtell, R.K.; Watkins, L.R.

    2014-01-01

    The initial reinforcing properties of drugs of abuse, such as cocaine, are largely attributed to their ability to activate the mesolimbic dopamine system. Resulting increases in extracellular dopamine in the nucleus accumbens (NAc) are traditionally thought to result from cocaine’s ability to block dopamine transporters (DATs). Here we demonstrate that cocaine also interacts with the immunosurveillance receptor complex, Toll-Like Receptor 4 (TLR4), on microglial cells to initiate central innate immune signaling. Disruption of cocaine signaling at TLR4 suppresses cocaine-induced extracellular dopamine in the NAc, as well as cocaine conditioned place preference and cocaine self-administration. These results provide a novel understanding of the neurobiological mechanisms underlying cocaine reward/reinforcement that includes a critical role for central immune signaling, and offer a new target for medication development for cocaine abuse treatment. PMID:25644383

  6. The "proactive" model of learning: Integrative framework for model-free and model-based reinforcement learning utilizing the associative learning-based proactive brain concept.

    PubMed

    Zsuga, Judit; Biro, Klara; Papp, Csaba; Tajti, Gabor; Gesztelyi, Rudolf

    2016-02-01

    Reinforcement learning (RL) is a powerful concept underlying forms of associative learning governed by the use of a scalar reward signal, with learning taking place if expectations are violated. RL may be assessed using model-based and model-free approaches. Model-based reinforcement learning involves the amygdala, the hippocampus, and the orbitofrontal cortex (OFC). The model-free system involves the pedunculopontine-tegmental nucleus (PPTgN), the ventral tegmental area (VTA) and the ventral striatum (VS). Based on the functional connectivity of VS, model-free and model based RL systems center on the VS that by integrating model-free signals (received as reward prediction error) and model-based reward related input computes value. Using the concept of reinforcement learning agent we propose that the VS serves as the value function component of the RL agent. Regarding the model utilized for model-based computations we turned to the proactive brain concept, which offers an ubiquitous function for the default network based on its great functional overlap with contextual associative areas. Hence, by means of the default network the brain continuously organizes its environment into context frames enabling the formulation of analogy-based association that are turned into predictions of what to expect. The OFC integrates reward-related information into context frames upon computing reward expectation by compiling stimulus-reward and context-reward information offered by the amygdala and hippocampus, respectively. Furthermore we suggest that the integration of model-based expectations regarding reward into the value signal is further supported by the efferent of the OFC that reach structures canonical for model-free learning (e.g., the PPTgN, VTA, and VS).

  7. Neural control of dopamine neurotransmission: implications for reinforcement learning.

    PubMed

    Aggarwal, Mayank; Hyland, Brian I; Wickens, Jeffery R

    2012-04-01

    In the past few decades there has been remarkable convergence of machine learning with neurobiological understanding of reinforcement learning mechanisms, exemplified by temporal difference (TD) learning models. The anatomy of the basal ganglia provides a number of potential substrates for instantiation of the TD mechanism. In contrast to the traditional concept of direct and indirect pathway outputs from the striatum, we emphasize that projection neurons of the striatum are branched and individual striatofugal neurons innervate both globus pallidus externa and globus pallidus interna/substantia nigra (GPi/SNr). This suggests that the GPi/SNr has the necessary inputs to operate as the source of a TD signal. We also discuss the mechanism for the timing processes necessary for learning in the TD framework. The TD framework has been particularly successful in analysing electrophysiogical recordings from dopamine (DA) neurons during learning, in terms of reward prediction error. However, present understanding of the neural control of DA release is limited, and hence the neural mechanisms involved are incompletely understood. Inhibition is very conspicuously present among the inputs to the DA neurons, with inhibitory synapses accounting for the majority of synapses on DA neurons. Furthermore, synchronous firing of the DA neuron population requires disinhibition and excitation to occur together in a coordinated manner. We conclude that the inhibitory circuits impinging directly or indirectly on the DA neurons play a central role in the control of DA neuron activity and further investigation of these circuits may provide important insight into the biological mechanisms of reinforcement learning.

  8. Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning.

    PubMed

    Piot, Bilal; Geist, Matthieu; Pietquin, Olivier

    2016-05-04

    Learning from demonstrations is a paradigm by which an apprentice agent learns a control policy for a dynamic environment by observing demonstrations delivered by an expert agent. It is usually implemented as either imitation learning (IL) or inverse reinforcement learning (IRL) in the literature. On the one hand, IRL is a paradigm relying on the Markov decision processes, where the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior. On the other hand, IL consists in directly generalizing the expert strategy, observed in the demonstrations, to unvisited states (and it is therefore close to classification, when there is a finite set of possible decisions). While these two visions are often considered as opposite to each other, the purpose of this paper is to exhibit a formal link between these approaches from which new algorithms can be derived. We show that IL and IRL can be redefined in a way that they are equivalent, in the sense that there exists an explicit bijective operator (namely, the inverse optimal Bellman operator) between their respective spaces of solutions. To do so, we introduce the set-policy framework that creates a clear link between the IL and the IRL. As a result, the IL and IRL solutions making the best of both worlds are obtained. In addition, it is a unifying framework from which existing IL and IRL algorithms can be derived and which opens the way for the IL methods able to deal with the environment's dynamics. Finally, the IRL algorithms derived from the set-policy framework are compared with the algorithms belonging to the more common trajectory-matching family. Experiments demonstrate that the set-policy-based algorithms outperform both the standard IRL and IL ones and result in more robust solutions.

  9. Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain.

    PubMed

    Niv, Yael; Edlund, Jeffrey A; Dayan, Peter; O'Doherty, John P

    2012-01-11

    Humans and animals are exquisitely, though idiosyncratically, sensitive to risk or variance in the outcomes of their actions. Economic, psychological, and neural aspects of this are well studied when information about risk is provided explicitly. However, we must normally learn about outcomes from experience, through trial and error. Traditional models of such reinforcement learning focus on learning about the mean reward value of cues and ignore higher order moments such as variance. We used fMRI to test whether the neural correlates of human reinforcement learning are sensitive to experienced risk. Our analysis focused on anatomically delineated regions of a priori interest in the nucleus accumbens, where blood oxygenation level-dependent (BOLD) signals have been suggested as correlating with quantities derived from reinforcement learning. We first provide unbiased evidence that the raw BOLD signal in these regions corresponds closely to a reward prediction error. We then derive from this signal the learned values of cues that predict rewards of equal mean but different variance and show that these values are indeed modulated by experienced risk. Moreover, a close neurometric-psychometric coupling exists between the fluctuations of the experience-based evaluations of risky options that we measured neurally and the fluctuations in behavioral risk aversion. This suggests that risk sensitivity is integral to human learning, illuminating economic models of choice, neuroscientific models of affective learning, and the workings of the underlying neural mechanisms.

  10. Reinforcement learning improves behaviour from evaluative feedback

    NASA Astrophysics Data System (ADS)

    Littman, Michael L.

    2015-05-01

    Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

  11. Heightened reward learning under stress in generalized anxiety disorder: a predictor of depression resistance?

    PubMed

    Morris, Bethany H; Rottenberg, Jonathan

    2015-02-01

    Stress-induced anhedonia is associated with depression vulnerability (Bogdan & Pizzagalli, 2006). We investigated stress-induced deficits in reward learning in a depression-vulnerable group with analogue generalized anxiety disorder (GAD, n = 34), and never-depressed healthy controls (n = 41). Utilizing a computerized signal detection task, reward learning was assessed under stressor and neutral conditions. Controls displayed intact reward learning in the neutral condition, and the expected stress-induced blunting. The GAD group as a whole also showed intact reward learning in the neutral condition. When GAD subjects were analyzed as a function of prior depression history, never-depressed GAD subjects showed heightened reward learning in the stressor condition. Better reward learning under stress among GAD subjects predicted lower depression symptoms 1 month later. Robust reward learning under stress may indicate depression resistance among anxious individuals.

  12. Cocaine addiction as a homeostatic reinforcement learning disorder.

    PubMed

    Keramati, Mehdi; Durand, Audrey; Girardeau, Paul; Gutkin, Boris; Ahmed, Serge H

    2017-03-01

    Drug addiction implicates both reward learning and homeostatic regulation mechanisms of the brain. This has stimulated 2 partially successful theoretical perspectives on addiction. Many important aspects of addiction, however, remain to be explained within a single, unified framework that integrates the 2 mechanisms. Building upon a recently developed homeostatic reinforcement learning theory, the authors focus on a key transition stage of addiction that is well modeled in animals, escalation of drug use, and propose a computational theory of cocaine addiction where cocaine reinforces behavior due to its rapid homeostatic corrective effect, whereas its chronic use induces slow and long-lasting changes in homeostatic setpoint. Simulations show that our new theory accounts for key behavioral and neurobiological features of addiction, most notably, escalation of cocaine use, drug-primed craving and relapse, individual differences underlying dose-response curves, and dopamine D2-receptor downregulation in addicts. The theory also generates unique predictions about cocaine self-administration behavior in rats that are confirmed by new experimental results. Viewing addiction as a homeostatic reinforcement learning disorder coherently explains many behavioral and neurobiological aspects of the transition to cocaine addiction, and suggests a new perspective toward understanding addiction. (PsycINFO Database Record

  13. Spiking neural networks with different reinforcement learning (RL) schemes in a multiagent setting.

    PubMed

    Christodoulou, Chris; Cleanthous, Aristodemos

    2010-12-31

    This paper investigates the effectiveness of spiking agents when trained with reinforcement learning (RL) in a challenging multiagent task. In particular, it explores learning through reward-modulated spike-timing dependent plasticity (STDP) and compares it to reinforcement of stochastic synaptic transmission in the general-sum game of the Iterated Prisoner's Dilemma (IPD). More specifically, a computational model is developed where we implement two spiking neural networks as two "selfish" agents learning simultaneously but independently, competing in the IPD game. The purpose of our system (or collective) is to maximise its accumulated reward in the presence of reward-driven competing agents within the collective. This can only be achieved when the agents engage in a behaviour of mutual cooperation during the IPD. Previously, we successfully applied reinforcement of stochastic synaptic transmission to the IPD game. The current study utilises reward-modulated STDP with eligibility trace and results show that the system managed to exhibit the desired behaviour by establishing mutual cooperation between the agents. It is noted that the cooperative outcome was attained after a relatively short learning period which enhanced the accumulation of reward by the system. As in our previous implementation, the successful application of the learning algorithm to the IPD becomes possible only after we extended it with additional global reinforcement signals in order to enhance competition at the neuronal level. Moreover it is also shown that learning is enhanced (as indicated by an increased IPD cooperative outcome) through: (i) strong memory for each agent (regulated by a high eligibility trace time constant) and (ii) firing irregularity produced by equipping the agents' LIF neurons with a partial somatic reset mechanism.

  14. Dopamine-dependent reinforcement of motor skill learning: evidence from Gilles de la Tourette syndrome.

    PubMed

    Palminteri, Stefano; Lebreton, Maël; Worbe, Yulia; Hartmann, Andreas; Lehéricy, Stéphane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-08-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only affect choices but also motor skills such as typing. Here, we employed a novel paradigm to demonstrate that monetary rewards can improve motor skill learning in humans. Indeed, healthy participants progressively got faster in executing sequences of key presses that were repeatedly rewarded with 10 euro compared with 1 cent. Control tests revealed that the effect of reinforcement on motor skill learning was independent of subjects being aware of sequence-reward associations. To account for this implicit effect, we developed an actor-critic model, in which reward prediction errors are used by the critic to update state values and by the actor to facilitate action execution. To assess the role of dopamine in such computations, we applied the same paradigm in patients with Gilles de la Tourette syndrome, who were either unmedicated or treated with neuroleptics. We also included patients with focal dystonia, as an example of hyperkinetic motor disorder unrelated to dopamine. Model fit showed the following dissociation: while motor skills were affected in all patient groups, reinforcement learning was selectively enhanced in unmedicated patients with Gilles de la Tourette syndrome and impaired by neuroleptics. These results support the hypothesis that overactive dopamine transmission leads to excessive reinforcement of motor sequences, which might explain the formation of tics in Gilles de la Tourette syndrome.

  15. Efficient exploration through active learning for value function approximation in reinforcement learning.

    PubMed

    Akiyama, Takayuki; Hachiya, Hirotaka; Sugiyama, Masashi

    2010-06-01

    Appropriately designing sampling policies is highly important for obtaining better control policies in reinforcement learning. In this paper, we first show that the least-squares policy iteration (LSPI) framework allows us to employ statistical active learning methods for linear regression. Then we propose a design method of good sampling policies for efficient exploration, which is particularly useful when the sampling cost of immediate rewards is high. The effectiveness of the proposed method, which we call active policy iteration (API), is demonstrated through simulations with a batting robot.

  16. Dynamic Sensor Tasking for Space Situational Awareness via Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Linares, R.; Furfaro, R.

    2016-09-01

    This paper studies the Sensor Management (SM) problem for optical Space Object (SO) tracking. The tasking problem is formulated as a Markov Decision Process (MDP) and solved using Reinforcement Learning (RL). The RL problem is solved using the actor-critic policy gradient approach. The actor provides a policy which is random over actions and given by a parametric probability density function (pdf). The critic evaluates the policy by calculating the estimated total reward or the value function for the problem. The parameters of the policy action pdf are optimized using gradients with respect to the reward function. Both the critic and the actor are modeled using deep neural networks (multi-layer neural networks). The policy neural network takes the current state as input and outputs probabilities for each possible action. This policy is random, and can be evaluated by sampling random actions using the probabilities determined by the policy neural network's outputs. The critic approximates the total reward using a neural network. The estimated total reward is used to approximate the gradient of the policy network with respect to the network parameters. This approach is used to find the non-myopic optimal policy for tasking optical sensors to estimate SO orbits. The reward function is based on reducing the uncertainty for the overall catalog to below a user specified uncertainty threshold. This work uses a 30 km total position error for the uncertainty threshold. This work provides the RL method with a negative reward as long as any SO has a total position error above the uncertainty threshold. This penalizes policies that take longer to achieve the desired accuracy. A positive reward is provided when all SOs are below the catalog uncertainty threshold. An optimal policy is sought that takes actions to achieve the desired catalog uncertainty in minimum time. This work trains the policy in simulation by letting it task a single sensor to "learn" from its performance

  17. Hippocampal lesions facilitate instrumental learning with delayed reinforcement but induce impulsive choice in rats

    PubMed Central

    Cheung, Timothy HC; Cardinal, Rudolf N

    2005-01-01

    Background Animals must frequently act to influence the world even when the reinforcing outcomes of their actions are delayed. Learning with action-outcome delays is a complex problem, and little is known of the neural mechanisms that bridge such delays. When outcomes are delayed, they may be attributed to (or associated with) the action that caused them, or mistakenly attributed to other stimuli, such as the environmental context. Consequently, animals that are poor at forming context-outcome associations might learn action-outcome associations better with delayed reinforcement than normal animals. The hippocampus contributes to the representation of environmental context, being required for aspects of contextual conditioning. We therefore hypothesized that animals with hippocampal lesions would be better than normal animals at learning to act on the basis of delayed reinforcement. We tested the ability of hippocampal-lesioned rats to learn a free-operant instrumental response using delayed reinforcement, and what is potentially a related ability – the ability to exhibit self-controlled choice, or to sacrifice an immediate, small reward in order to obtain a delayed but larger reward. Results Rats with sham or excitotoxic hippocampal lesions acquired an instrumental response with different delays (0, 10, or 20 s) between the response and reinforcer delivery. These delays retarded learning in normal rats. Hippocampal-lesioned rats responded slightly less than sham-operated controls in the absence of delays, but they became better at learning (relative to shams) as the delays increased; delays impaired learning less in hippocampal-lesioned rats than in shams. In contrast, lesioned rats exhibited impulsive choice, preferring an immediate, small reward to a delayed, larger reward, even though they preferred the large reward when it was not delayed. Conclusion These results support the view that the hippocampus hinders action-outcome learning with delayed outcomes

  18. Effort-Reward Imbalance for Learning Is Associated with Fatigue in School Children

    ERIC Educational Resources Information Center

    Fukuda, Sanae; Yamano, Emi; Joudoi, Takako; Mizuno, Kei; Tanaka, Masaaki; Kawatani, Junko; Takano, Miyuki; Tomoda, Akemi; Imai-Matsumura, Kyoko; Miike, Teruhisa; Watanabe, Yasuyoshi

    2010-01-01

    We examined relationships among fatigue, sleep quality, and effort-reward imbalance for learning in school children. We developed an effort-reward for learning scale in school students and examined its reliability and validity. Self-administered surveys, including the effort reward for leaning scale and fatigue scale, were completed by 1,023…

  19. Reinforcement learning for routing in cognitive radio ad hoc networks.

    PubMed

    Al-Rawi, Hasan A A; Yau, Kok-Lim Alvin; Mohamad, Hafizal; Ramli, Nordin; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs.

  20. Reinforcement Learning for Routing in Cognitive Radio Ad Hoc Networks

    PubMed Central

    Al-Rawi, Hasan A. A.; Mohamad, Hafizal; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs. PMID:25140350

  1. Artist Agent: A Reinforcement Learning Approach to Automatic Stroke Generation in Oriental Ink Painting

    NASA Astrophysics Data System (ADS)

    Xie, Ning; Hachiya, Hirotaka; Sugiyama, Masashi

    Oriental ink painting, called Sumi-e, is one of the most appealing painting styles that has attracted artists around the world. Major challenges in computer-based Sumi-e simulation are to abstract complex scene information and draw smooth and natural brush strokes. To automatically find such strokes, we propose to model the brush as a reinforcement learning agent, and learn desired brush-trajectories by maximizing the sum of rewards in the policy search framework. We also provide elaborate design of actions, states, and rewards tailored for a Sumi-e agent. The effectiveness of our proposed approach is demonstrated through simulated Sumi-e experiments.

  2. Ventral striatum and orbitofrontal cortex are both required for model-based, but not model-free, reinforcement learning.

    PubMed

    McDannald, Michael A; Lucantonio, Federica; Burke, Kathryn A; Niv, Yael; Schoenbaum, Geoffrey

    2011-02-16

    In many cases, learning is thought to be driven by differences between the value of rewards we expect and rewards we actually receive. Yet learning can also occur when the identity of the reward we receive is not as expected, even if its value remains unchanged. Learning from changes in reward identity implies access to an internal model of the environment, from which information about the identity of the expected reward can be derived. As a result, such learning is not easily accounted for by model-free reinforcement learning theories such as temporal difference reinforcement learning (TDRL), which predicate learning on changes in reward value, but not identity. Here, we used unblocking procedures to assess learning driven by value- versus identity-based prediction errors. Rats were trained to associate distinct visual cues with different food quantities and identities. These cues were subsequently presented in compound with novel auditory cues and the reward quantity or identity was selectively changed. Unblocking was assessed by presenting the auditory cues alone in a probe test. Consistent with neural implementations of TDRL models, we found that the ventral striatum was necessary for learning in response to changes in reward value. However, this area, along with orbitofrontal cortex, was also required for learning driven by changes in reward identity. This observation requires that existing models of TDRL in the ventral striatum be modified to include information about the specific features of expected outcomes derived from model-based representations, and that the role of orbitofrontal cortex in these models be clearly delineated.

  3. Racial bias shapes social reinforcement learning.

    PubMed

    Lindström, Björn; Selbing, Ida; Molapour, Tanaz; Olsson, Andreas

    2014-03-01

    Both emotional facial expressions and markers of racial-group belonging are ubiquitous signals in social interaction, but little is known about how these signals together affect future behavior through learning. To address this issue, we investigated how emotional (threatening or friendly) in-group and out-group faces reinforced behavior in a reinforcement-learning task. We asked whether reinforcement learning would be modulated by intergroup attitudes (i.e., racial bias). The results showed that individual differences in racial bias critically modulated reinforcement learning. As predicted, racial bias was associated with more efficiently learned avoidance of threatening out-group individuals. We used computational modeling analysis to quantitatively delimit the underlying processes affected by social reinforcement. These analyses showed that racial bias modulates the rate at which exposure to threatening out-group individuals is transformed into future avoidance behavior. In concert, these results shed new light on the learning processes underlying social interaction with racial-in-group and out-group individuals.

  4. Learning to obtain reward, but not avoid punishment, is affected by presence of PTSD symptoms in male veterans: empirical data and computational model.

    PubMed

    Myers, Catherine E; Moustafa, Ahmed A; Sheynin, Jony; Vanmeenen, Kirsten M; Gilbertson, Mark W; Orr, Scott P; Beck, Kevin D; Pang, Kevin C H; Servatius, Richard J

    2013-01-01

    Post-traumatic stress disorder (PTSD) symptoms include behavioral avoidance which is acquired and tends to increase with time. This avoidance may represent a general learning bias; indeed, individuals with PTSD are often faster than controls on acquiring conditioned responses based on physiologically-aversive feedback. However, it is not clear whether this learning bias extends to cognitive feedback, or to learning from both reward and punishment. Here, male veterans with self-reported current, severe PTSD symptoms (PTSS group) or with few or no PTSD symptoms (control group) completed a probabilistic classification task that included both reward-based and punishment-based trials, where feedback could take the form of reward, punishment, or an ambiguous "no-feedback" outcome that could signal either successful avoidance of punishment or failure to obtain reward. The PTSS group outperformed the control group in total points obtained; the PTSS group specifically performed better than the control group on reward-based trials, with no difference on punishment-based trials. To better understand possible mechanisms underlying observed performance, we used a reinforcement learning model of the task, and applied maximum likelihood estimation techniques to derive estimated parameters describing individual participants' behavior. Estimations of the reinforcement value of the no-feedback outcome were significantly greater in the control group than the PTSS group, suggesting that the control group was more likely to value this outcome as positively reinforcing (i.e., signaling successful avoidance of punishment). This is consistent with the control group's generally poorer performance on reward trials, where reward feedback was to be obtained in preference to the no-feedback outcome. Differences in the interpretation of ambiguous feedback may contribute to the facilitated reinforcement learning often observed in PTSD patients, and may in turn provide new insight into how

  5. The basolateral amygdala in reward learning and addiction

    PubMed Central

    Wassum, Kate M.; Izquierdo, Alicia

    2015-01-01

    Sophisticated behavioral paradigms partnered with the emergence of increasingly selective techniques to target the basolateral amygdala (BLA) have resulted in an enhanced understanding of the role of this nucleus in learning and using reward information. Due to the wide variety of behavioral approaches many questions remain on the circumscribed role of BLA in appetitive behavior. In this review, we integrate conclusions of BLA function in reward-related behavior using traditional interference techniques (lesion, pharmacological inactivation) with those using newer methodological approaches in experimental animals that allow in vivo manipulation of cell type-specific populations and neural recordings. Secondly, from a review of appetitive behavioral tasks in rodents and monkeys and recent computational models of reward procurement, we derive evidence for BLA as a neural integrator of reward value, history, and cost parameters. Taken together, BLA codes specific and temporally dynamic outcome representations in a distributed network to orchestrate adaptive responses. We provide evidence that experiences with opiates and psychostimulants alter these outcome representations in BLA, resulting in long-term modified action. PMID:26341938

  6. A REINFORCEMENT LEARNING MODEL OF PERSUASIVE COMMUNICATION.

    ERIC Educational Resources Information Center

    WEISS, ROBERT FRANK

    THEORETICAL AND EXPERIMENTAL ANALOGIES ARE DRAWN BETWEEN LEARNING THEORY AND PERSUASIVE COMMUNICATION AS AN EXTENSION OF LIBERALIZED STIMULUS RESPONSE THEORY. IN THE FIRST EXPERIMENT ON INSTRUMENTAL CONDITIONING OF ATTITUDES, THE SUBJECTS READ AN OPINION TO BE LEARNED, FOLLOWED BY A SUPPORTING ARGUMENT ASSUMED TO FUNCTION AS A REINFORCER. THE TIME…

  7. Adaptive Educational Software by Applying Reinforcement Learning

    ERIC Educational Resources Information Center

    Bennane, Abdellah

    2013-01-01

    The introduction of the intelligence in teaching software is the object of this paper. In software elaboration process, one uses some learning techniques in order to adapt the teaching software to characteristics of student. Generally, one uses the artificial intelligence techniques like reinforcement learning, Bayesian network in order to adapt…

  8. Generalization of value in reinforcement learning by humans

    PubMed Central

    Wimmer, G. Elliott; Daw, Nathaniel D.; Shohamy, Daphna

    2012-01-01

    Research in decision making has focused on the role of dopamine and its striatal targets in guiding choices via learned stimulus-reward or stimulus-response associations, behavior that is well-described by reinforcement learning (RL) theories. However, basic RL is relatively limited in scope and does not explain how learning about stimulus regularities or relations may guide decision making. A candidate mechanism for this type of learning comes from the domain of memory, which has highlighted a role for the hippocampus in learning of stimulus-stimulus relations, typically dissociated from the role of the striatum in stimulus-response learning. Here, we used fMRI and computational model-based analyses to examine the joint contributions of these mechanisms to RL. Humans performed an RL task with added relational structure, modeled after tasks used to isolate hippocampal contributions to memory. On each trial participants chose one of four options, but the reward probabilities for pairs of options were correlated across trials. This (uninstructed) relationship between pairs of options potentially enabled an observer to learn about options’ values based on experience with the other options and to generalize across them. We observed BOLD activity related to learning in the striatum and also in the hippocampus. By comparing a basic RL model to one augmented to allow feedback to generalize between correlated options, we tested whether choice behavior and BOLD activity were influenced by the opportunity to generalize across correlated options. Although such generalization goes beyond standard computational accounts of RL and striatal BOLD, both choices and striatal BOLD were better explained by the augmented model. Consistent with the hypothesized role for the hippocampus in this generalization, functional connectivity between the ventral striatum and hippocampus was modulated, across participants, by the ability of the augmented model to capture participants’ choice

  9. A proposed resolution to the paradox of drug reward: Dopamine's evolution from an aversive signal to a facilitator of drug reward via negative reinforcement.

    PubMed

    Ting-A-Kee, Ryan; Heinmiller, Andrew; van der Kooy, Derek

    2015-09-01

    The mystery surrounding how plant neurotoxins came to possess reinforcing properties is termed the paradox of drug reward. Here we propose a resolution to this paradox whereby dopamine - which has traditionally been viewed as a signal of reward - initially signaled aversion and encouraged escape. We suggest that after being consumed, plant neurotoxins such as nicotine activated an aversive dopaminergic pathway, thereby deterring predatory herbivores. Later evolutionary events - including the development of a GABAergic system capable of modulating dopaminergic activity - led to the ability to down-regulate and 'control' this dopamine-based aversion. We speculate that this negative reinforcement system evolved so that animals could suppress aversive states such as hunger in order to attend to other internal drives (such as mating and shelter) that would result in improved organismal fitness.

  10. Context transfer in reinforcement learning using action-value functions.

    PubMed

    Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

    2014-01-01

    This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task.

  11. Context Transfer in Reinforcement Learning Using Action-Value Functions

    PubMed Central

    Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

    2014-01-01

    This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task. PMID:25610457

  12. Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology.

    PubMed

    Schultz, Wolfram

    2004-04-01

    Neurons in a small number of brain structures detect rewards and reward-predicting stimuli and are active during the expectation of predictable food and liquid rewards. These neurons code the reward information according to basic terms of various behavioural theories that seek to explain reward-directed learning, approach behaviour and decision-making. The involved brain structures include groups of dopamine neurons, the striatum including the nucleus accumbens, the orbitofrontal cortex and the amygdala. The reward information is fed to brain structures involved in decision-making and organisation of behaviour, such as the dorsolateral prefrontal cortex and possibly the parietal cortex. The neural coding of basic reward terms derived from formal theories puts the neurophysiological investigation of reward mechanisms on firm conceptual grounds and provides neural correlates for the function of rewards in learning, approach behaviour and decision-making.

  13. Reinforcement learning in depression: A review of computational research.

    PubMed

    Chen, Chong; Takahashi, Taiki; Nakagawa, Shin; Inoue, Takeshi; Kusumi, Ichiro

    2015-08-01

    Despite being considered primarily a mood disorder, major depressive disorder (MDD) is characterized by cognitive and decision making deficits. Recent research has employed computational models of reinforcement learning (RL) to address these deficits. The computational approach has the advantage in making explicit predictions about learning and behavior, specifying the process parameters of RL, differentiating between model-free and model-based RL, and the computational model-based functional magnetic resonance imaging and electroencephalography. With these merits there has been an emerging field of computational psychiatry and here we review specific studies that focused on MDD. Considerable evidence suggests that MDD is associated with impaired brain signals of reward prediction error and expected value ('wanting'), decreased reward sensitivity ('liking') and/or learning (be it model-free or model-based), etc., although the causality remains unclear. These parameters may serve as valuable intermediate phenotypes of MDD, linking general clinical symptoms to underlying molecular dysfunctions. We believe future computational research at clinical, systems, and cellular/molecular/genetic levels will propel us toward a better understanding of the disease.

  14. Short-term memory traces for action bias in human reinforcement learning.

    PubMed

    Bogacz, Rafal; McClure, Samuel M; Li, Jian; Cohen, Jonathan D; Montague, P Read

    2007-06-11

    Recent experimental and theoretical work on reinforcement learning has shed light on the neural bases of learning from rewards and punishments. One fundamental problem in reinforcement learning is the credit assignment problem, or how to properly assign credit to actions that lead to reward or punishment following a delay. Temporal difference learning solves this problem, but its efficiency can be significantly improved by the addition of eligibility traces (ET). In essence, ETs function as decaying memories of previous choices that are used to scale synaptic weight changes. It has been shown in theoretical studies that ETs spanning a number of actions may improve the performance of reinforcement learning. However, it remains an open question whether including ETs that persist over sequences of actions allows reinforcement learning models to better fit empirical data regarding the behaviors of humans and other animals. Here, we report an experiment in which human subjects performed a sequential economic decision game in which the long-term optimal strategy differed from the strategy that leads to the greatest short-term return. We demonstrate that human subjects' performance in the task is significantly affected by the time between choices in a surprising and seemingly counterintuitive way. However, this behavior is naturally explained by a temporal difference learning model which includes ETs persisting across actions. Furthermore, we review recent findings that suggest that short-term synaptic plasticity in dopamine neurons may provide a realistic biophysical mechanism for producing ETs that persist on a timescale consistent with behavioral observations.

  15. Evolution with reinforcement learning in negotiation.

    PubMed

    Zou, Yi; Zhan, Wenjie; Shao, Yuan

    2014-01-01

    Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm.

  16. Evolution with Reinforcement Learning in Negotiation

    PubMed Central

    Zou, Yi; Zhan, Wenjie; Shao, Yuan

    2014-01-01

    Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm. PMID:25048108

  17. Reinforcement Learning with Bounded Information Loss

    NASA Astrophysics Data System (ADS)

    Peters, Jan; Mülling, Katharina; Seldin, Yevgeny; Altun, Yasemin

    2011-03-01

    Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant or natural policy gradients, many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest two reinforcement learning methods, i.e., a model-based and a model free algorithm that bound the loss in relative entropy while maximizing their return. The resulting methods differ significantly from previous policy gradient approaches and yields an exact update step. It works well on typical reinforcement learning benchmark problems as well as novel evaluations in robotics. We also show a Bayesian bound motivation of this new approach [8].

  18. Deficits in reinforcement learning but no link to apathy in patients with schizophrenia.

    PubMed

    Hartmann-Riemer, Matthias N; Aschenbrenner, Steffen; Bossert, Magdalena; Westermann, Celina; Seifritz, Erich; Tobler, Philippe N; Weisbrod, Matthias; Kaiser, Stefan

    2017-01-10

    Negative symptoms in schizophrenia have been linked to selective reinforcement learning deficits in the context of gains combined with intact loss-avoidance learning. Fundamental mechanisms of reinforcement learning and choice are prediction error signaling and the precise representation of reward value for future decisions. It is unclear which of these mechanisms contribute to the impairments in learning from positive outcomes observed in schizophrenia. A recent study suggested that patients with severe apathy symptoms show deficits in the representation of expected value. Considering the fundamental relevance for the understanding of these symptoms, we aimed to assess the stability of these findings across studies. Sixty-four patients with schizophrenia and 19 healthy control participants performed a probabilistic reward learning task. They had to associate stimuli with gain or loss-avoidance. In a transfer phase participants indicated valuation of the previously learned stimuli by choosing among them. Patients demonstrated an overall impairment in learning compared to healthy controls. No effects of apathy symptoms on task indices were observed. However, patients with schizophrenia learned better in the context of loss-avoidance than in the context of gain. Earlier findings were thus partially replicated. Further studies are needed to clarify the mechanistic link between negative symptoms and reinforcement learning.

  19. Deficits in reinforcement learning but no link to apathy in patients with schizophrenia

    PubMed Central

    Hartmann-Riemer, Matthias N.; Aschenbrenner, Steffen; Bossert, Magdalena; Westermann, Celina; Seifritz, Erich; Tobler, Philippe N.; Weisbrod, Matthias; Kaiser, Stefan

    2017-01-01

    Negative symptoms in schizophrenia have been linked to selective reinforcement learning deficits in the context of gains combined with intact loss-avoidance learning. Fundamental mechanisms of reinforcement learning and choice are prediction error signaling and the precise representation of reward value for future decisions. It is unclear which of these mechanisms contribute to the impairments in learning from positive outcomes observed in schizophrenia. A recent study suggested that patients with severe apathy symptoms show deficits in the representation of expected value. Considering the fundamental relevance for the understanding of these symptoms, we aimed to assess the stability of these findings across studies. Sixty-four patients with schizophrenia and 19 healthy control participants performed a probabilistic reward learning task. They had to associate stimuli with gain or loss-avoidance. In a transfer phase participants indicated valuation of the previously learned stimuli by choosing among them. Patients demonstrated an overall impairment in learning compared to healthy controls. No effects of apathy symptoms on task indices were observed. However, patients with schizophrenia learned better in the context of loss-avoidance than in the context of gain. Earlier findings were thus partially replicated. Further studies are needed to clarify the mechanistic link between negative symptoms and reinforcement learning. PMID:28071747

  20. Optimal control in microgrid using multi-agent reinforcement learning.

    PubMed

    Li, Fu-Dong; Wu, Min; He, Yong; Chen, Xin

    2012-11-01

    This paper presents an improved reinforcement learning method to minimize electricity costs on the premise of satisfying the power balance and generation limit of units in a microgrid with grid-connected mode. Firstly, the microgrid control requirements are analyzed and the objective function of optimal control for microgrid is proposed. Then, a state variable "Average Electricity Price Trend" which is used to express the most possible transitions of the system is developed so as to reduce the complexity and randomicity of the microgrid, and a multi-agent architecture including agents, state variables, action variables and reward function is formulated. Furthermore, dynamic hierarchical reinforcement learning, based on change rate of key state variable, is established to carry out optimal policy exploration. The analysis shows that the proposed method is beneficial to handle the problem of "curse of dimensionality" and speed up learning in the unknown large-scale world. Finally, the simulation results under JADE (Java Agent Development Framework) demonstrate the validity of the presented method in optimal control for a microgrid with grid-connected mode.

  1. Antipsychotic dose modulates behavioral and neural responses to feedback during reinforcement learning in schizophrenia.

    PubMed

    Insel, Catherine; Reinen, Jenna; Weber, Jochen; Wager, Tor D; Jarskog, L Fredrik; Shohamy, Daphna; Smith, Edward E

    2014-03-01

    Schizophrenia is characterized by an abnormal dopamine system, and dopamine blockade is the primary mechanism of antipsychotic treatment. Consistent with the known role of dopamine in reward processing, prior research has demonstrated that patients with schizophrenia exhibit impairments in reward-based learning. However, it remains unknown how treatment with antipsychotic medication impacts the behavioral and neural signatures of reinforcement learning in schizophrenia. The goal of this study was to examine whether antipsychotic medication modulates behavioral and neural responses to prediction error coding during reinforcement learning. Patients with schizophrenia completed a reinforcement learning task while undergoing functional magnetic resonance imaging. The task consisted of two separate conditions in which participants accumulated monetary gain or avoided monetary loss. Behavioral results indicated that antipsychotic medication dose was associated with altered behavioral approaches to learning, such that patients taking higher doses of medication showed increased sensitivity to negative reinforcement. Higher doses of antipsychotic medication were also associated with higher learning rates (LRs), suggesting that medication enhanced sensitivity to trial-by-trial feedback. Neuroimaging data demonstrated that antipsychotic dose was related to differences in neural signatures of feedback prediction error during the loss condition. Specifically, patients taking higher doses of medication showed attenuated prediction error responses in the striatum and the medial prefrontal cortex. These findings indicate that antipsychotic medication treatment may influence motivational processes in patients with schizophrenia.

  2. Learning and generalization from reward and punishment in opioid addiction.

    PubMed

    Myers, Catherine E; Rego, Janice; Haber, Paul; Morley, Kirsten; Beck, Kevin D; Hogarth, Lee; Moustafa, Ahmed A

    2017-01-15

    This study adapts a widely-used acquired equivalence paradigm to investigate how opioid-addicted individuals learn from positive and negative feedback, and how they generalize this learning. The opioid-addicted group consisted of 33 participants with a history of heroin dependency currently in a methadone maintenance program; the control group consisted of 32 healthy participants without a history of drug addiction. All participants performed a novel variant of the acquired equivalence task, where they learned to map some stimuli to correct outcomes in order to obtain reward, and to map other stimuli to correct outcomes in order to avoid punishment; some stimuli were implicitly "equivalent" in the sense of being paired with the same outcome. On the initial training phase, both groups performed similarly on learning to obtain reward, but as memory load grew, the control group outperformed the addicted group on learning to avoid punishment. On a subsequent testing phase, the addicted and control groups performed similarly on retention trials involving previously-trained stimulus-outcome pairs, as well as on generalization trials to assess acquired equivalence. Since prior work with acquired equivalence tasks has associated stimulus-outcome learning with the nigrostriatal dopamine system, and generalization with the hippocampal region, the current results are consistent with basal ganglia dysfunction in the opioid-addicted patients. Further, a selective deficit in learning from punishment could contribute to processes by which addicted individuals continue to pursue drug use even at the cost of negative consequences such as loss of income and the opportunity to engage in other life activities.

  3. Reinforcement learning on slow features of high-dimensional input streams.

    PubMed

    Legenstein, Robert; Wilbert, Niko; Wiskott, Laurenz

    2010-08-19

    Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.

  4. Autonomous reinforcement learning with experience replay.

    PubMed

    Wawrzyński, Paweł; Tanwani, Ajay Kumar

    2013-05-01

    This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time.

  5. Refining Linear Fuzzy Rules by Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.; Malkani, Anil

    1996-01-01

    Linear fuzzy rules are increasingly being used in the development of fuzzy logic systems. Radial basis functions have also been used in the antecedents of the rules for clustering in product space which can automatically generate a set of linear fuzzy rules from an input/output data set. Manual methods are usually used in refining these rules. This paper presents a method for refining the parameters of these rules using reinforcement learning which can be applied in domains where supervised input-output data is not available and reinforcements are received only after a long sequence of actions. This is shown for a generalization of radial basis functions. The formation of fuzzy rules from data and their automatic refinement is an important step in closing the gap between the application of reinforcement learning methods in the domains where only some limited input-output data is available.

  6. Hemispheric Asymmetries in Striatal Reward Responses Relate to Approach-Avoidance Learning and Encoding of Positive-Negative Prediction Errors in Dopaminergic Midbrain Regions.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly C; Schwartz, Sophie

    2015-10-28

    Some individuals are better at learning about rewarding situations, whereas others are inclined to avoid punishments (i.e., enhanced approach or avoidance learning, respectively). In reinforcement learning, action values are increased when outcomes are better than predicted (positive prediction errors [PEs]) and decreased for worse than predicted outcomes (negative PEs). Because actions with high and low values are approached and avoided, respectively, individual differences in the neural encoding of PEs may influence the balance between approach-avoidance learning. Recent correlational approaches also indicate that biases in approach-avoidance learning involve hemispheric asymmetries in dopamine function. However, the computational and neural mechanisms underpinning such learning biases remain unknown. Here we assessed hemispheric reward asymmetry in striatal activity in 34 human participants who performed a task involving rewards and punishments. We show that the relative difference in reward response between hemispheres relates to individual biases in approach-avoidance learning. Moreover, using a computational modeling approach, we demonstrate that better encoding of positive (vs negative) PEs in dopaminergic midbrain regions is associated with better approach (vs avoidance) learning, specifically in participants with larger reward responses in the left (vs right) ventral striatum. Thus, individual dispositions or traits may be determined by neural processes acting to constrain learning about specific aspects of the world.

  7. SAwSu: an integrated model of associative and reinforcement learning.

    PubMed

    Veksler, Vladislav D; Myers, Christopher W; Gluck, Kevin A

    2014-04-01

    Successfully explaining and replicating the complexity and generality of human and animal learning will require the integration of a variety of learning mechanisms. Here, we introduce a computational model which integrates associative learning (AL) and reinforcement learning (RL). We contrast the integrated model with standalone AL and RL models in three simulation studies. First, a synthetic grid-navigation task is employed to highlight performance advantages for the integrated model in an environment where the reward structure is both diverse and dynamic. The second and third simulations contrast the performances of the three models in behavioral experiments, demonstrating advantages for the integrated model in accounting for behavioral data.

  8. Time representation in reinforcement learning models of the basal ganglia

    PubMed Central

    Gershman, Samuel J.; Moustafa, Ahmed A.; Ludvig, Elliot A.

    2014-01-01

    Reinforcement learning (RL) models have been influential in understanding many aspects of basal ganglia function, from reward prediction to action selection. Time plays an important role in these models, but there is still no theoretical consensus about what kind of time representation is used by the basal ganglia. We review several theoretical accounts and their supporting evidence. We then discuss the relationship between RL models and the timing mechanisms that have been attributed to the basal ganglia. We hypothesize that a single computational system may underlie both RL and interval timing—the perception of duration in the range of seconds to hours. This hypothesis, which extends earlier models by incorporating a time-sensitive action selection mechanism, may have important implications for understanding disorders like Parkinson's disease in which both decision making and timing are impaired. PMID:24409138

  9. Geographical Inquiry and Learning Reinforcement Theory.

    ERIC Educational Resources Information Center

    Davies, Christopher S.

    1983-01-01

    Although instructors have been reluctant to utilize the Keller Plan (a personalized system of instruction), it lends itself to teaching introductory geography. College students found that the routine and frequent reinforcement led to progressive learning. However, it does not lend itself to the study of reflexive or polemical concepts. (IS)

  10. Classroom Reinforcement and Learning: A Quantitative Synthesis.

    ERIC Educational Resources Information Center

    Lysakowski, Richard S.; Walberg, Herbert J.

    To estimate the influence of positive reinforcement on classroom learning, the authors analyzed statistical data from 39 studies spanning the years 1958-1978 and containing a combined sample of 4,842 students in 202 classes. Twenty-nine characteristics of each study's sample, methodology, and reliability were coded to measure their effects on…

  11. Extraversion differentiates between model-based and model-free strategies in a reinforcement learning task.

    PubMed

    Skatova, Anya; Chan, Patricia A; Daw, Nathaniel D

    2013-01-01

    Prominent computational models describe a neural mechanism for learning from reward prediction errors, and it has been suggested that variations in this mechanism are reflected in personality factors such as trait extraversion. However, although trait extraversion has been linked to improved reward learning, it is not yet known whether this relationship is selective for the particular computational strategy associated with error-driven learning, known as model-free reinforcement learning, vs. another strategy, model-based learning, which the brain is also known to employ. In the present study we test this relationship by examining whether humans' scores on an extraversion scale predict individual differences in the balance between model-based and model-free learning strategies in a sequentially structured decision task designed to distinguish between them. In previous studies with this task, participants have shown a combination of both types of learning, but with substantial individual variation in the balance between them. In the current study, extraversion predicted worse behavior across both sorts of learning. However, the hypothesis that extraverts would be selectively better at model-free reinforcement learning held up among a subset of the more engaged participants, and overall, higher task engagement was associated with a more selective pattern by which extraversion predicted better model-free learning. The findings indicate a relationship between a broad personality orientation and detailed computational learning mechanisms. Results like those in the present study suggest an intriguing and rich relationship between core neuro-computational mechanisms and broader life orientations and outcomes.

  12. Reinforcement Learning in Information Searching

    ERIC Educational Resources Information Center

    Cen, Yonghua; Gan, Liren; Bai, Chen

    2013-01-01

    Introduction: The study seeks to answer two questions: How do university students learn to use correct strategies to conduct scholarly information searches without instructions? and, What are the differences in learning mechanisms between users at different cognitive levels? Method: Two groups of users, thirteen first year undergraduate students…

  13. Can Traditions Emerge from the Interaction of Stimulus Enhancement and Reinforcement Learning? An Experimental Model.

    PubMed

    Matthews, Luke J; Paukner, Annika; Suomi, Stephen J

    2010-06-01

    The study of social learning in captivity and behavioral traditions in the wild are two burgeoning areas of research, but few empirical studies have tested how learning mechanisms produce emergent patterns of tradition. Studies have examined how social learning mechanisms that are cognitively complex and possessed by few species, such as imitation, result in traditional patterns, yet traditional patterns are also exhibited by species that may not possess such mechanisms. We propose an explicit model of how stimulus enhancement and reinforcement learning could interact to produce traditions. We tested the model experimentally with tufted capuchin monkeys (Cebus apella), which exhibit traditions in the wild but have rarely demonstrated imitative abilities in captive experiments. Monkeys showed both stimulus enhancement learning and a habitual bias to perform whichever behavior first obtained them a reward. These results support our model that simple social learning mechanisms combined with reinforcement can result in traditional patterns of behavior.

  14. Ventral tegmental area neurons in learned appetitive behavior and positive reinforcement.

    PubMed

    Fields, Howard L; Hjelmstad, Gregory O; Margolis, Elyssa B; Nicola, Saleem M

    2007-01-01

    Ventral tegmental area (VTA) neuron firing precedes behaviors elicited by reward-predictive sensory cues and scales with the magnitude and unpredictability of received rewards. These patterns are consistent with roles in the performance of learned appetitive behaviors and in positive reinforcement, respectively. The VTA includes subpopulations of neurons with different afferent connections, neurotransmitter content, and projection targets. Because the VTA and substantia nigra pars compacta are the sole sources of striatal and limbic forebrain dopamine, measurements of dopamine release and manipulations of dopamine function have provided critical evidence supporting a VTA contribution to these functions. However, the VTA also sends GABAergic and glutamatergic projections to the nucleus accumbens and prefrontal cortex. Furthermore, VTA-mediated but dopamine-independent positive reinforcement has been demonstrated. Consequently, identifying the neurotransmitter content and projection target of VTA neurons recorded in vivo will be critical for determining their contribution to learned appetitive behaviors.

  15. Prediction error in reinforcement learning: a meta-analysis of neuroimaging studies.

    PubMed

    Garrison, Jane; Erdeniz, Burak; Done, John

    2013-08-01

    Activation likelihood estimation (ALE) meta-analyses were used to examine the neural correlates of prediction error in reinforcement learning. The findings are interpreted in the light of current computational models of learning and action selection. In this context, particular consideration is given to the comparison of activation patterns from studies using instrumental and Pavlovian conditioning, and where reinforcement involved rewarding or punishing feedback. The striatum was the key brain area encoding for prediction error, with activity encompassing dorsal and ventral regions for instrumental and Pavlovian reinforcement alike, a finding which challenges the functional separation of the striatum into a dorsal 'actor' and a ventral 'critic'. Prediction error activity was further observed in diverse areas of predominantly anterior cerebral cortex including medial prefrontal cortex and anterior cingulate cortex. Distinct patterns of prediction error activity were found for studies using rewarding and aversive reinforcers; reward prediction errors were observed primarily in the striatum while aversive prediction errors were found more widely including insula and habenula.

  16. Functional polymorphism of the mu-opioid receptor gene (OPRM1) influences reinforcement learning in humans.

    PubMed

    Lee, Mary R; Gallen, Courtney L; Zhang, Xiaochu; Hodgkinson, Colin A; Goldman, David; Stein, Elliot A; Barr, Christina S

    2011-01-01

    Previous reports on the functional effects (i.e., gain or loss of function), and phenotypic outcomes (e.g., changes in addiction vulnerability and stress response) of a commonly occurring functional single nucleotide polymorphism (SNP) of the mu-opioid receptor (OPRM1 A118G) have been inconsistent. Here we examine the effect of this polymorphism on implicit reward learning. We used a probabilistic signal detection task to determine whether this polymorphism impacts response bias to monetary reward in 63 healthy adult subjects: 51 AA homozygotes and 12 G allele carriers. OPRM1 AA homozygotes exhibited typical responding to the rewarded response--that is, their bias to the rewarded stimulus increased over time. However, OPRM1 G allele carriers exhibited a decline in response to the rewarded stimulus compared to the AA homozygotes. These results extend previous reports on the heritability of performance on this task by implicating a specific polymorphism. Through comparison with other studies using this task, we suggest a possible mechanism by which the OPRM1 polymorphism may confer reduced response to natural reward through a dopamine-mediated decrease during positive reinforcement learning.

  17. Pressure to cooperate: is positive reward interdependence really needed in cooperative learning?

    PubMed

    Buchs, Céline; Gilles, Ingrid; Dutrévis, Marion; Butera, Fabrizio

    2011-03-01

    BACKGROUND. Despite extensive research on cooperative learning, the debate regarding whether or not its effectiveness depends on positive reward interdependence has not yet found clear evidence. AIMS. We tested the hypothesis that positive reward interdependence, as compared to reward independence, enhances cooperative learning only if learners work on a 'routine task'; if the learners work on a 'true group task', positive reward interdependence induces the same level of learning as reward independence. SAMPLE. The study involved 62 psychology students during regular workshops. METHOD. Students worked on two psychology texts in cooperative dyads for three sessions. The type of task was manipulated through resource interdependence: students worked on either identical (routine task) or complementary (true group task) information. Students expected to be assessed with a Multiple Choice Test (MCT) on the two texts. The MCT assessment type was introduced according to two reward interdependence conditions, either individual (reward independence) or common (positive reward interdependence). A follow-up individual test took place 4 weeks after the third session of dyadic work to examine individual learning. RESULTS. The predicted interaction between the two types of interdependence was significant, indicating that students learned more with positive reward interdependence than with reward independence when they worked on identical information (routine task), whereas students who worked on complementary information (group task) learned the same with or without reward interdependence. CONCLUSIONS. This experiment sheds light on the conditions under which positive reward interdependence enhances cooperative learning, and suggests that creating a real group task allows to avoid the need for positive reward interdependence.

  18. Fuzzy Q-Learning for Generalization of Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1996-01-01

    Fuzzy Q-Learning, introduced earlier by the author, is an extension of Q-Learning into fuzzy environments. GARIC is a methodology for fuzzy reinforcement learning. In this paper, we introduce GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Q-Learning and at the local level, each agent learns and operates based on GARIC. GARIC-Q improves the speed and applicability of Fuzzy Q-Learning through generalization of input space by using fuzzy rules and bridges the gap between Q-Learning and rule based intelligent systems.

  19. Reinforcement learning for robot control

    NASA Astrophysics Data System (ADS)

    Smart, William D.; Pack Kaelbling, Leslie

    2002-02-01

    Writing control code for mobile robots can be a very time-consuming process. Even for apparently simple tasks, it is often difficult to specify in detail how the robot should accomplish them. Robot control code is typically full of magic numbers that must be painstakingly set for each environment that the robot must operate in. The idea of having a robot learn how to accomplish a task, rather than being told explicitly is an appealing one. It seems easier and much more intuitive for the programmer to specify what the robot should be doing, and let it learn the fine details of how to do it. In this paper, we describe JAQL, a framework for efficient learning on mobile robots, and present the results of using it to learn control policies for simple tasks.

  20. fMRI and EEG Predictors of Dynamic Decision Parameters during Human Reinforcement Learning

    PubMed Central

    Gagne, Chris; Nyhus, Erika; Masters, Sean; Wiecki, Thomas V.; Cavanagh, James F.; Badre, David

    2015-01-01

    What are the neural dynamics of choice processes during reinforcement learning? Two largely separate literatures have examined dynamics of reinforcement learning (RL) as a function of experience but assuming a static choice process, or conversely, the dynamics of choice processes in decision making but based on static decision values. Here we show that human choice processes during RL are well described by a drift diffusion model (DDM) of decision making in which the learned trial-by-trial reward values are sequentially sampled, with a choice made when the value signal crosses a decision threshold. Moreover, simultaneous fMRI and EEG recordings revealed that this decision threshold is not fixed across trials but varies as a function of activity in the subthalamic nucleus (STN) and is further modulated by trial-by-trial measures of decision conflict and activity in the dorsomedial frontal cortex (pre-SMA BOLD and mediofrontal theta in EEG). These findings provide converging multimodal evidence for a model in which decision threshold in reward-based tasks is adjusted as a function of communication from pre-SMA to STN when choices differ subtly in reward values, allowing more time to choose the statistically more rewarding option. PMID:25589744

  1. fMRI and EEG predictors of dynamic decision parameters during human reinforcement learning.

    PubMed

    Frank, Michael J; Gagne, Chris; Nyhus, Erika; Masters, Sean; Wiecki, Thomas V; Cavanagh, James F; Badre, David

    2015-01-14

    What are the neural dynamics of choice processes during reinforcement learning? Two largely separate literatures have examined dynamics of reinforcement learning (RL) as a function of experience but assuming a static choice process, or conversely, the dynamics of choice processes in decision making but based on static decision values. Here we show that human choice processes during RL are well described by a drift diffusion model (DDM) of decision making in which the learned trial-by-trial reward values are sequentially sampled, with a choice made when the value signal crosses a decision threshold. Moreover, simultaneous fMRI and EEG recordings revealed that this decision threshold is not fixed across trials but varies as a function of activity in the subthalamic nucleus (STN) and is further modulated by trial-by-trial measures of decision conflict and activity in the dorsomedial frontal cortex (pre-SMA BOLD and mediofrontal theta in EEG). These findings provide converging multimodal evidence for a model in which decision threshold in reward-based tasks is adjusted as a function of communication from pre-SMA to STN when choices differ subtly in reward values, allowing more time to choose the statistically more rewarding option.

  2. Reinforcement learning modulates the stability of cognitive control settings for object selection

    PubMed Central

    Sali, Anthony W.; Anderson, Brian A.; Yantis, Steven

    2013-01-01

    Cognitive flexibility reflects both a trait that reliably differs between individuals and a state that can fluctuate moment-to-moment. Whether individuals can undergo persistent changes in cognitive flexibility as a result of reward learning is less understood. Here, we investigated whether reinforcing a periodic shift in an object selection strategy can make an individual more prone to switch strategies in a subsequent unrelated task. Participants completed two different choice tasks in which they selected one of four objects in an attempt to obtain a hidden reward on each trial. During a training phase, objects were defined by color. Participants received either consistent reward contingencies in which one color was more often rewarded, or contingencies in which the color that was more often rewarded changed periodically and without warning. Following the training phase, all participants completed a test phase in which reward contingencies were defined by spatial location and the location that was more often rewarded remained constant across the entire task. Those participants who received inconsistent contingencies during training continued to make more variable selections during the test phase in comparison to those who received the consistent training. Furthermore, a difference in the likelihood to switch selections on a trial-by-trial basis emerged between training groups: participants who received consistent contingencies during training were less likely to switch object selections following an unrewarded trial and more likely to repeat a selection following reward. Our findings provide evidence that the extent to which priority shifting is reinforced modulates the stability of cognitive control settings in a persistent manner, such that individuals become generally more or less prone to shifting priorities in the future. PMID:24391557

  3. Autistic Traits Moderate the Impact of Reward Learning on Social Behaviour

    PubMed Central

    Panasiti, Maria Serena; Puzzo, Ignazio

    2015-01-01

    A deficit in empathy has been suggested to underlie social behavioural atypicalities in autism. A parallel theoretical account proposes that reduced social motivation (i.e., low responsivity to social rewards) can account for the said atypicalities. Recent evidence suggests that autistic traits modulate the link between reward and proxy metrics related to empathy. Using an evaluative conditioning paradigm to associate high and low rewards with faces, a previous study has shown that individuals high in autistic traits show reduced spontaneous facial mimicry of faces associated with high vs. low reward. This observation raises the possibility that autistic traits modulate the magnitude of evaluative conditioning. To test this, we investigated (a) if autistic traits could modulate the ability to implicitly associate a reward value to a social stimulus (reward learning/conditioning, using the Implicit Association Task, IAT); (b) if the learned association could modulate participants’ prosocial behaviour (i.e., social reciprocity, measured using the cyberball task); (c) if the strength of this modulation was influenced by autistic traits. In 43 neurotypical participants, we found that autistic traits moderated the relationship of social reward learning on prosocial behaviour but not reward learning itself. This evidence suggests that while autistic traits do not directly influence social reward learning, they modulate the relationship of social rewards with prosocial behaviour. Autism Res 2016, 9: 471–479. © 2015 The Authors Autism Research published by Wiley Periodicals, Inc. on behalf of International Society for Autism Research PMID:26280134

  4. Autistic Traits Moderate the Impact of Reward Learning on Social Behaviour.

    PubMed

    Panasiti, Maria Serena; Puzzo, Ignazio; Chakrabarti, Bhismadev

    2016-04-01

    A deficit in empathy has been suggested to underlie social behavioural atypicalities in autism. A parallel theoretical account proposes that reduced social motivation (i.e., low responsivity to social rewards) can account for the said atypicalities. Recent evidence suggests that autistic traits modulate the link between reward and proxy metrics related to empathy. Using an evaluative conditioning paradigm to associate high and low rewards with faces, a previous study has shown that individuals high in autistic traits show reduced spontaneous facial mimicry of faces associated with high vs. low reward. This observation raises the possibility that autistic traits modulate the magnitude of evaluative conditioning. To test this, we investigated (a) if autistic traits could modulate the ability to implicitly associate a reward value to a social stimulus (reward learning/conditioning, using the Implicit Association Task, IAT); (b) if the learned association could modulate participants' prosocial behaviour (i.e., social reciprocity, measured using the cyberball task); (c) if the strength of this modulation was influenced by autistic traits. In 43 neurotypical participants, we found that autistic traits moderated the relationship of social reward learning on prosocial behaviour but not reward learning itself. This evidence suggests that while autistic traits do not directly influence social reward learning, they modulate the relationship of social rewards with prosocial behaviour.

  5. Reinforcement learning based artificial immune classifier.

    PubMed

    Karakose, Mehmet

    2013-01-01

    One of the widely used methods for classification that is a decision-making process is artificial immune systems. Artificial immune systems based on natural immunity system can be successfully applied for classification, optimization, recognition, and learning in real-world problems. In this study, a reinforcement learning based artificial immune classifier is proposed as a new approach. This approach uses reinforcement learning to find better antibody with immune operators. The proposed new approach has many contributions according to other methods in the literature such as effectiveness, less memory cell, high accuracy, speed, and data adaptability. The performance of the proposed approach is demonstrated by simulation and experimental results using real data in Matlab and FPGA. Some benchmark data and remote image data are used for experimental results. The comparative results with supervised/unsupervised based artificial immune system, negative selection classifier, and resource limited artificial immune classifier are given to demonstrate the effectiveness of the proposed new method.

  6. Reinforcement Learning Based Artificial Immune Classifier

    PubMed Central

    Karakose, Mehmet

    2013-01-01

    One of the widely used methods for classification that is a decision-making process is artificial immune systems. Artificial immune systems based on natural immunity system can be successfully applied for classification, optimization, recognition, and learning in real-world problems. In this study, a reinforcement learning based artificial immune classifier is proposed as a new approach. This approach uses reinforcement learning to find better antibody with immune operators. The proposed new approach has many contributions according to other methods in the literature such as effectiveness, less memory cell, high accuracy, speed, and data adaptability. The performance of the proposed approach is demonstrated by simulation and experimental results using real data in Matlab and FPGA. Some benchmark data and remote image data are used for experimental results. The comparative results with supervised/unsupervised based artificial immune system, negative selection classifier, and resource limited artificial immune classifier are given to demonstrate the effectiveness of the proposed new method. PMID:23935424

  7. Neural correlates of reinforcement learning and social preferences in competitive bidding.

    PubMed

    van den Bos, Wouter; Talwar, Arjun; McClure, Samuel M

    2013-01-30

    In competitive social environments, people often deviate from what rational choice theory prescribes, resulting in losses or suboptimal monetary gains. We investigate how competition affects learning and decision-making in a common value auction task. During the experiment, groups of five human participants were simultaneously scanned using MRI while playing the auction task. We first demonstrate that bidding is well characterized by reinforcement learning with biased reward representations dependent on social preferences. Indicative of reinforcement learning, we found that estimated trial-by-trial prediction errors correlated with activity in the striatum and ventromedial prefrontal cortex. Additionally, we found that individual differences in social preferences were related to activity in the temporal-parietal junction and anterior insula. Connectivity analyses suggest that monetary and social value signals are integrated in the ventromedial prefrontal cortex and striatum. Based on these results, we argue for a novel mechanistic account for the integration of reinforcement history and social preferences in competitive decision-making.

  8. Reinforcement learning models and their neural correlates: An activation likelihood estimation meta-analysis.

    PubMed

    Chase, Henry W; Kumar, Poornima; Eickhoff, Simon B; Dombrovski, Alexandre Y

    2015-06-01

    Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments-prediction error-is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies have suggested that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that had employed algorithmic reinforcement learning models across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, whereas instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies.

  9. Reinforcement Learning Models and Their Neural Correlates: An Activation Likelihood Estimation Meta-Analysis

    PubMed Central

    Kumar, Poornima; Eickhoff, Simon B.; Dombrovski, Alexandre Y.

    2015-01-01

    Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments – prediction error – is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies suggest that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that employed algorithmic reinforcement learning models, across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, while instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually-estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies. PMID:25665667

  10. Implication of dopaminergic modulation in operant reward learning and the induction of compulsive-like feeding behavior in Aplysia.

    PubMed

    Bédécarrats, Alexis; Cornet, Charles; Simmers, John; Nargeot, Romuald

    2013-05-16

    Feeding in Aplysia provides an amenable model system for analyzing the neuronal substrates of motivated behavior and its adaptability by associative reward learning and neuromodulation. Among such learning processes, appetitive operant conditioning that leads to a compulsive-like expression of feeding actions is known to be associated with changes in the membrane properties and electrical coupling of essential action-initiating B63 neurons in the buccal central pattern generator (CPG). Moreover, the food-reward signal for this learning is conveyed in the esophageal nerve (En), an input nerve rich in dopamine-containing fibers. Here, to investigate whether dopamine (DA) is involved in this learning-induced plasticity, we used an in vitro analog of operant conditioning in which electrical stimulation of En substituted the contingent reinforcement of biting movements in vivo. Our data indicate that contingent En stimulation does, indeed, replicate the operant learning-induced changes in CPG output and the underlying membrane and synaptic properties of B63. Significantly, moreover, this network and cellular plasticity was blocked when the input nerve was stimulated in the presence of the DA receptor antagonist cis-flupenthixol. These results therefore suggest that En-derived dopaminergic modulation of CPG circuitry contributes to the operant reward-dependent emergence of a compulsive-like expression of Aplysia's feeding behavior.

  11. Reinforcement learning, spike-time-dependent plasticity, and the BCM rule.

    PubMed

    Baras, Dorit; Meir, Ron

    2007-08-01

    Learning agents, whether natural or artificial, must update their internal parameters in order to improve their behavior over time. In reinforcement learning, this plasticity is influenced by an environmental signal, termed a reward, that directs the changes in appropriate directions. We apply a recently introduced policy learning algorithm from machine learning to networks of spiking neurons and derive a spike-time-dependent plasticity rule that ensures convergence to a local optimum of the expected average reward. The approach is applicable to a broad class of neuronal models, including the Hodgkin-Huxley model. We demonstrate the effectiveness of the derived rule in several toy problems. Finally, through statistical analysis, we show that the synaptic plasticity rule established is closely related to the widely used BCM rule, for which good biological evidence exists.

  12. The effects of aging on the interaction between reinforcement learning and attention.

    PubMed

    Radulescu, Angela; Daniel, Reka; Niv, Yael

    2016-11-01

    Reinforcement learning (RL) in complex environments relies on selective attention to uncover those aspects of the environment that are most predictive of reward. Whereas previous work has focused on age-related changes in RL, it is not known whether older adults learn differently from younger adults when selective attention is required. In 2 experiments, we examined how aging affects the interaction between RL and selective attention. Younger and older adults performed a learning task in which only 1 stimulus dimension was relevant to predicting reward, and within it, 1 "target" feature was the most rewarding. Participants had to discover this target feature through trial and error. In Experiment 1, stimuli varied on 1 or 3 dimensions and participants received hints that revealed the target feature, the relevant dimension, or gave no information. Group-related differences in accuracy and RTs differed systematically as a function of the number of dimensions and the type of hint available. In Experiment 2 we used trial-by-trial computational modeling of the learning process to test for age-related differences in learning strategies. Behavior of both young and older adults was explained well by a reinforcement-learning model that uses selective attention to constrain learning. However, the model suggested that older adults restricted their learning to fewer features, employing more focused attention than younger adults. Furthermore, this difference in strategy predicted age-related deficits in accuracy. We discuss these results suggesting that a narrower filter of attention may reflect an adaptation to the reduced capabilities of the reinforcement learning system. (PsycINFO Database Record

  13. Reward-based learning of a redundant task.

    PubMed

    Tamagnone, Irene; Casadio, Maura; Sanguineti, Vittorio

    2013-06-01

    Motor skill learning has different components. When we acquire a new motor skill we have both to learn a reliable action-value map to select a highly rewarded action (task model) and to develop an internal representation of the novel dynamics of the task environment, in order to execute properly the action previously selected (internal model). Here we focus on a 'pure' motor skill learning task, in which adaptation to a novel dynamical environment is negligible and the problem is reduced to the acquisition of an action-value map, only based on knowledge of results. Subjects performed point-to-point movement, in which start and target positions were fixed and visible, but the score provided at the end of the movement depended on the distance of the trajectory from a hidden viapoint. Subjects did not have clues on the correct movement other than the score value. The task is highly redundant, as infinite trajectories are compatible with the maximum score. Our aim was to capture the strategies subjects use in the exploration of the task space and in the exploitation of the task redundancy during learning. The main findings were that (i) subjects did not converge to a unique solution; rather, their final trajectories are determined by subject-specific history of exploration. (ii) with learning, subjects reduced the trajectory's overall variability, but the point of minimum variability gradually shifted toward the portion of the trajectory closer to the hidden via-point.

  14. Multi-Agent Reinforcement Learning and Adaptive Neural Networks.

    DTIC Science & Technology

    2007-11-02

    learning method. The objective was to study the utility of reinforcement learning as an approach to complex decentralized control problems. The major...accomplishment was a detailed study of multi-agent reinforcement learning applied to a large-scale decentralized stochastic control problem. This study...included a very successful demonstration that a multi-agent reinforcement learning system using neural networks could learn high-performance

  15. Effects of Cooperative versus Individual Study on Learning and Motivation after Reward-Removal

    ERIC Educational Resources Information Center

    Sears, David A.; Pai, Hui-Hua

    2012-01-01

    Rewards are frequently used in classrooms and recommended as a key component of well-researched methods of cooperative learning (e.g., Slavin, 1995). While many studies of cooperative learning find beneficial effects of rewards, many studies of individuals find negative effects (e.g., Deci, Koestner, & Ryan, 1999; Lepper, 1988). This may be…

  16. Analysis of Reward Functions in Learning: Unconscious Information Processing: Noncognitive Determinants of Response Strength

    DTIC Science & Technology

    1984-05-01

    Research Note 84-76 ANALYSIS OF REWARD FUNCTIONS IN LEARNING: UNCONSCIOUS INFORMATION PROCESSING : Lf NONCOGNITIVE DETERMINANTS OF RESPONSE STRENGTH...Melvin H. Marx University of Missouri, Columbia David W. Bessemer , Contracting Officer’s Representative0 Submitted by Robert M. Sasmor, Director BASIC...REPORT & PERIOD COVERED ANALYSIS OF REWARD FUNCTIONS IN LEARNING: Final Report UNCONSCIOUS INFORMATION PROCESSING : NONCOGNITIVE Sept. 1978 - Sept. 15

  17. Curiosity and reward: Valence predicts choice and information prediction errors enhance learning.

    PubMed

    Marvin, Caroline B; Shohamy, Daphna

    2016-03-01

    Curiosity drives many of our daily pursuits and interactions; yet, we know surprisingly little about how it works. Here, we harness an idea implied in many conceptualizations of curiosity: that information has value in and of itself. Reframing curiosity as the motivation to obtain reward-where the reward is information-allows one to leverage major advances in theoretical and computational mechanisms of reward-motivated learning. We provide new evidence supporting 2 predictions that emerge from this framework. First, we find an asymmetric effect of positive versus negative information, with positive information enhancing both curiosity and long-term memory for information. Second, we find that it is not the absolute value of information that drives learning but, rather, the gap between the reward expected and reward received, an "information prediction error." These results support the idea that information functions as a reward, much like money or food, guiding choices and driving learning in systematic ways.

  18. The cerebellum: a neural system for the study of reinforcement learning.

    PubMed

    Swain, Rodney A; Kerr, Abigail L; Thompson, Richard F

    2011-01-01

    In its strictest application, the term "reinforcement learning" refers to a computational approach to learning in which an agent (often a machine) interacts with a mutable environment to maximize reward through trial and error. The approach borrows essentials from several fields, most notably Computer Science, Behavioral Neuroscience, and Psychology. At the most basic level, a neural system capable of mediating reinforcement learning must be able to acquire sensory information about the external environment and internal milieu (either directly or through connectivities with other brain regions), must be able to select a behavior to be executed, and must be capable of providing evaluative feedback about the success of that behavior. Given that Psychology informs us that reinforcers, both positive and negative, are stimuli or consequences that increase the probability that the immediately antecedent behavior will be repeated and that reinforcer strength or viability is modulated by the organism's past experience with the reinforcer, its affect, and even the state of its muscles (e.g., eyes open or closed); it is the case that any neural system that supports reinforcement learning must also be sensitive to these same considerations. Once learning is established, such a neural system must finally be able to maintain continued response expression and prevent response drift. In this report, we examine both historical and recent evidence that the cerebellum satisfies all of these requirements. While we report evidence from a variety of learning paradigms, the majority of our discussion will focus on classical conditioning of the rabbit eye blink response as an ideal model system for the study of reinforcement and reinforcement learning.

  19. Post-learning hippocampal dynamics promote preferential retention of rewarding events

    PubMed Central

    Gruber, Matthias J.; Ritchey, Maureen; Wang, Shao-Fang; Doss, Manoj K.; Ranganath, Charan

    2016-01-01

    Reward motivation is known to modulate memory encoding, and this effect depends on interactions between the substantia nigra/ ventral tegmental area complex (SN/VTA) and the hippocampus. It is unknown, however, whether these interactions influence offline neural activity in the human brain that is thought to promote memory consolidation. Here, we used functional magnetic resonance imaging (fMRI) to test the effect of reward motivation on post-learning neural dynamics and subsequent memory for objects that were learned in high- or low-reward motivation contexts. We found that post-learning increases in resting-state functional connectivity between the SN/VTA and hippocampus predicted preferential retention of objects that were learned in high-reward contexts. In addition, multivariate pattern classification revealed that hippocampal representations of high-reward contexts were preferentially reactivated during post-learning rest, and the number of hippocampal reactivations was predictive of preferential retention of items learned in high-reward contexts. These findings indicate that reward motivation alters offline post-learning dynamics between the SN/VTA and hippocampus, providing novel evidence for a potential mechanism by which reward could influence memory consolidation. PMID:26875624

  20. The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

    PubMed

    Huertas, Marco A; Schwettmann, Sarah E; Shouval, Harel Z

    2016-01-01

    The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for

  1. The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces

    PubMed Central

    Huertas, Marco A.; Schwettmann, Sarah E.; Shouval, Harel Z.

    2016-01-01

    The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for

  2. Learning to Produce Syllabic Speech Sounds via Reward-Modulated Neural Plasticity

    PubMed Central

    Warlaumont, Anne S.; Finnegan, Megan K.

    2016-01-01

    At around 7 months of age, human infants begin to reliably produce well-formed syllables containing both consonants and vowels, a behavior called canonical babbling. Over subsequent months, the frequency of canonical babbling continues to increase. How the infant’s nervous system supports the acquisition of this ability is unknown. Here we present a computational model that combines a spiking neural network, reinforcement-modulated spike-timing-dependent plasticity, and a human-like vocal tract to simulate the acquisition of canonical babbling. Like human infants, the model’s frequency of canonical babbling gradually increases. The model is rewarded when it produces a sound that is more auditorily salient than sounds it has previously produced. This is consistent with data from human infants indicating that contingent adult responses shape infant behavior and with data from deaf and tracheostomized infants indicating that hearing, including hearing one’s own vocalizations, is critical for canonical babbling development. Reward receipt increases the level of dopamine in the neural network. The neural network contains a reservoir with recurrent connections and two motor neuron groups, one agonist and one antagonist, which control the masseter and orbicularis oris muscles, promoting or inhibiting mouth closure. The model learns to increase the number of salient, syllabic sounds it produces by adjusting the base level of muscle activation and increasing their range of activity. Our results support the possibility that through dopamine-modulated spike-timing-dependent plasticity, the motor cortex learns to harness its natural oscillations in activity in order to produce syllabic sounds. It thus suggests that learning to produce rhythmic mouth movements for speech production may be supported by general cortical learning mechanisms. The model makes several testable predictions and has implications for our understanding not only of how syllabic vocalizations develop

  3. FMRQ-A Multiagent Reinforcement Learning Algorithm for Fully Cooperative Tasks.

    PubMed

    Zhang, Zhen; Zhao, Dongbin; Gao, Junwei; Wang, Dongqing; Dai, Yujie

    2016-04-14

    In this paper, we propose a multiagent reinforcement learning algorithm dealing with fully cooperative tasks. The algorithm is called frequency of the maximum reward Q-learning (FMRQ). FMRQ aims to achieve one of the optimal Nash equilibria so as to optimize the performance index in multiagent systems. The frequency of obtaining the highest global immediate reward instead of immediate reward is used as the reinforcement signal. With FMRQ each agent does not need the observation of the other agents' actions and only shares its state and reward at each step. We validate FMRQ through case studies of repeated games: four cases of two-player two-action and one case of three-player two-action. It is demonstrated that FMRQ can converge to one of the optimal Nash equilibria in these cases. Moreover, comparison experiments on tasks with multiple states and finite steps are conducted. One is box-pushing and the other one is distributed sensor network problem. Experimental results show that the proposed algorithm outperforms others with higher performance.

  4. Embedded Incremental Feature Selection for Reinforcement Learning

    DTIC Science & Technology

    2012-05-01

    policy by a problem-specific fit- ness function. The composition of the selected subset in terms of the fraction of relevant features among se- lected...features. In Figure 4b we see the composition of the se- lected subsets by the three algorithms. IFSE-NEAT clearly has the highest percentage of relevant...528. Kroon, M. and Whiteson, S. (2009). Automatic feature se- lection for model-based reinforcement learning in fac- tored mdps . In Proceedings of the

  5. Optimal chaos control through reinforcement learning.

    PubMed

    Gadaleta, Sabino; Dangelmayr, Gerhard

    1999-09-01

    A general purpose chaos control algorithm based on reinforcement learning is introduced and applied to the stabilization of unstable periodic orbits in various chaotic systems and to the targeting problem. The algorithm does not require any information about the dynamical system nor about the location of periodic orbits. Numerical tests demonstrate good and fast performance under noisy and nonstationary conditions. (c) 1999 American Institute of Physics.

  6. The Function of Direct and Vicarious Reinforcement in Human Learning.

    ERIC Educational Resources Information Center

    Owens, Carl R.; And Others

    The role of reinforcement has long been an issue in learning theory. The effects of reinforcement in learning were investigated under circumstances which made the information necessary for correct performance equally available to reinforced and nonreinforced subjects. Fourth graders (N=36) were given a pre-test of 20 items from the Peabody Picture…

  7. The Effects of Verbal and Material Rewards and Punishers on the Performance of Impulsive and Reflective Children

    ERIC Educational Resources Information Center

    Firestone, Philip; Douglas, Virginia I.

    1977-01-01

    Impulsive and reflective children performed in a discrimination learning task which included four reinforcement conditions: verbal-reward, verbal-punishment, material-reward, and material-punishment. (SB)

  8. How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis.

    PubMed

    Collins, Anne G E; Frank, Michael J

    2012-04-01

    Instrumental learning involves corticostriatal circuitry and the dopaminergic system. This system is typically modeled in the reinforcement learning (RL) framework by incrementally accumulating reward values of states and actions. However, human learning also implicates prefrontal cortical mechanisms involved in higher level cognitive functions. The interaction of these systems remains poorly understood, and models of human behavior often ignore working memory (WM) and therefore incorrectly assign behavioral variance to the RL system. Here we designed a task that highlights the profound entanglement of these two processes, even in simple learning problems. By systematically varying the size of the learning problem and delay between stimulus repetitions, we separately extracted WM-specific effects of load and delay on learning. We propose a new computational model that accounts for the dynamic integration of RL and WM processes observed in subjects' behavior. Incorporating capacity-limited WM into the model allowed us to capture behavioral variance that could not be captured in a pure RL framework even if we (implausibly) allowed separate RL systems for each set size. The WM component also allowed for a more reasonable estimation of a single RL process. Finally, we report effects of two genetic polymorphisms having relative specificity for prefrontal and basal ganglia functions. Whereas the COMT gene coding for catechol-O-methyl transferase selectively influenced model estimates of WM capacity, the GPR6 gene coding for G-protein-coupled receptor 6 influenced the RL learning rate. Thus, this study allowed us to specify distinct influences of the high-level and low-level cognitive functions on instrumental learning, beyond the possibilities offered by simple RL models.

  9. Bi-Directional Effect of Increasing Doses of Baclofen on Reinforcement Learning

    PubMed Central

    Terrier, Jean; Ort, Andres; Yvon, Cédric; Saj, Arnaud; Vuilleumier, Patrik; Lüscher, Christian

    2011-01-01

    In rodents as well as in humans, efficient reinforcement learning depends on dopamine (DA) released from ventral tegmental area (VTA) neurons. It has been shown that in brain slices of mice, GABAB-receptor agonists at low concentrations increase the firing frequency of VTA–DA neurons, while high concentrations reduce the firing frequency. It remains however elusive whether baclofen can modulate reinforcement learning in humans. Here, in a double-blind study in 34 healthy human volunteers, we tested the effects of a low and a high concentration of oral baclofen, a high affinity GABAB-receptor agonist, in a gambling task associated with monetary reward. A low (20 mg) dose of baclofen increased the efficiency of reward-associated learning but had no effect on the avoidance of monetary loss. A high (50 mg) dose of baclofen on the other hand did not affect the learning curve. At the end of the task, subjects who received 20 mg baclofen p.o. were more accurate in choosing the symbol linked to the highest probability of earning money compared to the control group (89.55 ± 1.39 vs. 81.07 ± 1.55%, p = 0.002). Our results support a model where baclofen, at low concentrations, causes a disinhibition of DA neurons, increases DA levels and thus facilitates reinforcement learning. PMID:21811448

  10. Bi-directional effect of increasing doses of baclofen on reinforcement learning.

    PubMed

    Terrier, Jean; Ort, Andres; Yvon, Cédric; Saj, Arnaud; Vuilleumier, Patrik; Lüscher, Christian

    2011-01-01

    In rodents as well as in humans, efficient reinforcement learning depends on dopamine (DA) released from ventral tegmental area (VTA) neurons. It has been shown that in brain slices of mice, GABA(B)-receptor agonists at low concentrations increase the firing frequency of VTA-DA neurons, while high concentrations reduce the firing frequency. It remains however elusive whether baclofen can modulate reinforcement learning in humans. Here, in a double-blind study in 34 healthy human volunteers, we tested the effects of a low and a high concentration of oral baclofen, a high affinity GABA(B)-receptor agonist, in a gambling task associated with monetary reward. A low (20 mg) dose of baclofen increased the efficiency of reward-associated learning but had no effect on the avoidance of monetary loss. A high (50 mg) dose of baclofen on the other hand did not affect the learning curve. At the end of the task, subjects who received 20 mg baclofen p.o. were more accurate in choosing the symbol linked to the highest probability of earning money compared to the control group (89.55 ± 1.39 vs. 81.07 ± 1.55%, p = 0.002). Our results support a model where baclofen, at low concentrations, causes a disinhibition of DA neurons, increases DA levels and thus facilitates reinforcement learning.

  11. Reinforcement learning and dopamine in schizophrenia: dimensions of symptoms or specific features of a disease group?

    PubMed

    Deserno, Lorenz; Boehme, Rebecca; Heinz, Andreas; Schlagenhauf, Florian

    2013-12-23

    Abnormalities in reinforcement learning are a key finding in schizophrenia and have been proposed to be linked to elevated levels of dopamine neurotransmission. Behavioral deficits in reinforcement learning and their neural correlates may contribute to the formation of clinical characteristics of schizophrenia. The ability to form predictions about future outcomes is fundamental for environmental interactions and depends on neuronal teaching signals, like reward prediction errors. While aberrant prediction errors, that encode non-salient events as surprising, have been proposed to contribute to the formation of positive symptoms, a failure to build neural representations of decision values may result in negative symptoms. Here, we review behavioral and neuroimaging research in schizophrenia and focus on studies that implemented reinforcement learning models. In addition, we discuss studies that combined reinforcement learning with measures of dopamine. Thereby, we suggest how reinforcement learning abnormalities in schizophrenia may contribute to the formation of psychotic symptoms and may interact with cognitive deficits. These ideas point toward an interplay of more rigid versus flexible control over reinforcement learning. Pronounced deficits in the flexible or model-based domain may allow for a detailed characterization of well-established cognitive deficits in schizophrenia patients based on computational models of learning. Finally, we propose a framework based on the potentially crucial contribution of dopamine to dysfunctional reinforcement learning on the level of neural networks. Future research may strongly benefit from computational modeling but also requires further methodological improvement for clinical group studies. These research tools may help to improve our understanding of disease-specific mechanisms and may help to identify clinically relevant subgroups of the heterogeneous entity schizophrenia.

  12. Reinforcement Learning and Dopamine in Schizophrenia: Dimensions of Symptoms or Specific Features of a Disease Group?

    PubMed Central

    Deserno, Lorenz; Boehme, Rebecca; Heinz, Andreas; Schlagenhauf, Florian

    2013-01-01

    Abnormalities in reinforcement learning are a key finding in schizophrenia and have been proposed to be linked to elevated levels of dopamine neurotransmission. Behavioral deficits in reinforcement learning and their neural correlates may contribute to the formation of clinical characteristics of schizophrenia. The ability to form predictions about future outcomes is fundamental for environmental interactions and depends on neuronal teaching signals, like reward prediction errors. While aberrant prediction errors, that encode non-salient events as surprising, have been proposed to contribute to the formation of positive symptoms, a failure to build neural representations of decision values may result in negative symptoms. Here, we review behavioral and neuroimaging research in schizophrenia and focus on studies that implemented reinforcement learning models. In addition, we discuss studies that combined reinforcement learning with measures of dopamine. Thereby, we suggest how reinforcement learning abnormalities in schizophrenia may contribute to the formation of psychotic symptoms and may interact with cognitive deficits. These ideas point toward an interplay of more rigid versus flexible control over reinforcement learning. Pronounced deficits in the flexible or model-based domain may allow for a detailed characterization of well-established cognitive deficits in schizophrenia patients based on computational models of learning. Finally, we propose a framework based on the potentially crucial contribution of dopamine to dysfunctional reinforcement learning on the level of neural networks. Future research may strongly benefit from computational modeling but also requires further methodological improvement for clinical group studies. These research tools may help to improve our understanding of disease-specific mechanisms and may help to identify clinically relevant subgroups of the heterogeneous entity schizophrenia. PMID:24391603

  13. Rewards versus Learning: A Response to Paul Chance.

    ERIC Educational Resources Information Center

    Kohn, Alfie

    1993-01-01

    Responding to Paul Chance's November 1992 "Kappan" article on motivational value of rewards, this article argues that manipulating student behavior with either punishments or rewards is unnecessary and counterproductive. Extrinsic rewards can never buy more than short-term compliance because they are inherently controlling and…

  14. Excitotoxic lesions of the medial striatum delay extinction of a reinforcement color discrimination operant task in domestic chicks; a functional role of reward anticipation.

    PubMed

    Ichikawa, Yoko; Izawa, Ei-Ichi; Matsushima, Toshiya

    2004-12-01

    To reveal the functional roles of the striatum, we examined the effects of excitotoxic lesions to the bilateral medial striatum (mSt) and nucleus accumbens (Ac) in a food reinforcement color discrimination operant task. With a food reward as reinforcement, 1-week-old domestic chicks were trained to peck selectively at red and yellow beads (S+) and not to peck at a blue bead (S-). Those chicks then received either lesions or sham operations and were tested in extinction training sessions, during which yellow turned out to be nonrewarding (S-), whereas red and blue remained unchanged. To further examine the effects on postoperant noninstrumental aspects of behavior, we also measured the "waiting time", during which chicks stayed at the empty feeder after pecking at yellow. Although the lesioned chicks showed significantly higher error rates in the nonrewarding yellow trials, their postoperant waiting time gradually decreased similarly to the sham controls. Furthermore, the lesioned chicks waited significantly longer than the controls, even from the first extinction block. In the blue trials, both lesioned and sham chicks consistently refrained from pecking, indicating that the delayed extinction was not due to a general disinhibition of pecking. Similarly, no effects were found in the novel training sessions, suggesting that the lesions had selective effects on the extinction of a learned operant. These results suggest that a neural representation of memory-based reward anticipation in the mSt/Ac could contribute to the anticipation error required for extinction.

  15. Identifying Cognitive Remediation Change Through Computational Modelling—Effects on Reinforcement Learning in Schizophrenia

    PubMed Central

    Cella, Matteo; Bishara, Anthony J.; Medin, Evelina; Swan, Sarah; Reeder, Clare; Wykes, Til

    2014-01-01

    Objective: Converging research suggests that individuals with schizophrenia show a marked impairment in reinforcement learning, particularly in tasks requiring flexibility and adaptation. The problem has been associated with dopamine reward systems. This study explores, for the first time, the characteristics of this impairment and how it is affected by a behavioral intervention—cognitive remediation. Method: Using computational modelling, 3 reinforcement learning parameters based on the Wisconsin Card Sorting Test (WCST) trial-by-trial performance were estimated: R (reward sensitivity), P (punishment sensitivity), and D (choice consistency). In Study 1 the parameters were compared between a group of individuals with schizophrenia (n = 100) and a healthy control group (n = 50). In Study 2 the effect of cognitive remediation therapy (CRT) on these parameters was assessed in 2 groups of individuals with schizophrenia, one receiving CRT (n = 37) and the other receiving treatment as usual (TAU, n = 34). Results: In Study 1 individuals with schizophrenia showed impairment in the R and P parameters compared with healthy controls. Study 2 demonstrated that sensitivity to negative feedback (P) and reward (R) improved in the CRT group after therapy compared with the TAU group. R and P parameter change correlated with WCST outputs. Improvements in R and P after CRT were associated with working memory gains and reduction of negative symptoms, respectively. Conclusion: Schizophrenia reinforcement learning difficulties negatively influence performance in shift learning tasks. CRT can improve sensitivity to reward and punishment. Identifying parameters that show change may be useful in experimental medicine studies to identify cognitive domains susceptible to improvement. PMID:24214932

  16. Novelty and Inductive Generalization in Human Reinforcement Learning.

    PubMed

    Gershman, Samuel J; Niv, Yael

    2015-07-01

    In reinforcement learning (RL), a decision maker searching for the most rewarding option is often faced with the question: What is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: How can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and we describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of RL in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional RL algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty.

  17. Reinforcement learning design for cancer clinical trials

    PubMed Central

    Zhao, Yufan; Kosorok, Michael R.; Zeng, Donglin

    2009-01-01

    Summary We develop reinforcement learning trials for discovering individualized treatment regimens for life-threatening diseases such as cancer. A temporal-difference learning method called Q-learning is utilized which involves learning an optimal policy from a single training set of finite longitudinal patient trajectories. Approximating the Q-function with time-indexed parameters can be achieved by using support vector regression or extremely randomized trees. Within this framework, we demonstrate that the procedure can extract optimal strategies directly from clinical data without relying on the identification of any accurate mathematical models, unlike approaches based on adaptive design. We show that reinforcement learning has tremendous potential in clinical research because it can select actions that improve outcomes by taking into account delayed effects even when the relationship between actions and outcomes is not fully known. To support our claims, the methodology's practical utility is illustrated in a simulation analysis. In the immediate future, we will apply this general strategy to studying and identifying new treatments for advanced metastatic stage IIIB/IV non-small cell lung cancer, which usually includes multiple lines of chemotherapy treatment. Moreover, there is significant potential of the proposed methodology for developing personalized treatment strategies in other cancers, in cystic fibrosis, and in other life-threatening diseases. PMID:19750510

  18. Striatal dopamine ramping may indicate flexible reinforcement learning with forgetting in the cortico-basal ganglia circuits.

    PubMed

    Morita, Kenji; Kato, Ayaka

    2014-01-01

    It has been suggested that the midbrain dopamine (DA) neurons, receiving inputs from the cortico-basal ganglia (CBG) circuits and the brainstem, compute reward prediction error (RPE), the difference between reward obtained or expected to be obtained and reward that had been expected to be obtained. These reward expectations are suggested to be stored in the CBG synapses and updated according to RPE through synaptic plasticity, which is induced by released DA. These together constitute the "DA=RPE" hypothesis, which describes the mutual interaction between DA and the CBG circuits and serves as the primary working hypothesis in studying reward learning and value-based decision-making. However, recent work has revealed a new type of DA signal that appears not to represent RPE. Specifically, it has been found in a reward-associated maze task that striatal DA concentration primarily shows a gradual increase toward the goal. We explored whether such ramping DA could be explained by extending the "DA=RPE" hypothesis by taking into account biological properties of the CBG circuits. In particular, we examined effects of possible time-dependent decay of DA-dependent plastic changes of synaptic strengths by incorporating decay of learned values into the RPE-based reinforcement learning model and simulating reward learning tasks. We then found that incorporation of such a decay dramatically changes the model's behavior, causing gradual ramping of RPE. Moreover, we further incorporated magnitude-dependence of the rate of decay, which could potentially be in accord with some past observations, and found that near-sigmoidal ramping of RPE, resembling the observed DA ramping, could then occur. Given that synaptic decay can be useful for flexibly reversing and updating the learned reward associations, especially in case the baseline DA is low and encoding of negative RPE by DA is limited, the observed DA ramping would be indicative of the operation of such flexible reward learning.

  19. Striatal dopamine ramping may indicate flexible reinforcement learning with forgetting in the cortico-basal ganglia circuits

    PubMed Central

    Morita, Kenji; Kato, Ayaka

    2014-01-01

    It has been suggested that the midbrain dopamine (DA) neurons, receiving inputs from the cortico-basal ganglia (CBG) circuits and the brainstem, compute reward prediction error (RPE), the difference between reward obtained or expected to be obtained and reward that had been expected to be obtained. These reward expectations are suggested to be stored in the CBG synapses and updated according to RPE through synaptic plasticity, which is induced by released DA. These together constitute the “DA=RPE” hypothesis, which describes the mutual interaction between DA and the CBG circuits and serves as the primary working hypothesis in studying reward learning and value-based decision-making. However, recent work has revealed a new type of DA signal that appears not to represent RPE. Specifically, it has been found in a reward-associated maze task that striatal DA concentration primarily shows a gradual increase toward the goal. We explored whether such ramping DA could be explained by extending the “DA=RPE” hypothesis by taking into account biological properties of the CBG circuits. In particular, we examined effects of possible time-dependent decay of DA-dependent plastic changes of synaptic strengths by incorporating decay of learned values into the RPE-based reinforcement learning model and simulating reward learning tasks. We then found that incorporation of such a decay dramatically changes the model's behavior, causing gradual ramping of RPE. Moreover, we further incorporated magnitude-dependence of the rate of decay, which could potentially be in accord with some past observations, and found that near-sigmoidal ramping of RPE, resembling the observed DA ramping, could then occur. Given that synaptic decay can be useful for flexibly reversing and updating the learned reward associations, especially in case the baseline DA is low and encoding of negative RPE by DA is limited, the observed DA ramping would be indicative of the operation of such flexible reward

  20. Motivational neural circuits underlying reinforcement learning.

    PubMed

    Averbeck, Bruno B; Costa, Vincent D

    2017-03-29

    Reinforcement learning (RL) is the behavioral process of learning the values of actions and objects. Most models of RL assume that the dopaminergic prediction error signal drives plasticity in frontal-striatal circuits. The striatum then encodes value representations that drive decision processes. However, the amygdala has also been shown to play an important role in forming Pavlovian stimulus-outcome associations. These Pavlovian associations can drive motivated behavior via the amygdala projections to the ventral striatum or the ventral tegmental area. The amygdala may, therefore, play a central role in RL. Here we compare the contributions of the amygdala and the striatum to RL and show that both the amygdala and striatum learn and represent expected values in RL tasks. Furthermore, value representations in the striatum may be inherited, to some extent, from the amygdala. The striatum may, therefore, play less of a primary role in learning stimulus-outcome associations in RL than previously suggested.

  1. Molecular mechanisms underlying a cellular analogue of operant reward learning

    PubMed Central

    Lorenzetti, Fred D.; Baxter, Douglas A.; Byrne, John H.

    2008-01-01

    SUMMARY Operant conditioning is a ubiquitous but mechanistically poorly understood form of associative learning in which an animal learns the consequences of its behavior. Using a single-cell analogue of operant conditioning in neuron B51 of Aplysia, we examined second-messenger pathways engaged by activity and reward and how they may provide a biochemical association underlying operant learning. Conditioning was blocked by Rp-cAMP, a peptide inhibitor of PKA, a PKC inhibitor and by expressing a dominant negative isoform of Ca2+-dependent PKC (apl-I). Thus, both PKA and PKC were necessary for operant conditioning. Injection of cAMP into B51 mimicked the effects of operant conditioning. Activation of PKC also mimicked conditioning, but was dependent on both cAMP and PKA, suggesting that PKC acted at some point upstream of PKA activation. Our results demonstrate how these molecules can interact to mediate operant conditioning in an individual neuron important for the expression of the conditioned behavior. PMID:18786364

  2. Intrinsically Motivated Reinforcement Learning: A Promising Framework for Developmental Robot Learning

    DTIC Science & Technology

    2005-01-01

    for intrinsically motivated reinforcement learning that strives to achieve broad competence in an environment in a task-nonspecific manner by...hierarchical learning, intrinsically motivated reinforcement learning is an obvious choice for organizing behavior in developmental robotics. We present

  3. Attenuating GABA(A) receptor signaling in dopamine neurons selectively enhances reward learning and alters risk preference in mice.

    PubMed

    Parker, Jones G; Wanat, Matthew J; Soden, Marta E; Ahmad, Kinza; Zweifel, Larry S; Bamford, Nigel S; Palmiter, Richard D

    2011-11-23

    Phasic dopamine (DA) transmission encodes the value of reward-predictive stimuli and influences both learning and decision-making. Altered DA signaling is associated with psychiatric conditions characterized by risky choices such as pathological gambling. These observations highlight the importance of understanding how DA neuron activity is modulated. While excitatory drive onto DA neurons is critical for generating phasic DA responses, emerging evidence suggests that inhibitory signaling also modulates these responses. To address the functional importance of inhibitory signaling in DA neurons, we generated mice lacking the β3 subunit of the GABA(A) receptor specifically in DA neurons (β3-KO mice) and examined their behavior in tasks that assessed appetitive learning, aversive learning, and risk preference. DA neurons in midbrain slices from β3-KO mice exhibited attenuated GABA-evoked IPSCs. Furthermore, electrical stimulation of excitatory afferents to DA neurons elicited more DA release in the nucleus accumbens of β3-KO mice as measured by fast-scan cyclic voltammetry. β3-KO mice were more active than controls when given morphine, which correlated with potential compensatory upregulation of GABAergic tone onto DA neurons. β3-KO mice learned faster in two food-reinforced learning paradigms, but extinguished their learned behavior normally. Enhanced learning was specific for appetitive tasks, as aversive learning was unaffected in β3-KO mice. Finally, we found that β3-KO mice had enhanced risk preference in a probabilistic selection task that required mice to choose between a small certain reward and a larger uncertain reward. Collectively, these findings identify a selective role for GABA(A) signaling in DA neurons in appetitive learning and decision-making.

  4. Single amino acids in sucrose rewards modulate feeding and associative learning in the honeybee.

    PubMed

    Simcock, Nicola K; Gray, Helen E; Wright, Geraldine A

    2014-10-01

    Obtaining the correct balance of nutrients requires that the brain integrates information about the body's nutritional state with sensory information from food to guide feeding behaviour. Learning is a mechanism that allows animals to identify cues associated with nutrients so that they can be located quickly when required. Feedback about nutritional state is essential for nutrient balancing and could influence learning. How specific this feedback is to individual nutrients has not often been examined. Here, we tested how the honeybee's nutritional state influenced the likelihood it would feed on and learn sucrose solutions containing single amino acids. Nutritional state was manipulated by pre-feeding bees with either 1M sucrose or 1M sucrose containing 100mM of isoleucine, proline, phenylalanine, or methionine 24h prior to olfactory conditioning of the proboscis extension response. We found that bees pre-fed sucrose solution consumed less of solutions containing amino acids and were also less likely to learn to associate amino acid solutions with odours. Unexpectedly, bees pre-fed solutions containing an amino acid were also less likely to learn to associate odours with sucrose the next day. Furthermore, they consumed more of and were more likely to learn when rewarded with an amino acid solution if they were pre-fed isoleucine and proline. Our data indicate that single amino acids at relatively high concentrations inhibit feeding on sucrose solutions containing them, and they can act as appetitive reinforcers during learning. Our data also suggest that select amino acids interact with mechanisms that signal nutritional sufficiency to reduce hunger. Based on these experiments, we predict that nutrient balancing for essential amino acids during learning requires integration of information about several amino acids experienced simultaneously.

  5. Single amino acids in sucrose rewards modulate feeding and associative learning in the honeybee

    PubMed Central

    Simcock, Nicola K.; Gray, Helen E.; Wright, Geraldine A.

    2014-01-01

    Obtaining the correct balance of nutrients requires that the brain integrates information about the body’s nutritional state with sensory information from food to guide feeding behaviour. Learning is a mechanism that allows animals to identify cues associated with nutrients so that they can be located quickly when required. Feedback about nutritional state is essential for nutrient balancing and could influence learning. How specific this feedback is to individual nutrients has not often been examined. Here, we tested how the honeybee’s nutritional state influenced the likelihood it would feed on and learn sucrose solutions containing single amino acids. Nutritional state was manipulated by pre-feeding bees with either 1 M sucrose or 1 M sucrose containing 100 mM of isoleucine, proline, phenylalanine, or methionine 24 h prior to olfactory conditioning of the proboscis extension response. We found that bees pre-fed sucrose solution consumed less of solutions containing amino acids and were also less likely to learn to associate amino acid solutions with odours. Unexpectedly, bees pre-fed solutions containing an amino acid were also less likely to learn to associate odours with sucrose the next day. Furthermore, they consumed more of and were more likely to learn when rewarded with an amino acid solution if they were pre-fed isoleucine and proline. Our data indicate that single amino acids at relatively high concentrations inhibit feeding on sucrose solutions containing them, and they can act as appetitive reinforcers during learning. Our data also suggest that select amino acids interact with mechanisms that signal nutritional sufficiency to reduce hunger. Based on these experiments, we predict that nutrient balancing for essential amino acids during learning requires integration of information about several amino acids experienced simultaneously. PMID:24819203

  6. Multiagent reinforcement learning: spiking and nonspiking agents in the iterated Prisoner's Dilemma.

    PubMed

    Vassiliades, Vassilis; Cleanthous, Aristodemos; Christodoulou, Chris

    2011-04-01

    This paper investigates multiagent reinforcement learning (MARL) in a general-sum game where the payoffs' structure is such that the agents are required to exploit each other in a way that benefits all agents. The contradictory nature of these games makes their study in multiagent systems quite challenging. In particular, we investigate MARL with spiking and nonspiking agents in the Iterated Prisoner's Dilemma by exploring the conditions required to enhance its cooperative outcome. The spiking agents are neural networks with leaky integrate-and-fire neurons trained with two different learning algorithms: 1) reinforcement of stochastic synaptic transmission, or 2) reward-modulated spike-timing-dependent plasticity with eligibility trace. The nonspiking agents use a tabular representation and are trained with Q- and SARSA learning algorithms, with a novel reward transformation process also being applied to the Q-learning agents. According to the results, the cooperative outcome is enhanced by: 1) transformed internal reinforcement signals and a combination of a high learning rate and a low discount factor with an appropriate exploration schedule in the case of non-spiking agents, and 2) having longer eligibility trace time constant in the case of spiking agents. Moreover, it is shown that spiking and nonspiking agents have similar behavior and therefore they can equally well be used in a multiagent interaction setting. For training the spiking agents in the case where more than one output neuron competes for reinforcement, a novel and necessary modification that enhances competition is applied to the two learning algorithms utilized, in order to avoid a possible synaptic saturation. This is done by administering to the networks additional global reinforcement signals for every spike of the output neurons that were not "responsible" for the preceding decision.

  7. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning

    PubMed Central

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  8. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning.

    PubMed

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  9. Reinforcement learning can account for associative and perceptual learning on a visual-decision task.

    PubMed

    Law, Chi-Tat; Gold, Joshua I

    2009-05-01

    We recently showed that improved perceptual performance on a visual motion direction-discrimination task corresponds to changes in how an unmodified sensory representation in the brain is interpreted to form a decision that guides behavior. Here we found that these changes can be accounted for using a reinforcement-learning rule to shape functional connectivity between the sensory and decision neurons. We modeled performance on the basis of the readout of simulated responses of direction-selective sensory neurons in the middle temporal area (MT) of monkey cortex. A reward prediction error guided changes in connections between these sensory neurons and the decision process, first establishing the association between motion direction and response direction, and then gradually improving perceptual sensitivity by selectively strengthening the connections from the most sensitive neurons in the sensory population. The results suggest a common, feedback-driven mechanism for some forms of associative and perceptual learning.

  10. Dissociable neural representations of reinforcement and belief prediction errors underlie strategic learning.

    PubMed

    Zhu, Lusha; Mathewson, Kyle E; Hsu, Ming

    2012-01-31

    Decision-making in the presence of other competitive intelligent agents is fundamental for social and economic behavior. Such decisions require agents to behave strategically, where in addition to learning about the rewards and punishments available in the environment, they also need to anticipate and respond to actions of others competing for the same rewards. However, whereas we know much about strategic learning at both theoretical and behavioral levels, we know relatively little about the underlying neural mechanisms. Here, we show using a multi-strategy competitive learning paradigm that strategic choices can be characterized by extending the reinforcement learning (RL) framework to incorporate agents' beliefs about the actions of their opponents. Furthermore, using this characterization to generate putative internal values, we used model-based functional magnetic resonance imaging to investigate neural computations underlying strategic learning. We found that the distinct notions of prediction errors derived from our computational model are processed in a partially overlapping but distinct set of brain regions. Specifically, we found that the RL prediction error was correlated with activity in the ventral striatum. In contrast, activity in the ventral striatum, as well as the rostral anterior cingulate (rACC), was correlated with a previously uncharacterized belief-based prediction error. Furthermore, activity in rACC reflected individual differences in degree of engagement in belief learning. These results suggest a model of strategic behavior where learning arises from interaction of dissociable reinforcement and belief-based inputs.

  11. Forward shift of feeding buzz components of dolphins and belugas during associative learning reveals a likely connection to reward expectation, pleasure and brain dopamine activation.

    PubMed

    Ridgway, S H; Moore, P W; Carder, D A; Romano, T A

    2014-08-15

    For many years, we heard sounds associated with reward from dolphins and belugas. We named these pulsed sounds victory squeals (VS), as they remind us of a child's squeal of delight. Here we put these sounds in context with natural and learned behavior. Like bats, echolocating cetaceans produce feeding buzzes as they approach and catch prey. Unlike bats, cetaceans continue their feeding buzzes after prey capture and the after portion is what we call the VS. Prior to training (or conditioning), the VS comes after the fish reward; with repeated trials it moves to before the reward. During training, we use a whistle or other sound to signal a correct response by the animal. This sound signal, named a secondary reinforcer (SR), leads to the primary reinforcer, fish. Trainers usually name their whistle or other SR a bridge, as it bridges the time gap between the correct response and reward delivery. During learning, the SR becomes associated with reward and the VS comes after the SR rather than after the fish. By following the SR, the VS confirms that the animal expects a reward. Results of early brain stimulation work suggest to us that SR stimulates brain dopamine release, which leads to the VS. Although there are no direct studies of dopamine release in cetaceans, we found that the timing of our VS is consistent with a response after dopamine release. We compared trained vocal responses to auditory stimuli with VS responses to SR sounds. Auditory stimuli that did not signal reward resulted in faster responses by a mean of 151 ms for dolphins and 250 ms for belugas. In laboratory animals, there is a 100 to 200 ms delay for dopamine release. VS delay in our animals is similar and consistent with vocalization after dopamine release. Our novel observation suggests that the dopamine reward system is active in cetacean brains.

  12. Affective personality predictors of disrupted reward learning and pursuit in major depressive disorder.

    PubMed

    DelDonno, Sophie R; Weldon, Anne L; Crane, Natania A; Passarotti, Alessandra M; Pruitt, Patrick J; Gabriel, Laura B; Yau, Wendy; Meyers, Kortni K; Hsu, David T; Taylor, Stephen F; Heitzeg, Mary M; Herbener, Ellen; Shankman, Stewart A; Mickey, Brian J; Zubieta, Jon-Kar; Langenecker, Scott A

    2015-11-30

    Anhedonia, the diminished anticipation and pursuit of reward, is a core symptom of major depressive disorder (MDD). Trait behavioral activation (BA), as a proxy for anhedonia, and behavioral inhibition (BI) may moderate the relationship between MDD and reward-seeking. The present studies probed for reward learning deficits, potentially due to aberrant BA and/or BI, in active or remitted MDD individuals compared to healthy controls (HC). Active MDD (Study 1) and remitted MDD (Study 2) participants completed the modified monetary incentive delay task (mMIDT), a behavioral reward-seeking task whose response window parameters were individually titrated to theoretically elicit equivalent accuracy between groups. Participants completed the BI Scale and BA Reward-Responsiveness and Drive Scales. Despite individual titration, active MDD participants won significantly less money than HCs. Higher Reward-Responsiveness scores predicted more won; Drive and BI were not predictive. Remitted MDD participants' performance did not differ from controls', and trait BA and BI measures did not predict r-MDD performance. These results suggest that diminished reward-responsiveness may contribute to decreased motivation and reward pursuit during active MDD, but that reward learning is intact in remission. Understanding individual reward processing deficits in MDD may inform personalized intervention addressing anhedonia and motivation deficits in select MDD patients.

  13. Dopamine and opioid gene variants are associated with increased smoking reward and reinforcement owing to negative mood.

    PubMed

    Perkins, Kenneth A; Lerman, Caryn; Grottenthaler, Amy; Ciccocioppo, Melinda M; Milanak, Melissa; Conklin, Cynthia A; Bergen, Andrew W; Benowitz, Neal L

    2008-09-01

    Negative mood increases smoking reinforcement and risk of relapse. We explored associations of gene variants in the dopamine, opioid, and serotonin pathways with smoking reward ('liking') and reinforcement (latency to first puff and total puffs) as a function of negative mood and expected versus actual nicotine content of the cigarette. Smokers of European ancestry (n=72) were randomized to one of four groups in a 2x2 balanced placebo design, corresponding with manipulation of actual (0.6 vs. 0.05 mg) and expected (told nicotine and told denicotinized) nicotine 'dose' in cigarettes during each of two sessions (negative vs. positive mood induction). Following mood induction and expectancy instructions, they sampled and rated the assigned cigarette, and then smoked additional cigarettes ad lib during continued mood induction. The increase in smoking amount owing to negative mood was associated with: dopamine D2 receptor (DRD2) C957T (CC>TT or CT), SLC6A3 (presence of 9 repeat>absence of 9), and among those given a nicotine cigarette, DRD4 (presence of 7 repeat>absence of 7) and DRD2/ANKK1 TaqIA (TT or CT>CC). SLC6A3, and DRD2/ANKK1 TaqIA were also associated with smoking reward and smoking latency. OPRM1 (AA>AG or GG) was associated with smoking reward, but SLC6A4 variable number tandem repeat was unrelated to any of these measures. These results warrant replication but provide the first evidence for genetic associations with the acute increase in smoking reward and reinforcement owing to negative mood.

  14. Knowledge-Based Reinforcement Learning for Data Mining

    NASA Astrophysics Data System (ADS)

    Kudenko, Daniel; Grzes, Marek

    Data Mining is the process of extracting patterns from data. Two general avenues of research in the intersecting areas of agents and data mining can be distinguished. The first approach is concerned with mining an agent’s observation data in order to extract patterns, categorize environment states, and/or make predictions of future states. In this setting, data is normally available as a batch, and the agent’s actions and goals are often independent of the data mining task. The data collection is mainly considered as a side effect of the agent’s activities. Machine learning techniques applied in such situations fall into the class of supervised learning. In contrast, the second scenario occurs where an agent is actively performing the data mining, and is responsible for the data collection itself. For example, a mobile network agent is acquiring and processing data (where the acquisition may incur a certain cost), or a mobile sensor agent is moving in a (perhaps hostile) environment, collecting and processing sensor readings. In these settings, the tasks of the agent and the data mining are highly intertwined and interdependent (or even identical). Supervised learning is not a suitable technique for these cases. Reinforcement Learning (RL) enables an agent to learn from experience (in form of reward and punishment for explorative actions) and adapt to new situations, without a teacher. RL is an ideal learning technique for these data mining scenarios, because it fits the agent paradigm of continuous sensing and acting, and the RL agent is able to learn to make decisions on the sampling of the environment which provides the data. Nevertheless, RL still suffers from scalability problems, which have prevented its successful use in many complex real-world domains. The more complex the tasks, the longer it takes a reinforcement learning algorithm to converge to a good solution. For many real-world tasks, human expert knowledge is available. For example, human

  15. What motivates adolescents? Neural responses to rewards and their influence on adolescents' risk taking, learning, and cognitive control.

    PubMed

    van Duijvenvoorde, Anna C K; Peters, Sabine; Braams, Barbara R; Crone, Eveline A

    2016-11-01

    Adolescence is characterized by pronounced changes in motivated behavior, during which emphasis on potential rewards may result in an increased tendency to approach things that are novel and bring potential for positive reinforcement. While this may result in risky and health-endangering behavior, it may also lead to positive consequences, such as behavioral flexibility and greater learning. In this review we will discuss both the maladaptive and adaptive properties of heightened reward-sensitivity in adolescents by reviewing recent cognitive neuroscience findings in relation to behavioral outcomes. First, we identify brain regions involved in processing rewards in adults and adolescents. Second, we discuss how functional changes in reward-related brain activity during adolescence are related to two behavioral domains: risk taking and cognitive control. Finally, we conclude that progress lies in new levels of explanation by further integration of neural results with behavioral theories and computational models. In addition, we highlight that longitudinal measures, and a better conceptualization of adolescence and environmental determinants, are of crucial importance for understanding positive and negative developmental trajectories.

  16. Somatic and Reinforcement-Based Plasticity in the Initial Stages of Human Motor Learning.

    PubMed

    Sidarta, Ananda; Vahdat, Shahabeddin; Bernardi, Nicolò F; Ostry, David J

    2016-11-16

    As one learns to dance or play tennis, the desired somatosensory state is typically unknown. Trial and error is important as motor behavior is shaped by successful and unsuccessful movements. As an experimental model, we designed a task in which human participants make reaching movements to a hidden target and receive positive reinforcement when successful. We identified somatic and reinforcement-based sources of plasticity on the basis of changes in functional connectivity using resting-state fMRI before and after learning. The neuroimaging data revealed reinforcement-related changes in both motor and somatosensory brain areas in which a strengthening of connectivity was related to the amount of positive reinforcement during learning. Areas of prefrontal cortex were similarly altered in relation to reinforcement, with connectivity between sensorimotor areas of putamen and the reward-related ventromedial prefrontal cortex strengthened in relation to the amount of successful feedback received. In other analyses, we assessed connectivity related to changes in movement direction between trials, a type of variability that presumably reflects exploratory strategies during learning. We found that connectivity in a network linking motor and somatosensory cortices increased with trial-to-trial changes in direction. Connectivity varied as well with the change in movement direction following incorrect movements. Here the changes were observed in a somatic memory and decision making network involving ventrolateral prefrontal cortex and second somatosensory cortex. Our results point to the idea that the initial stages of motor learning are not wholly motor but rather involve plasticity in somatic and prefrontal networks related both to reward and exploration.

  17. Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control.

    PubMed

    Uragami, Daisuke; Takahashi, Tatsuji; Matsuo, Yoshiki

    2014-02-01

    Many algorithms and methods in artificial intelligence or machine learning were inspired by human cognition. As a mechanism to handle the exploration-exploitation dilemma in reinforcement learning, the loosely symmetric (LS) value function that models causal intuition of humans was proposed (Shinohara et al., 2007). While LS shows the highest correlation with causal induction by humans, it has been reported that it effectively works in multi-armed bandit problems that form the simplest class of tasks representing the dilemma. However, the scope of application of LS was limited to the reinforcement learning problems that have K actions with only one state (K-armed bandit problems). This study proposes LS-Q learning architecture that can deal with general reinforcement learning tasks with multiple states and delayed reward. We tested the learning performance of the new architecture in giant-swing robot motion learning, where uncertainty and unknown-ness of the environment is huge. In the test, the help of ready-made internal models or functional approximation of the state space were not given. The simulations showed that while the ordinary Q-learning agent does not reach giant-swing motion because of stagnant loops (local optima with low rewards), LS-Q escapes such loops and acquires giant-swing. It is confirmed that the smaller number of states is, in other words, the more coarse-grained the division of states and the more incomplete the state observation is, the better LS-Q performs in comparison with Q-learning. We also showed that the high performance of LS-Q depends comparatively little on parameter tuning and learning time. This suggests that the proposed method inspired by human cognition works adaptively in real environments.

  18. Integration of reinforcement learning and optimal decision-making theories of the basal ganglia.

    PubMed

    Bogacz, Rafal; Larsen, Tobias

    2011-04-01

    This article seeks to integrate two sets of theories describing action selection in the basal ganglia: reinforcement learning theories describing learning which actions to select to maximize reward and decision-making theories proposing that the basal ganglia selects actions on the basis of sensory evidence accumulated in the cortex. In particular, we present a model that integrates the actor-critic model of reinforcement learning and a model assuming that the cortico-basal-ganglia circuit implements a statistically optimal decision-making procedure. The values of cortico-striatal weights required for optimal decision making in our model differ from those provided by standard reinforcement learning models. Nevertheless, we show that an actor-critic model converges to the weights required for optimal decision making when biologically realistic limits on synaptic weights are introduced. We also describe the model's predictions concerning reaction times and neural responses during learning, and we discuss directions required for further integration of reinforcement learning and optimal decision-making theories.

  19. An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning

    DTIC Science & Technology

    2000-10-01

    Learning behaviors in a multiagent environment are crucial for developing and adapting multiagent systems. Reinforcement learning techniques have...presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities. We examine the

  20. Aging Affects Acquisition and Reversal of Reward-Based Associative Learning

    ERIC Educational Resources Information Center

    Weiler, Julia A.; Bellebaum, Christian; Daum, Irene

    2008-01-01

    Reward-based associative learning is mediated by a distributed network of brain regions that are dependent on the dopaminergic system. Age-related changes in key regions of this system, the striatum and the prefrontal cortex, may adversely affect the ability to use reward information for the guidance of behavior. The present study investigated the…

  1. Comparing the neural basis of monetary reward and cognitive feedback during information-integration category learning.

    PubMed

    Daniel, Reka; Pollmann, Stefan

    2010-01-06

    The dopaminergic system is known to play a central role in reward-based learning (Schultz, 2006), yet it was also observed to be involved when only cognitive feedback is given (Aron et al., 2004). Within the domain of information-integration category learning, in which information from several stimulus dimensions has to be integrated predecisionally (Ashby and Maddox, 2005), the importance of contingent feedback is well established (Maddox et al., 2003). We examined the common neural correlates of reward anticipation and prediction error in this task. Sixteen subjects performed two parallel information-integration tasks within a single event-related functional magnetic resonance imaging session but received a monetary reward only for one of them. Similar functional areas including basal ganglia structures were activated in both task versions. In contrast, a single structure, the nucleus accumbens, showed higher activation during monetary reward anticipation compared with the anticipation of cognitive feedback in information-integration learning. Additionally, this activation was predicted by measures of intrinsic motivation in the cognitive feedback task and by measures of extrinsic motivation in the rewarded task. Our results indicate that, although all other structures implicated in category learning are not significantly affected by altering the type of reward, the nucleus accumbens responds to the positive incentive properties of an expected reward depending on the specific type of the reward.

  2. Stimulus-Reward Association and Reversal Learning in Individuals with Asperger Syndrome

    ERIC Educational Resources Information Center

    Zalla, Tiziana; Sav, Anca-Maria; Leboyer, Marion

    2009-01-01

    In the present study, performance of a group of adults with Asperger Syndrome (AS) on two series of object reversal and extinction was compared with that of a group of adults with typical development. Participants were requested to learn a stimulus-reward association rule and monitor changes in reward value of stimuli in order to gain as many…

  3. Establishing the dopamine dependency of human striatal signals during reward and punishment reversal learning.

    PubMed

    van der Schaaf, Marieke E; van Schouwenburg, Martine R; Geurts, Dirk E M; Schellekens, Arnt F A; Buitelaar, Jan K; Verkes, Robbert Jan; Cools, Roshan

    2014-03-01

    Drugs that alter dopamine transmission have opposite effects on reward and punishment learning. These opposite effects have been suggested to depend on dopamine in the striatum. Here, we establish for the first time the neurochemical specificity of such drug effects, during reward and punishment learning in humans, by adopting a coadministration design. Participants (N = 22) were scanned on 4 occasions using functional magnetic resonance imaging, following intake of placebo, bromocriptine (dopamine-receptor agonist), sulpiride (dopamine-receptor antagonist), or a combination of both drugs. A reversal-learning task was employed, in which both unexpected rewards and punishments signaled reversals. Drug effects were stratified with baseline working memory to take into account individual variations in drug response. Sulpiride induced parallel span-dependent changes on striatal blood oxygen level-dependent (BOLD) signal during unexpected rewards and punishments. These drug effects were found to be partially dopamine-dependent, as they were blocked by coadministration with bromocriptine. In contrast, sulpiride elicited opposite effects on behavioral measures of reward and punishment learning. Moreover, sulpiride-induced increases in striatal BOLD signal during both outcomes were associated with behavioral improvement in reward versus punishment learning. These results provide a strong support for current theories, suggesting that drug effects on reward and punishment learning are mediated via striatal dopamine.

  4. Episodic memory encoding interferes with reward learning and decreases striatal prediction errors.

    PubMed

    Wimmer, G Elliott; Braun, Erin Kendall; Daw, Nathaniel D; Shohamy, Daphna

    2014-11-05

    Learning is essential for adaptive decision making. The striatum and its dopaminergic inputs are known to support incremental reward-based learning, while the hippocampus is known to support encoding of single events (episodic memory). Although traditionally studied separately, in even simple experiences, these two types of learning are likely to co-occur and may interact. Here we sought to understand the nature of this interaction by examining how incremental reward learning is related to concurrent episodic memory encoding. During the experiment, human participants made choices between two options (colored squares), each associated with a drifting probability of reward, with the goal of earning as much money as possible. Incidental, trial-unique object pictures, unrelated to the choice, were overlaid on each option. The next day, participants were given a surprise memory test for these pictures. We found that better episodic memory was related to a decreased influence of recent reward experience on choice, both within and across participants. fMRI analyses further revealed that during learning the canonical striatal reward prediction error signal was significantly weaker when episodic memory was stronger. This decrease in reward prediction error signals in the striatum was associated with enhanced functional connectivity between the hippocampus and striatum at the time of choice. Our results suggest a mechanism by which memory encoding may compete for striatal processing and provide insight into how interactions between different forms of learning guide reward-based decision making.

  5. A reinforcement learning approach to instrumental contingency degradation in rats.

    PubMed

    Dutech, Alain; Coutureau, Etienne; Marchand, Alain R

    2011-01-01

    Goal-directed action involves a representation of action consequences. Adapting to changes in action-outcome contingency requires the prefrontal region. Indeed, rats with lesions of the medial prefrontal cortex do not adapt their free operant response when food delivery becomes unrelated to lever-pressing. The present study explores the bases of this deficit through a combined behavioural and computational approach. We show that lesioned rats retain some behavioural flexibility and stop pressing if this action prevents food delivery. We attempt to model this phenomenon in a reinforcement learning framework. The model assumes that distinct action values are learned in an incremental manner in distinct states. The model represents states as n-uplets of events, emphasizing sequences rather than the continuous passage of time. Probabilities of lever-pressing and visits to the food magazine observed in the behavioural experiments are first analyzed as a function of these states, to identify sequences of events that influence action choice. Observed action probabilities appear to be essentially function of the last event that occurred, with reward delivery and waiting significantly facilitating magazine visits and lever-pressing respectively. Behavioural sequences of normal and lesioned rats are then fed into the model, action values are updated at each event transition according to the SARSA algorithm, and predicted action probabilities are derived through a softmax policy. The model captures the time course of learning, as well as the differential adaptation of normal and prefrontal lesioned rats to contingency degradation with the same parameters for both groups. The results suggest that simple temporal difference algorithms with low learning rates can largely account for instrumental learning and performance. Prefrontal lesioned rats appear to mainly differ from control rats in their low rates of visits to the magazine after a lever press, and their inability to

  6. CLEANing the Reward: Counterfactual Actions to Remove Exploratory Action Noise in Multiagent Learning

    NASA Technical Reports Server (NTRS)

    HolmesParker, Chris; Taylor, Mathew E.; Tumer, Kagan; Agogino, Adrian

    2014-01-01

    Learning in multiagent systems can be slow because agents must learn both how to behave in a complex environment and how to account for the actions of other agents. The inability of an agent to distinguish between the true environmental dynamics and those caused by the stochastic exploratory actions of other agents creates noise in each agent's reward signal. This learning noise can have unforeseen and often undesirable effects on the resultant system performance. We define such noise as exploratory action noise, demonstrate the critical impact it can have on the learning process in multiagent settings, and introduce a reward structure to effectively remove such noise from each agent's reward signal. In particular, we introduce Coordinated Learning without Exploratory Action Noise (CLEAN) rewards and empirically demonstrate their benefits

  7. Updating dopamine reward signals

    PubMed Central

    Schultz, Wolfram

    2013-01-01

    Recent work has advanced our knowledge of phasic dopamine reward prediction error signals. The error signal is bidirectional, reflects well the higher order prediction error described by temporal difference learning models, is compatible with model-free and model-based reinforcement learning, reports the subjective rather than physical reward value during temporal discounting and reflects subjective stimulus perception rather than physical stimulus aspects. Dopamine activations are primarily driven by reward, and to some extent risk, whereas punishment and salience have only limited activating effects when appropriate controls are respected. The signal is homogeneous in terms of time course but heterogeneous in many other aspects. It is essential for synaptic plasticity and a range of behavioural learning situations. PMID:23267662

  8. A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning.

    PubMed

    Franklin, Nicholas T; Frank, Michael J

    2015-12-25

    Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments.

  9. A reinforcement learning mechanism responsible for the valuation of free choice.

    PubMed

    Cockburn, Jeffrey; Collins, Anne G E; Frank, Michael J

    2014-08-06

    Humans exhibit a preference for options they have freely chosen over equally valued options they have not; however, the neural mechanism that drives this bias and its functional significance have yet to be identified. Here, we propose a model in which choice biases arise due to amplified positive reward prediction errors associated with free choice. Using a novel variant of a probabilistic learning task, we show that choice biases are selective to options that are predominantly associated with positive outcomes. A polymorphism in DARPP-32, a gene linked to dopaminergic striatal plasticity and individual differences in reinforcement learning, was found to predict the effect of choice as a function of value. We propose that these choice biases are the behavioral byproduct of a credit assignment mechanism responsible for ensuring the effective delivery of dopaminergic reinforcement learning signals broadcast to the striatum.

  10. A reinforcement learning mechanism responsible for the valuation of free choice

    PubMed Central

    Cockburn, Jeffrey; Collins, Anne G.E.; Frank, Michael J.

    2014-01-01

    Summary Humans exhibit a preference for options they have freely chosen over equally valued options they have not; however, the neural mechanism that drives this bias and its functional significance have yet to be identified. Here, we propose a model in which choice biases arise due to amplified positive reward prediction errors associated with free choice. Using a novel variant of a probabilistic learning task, we show that choice biases are selective to options that are predominantly associated with positive outcomes. A polymorphism in DARPP-32, a gene linked to dopaminergic striatal plasticity and individual differences in reinforcement learning, was found to predict the effect of choice as a function of value. We propose that these choice biases are the behavioral byproduct of a credit assignment mechanism responsible for ensuring the effective delivery of dopaminergic reinforcement learning signals broadcast to the striatum. PMID:25066083

  11. Reinforcement learning of targeted movement in a spiking neuronal model of motor cortex.

    PubMed

    Chadderdon, George L; Neymotin, Samuel A; Kerr, Cliff C; Lytton, William W

    2012-01-01

    Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint "forearm" to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (-1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior.

  12. Reinforcement Learning of Targeted Movement in a Spiking Neuronal Model of Motor Cortex

    PubMed Central

    Chadderdon, George L.; Neymotin, Samuel A.; Kerr, Cliff C.; Lytton, William W.

    2012-01-01

    Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint “forearm” to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (−1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior. PMID:23094042

  13. Stochastic Scheduling and Planning Using Reinforcement Learning

    DTIC Science & Technology

    2007-11-02

    reinforcement learning (RL) methods to large-scale optimization problems relevant to Air Force operations planning, scheduling, and maintenance. The objectives of this project were to: (1) investigate the utility of RL on large-scale logistics problems; (2) extend existing RL theory and practice to enhance the ease of application and the performance of RL on these problems; and (3) explore new problem formulations in order to take maximal advantage of RL methods. A method using RL to modify local search cost functions was developed and shown to yield significant

  14. Learning processes affecting human decision making: An assessment of reinforcer-selective Pavlovian-to-instrumental transfer following reinforcer devaluation.

    PubMed

    Allman, Melissa J; DeLeon, Iser G; Cataldo, Michael F; Holland, Peter C; Johnson, Alexander W

    2010-07-01

    In reinforcer-selective transfer, Pavlovian stimuli that are predictive of specific outcomes bias performance toward responses associated with those outcomes. Although this phenomenon has been extensively examined in rodents, recent assessments have extended to humans. Using a stock market paradigm adults were trained to associate particular symbols and responses with particular currencies. During the first test, individuals showed a preference for responding on actions associated with the same outcome as that predicted by the presented stimulus (i.e., a reinforcer-selective transfer effect). In the second test of the experiment, one of the currencies was devalued. We found it notable that this served to reduce responses to those stimuli associated with the devalued currency. This finding is in contrast to that typically observed in rodent studies, and suggests that participants in this task represented the sensory features that differentiate the reinforcers and their value during reinforcer-selective transfer. These results are discussed in terms of implications for understanding associative learning processes in humans and the ability of reward-paired cues to direct adaptive and maladaptive behavior.

  15. Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback.

    PubMed

    Tan, A H; Lu, N; Xiao, D

    2008-02-01

    This paper presents a neural architecture for learning category nodes encoding mappings across multimodal patterns involving sensory inputs, actions, and rewards. By integrating adaptive resonance theory (ART) and temporal difference (TD) methods, the proposed neural model, called TD fusion architecture for learning, cognition, and navigation (TD-FALCON), enables an autonomous agent to adapt and function in a dynamic environment with immediate as well as delayed evaluative feedback (reinforcement) signals. TD-FALCON learns the value functions of the state-action space estimated through on-policy and off-policy TD learning methods, specifically state-action-reward-state-action (SARSA) and Q-learning. The learned value functions are then used to determine the optimal actions based on an action selection policy. We have developed TD-FALCON systems using various TD learning strategies and compared their performance in terms of task completion, learning speed, as well as time and space efficiency. Experiments based on a minefield navigation task have shown that TD-FALCON systems are able to learn effectively with both immediate and delayed reinforcement and achieve a stable performance in a pace much faster than those of standard gradient-descent-based reinforcement learning systems.

  16. Seizure Control in a Computational Model Using a Reinforcement Learning Stimulation Paradigm.

    PubMed

    Nagaraj, Vivek; Lamperski, Andrew; Netoff, Theoden I

    2016-11-02

    Neuromodulation technologies such as vagus nerve stimulation and deep brain stimulation, have shown some efficacy in controlling seizures in medically intractable patients. However, inherent patient-to-patient variability of seizure disorders leads to a wide range of therapeutic efficacy. A patient specific approach to determining stimulation parameters may lead to increased therapeutic efficacy while minimizing stimulation energy and side effects. This paper presents a reinforcement learning algorithm that optimizes stimulation frequency for controlling seizures with minimum stimulation energy. We apply our method to a computational model called the epileptor. The epileptor model simulates inter-ictal and ictal local field potential data. In order to apply reinforcement learning to the Epileptor, we introduce a specialized reward function and state-space discretization. With the reward function and discretization fixed, we test the effectiveness of the temporal difference reinforcement learning algorithm (TD(0)). For periodic pulsatile stimulation, we derive a relation that describes, for any stimulation frequency, the minimal pulse amplitude required to suppress seizures. The TD(0) algorithm is able to identify parameters that control seizures quickly. Additionally, our results show that the TD(0) algorithm refines the stimulation frequency to minimize stimulation energy thereby converging to optimal parameters reliably. An advantage of the TD(0) algorithm is that it is adaptive so that the parameters necessary to control the seizures can change over time. We show that the algorithm can converge on the optimal solution in simulation with slow and fast inter-seizure intervals.

  17. Value learning and arousal in the extinction of probabilistic rewards: the role of dopamine in a modified temporal difference model.

    PubMed

    Song, Minryung R; Fellous, Jean-Marc

    2014-01-01

    Because most rewarding events are probabilistic and changing, the extinction of probabilistic rewards is important for survival. It has been proposed that the extinction of probabilistic rewards depends on arousal and the amount of learning of reward values. Midbrain dopamine neurons were suggested to play a role in both arousal and learning reward values. Despite extensive research on modeling dopaminergic activity in reward learning (e.g. temporal difference models), few studies have been done on modeling its role in arousal. Although temporal difference models capture key characteristics of dopaminergic activity during the extinction of deterministic rewards, they have been less successful at simulating the extinction of probabilistic rewards. By adding an arousal signal to a temporal difference model, we were able to simulate the extinction of probabilistic rewards and its dependence on the amount of learning. Our simulations propose that arousal allows the probability of reward to have lasting effects on the updating of reward value, which slows the extinction of low probability rewards. Using this model, we predicted that, by signaling the prediction error, dopamine determines the learned reward value that has to be extinguished during extinction and participates in regulating the size of the arousal signal that controls the learning rate. These predictions were supported by pharmacological experiments in rats.

  18. Dopamine-mediated reinforcement learning signals in the striatum and ventromedial prefrontal cortex underlie value-based choices.

    PubMed

    Jocham, Gerhard; Klein, Tilmann A; Ullsperger, Markus

    2011-02-02

    A large body of evidence exists on the role of dopamine in reinforcement learning. Less is known about how dopamine shapes the relative impact of positive and negative outcomes to guide value-based choices. We combined administration of the dopamine D(2) receptor antagonist amisulpride with functional magnetic resonance imaging in healthy human volunteers. Amisulpride did not affect initial reinforcement learning. However, in a later transfer phase that involved novel choice situations requiring decisions between two symbols based on their previously learned values, amisulpride improved participants' ability to select the better of two highly rewarding options, while it had no effect on choices between two very poor options. During the learning phase, activity in the striatum encoded a reward prediction error. In the transfer phase, in the absence of any outcome, ventromedial prefrontal cortex (vmPFC) continually tracked the learned value of the available options on each trial. Both striatal prediction error coding and tracking of learned value in the vmPFC were predictive of subjects' choice performance in the transfer phase, and both were enhanced under amisulpride. These findings show that dopamine-dependent mechanisms enhance reinforcement learning signals in the striatum and sharpen representations of associative values in prefrontal cortex that are used to guide reinforcement-based decisions.

  19. The Roles of Dopamine and Related Compounds in Reward-Seeking Behavior Across Animal Phyla

    PubMed Central

    Barron, Andrew B.; Søvik, Eirik; Cornish, Jennifer L.

    2010-01-01

    Motile animals actively seek out and gather resources they find rewarding, and this is an extremely powerful organizer and motivator of animal behavior. Mammalian studies have revealed interconnected neurobiological systems for reward learning, reward assessment, reinforcement and reward-seeking; all involving the biogenic amine dopamine. The neurobiology of reward-seeking behavioral systems is less well understood in invertebrates, but in many diverse invertebrate groups, reward learning and responses to food rewards also involve dopamine. The obvious exceptions are the arthropods in which the chemically related biogenic amine octopamine has a greater effect on reward learning and reinforcement than dopamine. Here we review the functions of these biogenic amines in behavioral responses to rewards in different animal groups, and discuss these findings in an evolutionary context. PMID:21048897

  20. Vicarious reinforcement learning signals when instructing others.

    PubMed

    Apps, Matthew A J; Lesage, Elise; Ramnani, Narender

    2015-02-18

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action-outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors.

  1. Vicarious Reinforcement Learning Signals When Instructing Others

    PubMed Central

    Lesage, Elise; Ramnani, Narender

    2015-01-01

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action–outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors. PMID:25698730

  2. Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats

    PubMed Central

    Lloyd, Kevin; Becker, Nadine; Jones, Matthew W.; Bogacz, Rafal

    2012-01-01

    Learning to form appropriate, task-relevant working memory representations is a complex process central to cognition. Gating models frame working memory as a collection of past observations and use reinforcement learning (RL) to solve the problem of when to update these observations. Investigation of how gating models relate to brain and behavior remains, however, at an early stage. The current study sought to explore the ability of simple RL gating models to replicate rule learning behavior in rats. Rats were trained in a maze-based spatial learning task that required animals to make trial-by-trial choices contingent upon their previous experience. Using an abstract version of this task, we tested the ability of two gating algorithms, one based on the Actor-Critic and the other on the State-Action-Reward-State-Action (SARSA) algorithm, to generate behavior consistent with the rats'. Both models produced rule-acquisition behavior consistent with the experimental data, though only the SARSA gating model mirrored faster learning following rule reversal. We also found that both gating models learned multiple strategies in solving the initial task, a property which highlights the multi-agent nature of such models and which is of importance in considering the neural basis of individual differences in behavior. PMID:23115551

  3. A Neurogenetic Dissociation between Punishment-, Reward-, and Relief-Learning in Drosophila

    PubMed Central

    Yarali, Ayse; Gerber, Bertram

    2010-01-01

    What is particularly worth remembering about a traumatic experience is what brought it about, and what made it cease. For example, fruit flies avoid an odor which during training had preceded electric shock punishment; on the other hand, if the odor had followed shock during training, it is later on approached as a signal for the relieving end of shock. We provide a neurogenetic analysis of such relief learning. Blocking, using UAS-shibirets1, the output from a particular set of dopaminergic neurons defined by the TH-Gal4 driver partially impaired punishment learning, but left relief learning intact. Thus, with respect to these particular neurons, relief learning differs from punishment learning. Targeting another set of dopaminergic/serotonergic neurons defined by the DDC-Gal4 driver on the other hand affected neither punishment nor relief learning. As for the octopaminergic system, the tbhM18 mutation, compromising octopamine biosynthesis, partially impaired sugar-reward learning, but not relief learning. Thus, with respect to this particular mutation, relief learning, and reward learning are dissociated. Finally, blocking output from the set of octopaminergic/tyraminergic neurons defined by the TDC2-Gal4 driver affected neither reward, nor relief learning. We conclude that regarding the used genetic tools, relief learning is neurogenetically dissociated from both punishment and reward learning. This may be a message relevant also for analyses of relief learning in other experimental systems including man. PMID:21206762

  4. How Food as a Reward Is Detrimental to Children's Health, Learning, and Behavior

    ERIC Educational Resources Information Center

    Fedewa, Alicia L.; Davis, Matthew Cody

    2015-01-01

    Background: Despite small- and wide-scale prevention efforts to curb obesity, the percentage of children classified as overweight and obese has remained relatively consistent in the last decade. As school personnel are increasingly pressured to enhance student performance, many educators use food as a reward to motivate and reinforce positive…

  5. Stochastic reinforcement benefits skill acquisition.

    PubMed

    Dayan, Eran; Averbeck, Bruno B; Richmond, Barry J; Cohen, Leonardo G

    2014-02-14

    Learning complex skills is driven by reinforcement, which facilitates both online within-session gains and retention of the acquired skills. Yet, in ecologically relevant situations, skills are often acquired when mapping between actions and rewarding outcomes is unknown to the learning agent, resulting in reinforcement schedules of a stochastic nature. Here we trained subjects on a visuomotor learning task, comparing reinforcement schedules with higher, lower, or no stochasticity. Training under higher levels of stochastic reinforcement benefited skill acquisition, enhancing both online gains and long-term retention. These findings indicate that the enhancing effects of reinforcement on skill acquisition depend on reinforcement schedules.

  6. Extraversion differentiates between model-based and model-free strategies in a reinforcement learning task

    PubMed Central

    Skatova, Anya; Chan, Patricia A.; Daw, Nathaniel D.

    2013-01-01

    Prominent computational models describe a neural mechanism for learning from reward prediction errors, and it has been suggested that variations in this mechanism are reflected in personality factors such as trait extraversion. However, although trait extraversion has been linked to improved reward learning, it is not yet known whether this relationship is selective for the particular computational strategy associated with error-driven learning, known as model-free reinforcement learning, vs. another strategy, model-based learning, which the brain is also known to employ. In the present study we test this relationship by examining whether humans' scores on an extraversion scale predict individual differences in the balance between model-based and model-free learning strategies in a sequentially structured decision task designed to distinguish between them. In previous studies with this task, participants have shown a combination of both types of learning, but with substantial individual variation in the balance between them. In the current study, extraversion predicted worse behavior across both sorts of learning. However, the hypothesis that extraverts would be selectively better at model-free reinforcement learning held up among a subset of the more engaged participants, and overall, higher task engagement was associated with a more selective pattern by which extraversion predicted better model-free learning. The findings indicate a relationship between a broad personality orientation and detailed computational learning mechanisms. Results like those in the present study suggest an intriguing and rich relationship between core neuro-computational mechanisms and broader life orientations and outcomes. PMID:24027514

  7. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia

    PubMed Central

    Markou, Athina; Salamone, John D.; Bussey, Timothy; Mar, Adam; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-01-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu). A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. PMID:23994273

  8. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia.

    PubMed

    Markou, Athina; Salamone, John D; Bussey, Timothy J; Mar, Adam C; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-11-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu) meeting. A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits.

  9. Connectionist reinforcement learning of robot control skills

    NASA Astrophysics Data System (ADS)

    Araújo, Rui; Nunes, Urbano; de Almeida, A. T.

    1998-07-01

    Many robot manipulator tasks are difficult to model explicitly and it is difficult to design and program automatic control algorithms for them. The development, improvement, and application of learning techniques taking advantage of sensory information would enable the acquisition of new robot skills and avoid some of the difficulties of explicit programming. In this paper we use a reinforcement learning approach for on-line generation of skills for control of robot manipulator systems. Instead of generating skills by explicit programming of a perception to action mapping they are generated by trial and error learning, guided by a performance evaluation feedback function. The resulting system may be seen as an anticipatory system that constructs an internal representation model of itself and of its environment. This enables it to identify its current situation and to generate corresponding appropriate commands to the system in order to perform the required skill. The method was applied to the problem of learning a force control skill in which the tool-tip of a robot manipulator must be moved from a free space situation, to a contact state with a compliant surface and having a constant interaction force.

  10. Stochastic Reinforcement Benefits Skill Acquisition

    ERIC Educational Resources Information Center

    Dayan, Eran; Averbeck, Bruno B.; Richmond, Barry J.; Cohen, Leonardo G.

    2014-01-01

    Learning complex skills is driven by reinforcement, which facilitates both online within-session gains and retention of the acquired skills. Yet, in ecologically relevant situations, skills are often acquired when mapping between actions and rewarding outcomes is unknown to the learning agent, resulting in reinforcement schedules of a stochastic…

  11. DAT isn't all that: cocaine reward and reinforcement require Toll-like receptor 4 signaling.

    PubMed

    Northcutt, A L; Hutchinson, M R; Wang, X; Baratta, M V; Hiranita, T; Cochran, T A; Pomrenze, M B; Galer, E L; Kopajtic, T A; Li, C M; Amat, J; Larson, G; Cooper, D C; Huang, Y; O'Neill, C E; Yin, H; Zahniser, N R; Katz, J L; Rice, K C; Maier, S F; Bachtell, R K; Watkins, L R

    2015-12-01

    The initial reinforcing properties of drugs of abuse, such as cocaine, are largely attributed to their ability to activate the mesolimbic dopamine system. Resulting increases in extracellular dopamine in the nucleus accumbens (NAc) are traditionally thought to result from cocaine's ability to block dopamine transporters (DATs). Here we demonstrate that cocaine also interacts with the immunosurveillance receptor complex, Toll-like receptor 4 (TLR4), on microglial cells to initiate central innate immune signaling. Disruption of cocaine signaling at TLR4 suppresses cocaine-induced extracellular dopamine in the NAc, as well as cocaine conditioned place preference and cocaine self-administration. These results provide a novel understanding of the neurobiological mechanisms underlying cocaine reward/reinforcement that includes a critical role for central immune signaling, and offer a new target for medication development for cocaine abuse treatment.

  12. Reinforcement Learning for the Adaptive Control of Perception and Action

    DTIC Science & Technology

    1992-02-01

    This dissertation applies reinforcement learning to the adaptive control of active sensory-motor systems. Active sensory-motor systems, in addition...distinct states in the external world. This phenomenon, called perceptual aliasing, is shown to destabilize existing reinforcement learning algorithms

  13. Reinforcement of Science Learning through Local Culture: A Delphi Study

    ERIC Educational Resources Information Center

    Nuangchalerm, Prasart

    2008-01-01

    This study aims to explore the ways to reinforce science learning through local culture by using Delphi technique. Twenty four participants in various fields of study were selected. The result of study provides a framework for reinforcement of science learning through local culture on the theme life and environment. (Contains 1 table.)

  14. Punishment Insensitivity and Impaired Reinforcement Learning in Preschoolers

    ERIC Educational Resources Information Center

    Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

    2014-01-01

    Background: Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a…

  15. A Discussion of Possibility of Reinforcement Learning Using Event-Related Potential in BCI

    NASA Astrophysics Data System (ADS)

    Yamagishi, Yuya; Tsubone, Tadashi; Wada, Yasuhiro

    Recently, Brain computer interface (BCI) which is a direct connecting pathway an external device such as a computer or a robot and a human brain have gotten a lot of attention. Since BCI can control the machines as robots by using the brain activity without using the voluntary muscle, the BCI may become a useful communication tool for handicapped persons, for instance, amyotrophic lateral sclerosis patients. However, in order to realize the BCI system which can perform precise tasks on various environments, it is necessary to design the control rules to adapt to the dynamic environments. Reinforcement learning is one approach of the design of the control rule. If this reinforcement leaning can be performed by the brain activity, it leads to the attainment of BCI that has general versatility. In this research, we paid attention to P300 of event-related potential as an alternative signal of the reward of reinforcement learning. We discriminated between the success and the failure trials from P300 of the EEG of the single trial by using the proposed discrimination algorithm based on Support vector machine. The possibility of reinforcement learning was examined from the viewpoint of the number of discriminated trials. It was shown that there was a possibility to be able to learn in most subjects.

  16. Differential Modulation of Reinforcement Learning by D2 Dopamine and NMDA Glutamate Receptor Antagonism

    PubMed Central

    Klein, Tilmann A.; Ullsperger, Markus

    2014-01-01

    The firing pattern of midbrain dopamine (DA) neurons is well known to reflect reward prediction errors (PEs), the difference between obtained and expected rewards. The PE is thought to be a crucial signal for instrumental learning, and interference with DA transmission impairs learning. Phasic increases of DA neuron firing during positive PEs are driven by activation of NMDA receptors, whereas phasic suppression of firing during negative PEs is likely mediated by inputs from the lateral habenula. We aimed to determine the contribution of DA D2-class and NMDA receptors to appetitively and aversively motivated reinforcement learning. Healthy human volunteers were scanned with functional magnetic resonance imaging while they performed an instrumental learning task under the influence of either the DA D2 receptor antagonist amisulpride (400 mg), the NMDA receptor antagonist memantine (20 mg), or placebo. Participants quickly learned to select (“approach”) rewarding and to reject (“avoid”) punishing options. Amisulpride impaired both approach and avoidance learning, while memantine mildly attenuated approach learning but had no effect on avoidance learning. These behavioral effects of the antagonists were paralleled by their modulation of striatal PEs. Amisulpride reduced both appetitive and aversive PEs, while memantine diminished appetitive, but not aversive PEs. These data suggest that striatal D2-class receptors contribute to both approach and avoidance learning by detecting both the phasic DA increases and decreases during appetitive and aversive PEs. NMDA receptors on the contrary appear to be required only for approach learning because phasic DA increases during positive PEs are NMDA dependent, whereas phasic decreases during negative PEs are not. PMID:25253860

  17. Reinforcement learning and counterfactual reasoning explain adaptive behavior in a changing environment.

    PubMed

    Zhang, Yunfeng; Paik, Jaehyon; Pirolli, Peter

    2015-04-01

    Animals routinely adapt to changes in the environment in order to survive. Though reinforcement learning may play a role in such adaptation, it is not clear that it is the only mechanism involved, as it is not well suited to producing rapid, relatively immediate changes in strategies in response to environmental changes. This research proposes that counterfactual reasoning might be an additional mechanism that facilitates change detection. An experiment is conducted in which a task state changes over time and the participants had to detect the changes in order to perform well and gain monetary rewards. A cognitive model is constructed that incorporates reinforcement learning with counterfactual reasoning to help quickly adjust the utility of task strategies in response to changes. The results show that the model can accurately explain human data and that counterfactual reasoning is key to reproducing the various effects observed in this change detection paradigm.

  18. Disentangling pleasure from incentive salience and learning signals in brain reward circuitry.

    PubMed

    Smith, Kyle S; Berridge, Kent C; Aldridge, J Wayne

    2011-07-05

    Multiple signals for reward-hedonic impact, motivation, and learned associative prediction-are funneled through brain mesocorticolimbic circuits involving the nucleus accumbens and ventral pallidum. Here, we show how the hedonic "liking" and motivation "wanting" signals for a sweet reward are distinctly modulated and tracked in this circuit separately from signals for Pavlovian predictions (learning). Animals first learned to associate a fixed sequence of Pavlovian cues with sucrose reward. Subsequent intraaccumbens microinjections of an opioid-stimulating drug increased the hedonic liking impact of sucrose in behavior and firing signals of ventral pallidum neurons, and likewise, they increased incentive salience signals in firing to the reward-proximal incentive cue (but did not alter firing signals to the learned prediction value of a reward-distal cue). Microinjection of a dopamine-stimulating drug instead enhanced only the motivation component but did not alter hedonic impact or learned prediction signals. Different dedicated neuronal subpopulations in the ventral pallidum tracked signal enhancements for hedonic impact vs. incentive salience, and a faster firing pattern also distinguished incentive signals from slower hedonic signals, even for a third overlapping population. These results reveal separate neural representations of wanting, liking, and prediction components of the same reward within the nucleus accumbens to ventral pallidum segment of mesocorticolimbic circuitry.

  19. Choice as a function of reinforcer "hold": from probability learning to concurrent reinforcement.

    PubMed

    Jensen, Greg; Neuringer, Allen

    2008-10-01

    Two procedures commonly used to study choice are concurrent reinforcement and probability learning. Under concurrent-reinforcement procedures, once a reinforcer is scheduled, it remains available indefinitely until collected. Therefore reinforcement becomes increasingly likely with passage of time or responses on other operanda. Under probability learning, reinforcer probabilities are constant and independent of passage of time or responses. Therefore a particular reinforcer is gained or not, on the basis of a single response, and potential reinforcers are not retained, as when betting at a roulette wheel. In the "real" world, continued availability of reinforcers often lies between these two extremes, with potential reinforcers being lost owing to competition, maturation, decay, and random scatter. The authors parametrically manipulated the likelihood of continued reinforcer availability, defined as hold, and examined the effects on pigeons' choices. Choices varied as power functions of obtained reinforcers under all values of hold. Stochastic models provided generally good descriptions of choice emissions with deviations from stochasticity systematically related to hold. Thus, a single set of principles accounted for choices across hold values that represent a wide range of real-world conditions.

  20. Affective modulation of the startle reflex and the Reinforcement Sensitivity Theory of personality: The role of sensitivity to reward.

    PubMed

    Aluja, Anton; Blanch, Angel; Blanco, Eduardo; Balada, Ferran

    2015-01-01

    This study evaluated differences in the amplitude of startle reflex and Sensitivity to Reward (SR) and Sensitivity to Punishment (SP) personality variables of the Reinforcement Sensitivity Theory (RST). We hypothesized that subjects with higher scores in SR would obtain a higher startle reflex when exposed to pleasant pictures than lower scores, while higher scores in SP would obtain a higher startle reflex when exposed to unpleasant pictures than subjects with lower scores in this dimension. The sample consisted of 112 healthy female undergraduate psychology students. Personality was assessed using the short version of the Sensitivity to Punishment and Sensitivity Reward Questionnaire (SPSRQ). Laboratory anxiety was controlled by the State Anxiety Inventory. The startle blink reflex was recorded electromyographically (EMG) from the right orbicularis oculi muscle as a response to the International Affective Picture System (IAPS) pleasant, neutral and unpleasant pictures. Subjects higher in SR obtained a significant higher startle reflex response in pleasant pictures than lower scorers (48.48 vs 46.28, p<0.012). Subjects with higher scores in SP showed a light tendency of higher startle responses in unpleasant pictures in a non-parametric local regression graphical analysis (LOESS). The findings shed light on the relationships among the impulsive-disinhibited personality, including sensitivity to reward and emotions evoked through pictures of emotional content.

  1. Instrumental Learning in Preschool Children as a Function of Type of Task, Type of Reward, and Some Organismic Variables

    ERIC Educational Resources Information Center

    Clarke, Alex M.; And Others

    1974-01-01

    The effects on instrumental behavior of differences in type of task, type of reward and three organismic variables (history of social reinforcement from peers, extraversion, and intelligence) were investigated in preschool children. (ST)

  2. Safe Exploration Algorithms for Reinforcement Learning Controllers.

    PubMed

    Mannucci, Tommaso; van Kampen, Erik-Jan; de Visser, Cornelis; Chu, Qiping

    2017-02-06

    Self-learning approaches, such as reinforcement learning, offer new possibilities for autonomous control of uncertain or time-varying systems. However, exploring an unknown environment under limited prediction capabilities is a challenge for a learning agent. If the environment is dangerous, free exploration can result in physical damage or in an otherwise unacceptable behavior. With respect to existing methods, the main contribution of this paper is the definition of a new approach that does not require global safety functions, nor specific formulations of the dynamics or of the environment, but relies on interval estimation of the dynamics of the agent during the exploration phase, assuming a limited capability of the agent to perceive the presence of incoming fatal states. Two algorithms are presented with this approach. The first is the Safety Handling Exploration with Risk Perception Algorithm (SHERPA), which provides safety by individuating temporary safety functions, called backups. SHERPA is shown in a simulated, simplified quadrotor task, for which dangerous states are avoided. The second algorithm, denominated OptiSHERPA, can safely handle more dynamically complex systems for which SHERPA is not sufficient through the use of safety metrics. An application of OptiSHERPA is simulated on an aircraft altitude control task.

  3. Reinforcement learning in complementarity game and population dynamics.

    PubMed

    Jost, Jürgen; Li, Wei

    2014-02-01

    We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005)] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.

  4. Repeated electrical stimulation of reward-related brain regions affects cocaine but not "natural" reinforcement.

    PubMed

    Levy, Dino; Shabat-Simon, Maytal; Shalev, Uri; Barnea-Ygael, Noam; Cooper, Ayelet; Zangen, Abraham

    2007-12-19

    Drug addiction is associated with long-lasting neuronal adaptations including alterations in dopamine and glutamate receptors in the brain reward system. Treatment strategies for cocaine addiction and especially the prevention of craving and relapse are limited, and their effectiveness is still questionable. We hypothesized that repeated stimulation of the brain reward system can induce localized neuronal adaptations that may either potentiate or reduce addictive behaviors. The present study was designed to test how repeated interference with the brain reward system using localized electrical stimulation of the medial forebrain bundle at the lateral hypothalamus (LH) or the prefrontal cortex (PFC) affects cocaine addiction-associated behaviors and some of the neuronal adaptations induced by repeated exposure to cocaine. Repeated high-frequency stimulation in either site influenced cocaine, but not sucrose reward-related behaviors. Stimulation of the LH reduced cue-induced seeking behavior, whereas stimulation of the PFC reduced both cocaine-seeking behavior and the motivation for its consumption. The behavioral findings were accompanied by glutamate receptor subtype alterations in the nucleus accumbens and the ventral tegmental area, both key structures of the reward system. It is therefore suggested that repeated electrical stimulation of the PFC can become a novel strategy for treating addiction.

  5. Comparing rewarding and reinforcing properties between 'bath salt' 3,4-methylenedioxypyrovalerone (MDPV) and cocaine using ultrasonic vocalizations in rats.

    PubMed

    Simmons, Steven J; Gregg, Ryan A; Tran, Fionya H; Mo, Lili; von Weltin, Eva; Barker, David J; Gentile, Taylor A; Watterson, Lucas R; Rawls, Scott M; Muschamp, John W

    2016-12-01

    Abuse of synthetic psychostimulants like synthetic cathinones has risen in recent years. 3,4-Methylenedioxypyrovalerone (MDPV) is one such synthetic cathinone that demonstrates a mechanism of action similar to cocaine. Compared to cocaine, MDPV is more potent at blocking dopamine and norepinephrine reuptake and is readily self-administered by rodents. The present study compared the rewarding and reinforcing properties of MDPV and cocaine using systemic injection dose-response and self-administration models. Fifty kilohertz ultrasonic vocalizations (USVs) were recorded as an index of positive affect throughout experiments. In Experiment 1, MDPV and cocaine dose-dependently elicited 50-kHz USVs upon systemic injection, but MDPV increased USVs at greater rates and with greater persistence relative to cocaine. In Experiment 2, latency to begin MDPV self-administration was shorter than latency to begin cocaine self-administration, and self-administered MDPV elicited greater and more persistent rates of 50-kHz USVs versus cocaine. MDPV-elicited 50-kHz USVs were sustained over the course of drug load-up whereas cocaine-elicited USVs waned following initial infusions. Notably, we observed a robust presence of context-elicited 50-kHz USVs from both MDPV and cocaine self-administering rats. Collectively, these data suggest that MDPV has powerfully rewarding and reinforcing effects relative to cocaine at one-tenth doses. Consistent with prior work, we additionally interpret these data in supporting that MDPV has significant abuse risk based on its potency and subjectively positive effects. Future studies will be needed to better refine therapeutic strategies targeted at reducing the rewarding effects of cathinone analogs in efforts to ultimately reduce abuse liability.

  6. Deficient reinforcement learning in medial frontal cortex as a model of dopamine-related motivational deficits in ADHD.

    PubMed

    Silvetti, Massimo; Wiersema, Jan R; Sonuga-Barke, Edmund; Verguts, Tom

    2013-10-01

    Attention Deficit/Hyperactivity Disorder (ADHD) is a pathophysiologically complex and heterogeneous condition with both cognitive and motivational components. We propose a novel computational hypothesis of motivational deficits in ADHD, drawing together recent evidence on the role of anterior cingulate cortex (ACC) and associated mesolimbic dopamine circuits in both reinforcement learning and ADHD. Based on findings of dopamine dysregulation and ACC involvement in ADHD we simulated a lesion in a previously validated computational model of ACC (Reward Value and Prediction Model, RVPM). We explored the effects of the lesion on the processing of reinforcement signals. We tested specific behavioral predictions about the profile of reinforcement-related deficits in ADHD in three experimental contexts; probability tracking task, partial and continuous reward schedules, and immediate versus delayed rewards. In addition, predictions were made at the neurophysiological level. Behavioral and neurophysiological predictions from the RVPM-based lesion-model of motivational dysfunction in ADHD were confirmed by data from previously published studies. RVPM represents a promising model of ADHD reinforcement learning suggesting that ACC dysregulation might play a role in the pathogenesis of motivational deficits in ADHD. However, more behavioral and neurophysiological studies are required to test core predictions of the model. In addition, the interaction with different brain networks underpinning other aspects of ADHD neuropathology (i.e., executive function) needs to be better understood.

  7. The role of GABAB receptors in human reinforcement learning.

    PubMed

    Ort, Andres; Kometer, Michael; Rohde, Judith; Seifritz, Erich; Vollenweider, Franz X

    2014-10-01

    Behavioral evidence from human studies suggests that the γ-aminobutyric acid type B receptor (GABAB receptor) agonist baclofen modulates reinforcement learning and reduces craving in patients with addiction spectrum disorders. However, in contrast to the well established role of dopamine in reinforcement learning, the mechanisms by which the GABAB receptor influences reinforcement learning in humans remain completely unknown. To further elucidate this issue, a cross-over, double-blind, placebo-controlled study was performed in healthy human subjects (N=15) to test the effects of baclofen (20 and 50mg p.o.) on probabilistic reinforcement learning. Outcomes were the feedback-induced P2 component of the event-related potential, the feedback-related negativity, and the P300 component of the event-related potential. Baclofen produced a reduction of P2 amplitude over the course of the experiment, but did not modulate the feedback-related negativity. Furthermore, there was a trend towards increased learning after baclofen administration relative to placebo over the course of the experiment. The present results extend previous theories of reinforcement learning, which focus on the importance of mesolimbic dopamine signaling, and indicate that stimulation of cortical GABAB receptors in a fronto-parietal network leads to better attentional allocation in reinforcement learning. This observation is a first step in our understanding of how baclofen may improve reinforcement learning in healthy subjects. Further studies with bigger sample sizes are needed to corroborate this conclusion and furthermore, test this effect in patients with addiction spectrum disorder.

  8. Oculomotor learning revisited: a model of reinforcement learning in the basal ganglia incorporating an efference copy of motor actions

    PubMed Central

    Fee, Michale S.

    2012-01-01

    In its simplest formulation, reinforcement learning is based on the idea that if an action taken in a particular context is followed by a favorable outcome, then, in the same context, the tendency to produce that action should be strengthened, or reinforced. While reinforcement learning forms the basis of many current theories of basal ganglia (BG) function, these models do not incorporate distinct computational roles for signals that convey context, and those that convey what action an animal takes. Recent experiments in the songbird suggest that vocal-related BG circuitry receives two functionally distinct excitatory inputs. One input is from a cortical region that carries context information about the current “time” in the motor sequence. The other is an efference copy of motor commands from a separate cortical brain region that generates vocal variability during learning. Based on these findings, I propose here a general model of vertebrate BG function that combines context information with a distinct motor efference copy signal. The signals are integrated by a learning rule in which efference copy inputs gate the potentiation of context inputs (but not efference copy inputs) onto medium spiny neurons in response to a rewarded action. The hypothesis is described in terms of a circuit that implements the learning of visually guided saccades. The model makes testable predictions about the anatomical and functional properties of hypothesized context and efference copy inputs to the striatum from both thalamic and cortical sources. PMID:22754501

  9. Modulation of spatial attention by goals, statistical learning, and monetary reward

    PubMed Central

    Sha, Li Z.; Remington, Roger W.

    2015-01-01

    This study documented the relative strength of task goals, visual statistical learning, and monetary reward in guiding spatial attention. Using a difficult T-among-L search task, we cued spatial attention to one visual quadrant by (i) instructing people to prioritize it (goal-driven attention), (ii) placing the target frequently there (location probability learning), or (iii) associating that quadrant with greater monetary gain (reward-based attention). Results showed that successful goal-driven attention exerted the strongest influence on search RT. Incidental location probability learning yielded a smaller though still robust effect. Incidental reward learning produced negligible guidance for spatial attention. The 95 % confidence intervals of the three effects were largely nonoverlapping. To understand these results, we simulated the role of location repetition priming in probability cuing and reward learning. Repetition priming underestimated the strength of location probability cuing, suggesting that probability cuing involved long-term statistical learning of how to shift attention. Repetition priming provided a reasonable account for the negligible effect of reward on spatial attention. We propose a multiple-systems view of spatial attention that includes task goals, search habit, and priming as primary drivers of top-down attention. PMID:26105657

  10. Modulation of spatial attention by goals, statistical learning, and monetary reward.

    PubMed

    Jiang, Yuhong V; Sha, Li Z; Remington, Roger W

    2015-10-01

    This study documented the relative strength of task goals, visual statistical learning, and monetary reward in guiding spatial attention. Using a difficult T-among-L search task, we cued spatial attention to one visual quadrant by (i) instructing people to prioritize it (goal-driven attention), (ii) placing the target frequently there (location probability learning), or (iii) associating that quadrant with greater monetary gain (reward-based attention). Results showed that successful goal-driven attention exerted the strongest influence on search RT. Incidental location probability learning yielded a smaller though still robust effect. Incidental reward learning produced negligible guidance for spatial attention. The 95 % confidence intervals of the three effects were largely nonoverlapping. To understand these results, we simulated the role of location repetition priming in probability cuing and reward learning. Repetition priming underestimated the strength of location probability cuing, suggesting that probability cuing involved long-term statistical learning of how to shift attention. Repetition priming provided a reasonable account for the negligible effect of reward on spatial attention. We propose a multiple-systems view of spatial attention that includes task goals, search habit, and priming as primary drivers of top-down attention.

  11. Using Fuzzy Logic for Performance Evaluation in Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.

    1992-01-01

    Current reinforcement learning algorithms require long training periods which generally limit their applicability to small size problems. A new architecture is described which uses fuzzy rules to initialize its two neural networks: a neural network for performance evaluation and another for action selection. This architecture is applied to control of dynamic systems and it is demonstrated that it is possible to start with an approximate prior knowledge and learn to refine it through experiments using reinforcement learning.

  12. Short-term plasticity as cause-effect hypothesis testing in distal reward learning.

    PubMed

    Soltoggio, Andrea

    2015-02-01

    Asynchrony, overlaps, and delays in sensory-motor signals introduce ambiguity as to which stimuli, actions, and rewards are causally related. Only the repetition of reward episodes helps distinguish true cause-effect relationships from coincidental occurrences. In the model proposed here, a novel plasticity rule employs short- and long-term changes to evaluate hypotheses on cause-effect relationships. Transient weights represent hypotheses that are consolidated in long-term memory only when they consistently predict or cause future rewards. The main objective of the model is to preserve existing network topologies when learning with ambiguous information flows. Learning is also improved by biasing the exploration of the stimulus-response space toward actions that in the past occurred before rewards. The model indicates under which conditions beliefs can be consolidated in long-term memory, it suggests a solution to the plasticity-stability dilemma, and proposes an interpretation of the role of short-term plasticity.

  13. Rewarded by Punishment: Reflections on the Disuse of Positive Reinforcement in Education.

    ERIC Educational Resources Information Center

    Maag, John W.

    2001-01-01

    This article delineates the reasons why educators find punishment a more acceptable approach for managing students' challenging behaviors than positive reinforcement. The article argues that educators should plan the occurrence of positive reinforcement to increase appropriate behaviors rather than running the risk of it haphazardly promoting…

  14. Frontal theta links prediction errors to behavioral adaptation in reinforcement learning.

    PubMed

    Cavanagh, James F; Frank, Michael J; Klein, Theresa J; Allen, John J B

    2010-02-15

    Investigations into action monitoring have consistently detailed a frontocentral voltage deflection in the event-related potential (ERP) following the presentation of negatively valenced feedback, sometimes termed the feedback-related negativity (FRN). The FRN has been proposed to reflect a neural response to prediction errors during reinforcement learning, yet the single-trial relationship between neural activity and the quanta of expectation violation remains untested. Although ERP methods are not well suited to single-trial analyses, the FRN has been associated with theta band oscillatory perturbations in the medial prefrontal cortex. Mediofrontal theta oscillations have been previously associated with expectation violation and behavioral adaptation and are well suited to single-trial analysis. Here, we recorded EEG activity during a probabilistic reinforcement learning task and fit the performance data to an abstract computational model (Q-learning) for calculation of single-trial reward prediction errors. Single-trial theta oscillatory activities following feedback were investigated within the context of expectation (prediction error) and adaptation (subsequent reaction time change). Results indicate that interactive medial and lateral frontal theta activities reflect the degree of negative and positive reward prediction error in the service of behavioral adaptation. These different brain areas use prediction error calculations for different behavioral adaptations, with medial frontal theta reflecting the utilization of prediction errors for reaction time slowing (specifically following errors), but lateral frontal theta reflecting prediction errors leading to working memory-related reaction time speeding for the correct choice.

  15. Prespeech motor learning in a neural network using reinforcement.

    PubMed

    Warlaumont, Anne S; Westermann, Gert; Buder, Eugene H; Oller, D Kimbrough

    2013-02-01

    Vocal motor development in infancy provides a crucial foundation for language development. Some significant early accomplishments include learning to control the process of phonation (the production of sound at the larynx) and learning to produce the sounds of one's language. Previous work has shown that social reinforcement shapes the kinds of vocalizations infants produce. We present a neural network model that provides an account of how vocal learning may be guided by reinforcement. The model consists of a self-organizing map that outputs to muscles of a realistic vocalization synthesizer. Vocalizations are spontaneously produced by the network. If a vocalization meets certain acoustic criteria, it is reinforced, and the weights are updated to make similar muscle activations increasingly likely to recur. We ran simulations of the model under various reinforcement criteria and tested the types of vocalizations it produced after learning in the different conditions. When reinforcement was contingent on the production of phonated (i.e. voiced) sounds, the network's post-learning productions were almost always phonated, whereas when reinforcement was not contingent on phonation, the network's post-learning productions were almost always not phonated. When reinforcement was contingent on both phonation and proximity to English vowels as opposed to Korean vowels, the model's post-learning productions were more likely to resemble the English vowels and vice versa.

  16. Coevolutionary networks of reinforcement-learning agents

    NASA Astrophysics Data System (ADS)

    Kianercy, Ardeshir; Galstyan, Aram

    2013-07-01

    This paper presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement-learning scheme. It is demonstrated that the coevolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with nontrivial structural dynamics. In particular, we characterize the Nash equilibria (NE) in such games and examine the local stability of the rest points corresponding to those equilibria. We also study general n-player networks via both simulations and analytical methods and find that, in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable.

  17. Coevolutionary networks of reinforcement-learning agents.

    PubMed

    Kianercy, Ardeshir; Galstyan, Aram

    2013-07-01

    This paper presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement-learning scheme. It is demonstrated that the coevolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with nontrivial structural dynamics. In particular, we characterize the Nash equilibria (NE) in such games and examine the local stability of the rest points corresponding to those equilibria. We also study general n-player networks via both simulations and analytical methods and find that, in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable.

  18. Developing PFC representations using reinforcement learning.

    PubMed

    Reynolds, Jeremy R; O'Reilly, Randall C

    2009-12-01

    From both functional and biological considerations, it is widely believed that action production, planning, and goal-oriented behaviors supported by the frontal cortex are organized hierarchically [Fuster (1991); Koechlin, E., Ody, C., & Kouneiher, F. (2003). Neuroscience: The architecture of cognitive control in the human prefrontal cortex. Science, 424, 1181-1184; Miller, G. A., Galanter, E., & Pribram, K. H. (1960). Plans and the structure of behavior. New York: Holt]. However, the nature of the different levels of the hierarchy remains unclear, and little attention has been paid to the origins of such a hierarchy. We address these issues through biologically-inspired computational models that develop representations through reinforcement learning. We explore several different factors in these models that might plausibly give rise to a hierarchical organization of representations within the PFC, including an initial connectivity hierarchy within PFC, a hierarchical set of connections between PFC and subcortical structures controlling it, and differential synaptic plasticity schedules. Simulation results indicate that architectural constraints contribute to the segregation of different types of representations, and that this segregation facilitates learning. These findings are consistent with the idea that there is a functional hierarchy in PFC, as captured in our earlier computational models of PFC function and a growing body of empirical data.

  19. The Use of Rewards in Instructional Digital Games: An Application of Positive Reinforcement

    ERIC Educational Resources Information Center

    Malala, John; Major, Anthony; Maunez-Cuadra, Jose; McCauley-Bell, Pamela

    2007-01-01

    The main argument being presented in this paper is that instructional designers and educational researchers need to shift their attention from performance to interest. Educational digital games have to aim at building lasting interest in real world applications. The main hypothesis advocated in this paper is that the use of rewards in educational…

  20. Modeling Social Structure as Network Effects: Rewards for Learning Improves Performance

    NASA Astrophysics Data System (ADS)

    Hazy, James K.; Tivnan, Brian F.; Schwandt, David R.

    A theoretical representation of social structure in agent-based organizations is developed. To test the model we generated a hypothesis from organizational learning theory and tested it using computational experiments. We found that emergent social structure associated with rewarding agent learning increased collective output over and above pay for performance.

  1. A Computer-Assisted Learning Model Based on the Digital Game Exponential Reward System

    ERIC Educational Resources Information Center

    Moon, Man-Ki; Jahng, Surng-Gahb; Kim, Tae-Yong

    2011-01-01

    The aim of this research was to construct a motivational model which would stimulate voluntary and proactive learning using digital game methods offering players more freedom and control. The theoretical framework of this research lays the foundation for a pedagogical learning model based on digital games. We analyzed the game reward system, which…

  2. Distractions from Teaching and Learning: Lessons from Kentucky's Use of Rewards.

    ERIC Educational Resources Information Center

    Abelmann, Charles H.; Kenyon, Susan B.

    If rewards are to be used as a school-reform tool, their formats must be more closely tailored to the organizational characteristics of schools and to the purpose of improving teaching and learning. This paper describes lessons learned from Kentucky's collective incentive system, the Kentucky Instructional Results Information System (KIRIS). The…

  3. Re-evaluating the role of the orbitofrontal cortex in reward and reinforcement.

    PubMed

    Noonan, M P; Kolling, N; Walton, M E; Rushworth, M F S

    2012-04-01

    The orbitofrontal cortex and adjacent ventromedial prefrontal cortex carry reward representations and mediate flexible behaviour when circumstances change. Here we review how recent experiments in humans and macaques have confirmed the existence of a major difference between the functions of the ventromedial prefrontal cortex and adjacent medial orbitofrontal cortex (mOFC) on the one hand and the lateral orbitofrontal cortex (lOFC) on the other. These differences, however, may not be best accounted for in terms of specializations for reward and error/punishment processing as is commonly assumed. Instead we argue that both lesion and functional magnetic resonance imaging studies reveal that the lOFC is concerned with the assignment of credit for both reward and error outcomes to the choice of specific stimuli and with the linking of specific stimulus representations to representations of specific types of reward outcome. By contrast, we argue that the ventromedial prefrontal cortex/mOFC is concerned with evaluation, value-guided decision-making and maintenance of a choice over successive decisions. Despite the popular view that they cause perseveration of behaviour and inability to inhibit repetition of a previously made choice, we found that lesions in neither orbitofrontal subdivision caused perseveration. On the contrary, lesions in the lOFC made animals switch more rapidly between choices when they were finding it difficult to assign reward values to choices. Lesions in the mOFC caused animals to lose their normal predisposition to repeat previously successful choices, suggesting that the mOFC does not just mediate value comparison in choice but also facilitates maintenance of the same choice if it has been successful.

  4. A hypothesis for basal ganglia-dependent reinforcement learning in the songbird

    PubMed Central

    Fee, Michale S.; Goldberg, Jesse H.

    2011-01-01

    Most of our motor skills are not innately programmed, but are learned by a combination of motor exploration and performance evaluation, suggesting that they proceed through a reinforcement learning (RL) mechanism. Songbirds have emerged as a model system to study how a complex behavioral sequence can be learned through an RL-like strategy. Interestingly, like motor sequence learning in mammals, song learning in birds requires a basal ganglia (BG)-thalamocortical loop, suggesting common neural mechanisms. Here we outline a specific working hypothesis for how BG-forebrain circuits could utilize an internally computed reinforcement signal to direct song learning. Our model includes a number of general concepts borrowed from the mammalian BG literature, including a dopaminergic reward prediction error and dopamine mediated plasticity at corticostriatal synapses. We also invoke a number of conceptual advances arising from recent observations in the songbird. Specifically, there is evidence for a specialized cortical circuit that adds trial-to-trial variability to stereotyped cortical motor programs, and a role for the BG in ‘biasing’ this variability to improve behavioral performance. This BG-dependent ‘premotor bias’ may in turn guide plasticity in downstream cortical synapses to consolidate recently-learned song changes. Given the similarity between mammalian and songbird BG-thalamocortical circuits, our model for the role of the BG in this process may have broader relevance to mammalian BG function. PMID:22015923

  5. SOVEREIGN: An autonomous neural system for incrementally learning planned action sequences to navigate towards a rewarded goal.

    PubMed

    Gnadt, William; Grossberg, Stephen

    2008-06-01

    How do reactive and planned behaviors interact in real time? How are sequences of such behaviors released at appropriate times during autonomous navigation to realize valued goals? Controllers for both animals and mobile robots, or animats, need reactive mechanisms for exploration, and learned plans to reach goal objects once an environment becomes familiar. The SOVEREIGN (Self-Organizing, Vision, Expectation, Recognition, Emotion, Intelligent, Goal-oriented Navigation) animat model embodies these capabilities, and is tested in a 3D virtual reality environment. SOVEREIGN includes several interacting subsystems which model complementary properties of cortical What and Where processing streams and which clarify similarities between mechanisms for navigation and arm movement control. As the animat explores an environment, visual inputs are processed by networks that are sensitive to visual form and motion in the What and Where streams, respectively. Position-invariant and size-invariant recognition categories are learned by real-time incremental learning in the What stream. Estimates of target position relative to the animat are computed in the Where stream, and can activate approach movements toward the target. Motion cues from animat locomotion can elicit head-orienting movements to bring a new target into view. Approach and orienting movements are alternately performed during animat navigation. Cumulative estimates of each movement are derived from interacting proprioceptive and visual cues. Movement sequences are stored within a motor working memory. Sequences of visual categories are stored in a sensory working memory. These working memories trigger learning of sensory and motor sequence categories, or plans, which together control planned movements. Predictively effective chunk combinations are selectively enhanced via reinforcement learning when the animat is rewarded. Selected planning chunks effect a gradual transition from variable reactive exploratory

  6. Oxytocin selectively facilitates learning with social feedback and increases activity and functional connectivity in emotional memory and reward processing regions.

    PubMed

    Hu, Jiehui; Qi, Song; Becker, Benjamin; Luo, Lizhu; Gao, Shan; Gong, Qiyong; Hurlemann, René; Kendrick, Keith M

    2015-06-01

    In male Caucasian subjects, learning is facilitated by receipt of social compared with non-social feedback, and the neuropeptide oxytocin (OXT) facilitates this effect. In this study, we have first shown a cultural difference in that male Chinese subjects actually perform significantly worse in the same reinforcement associated learning task with social (emotional faces) compared with non-social feedback. Nevertheless, in two independent double-blind placebo (PLC) controlled between-subject design experiments we found OXT still selectively facilitated learning with social feedback. Similar to Caucasian subjects this OXT effect was strongest with feedback using female rather than male faces. One experiment performed in conjunction with functional magnetic resonance imaging showed that during the response, but not feedback phase of the task, OXT selectively increased activity in the amygdala, hippocampus, parahippocampal gyrus and putamen during the social feedback condition, and functional connectivity between the amygdala and insula and caudate. Therefore, OXT may be increasing the salience and reward value of anticipated social feedback. In the PLC group, response times and state anxiety scores during social feedback were associated with signal changes in these same regions but not in the OXT group. OXT may therefore have also facilitated learning by reducing anxiety in the social feedback condition. Overall our results provide the first evidence for cultural differences in social facilitation of learning per se, but a similar selective enhancement of learning with social feedback under OXT. This effect of OXT may be associated with enhanced responses and functional connectivity in emotional memory and reward processing regions.

  7. On the Possibility of a Reinforcement Theory of Cognitive Learning.

    ERIC Educational Resources Information Center

    Smith, Kendon

    This paper discusses cognitive learning in terms of reinforcement theory and presents arguments suggesting that a viable theory of cognition based on reinforcement principles is not out of the question. This position is supported by a discussion of the weaknesses of theories based entirely on contiguity and of considerations that are more positive…

  8. Novelty as a Reinforcer for Position Learning in Children

    ERIC Educational Resources Information Center

    Wilson, Marian Monyok

    1974-01-01

    The stimulus-familiarization-effect (SFE) paradigm, a reaction-time (RT) task based on a response to novelty procedure, was modified to assess response for novelty, ie., a response-reinforcement sequence. The potential implications of attention for reinforcement theory and learning in general are discussed. (Author/CS)

  9. Covert Operant Reinforcement of Remedial Reading Learning Tasks.

    ERIC Educational Resources Information Center

    Schmickley, Verne G.

    The effects of covert operant reinforcement upon remedial reading learning tasks were investigated. Forty junior high school students were taught to imagine either neutral scenes (control) or positive scenes (treatment) upon cue while reading. It was hypothesized that positive covert reinforcement would enhance performance on several measures of…

  10. Knockout crickets for the study of learning and memory: Dopamine receptor Dop1 mediates aversive but not appetitive reinforcement in crickets.

    PubMed

    Awata, Hiroko; Watanabe, Takahito; Hamanaka, Yoshitaka; Mito, Taro; Noji, Sumihare; Mizunami, Makoto

    2015-11-02

    Elucidation of reinforcement mechanisms in associative learning is an important subject in neuroscience. In mammals, dopamine neurons are thought to play critical roles in mediating both appetitive and aversive reinforcement. Our pharmacological studies suggested that octopamine and dopamine neurons mediate reward and punishment, respectively, in crickets, but recent studies in fruit-flies concluded that dopamine neurons mediates both reward and punishment, via the type 1 dopamine receptor Dop1. To resolve the discrepancy between studies in different insect species, we produced Dop1 knockout crickets using the CRISPR/Cas9 system and found that they are defective in aversive learning with sodium chloride punishment but not appetitive learning with water or sucrose reward. The results suggest that dopamine and octopamine neurons mediate aversive and appetitive reinforcement, respectively, in crickets. We suggest unexpected diversity in neurotransmitters mediating appetitive reinforcement between crickets and fruit-flies, although the neurotransmitter mediating aversive reinforcement is conserved. This study demonstrates usefulness of the CRISPR/Cas9 system for producing knockout animals for the study of learning and memory.

  11. Knockout crickets for the study of learning and memory: Dopamine receptor Dop1 mediates aversive but not appetitive reinforcement in crickets

    PubMed Central

    Awata, Hiroko; Watanabe, Takahito; Hamanaka, Yoshitaka; Mito, Taro; Noji, Sumihare; Mizunami, Makoto

    2015-01-01

    Elucidation of reinforcement mechanisms in associative learning is an important subject in neuroscience. In mammals, dopamine neurons are thought to play critical roles in mediating both appetitive and aversive reinforcement. Our pharmacological studies suggested that octopamine and dopamine neurons mediate reward and punishment, respectively, in crickets, but recent studies in fruit-flies concluded that dopamine neurons mediates both reward and punishment, via the type 1 dopamine receptor Dop1. To resolve the discrepancy between studies in different insect species, we produced Dop1 knockout crickets using the CRISPR/Cas9 system and found that they are defective in aversive learning with sodium chloride punishment but not appetitive learning with water or sucrose reward. The results suggest that dopamine and octopamine neurons mediate aversive and appetitive reinforcement, respectively, in crickets. We suggest unexpected diversity in neurotransmitters mediating appetitive reinforcement between crickets and fruit-flies, although the neurotransmitter mediating aversive reinforcement is conserved. This study demonstrates usefulness of the CRISPR/Cas9 system for producing knockout animals for the study of learning and memory. PMID:26521965

  12. Hedging Your Bets by Learning Reward Correlations in the Human Brain

    PubMed Central

    Wunderlich, Klaus; Symmonds, Mkael; Bossaerts, Peter; Dolan, Raymond J.

    2011-01-01

    Summary Human subjects are proficient at tracking the mean and variance of rewards and updating these via prediction errors. Here, we addressed whether humans can also learn about higher-order relationships between distinct environmental outcomes, a defining ecological feature of contexts where multiple sources of rewards are available. By manipulating the degree to which distinct outcomes are correlated, we show that subjects implemented an explicit model-based strategy to learn the associated outcome correlations and were adept in using that information to dynamically adjust their choices in a task that required a minimization of outcome variance. Importantly, the experimentally generated outcome correlations were explicitly represented neuronally in right midinsula with a learning prediction error signal expressed in rostral anterior cingulate cortex. Thus, our data show that the human brain represents higher-order correlation structures between rewards, a core adaptive ability whose immediate benefit is optimized sampling. PMID:21943609

  13. Dopamine prediction errors in reward learning and addiction: from theory to neural circuitry

    PubMed Central

    Keiflin, Ronald; Janak, Patricia H.

    2015-01-01

    Summary Midbrain dopamine (DA) neurons are proposed to signal reward prediction error (RPE), a fundamental parameter in associative learning models. This RPE hypothesis provides a compelling theoretical framework for understanding DA function in reward learning and addiction. New studies support a causal role for DA-mediated RPE activity in promoting learning about natural reward; however, this question has not been explicitly tested in the context of drug addiction. In this review, we integrate theoretical models with experimental findings on the activity of DA systems, and on the causal role of specific neuronal projections and cell types, to provide a circuit-based framework for probing DA-RPE function in addiction. By examining error-encoding DA neurons in the neural network in which they are embedded, hypotheses regarding circuit-level adaptations that possibly contribute to pathological error-signaling and addiction can be formulated and tested. PMID:26494275

  14. Assessment of rewarding and reinforcing properties of biperiden in conditioned place preference in rats.

    PubMed

    Allahverdiyev, Oruc; Nurten, Asiye; Enginar, Nurhan

    2011-12-01

    Biperiden is one of the most commonly abused anticholinergic drugs. This study assessed its motivational effects in the acquisition of conditioned place preference in rats. Biperiden neither produced place conditioning itself nor enhanced the rewarding effect of morphine. Furthermore, biperiden in combination with haloperidol also did not affect place preference. These findings suggest that biperiden seems devoid of abuse potential properties at least at the doses used.

  15. Vicarious reinforcement in rhesus macaques (macaca mulatta).

    PubMed

    Chang, Steve W C; Winecoff, Amy A; Platt, Michael L

    2011-01-01

    What happens to others profoundly influences our own behavior. Such other-regarding outcomes can drive observational learning, as well as motivate cooperation, charity, empathy, and even spite. Vicarious reinforcement may serve as one of the critical mechanisms mediating the influence of other-regarding outcomes on behavior and decision-making in groups. Here we show that rhesus macaques spontaneously derive vicarious reinforcement from observing rewards given to another monkey, and that this reinforcement can motivate them to subsequently deliver or withhold rewards from the other animal. We exploited Pavlovian and instrumental conditioning to associate rewards to self (M1) and/or rewards to another monkey (M2) with visual cues. M1s made more errors in the instrumental trials when cues predicted reward to M2 compared to when cues predicted reward to M1, but made even more errors when cues predicted reward to no one. In subsequent preference tests between pairs of conditioned cues, M1s preferred cues paired with reward to M2 over cues paired with reward to no one. By contrast, M1s preferred cues paired with reward to self over cues paired with reward to both monkeys simultaneously. Rates of attention to M2 strongly predicted the strength and valence of vicarious reinforcement. These patterns of behavior, which were absent in non-social control trials, are consistent with vicarious reinforcement based upon sensitivity to observed, or counterfactual, outcomes with respect to another individual. Vicarious reward may play a critical role in shaping cooperation and competition, as well as motivating observational learning and group coordination in rhesus macaques, much as it does in humans. We propose that vicarious reinforcement signals mediate these behaviors via homologous neural circuits involved in reinforcement learning and decision-making.

  16. Motive to Avoid Success, Locus of Control, and Reinforcement Avoidance.

    ERIC Educational Resources Information Center

    Katovsky, Walter

    Subjects were four groups of 12 college women, high or low in motive to avoid success (MAS) and locus of control (LC), were reinforced for response A on a fixed partial reinforcement schedule on three concept learning tasks, one task consisting of combined reward and punishment, another of reward only, and one of punishment only. Response B was…

  17. Stress, genotype and norepinephrine in the prediction of mouse behavior using reinforcement learning.

    PubMed

    Luksys, Gediminas; Gerstner, Wulfram; Sandi, Carmen

    2009-09-01

    Individual behavioral performance during learning is known to be affected by modulatory factors, such as stress and motivation, and by genetic predispositions that influence sensitivity to these factors. Despite numerous studies, no integrative framework is available that could predict how a given animal would perform a certain learning task in a realistic situation. We found that a simple reinforcement learning model can predict mouse behavior in a hole-box conditioning task if model metaparameters are dynamically controlled on the basis of the mouse's genotype and phenotype, stress conditions, recent performance feedback and pharmacological manipulations of adrenergic alpha-2 receptors. We find that stress and motivation affect behavioral performance by altering the exploration-exploitation balance in a genotype-dependent manner. Our results also provide computational insights into how an inverted U-shape relation between stress/arousal/norepinephrine levels and behavioral performance could be explained through changes in task performance accuracy and future reward discounting.

  18. Continuous theta-burst stimulation (cTBS) over the lateral prefrontal cortex alters reinforcement learning bias.

    PubMed

    Ott, Derek V M; Ullsperger, Markus; Jocham, Gerhard; Neumann, Jane; Klein, Tilmann A

    2011-07-15

    The prefrontal cortex is known to play a key role in higher-order cognitive functions. Recently, we showed that this brain region is active in reinforcement learning, during which subjects constantly have to integrate trial outcomes in order to optimize performance. To further elucidate the role of the dorsolateral prefrontal cortex (DLPFC) in reinforcement learning, we applied continuous theta-burst stimulation (cTBS) either to the left or right DLPFC, or to the vertex as a control region, respectively, prior to the performance of a probabilistic learning task in an fMRI environment. While there was no influence of cTBS on learning performance per se, we observed a stimulation-dependent modulation of reward vs. punishment sensitivity: Left-hemispherical DLPFC stimulation led to a more reward-guided performance, while right-hemispherical cTBS induced a more avoidance-guided behavior. FMRI results showed enhanced prediction error coding in the ventral striatum in subjects stimulated over the left as compared to the right DLPFC. Both behavioral and imaging results are in line with recent findings that left, but not right-hemispherical stimulation can trigger a release of dopamine in the ventral striatum, which has been suggested to increase the relative impact of rewards rather than punishment on behavior.

  19. Reinforcement learning of motor skills with policy gradients.

    PubMed

    Peters, Jan; Schaal, Stefan

    2008-05-01

    Autonomous learning is one of the hallmarks of human and animal behavior, and understanding the principles of learning will be crucial in order to achieve true autonomy in advanced machines like humanoid robots. In this paper, we examine learning of complex motor skills with human-like limbs. While supervised learning can offer useful tools for bootstrapping behavior, e.g., by learning from demonstration, it is only reinforcement learning that offers a general approach to the final trial-and-error improvement that is needed by each individual acquiring a skill. Neither neurobiological nor machine learning studies have, so far, offered compelling results on how reinforcement learning can be scaled to the high-dimensional continuous state and action spaces of humans or humanoids. Here, we combine two recent research developments on learning motor control in order to achieve this scaling. First, we interpret the idea of modular motor control by means of motor primitives as a suitable way to generate parameterized control policies for reinforcement learning. Second, we combine motor primitives with the theory of stochastic policy gradient learning, which currently seems to be the only feasible framework for reinforcement learning for humanoids. We evaluate different policy gradient methods with a focus on their applicability to parameterized motor primitives. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm.

  20. Reinforcement learning, conditioning, and the brain: Successes and challenges.

    PubMed

    Maia, Tiago V

    2009-12-01

    The field of reinforcement learning has greatly influenced the neuroscientific study of conditioning. This article provides an introduction to reinforcement learning followed by an examination of the successes and challenges using reinforcement learning to understand the neural bases of conditioning. Successes reviewed include (1) the mapping of positive and negative prediction errors to the firing of dopamine neurons and neurons in the lateral habenula, respectively; (2) the mapping of model-based and model-free reinforcement learning to associative and sensorimotor cortico-basal ganglia-thalamo-cortical circuits, respectively; and (3) the mapping of actor and critic to the dorsal and ventral striatum, respectively. Challenges reviewed consist of several behavioral and neural findings that are at odds with standard reinforcement-learning models, including, among others, evidence for hyperbolic discounting and adaptive coding. The article suggests ways of reconciling reinforcement-learning models with many of the challenging findings, and highlights the need for further theoretical developments where necessary. Additional information related to this study may be downloaded from http://cabn.psychonomic-journals.org/content/supplemental.

  1. Reconsidering Food Reward, Brain Stimulation, and Dopamine: Incentives Act Forward.

    PubMed

    Newquist, Gunnar; Gardner, R Allen

    2015-01-01

    In operant conditioning, rats pressing levers and pigeons pecking keys depend on contingent food reinforcement. Food reward agrees with Skinner's behaviorism, undergraduate textbooks, and folk psychology. However, nearly a century of experimental evidence shows, instead, that food in an operant conditioning chamber acts forward to evoke species-specific feeding behavior rather than backward to reinforce experimenter-defined responses. Furthermore, recent findings in neuroscience show consistently that intracranial stimulation to reward centers and dopamine release, the proposed reward molecule, also act forward to evoke inborn species-specific behavior. These results challenge longstanding views of hedonic learning and must be incorporated into contemporary learning theory.

  2. Rats bred for helplessness exhibit positive reinforcement learning deficits which are not alleviated by an antidepressant dose of the MAO-B inhibitor deprenyl.

    PubMed

    Schulz, Daniela; Henn, Fritz A; Petri, David; Huston, Joseph P

    2016-08-04

    Principles of negative reinforcement learning may play a critical role in the etiology and treatment of depression. We examined the integrity of positive reinforcement learning in congenitally helpless (cH) rats, an animal model of depression, using a random ratio schedule and a devaluation-extinction procedure. Furthermore, we tested whether an antidepressant dose of the monoamine oxidase (MAO)-B inhibitor deprenyl would reverse any deficits in positive reinforcement learning. We found that cH rats (n=9) were impaired in the acquisition of even simple operant contingencies, such as a fixed interval (FI) 20 schedule. cH rats exhibited no apparent deficits in appetite or reward sensitivity. They reacted to the devaluation of food in a manner consistent with a dose-response relationship. Reinforcer motivation as assessed by lever pressing across sessions with progressively decreasing reward probabilities was highest in congenitally non-helpless (cNH, n=10) rats as long as the reward probabilities remained relatively high. cNH compared to wild-type (n=10) rats were also more resistant to extinction across sessions. Compared to saline (n=5), deprenyl (n=5) reduced the duration of immobility of cH rats in the forced swimming test, indicative of antidepressant effects, but did not restore any deficits in the acquisition of a FI 20 schedule. We conclude that positive reinforcement learning was impaired in rats bred for helplessness, possibly due to motivational impairments but not deficits in reward sensitivity, and that deprenyl exerted antidepressant effects but did not reverse the deficits in positive reinforcement learning.

  3. WWC Quick Review of the Article "Culture and the Interaction of Student Ethnicity with Reward Structure in Group Learning" Revised

    ERIC Educational Resources Information Center

    What Works Clearinghouse, 2010

    2010-01-01

    This paper presents an updated WWC (What Works Clearinghouse) Review of the Article "Culture and the Interaction of Student Ethnicity with Reward Structure in Group Learning". The study examined the effects of different reward systems used in group learning situations on the math skills of African-American and White students. The…

  4. Sound Sequence Discrimination Learning Motivated by Reward Requires Dopaminergic D2 Receptor Activation in the Rat Auditory Cortex

    ERIC Educational Resources Information Center

    Kudoh, Masaharu; Shibuki, Katsuei

    2006-01-01

    We have previously reported that sound sequence discrimination learning requires cholinergic inputs to the auditory cortex (AC) in rats. In that study, reward was used for motivating discrimination behavior in rats. Therefore, dopaminergic inputs mediating reward signals may have an important role in the learning. We tested the possibility in the…

  5. The impact of effort-reward imbalance and learning motivation on teachers' sickness absence.

    PubMed

    Derycke, Hanne; Vlerick, Peter; Van de Ven, Bart; Rots, Isabel; Clays, Els

    2013-02-01

    The aim of this study was to analyse the impact of the effort-reward imbalance and learning motivation on sickness absence duration and sickness absence frequency among beginning teachers in Flanders (Belgium). A total of 603 teachers, who recently graduated, participated in this study. Effort-reward imbalance and learning motivation were assessed by means of self-administered questionnaires. Prospective data of registered sickness absence during 12 months follow-up were collected. Multivariate logistic regression analyses were performed. An imbalance between high efforts and low rewards (extrinsic hypothesis) was associated with longer sickness absence duration and more frequent absences. A low level of learning motivation (intrinsic hypothesis) was not associated with longer sickness absence duration but was significantly positively associated with sickness absence frequency. No significant results were obtained for the interaction hypothesis between imbalance and learning motivation. Further research is needed to deepen our understanding of the impact of psychosocial work conditions and personal resources on both sickness absence duration and frequency. Specifically, attention could be given to optimizing or reducing efforts spent at work, increasing rewards and stimulating learning motivation to influence sickness absence.

  6. Effect of reinforcement learning on coordination of multiangent systems

    NASA Astrophysics Data System (ADS)

    Bukkapatnam, Satish T. S.; Gao, Greg

    2000-12-01

    For effective coordination of distributed environments involving multiagent systems, learning ability of each agent in the environment plays a crucial role. In this paper, we develop a simple group learning method based on reinforcement, and study its effect on coordination through application to a supply chain procurement scenario involving a computer manufacturer. Here, all parties are represented by self-interested, autonomous agents, each capable of performing specific simple tasks. They negotiate with each other to perform complex tasks and thus coordinate supply chain procurement. Reinforcement learning is intended to enable each agent to reach a best negotiable price within a shortest possible time. Our simulations of the application scenario under different learning strategies reveals the positive effects of reinforcement learning on an agent's as well as the system's performance.

  7. Toward a common theory for learning from reward, affect, and motivation: the SIMON framework

    PubMed Central

    Madan, Christopher R.

    2013-01-01

    While the effects of reward, affect, and motivation on learning have each developed into their own fields of research, they largely have been investigated in isolation. As all three of these constructs are highly related, and use similar experimental procedures, an important advance in research would be to consider the interplay between these constructs. Here we first define each of the three constructs, and then discuss how they may influence each other within a common framework. Finally, we delineate several sources of evidence supporting the framework. By considering the constructs of reward, affect, and motivation within a single framework, we can develop a better understanding of the processes involved in learning and how they interplay, and work toward a comprehensive theory that encompasses reward, affect, and motivation. PMID:24109436

  8. Neural regions that underlie reinforcement learning are also active for social expectancy violations.

    PubMed

    Harris, Lasana T; Fiske, Susan T

    2010-01-01

    Prediction error, the difference between an expected and an actual outcome, serves as a learning signal that interacts with reward and punishment value to direct future behavior during reinforcement learning. We hypothesized that similar learning and valuation signals may underlie social expectancy violations. Here, we explore the neural correlates of social expectancy violation signals along the universal person-perception dimensions trait warmth and competence. In this context, social learning may result from expectancy violations that occur when a target is inconsistent with an a priori schema. Expectancy violation may activate neural regions normally implicated in prediction error and valuation during appetitive and aversive conditioning. Using fMRI, we first gave perceivers high warmth or competence behavioral information that led to dispositional or situational attributions for the behavior. Participants then saw pictures of people responsible for the behavior; they represented social groups either inconsistent (rated low on either warmth or competence) or consistent (rated high on either warmth or competence) with the behavior information. Warmth and competence expectancy violations activate striatal regions that represent evaluative and prediction error signals. Social cognition regions underlie consistent expectations. These findings suggest that regions underlying reinforcement learning may work in concert with social cognition regions in warmth and competence social expectancy. This study illustrates the neural overlap between neuroeconomics and social neuroscience.

  9. Reinforcement Learning and Episodic Memory in Humans and Animals: An Integrative Framework.

    PubMed

    Gershman, Samuel J; Daw, Nathaniel D

    2017-01-03

    We review the psychology and neuroscience of reinforcement learning (RL), which has experienced significant progress in the past two decades, enabled by the comprehensive experimental study of simple learning and decision-making tasks. However, one challenge in the study of RL is computational: The simplicity of these tasks ignores important aspects of reinforcement learning in the real world: (a) State spaces are high-dimensional, continuous, and partially observable; this implies that (b) data are relatively sparse and, indeed, precisely the same situation may never be encountered twice; furthermore, (c) rewards depend on the long-term consequences of actions in ways that violate the classical assumptions that make RL tractable. A seemingly distinct challenge is that, cognitively, theories of RL have largely involved procedural and semantic memory, the way in which knowledge about action values or world models extracted gradually from many experiences can drive choice. This focus on semantic memory leaves out many aspects of memory, such as episodic memory, related to the traces of individual events. We suggest that these two challenges are related. The computational challenge can be dealt with, in part, by endowing RL systems with episodic memory, allowing them to (a) efficiently approximate value functions over complex state spaces, (b) learn with very little data, and (c) bridge long-term dependencies between actions and rewards. We review the computational theory underlying this proposal and the empirical evidence to support it. Our proposal suggests that the ubiquitous and diverse roles of memory in RL may function as part of an integrated learning system.

  10. Learning, using natural reinforcements, in insect preparations that permit cellular neuronal analysis.

    PubMed

    Hoyle, G

    1980-07-01

    A general paradigm is described that permits testing the ability of an arthropod to learn (by operant conditioning) to alter the position of a single leg segment in order to relate to behaviorally appropriate reinforcement. The paradigm was designed so that intracellular recording from identified neurons involved would be possible during the training of a locust or grasshopper, for which extensive neuron maps are available. As a prelude to such studies, electromyograms were made from the antagonistic muscles that move the conditioned limb, which in the present experiments was the tibia of the metathoracic leg. Negative (aversive) reinforcement was provided by a loud sound/vibration and positive (reward) reinforcement by food in the form of sugar-water or fresh-growing grass. In the aversive reinforcement experiments the sound, which reflexly caused flexion, was on continually except when the tibia of one hind leg was voluntarily placed in an electronically set position "window" displaced, in extension, away from the preferred position. In feeding experiments, food was brought automatically to the mouth by a motor-driven arm when the tibia was held within a position window set away from the preferred position in either extension or flexion. Whole or headless insects learned to turn off the sound permanently, except for sporadic brief interruptions, by tonic shifting of tibial position. Insects learned to bring food to the mouth by modifying the plateau phase of a position displacement lasting for a few minutes, that was found to occur from time to time also in controls. In aversive learning, minimum times to turn off the sound were 22 sec for the easiest position and 4 min for the most difficult. The longest time in the easiest position was 1 min 40 sec and in the most difficult 39 min; excluding measurement for individuals that did not learn. In reward learning, the minimum time in the easiest position was just under 1 min, and 12 min in the most difficult position

  11. Central amygdala GluA1 facilitates associative learning of opioid reward.

    PubMed

    Cai, You-Qing; Wang, Wei; Hou, Yuan-Yuan; Zhang, Zhi; Xie, Jun; Pan, Zhizhong Z

    2013-01-23

    GluA1 subunits of AMPA glutamate receptors are implicated in the synaptic plasticity induced by drugs of abuse for behaviors of drug addiction, but GluA1 roles in emotional learning and memories of drug reward in the development of drug addiction remain unclear. In this study of the central nucleus of the amygdala (CeA), which is critical in emotional learning of drug reward, we investigated how adaptive changes in the expression of GluA1 subunits affected the learning process of opioid-induced context-reward association (associative learning) for the acquisition of reward-related behavior. In CeA neurons, we found that CeA GluA1 expression was significantly increased 2 h after conditioning treatment with morphine, but not 24 h after the conditioning when the behavior of conditioned place reference (CPP) was fully established in rats. Adenoviral overexpression of GluA1 subunits in CeA accelerated associative learning, as shown by reduced minimum time of morphine conditioning required for CPP acquisition and by facilitated CPP extinction through extinction training with no morphine involved. Adenoviral shRNA-mediated downregulation of CeA GluA1 produced opposite effects, inhibiting the processes of both CPP acquisition and CPP extinction. Adenoviral knockdown of CeA GluA2 subunits facilitated CPP acquisition, but did not alter CPP extinction. Whole-cell recording revealed enhanced electrophysiological properties of postsynaptic GluA2-lacking AMPA receptors in adenoviral GluA1-infected CeA neurons. These results suggest that increased GluA1 expression of CeA AMPA receptors facilitates the associative learning of context-drug reward, an important process in both development and relapse of drug-seeking behaviors in drug addiction.

  12. Dopamine Replacement Therapy, Learning and Reward Prediction in Parkinson’s Disease: Implications for Rehabilitation

    PubMed Central

    Ferrazzoli, Davide; Carter, Adrian; Ustun, Fatma S.; Palamara, Grazia; Ortelli, Paola; Maestri, Roberto; Yücel, Murat; Frazzitta, Giuseppe

    2016-01-01

    The principal feature of Parkinson’s disease (PD) is the impaired ability to acquire and express habitual-automatic actions due to the loss of dopamine in the dorsolateral striatum, the region of the basal ganglia associated with the control of habitual behavior. Dopamine replacement therapy (DRT) compensates for the lack of dopamine, representing the standard treatment for different motor symptoms of PD (such as rigidity, bradykinesia and resting tremor). On the other hand, rehabilitation treatments, exploiting the use of cognitive strategies, feedbacks and external cues, permit to “learn to bypass” the defective basal ganglia (using the dorsolateral area of the prefrontal cortex) allowing the patients to perform correct movements under executive-volitional control. Therefore, DRT and rehabilitation seem to be two complementary and synergistic approaches. Learning and reward are central in rehabilitation: both of these mechanisms are the basis for the success of any rehabilitative treatment. Anyway, it is known that “learning resources” and reward could be negatively influenced from dopaminergic drugs. Furthermore, DRT causes different well-known complications: among these, dyskinesias, motor fluctuations, and dopamine dysregulation syndrome (DDS) are intimately linked with the alteration in the learning and reward mechanisms and could impact seriously on the rehabilitative outcomes. These considerations highlight the need for careful titration of DRT to produce the desired improvement in motor symptoms while minimizing the associated detrimental effects. This is important in order to maximize the motor re-learning based on repetition, reward and practice during rehabilitation. In this scenario, we review the knowledge concerning the interactions between DRT, learning and reward, examine the most impactful DRT side effects and provide suggestions for optimizing rehabilitation in PD. PMID:27378872

  13. Involvement of the Rat Anterior Cingulate Cortex in Control of Instrumental Responses Guided by Reward Expectancy

    ERIC Educational Resources Information Center

    Schweimer, Judith; Hauber, Wolfgang

    2005-01-01

    The anterior cingulate cortex (ACC) plays a critical role in stimulus-reinforcement learning and reward-guided selection of actions. Here we conducted a series of experiments to further elucidate the role of the ACC in instrumental behavior involving effort-based decision-making and instrumental learning guided by reward-predictive stimuli. In…

  14. Medial orbitofrontal cortex modulates associative learning between environmental cues and reward probability.

    PubMed

    Hall-McMaster, Sam; Millar, Jessica; Ruan, Ming; Ward, Ryan D

    2017-02-01

    It has recently been recognized that orbitofrontal cortex has 2 subdivisions that are anatomically and functionally distinct. Most rodent research has focused on the lateral subdivision, leaving the medial subdivision (mOFC) relatively unexplored. We recently showed that inhibiting mOFC neurons eliminated the differential impact of reward probability cues on discrimination accuracy in a sustained attention task. In the present study, we tested whether increasing mOFC neuronal activity in rats would accelerate acquisition of reward contingencies. mOFC neuronal activity was increased using the DREADD (Designer Receptors Exclusively Activated by Designer Drugs) method, in which clozapine-N-oxide administration leads to neuronal modulation by acting on synthetic receptors not normally expressed in the rat brain. We predicted that rats with neuronal activation in mOFC would require fewer sessions than controls for acquisition of a task in which visual cues signal the probability of reward for correct discrimination performance. Contrary to this prediction, mOFC neuronal activation impaired task acquisition, suggesting mOFC may play a role in learning relationships between environmental cues and reward probability or for using that information in adaptive decision-making. In addition, disrupted mOFC activity may contribute to psychiatric conditions in which learning associations between environmental cues and reward probability is impaired. (PsycINFO Database Record

  15. Impaired reward learning and intact motivation after serotonin depletion in rats

    PubMed Central

    Izquierdo, Alicia; Carlos, Kathleen; Ostrander, Serena; Rodriguez, Danilo; McCall-Craddolph, Aaron; Yagnik, Gargey; Zhou, Feimeng

    2012-01-01

    Aside from the well-known influence of serotonin (5-hydroxytryptamine, 5-HT) on emotional regulation, more recent investigations have revealed the importance of this monoamine in modulating cognition. Parachlorophenylalanine (PCPA) depletes 5-HT by inhibiting tryptophan hydroxylase, the enzyme required for 5-HT synthesis and, if administered at sufficiently high doses, can result in a depletion of at least 90% of the brain s 5-HT levels. The present study assessed the long-lasting effects of widespread 5-HT depletions on two tasks of cognitive flexibility in Long Evans rats: effort discounting and reversal learning. We assessed performance on these tasks after administration of either 250 or 500 mg/kg PCPA or saline (SAL) on two consecutive days. Consistent with a previous report investigating the role of 5-HT on effort discounting, pretreatment with either dose of PCPA resulted in normal effortful choice: All rats continued to climb tall barriers to obtain large rewards and were not work-averse. Additionally, rats receiving the lower dose of PCPA displayed normal reversal learning. However, despite intact motivation to work for food rewards, rats receiving the largest dose of PCPA were unexpectedly impaired relative to SAL rats on the pretraining stages leading up to reversal learning, ultimately failing to approach and respond to the stimuli associated with reward. High performance liquid chromatography (HPLC) with electrochemical detection confirmed 5-HT, and not dopamine, levels in the ventromedial frontal cortex were correlated with this measure of associative reward learning. PMID:22652392

  16. Sensory Responsiveness and the Effects of Equal Subjective Rewards on Tactile Learning and Memory of Honeybees

    ERIC Educational Resources Information Center

    Scheiner, Ricarda; Kuritz-Kaiser, Anthea; Menzel, Randolf; Erber, Joachim

    2005-01-01

    In tactile learning, sucrose is the unconditioned stimulus and reward, which is usually applied to the antenna to elicit proboscis extension and which the bee can drink when it is subsequently applied to the extended proboscis. The conditioned stimulus is a tactile object that the bee can scan with its antennae. In this paper we describe the…

  17. A spiking network model of decision making employing rewarded STDP.

    PubMed

    Skorheim, Steven; Lonjers, Peter; Bazhenov, Maxim

    2014-01-01

    Reward-modulated spike timing dependent plasticity (STDP) combines unsupervised STDP with a reinforcement signal that modulates synaptic changes. It was proposed as a learning rule capable of solving the distal reward problem in reinforcement learning. Nonetheless, performance and limitations of this learning mechanism have yet to be tested for its ability to solve biological problems. In our work, rewarded STDP was implemented to model foraging behavior in a simulated environment. Over the course of training the network of spiking neurons developed the capability of producing highly successful decision-making. The network performance remained stable even after significant perturbations of synaptic structure. Rewarded STDP alone was insufficient to learn effective decision making due to the difficulty maintaining homeostatic equilibrium of synaptic weights and the development of local performance maxima. Our study predicts that successful learning requires stabilizing mechanisms that allow neurons to balance their input and output synapses as well as synaptic noise.

  18. Corticotropin-releasing hormone receptor type 1 (CRHR1) genetic variation and stress interact to influence reward learning.

    PubMed

    Bogdan, Ryan; Santesso, Diane L; Fagerness, Jesen; Perlis, Roy H; Pizzagalli, Diego A

    2011-09-14

    Stress is a general risk factor for psychopathology, but the mechanisms underlying this relationship remain largely unknown. Animal studies and limited human research suggest that stress can induce anhedonic behavior. Moreover, emerging data indicate that genetic variation within the corticotropin-releasing hormone type 1 receptor gene (CRHR1) at rs12938031 may promote psychopathology, particularly in the context of stress. Using an intermediate phenotypic neurogenetics approach, we assessed how stress and CRHR1 genetic variation (rs12938031) influence reward learning, an important component of anhedonia. Psychiatrically healthy female participants (n = 75) completed a probabilistic reward learning task during stress and no-stress conditions while 128-channel event-related potentials were recorded. Fifty-six participants were also genotyped across CRHR1. Response bias, an individual's ability to modulate behavior as a function of reward, was the primary behavioral variable of interest. The feedback-related positivity (FRP) in response to reward feedback was used as a neural index of reward learning. Relative to the no-stress condition, acute stress was associated with blunted response bias as well as a smaller and delayed FRP (indicative of disrupted reward learning) and reduced anterior cingulate and orbitofrontal cortex activation to reward. Critically, rs12938031 interacted with stress to influence reward learning: both behaviorally and neurally, A homozygotes showed stress-induced reward learning abnormalities. These findings indicate that acute, uncontrollable stressors reduce participants' ability to modulate behavior as a function of reward, and that such effects are modulated by CRHR1 genotype. Homozygosity for the A allele at rs12938031 may increase risk for psychopathology via stress-induced reward learning deficits.

  19. Reinforcement Learning of Optimal Supervisor based on the Worst-Case Behavior

    NASA Astrophysics Data System (ADS)

    Kajiwara, Kouji; Yamasaki, Tatsushi

    The supervisory control initiated by Ramadge and Wonham is a framework for logical control of discrete event systems. In the original supervisory control, the costs for occurrence and disabling of events have not been considered. Then, the optimal supervisory control based on quatitative measures has also been studied. This paper proposes a synthesis method of the optimal supervisor based on the worst-case behavior of discrete event systems. We introduce the new value functions for the assigned control patterns. The new value functions are not based on the expected total rewards, but based on the most undesirable event occurrence in the assigned control pattern. In the proposed method, the supervisor learns how to assign the control pattern based on reinforcement learning so as to maximize the value functions. We show the efficiency of the proposed method by computer simulation.

  20. Situational Reinforcement: The New Old Way to Learn Languages.

    ERIC Educational Resources Information Center

    Petrucelli, Gerald J.

    1977-01-01

    Situational Reinforcement, a teaching methodology developed out of the cognitive-field theory of learning, is described. It combines many techniques and methods developed over the years. This discussion of it considers common learning problems: (1) boredom, apathy and passivity on the part of the student, (2) the teacher's preoccupation with…

  1. Social Learning, Reinforcement and Crime: Evidence from Three European Cities

    ERIC Educational Resources Information Center

    Tittle, Charles R.; Antonaccio, Olena; Botchkovar, Ekaterina

    2012-01-01

    This study reports a cross-cultural test of Social Learning Theory using direct measures of social learning constructs and focusing on the causal structure implied by the theory. Overall, the results strongly confirm the main thrust of the theory. Prior criminal reinforcement and current crime-favorable definitions are highly related in all three…

  2. Mastery Learning through Individualized Instruction: A Reinforcement Strategy

    ERIC Educational Resources Information Center

    Sagy, John; Ravi, R.; Ananthasayanam, R.

    2009-01-01

    The present study attempts to gauge the effect of individualized instructional methods as a reinforcement strategy for mastery learning. Among various individualized instructional methods, the study focuses on PIM (Programmed Instructional Method) and CAIM (Computer Assisted Instruction Method). Mastery learning is a process where students achieve…

  3. Drift diffusion model of reward and punishment learning in schizophrenia: Modeling and experimental data.

    PubMed

    Moustafa, Ahmed A; Kéri, Szabolcs; Somlai, Zsuzsanna; Balsdon, Tarryn; Frydecka, Dorota; Misiak, Blazej; White, Corey

    2015-09-15

    In this study, we tested reward- and punishment learning performance using a probabilistic classification learning task in patients with schizophrenia (n=37) and healthy controls (n=48). We also fit subjects' data using a Drift Diffusion Model (DDM) of simple decisions to investigate which components of the decision process differ between patients and controls. Modeling results show between-group differences in multiple components of the decision process. Specifically, patients had slower motor/encoding time, higher response caution (favoring accuracy over speed), and a deficit in classification learning for punishment, but not reward, trials. The results suggest that patients with schizophrenia adopt a compensatory strategy of favoring accuracy over speed to improve performance, yet still show signs of a deficit in learning based on negative feedback. Our data highlights the importance of applying fitting models (particularly drift diffusion models) to behavioral data. The implications of these findings are discussed relative to theories of schizophrenia and cognitive processing.

  4. Reinforcement learning in complementarity game and population dynamics

    NASA Astrophysics Data System (ADS)

    Jost, Jürgen; Li, Wei

    2014-02-01

    We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005), 10.1016/j.physa.2004.07.005] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.

  5. Cingulate neglect in humans: disruption of contralesional reward learning in right brain damage.

    PubMed

    Lecce, Francesca; Rotondaro, Francesca; Bonnì, Sonia; Carlesimo, Augusto; Thiebaut de Schotten, Michel; Tomaiuolo, Francesco; Doricchi, Fabrizio

    2015-01-01

    Motivational valence plays a key role in orienting spatial attention. Nonetheless, clinical documentation and understanding of motivationally based deficits of spatial orienting in the human is limited. Here in a series of one group-study and two single-case studies, we have examined right brain damaged patients (RBD) with and without left spatial neglect in a spatial reward-learning task, in which the motivational valence of the left contralesional and the right ipsilesional space was contrasted. In each trial two visual boxes were presented, one to the left and one to the right of central fixation. In one session monetary rewards were released more frequently in the box on the left side (75% of trials) whereas in another session they were released more frequently on the right side. In each trial patients were required to: 1) point to each one of the two boxes; 2) choose one of the boxes for obtaining monetary reward; 3) report explicitly the position of reward and whether this position matched or not the original choice. Despite defective spontaneous allocation of attention toward the contralesional space, RBD patients with left spatial neglect showed preserved contralesional reward learning, i.e., comparable to ipsilesional learning and to reward learning displayed by patients without neglect. A notable exception in the group of neglect patients was L.R., who showed no sign of contralesional reward learning in a series of 120 consecutive trials despite being able of reaching learning criterion in only 20 trials in the ipsilesional space. L.R. suffered a cortical-subcortical brain damage affecting the anterior components of the parietal-frontal attentional network and, compared with all other neglect and non-neglect patients, had additional lesion involvement of the medial anterior cingulate cortex (ACC) and of the adjacent sectors of the corpus callosum. In contrast to his lateralized motivational learning deficit, L.R. had no lateral bias in the early phases of

  6. Distinguishing between learning and motivation in behavioral tests of the reinforcement sensitivity theory of personality.

    PubMed

    Smillie, Luke D; Dalgleish, Len I; Jackson, Chris J

    2007-04-01

    According to Gray's (1973) Reinforcement Sensitivity Theory (RST), a Behavioral Inhibition System (BIS) and a Behavioral Activation System (BAS) mediate effects of goal conflict and reward on behavior. BIS functioning has been linked with individual differences in trait anxiety and BAS functioning with individual differences in trait impulsivity. In this article, it is argued that behavioral outputs of the BIS and BAS can be distinguished in terms of learning and motivation processes and that these can be operationalized using the Signal Detection Theory measures of response-sensitivity and response-bias. In Experiment 1, two measures of BIS-reactivity predicted increased response-sensitivity under goal conflict, whereas one measure of BAS-reactivity predicted increased response-sensitivity under reward. In Experiment 2, two measures of BIS-reactivity predicted response-bias under goal conflict, whereas a measure of BAS-reactivity predicted motivation response-bias under reward. In both experiments, impulsivity measures did not predict criteria for BAS-reactivity as traditionally predicted by RST.

  7. Human-level control through deep reinforcement learning

    NASA Astrophysics Data System (ADS)

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K.; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-01

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  8. Human-level control through deep reinforcement learning.

    PubMed

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A; Veness, Joel; Bellemare, Marc G; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-26

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  9. Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments.

    PubMed

    Leong, Yuan Chang; Radulescu, Angela; Daniel, Reka; DeWoskin, Vivian; Niv, Yael

    2017-01-18

    Little is known about the relationship between attention and learning during decision making. Using eye tracking and multivariate pattern analysis of fMRI data, we measured participants' dimensional attention as they performed a trial-and-error learning task in which only one of three stimulus dimensions was relevant for reward at any given time. Analysis of participants' choices revealed that attention biased both value computation during choice and value update during learning. Value signals in the ventromedial prefrontal cortex and prediction errors in the striatum were similarly biased by attention. In turn, participants' focus of attention was dynamically modulated by ongoing learning. Attentional switches across dimensions correlated with activity in a frontoparietal attention network, which showed enhanced connectivity with the ventromedial prefrontal cortex between switches. Our results suggest a bidirectional interaction between attention and learning: attention constrains learning to relevant dimensions of the environment, while we learn what to attend to via trial and error.

  10. Protein interaction network constructing based on text mining and reinforcement learning with application to prostate cancer.

    PubMed

    Zhu, Fei; Liu, Quan; Zhang, Xiaofang; Shen, Bairong

    2015-08-01

    Constructing interaction network from biomedical texts is a very important and interesting work. The authors take advantage of text mining and reinforcement learning approaches to establish protein interaction network. Considering the high computational efficiency of co-occurrence-based interaction extraction approaches and high precision of linguistic patterns approaches, the authors propose an interaction extracting algorithm where they utilise frequently used linguistic patterns to extract the interactions from texts and then find out interactions from extended unprocessed texts under the basic idea of co-occurrence approach, meanwhile they discount the interaction extracted from extended texts. They put forward a reinforcement learning-based algorithm to establish a protein interaction network, where nodes represent proteins and edges denote interactions. During the evolutionary process, a node selects another node and the attained reward determines which predicted interaction should be reinforced. The topology of the network is updated by the agent until an optimal network is formed. They used texts downloaded from PubMed to construct a prostate cancer protein interaction network by the proposed methods. The results show that their method brought out pretty good matching rate. Network topology analysis results also demonstrate that the curves of node degree distribution, node degree probability and probability distribution of constructed network accord with those of the scale-free network well.

  11. Negative reinforcement impairs overnight memory consolidation.

    PubMed

    Stamm, Andrew W; Nguyen, Nam D; Seicol, Benjamin J; Fagan, Abigail; Oh, Angela; Drumm, Michael; Lundt, Maureen; Stickgold, Robert; Wamsley, Erin J

    2014-11-01

    Post-learning sleep is beneficial for human memory. However, it may be that not all memories benefit equally from sleep. Here, we manipulated a spatial learning task using monetary reward and performance feedback, asking whether enhancing the salience of the task would augment overnight memory consolidation and alter its incorporation into dreaming. Contrary to our hypothesis, we found that the addition of reward impaired overnight consolidation of spatial memory. Our findings seemingly contradict prior reports that enhancing the reward value of learned information augments sleep-dependent memory processing. Given that the reward followed a negative reinforcement paradigm, consolidation may have been impaired via a stress-related mechanism.

  12. Tactile learning and the individual evaluation of the reward in honey bees (Apis mellifera L.).

    PubMed

    Scheiner, R; Erber, J; Page, R E

    1999-07-01

    Using the proboscis extension response we conditioned pollen and nectar foragers of the honey bee (Apis mellifera L.) to tactile patterns under laboratory conditions. Pollen foragers demonstrated better acquisition, extinction, and reversal learning than nectar foragers. We tested whether the known differences in response thresholds to sucrose between pollen and nectar foragers could explain the observed differences in learning and found that nectar foragers with low response thresholds performed better during acquisition and extinction than ones with higher thresholds. Conditioning pollen and nectar foragers with similar response thresholds did not yield differences in their learning performance. These results suggest that differences in the learning performance of pollen and nectar foragers are a consequence of differences in their perception of sucrose. Furthermore, we analysed the effect which the perception of sucrose reward has on associative learning. Nectar foragers with uniform low response thresholds were conditioned using varying concentrations of sucrose. We found significant positive correlations between the concentrations of the sucrose rewards and the performance during acquisition and extinction. The results are summarised in a model which describes the relationships between learning performance, response threshold to sucrose, concentration of sucrose and the number of rewards.

  13. Social Cognition as Reinforcement Learning: Feedback Modulates Emotion Inference.

    PubMed

    Zaki, Jamil; Kallman, Seth; Wimmer, G Elliott; Ochsner, Kevin; Shohamy, Daphna

    2016-09-01

    Neuroscientific studies of social cognition typically employ paradigms in which perceivers draw single-shot inferences about the internal states of strangers. Real-world social inference features much different parameters: People often encounter and learn about particular social targets (e.g., friends) over time and receive feedback about whether their inferences are correct or incorrect. Here, we examined this process and, more broadly, the intersection between social cognition and reinforcement learning. Perceivers were scanned using fMRI while repeatedly encountering three social targets who produced conflicting visual and verbal emotional cues. Perceivers guessed how targets felt and received feedback about whether they had guessed correctly. Visual cues reliably predicted one target's emotion, verbal cues predicted a second target's emotion, and neither reliably predicted the third target's emotion. Perceivers successfully used this information to update their judgments over time. Furthermore, trial-by-trial learning signals-estimated using two reinforcement learning models-tracked activity in ventral striatum and ventromedial pFC, structures associated with reinforcement learning, and regions associated with updating social impressions, including TPJ. These data suggest that learning about others' emotions, like other forms of feedback learning, relies on domain-general reinforcement mechanisms as well as domain-specific social information processing.

  14. Effort provides its own reward: endeavors reinforce subjective expectation and evaluation of task performance.

    PubMed

    Wang, Lei; Zheng, Jiehui; Meng, Liang

    2017-04-01

    Although many studies have investigated the relationship between the amount of effort invested in a certain task and one's attitude towards the subsequent reward, whether exerted effort would impact one's expectation and evaluation of performance feedback itself still remains to be examined. In the present study, two types of calculation tasks that varied in the required effort were adopted, and we resorted to electroencephalography to probe the temporal dynamics of how exerted effort would affect one's anticipation and evaluation of performance feedback. In the high-effort condition, a more salient stimulus-preceding negativity was detected during the anticipation stage, which was accompanied with a more salient FRN/P300 complex (a more positive P300 and a less negative feedback-related negativity) in response to positive outcomes in the evaluation stage. These results suggested that when more effort was invested, an enhanced anticipatory attention would be paid toward one's task performance feedback and that positive outcomes would be subjectively valued to a greater extent.

  15. One-trial reward learning in the snail Lymnea stagnalis.

    PubMed

    Alexander, J; Audesirk, T E; Audesirk, G J

    1984-01-01

    We present evidence that the pond snail Lymnaea stagnalis is capable of aquisition and extensive retention of an appetitively reinforced feeding response after only a single training trial. Food-deprived snails presented with a single pairing of a phagostimulant (a mixture of sucrose and casein digest) and a novel, non-food chemostimulus (amyl acetate) subsequently made feeding responses to the amyl acetate and retained the association for at least 19 days. This demonstration of one-trial, non-aversive classical conditioning enhances the utility of Lymnaea stagnalis as a model system for the detailed analysis of neural mechanisms underlying plasticity.

  16. Reward quality influences the development of learned olfactory biases in honeybees

    PubMed Central

    Wright, Geraldine A.; Choudhary, Amir F.; Bentley, Michael A.

    2009-01-01

    Plants produce flowers with complex visual and olfactory signals, but we know relatively little about the way that signals such as floral scents have evolved. One important factor that may direct the evolution of floral signals is a pollinator's ability to learn. When animals learn to associate two similar signals with different outcomes, biases in their responses to new signals can be formed. Here, we investigated whether or not pollinators develop learned biases towards floral scents that depend on nectar reward quality by training restrained honeybees to learn to associate two similar odour signals with different outcomes using a classical conditioning assay. Honeybees developed learned biases towards odours as a result of differential conditioning, and the extent to which an olfactory bias could be produced depended upon the difference in the quality of the nectar rewards experienced during conditioning. Our results suggest that differences in reward quality offered by flowers influence odour recognition by pollinators, which in turn could influence the evolution of floral scents in natural populations of co-flowering plants. PMID:19369260

  17. Reward quality influences the development of learned olfactory biases in honeybees.

    PubMed

    Wright, Geraldine A; Choudhary, Amir F; Bentley, Michael A

    2009-07-22

    Plants produce flowers with complex visual and olfactory signals, but we know relatively little about the way that signals such as floral scents have evolved. One important factor that may direct the evolution of floral signals is a pollinator's ability to learn. When animals learn to associate two similar signals with different outcomes, biases in their responses to new signals can be formed. Here, we investigated whether or not pollinators develop learned biases towards floral scents that depend on nectar reward quality by training restrained honeybees to learn to associate two similar odour signals with different outcomes using a classical conditioning assay. Honeybees developed learned biases towards odours as a result of differential conditioning, and the extent to which an olfactory bias could be produced depended upon the difference in the quality of the nectar rewards experienced during conditioning. Our results suggest that differences in reward quality offered by flowers influence odour recognition by pollinators, which in turn could influence the evolution of floral scents in natural populations of co-flowering plants.

  18. Linking Individual Learning Styles to Approach-Avoidance Motivational Traits and Computational Aspects of Reinforcement Learning.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly C; Schwartz, Sophie

    2016-01-01

    Learning how to gain rewards (approach learning) and avoid punishments (avoidance learning) is fundamental for everyday life. While individual differences in approach and avoidance learning styles have been related to genetics and aging, the contribution of personality factors, such as traits, remains undetermined. Moreover, little is known about the computational mechanisms mediating differences in learning styles. Here, we used a probabilistic selection task with positive and negative feedbacks, in combination with computational modelling, to show that individuals displaying better approach (vs. avoidance) learning scored higher on measures of approach (vs. avoidance) trait motivation, but, paradoxically, also displayed reduced learning speed following positive (vs. negative) outcomes. These data suggest that learning different types of information depend on associated reward values and internal motivational drives, possibly determined by personality traits.

  19. Linking Individual Learning Styles to Approach-Avoidance Motivational Traits and Computational Aspects of Reinforcement Learning

    PubMed Central

    Carl Aberg, Kristoffer; Doell, Kimberly C.; Schwartz, Sophie

    2016-01-01

    Learning how to gain rewards (approach learning) and avoid punishments (avoidance learning) is fundamental for everyday life. While individual differences in approach and avoidance learning styles have been related to genetics and aging, the contribution of personality factors, such as traits, remains undetermined. Moreover, little is known about the computational mechanisms mediating differences in learning styles. Here, we used a probabilistic selection task with positive and negative feedbacks, in combination with computational modelling, to show that individuals displaying better approach (vs. avoidance) learning scored higher on measures of approach (vs. avoidance) trait motivation, but, paradoxically, also displayed reduced learning speed following positive (vs. negative) outcomes. These data suggest that learning different types of information depend on associated reward values and internal motivational drives, possibly determined by personality traits. PMID:27851807

  20. Literacy and Learning: Integrated Skills Reinforcement.

    ERIC Educational Resources Information Center

    Anderson, JoAnn Romero; And Others

    1991-01-01

    Describes the integrated skills reinforcement (ISR) approach to Language across the Curriculum used at La Guardia Community College (LCC) to teach basic skills within the context of the subject-content of various disciplines. Explains LCC's student-centered approach to faculty development, and the use of ISR as a basis for curricular,…

  1. Effect of green tea on reward learning in healthy individuals: a randomized, double-blind, placebo-controlled pilot study

    PubMed Central

    2013-01-01

    Background Both clinical and preclinical studies revealed that regular intake of green tea reduced the prevalence of depressive symptoms, as well as produced antidepressant-like effects in rodents. Evidence proposed that disturbed reward learning has been associated with the development of anhedonia, a core symptom of depression. However, the relationship between green tea and reward learning is poorly investigated. Our goal was to test whether chronic treatment with green tea in healthy subjects affects the process of reward learning and subsequently regulates the depressive symptoms. Methods Seventy-four healthy subjects participated in a double-blind, randomized placebo-controlled study with oral administration of green tea or placebo for 5weeks. We used the monetary incentive delay task to evaluate the reward learning by measurement of the response to reward trial or no-reward trial. We compared the reaction time of reward responsiveness between green tea and placebo treatment. Furthermore, we selected Montgomery-Asberg depression rating scale (MADRS) and 17-item Hamilton Rating Scale for Depression (HRSD-17) to estimate the depressive symptoms in these two groups. Results The results showed chronic treatment of green tea increased reward learning compared with placebo by decreasing the reaction time in monetary incentive delay task. Moreover, participants treated with green tea showed reduced scores measured in MADRS and HRSD-17 compared with participants treated with placebo. Conclusions Our findings reveal that chronic green tea increased the reward learning and prevented the depressive symptoms. These results also raised the possibility that supplementary administration of green tea might reverse the development of depression through normalization of the reward function. PMID:23777561

  2. A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning

    PubMed Central

    Franklin, Nicholas T; Frank, Michael J

    2015-01-01

    Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments. DOI: http://dx.doi.org/10.7554/eLife.12029.001 PMID:26705698

  3. Neural Circuits Trained with Standard Reinforcement Learning Can Accumulate Probabilistic Information during Decision Making.

    PubMed

    Kurzawa, Nils; Summerfield, Christopher; Bogacz, Rafal

    2017-02-01

    Much experimental evidence suggests that during decision making, neural circuits accumulate evidence supporting alternative options. A computational model well describing this accumulation for choices between two options assumes that the brain integrates the log ratios of the likelihoods of the sensory inputs given the two options. Several models have been proposed for how neural circuits can learn these log-likelihood ratios from experience, but all of these models introduced novel and specially dedicated synaptic plasticity rules. Here we show that for a certain wide class of tasks, the log-likelihood ratios are approximately linearly proportional to the expected rewards for selecting actions. Therefore, a simple model based on standard reinforcement learning rules is able to estimate the log-likelihood ratios from experience and on each trial accumulate the log-likelihood ratios associated with presented stimuli while selecting an action. The simulations of the model replicate experimental data on both behavior and neural activity in tasks requiring accumulation of probabilistic cues. Our results suggest that there is no need for the brain to support dedicated plasticity rules, as the standard mechanisms proposed to describe reinforcement learning can enable the neural circuits to perform efficient probabilistic inference.

  4. Neuronal learning of invariant object representation in the ventral visual stream is not dependent on reward.

    PubMed

    Li, Nuo; Dicarlo, James J

    2012-05-09

    Neurons at the top of primate ventral visual stream [inferior temporal cortex (IT)] have selectivity for objects that is highly tolerant to variation in the object's appearance on the retina. Previous nonhuman primate (Macaca mulatta) studies suggest that this neuronal tolerance is at least partly supported by the natural temporal contiguity of visual experience, because altering that temporal contiguity can robustly alter adult IT position and size tolerance. According to that work, it is the statistics of the subject's visual experience, not the subject's reward, that instruct the specific images that IT treats as equivalent. But is reward necessary for gating this type of learning in the ventral stream? Here we show that this is not the case--temporal tolerance learning proceeds at the same rate, regardless of reward magnitude and regardless of the temporal co-occurrence of reward, even in a behavioral task that does not require the subject to engage the object images. This suggests that the ventral visual stream uses autonomous, fully unsupervised mechanisms to constantly leverage all visual experience to help build its invariant object representation.

  5. SOCIAL REINFORCEMENT, PERSONALITY AND LEARNING PERFORMANCE IN CROSS-CULTURAL PROGRAMMED INSTRUCTION.

    DTIC Science & Technology

    of learning program, containing four conditions of social reinforcement; positive reinforcement for correct choices, negative reinforcement for...incorrect choices, both positive and negative evaluation for either response, and no social evaluation. It was found that the presence of negative ... reinforcement as a factor significantly lowered the learning performance in one group. The opposite trend was evidenced in the other group. This discrepancy

  6. IMMEDIATE REINFORCEMENT AND THE DISADVANTAGED LEARNER. A PRACTICAL APPLICATION OF LEARNING THEORY.

    ERIC Educational Resources Information Center

    FANTINI, MARIO; WEINSTEIN, GERALD

    IMPLICIT IN THE CONCEPT OF IMMEDIATE REINFORCEMENT ARE TWO ASSUMPTIONS--(1) THAT A NEED MUST BE SATISFIED, AND (2) THAT A REWARD CAN SERVE TO SATISFY THIS NEED. THE CULTURALLY DISADVANTAGED CHILD NEEDS ENCOURAGEMENT OR DISCOURAGEMENT RIGHT AWAY. HIS SOCIETY OPERATES IN THIS WAY. IN THE CLASSROOM, SUCH REINFORCEMENT MAY TAKE MANY FORMS. ONE TEACHER…

  7. Reinforcement and inference in cross-situational word learning

    PubMed Central

    Tilles, Paulo F. C.; Fontanari, José F.

    2013-01-01

    Cross-situational word learning is based on the notion that a learner can determine the referent of a word by finding something in common across many observed uses of that word. Here we propose an adaptive learning algorithm that contains a parameter that controls the strength of the reinforcement applied to associations between concurrent words and referents, and a parameter that regulates inference, which includes built-in biases, such as mutual exclusivity, and information of past learning events. By adjusting these parameters so that the model predictions agree with data from representative experiments on cross-situational word learning, we were able to explain the learning strategies adopted by the participants of those experiments in terms of a trade-off between reinforcement and inference. These strategies can vary wildly depending on the conditions of the experiments. For instance, for fast mapping experiments (i.e., the correct referent could, in principle, be inferred in a single observation) inference is prevalent, whereas for segregated contextual diversity experiments (i.e., the referents are separated in groups and are exhibited with members of their groups only) reinforcement is predominant. Other experiments are explained with more balanced doses of reinforcement and inference. PMID:24312030

  8. Cooperation and Coordination Between Fuzzy Reinforcement Learning Agents in Continuous State Partially Observable Markov Decision Processes

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Vengerov, David

    1999-01-01

    Successful operations of future multi-agent intelligent systems require efficient cooperation schemes between agents sharing learning experiences. We consider a pseudo-realistic world in which one or more opportunities appear and disappear in random locations. Agents use fuzzy reinforcement learning to learn which opportunities are most worthy of pursuing based on their promise rewards, expected lifetimes, path lengths and expected path costs. We show that this world is partially observable because the history of an agent influences the distribution of its future states. We consider a cooperation mechanism in which agents share experience by using and-updating one joint behavior policy. We also implement a coordination mechanism for allocating opportunities to different agents in the same world. Our results demonstrate that K cooperative agents each learning in a separate world over N time steps outperform K independent agents each learning in a separate world over K*N time steps, with this result becoming more pronounced as the degree of partial observability in the environment increases. We also show that cooperation between agents learning in the same world decreases performance with respect to independent agents. Since cooperation reduces diversity between agents, we conclude that diversity is a key parameter in the trade off between maximizing utility from cooperation when diversity is low and maximizing utility from competitive coordination when diversity is high.

  9. Transcriptomic analysis of instinctive and learned reward-related behaviors in honey bees.

    PubMed

    Naeger, Nicholas L; Robinson, Gene E

    2016-11-15

    We used transcriptomics to compare instinctive and learned, reward-based honey bee behaviors with similar spatio-temporal components: mating flights by males (drones) and time-trained foraging flights by females (workers), respectively. Genome-wide gene expression profiling via RNA sequencing was performed on the mushroom bodies, a region of the brain known for multi-modal sensory integration and responsive to various types of reward. Differentially expressed genes (DEGs) associated with the onset of mating (623 genes) were enriched for the gene ontology (GO) categories of Transcription, Unfolded Protein Binding, Post-embryonic Development, and Neuron Differentiation. DEGs associated with the onset of foraging (473) were enriched for Lipid Transport, Regulation of Programmed Cell Death, and Actin Cytoskeleton Organization. These results demonstrate that there are fundamental molecular differences between similar instinctive and learned behaviors. In addition, there were 166 genes with strong similarities in expression across the two behaviors - a statistically significant overlap in gene expression, also seen in Weighted Gene Co-Expression Network Analysis. This finding indicates that similar instinctive and learned behaviors also share common molecular architecture. This common set of DEGs was enriched for Regulation of RNA Metabolic Process, Transcription Factor Activity, and Response to Ecdysone. These findings provide a starting point for better understanding the relationship between instincts and learned behaviors. In addition, because bees collect food for their colony rather than for themselves, these results also support the idea that altruistic behavior relies, in part, on elements of brain reward systems associated with selfish behavior.

  10. Decision theory, reinforcement learning, and the brain.

    PubMed

    Dayan, Peter; Daw, Nathaniel D

    2008-12-01

    Decision making is a core competence for animals and humans acting and surviving in environments they only partially comprehend, gaining rewards and punishments for their troubles. Decision-theoretic concepts permeate experiments and computational models in ethology, psychology, and neuroscience. Here, we review a well-known, coherent Bayesian approach to decision making, showing how it unifies issues in Markovian decision problems, signal detection psychophysics, sequential sampling, and optimal exploration and discuss paradigmatic psychological and neural examples of each problem. We discuss computational issues concerning what subjects know about their task and how ambitious they are in seeking optimal solutions; we address algorithmic topics concerning model-based and model-free methods for making choices; and we highlight key aspects of the neural implementation of decision making.

  11. Reinforcement Learning in Young Adults with Developmental Language Impairment

    PubMed Central

    Lee, Joanna C.; Tomblin, J. Bruce

    2012-01-01

    The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank et al., 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic selection task was used to assess how participants implicitly extracted reinforcement history from the environment based on probabilistic positive/negative feedback. The findings showed impaired RL in individuals with DLI, indicating an altered gating function of the striatum in testing. However, they exploited similar learning strategies as comparison participants at the beginning of training, reflecting relatively intact functions of the prefrontal cortex to rapidly update reinforcement information. Within the context of Frank’s model, these results can be interpreted as evidence for alterations in the basal ganglia of individuals with DLI. PMID:22921956

  12. Reinforcement learning in young adults with developmental language impairment.

    PubMed

    Lee, Joanna C; Tomblin, J Bruce

    2012-12-01

    The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank, Seeberger, & O'Reilly, 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic selection task was used to assess how participants implicitly extracted reinforcement history from the environment based on probabilistic positive/negative feedback. The findings showed impaired RL in individuals with DLI, indicating an altered gating function of the striatum in testing. However, they exploited similar learning strategies as comparison participants at the beginning of training, reflecting relatively intact functions of the prefrontal cortex to rapidly update reinforcement information. Within the context of Frank's model, these results can be interpreted as evidence for alterations in the basal ganglia of individuals with DLI.

  13. Probabilistic reward- and punishment-based learning in opioid addiction: Experimental and computational data.

    PubMed

    Myers, Catherine E; Sheynin, Jony; Balsdon, Tarryn; Luzardo, Andre; Beck, Kevin D; Hogarth, Lee; Haber, Paul; Moustafa, Ahmed A

    2016-01-01

    Addiction is the continuation of a habit in spite of negative consequences. A vast literature gives evidence that this poor decision-making behavior in individuals addicted to drugs also generalizes to laboratory decision making tasks, suggesting that the impairment in decision-making is not limited to decisions about taking drugs. In the current experiment, opioid-addicted individuals and matched controls with no history of illicit drug use were administered a probabilistic classification task that embeds both reward-based and punishment-based learning trials, and a computational model of decision making was applied to understand the mechanisms describing individuals' performance on the task. Although behavioral results showed that opioid-addicted individuals performed as well as controls on both reward- and punishment-based learning, the modeling results suggested subtle differences in how decisions were made between the two groups. Specifically, the opioid-addicted group showed decreased tendency to repeat prior responses, meaning that they were more likely to "chase reward" when expectancies were violated, whereas controls were more likely to stick with a previously-successful response rule, despite occasional expectancy violations. This tendency to chase short-term reward, potentially at the expense of developing rules that maximize reward over the long term, may be a contributing factor to opioid addiction. Further work is indicated to better understand whether this tendency arises as a result of brain changes in the wake of continued opioid use/abuse, or might be a pre-existing factor that may contribute to risk for addiction.

  14. Reinforcement learning accounts for moody conditional cooperation behavior: experimental results

    PubMed Central

    Horita, Yutaka; Takezawa, Masanori; Inukai, Keigo; Kita, Toshimasa; Masuda, Naoki

    2017-01-01

    In social dilemma games, human participants often show conditional cooperation (CC) behavior or its variant called moody conditional cooperation (MCC), with which they basically tend to cooperate when many other peers have previously cooperated. Recent computational studies showed that CC and MCC behavioral patterns could be explained by reinforcement learning. In the present study, we use a repeated multiplayer prisoner’s dilemma game and the repeated public goods game played by human participants to examine whether MCC is observed across different types of game and the possibility that reinforcement learning explains observed behavior. We observed MCC behavior in both games, but the MCC that we observed was different from that observed in the past experiments. In the present study, whether or not a focal participant cooperated previously affected the overall level of cooperation, instead of changing the tendency of cooperation in response to cooperation of other participants in the previous time step. We found that, across different conditions, reinforcement learning models were approximately as accurate as a MCC model in describing the experimental results. Consistent with the previous computational studies, the present results suggest that reinforcement learning may be a major proximate mechanism governing MCC behavior. PMID:28071646