Science.gov

Sample records for reward reinforcement learning

  1. Auto-exploratory average reward reinforcement learning

    SciTech Connect

    Ok, DoKyeong; Tadepalli, P.

    1996-12-31

    We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexplored parts of the state space, while always choosing greedy actions with respect to the current value function. We show that this {open_quotes}Auto-exploratory H-learning{close_quotes} performs better than the original H-learning under previously studied exploration methods such as random, recency-based, or counter-based exploration.

  2. Finding intrinsic rewards by embodied evolution and constrained reinforcement learning.

    PubMed

    Uchibe, Eiji; Doya, Kenji

    2008-12-01

    Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and 'mating' by software reproduction are three major 'extrinsic' rewards. We show in hardware experiments that the robots can find appropriate 'intrinsic' rewards for the vision of battery packs and other robots to promote approach behaviors.

  3. Finding intrinsic rewards by embodied evolution and constrained reinforcement learning.

    PubMed

    Uchibe, Eiji; Doya, Kenji

    2008-12-01

    Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and 'mating' by software reproduction are three major 'extrinsic' rewards. We show in hardware experiments that the robots can find appropriate 'intrinsic' rewards for the vision of battery packs and other robots to promote approach behaviors. PMID:19013054

  4. Optimal Reward Functions in Distributed Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Wolpert, David H.; Tumer, Kagan

    2000-01-01

    We consider the design of multi-agent systems so as to optimize an overall world utility function when (1) those systems lack centralized communication and control, and (2) each agents runs a distinct Reinforcement Learning (RL) algorithm. A crucial issue in such design problems is to initialize/update each agent's private utility function, so as to induce best possible world utility. Traditional 'team game' solutions to this problem sidestep this issue and simply assign to each agent the world utility as its private utility function. In previous work we used the 'Collective Intelligence' framework to derive a better choice of private utility functions, one that results in world utility performance up to orders of magnitude superior to that ensuing from use of the team game utility. In this paper we extend these results. We derive the general class of private utility functions that both are easy for the individual agents to learn and that, if learned well, result in high world utility. We demonstrate experimentally that using these new utility functions can result in significantly improved performance over that of our previously proposed utility, over and above that previous utility's superiority to the conventional team game utility.

  5. Inferring reward prediction errors in patients with schizophrenia: a dynamic reward task for reinforcement learning.

    PubMed

    Li, Chia-Tzu; Lai, Wen-Sung; Liu, Chih-Min; Hsu, Yung-Fong

    2014-01-01

    Abnormalities in the dopamine system have long been implicated in explanations of reinforcement learning and psychosis. The updated reward prediction error (RPE)-a discrepancy between the predicted and actual rewards-is thought to be encoded by dopaminergic neurons. Dysregulation of dopamine systems could alter the appraisal of stimuli and eventually lead to schizophrenia. Accordingly, the measurement of RPE provides a potential behavioral index for the evaluation of brain dopamine activity and psychotic symptoms. Here, we assess two features potentially crucial to the RPE process, namely belief formation and belief perseveration, via a probability learning task and reinforcement-learning modeling. Forty-five patients with schizophrenia [26 high-psychosis and 19 low-psychosis, based on their p1 and p3 scores in the positive-symptom subscales of the Positive and Negative Syndrome Scale (PANSS)] and 24 controls were tested in a feedback-based dynamic reward task for their RPE-related decision making. While task scores across the three groups were similar, matching law analysis revealed that the reward sensitivities of both psychosis groups were lower than that of controls. Trial-by-trial data were further fit with a reinforcement learning model using the Bayesian estimation approach. Model fitting results indicated that both psychosis groups tend to update their reward values more rapidly than controls. Moreover, among the three groups, high-psychosis patients had the lowest degree of choice perseveration. Lumping patients' data together, we also found that patients' perseveration appears to be negatively correlated (p = 0.09, trending toward significance) with their PANSS p1 + p3 scores. Our method provides an alternative for investigating reward-related learning and decision making in basic and clinical settings.

  6. Beyond simple reinforcement learning: the computational neurobiology of reward-learning and valuation.

    PubMed

    O'Doherty, John P

    2012-04-01

    Neural computational accounts of reward-learning have been dominated by the hypothesis that dopamine neurons behave like a reward-prediction error and thus facilitate reinforcement learning in striatal target neurons. While this framework is consistent with a lot of behavioral and neural evidence, this theory fails to account for a number of behavioral and neurobiological observations. In this special issue of EJN we feature a combination of theoretical and experimental papers highlighting some of the explanatory challenges faced by simple reinforcement-learning models and describing some of the ways in which the framework is being extended in order to address these challenges.

  7. Homeostatic reinforcement learning for integrating reward collection and physiological stability.

    PubMed

    Keramati, Mehdi; Gutkin, Boris

    2014-12-02

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system.

  8. Homeostatic reinforcement learning for integrating reward collection and physiological stability.

    PubMed

    Keramati, Mehdi; Gutkin, Boris

    2014-01-01

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system. PMID:25457346

  9. When, What, and How Much to Reward in Reinforcement Learning-Based Models of Cognition

    ERIC Educational Resources Information Center

    Janssen, Christian P.; Gray, Wayne D.

    2012-01-01

    Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other…

  10. Homeostatic reinforcement learning for integrating reward collection and physiological stability

    PubMed Central

    Keramati, Mehdi; Gutkin, Boris

    2014-01-01

    Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system. DOI: http://dx.doi.org/10.7554/eLife.04811.001 PMID:25457346

  11. Toward an autonomous brain machine interface: integrating sensorimotor reward modulation and reinforcement learning.

    PubMed

    Marsh, Brandi T; Tarigoppula, Venkata S Aditya; Chen, Chen; Francis, Joseph T

    2015-05-13

    For decades, neurophysiologists have worked on elucidating the function of the cortical sensorimotor control system from the standpoint of kinematics or dynamics. Recently, computational neuroscientists have developed models that can emulate changes seen in the primary motor cortex during learning. However, these simulations rely on the existence of a reward-like signal in the primary sensorimotor cortex. Reward modulation of the primary sensorimotor cortex has yet to be characterized at the level of neural units. Here we demonstrate that single units/multiunits and local field potentials in the primary motor (M1) cortex of nonhuman primates (Macaca radiata) are modulated by reward expectation during reaching movements and that this modulation is present even while subjects passively view cursor motions that are predictive of either reward or nonreward. After establishing this reward modulation, we set out to determine whether we could correctly classify rewarding versus nonrewarding trials, on a moment-to-moment basis. This reward information could then be used in collaboration with reinforcement learning principles toward an autonomous brain-machine interface. The autonomous brain-machine interface would use M1 for both decoding movement intention and extraction of reward expectation information as evaluative feedback, which would then update the decoding algorithm as necessary. In the work presented here, we show that this, in theory, is possible. PMID:25972167

  12. Toward an autonomous brain machine interface: integrating sensorimotor reward modulation and reinforcement learning.

    PubMed

    Marsh, Brandi T; Tarigoppula, Venkata S Aditya; Chen, Chen; Francis, Joseph T

    2015-05-13

    For decades, neurophysiologists have worked on elucidating the function of the cortical sensorimotor control system from the standpoint of kinematics or dynamics. Recently, computational neuroscientists have developed models that can emulate changes seen in the primary motor cortex during learning. However, these simulations rely on the existence of a reward-like signal in the primary sensorimotor cortex. Reward modulation of the primary sensorimotor cortex has yet to be characterized at the level of neural units. Here we demonstrate that single units/multiunits and local field potentials in the primary motor (M1) cortex of nonhuman primates (Macaca radiata) are modulated by reward expectation during reaching movements and that this modulation is present even while subjects passively view cursor motions that are predictive of either reward or nonreward. After establishing this reward modulation, we set out to determine whether we could correctly classify rewarding versus nonrewarding trials, on a moment-to-moment basis. This reward information could then be used in collaboration with reinforcement learning principles toward an autonomous brain-machine interface. The autonomous brain-machine interface would use M1 for both decoding movement intention and extraction of reward expectation information as evaluative feedback, which would then update the decoding algorithm as necessary. In the work presented here, we show that this, in theory, is possible.

  13. The left hemisphere learns what is right: Hemispatial reward learning depends on reinforcement learning processes in the contralateral hemisphere.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly Crystal; Schwartz, Sophie

    2016-08-01

    Orienting biases refer to consistent, trait-like direction of attention or locomotion toward one side of space. Recent studies suggest that such hemispatial biases may determine how well people memorize information presented in the left or right hemifield. Moreover, lesion studies indicate that learning rewarded stimuli in one hemispace depends on the integrity of the contralateral striatum. However, the exact neural and computational mechanisms underlying the influence of individual orienting biases on reward learning remain unclear. Because reward-based behavioural adaptation depends on the dopaminergic system and prediction error (PE) encoding in the ventral striatum, we hypothesized that hemispheric asymmetries in dopamine (DA) function may determine individual spatial biases in reward learning. To test this prediction, we acquired fMRI in 33 healthy human participants while they performed a lateralized reward task. Learning differences between hemispaces were assessed by presenting stimuli, assigned to different reward probabilities, to the left or right of central fixation, i.e. presented in the left or right visual hemifield. Hemispheric differences in DA function were estimated through differential fMRI responses to positive vs. negative feedback in the left vs. right ventral striatum, and a computational approach was used to identify the neural correlates of PEs. Our results show that spatial biases favoring reward learning in the right (vs. left) hemifield were associated with increased reward responses in the left hemisphere and relatively better neural encoding of PEs for stimuli presented in the right (vs. left) hemifield. These findings demonstrate that trait-like spatial biases implicate hemisphere-specific learning mechanisms, with individual differences between hemispheres contributing to reinforcing spatial biases. PMID:27221149

  14. The left hemisphere learns what is right: Hemispatial reward learning depends on reinforcement learning processes in the contralateral hemisphere.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly Crystal; Schwartz, Sophie

    2016-08-01

    Orienting biases refer to consistent, trait-like direction of attention or locomotion toward one side of space. Recent studies suggest that such hemispatial biases may determine how well people memorize information presented in the left or right hemifield. Moreover, lesion studies indicate that learning rewarded stimuli in one hemispace depends on the integrity of the contralateral striatum. However, the exact neural and computational mechanisms underlying the influence of individual orienting biases on reward learning remain unclear. Because reward-based behavioural adaptation depends on the dopaminergic system and prediction error (PE) encoding in the ventral striatum, we hypothesized that hemispheric asymmetries in dopamine (DA) function may determine individual spatial biases in reward learning. To test this prediction, we acquired fMRI in 33 healthy human participants while they performed a lateralized reward task. Learning differences between hemispaces were assessed by presenting stimuli, assigned to different reward probabilities, to the left or right of central fixation, i.e. presented in the left or right visual hemifield. Hemispheric differences in DA function were estimated through differential fMRI responses to positive vs. negative feedback in the left vs. right ventral striatum, and a computational approach was used to identify the neural correlates of PEs. Our results show that spatial biases favoring reward learning in the right (vs. left) hemifield were associated with increased reward responses in the left hemisphere and relatively better neural encoding of PEs for stimuli presented in the right (vs. left) hemifield. These findings demonstrate that trait-like spatial biases implicate hemisphere-specific learning mechanisms, with individual differences between hemispheres contributing to reinforcing spatial biases.

  15. Principal components analysis of reward prediction errors in a reinforcement learning task.

    PubMed

    Sambrook, Thomas D; Goslin, Jeremy

    2016-01-01

    Models of reinforcement learning represent reward and punishment in terms of reward prediction errors (RPEs), quantitative signed terms describing the degree to which outcomes are better than expected (positive RPEs) or worse (negative RPEs). An electrophysiological component known as feedback related negativity (FRN) occurs at frontocentral sites 240-340ms after feedback on whether a reward or punishment is obtained, and has been claimed to neurally encode an RPE. An outstanding question however, is whether the FRN is sensitive to the size of both positive RPEs and negative RPEs. Previous attempts to answer this question have examined the simple effects of RPE size for positive RPEs and negative RPEs separately. However, this methodology can be compromised by overlap from components coding for unsigned prediction error size, or "salience", which are sensitive to the absolute size of a prediction error but not its valence. In our study, positive and negative RPEs were parametrically modulated using both reward likelihood and magnitude, with principal components analysis used to separate out overlying components. This revealed a single RPE encoding component responsive to the size of positive RPEs, peaking at ~330ms, and occupying the delta frequency band. Other components responsive to unsigned prediction error size were shown, but no component sensitive to negative RPE size was found.

  16. When, what, and how much to reward in reinforcement learning-based models of cognition.

    PubMed

    Janssen, Christian P; Gray, Wayne D

    2012-03-01

    Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other interval of task performance), what (the objective function: e.g., performance time or performance accuracy), and how much (the magnitude: with binary, categorical, or continuous values). In this article, we explore the problem space of these three parameters in the context of a task whose completion entails some combination of 36 state-action pairs, where all intermediate states (i.e., after the initial state and prior to the end state) represent progressive but partial completion of the task. Different choices produce profoundly different learning paths and outcomes, with the strongest effect for moment. Unfortunately, there is little discussion in the literature of the effect of such choices. This absence is disappointing, as the choice of when, what, and how much needs to be made by a modeler for every learning model. PMID:22257174

  17. [The model of the reward choice basing on the theory of reinforcement learning].

    PubMed

    Smirnitskaia, I A; Frolov, A A; Merzhanova, G Kh

    2007-01-01

    We developed the model of alimentary instrumental conditioned bar-pressing reflex for cats making a choice between either immediate small reinforcement ("impulsive behavior") or delayed more valuable reinforcement ("self-control behavior"). Our model is based on the reinforcement learning theory. We emulated dopamine contribution by discount coefficient of this theory (a subjective decrease in the value of a delayed reinforcement). The results of computer simulation showed that "cats" with large discount coefficient demonstrated "self-control behavior"; small discount coefficient was associated with "impulsive behavior". This data are in agreement with the experimental data indicating that the impulsive behavior is due to a decreased amount of dopamine in striatum.

  18. Dopaminergic control of motivation and reinforcement learning: a closed-circuit account for reward-oriented behavior.

    PubMed

    Morita, Kenji; Morishima, Mieko; Sakai, Katsuyuki; Kawaguchi, Yasuo

    2013-05-15

    Humans and animals take actions quickly when they expect that the actions lead to reward, reflecting their motivation. Injection of dopamine receptor antagonists into the striatum has been shown to slow such reward-seeking behavior, suggesting that dopamine is involved in the control of motivational processes. Meanwhile, neurophysiological studies have revealed that phasic response of dopamine neurons appears to represent reward prediction error, indicating that dopamine plays central roles in reinforcement learning. However, previous attempts to elucidate the mechanisms of these dopaminergic controls have not fully explained how the motivational and learning aspects are related and whether they can be understood by the way the activity of dopamine neurons itself is controlled by their upstream circuitries. To address this issue, we constructed a closed-circuit model of the corticobasal ganglia system based on recent findings regarding intracortical and corticostriatal circuit architectures. Simulations show that the model could reproduce the observed distinct motivational effects of D1- and D2-type dopamine receptor antagonists. Simultaneously, our model successfully explains the dopaminergic representation of reward prediction error as observed in behaving animals during learning tasks and could also explain distinct choice biases induced by optogenetic stimulation of the D1 and D2 receptor-expressing striatal neurons. These results indicate that the suggested roles of dopamine in motivational control and reinforcement learning can be understood in a unified manner through a notion that the indirect pathway of the basal ganglia represents the value of states/actions at a previous time point, an empirically driven key assumption of our model.

  19. A model of reward choice based on the theory of reinforcement learning.

    PubMed

    Smirnitskaya, I A; Frolov, A A; Merzhanova, G Kh

    2008-03-01

    A model explaining behavioral "impulsivity" and "self-control" is proposed on the basis of the theory of reinforcement learning. The discount coefficient gamma, which in this theory accounts for the subjective reduction in the value of a delayed reinforcement, is identified with the overall level of dopaminergic neuron activity which, according to published data, also determines the behavioral variant. Computer modeling showed that high values of gamma are characteristic of predominantly "self-controlled" subjects, while smaller values of gamma are characteristic of "impulsive" subjects.

  20. The Rewards of Learning.

    ERIC Educational Resources Information Center

    Chance, Paul

    1992-01-01

    Although intrinsic rewards are important, they (along with punishment and encouragement) are insufficient for efficient learning. Teachers must supplement intrinsic rewards with extrinsic rewards, such as praising, complimenting, applauding, and providing other forms of recognition for good work. Teachers should use the weakest reward required to…

  1. [Reinforcement learning by striatum].

    PubMed

    Kunisato, Yoshihiko; Okada, Go; Okamoto, Yasumasa

    2009-04-01

    Recently, computational models of reinforcement learning have been applied for the analysis of neuroimaging data. It has been clarified that the striatum plays a key role in decision making. We review the reinforcement learning theory and the biological structures such as the brain and signals such as neuromodulators associated with reinforcement learning. We also investigated the function of the striatum and the neurotransmitter serotonin in reward prediction. We first studied the brain mechanisms for reward prediction at different time scales. Our experiment on the striatum showed that the ventroanterior regions are involved in predicting immediate rewards and the dorsoposterior regions are involved in predicting future rewards. Further, we investigated whether serotonin regulates both the reward selection and the striatum function are specialized reward prediction at different time scales. To this end, we regulated the dietary intake of tryptophan, a precursor of serotonin. Our experiment showed that the activity of the ventral part of the striatum was correlated with reward prediction at shorter time scales, and this activity was stronger at low serotonin levels. By contrast, the activity of the dorsal part of the striatum was correlated with reward prediction at longer time scales, and this activity was stronger at high serotonin levels. Further, a higher proportion of small reward choices, together with a higher rate of discounting of delayed rewards is observed in the low-serotonin condition than in the control and high-serotonin conditions. Further examinations are required in future to assess the relation between the disturbance of reward prediction caused by low serotonin and mental disorders related to serotonin such as depression.

  2. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

    PubMed

    Glimcher, Paul W

    2011-09-13

    A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn.

  3. Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis

    PubMed Central

    Glimcher, Paul W.

    2011-01-01

    A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn. PMID:21389268

  4. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

    PubMed

    Glimcher, Paul W

    2011-09-13

    A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn. PMID:21389268

  5. Reward and learning in the goldfish.

    PubMed

    Lowes, G; Bitterman, M E

    1967-07-28

    An experiment with goldfish showed the effects of change in amount of reward that are predicted from reinforcement theory. The performance of animals shifted from small to large reward improved gradually to the level of unshifted large-reward controls, while the performance of animals shifted from large to small reward remained at the large-reward level. The difference between these results and those obtained in analogous experiments with the rat suggests that reward functions differently in the instrumental learning of the two animals.

  6. Heads for learning, tails for memory: reward, reinforcement and a role of dopamine in determining behavioral relevance across multiple timescales

    PubMed Central

    Baudonnat, Mathieu; Huber, Anna; David, Vincent; Walton, Mark E.

    2013-01-01

    Dopamine has long been tightly associated with aspects of reinforcement learning and motivation in simple situations where there are a limited number of stimuli to guide behavior and constrained range of outcomes. In naturalistic situations, however, there are many potential cues and foraging strategies that could be adopted, and it is critical that animals determine what might be behaviorally relevant in such complex environments. This requires not only detecting discrepancies with what they have recently experienced, but also identifying similarities with past experiences stored in memory. Here, we review what role dopamine might play in determining how and when to learn about the world, and how to develop choice policies appropriate to the situation faced. We discuss evidence that dopamine is shaped by motivation and memory and in turn shapes reward-based memory formation. In particular, we suggest that hippocampal-striatal-dopamine networks may interact to determine how surprising the world is and to either inhibit or promote actions at time of behavioral uncertainty. PMID:24130514

  7. An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning

    PubMed Central

    Balasubramani, Pragathi P.; Chakravarthy, V. Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A.

    2014-01-01

    Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG. PMID:24795614

  8. An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning.

    PubMed

    Balasubramani, Pragathi P; Chakravarthy, V Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A

    2014-01-01

    Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG.

  9. Reward feedback accelerates motor learning.

    PubMed

    Nikooyan, Ali A; Ahmed, Alaa A

    2015-01-15

    Recent findings have demonstrated that reward feedback alone can drive motor learning. However, it is not yet clear whether reward feedback alone can lead to learning when a perturbation is introduced abruptly, or how a reward gradient can modulate learning. In this study, we provide reward feedback that decays continuously with increasing error. We asked whether it is possible to learn an abrupt visuomotor rotation by reward alone, and if the learning process could be modulated by combining reward and sensory feedback and/or by using different reward landscapes. We designed a novel visuomotor learning protocol during which subjects experienced an abruptly introduced rotational perturbation. Subjects received either visual feedback or reward feedback, or a combination of the two. Two different reward landscapes, where the reward decayed either linearly or cubically with distance from the target, were tested. Results demonstrate that it is possible to learn from reward feedback alone and that the combination of reward and sensory feedback accelerates learning. An analysis of the underlying mechanisms reveals that although reward feedback alone does not allow for sensorimotor remapping, it can nonetheless lead to broad generalization, highlighting a dissociation between remapping and generalization. Also, the combination of reward and sensory feedback accelerates learning without compromising sensorimotor remapping. These findings suggest that the use of reward feedback is a promising approach to either supplement or substitute sensory feedback in the development of improved neurorehabilitation techniques. More generally, they point to an important role played by reward in the motor learning process.

  10. Robust reinforcement learning.

    PubMed

    Morimoto, Jun; Doya, Kenji

    2005-02-01

    This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both offline learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H(infinity) control, we consider a differential game in which a "disturbing" agent tries to make the worst possible disturbance while a "control" agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H(infinity) control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired.

  11. Dimensional reduction for reward-based learning.

    PubMed

    Swinehart, Christian D; Abbott, L F

    2006-09-01

    Reward-based learning in neural systems is challenging because a large number of parameters that affect network function must be optimized solely on the basis of a reward signal that indicates improved performance. Searching the parameter space for an optimal solution is particularly difficult if the network is large. We show that Hebbian forms of synaptic plasticity applied to synapses between a supervisor circuit and the network it is controlling can effectively reduce the dimension of the space of parameters being searched to support efficient reinforcement-based learning in large networks. The critical element is that the connections between the supervisor units and the network must be reciprocal. Once the appropriate connections have been set up by Hebbian plasticity, a reinforcement-based learning procedure leads to rapid learning in a function approximation task. Hebbian plasticity within the network being supervised ultimately allows the network to perform the task without input from the supervisor.

  12. Prosocial Reward Learning in Children and Adolescents

    PubMed Central

    Kwak, Youngbin; Huettel, Scott A.

    2016-01-01

    Adolescence is a period of increased sensitivity to social contexts. To evaluate how social context sensitivity changes over development—and influences reward learning—we investigated how children and adolescents perceive and integrate rewards for oneself and others during a dynamic risky decision-making task. Children and adolescents (N = 75, 8–16 years) performed the Social Gambling Task (SGT, Kwak et al., 2014) and completed a set of questionnaires measuring other-regarding behavior. In the SGT, participants choose amongst four card decks that have different payout structures for oneself and for a charity. We examined patterns of choices, overall decision strategies, and how reward outcomes led to trial-by-trial adjustments in behavior, as estimated using a reinforcement-learning model. Performance of children and adolescents was compared to data from a previously collected sample of adults (N = 102) performing the identical task. We found that that children/adolescents were not only more sensitive to rewards directed to the charity than self but also showed greater prosocial tendencies on independent measures of other-regarding behavior. Children and adolescents also showed less use of a strategy that prioritizes rewards for self at the expense of rewards for others. These results support the conclusion that, compared to adults, children and adolescents show greater sensitivity to outcomes for others when making decisions and learning about potential rewards. PMID:27761125

  13. Quantum reinforcement learning.

    PubMed

    Dong, Daoyi; Chen, Chunlin; Li, Hanxiong; Tarn, Tzyh-Jong

    2008-10-01

    The key approaches for machine learning, particularly learning in unknown probabilistic environments, are new representations and computation mechanisms. In this paper, a novel quantum reinforcement learning (QRL) method is proposed by combining quantum theory and reinforcement learning (RL). Inspired by the state superposition principle and quantum parallelism, a framework of a value-updating algorithm is introduced. The state (action) in traditional RL is identified as the eigen state (eigen action) in QRL. The state (action) set can be represented with a quantum superposition state, and the eigen state (eigen action) can be obtained by randomly observing the simulated quantum state according to the collapse postulate of quantum measurement. The probability of the eigen action is determined by the probability amplitude, which is updated in parallel according to rewards. Some related characteristics of QRL such as convergence, optimality, and balancing between exploration and exploitation are also analyzed, which shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism. To evaluate the performance and practicability of QRL, several simulated experiments are given, and the results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems. This paper is also an effective exploration on the application of quantum computation to artificial intelligence.

  14. Quantum reinforcement learning.

    PubMed

    Dong, Daoyi; Chen, Chunlin; Li, Hanxiong; Tarn, Tzyh-Jong

    2008-10-01

    The key approaches for machine learning, particularly learning in unknown probabilistic environments, are new representations and computation mechanisms. In this paper, a novel quantum reinforcement learning (QRL) method is proposed by combining quantum theory and reinforcement learning (RL). Inspired by the state superposition principle and quantum parallelism, a framework of a value-updating algorithm is introduced. The state (action) in traditional RL is identified as the eigen state (eigen action) in QRL. The state (action) set can be represented with a quantum superposition state, and the eigen state (eigen action) can be obtained by randomly observing the simulated quantum state according to the collapse postulate of quantum measurement. The probability of the eigen action is determined by the probability amplitude, which is updated in parallel according to rewards. Some related characteristics of QRL such as convergence, optimality, and balancing between exploration and exploitation are also analyzed, which shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism. To evaluate the performance and practicability of QRL, several simulated experiments are given, and the results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems. This paper is also an effective exploration on the application of quantum computation to artificial intelligence. PMID:18784007

  15. Reinforcement, Reward, and Intrinsic Motivation: A Meta-Analysis.

    ERIC Educational Resources Information Center

    Cameron, Judy; Pierce, W. David

    1994-01-01

    A meta-analysis including 96 experimental studies considers the effects of reinforcement/reward on intrinsic motivation. Results indicate that reward does not decrease intrinsic motivation, although interaction effects must be examined. An analysis with five studies also indicates that reinforcement does not harm intrinsic motivation. (SLD)

  16. Reinforcement learning and Tourette syndrome.

    PubMed

    Palminteri, Stefano; Pessiglione, Mathias

    2013-01-01

    In this chapter, we report the first experimental explorations of reinforcement learning in Tourette syndrome, realized by our team in the last few years. This report will be preceded by an introduction aimed to provide the reader with the state of the art of the knowledge concerning the neural bases of reinforcement learning at the moment of these studies and the scientific rationale beyond them. In short, reinforcement learning is learning by trial and error to maximize rewards and minimize punishments. This decision-making and learning process implicates the dopaminergic system projecting to the frontal cortex-basal ganglia circuits. A large body of evidence suggests that the dysfunction of the same neural systems is implicated in the pathophysiology of Tourette syndrome. Our results show that Tourette condition, as well as the most common pharmacological treatments (dopamine antagonists), affects reinforcement learning performance in these patients. Specifically, the results suggest a deficit in negative reinforcement learning, possibly underpinned by a functional hyperdopaminergia, which could explain the persistence of tics, despite their evident inadaptive (negative) value. This idea, together with the implications of these results in Tourette therapy and the future perspectives, is discussed in Section 4 of this chapter.

  17. Microstimulation of the human substantia nigra alters reinforcement learning.

    PubMed

    Ramayya, Ashwin G; Misra, Amrit; Baltuch, Gordon H; Kahana, Michael J

    2014-05-14

    Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action-reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action-reward associations rather than stimulus-reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action-reward associations during reinforcement learning. PMID:24828643

  18. Individual differences in sensitivity to reward and punishment and neural activity during reward and avoidance learning.

    PubMed

    Kim, Sang Hee; Yoon, HeungSik; Kim, Hackjin; Hamann, Stephan

    2015-09-01

    In this functional neuroimaging study, we investigated neural activations during the process of learning to gain monetary rewards and to avoid monetary loss, and how these activations are modulated by individual differences in reward and punishment sensitivity. Healthy young volunteers performed a reinforcement learning task where they chose one of two fractal stimuli associated with monetary gain (reward trials) or avoidance of monetary loss (avoidance trials). Trait sensitivity to reward and punishment was assessed using the behavioral inhibition/activation scales (BIS/BAS). Functional neuroimaging results showed activation of the striatum during the anticipation and reception periods of reward trials. During avoidance trials, activation of the dorsal striatum and prefrontal regions was found. As expected, individual differences in reward sensitivity were positively associated with activation in the left and right ventral striatum during reward reception. Individual differences in sensitivity to punishment were negatively associated with activation in the left dorsal striatum during avoidance anticipation and also with activation in the right lateral orbitofrontal cortex during receiving monetary loss. These results suggest that learning to attain reward and learning to avoid loss are dependent on separable sets of neural regions whose activity is modulated by trait sensitivity to reward or punishment.

  19. Reinforcement, Expectancy, and Learning

    ERIC Educational Resources Information Center

    Bolles, Robert

    1972-01-01

    Surveys some of the difficulties currently confronting the reinforcement concept and cosiders some alternatives to reinforcement as the fundamental basis of learning. Two specific alternatives considered are: an incentive motivation approach and a cognitive approach. (Author)

  20. Reinforcement of Learning

    ERIC Educational Resources Information Center

    Jones, Peter

    1977-01-01

    A company trainer shows some ways of scheduling reinforcement of learning for trainees: continuous reinforcement, fixed ratio, variable ratio, fixed interval, and variable interval. As there are problems with all methods, he suggests trying combinations of various types of reinforcement. (MF)

  1. Reinforcement learning: Computational theory and biological mechanisms.

    PubMed

    Doya, Kenji

    2007-05-01

    Reinforcement learning is a computational framework for an active agent to learn behaviors on the basis of a scalar reward signal. The agent can be an animal, a human, or an artificial system such as a robot or a computer program. The reward can be food, water, money, or whatever measure of the performance of the agent. The theory of reinforcement learning, which was developed in an artificial intelligence community with intuitions from animal learning theory, is now giving a coherent account on the function of the basal ganglia. It now serves as the "common language" in which biologists, engineers, and social scientists can exchange their problems and findings. This article reviews the basic theoretical framework of reinforcement learning and discusses its recent and future contributions toward the understanding of animal behaviors and human decision making.

  2. Reinforcement learning: Computational theory and biological mechanisms

    PubMed Central

    Doya, Kenji

    2007-01-01

    Reinforcement learning is a computational framework for an active agent to learn behaviors on the basis of a scalar reward signal. The agent can be an animal, a human, or an artificial system such as a robot or a computer program. The reward can be food, water, money, or whatever measure of the performance of the agent. The theory of reinforcement learning, which was developed in an artificial intelligence community with intuitions from animal learning theory, is now giving a coherent account on the function of the basal ganglia. It now serves as the “common language” in which biologists, engineers, and social scientists can exchange their problems and findings. This article reviews the basic theoretical framework of reinforcement learning and discusses its recent and future contributions toward the understanding of animal behaviors and human decision making. PMID:19404458

  3. Reinforcement learning: Solving two case studies

    NASA Astrophysics Data System (ADS)

    Duarte, Ana Filipa; Silva, Pedro; dos Santos, Cristina Peixoto

    2012-09-01

    Reinforcement Learning algorithms offer interesting features for the control of autonomous systems, such as the ability to learn from direct interaction with the environment, and the use of a simple reward signalas opposed to the input-outputs pairsused in classic supervised learning. The reward signal indicates the success of failure of the actions executed by the agent in the environment. In this work, are described RL algorithmsapplied to two case studies: the Crawler robot and the widely known inverted pendulum. We explore RL capabilities to autonomously learn a basic locomotion pattern in the Crawler, andapproach the balancing problem of biped locomotion using the inverted pendulum.

  4. Modular Inverse Reinforcement Learning for Visuomotor Behavior

    PubMed Central

    Rothkopf, Constantin A.; Ballard, Dana H.

    2013-01-01

    In a large variety of situations one would like to have an expressive and accurate model of observed animal or human behavior. While general purpose mathematical models may capture successfully properties of observed behavior, it is desirable to root models in biological facts. Because of ample empirical evidence for reward-based learning in visuomotor tasks we use a computational model based on the assumption that the observed agent is balancing the costs and benefits of its behavior to meet its goals. This leads to using the framework of Reinforcement Learning, which additionally provides well-established algorithms for learning of visuomotor task solutions. To quantify the agent’s goals as rewards implicit in the observed behavior we propose to use inverse reinforcement learning, which quantifies the agent’s goals as rewards implicit in the observed behavior. Based on the assumption of a modular cognitive architecture, we introduce a modular inverse reinforcement learning algorithm that estimates the relative reward contributions of the component tasks in navigation, consisting of following a path while avoiding obstacles and approaching targets. It is shown how to recover the component reward weights for individual tasks and that variability in observed trajectories can be explained succinctly through behavioral goals. It is demonstrated through simulations that good estimates can be obtained already with modest amounts of observation data, which in turn allows the prediction of behavior in novel configurations. PMID:23832417

  5. Theory meets pigeons: the influence of reward-magnitude on discrimination-learning.

    PubMed

    Rose, Jonas; Schmidt, Robert; Grabemann, Marco; Güntürkün, Onur

    2009-03-01

    Modern theoretical accounts on reward-based learning are commonly based on reinforcement learning algorithms. Most noted in this context is the temporal-difference (TD) algorithm in which the difference between predicted and obtained reward, the prediction-error, serves as a learning signal. Consequently, larger rewards cause bigger prediction-errors and lead to faster learning than smaller rewards. Therefore, if animals employ a neural implementation of TD learning, reward-magnitude should affect learning in animals accordingly. Here we test this prediction by training pigeons on a simple color-discrimination task with two pairs of colors. In each pair, correct discrimination is rewarded; in pair one with a large-reward, in pair two with a small-reward. Pigeons acquired the 'large-reward' discrimination faster than the 'small-reward' discrimination. Animal behavior and an implementation of the TD-algorithm yielded comparable results with respect to the difference between learning curves in the large-reward and in the small-reward conditions. We conclude that the influence of reward-magnitude on the acquisition of a simple discrimination paradigm is accurately reflected by a TD implementation of reinforcement learning.

  6. Reinforcement learning in scheduling

    NASA Technical Reports Server (NTRS)

    Dietterich, Tom G.; Ok, Dokyeong; Zhang, Wei; Tadepalli, Prasad

    1994-01-01

    The goal of this research is to apply reinforcement learning methods to real-world problems like scheduling. In this preliminary paper, we show that learning to solve scheduling problems such as the Space Shuttle Payload Processing and the Automatic Guided Vehicle (AGV) scheduling can be usefully studied in the reinforcement learning framework. We discuss some of the special challenges posed by the scheduling domain to these methods and propose some possible solutions we plan to implement.

  7. Steady potential correlates of positive reinforcement: reward contingent positive variation.

    PubMed

    Marczynski, T J; York, J L; Hackett, J T

    1969-01-17

    A positive reinforcement with food produced high-voltage bursts of alpha activity over the posterior marginal gyrus in a cat deprived of food and water. This synchronization was always associated with a large (180 to 300 microvolt), positive steady potential shift comparable to that occurring during the onset of sleep. Since this shift was contingent upon the relative appropriateness and desirability of food reward, it was termed reward contingent positive variation.

  8. Neuroeconomics: a formal test of dopamine's role in reinforcement learning.

    PubMed

    DeWitt, Eric E J

    2014-04-14

    Over the last two decades, dopamine and reinforcement learning have been increasingly linked. Using a novel, axiomatic approach, a recent study shows that dopamine meets the necessary and sufficient conditions required by the theory to encode a reward prediction error.

  9. Model-Based Reinforcement Learning under Concurrent Schedules of Reinforcement in Rodents

    ERIC Educational Resources Information Center

    Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

    2009-01-01

    Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's…

  10. Reward-produced memories regulate memory-discrimination learning, extinction, and other forms of discrimination learning.

    PubMed

    Capaldi, E J; Birmingham, K M

    1998-07-01

    In memory-discrimination learning, reward-produced memories are differentially rewarded such that they are the only stimuli available to support discriminative responding. Memory-discrimination learning was used in this study as follows: Reward-produced memories that were assumed to regulate instrumental performance in previously reported extinction and discrimination learning investigations were isolated and explicitly differentially reinforced (prior to a shift to extinction) in each of 4 runway investigations with rats. Results obtained here in the explicit discrimination learning stage and in the subsequent extinction stage were consistent with the prediction of the memory view and with prior discrimination learning and extinction findings. The memory interpretation was applied to memory-discrimination learning, to extinction, and to 2 other types of discrimination learning. It appears that a theory must use reward-produced memories to explain all 4 types of discrimination learning.

  11. Do learning rates adapt to the distribution of rewards?

    PubMed

    Gershman, Samuel J

    2015-10-01

    Studies of reinforcement learning have shown that humans learn differently in response to positive and negative reward prediction errors, a phenomenon that can be captured computationally by positing asymmetric learning rates. This asymmetry, motivated by neurobiological and cognitive considerations, has been invoked to explain learning differences across the lifespan as well as a range of psychiatric disorders. Recent theoretical work, motivated by normative considerations, has hypothesized that the learning rate asymmetry should be modulated by the distribution of rewards across the available options. In particular, the learning rate for negative prediction errors should be higher than the learning rate for positive prediction errors when the average reward rate is high, and this relationship should reverse when the reward rate is low. We tested this hypothesis in a series of experiments. Contrary to the theoretical predictions, we found that the asymmetry was largely insensitive to the average reward rate; instead, the dominant pattern was a higher learning rate for negative than for positive prediction errors, possibly reflecting risk aversion.

  12. Dose Dependent Dopaminergic Modulation of Reward-Based Learning in Parkinson's Disease

    ERIC Educational Resources Information Center

    van Wouwe, N. C.; Ridderinkhof, K. R.; Band, G. P. H.; van den Wildenberg, W. P. M.; Wylie, S. A.

    2012-01-01

    Learning to select optimal behavior in new and uncertain situations is a crucial aspect of living and requires the ability to quickly associate stimuli with actions that lead to rewarding outcomes. Mathematical models of reinforcement-based learning to select rewarding actions distinguish between (1) the formation of stimulus-action-reward…

  13. A universal role of the ventral striatum in reward-based learning: Evidence from human studies

    PubMed Central

    Daniel, Reka; Pollmann, Stefan

    2014-01-01

    Reinforcement learning enables organisms to adjust their behavior in order to maximize rewards. Electrophysiological recordings of dopaminergic midbrain neurons have shown that they code the difference between actual and predicted rewards, i.e., the reward prediction error, in many species. This error signal is conveyed to both the striatum and cortical areas and is thought to play a central role in learning to optimize behavior. However, in human daily life rewards are diverse and often only indirect feedback is available. Here we explore the range of rewards that are processed by the dopaminergic system in human participants, and examine whether it is also involved in learning in the absence of explicit rewards. While results from electrophysiological recordings in humans are sparse, evidence linking dopaminergic activity to the metabolic signal recorded from the midbrain and striatum with functional magnetic resonance imaging (fMRI) is available. Results from fMRI studies suggest that the human ventral striatum (VS) receives valuation information for a diverse set of rewarding stimuli. These range from simple primary reinforcers such as juice rewards over abstract social rewards to internally generated signals on perceived correctness, suggesting that the VS is involved in learning from trial-and-error irrespective of the specific nature of provided rewards. In addition, we summarize evidence that the VS can also be implicated when learning from observing others, and in tasks that go beyond simple stimulus-action-outcome learning, indicating that the reward system is also recruited in more complex learning tasks. PMID:24825620

  14. Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules

    PubMed Central

    La Camera, Giancarlo; Richmond, Barry J.

    2008-01-01

    It is often assumed that animals and people adjust their behavior to maximize reward acquisition. In visually cued reinforcement schedules, monkeys make errors in trials that are not immediately rewarded, despite having to repeat error trials. Here we show that error rates are typically smaller in trials equally distant from reward but belonging to longer schedules (referred to as “schedule length effect”). This violates the principles of reward maximization and invariance and cannot be predicted by the standard methods of Reinforcement Learning, such as the method of temporal differences. We develop a heuristic model that accounts for all of the properties of the behavior in the reinforcement schedule task but whose predictions are not different from those of the standard temporal difference model in choice tasks. In the modification of temporal difference learning introduced here, the effect of schedule length emerges spontaneously from the sensitivity to the immediately preceding trial. We also introduce a policy for general Markov Decision Processes, where the decision made at each node is conditioned on the motivation to perform an instrumental action, and show that the application of our model to the reinforcement schedule task and the choice task are special cases of this general theoretical framework. Within this framework, Reinforcement Learning can approach contextual learning with the mixture of empirical findings and principled assumptions that seem to coexist in the best descriptions of animal behavior. As examples, we discuss two phenomena observed in humans that often derive from the violation of the principle of invariance: “framing,” wherein equivalent options are treated differently depending on the context in which they are presented, and the “sunk cost” effect, the greater tendency to continue an endeavor once an investment in money, effort, or time has been made. The schedule length effect might be a manifestation of these phenomena

  15. Reinforcement learning or active inference?

    PubMed

    Friston, Karl J; Daunizeau, Jean; Kiebel, Stefan J

    2009-01-01

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

  16. Learning Reward Uncertainty in the Basal Ganglia.

    PubMed

    Mikhael, John G; Bogacz, Rafal

    2016-09-01

    Learning the reliability of different sources of rewards is critical for making optimal choices. However, despite the existence of detailed theory describing how the expected reward is learned in the basal ganglia, it is not known how reward uncertainty is estimated in these circuits. This paper presents a class of models that encode both the mean reward and the spread of the rewards, the former in the difference between the synaptic weights of D1 and D2 neurons, and the latter in their sum. In the models, the tendency to seek (or avoid) options with variable reward can be controlled by increasing (or decreasing) the tonic level of dopamine. The models are consistent with the physiology of and synaptic plasticity in the basal ganglia, they explain the effects of dopaminergic manipulations on choices involving risks, and they make multiple experimental predictions. PMID:27589489

  17. Learning Reward Uncertainty in the Basal Ganglia

    PubMed Central

    Bogacz, Rafal

    2016-01-01

    Learning the reliability of different sources of rewards is critical for making optimal choices. However, despite the existence of detailed theory describing how the expected reward is learned in the basal ganglia, it is not known how reward uncertainty is estimated in these circuits. This paper presents a class of models that encode both the mean reward and the spread of the rewards, the former in the difference between the synaptic weights of D1 and D2 neurons, and the latter in their sum. In the models, the tendency to seek (or avoid) options with variable reward can be controlled by increasing (or decreasing) the tonic level of dopamine. The models are consistent with the physiology of and synaptic plasticity in the basal ganglia, they explain the effects of dopaminergic manipulations on choices involving risks, and they make multiple experimental predictions. PMID:27589489

  18. Social stress reactivity alters reward and punishment learning.

    PubMed

    Cavanagh, James F; Frank, Michael J; Allen, John J B

    2011-06-01

    To examine how stress affects cognitive functioning, individual differences in trait vulnerability (punishment sensitivity) and state reactivity (negative affect) to social evaluative threat were examined during concurrent reinforcement learning. Lower trait-level punishment sensitivity predicted better reward learning and poorer punishment learning; the opposite pattern was found in more punishment sensitive individuals. Increasing state-level negative affect was directly related to punishment learning accuracy in highly punishment sensitive individuals, but these measures were inversely related in less sensitive individuals. Combined electrophysiological measurement, performance accuracy and computational estimations of learning parameters suggest that trait and state vulnerability to stress alter cortico-striatal functioning during reinforcement learning, possibly mediated via medio-frontal cortical systems.

  19. Framework for robot skill learning using reinforcement learning

    NASA Astrophysics Data System (ADS)

    Wei, Yingzi; Zhao, Mingyang

    2003-09-01

    Robot acquiring skill is a process similar to human skill learning. Reinforcement learning (RL) is an on-line actor critic method for a robot to develop its skill. The reinforcement function has become the critical component for its effect of evaluating the action and guiding the learning process. We present an augmented reward function that provides a new way for RL controller to incorporate prior knowledge and experience into the RL controller. Also, the difference form of augmented reward function is considered carefully. The additional reward beyond conventional reward will provide more heuristic information for RL. In this paper, we present a strategy for the task of complex skill learning. Automatic robot shaping policy is to dissolve the complex skill into a hierarchical learning process. The new form of value function is introduced to attain smooth motion switching swiftly. We present a formal, but practical, framework for robot skill learning and also illustrate with an example the utility of method for learning skilled robot control on line.

  20. The Computational Development of Reinforcement Learning during Adolescence

    PubMed Central

    Palminteri, Stefano; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-01-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents’ behaviour was better explained by a basic reinforcement learning algorithm, adults’ behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence. PMID:27322574

  1. The Computational Development of Reinforcement Learning during Adolescence.

    PubMed

    Palminteri, Stefano; Kilford, Emma J; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-06-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents' behaviour was better explained by a basic reinforcement learning algorithm, adults' behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence.

  2. The Computational Development of Reinforcement Learning during Adolescence.

    PubMed

    Palminteri, Stefano; Kilford, Emma J; Coricelli, Giorgio; Blakemore, Sarah-Jayne

    2016-06-01

    Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents' behaviour was better explained by a basic reinforcement learning algorithm, adults' behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence. PMID:27322574

  3. Dopamine, reward learning, and active inference

    PubMed Central

    FitzGerald, Thomas H. B.; Dolan, Raymond J.; Friston, Karl

    2015-01-01

    Temporal difference learning models propose phasic dopamine signaling encodes reward prediction errors that drive learning. This is supported by studies where optogenetic stimulation of dopamine neurons can stand in lieu of actual reward. Nevertheless, a large body of data also shows that dopamine is not necessary for learning, and that dopamine depletion primarily affects task performance. We offer a resolution to this paradox based on an hypothesis that dopamine encodes the precision of beliefs about alternative actions, and thus controls the outcome-sensitivity of behavior. We extend an active inference scheme for solving Markov decision processes to include learning, and show that simulated dopamine dynamics strongly resemble those actually observed during instrumental conditioning. Furthermore, simulated dopamine depletion impairs performance but spares learning, while simulated excitation of dopamine neurons drives reward learning, through aberrant inference about outcome states. Our formal approach provides a novel and parsimonious reconciliation of apparently divergent experimental findings. PMID:26581305

  4. Reward and non-reward learning of flower colours in the butterfly Byasa alcinous (Lepidoptera: Papilionidae)

    NASA Astrophysics Data System (ADS)

    Kandori, Ikuo; Yamaki, Takafumi

    2012-09-01

    Learning plays an important role in food acquisition for a wide range of insects. To increase their foraging efficiency, flower-visiting insects may learn to associate floral cues with the presence (so-called reward learning) or the absence (so-called non-reward learning) of a reward. Reward learning whilst foraging for flowers has been demonstrated in many insect taxa, whilst non-reward learning in flower-visiting insects has been demonstrated only in honeybees, bumblebees and hawkmoths. This study examined both reward and non-reward learning abilities in the butterfly Byasa alcinous whilst foraging among artificial flowers of different colours. This butterfly showed both types of learning, although butterflies of both sexes learned faster via reward learning. In addition, females learned via reward learning faster than males. To the best of our knowledge, these are the first empirical data on the learning speed of both reward and non-reward learning in insects. We discuss the adaptive significance of a lower learning speed for non-reward learning when foraging on flowers.

  5. Nucleus Accumbens Core and Shell Differentially Encode Reward-Associated Cues after Reinforcer Devaluation

    PubMed Central

    West, Elizabeth A.

    2016-01-01

    Nucleus accumbens (NAc) neurons encode features of stimulus learning and action selection associated with rewards. The NAc is necessary for using information about expected outcome values to guide behavior after reinforcer devaluation. Evidence suggests that core and shell subregions may play dissociable roles in guiding motivated behavior. Here, we recorded neural activity in the NAc core and shell during training and performance of a reinforcer devaluation task. Long–Evans male rats were trained that presses on a lever under an illuminated cue light delivered a flavored sucrose reward. On subsequent test days, each rat was given free access to one of two distinctly flavored foods to consume to satiation and were then immediately tested on the lever pressing task under extinction conditions. Rats decreased pressing on the test day when the reinforcer earned during training was the sated flavor (devalued) compared with the test day when the reinforcer was not the sated flavor (nondevalued), demonstrating evidence of outcome-selective devaluation. Cue-selective encoding during training by NAc core (but not shell) neurons reliably predicted subsequent behavioral performance; that is, the greater the percentage of neurons that responded to the cue, the better the rats suppressed responding after devaluation. In contrast, NAc shell (but not core) neurons significantly decreased cue-selective encoding in the devalued condition compared with the nondevalued condition. These data reveal that NAc core and shell neurons encode information differentially about outcome-specific cues after reinforcer devaluation that are related to behavioral performance and outcome value, respectively. SIGNIFICANCE STATEMENT Many neuropsychiatric disorders are marked by impairments in behavioral flexibility. Although the nucleus accumbens (NAc) is required for behavioral flexibility, it is not known how NAc neurons encode this information. Here, we recorded NAc neurons during a training

  6. Coexistence of reward and unsupervised learning during the operant conditioning of neural firing rates.

    PubMed

    Kerr, Robert R; Grayden, David B; Thomas, Doreen A; Gilson, Matthieu; Burkitt, Anthony N

    2014-01-01

    A fundamental goal of neuroscience is to understand how cognitive processes, such as operant conditioning, are performed by the brain. Typical and well studied examples of operant conditioning, in which the firing rates of individual cortical neurons in monkeys are increased using rewards, provide an opportunity for insight into this. Studies of reward-modulated spike-timing-dependent plasticity (RSTDP), and of other models such as R-max, have reproduced this learning behavior, but they have assumed that no unsupervised learning is present (i.e., no learning occurs without, or independent of, rewards). We show that these models cannot elicit firing rate reinforcement while exhibiting both reward learning and ongoing, stable unsupervised learning. To fix this issue, we propose a new RSTDP model of synaptic plasticity based upon the observed effects that dopamine has on long-term potentiation and depression (LTP and LTD). We show, both analytically and through simulations, that our new model can exhibit unsupervised learning and lead to firing rate reinforcement. This requires that the strengthening of LTP by the reward signal is greater than the strengthening of LTD and that the reinforced neuron exhibits irregular firing. We show the robustness of our findings to spike-timing correlations, to the synaptic weight dependence that is assumed, and to changes in the mean reward. We also consider our model in the differential reinforcement of two nearby neurons. Our model aligns more strongly with experimental studies than previous models and makes testable predictions for future experiments. PMID:24475240

  7. Coexistence of reward and unsupervised learning during the operant conditioning of neural firing rates.

    PubMed

    Kerr, Robert R; Grayden, David B; Thomas, Doreen A; Gilson, Matthieu; Burkitt, Anthony N

    2014-01-01

    A fundamental goal of neuroscience is to understand how cognitive processes, such as operant conditioning, are performed by the brain. Typical and well studied examples of operant conditioning, in which the firing rates of individual cortical neurons in monkeys are increased using rewards, provide an opportunity for insight into this. Studies of reward-modulated spike-timing-dependent plasticity (RSTDP), and of other models such as R-max, have reproduced this learning behavior, but they have assumed that no unsupervised learning is present (i.e., no learning occurs without, or independent of, rewards). We show that these models cannot elicit firing rate reinforcement while exhibiting both reward learning and ongoing, stable unsupervised learning. To fix this issue, we propose a new RSTDP model of synaptic plasticity based upon the observed effects that dopamine has on long-term potentiation and depression (LTP and LTD). We show, both analytically and through simulations, that our new model can exhibit unsupervised learning and lead to firing rate reinforcement. This requires that the strengthening of LTP by the reward signal is greater than the strengthening of LTD and that the reinforced neuron exhibits irregular firing. We show the robustness of our findings to spike-timing correlations, to the synaptic weight dependence that is assumed, and to changes in the mean reward. We also consider our model in the differential reinforcement of two nearby neurons. Our model aligns more strongly with experimental studies than previous models and makes testable predictions for future experiments.

  8. Neural basis of reinforcement learning and decision making.

    PubMed

    Lee, Daeyeol; Seo, Hyojung; Jung, Min Whan

    2012-01-01

    Reinforcement learning is an adaptive process in which an animal utilizes its previous experience to improve the outcomes of future choices. Computational theories of reinforcement learning play a central role in the newly emerging areas of neuroeconomics and decision neuroscience. In this framework, actions are chosen according to their value functions, which describe how much future reward is expected from each action. Value functions can be adjusted not only through reward and penalty, but also by the animal's knowledge of its current environment. Studies have revealed that a large proportion of the brain is involved in representing and updating value functions and using them to choose an action. However, how the nature of a behavioral task affects the neural mechanisms of reinforcement learning remains incompletely understood. Future studies should uncover the principles by which different computational elements of reinforcement learning are dynamically coordinated across the entire brain.

  9. Risk-sensitive reinforcement learning.

    PubMed

    Shen, Yun; Tobia, Michael J; Sommer, Tobias; Obermayer, Klaus

    2014-07-01

    We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

  10. Time-Extended Policies in Mult-Agent Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian K.

    2004-01-01

    Reinforcement learning methods perform well in many domains where a single agent needs to take a sequence of actions to perform a task. These methods use sequences of single-time-step rewards to create a policy that tries to maximize a time-extended utility, which is a (possibly discounted) sum of these rewards. In this paper we build on our previous work showing how these methods can be extended to a multi-agent environment where each agent creates its own policy that works towards maximizing a time-extended global utility over all agents actions. We show improved methods for creating time-extended utilities for the agents that are both "aligned" with the global utility and "learnable." We then show how to crate single-time-step rewards while avoiding the pi fall of having rewards aligned with the global reward leading to utilities not aligned with the global utility. Finally, we apply these reward functions to the multi-agent Gridworld problem. We explicitly quantify a utility's learnability and alignment, and show that reinforcement learning agents using the prescribed reward functions successfully tradeoff learnability and alignment. As a result they outperform both global (e.g., team games ) and local (e.g., "perfectly learnable" ) reinforcement learning solutions by as much as an order of magnitude.

  11. Mind matters: Placebo enhances reward learning in Parkinson’s disease

    PubMed Central

    Schmidt, Liane; Braun, Erin Kendall; Wager, Tor D.; Shohamy, Daphna

    2015-01-01

    Expectations have a powerful influence on how we experience the world. Neurobiological and computational models of learning suggest that dopamine is crucial for shaping expectations of reward and that expectations alone may influence dopamine levels. However, because expectations and reinforcers are typically manipulated together, the role of expectations per se has remained unclear. Here, we separated these two factors using a placebo dopaminergic manipulation in Parkinson’s patients. We combined a reward learning task with fMRI to test how expectations of dopamine release modulate learning-related activity in the brain. We found that the mere expectation of dopamine release enhances reward learning and modulates learning-related signals in the striatum and the ventromedial prefrontal cortex. These effects were selective to learning from reward: neither medication nor placebo had an effect on learning to avoid monetary loss. These findings suggest a neurobiological mechanism by which expectations shape learning and affect. PMID:25326691

  12. Preference for unpredictable food rewards occurs with high proportion of reinforced trials or alcohol if rewards are not delayed.

    PubMed

    Daly, H B

    1989-01-01

    Organisms typically prefer situations where reward and nonreward are predictable rather than unpredictable. Although many theories can account for this result (e.g., information theory and delay-reduction theory), a recently developed mathematical model (DMOD) also predicts that subjects prefer the unpredictable reward situation under conditions that substantially decrease aversiveness of unpredictable nonreward (Daly & Daly, 1982). Because a high proportion of reinforced trials (lenient schedule) and alcohol injections decrease aversive conditioning, these variables were tested with rats in five E-maze experiments. A choice to one side of the maze resulted in a stimulus uncorrelated with reward outcome (unpredictable situation). A choice to the other side resulted in stimuli correlated with reward and nonreward (predictable situation). The stimuli were not visible until after the choice was made. A lenient reinforcement schedule resulted in preference for the unpredictable reward situation if rewards were not delayed. Alcohol resulted in preference for the unpredictable reward situation if a medium five-pellet reward was given. A lenient reinforcement schedule combined with an alcohol injection resulted in faster acquisition of the preference for the unpredictable reward situation than did a lenient schedule combined with a saline control injection. These results pose a major challenge to most theories, yet were predicted by DMOD.

  13. The dissociable effects of punishment and reward on motor learning.

    PubMed

    Galea, Joseph M; Mallia, Elizabeth; Rothwell, John; Diedrichsen, Jörn

    2015-04-01

    A common assumption regarding error-based motor learning (motor adaptation) in humans is that its underlying mechanism is automatic and insensitive to reward- or punishment-based feedback. Contrary to this hypothesis, we show in a double dissociation that the two have independent effects on the learning and retention components of motor adaptation. Negative feedback, whether graded or binary, accelerated learning. While it was not necessary for the negative feedback to be coupled to monetary loss, it had to be clearly related to the actual performance on the preceding movement. Positive feedback did not speed up learning, but it increased retention of the motor memory when performance feedback was withdrawn. These findings reinforce the view that independent mechanisms underpin learning and retention in motor adaptation, reject the assumption that motor adaptation is independent of motivational feedback, and raise new questions regarding the neural basis of negative and positive motivational feedback in motor learning. PMID:25706473

  14. Reinforcing brain stimulation in competition with water reward and shock avoidance.

    PubMed

    VALENSTEIN, E S; BEER, B

    1962-09-28

    Employing response rate as the index of reinforcing strength in self-stimulation experiments is questioned. With water reward or shock avoidance placed in competition with brain stimulation, self-stimulation rate does not reflect relative reinforcement value. The results agree with preference tests which show that, for a given electrode site, stimulus intensity, not rate, is directly related to reward strength.

  15. Contextual modulation of value signals in reward and punishment learning.

    PubMed

    Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

    2015-08-25

    Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative--context-dependent--scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system.

  16. Contextual modulation of value signals in reward and punishment learning

    PubMed Central

    Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

    2015-01-01

    Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative—context-dependent—scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system. PMID:26302782

  17. Contextual modulation of value signals in reward and punishment learning.

    PubMed

    Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

    2015-01-01

    Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative--context-dependent--scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system. PMID:26302782

  18. Benchmarking for Bayesian Reinforcement Learning

    PubMed Central

    Ernst, Damien; Couëtoux, Adrien

    2016-01-01

    In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but the benchmarks used to compare them are only relevant for specific cases. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed. PMID:27304891

  19. Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: a simulated robotic study.

    PubMed

    Mirolli, Marco; Santucci, Vieri G; Baldassarre, Gianluca

    2013-03-01

    An important issue of recent neuroscientific research is to understand the functional role of the phasic release of dopamine in the striatum, and in particular its relation to reinforcement learning. The literature is split between two alternative hypotheses: one considers phasic dopamine as a reward prediction error similar to the computational TD-error, whose function is to guide an animal to maximize future rewards; the other holds that phasic dopamine is a sensory prediction error signal that lets the animal discover and acquire novel actions. In this paper we propose an original hypothesis that integrates these two contrasting positions: according to our view phasic dopamine represents a TD-like reinforcement prediction error learning signal determined by both unexpected changes in the environment (temporary, intrinsic reinforcements) and biological rewards (permanent, extrinsic reinforcements). Accordingly, dopamine plays the functional role of driving both the discovery and acquisition of novel actions and the maximization of future rewards. To validate our hypothesis we perform a series of experiments with a simulated robotic system that has to learn different skills in order to get rewards. We compare different versions of the system in which we vary the composition of the learning signal. The results show that only the system reinforced by both extrinsic and intrinsic reinforcements is able to reach high performance in sufficiently complex conditions.

  20. Synthetic cathinones and their rewarding and reinforcing effects in rodents

    PubMed Central

    Watterson, Lucas R.; Olive, M. Foster

    2014-01-01

    Synthetic cathinones, colloquially referred to as “bath salts”, are derivatives of the psychoactive alkaloid cathinone found in Catha edulis (Khat). Since the mid-to-late 2000’s, these amphetamine-like psychostimulants have gained popularity amongst drug users due to their potency, low cost, ease of procurement, and constantly evolving chemical structures. Concomitant with their increased use is the emergence of a growing collection of case reports of bizarre and dangerous behaviors, toxicity to numerous organ systems, and death. However, scientific information regarding the abuse liability of these drugs has been relatively slower to materialize. Recently we have published several studies demonstrating that laboratory rodents will readily self-administer the “first generation” synthetic cathinones methylenedioxypyrovalerone (MDPV) and methylone via the intravenous route, in patterns similar to those of methamphetamine. Under progressive ratio schedules of reinforcement, the rank order of reinforcing efficacy of these compounds are MDPV ≥ methamphetamine > methylone. MDPV and methylone, as well as the “second generation” synthetic cathinones α-pyrrolidinovalerophenone (α-PVP) and 4-methylethcathinone (4-MEC), also dose-dependently increase brain reward function. Collectively, these findings indicate that synthetic cathinones have a high abuse and addiction potential and underscore the need for future assessment of the extent and duration of neurotoxicity induced by these emerging drugs of abuse. PMID:25328910

  1. Changes in corticostriatal connectivity during reinforcement learning in humans

    PubMed Central

    Horga, Guillermo; Maia, Tiago V.; Marsh, Rachel; Hao, Xuejun; Xu, Dongrong; Duan, Yunsuo; Tau, Gregory Z.; Graniello, Barbara; Wang, Zhishun; Kangarlu, Alayar; Martinez, Diana; Packard, Mark G.; Peterson, Bradley S.

    2015-01-01

    Many computational models assume that reinforcement learning relies on changes in synaptic efficacy between cortical regions representing stimuli and striatal regions involved in response selection, but this assumption has thus far lacked empirical support in humans. We recorded hemodynamic signals with fMRI while participants navigated a virtual maze to find hidden rewards. We fitted a reinforcement-learning algorithm to participants’ choice behavior and evaluated the neural activity and the changes in functional connectivity related to trial-by-trial learning variables. Activity in the posterior putamen during choice periods increased progressively during learning. Furthermore, the functional connections between the sensorimotor cortex and the posterior putamen strengthened progressively as participants learned the task. These changes in corticostriatal connectivity differentiated participants who learned the task from those who did not. These findings provide a direct link between changes in corticostriatal connectivity and learning, thereby supporting a central assumption common to several computational models of reinforcement learning. PMID:25393839

  2. Reinforcement learning in multidimensional environments relies on attention mechanisms.

    PubMed

    Niv, Yael; Daniel, Reka; Geana, Andra; Gershman, Samuel J; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C

    2015-05-27

    In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this "representation learning" process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the "curse of dimensionality" in reinforcement learning.

  3. The cerebellum is involved in reward-based reversal learning.

    PubMed

    Thoma, Patrizia; Bellebaum, Christian; Koch, Benno; Schwarz, Michael; Daum, Irene

    2008-01-01

    The cerebellum has recently been discussed in terms of a possible involvement in reward-based associative learning. To clarify the cerebellar contribution, eight patients with focal vascular lesions of the cerebellum and a group of 24 healthy subjects matched for age and IQ were compared on a range of different probabilistic outcome-based associative learning tasks, assessing acquisition, reversal, cognitive transfer, and generalization as well as the effect of reward magnitude. Cerebellar patients showed intact acquisition of stimulus contingencies, while reward-based reversal learning was significantly impaired. In addition, the patients showed slower acquisition of new stimulus contingencies in a second reward-based learning task, which might reflect reduced carry-over effects. Reward magnitude affected learning only during initial acquisition, with better learning on trials with high rewards in patients and control subjects. Overall, the findings suggest that the cerebellum is implicated in reversal learning as a dissociable component of reward-based learning. PMID:18592331

  4. Meta-learning in reinforcement learning.

    PubMed

    Schweighofer, Nicolas; Doya, Kenji

    2003-01-01

    Meta-parameters in reinforcement learning should be tuned to the environmental dynamics and the animal performance. Here, we propose a biologically plausible meta-reinforcement learning algorithm for tuning these meta-parameters in a dynamic, adaptive manner. We tested our algorithm in both a simulation of a Markov decision task and in a non-linear control task. Our results show that the algorithm robustly finds appropriate meta-parameter values, and controls the meta-parameter time course, in both static and dynamic environments. We suggest that the phasic and tonic components of dopamine neuron firing can encode the signal required for meta-learning of reinforcement learning.

  5. How we learn to make decisions: rapid propagation of reinforcement learning prediction errors in humans.

    PubMed

    Krigolson, Olav E; Hassall, Cameron D; Handy, Todd C

    2014-03-01

    Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors-discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833-1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129-141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769-776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679-709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward

  6. How we learn to make decisions: rapid propagation of reinforcement learning prediction errors in humans.

    PubMed

    Krigolson, Olav E; Hassall, Cameron D; Handy, Todd C

    2014-03-01

    Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors-discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833-1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129-141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769-776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679-709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward

  7. Hybrid Approach to Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Boulebtateche, Brahim; Fezari, Mourad; Boughazi, Mohamed

    2008-06-01

    Reinforcement Learning (RL) is a general framework in which an autonomous agent tries to learn an optimal policy of actions from direct interaction with the surrounding environment (RL). However, one difficulty for the application of RL control is its slow convergence, especially in environments with continuous state space. In this paper, a modified structure of RL is proposed to speed up reinforcement learning control. In this approach, supervision technique is combined with the standard Q-learning, a model-free algorithm of reinforcement learning. The a priori information is provided to the RL by an optimal LQ-controller, used to indicate preferred actions at intermittent times. It is shown that the convergence speed of the supervised RL agent is greatly improved compared to the conventional Q-Learning algorithm. Simulation work and results on the cart-pole balancing problem are given to illustrate the efficiency of the proposed method.

  8. Individual Differences in Reinforcement Learning: Behavioral, Electrophysiological, and Neuroimaging Correlates

    PubMed Central

    Santesso, Diane L.; Dillon, Daniel G.; Birk, Jeffrey L.; Holmes, Avram J.; Goetz, Elena; Bogdan, Ryan; Pizzagalli, Diego A.

    2008-01-01

    During reinforcement learning, phasic modulations of activity in midbrain dopamine neurons are conveyed to the dorsal anterior cingulate cortex (dACC) and basal ganglia and serve to guide adaptive responding. While the animal literature supports a role for the dACC in integrating reward history over time, most human electrophysiological studies of dACC function have focused on responses to single positive and negative outcomes. The present electrophysiological study investigated the role of the dACC in probabilistic reward learning in healthy subjects using a task that required integration of reinforcement history over time. We recorded the feedback-related negativity (FRN) to reward feedback in subjects who developed a response bias toward a more frequently rewarded (“rich”) stimulus (“learners”) versus subjects who did not (“non-learners”). Compared to non-learners, learners showed more positive (i.e., smaller) FRNs and greater dACC activation upon receiving reward for correct identification of the rich stimulus. In addition, dACC activation and a bias to select the rich stimulus were positively correlated. The same participants also completed a monetary incentive delay (MID) task administered during functional magnetic resonance imaging. Compared to non-learners, learners displayed stronger basal ganglia responses to reward in the MID task. These findings raise the possibility that learners in the probabilistic reinforcement task were characterized by stronger dACC and basal ganglia responses to rewarding outcomes. Furthermore, these results highlight the importance of the dACC to probabilistic reward learning in humans. PMID:18595740

  9. The combination of appetitive and aversive reinforcers and the nature of their interaction during auditory learning.

    PubMed

    Ilango, A; Wetzel, W; Scheich, H; Ohl, F W

    2010-03-31

    Learned changes in behavior can be elicited by either appetitive or aversive reinforcers. It is, however, not clear whether the two types of motivation, (approaching appetitive stimuli and avoiding aversive stimuli) drive learning in the same or different ways, nor is their interaction understood in situations where the two types are combined in a single experiment. To investigate this question we have developed a novel learning paradigm for Mongolian gerbils, which not only allows rewards and punishments to be presented in isolation or in combination with each other, but also can use these opposite reinforcers to drive the same learned behavior. Specifically, we studied learning of tone-conditioned hurdle crossing in a shuttle box driven by either an appetitive reinforcer (brain stimulation reward) or an aversive reinforcer (electrical footshock), or by a combination of both. Combination of the two reinforcers potentiated speed of acquisition, led to maximum possible performance, and delayed extinction as compared to either reinforcer alone. Additional experiments, using partial reinforcement protocols and experiments in which one of the reinforcers was omitted after the animals had been previously trained with the combination of both reinforcers, indicated that appetitive and aversive reinforcers operated together but acted in different ways: in this particular experimental context, punishment appeared to be more effective for initial acquisition and reward more effective to maintain a high level of conditioned responses (CRs). The results imply that learning mechanisms in problem solving were maximally effective when the initial punishment of mistakes was combined with the subsequent rewarding of correct performance.

  10. Reinforcement of perceptual inference: reward and punishment alter conscious visual perception during binocular rivalry

    PubMed Central

    Wilbertz, Gregor; van Slooten, Joanne; Sterzer, Philipp

    2014-01-01

    Perception is an inferential process, which becomes immediately evident when sensory information is conflicting or ambiguous and thus allows for more than one perceptual interpretation. Thinking the idea of perception as inference through to the end results in a blurring of boundaries between perception and action selection, as perceptual inference implies the construction of a percept as an active process. Here we therefore wondered whether perception shares a key characteristic of action selection, namely that it is shaped by reinforcement learning. In two behavioral experiments, we used binocular rivalry to examine whether perceptual inference can be influenced by the association of perceptual outcomes with reward or punishment, respectively, in analogy to instrumental conditioning. Binocular rivalry was evoked by two orthogonal grating stimuli presented to the two eyes, resulting in perceptual alternations between the two gratings. Perception was tracked indirectly and objectively through a target detection task, which allowed us to preclude potential reporting biases. Monetary reward or punishments were given repeatedly during perception of only one of the two rivaling stimuli. We found an increase in dominance durations for the percept associated with reward, relative to the non-rewarded percept. In contrast, punishment led to an increase of the non-punished compared to a relative decrease of the punished percept. Our results show that perception shares key characteristics with action selection, in that it is influenced by reward and punishment in opposite directions, thus narrowing the gap between the conceptually separated domains of perception and action selection. We conclude that perceptual inference is an adaptive process that is shaped by its consequences. PMID:25520687

  11. Reinforcement learning in supply chains.

    PubMed

    Valluri, Annapurna; North, Michael J; Macal, Charles M

    2009-10-01

    Effective management of supply chains creates value and can strategically position companies. In practice, human beings have been found to be both surprisingly successful and disappointingly inept at managing supply chains. The related fields of cognitive psychology and artificial intelligence have postulated a variety of potential mechanisms to explain this behavior. One of the leading candidates is reinforcement learning. This paper applies agent-based modeling to investigate the comparative behavioral consequences of three simple reinforcement learning algorithms in a multi-stage supply chain. For the first time, our findings show that the specific algorithm that is employed can have dramatic effects on the results obtained. Reinforcement learning is found to be valuable in multi-stage supply chains with several learning agents, as independent agents can learn to coordinate their behavior. However, learning in multi-stage supply chains using these postulated approaches from cognitive psychology and artificial intelligence take extremely long time periods to achieve stability which raises questions about their ability to explain behavior in real supply chains. The fact that it takes thousands of periods for agents to learn in this simple multi-agent setting provides new evidence that real world decision makers are unlikely to be using strict reinforcement learning in practice.

  12. Role of Dopamine D2 Receptors in Human Reinforcement Learning

    PubMed Central

    Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

    2014-01-01

    Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well. PMID:24713613

  13. Reinforcement active learning in the vibrissae system: optimal object localization.

    PubMed

    Gordon, Goren; Dorfman, Nimrod; Ahissar, Ehud

    2013-01-01

    Rats move their whiskers to acquire information about their environment. It has been observed that they palpate novel objects and objects they are required to localize in space. We analyze whisker-based object localization using two complementary paradigms, namely, active learning and intrinsic-reward reinforcement learning. Active learning algorithms select the next training samples according to the hypothesized solution in order to better discriminate between correct and incorrect labels. Intrinsic-reward reinforcement learning uses prediction errors as the reward to an actor-critic design, such that behavior converges to the one that optimizes the learning process. We show that in the context of object localization, the two paradigms result in palpation whisking as their respective optimal solution. These results suggest that rats may employ principles of active learning and/or intrinsic reward in tactile exploration and can guide future research to seek the underlying neuronal mechanisms that implement them. Furthermore, these paradigms are easily transferable to biomimetic whisker-based artificial sensors and can improve the active exploration of their environment. PMID:22789551

  14. Reinforcement active learning in the vibrissae system: optimal object localization.

    PubMed

    Gordon, Goren; Dorfman, Nimrod; Ahissar, Ehud

    2013-01-01

    Rats move their whiskers to acquire information about their environment. It has been observed that they palpate novel objects and objects they are required to localize in space. We analyze whisker-based object localization using two complementary paradigms, namely, active learning and intrinsic-reward reinforcement learning. Active learning algorithms select the next training samples according to the hypothesized solution in order to better discriminate between correct and incorrect labels. Intrinsic-reward reinforcement learning uses prediction errors as the reward to an actor-critic design, such that behavior converges to the one that optimizes the learning process. We show that in the context of object localization, the two paradigms result in palpation whisking as their respective optimal solution. These results suggest that rats may employ principles of active learning and/or intrinsic reward in tactile exploration and can guide future research to seek the underlying neuronal mechanisms that implement them. Furthermore, these paradigms are easily transferable to biomimetic whisker-based artificial sensors and can improve the active exploration of their environment.

  15. Habits, action sequences, and reinforcement learning

    PubMed Central

    Dezfouli, Amir; Balleine, Bernard W.

    2012-01-01

    It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquire and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions. PMID:22487034

  16. Associations among smoking, anhedonia, and reward learning in depression.

    PubMed

    Liverant, Gabrielle I; Sloan, Denise M; Pizzagalli, Diego A; Harte, Christopher B; Kamholz, Barbara W; Rosebrock, Laina E; Cohen, Andrew L; Fava, Maurizio; Kaplan, Gary B

    2014-09-01

    Depression and cigarette smoking co-occur at high rates. However, the etiological mechanisms that contribute to this relationship remain unclear. Anhedonia and associated impairments in reward learning are key features of depression, which also have been linked to the onset and maintenance of cigarette smoking. However, few studies have investigated differences in anhedonia and reward learning among depressed smokers and depressed nonsmokers. The goal of this study was to examine putative differences in anhedonia and reward learning in depressed smokers (n=36) and depressed nonsmokers (n=44). To this end, participants completed self-report measures of anhedonia and behavioral activation (BAS reward responsiveness scores) and as well as a probabilistic reward task rooted in signal detection theory, which measures reward learning (Pizzagalli, Jahn, & O'Shea, 2005). When considering self-report measures, depressed smokers reported higher trait anhedonia and reduced BAS reward responsiveness scores compared to depressed nonsmokers. In contrast to self-report measures, nicotine-satiated depressed smokers demonstrated greater acquisition of reward-based learning compared to depressed nonsmokers as indexed by the probabilistic reward task. Findings may point to a potential mechanism underlying the frequent co-occurrence of smoking and depression. These results highlight the importance of continued investigation of the role of anhedonia and reward system functioning in the co-occurrence of depression and nicotine abuse. Results also may support the use of treatments targeting reward learning (e.g., behavioral activation) to enhance smoking cessation among individuals with depression.

  17. Multi-agent Reinforcement Learning Model for Effective Action Selection

    NASA Astrophysics Data System (ADS)

    Youk, Sang Jo; Lee, Bong Keun

    Reinforcement learning is a sub area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. In the case of multi-agent, especially, which state space and action space gets very enormous in compared to single agent, so it needs to take most effective measure available select the action strategy for effective reinforcement learning. This paper proposes a multi-agent reinforcement learning model based on fuzzy inference system in order to improve learning collect speed and select an effective action in multi-agent. This paper verifies an effective action select strategy through evaluation tests based on Robocop Keep away which is one of useful test-beds for multi-agent. Our proposed model can apply to evaluate efficiency of the various intelligent multi-agents and also can apply to strategy and tactics of robot soccer system.

  18. Role of adenosine receptor subtypes in methamphetamine reward and reinforcement.

    PubMed

    Kavanagh, Kevin A; Schreiner, Drew C; Levis, Sophia C; O'Neill, Casey E; Bachtell, Ryan K

    2015-02-01

    The neurobiology of methamphetamine (MA) remains largely unknown despite its high abuse liability. The present series of studies explored the role of adenosine receptors on MA reward and reinforcement and identified alterations in the expression of adenosine receptors in dopamine terminal areas following MA administration in rats. We tested whether stimulating adenosine A1 or A2A receptor subtypes would influence MA-induced place preference or MA self-administration on fixed and progressive ratio schedules in male Sprague-Dawley rats. Stimulation of either adenosine A1 or A2A receptors significantly reduced the development of MA-induced place preference. Stimulating adenosine A1, but not A2A, receptors reduced MA self-administration responding. We next tested whether repeated experimenter-delivered MA administration would alter the expression of adenosine receptors in the striatal areas using immunoblotting. We observed no change in the expression of adenosine receptors. Lastly, rats were trained to self-administer MA or saline for 14 days and we detected changes in adenosine A1 and A2A receptor expression using immunoblotting. MA self-administration significantly increased adenosine A1 in the nucleus accumbens shell, caudate-putamen and prefrontal cortex. MA self-administration significantly decreased adenosine A2A receptor expression in the nucleus accumbens shell, but increased A2A receptor expression in the amygdala. These findings demonstrate that MA self-administration produces selective alterations in adenosine receptor expression in the nucleus accumbens shell and that stimulation of adenosine receptors reduces several behavioral indices of MA addiction. Together, these studies shed light onto the neurobiological alterations incurred through chronic MA use that may aid in the development of treatments for MA addiction.

  19. Monetary reward modulates task-irrelevant perceptual learning for invisible stimuli.

    PubMed

    Pascucci, David; Mastropasqua, Tommaso; Turatto, Massimo

    2015-01-01

    Task Irrelevant Perceptual Learning (TIPL) shows that the brain's discriminative capacity can improve also for invisible and unattended visual stimuli. It has been hypothesized that this form of "unconscious" neural plasticity is mediated by an endogenous reward mechanism triggered by the correct task performance. Although this result has challenged the mandatory role of attention in perceptual learning, no direct evidence exists of the hypothesized link between target recognition, reward and TIPL. Here, we manipulated the reward value associated with a target to demonstrate the involvement of reinforcement mechanisms in sensory plasticity for invisible inputs. Participants were trained in a central task associated with either high or low monetary incentives, provided only at the end of the experiment, while subliminal stimuli were presented peripherally. Our results showed that high incentive-value targets induced a greater degree of perceptual improvement for the subliminal stimuli, supporting the role of reinforcement mechanisms in TIPL. PMID:25942318

  20. Conflict acts as an implicit cost in reinforcement learning.

    PubMed

    Cavanagh, James F; Masters, Sean E; Bath, Kevin; Frank, Michael J

    2014-11-04

    Conflict has been proposed to act as a cost in action selection, implying a general function of medio-frontal cortex in the adaptation to aversive events. Here we investigate if response conflict acts as a cost during reinforcement learning by modulating experienced reward values in cortical and striatal systems. Electroencephalography recordings show that conflict diminishes the relationship between reward-related frontal theta power and cue preference yet it enhances the relationship between punishment and cue avoidance. Individual differences in the cost of conflict on reward versus punishment sensitivity are also related to a genetic polymorphism associated with striatal D1 versus D2 pathway balance (DARPP-32). We manipulate these patterns with the D2 agent cabergoline, which induces a strong bias to amplify the aversive value of punishment outcomes following conflict. Collectively, these findings demonstrate that interactive cortico-striatal systems implicitly modulate experienced reward and punishment values as a function of conflict.

  1. Conflict acts as an implicit cost in reinforcement learning.

    PubMed

    Cavanagh, James F; Masters, Sean E; Bath, Kevin; Frank, Michael J

    2014-01-01

    Conflict has been proposed to act as a cost in action selection, implying a general function of medio-frontal cortex in the adaptation to aversive events. Here we investigate if response conflict acts as a cost during reinforcement learning by modulating experienced reward values in cortical and striatal systems. Electroencephalography recordings show that conflict diminishes the relationship between reward-related frontal theta power and cue preference yet it enhances the relationship between punishment and cue avoidance. Individual differences in the cost of conflict on reward versus punishment sensitivity are also related to a genetic polymorphism associated with striatal D1 versus D2 pathway balance (DARPP-32). We manipulate these patterns with the D2 agent cabergoline, which induces a strong bias to amplify the aversive value of punishment outcomes following conflict. Collectively, these findings demonstrate that interactive cortico-striatal systems implicitly modulate experienced reward and punishment values as a function of conflict. PMID:25367437

  2. Tunnel Ventilation Control Using Reinforcement Learning Methodology

    NASA Astrophysics Data System (ADS)

    Chu, Baeksuk; Kim, Dongnam; Hong, Daehie; Park, Jooyoung; Chung, Jin Taek; Kim, Tae-Hyung

    The main purpose of tunnel ventilation system is to maintain CO pollutant concentration and VI (visibility index) under an adequate level to provide drivers with comfortable and safe driving environment. Moreover, it is necessary to minimize power consumption used to operate ventilation system. To achieve the objectives, the control algorithm used in this research is reinforcement learning (RL) method. RL is a goal-directed learning of a mapping from situations to actions without relying on exemplary supervision or complete models of the environment. The goal of RL is to maximize a reward which is an evaluative feedback from the environment. In the process of constructing the reward of the tunnel ventilation system, two objectives listed above are included, that is, maintaining an adequate level of pollutants and minimizing power consumption. RL algorithm based on actor-critic architecture and gradient-following algorithm is adopted to the tunnel ventilation system. The simulations results performed with real data collected from existing tunnel ventilation system and real experimental verification are provided in this paper. It is confirmed that with the suggested controller, the pollutant level inside the tunnel was well maintained under allowable limit and the performance of energy consumption was improved compared to conventional control scheme.

  3. Scaling prediction errors to reward variability benefits error-driven learning in humans

    PubMed Central

    Schultz, Wolfram

    2015-01-01

    Effective error-driven learning requires individuals to adapt learning to environmental reward variability. The adaptive mechanism may involve decays in learning rate across subsequent trials, as shown previously, and rescaling of reward prediction errors. The present study investigated the influence of prediction error scaling and, in particular, the consequences for learning performance. Participants explicitly predicted reward magnitudes that were drawn from different probability distributions with specific standard deviations. By fitting the data with reinforcement learning models, we found scaling of prediction errors, in addition to the learning rate decay shown previously. Importantly, the prediction error scaling was closely related to learning performance, defined as accuracy in predicting the mean of reward distributions, across individual participants. In addition, participants who scaled prediction errors relative to standard deviation also presented with more similar performance for different standard deviations, indicating that increases in standard deviation did not substantially decrease “adapters'” accuracy in predicting the means of reward distributions. However, exaggerated scaling beyond the standard deviation resulted in impaired performance. Thus efficient adaptation makes learning more robust to changing variability. PMID:26180123

  4. Is Avoiding an Aversive Outcome Rewarding? Neural Substrates of Avoidance Learning in the Human Brain

    PubMed Central

    Kim, Hackjin; Shimojo, Shinsuke

    2006-01-01

    Avoidance learning poses a challenge for reinforcement-based theories of instrumental conditioning, because once an aversive outcome is successfully avoided an individual may no longer experience extrinsic reinforcement for their behavior. One possible account for this is to propose that avoiding an aversive outcome is in itself a reward, and thus avoidance behavior is positively reinforced on each trial when the aversive outcome is successfully avoided. In the present study we aimed to test this possibility by determining whether avoidance of an aversive outcome recruits the same neural circuitry as that elicited by a reward itself. We scanned 16 human participants with functional MRI while they performed an instrumental choice task, in which on each trial they chose from one of two actions in order to either win money or else avoid losing money. Neural activity in a region previously implicated in encoding stimulus reward value, the medial orbitofrontal cortex, was found to increase, not only following receipt of reward, but also following successful avoidance of an aversive outcome. This neural signal may itself act as an intrinsic reward, thereby serving to reinforce actions during instrumental avoidance. PMID:16802856

  5. Learning strategies in table tennis using inverse reinforcement learning.

    PubMed

    Muelling, Katharina; Boularias, Abdeslam; Mohler, Betty; Schölkopf, Bernhard; Peters, Jan

    2014-10-01

    Learning a complex task such as table tennis is a challenging problem for both robots and humans. Even after acquiring the necessary motor skills, a strategy is needed to choose where and how to return the ball to the opponent's court in order to win the game. The data-driven identification of basic strategies in interactive tasks, such as table tennis, is a largely unexplored problem. In this paper, we suggest a computational model for representing and inferring strategies, based on a Markov decision problem, where the reward function models the goal of the task as well as the strategic information. We show how this reward function can be discovered from demonstrations of table tennis matches using model-free inverse reinforcement learning. The resulting framework allows to identify basic elements on which the selection of striking movements is based. We tested our approach on data collected from players with different playing styles and under different playing conditions. The estimated reward function was able to capture expert-specific strategic information that sufficed to distinguish the expert among players with different skill levels as well as different playing styles.

  6. Learning strategies in table tennis using inverse reinforcement learning.

    PubMed

    Muelling, Katharina; Boularias, Abdeslam; Mohler, Betty; Schölkopf, Bernhard; Peters, Jan

    2014-10-01

    Learning a complex task such as table tennis is a challenging problem for both robots and humans. Even after acquiring the necessary motor skills, a strategy is needed to choose where and how to return the ball to the opponent's court in order to win the game. The data-driven identification of basic strategies in interactive tasks, such as table tennis, is a largely unexplored problem. In this paper, we suggest a computational model for representing and inferring strategies, based on a Markov decision problem, where the reward function models the goal of the task as well as the strategic information. We show how this reward function can be discovered from demonstrations of table tennis matches using model-free inverse reinforcement learning. The resulting framework allows to identify basic elements on which the selection of striking movements is based. We tested our approach on data collected from players with different playing styles and under different playing conditions. The estimated reward function was able to capture expert-specific strategic information that sufficed to distinguish the expert among players with different skill levels as well as different playing styles. PMID:24756167

  7. Robot-assisted motor training: assistance decreases exploration during reinforcement learning.

    PubMed

    Sans-Muntadas, Albert; Duarte, Jaime E; Reinkensmeyer, David J

    2014-01-01

    Reinforcement learning (RL) is a form of motor learning that robotic therapy devices could potentially manipulate to promote neurorehabilitation. We developed a system that requires trainees to use RL to learn a predefined target movement. The system provides higher rewards for movements that are more similar to the target movement. We also developed a novel algorithm that rewards trainees of different abilities with comparable reward sizes. This algorithm measures a trainee's performance relative to their best performance, rather than relative to an absolute target performance, to determine reward. We hypothesized this algorithm would permit subjects who cannot normally achieve high reward levels to do so while still learning. In an experiment with 21 unimpaired human subjects, we found that all subjects quickly learned to make a first target movement with and without the reward equalization. However, artificially increasing reward decreased the subjects' tendency to engage in exploration and therefore slowed learning, particularly when we changed the target movement. An anti-slacking watchdog algorithm further slowed learning. These results suggest that robotic algorithms that assist trainees in achieving rewards or in preventing slacking might, over time, discourage the exploration needed for reinforcement learning.

  8. Stress modulates reinforcement learning in younger and older adults.

    PubMed

    Lighthall, Nichole R; Gorlick, Marissa A; Schoeke, Andrej; Frank, Michael J; Mather, Mara

    2013-03-01

    Animal research and human neuroimaging studies indicate that stress increases dopamine levels in brain regions involved in reward processing, and stress also appears to increase the attractiveness of addictive drugs. The current study tested the hypothesis that stress increases reward salience, leading to more effective learning about positive than negative outcomes in a probabilistic selection task. Changes to dopamine pathways with age raise the question of whether stress effects on incentive-based learning differ by age. Thus, the present study also examined whether effects of stress on reinforcement learning differed for younger (age 18-34) and older participants (age 65-85). Cold pressor stress was administered to half of the participants in each age group, and salivary cortisol levels were used to confirm biophysiological response to cold stress. After the manipulation, participants completed a probabilistic learning task involving positive and negative feedback. In both younger and older adults, stress enhanced learning about cues that predicted positive outcomes. In addition, during the initial learning phase, stress diminished sensitivity to recent feedback across age groups. These results indicate that stress affects reinforcement learning in both younger and older adults and suggests that stress exerts different effects on specific components of reinforcement learning depending on their neural underpinnings.

  9. Reward-Guided Learning with and without Causal Attribution.

    PubMed

    Jocham, Gerhard; Brodersen, Kay H; Constantinescu, Alexandra O; Kahn, Martin C; Ianni, Angela M; Walton, Mark E; Rushworth, Matthew F S; Behrens, Timothy E J

    2016-04-01

    When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task. PMID:26971947

  10. Reward-Guided Learning with and without Causal Attribution.

    PubMed

    Jocham, Gerhard; Brodersen, Kay H; Constantinescu, Alexandra O; Kahn, Martin C; Ianni, Angela M; Walton, Mark E; Rushworth, Matthew F S; Behrens, Timothy E J

    2016-04-01

    When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task.

  11. Reward-Guided Learning with and without Causal Attribution

    PubMed Central

    Jocham, Gerhard; Brodersen, Kay H.; Constantinescu, Alexandra O.; Kahn, Martin C.; Ianni, Angela M.; Walton, Mark E.; Rushworth, Matthew F.S.; Behrens, Timothy E.J.

    2016-01-01

    Summary When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task. PMID:26971947

  12. Reinforcement Learning and Savings Behavior*

    PubMed Central

    Choi, James J.; Laibson, David; Madrian, Brigitte C.; Metrick, Andrew

    2009-01-01

    We show that individual investors over-extrapolate from their personal experience when making savings decisions. Investors who experience particularly rewarding outcomes from saving in their 401(k)—a high average and/or low variance return—increase their 401(k) savings rate more than investors who have less rewarding experiences with saving. This finding is not driven by aggregate time-series shocks, income effects, rational learning about investing skill, investor fixed effects, or time-varying investor-level heterogeneity that is correlated with portfolio allocations to stock, bond, and cash asset classes. We discuss implications for the equity premium puzzle and interventions aimed at improving household financial outcomes. PMID:20352013

  13. Rational and Mechanistic Perspectives on Reinforcement Learning

    ERIC Educational Resources Information Center

    Chater, Nick

    2009-01-01

    This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: "mechanistic" and "rational." Reinforcement learning is often viewed in mechanistic terms--as…

  14. Behavioral and neural properties of social reinforcement learning.

    PubMed

    Jones, Rebecca M; Somerville, Leah H; Li, Jian; Ruberry, Erika J; Libby, Victoria; Glover, Gary; Voss, Henning U; Ballon, Douglas J; Casey, B J

    2011-09-14

    Social learning is critical for engaging in complex interactions with other individuals. Learning from positive social exchanges, such as acceptance from peers, may be similar to basic reinforcement learning. We formally test this hypothesis by developing a novel paradigm that is based on work in nonhuman primates and human imaging studies of reinforcement learning. The probability of receiving positive social reinforcement from three distinct peers was parametrically manipulated while brain activity was recorded in healthy adults using event-related functional magnetic resonance imaging. Over the course of the experiment, participants responded more quickly to faces of peers who provided more frequent positive social reinforcement, and rated them as more likeable. Modeling trial-by-trial learning showed ventral striatum and orbital frontal cortex activity correlated positively with forming expectations about receiving social reinforcement. Rostral anterior cingulate cortex activity tracked positively with modulations of expected value of the cues (peers). Together, the findings across three levels of analysis--social preferences, response latencies, and modeling neural responses--are consistent with reinforcement learning theory and nonhuman primate electrophysiological studies of reward. This work highlights the fundamental influence of acceptance by one's peers in altering subsequent behavior.

  15. Behavioral and neural properties of social reinforcement learning.

    PubMed

    Jones, Rebecca M; Somerville, Leah H; Li, Jian; Ruberry, Erika J; Libby, Victoria; Glover, Gary; Voss, Henning U; Ballon, Douglas J; Casey, B J

    2011-09-14

    Social learning is critical for engaging in complex interactions with other individuals. Learning from positive social exchanges, such as acceptance from peers, may be similar to basic reinforcement learning. We formally test this hypothesis by developing a novel paradigm that is based on work in nonhuman primates and human imaging studies of reinforcement learning. The probability of receiving positive social reinforcement from three distinct peers was parametrically manipulated while brain activity was recorded in healthy adults using event-related functional magnetic resonance imaging. Over the course of the experiment, participants responded more quickly to faces of peers who provided more frequent positive social reinforcement, and rated them as more likeable. Modeling trial-by-trial learning showed ventral striatum and orbital frontal cortex activity correlated positively with forming expectations about receiving social reinforcement. Rostral anterior cingulate cortex activity tracked positively with modulations of expected value of the cues (peers). Together, the findings across three levels of analysis--social preferences, response latencies, and modeling neural responses--are consistent with reinforcement learning theory and nonhuman primate electrophysiological studies of reward. This work highlights the fundamental influence of acceptance by one's peers in altering subsequent behavior. PMID:21917787

  16. Behavioral and neural properties of social reinforcement learning

    PubMed Central

    Jones, Rebecca M.; Somerville, Leah H.; Li, Jian; Ruberry, Erika J.; Libby, Victoria; Glover, Gary; Voss, Henning U.; Ballon, Douglas J.; Casey, BJ

    2011-01-01

    Social learning is critical for engaging in complex interactions with other individuals. Learning from positive social exchanges, such as acceptance from peers, may be similar to basic reinforcement learning. We formally test this hypothesis by developing a novel paradigm that is based upon work in non-human primates and human imaging studies of reinforcement learning. The probability of receiving positive social reinforcement from three distinct peers was parametrically manipulated while brain activity was recorded in healthy adults using event-related functional magnetic resonance imaging (fMRI). Over the course of the experiment, participants responded more quickly to faces of peers who provided more frequent positive social reinforcement, and rated them as more likeable. Modeling trial-by-trial learning showed ventral striatum and orbital frontal cortex activity correlated positively with forming expectations about receiving social reinforcement. Rostral anterior cingulate cortex activity tracked positively with modulations of expected value of the cues (peers). Together, the findings across three levels of analysis - social preferences, response latencies and modeling neural responses – are consistent with reinforcement learning theory and non-human primate electrophysiological studies of reward. This work highlights the fundamental influence of acceptance by one’s peers in altering subsequent behavior. PMID:21917787

  17. Learning arm's posture control using reinforcement learning and feedback-error-learning.

    PubMed

    Kambara, H; Kim, J; Sato, M; Koike, Y

    2004-01-01

    In this paper, we propose a learning model using the Actor-Critic method and the feedback-error-learning scheme. The Actor-Critic method, which is one of the major frameworks in reinforcement learning, has attracted attention as a computational learning model in the basal ganglia. Meanwhile, the feedback-error-learning is learning architecture proposed as a computationally coherent model of cerebellar motor learning. This learning architecture's purpose is to acquire a feed-forward controller by using a feedback controller's output as an error signal. In past researches, a predetermined constant gain feedback controller was used for the feedback-error-learning. We use the Actor-Critic method for obtaining a feedback controller in the feedback-error-earning. By applying the proposed learning model to an arm's posture control, we show that high-performance feedback and feed-forward controller can be acquired from only by using a scalar value of reward. PMID:17271719

  18. A selective role for dopamine in stimulus-reward learning.

    PubMed

    Flagel, Shelly B; Clark, Jeremy J; Robinson, Terry E; Mayo, Leah; Czuj, Alayna; Willuhn, Ingo; Akers, Christina A; Clinton, Sarah M; Phillips, Paul E M; Akil, Huda

    2011-01-01

    Individuals make choices and prioritize goals using complex processes that assign value to rewards and associated stimuli. During Pavlovian learning, previously neutral stimuli that predict rewards can acquire motivational properties, becoming attractive and desirable incentive stimuli. However, whether a cue acts solely as a predictor of reward, or also serves as an incentive stimulus, differs between individuals. Thus, individuals vary in the degree to which cues bias choice and potentially promote maladaptive behaviour. Here we use rats that differ in the incentive motivational properties they attribute to food cues to probe the role of the neurotransmitter dopamine in stimulus-reward learning. We show that intact dopamine transmission is not required for all forms of learning in which reward cues become effective predictors. Rather, dopamine acts selectively in a form of stimulus-reward learning in which incentive salience is assigned to reward cues. In individuals with a propensity for this form of learning, reward cues come to powerfully motivate and control behaviour. This work provides insight into the neurobiology of a form of stimulus-reward learning that confers increased susceptibility to disorders of impulse control.

  19. The impact of mineralocorticoid receptor ISO/VAL genotype (rs5522) and stress on reward learning.

    PubMed

    Bogdan, R; Perlis, R H; Fagerness, J; Pizzagalli, D A

    2010-08-01

    Research suggests that stress disrupts reinforcement learning and induces anhedonia. The mineralocorticoid receptor (MR) determines the sensitivity of the stress response, and the missense iso/val polymorphism (Ile180Val, rs5522) of the MR gene (NR3C2) has been associated with enhanced physiological stress responses, elevated depressive symptoms and reduced cortisol-induced MR gene expression. The goal of these studies was to evaluate whether rs5522 genotype and stress independently and interactively influence reward learning. In study 1, participants (n = 174) completed a probabilistic reward task under baseline (i.e. no-stress) conditions. In study 2, participants (n = 53) completed the task during a stress (threat-of-shock) and no-stress condition. Reward learning, i.e. the ability to modulate behavior as a function of reinforcement history, was the main variable of interest. In study 1, in which participants were evaluated under no-stress conditions, reward learning was enhanced in val carriers. In study 2, participants developed a weaker response bias toward a more frequently rewarded stimulus under the stress relative to no-stress condition. Critically, stress-induced reward learning deficits were largest in val carriers. Although preliminary and in need of replication due to small sample size, findings indicate that psychiatrically healthy individuals carrying the MR val allele, gene, which has been recently linked to depression, showed a reduced ability to modulate behavior as a function of reward when facing an acute, uncontrollable stressor. Future studies are warranted to evaluate whether rs5522 genotype interacts with naturalistic stressors to increase the risk of depression and whether stress-induced anhedonia might moderate such risk.

  20. Reinforcement learning deficits in people with schizophrenia persist after extended trials.

    PubMed

    Cicero, David C; Martin, Elizabeth A; Becker, Theresa M; Kerns, John G

    2014-12-30

    Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive feedback and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning.

  1. Dopamine-Dependent Reinforcement of Motor Skill Learning: Evidence from Gilles de la Tourette Syndrome

    ERIC Educational Resources Information Center

    Palminteri, Stefano; Lebreton, Mael; Worbe, Yulia; Hartmann, Andreas; Lehericy, Stephane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-01-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only…

  2. Rule Learning in Autism: The Role of Reward Type and Social Context

    PubMed Central

    Jones, E. J. H.; Webb, S. J.; Estes, A.; Dawson, G.

    2013-01-01

    Learning abstract rules is central to social and cognitive development. Across two experiments, we used Delayed Non-Matching to Sample tasks to characterize the longitudinal development and nature of rule-learning impairments in children with Autism Spectrum Disorder (ASD). Results showed that children with ASD consistently experienced more difficulty learning an abstract rule from a discrete physical reward than children with DD. Rule learning was facilitated by the provision of more concrete reinforcement, suggesting an underlying difficulty in forming conceptual connections. Learning abstract rules about social stimuli remained challenging through late childhood, indicating the importance of testing executive functions in both social and non-social contexts. PMID:23311315

  3. Rule learning in autism: the role of reward type and social context.

    PubMed

    Jones, E J H; Webb, S J; Estes, A; Dawson, G

    2013-01-01

    Learning abstract rules is central to social and cognitive development. Across two experiments, we used Delayed Non-Matching to Sample tasks to characterize the longitudinal development and nature of rule-learning impairments in children with Autism Spectrum Disorder (ASD). Results showed that children with ASD consistently experienced more difficulty learning an abstract rule from a discrete physical reward than children with DD. Rule learning was facilitated by the provision of more concrete reinforcement, suggesting an underlying difficulty in forming conceptual connections. Learning abstract rules about social stimuli remained challenging through late childhood, indicating the importance of testing executive functions in both social and non-social contexts.

  4. Simulation of rat behavior by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals.

    PubMed

    Murakoshi, Kazushi; Noguchi, Takuya

    2005-04-01

    Brown and Wanger [Brown, R.T., Wanger, A.R., 1964. Resistance to punishment and extinction following training with shock or nonreinforcement. J. Exp. Psychol. 68, 503-507] investigated rat behaviors with the following features: (1) rats were exposed to reward and punishment at the same time, (2) environment changed and rats relearned, and (3) rats were stochastically exposed to reward and punishment. The results are that exposure to nonreinforcement produces resistance to the decremental effects of behavior after stochastic reward schedule and that exposure to both punishment and reinforcement produces resistance to the decremental effects of behavior after stochastic punishment schedule. This paper aims to simulate the rat behaviors by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals. The former algorithms of reinforcement learning were unable to simulate the behavior of the feature (3). We improve the former reinforcement learning algorithms by controlling learning parameters in consideration of the acquisition probabilities of reinforcement signals. The proposed algorithm qualitatively simulates the result of the animal experiment of Brown and Wanger. PMID:15740837

  5. Simulation of rat behavior by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals.

    PubMed

    Murakoshi, Kazushi; Noguchi, Takuya

    2005-04-01

    Brown and Wanger [Brown, R.T., Wanger, A.R., 1964. Resistance to punishment and extinction following training with shock or nonreinforcement. J. Exp. Psychol. 68, 503-507] investigated rat behaviors with the following features: (1) rats were exposed to reward and punishment at the same time, (2) environment changed and rats relearned, and (3) rats were stochastically exposed to reward and punishment. The results are that exposure to nonreinforcement produces resistance to the decremental effects of behavior after stochastic reward schedule and that exposure to both punishment and reinforcement produces resistance to the decremental effects of behavior after stochastic punishment schedule. This paper aims to simulate the rat behaviors by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals. The former algorithms of reinforcement learning were unable to simulate the behavior of the feature (3). We improve the former reinforcement learning algorithms by controlling learning parameters in consideration of the acquisition probabilities of reinforcement signals. The proposed algorithm qualitatively simulates the result of the animal experiment of Brown and Wanger.

  6. Working memory and reward association learning impairments in obesity

    PubMed Central

    Coppin, Géraldine; Nolan-Poupart, Sarah; Jones-Gotman, Marilyn; Small, Dana M.

    2014-01-01

    Obesity has been associated with impaired executive functions including working memory. Less explored is the influence of obesity on learning and memory. In the current study we assessed stimulus reward association learning, explicit learning and memory and working memory in healthy weight, overweight and obese individuals. Explicit learning and memory did not differ as a function of group. In contrast, working memory was significantly and similarly impaired in both overweight and obese individuals compared to the healthy weight group. In the first reward association learning task the obese, but not healthy weight or overweight participants consistently formed paradoxical preferences for a pattern associated with a negative outcome (fewer food rewards). To determine if the deficit was specific to food reward a second experiment was conducted using money. Consistent with experiment 1, obese individuals selected the pattern associated with a negative outcome (fewer monetary rewards) more frequently than healthy weight individuals and thus failed to develop a significant preference for the most rewarded patterns as was observed in the healthy weight group. Finally, on a probabilistic learning task, obese compared to healthy weight individuals showed deficits in negative, but not positive outcome learning. Taken together, our results demonstrate deficits in working memory and stimulus reward learning in obesity and suggest that obese individuals are impaired in learning to avoid negative outcomes. PMID:25447070

  7. The ubiquity of model-based reinforcement learning.

    PubMed

    Doll, Bradley B; Simon, Dylan A; Daw, Nathaniel D

    2012-12-01

    The reward prediction error (RPE) theory of dopamine (DA) function has enjoyed great success in the neuroscience of learning and decision-making. This theory is derived from model-free reinforcement learning (RL), in which choices are made simply on the basis of previously realized rewards. Recently, attention has turned to correlates of more flexible, albeit computationally complex, model-based methods in the brain. These methods are distinguished from model-free learning by their evaluation of candidate actions using expected future outcomes according to a world model. Puzzlingly, signatures from these computations seem to be pervasive in the very same regions previously thought to support model-free learning. Here, we review recent behavioral and neural evidence about these two systems, in attempt to reconcile their enigmatic cohabitation in the brain.

  8. Use of Inverse Reinforcement Learning for Identity Prediction

    NASA Technical Reports Server (NTRS)

    Hayes, Roy; Bao, Jonathan; Beling, Peter; Horowitz, Barry

    2011-01-01

    We adopt Markov Decision Processes (MDP) to model sequential decision problems, which have the characteristic that the current decision made by a human decision maker has an uncertain impact on future opportunity. We hypothesize that the individuality of decision makers can be modeled as differences in the reward function under a common MDP model. A machine learning technique, Inverse Reinforcement Learning (IRL), was used to learn an individual's reward function based on limited observation of his or her decision choices. This work serves as an initial investigation for using IRL to analyze decision making, conducted through a human experiment in a cyber shopping environment. Specifically, the ability to determine the demographic identity of users is conducted through prediction analysis and supervised learning. The results show that IRL can be used to correctly identify participants, at a rate of 68% for gender and 66% for one of three college major categories.

  9. Pedunculopontine tegmental nucleus lesions impair probabilistic reversal learning by reducing sensitivity to positive reward feedback.

    PubMed

    Syed, Anam; Baker, Phillip M; Ragozzino, Michael E

    2016-05-01

    Recent findings indicate that pedunculopontine tegmental nucleus (PPTg) neurons encode reward-related information that is context-dependent. This information is critical for behavioral flexibility when reward outcomes change signaling a shift in response patterns should occur. The present experiment investigated whether NMDA lesions of the PPTg affects the acquisition and/or reversal learning of a spatial discrimination using probabilistic reinforcement. Male Long-Evans rats received a bilateral infusion of NMDA (30nmoles/side) or saline into the PPTg. Subsequently, rats were tested in a spatial discrimination test using a probabilistic learning procedure. One spatial location was rewarded with an 80% probability and the other spatial location rewarded with a 20% probability. After reaching acquisition criterion of 10 consecutive correct trials, the spatial location - reward contingencies were reversed in the following test session. Bilateral and unilateral PPTg-lesioned rats acquired the spatial discrimination test comparable to that as sham controls. In contrast, bilateral PPTg lesions, but not unilateral PPTg lesions, impaired reversal learning. The reversal learning deficit occurred because of increased regressions to the previously 'correct' spatial location after initially selecting the new, 'correct' choice. PPTg lesions also reduced the frequency of win-stay behavior early in the reversal learning session, but did not modify the frequency of lose-shift behavior during reversal learning. The present results suggest that the PPTg contributes to behavioral flexibility under conditions in which outcomes are uncertain, e.g. probabilistic reinforcement, by facilitating sensitivity to positive reward outcomes that allows the reliable execution of a new choice pattern. PMID:26976089

  10. COMT Val(158) Met genotype is associated with reward learning: a replication study and meta-analysis.

    PubMed

    Corral-Frías, N S; Pizzagalli, D A; Carré, J M; Michalski, L J; Nikolova, Y S; Perlis, R H; Fagerness, J; Lee, M R; Conley, E Drabant; Lancaster, T M; Haddad, S; Wolf, A; Smoller, J W; Hariri, A R; Bogdan, R

    2016-06-01

    Identifying mechanisms through which individual differences in reward learning emerge offers an opportunity to understand both a fundamental form of adaptive responding as well as etiological pathways through which aberrant reward learning may contribute to maladaptive behaviors and psychopathology. One candidate mechanism through which individual differences in reward learning may emerge is variability in dopaminergic reinforcement signaling. A common functional polymorphism within the catechol-O-methyl transferase gene (COMT; rs4680, Val(158) Met) has been linked to reward learning, where homozygosity for the Met allele (linked to heightened prefrontal dopamine function and decreased dopamine synthesis in the midbrain) has been associated with relatively increased reward learning. Here, we used a probabilistic reward learning task to asses response bias, a behavioral form of reward learning, across three separate samples that were combined for analyses (age: 21.80 ± 3.95; n = 392; 268 female; European-American: n = 208). We replicate prior reports that COMT rs4680 Met allele homozygosity is associated with increased reward learning in European-American participants (β = 0.20, t = 2.75, P < 0.01; ΔR(2) = 0.04). Moreover, a meta-analysis of 4 studies, including the current one, confirmed the association between COMT rs4680 genotype and reward learning (95% CI -0.11 to -0.03; z = 3.2; P < 0.01). These results suggest that variability in dopamine signaling associated with COMT rs4680 influences individual differences in reward which may potentially contribute to psychopathology characterized by reward dysfunction. PMID:27138112

  11. COMT Val(158) Met genotype is associated with reward learning: a replication study and meta-analysis.

    PubMed

    Corral-Frías, N S; Pizzagalli, D A; Carré, J M; Michalski, L J; Nikolova, Y S; Perlis, R H; Fagerness, J; Lee, M R; Conley, E Drabant; Lancaster, T M; Haddad, S; Wolf, A; Smoller, J W; Hariri, A R; Bogdan, R

    2016-06-01

    Identifying mechanisms through which individual differences in reward learning emerge offers an opportunity to understand both a fundamental form of adaptive responding as well as etiological pathways through which aberrant reward learning may contribute to maladaptive behaviors and psychopathology. One candidate mechanism through which individual differences in reward learning may emerge is variability in dopaminergic reinforcement signaling. A common functional polymorphism within the catechol-O-methyl transferase gene (COMT; rs4680, Val(158) Met) has been linked to reward learning, where homozygosity for the Met allele (linked to heightened prefrontal dopamine function and decreased dopamine synthesis in the midbrain) has been associated with relatively increased reward learning. Here, we used a probabilistic reward learning task to asses response bias, a behavioral form of reward learning, across three separate samples that were combined for analyses (age: 21.80 ± 3.95; n = 392; 268 female; European-American: n = 208). We replicate prior reports that COMT rs4680 Met allele homozygosity is associated with increased reward learning in European-American participants (β = 0.20, t = 2.75, P < 0.01; ΔR(2) = 0.04). Moreover, a meta-analysis of 4 studies, including the current one, confirmed the association between COMT rs4680 genotype and reward learning (95% CI -0.11 to -0.03; z = 3.2; P < 0.01). These results suggest that variability in dopamine signaling associated with COMT rs4680 influences individual differences in reward which may potentially contribute to psychopathology characterized by reward dysfunction.

  12. Dopamine selectively remediates 'model-based' reward learning: a computational approach.

    PubMed

    Sharp, Madeleine E; Foerde, Karin; Daw, Nathaniel D; Shohamy, Daphna

    2016-02-01

    Patients with loss of dopamine due to Parkinson's disease are impaired at learning from reward. However, it remains unknown precisely which aspect of learning is impaired. In particular, learning from reward, or reinforcement learning, can be driven by two distinct computational processes. One involves habitual stamping-in of stimulus-response associations, hypothesized to arise computationally from 'model-free' learning. The other, 'model-based' learning, involves learning a model of the world that is believed to support goal-directed behaviour. Much work has pointed to a role for dopamine in model-free learning. But recent work suggests model-based learning may also involve dopamine modulation, raising the possibility that model-based learning may contribute to the learning impairment in Parkinson's disease. To directly test this, we used a two-step reward-learning task which dissociates model-free versus model-based learning. We evaluated learning in patients with Parkinson's disease tested ON versus OFF their dopamine replacement medication and in healthy controls. Surprisingly, we found no effect of disease or medication on model-free learning. Instead, we found that patients tested OFF medication showed a marked impairment in model-based learning, and that this impairment was remediated by dopaminergic medication. Moreover, model-based learning was positively correlated with a separate measure of working memory performance, raising the possibility of common neural substrates. Our results suggest that some learning deficits in Parkinson's disease may be related to an inability to pursue reward based on complete representations of the environment.

  13. The role of reward in word learning and its implications for language acquisition.

    PubMed

    Ripollés, Pablo; Marco-Pallarés, Josep; Hielscher, Ulrike; Mestres-Missé, Anna; Tempelmann, Claus; Heinze, Hans-Jochen; Rodríguez-Fornells, Antoni; Noesselt, Toemme

    2014-11-01

    The exact neural processes behind humans' drive to acquire a new language--first as infants and later as second-language learners--are yet to be established. Recent theoretical models have proposed that during human evolution, emerging language-learning mechanisms might have been glued to phylogenetically older subcortical reward systems, reinforcing human motivation to learn a new language. Supporting this hypothesis, our results showed that adult participants exhibited robust fMRI activation in the ventral striatum (VS)--a core region of reward processing--when successfully learning the meaning of new words. This activation was similar to the VS recruitment elicited using an independent reward task. Moreover, the VS showed enhanced functional and structural connectivity with neocortical language areas during successful word learning. Together, our results provide evidence for the neural substrate of reward and motivation during word learning. We suggest that this strong functional and anatomical coupling between neocortical language regions and the subcortical reward system provided a crucial advantage in humans that eventually enabled our lineage to successfully acquire linguistic skills.

  14. Multi Agent Reward Analysis for Learning in Noisy Domains

    NASA Technical Reports Server (NTRS)

    Tumer, Kagan; Agogino, Adrian K.

    2005-01-01

    In many multi agent learning problems, it is difficult to determine, a priori, the agent reward structure that will lead to good performance. This problem is particularly pronounced in continuous, noisy domains ill-suited to simple table backup schemes commonly used in TD(lambda)/Q-learning. In this paper, we present a new reward evaluation method that allows the tradeoff between coordination among the agents and the difficulty of the learning problem each agent faces to be visualized. This method is independent of the learning algorithm and is only a function of the problem domain and the agents reward structure. We then use this reward efficiency visualization method to determine an effective reward without performing extensive simulations. We test this method in both a static and a dynamic multi-rover learning domain where the agents have continuous state spaces and where their actions are noisy (e.g., the agents movement decisions are not always carried out properly). Our results show that in the more difficult dynamic domain, the reward efficiency visualization method provides a two order of magnitude speedup in selecting a good reward. Most importantly it allows one to quickly create and verify rewards tailored to the observational limitations of the domain.

  15. Learning the specific quality of taste reinforcement in larval Drosophila.

    PubMed

    Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

    2015-01-27

    The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing-in any brain.

  16. Learning the specific quality of taste reinforcement in larval Drosophila.

    PubMed

    Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

    2015-01-01

    The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing-in any brain. PMID:25622533

  17. Learning the specific quality of taste reinforcement in larval Drosophila

    PubMed Central

    Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

    2015-01-01

    The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing—in any brain. DOI: http://dx.doi.org/10.7554/eLife.04711.001 PMID:25622533

  18. Learned Helplessness in Learning Disabled Adolescents as a Function of Noncontingent Rewards.

    ERIC Educational Resources Information Center

    Kleinhammer-Tramill, P. Jeannie; And Others

    1983-01-01

    Twenty-four learning disabled students (10 to 14 years old) assigned to three reward schedules (response contingent reward, 100 percent noncontingent reward, and 50 percent random noncontingent reward) were asked to replicate block designs and solve coding problems. No differences were found in errors, but Ss in noncontingent conditions…

  19. A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task.

    PubMed

    Suri, R E; Schultz, W

    1999-01-01

    This study investigated how the simulated response of dopamine neurons to reward-related stimuli could be used as reinforcement signal for learning a spatial delayed response task. Spatial delayed response tasks assess the functions of frontal cortex and basal ganglia in short-term memory, movement preparation and expectation of environmental events. In these tasks, a stimulus appears for a short period at a particular location, and after a delay the subject moves to the location indicated. Dopamine neurons are activated by unpredicted rewards and reward-predicting stimuli, are not influenced by fully predicted rewards, and are depressed by omitted rewards. Thus, they appear to report an error in the prediction of reward, which is the crucial reinforcement term in formal learning theories. Theoretical studies on reinforcement learning have shown that signals similar to dopamine responses can be used as effective teaching signals for learning. A neural network model implementing the temporal difference algorithm was trained to perform a simulated spatial delayed response task. The reinforcement signal was modeled according to the basic characteristics of dopamine responses to novel stimuli, primary rewards and reward-predicting stimuli. A Critic component analogous to dopamine neurons computed a temporal error in the prediction of reinforcement and emitted this signal to an Actor component which mediated the behavioral output. The spatial delayed response task was learned via two subtasks introducing spatial choices and temporal delays, in the same manner as monkeys in the laboratory. In all three tasks, the reinforcement signal of the Critic developed in a similar manner to the responses of natural dopamine neurons in comparable learning situations, and the learning curves of the Actor replicated the progress of learning observed in the animals. Several manipulations demonstrated further the efficacy of the particular characteristics of the dopamine

  20. Learned helplessness and learned prevalence: exploring the causal relations among perceived controllability, reward prevalence, and exploration.

    PubMed

    Teodorescu, Kinneret; Erev, Ido

    2014-10-01

    Exposure to uncontrollable outcomes has been found to trigger learned helplessness, a state in which the agent, because of lack of exploration, fails to take advantage of regained control. Although the implications of this phenomenon have been widely studied, its underlying cause remains undetermined. One can learn not to explore because the environment is uncontrollable, because the average reinforcement for exploring is low, or because rewards for exploring are rare. In the current research, we tested a simple experimental paradigm that contrasts the predictions of these three contributors and offers a unified psychological mechanism that underlies the observed phenomena. Our results demonstrate that learned helplessness is not correlated with either the perceived controllability of one's environment or the average reward, which suggests that reward prevalence is a better predictor of exploratory behavior than the other two factors. A simple computational model in which exploration decisions were based on small samples of past experiences captured the empirical phenomena while also providing a cognitive basis for feelings of uncontrollability.

  1. Bayesian Cue Integration as a Developmental Outcome of Reward Mediated Learning

    PubMed Central

    Weisswange, Thomas H.; Rothkopf, Constantin A.; Rodemann, Tobias; Triesch, Jochen

    2011-01-01

    Average human behavior in cue combination tasks is well predicted by Bayesian inference models. As this capability is acquired over developmental timescales, the question arises, how it is learned. Here we investigated whether reward dependent learning, that is well established at the computational, behavioral, and neuronal levels, could contribute to this development. It is shown that a model free reinforcement learning algorithm can indeed learn to do cue integration, i.e. weight uncertain cues according to their respective reliabilities and even do so if reliabilities are changing. We also consider the case of causal inference where multimodal signals can originate from one or multiple separate objects and should not always be integrated. In this case, the learner is shown to develop a behavior that is closest to Bayesian model averaging. We conclude that reward mediated learning could be a driving force for the development of cue integration and causal inference. PMID:21750717

  2. Credit assignment in movement-dependent reinforcement learning.

    PubMed

    McDougle, Samuel D; Boggess, Matthew J; Crossley, Matthew J; Parvin, Darius; Ivry, Richard B; Taylor, Jordan A

    2016-06-14

    When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants' explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem. PMID:27247404

  3. Credit assignment in movement-dependent reinforcement learning.

    PubMed

    McDougle, Samuel D; Boggess, Matthew J; Crossley, Matthew J; Parvin, Darius; Ivry, Richard B; Taylor, Jordan A

    2016-06-14

    When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants' explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem.

  4. Efficient reinforcement learning: computational theories, neuroscience and robotics.

    PubMed

    Kawato, Mitsuo; Samejima, Kazuyuki

    2007-04-01

    Reinforcement learning algorithms have provided some of the most influential computational theories for behavioral learning that depends on reward and penalty. After briefly reviewing supporting experimental data, this paper tackles three difficult theoretical issues that remain to be explored. First, plain reinforcement learning is much too slow to be considered a plausible brain model. Second, although the temporal-difference error has an important role both in theory and in experiments, how to compute it remains an enigma. Third, function of all brain areas, including the cerebral cortex, cerebellum, brainstem and basal ganglia, seems to necessitate a new computational framework. Computational studies that emphasize meta-parameters, hierarchy, modularity and supervised learning to resolve these issues are reviewed here, together with the related experimental data.

  5. Pleasurable music affects reinforcement learning according to the listener.

    PubMed

    Gold, Benjamin P; Frank, Michael J; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875

  6. Pleasurable music affects reinforcement learning according to the listener.

    PubMed

    Gold, Benjamin P; Frank, Michael J; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy.

  7. Pleasurable music affects reinforcement learning according to the listener

    PubMed Central

    Gold, Benjamin P.; Frank, Michael J.; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875

  8. Can model-free reinforcement learning explain deontological moral judgments?

    PubMed

    Ayars, Alisabeth

    2016-05-01

    Dual-systems frameworks propose that moral judgments are derived from both an immediate emotional response, and controlled/rational cognition. Recently Cushman (2013) proposed a new dual-system theory based on model-free and model-based reinforcement learning. Model-free learning attaches values to actions based on their history of reward and punishment, and explains some deontological, non-utilitarian judgments. Model-based learning involves the construction of a causal model of the world and allows for far-sighted planning; this form of learning fits well with utilitarian considerations that seek to maximize certain kinds of outcomes. I present three concerns regarding the use of model-free reinforcement learning to explain deontological moral judgment. First, many actions that humans find aversive from model-free learning are not judged to be morally wrong. Moral judgment must require something in addition to model-free learning. Second, there is a dearth of evidence for central predictions of the reinforcement account-e.g., that people with different reinforcement histories will, all else equal, make different moral judgments. Finally, to account for the effect of intention within the framework requires certain assumptions which lack support. These challenges are reasonable foci for future empirical/theoretical work on the model-free/model-based framework. PMID:26918742

  9. Can model-free reinforcement learning explain deontological moral judgments?

    PubMed

    Ayars, Alisabeth

    2016-05-01

    Dual-systems frameworks propose that moral judgments are derived from both an immediate emotional response, and controlled/rational cognition. Recently Cushman (2013) proposed a new dual-system theory based on model-free and model-based reinforcement learning. Model-free learning attaches values to actions based on their history of reward and punishment, and explains some deontological, non-utilitarian judgments. Model-based learning involves the construction of a causal model of the world and allows for far-sighted planning; this form of learning fits well with utilitarian considerations that seek to maximize certain kinds of outcomes. I present three concerns regarding the use of model-free reinforcement learning to explain deontological moral judgment. First, many actions that humans find aversive from model-free learning are not judged to be morally wrong. Moral judgment must require something in addition to model-free learning. Second, there is a dearth of evidence for central predictions of the reinforcement account-e.g., that people with different reinforcement histories will, all else equal, make different moral judgments. Finally, to account for the effect of intention within the framework requires certain assumptions which lack support. These challenges are reasonable foci for future empirical/theoretical work on the model-free/model-based framework.

  10. The neural basis of associative reward learning in honeybees.

    PubMed

    Hammer, M

    1997-06-01

    Appetitive learning of food-predicting stimuli, an essential part of foraging behavior in honeybees, follows the rules of associative learning. In the learning of odors as reward-predicting stimuli, an individual neuron, one of a small group of large ascending neurons that serve principal brain neuropiles, mediates the reward and has experience-dependent response properties. This implies that this neuron functions as an integral part of associative memory, might underlie more complex features of learning, and could participate in the implementation of learning rules. Moreover, its structural properties suggest that it organizes the interaction of functionally different neural nets during learning and experience-dependent behavior.

  11. Collaborating Fuzzy Reinforcement Learning Agents

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1997-01-01

    Earlier, we introduced GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Relearning and at the local level, each agent learns and operates based on ANTARCTIC, a technique for fuzzy reinforcement learning. In this paper, we show that it is possible for these agents to compete in order to affect the selected control policy but at the same time, they can collaborate while investigating the state space. In this model, the evaluator or the critic learns by observing all the agents behaviors but the control policy changes only based on the behavior of the winning agent also known as the super agent.

  12. The influence of trial order on learning from reward vs. punishment in a probabilistic categorization task: experimental and computational analyses

    PubMed Central

    Moustafa, Ahmed A.; Gluck, Mark A.; Herzallah, Mohammad M.; Myers, Catherine E.

    2015-01-01

    Previous research has shown that trial ordering affects cognitive performance, but this has not been tested using category-learning tasks that differentiate learning from reward and punishment. Here, we tested two groups of healthy young adults using a probabilistic category learning task of reward and punishment in which there are two types of trials (reward, punishment) and three possible outcomes: (1) positive feedback for correct responses in reward trials; (2) negative feedback for incorrect responses in punishment trials; and (3) no feedback for incorrect answers in reward trials and correct answers in punishment trials. Hence, trials without feedback are ambiguous, and may represent either successful avoidance of punishment or failure to obtain reward. In Experiment 1, the first group of subjects received an intermixed task in which reward and punishment trials were presented in the same block, as a standard baseline task. In Experiment 2, a second group completed the separated task, in which reward and punishment trials were presented in separate blocks. Additionally, in order to understand the mechanisms underlying performance in the experimental conditions, we fit individual data using a Q-learning model. Results from Experiment 1 show that subjects who completed the intermixed task paradoxically valued the no-feedback outcome as a reinforcer when it occurred on reinforcement-based trials, and as a punisher when it occurred on punishment-based trials. This is supported by patterns of empirical responding, where subjects showed more win-stay behavior following an explicit reward than following an omission of punishment, and more lose-shift behavior following an explicit punisher than following an omission of reward. In Experiment 2, results showed similar performance whether subjects received reward-based or punishment-based trials first. However, when the Q-learning model was applied to these data, there were differences between subjects in the reward

  13. The influence of trial order on learning from reward vs. punishment in a probabilistic categorization task: experimental and computational analyses.

    PubMed

    Moustafa, Ahmed A; Gluck, Mark A; Herzallah, Mohammad M; Myers, Catherine E

    2015-01-01

    Previous research has shown that trial ordering affects cognitive performance, but this has not been tested using category-learning tasks that differentiate learning from reward and punishment. Here, we tested two groups of healthy young adults using a probabilistic category learning task of reward and punishment in which there are two types of trials (reward, punishment) and three possible outcomes: (1) positive feedback for correct responses in reward trials; (2) negative feedback for incorrect responses in punishment trials; and (3) no feedback for incorrect answers in reward trials and correct answers in punishment trials. Hence, trials without feedback are ambiguous, and may represent either successful avoidance of punishment or failure to obtain reward. In Experiment 1, the first group of subjects received an intermixed task in which reward and punishment trials were presented in the same block, as a standard baseline task. In Experiment 2, a second group completed the separated task, in which reward and punishment trials were presented in separate blocks. Additionally, in order to understand the mechanisms underlying performance in the experimental conditions, we fit individual data using a Q-learning model. Results from Experiment 1 show that subjects who completed the intermixed task paradoxically valued the no-feedback outcome as a reinforcer when it occurred on reinforcement-based trials, and as a punisher when it occurred on punishment-based trials. This is supported by patterns of empirical responding, where subjects showed more win-stay behavior following an explicit reward than following an omission of punishment, and more lose-shift behavior following an explicit punisher than following an omission of reward. In Experiment 2, results showed similar performance whether subjects received reward-based or punishment-based trials first. However, when the Q-learning model was applied to these data, there were differences between subjects in the reward

  14. The influence of trial order on learning from reward vs. punishment in a probabilistic categorization task: experimental and computational analyses.

    PubMed

    Moustafa, Ahmed A; Gluck, Mark A; Herzallah, Mohammad M; Myers, Catherine E

    2015-01-01

    Previous research has shown that trial ordering affects cognitive performance, but this has not been tested using category-learning tasks that differentiate learning from reward and punishment. Here, we tested two groups of healthy young adults using a probabilistic category learning task of reward and punishment in which there are two types of trials (reward, punishment) and three possible outcomes: (1) positive feedback for correct responses in reward trials; (2) negative feedback for incorrect responses in punishment trials; and (3) no feedback for incorrect answers in reward trials and correct answers in punishment trials. Hence, trials without feedback are ambiguous, and may represent either successful avoidance of punishment or failure to obtain reward. In Experiment 1, the first group of subjects received an intermixed task in which reward and punishment trials were presented in the same block, as a standard baseline task. In Experiment 2, a second group completed the separated task, in which reward and punishment trials were presented in separate blocks. Additionally, in order to understand the mechanisms underlying performance in the experimental conditions, we fit individual data using a Q-learning model. Results from Experiment 1 show that subjects who completed the intermixed task paradoxically valued the no-feedback outcome as a reinforcer when it occurred on reinforcement-based trials, and as a punisher when it occurred on punishment-based trials. This is supported by patterns of empirical responding, where subjects showed more win-stay behavior following an explicit reward than following an omission of punishment, and more lose-shift behavior following an explicit punisher than following an omission of reward. In Experiment 2, results showed similar performance whether subjects received reward-based or punishment-based trials first. However, when the Q-learning model was applied to these data, there were differences between subjects in the reward

  15. The combination of appetitive and aversive reinforcers and the nature of their interaction during auditory learning.

    PubMed

    Ilango, A; Wetzel, W; Scheich, H; Ohl, F W

    2010-03-31

    Learned changes in behavior can be elicited by either appetitive or aversive reinforcers. It is, however, not clear whether the two types of motivation, (approaching appetitive stimuli and avoiding aversive stimuli) drive learning in the same or different ways, nor is their interaction understood in situations where the two types are combined in a single experiment. To investigate this question we have developed a novel learning paradigm for Mongolian gerbils, which not only allows rewards and punishments to be presented in isolation or in combination with each other, but also can use these opposite reinforcers to drive the same learned behavior. Specifically, we studied learning of tone-conditioned hurdle crossing in a shuttle box driven by either an appetitive reinforcer (brain stimulation reward) or an aversive reinforcer (electrical footshock), or by a combination of both. Combination of the two reinforcers potentiated speed of acquisition, led to maximum possible performance, and delayed extinction as compared to either reinforcer alone. Additional experiments, using partial reinforcement protocols and experiments in which one of the reinforcers was omitted after the animals had been previously trained with the combination of both reinforcers, indicated that appetitive and aversive reinforcers operated together but acted in different ways: in this particular experimental context, punishment appeared to be more effective for initial acquisition and reward more effective to maintain a high level of conditioned responses (CRs). The results imply that learning mechanisms in problem solving were maximally effective when the initial punishment of mistakes was combined with the subsequent rewarding of correct performance. PMID:20080152

  16. Ensemble algorithms in reinforcement learning.

    PubMed

    Wiering, Marco A; van Hasselt, Hado

    2008-08-01

    This paper describes several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent. The aim is to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms. We designed and implemented four different ensemble methods combining the following five different RL algorithms: Q-learning, Sarsa, actor-critic (AC), QV-learning, and AC learning automaton. The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms, in contrast to previous work where ensemble methods have been used in RL for representing and learning a single value function. We show experiments on five maze problems of varying complexity; the first problem is simple, but the other four maze tasks are of a dynamic or partially observable nature. The results indicate that the BM and MV ensembles significantly outperform the single RL algorithms.

  17. Ensemble algorithms in reinforcement learning.

    PubMed

    Wiering, Marco A; van Hasselt, Hado

    2008-08-01

    This paper describes several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent. The aim is to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms. We designed and implemented four different ensemble methods combining the following five different RL algorithms: Q-learning, Sarsa, actor-critic (AC), QV-learning, and AC learning automaton. The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms, in contrast to previous work where ensemble methods have been used in RL for representing and learning a single value function. We show experiments on five maze problems of varying complexity; the first problem is simple, but the other four maze tasks are of a dynamic or partially observable nature. The results indicate that the BM and MV ensembles significantly outperform the single RL algorithms. PMID:18632380

  18. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation

    PubMed Central

    Morita, Kenji

    2016-01-01

    It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of ‘Go’ or ‘No-Go’ selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of ‘Go’ values towards a goal, and (2) value-contrasts between ‘Go’ and ‘No-Go’ are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological

  19. Distributed reinforcement learning for adaptive and robust network intrusion response

    NASA Astrophysics Data System (ADS)

    Malialis, Kleanthis; Devlin, Sam; Kudenko, Daniel

    2015-07-01

    Distributed denial of service (DDoS) attacks constitute a rapidly evolving threat in the current Internet. Multiagent Router Throttling is a novel approach to defend against DDoS attacks where multiple reinforcement learning agents are installed on a set of routers and learn to rate-limit or throttle traffic towards a victim server. The focus of this paper is on online learning and scalability. We propose an approach that incorporates task decomposition, team rewards and a form of reward shaping called difference rewards. One of the novel characteristics of the proposed system is that it provides a decentralised coordinated response to the DDoS problem, thus being resilient to DDoS attacks themselves. The proposed system learns remarkably fast, thus being suitable for online learning. Furthermore, its scalability is successfully demonstrated in experiments involving 1000 learning agents. We compare our approach against a baseline and a popular state-of-the-art throttling technique from the network security literature and show that the proposed approach is more effective, adaptive to sophisticated attack rate dynamics and robust to agent failures.

  20. Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

    PubMed

    Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

    2013-04-01

    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity. PMID:23592970

  1. Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

    PubMed

    Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

    2013-04-01

    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

  2. Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

    PubMed Central

    Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

    2013-01-01

    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity. PMID:23592970

  3. A Selective Role for Lmo4 in Cue-Reward Learning.

    PubMed

    Maiya, Rajani; Mangieri, Regina A; Morrisett, Richard A; Heberlein, Ulrike; Messing, Robert O

    2015-07-01

    The ability to use environmental cues to predict rewarding events is essential to survival. The basolateral amygdala (BLA) plays a central role in such forms of associative learning. Aberrant cue-reward learning is thought to underlie many psychopathologies, including addiction, so understanding the underlying molecular mechanisms can inform strategies for intervention. The transcriptional regulator LIM-only 4 (LMO4) is highly expressed in pyramidal neurons of the BLA, where it plays an important role in fear learning. Because the BLA also contributes to cue-reward learning, we investigated the role of BLA LMO4 in this process using Lmo4-deficient mice and RNA interference. Lmo4-deficient mice showed a selective deficit in conditioned reinforcement. Knockdown of LMO4 in the BLA, but not in the nucleus accumbens, recapitulated this deficit in wild-type mice. Molecular and electrophysiological studies identified a deficit in dopamine D2 receptor signaling in the BLA of Lmo4-deficient mice. These results reveal a novel, LMO4-dependent transcriptional program within the BLA that is essential to cue-reward learning. PMID:26134647

  4. Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer.

    PubMed

    Hu, Yujing; Gao, Yang; An, Bo

    2015-07-01

    An important approach in multiagent reinforcement learning (MARL) is equilibrium-based MARL, which adopts equilibrium solution concepts in game theory and requires agents to play equilibrium strategies at each state. However, most existing equilibrium-based MARL algorithms cannot scale due to a large number of computationally expensive equilibrium computations (e.g., computing Nash equilibria is PPAD-hard) during learning. For the first time, this paper finds that during the learning process of equilibrium-based MARL, the one-shot games corresponding to each state's successive visits often have the same or similar equilibria (for some states more than 90% of games corresponding to successive visits have similar equilibria). Inspired by this observation, this paper proposes to use equilibrium transfer to accelerate equilibrium-based MARL. The key idea of equilibrium transfer is to reuse previously computed equilibria when each agent has a small incentive to deviate. By introducing transfer loss and transfer condition, a novel framework called equilibrium transfer-based MARL is proposed. We prove that although equilibrium transfer brings transfer loss, equilibrium-based MARL algorithms can still converge to an equilibrium policy under certain assumptions. Experimental results in widely used benchmarks (e.g., grid world game, soccer game, and wall game) show that the proposed framework: 1) not only significantly accelerates equilibrium-based MARL (up to 96.7% reduction in learning time), but also achieves higher average rewards than algorithms without equilibrium transfer and 2) scales significantly better than algorithms without equilibrium transfer when the state/action space grows and the number of agents increases.

  5. Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems.

    PubMed

    Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

    2013-01-01

    A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider. PMID:24137146

  6. Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems.

    PubMed

    Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

    2013-01-01

    A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider.

  7. Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems

    PubMed Central

    Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

    2013-01-01

    A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider. PMID:24137146

  8. Abnormal temporal difference reward-learning signals in major depression.

    PubMed

    Kumar, P; Waiter, G; Ahearn, T; Milders, M; Reid, I; Steele, J D

    2008-08-01

    Anhedonia is a core symptom of major depressive disorder (MDD), long thought to be associated with reduced dopaminergic function. However, most antidepressants do not act directly on the dopamine system and all antidepressants have a delayed full therapeutic effect. Recently, it has been proposed that antidepressants fail to alter dopamine function in antidepressant unresponsive MDD. There is compelling evidence that dopamine neurons code a specific phasic (short duration) reward-learning signal, described by temporal difference (TD) theory. There is no current evidence for other neurons coding a TD reward-learning signal, although such evidence may be found in time. The neuronal substrates of the TD signal were not explored in this study. Phasic signals are believed to have quite different properties to tonic (long duration) signals. No studies have investigated phasic reward-learning signals in MDD. Therefore, adults with MDD receiving long-term antidepressant medication, and comparison controls both unmedicated and acutely medicated with the antidepressant citalopram, were scanned using fMRI during a reward-learning task. Three hypotheses were tested: first, patients with MDD have blunted TD reward-learning signals; second, controls given an antidepressant acutely have blunted TD reward-learning signals; third, the extent of alteration in TD signals in major depression correlates with illness severity ratings. The results supported the hypotheses. Patients with MDD had significantly reduced reward-learning signals in many non-brainstem regions: ventral striatum (VS), rostral and dorsal anterior cingulate, retrosplenial cortex (RC), midbrain and hippocampus. However, the TD signal was increased in the brainstem of patients. As predicted, acute antidepressant administration to controls was associated with a blunted TD signal, and the brainstem TD signal was not increased by acute citalopram administration. In a number of regions, the magnitude of the abnormal

  9. Abnormal temporal difference reward-learning signals in major depression.

    PubMed

    Kumar, P; Waiter, G; Ahearn, T; Milders, M; Reid, I; Steele, J D

    2008-08-01

    Anhedonia is a core symptom of major depressive disorder (MDD), long thought to be associated with reduced dopaminergic function. However, most antidepressants do not act directly on the dopamine system and all antidepressants have a delayed full therapeutic effect. Recently, it has been proposed that antidepressants fail to alter dopamine function in antidepressant unresponsive MDD. There is compelling evidence that dopamine neurons code a specific phasic (short duration) reward-learning signal, described by temporal difference (TD) theory. There is no current evidence for other neurons coding a TD reward-learning signal, although such evidence may be found in time. The neuronal substrates of the TD signal were not explored in this study. Phasic signals are believed to have quite different properties to tonic (long duration) signals. No studies have investigated phasic reward-learning signals in MDD. Therefore, adults with MDD receiving long-term antidepressant medication, and comparison controls both unmedicated and acutely medicated with the antidepressant citalopram, were scanned using fMRI during a reward-learning task. Three hypotheses were tested: first, patients with MDD have blunted TD reward-learning signals; second, controls given an antidepressant acutely have blunted TD reward-learning signals; third, the extent of alteration in TD signals in major depression correlates with illness severity ratings. The results supported the hypotheses. Patients with MDD had significantly reduced reward-learning signals in many non-brainstem regions: ventral striatum (VS), rostral and dorsal anterior cingulate, retrosplenial cortex (RC), midbrain and hippocampus. However, the TD signal was increased in the brainstem of patients. As predicted, acute antidepressant administration to controls was associated with a blunted TD signal, and the brainstem TD signal was not increased by acute citalopram administration. In a number of regions, the magnitude of the abnormal

  10. Resting-state EEG theta activity and risk learning: sensitivity to reward or punishment?

    PubMed

    Massar, Stijn A A; Kenemans, J Leon; Schutter, Dennis J L G

    2014-03-01

    Increased theta (4-7 Hz)-beta (13-30 Hz) power ratio in resting state electroencephalography (EEG) has been associated with risky disadvantageous decision making and with impaired reinforcement learning. However, the specific contributions of theta and beta power in risky decision making remain unclear. The first aim of the present study was to replicate the earlier found relationship and examine the specific contributions of theta and beta power in risky decision making using the Iowa Gambling Task. The second aim of the study was to examine whether the relation were associated with differences in reward or punishment sensitivity. We replicated the earlier found relationship by showing a positive association between theta/beta ratio and risky decision making. This correlation was mainly driven by theta oscillations. Furthermore, theta power correlated with reward motivated learning, but not with punishment learning. The present results replicate and extend earlier findings by providing novel insights into the relation between thetabeta ratios and risky decision making. Specifically, findings show that resting-state theta activity is correlated with reinforcement learning, and that this association may be explained by differences in reward sensitivity.

  11. Pain relief produces negative reinforcement through activation of mesolimbic reward-valuation circuitry.

    PubMed

    Navratilova, Edita; Xie, Jennifer Y; Okun, Alec; Qu, Chaoling; Eyde, Nathan; Ci, Shuang; Ossipov, Michael H; King, Tamara; Fields, Howard L; Porreca, Frank

    2012-12-11

    Relief of pain is rewarding. Using a model of experimental postsurgical pain we show that blockade of afferent input from the injury with local anesthetic elicits conditioned place preference, activates ventral tegmental dopaminergic cells, and increases dopamine release in the nucleus accumbens. Importantly, place preference is associated with increased activity in midbrain dopaminergic neurons and blocked by dopamine antagonists injected into the nucleus accumbens. The data directly support the hypothesis that relief of pain produces negative reinforcement through activation of the mesolimbic reward-valuation circuitry.

  12. Human dorsal striatal activity during choice discriminates reinforcement learning behavior from the gambler's fallacy.

    PubMed

    Jessup, Ryan K; O'Doherty, John P

    2011-04-27

    Reinforcement learning theory has generated substantial interest in neurobiology, particularly because of the resemblance between phasic dopamine and reward prediction errors. Actor-critic theories have been adapted to account for the functions of the striatum, with parts of the dorsal striatum equated to the actor. Here, we specifically test whether the human dorsal striatum--as predicted by an actor-critic instantiation--is used on a trial-to-trial basis at the time of choice to choose in accordance with reinforcement learning theory, as opposed to a competing strategy: the gambler's fallacy. Using a partial-brain functional magnetic resonance imaging scanning protocol focused on the striatum and other ventral brain areas, we found that the dorsal striatum is more active when choosing consistent with reinforcement learning compared with the competing strategy. Moreover, an overlapping area of dorsal striatum along with the ventral striatum was found to be correlated with reward prediction errors at the time of outcome, as predicted by the actor-critic framework. These findings suggest that the same region of dorsal striatum involved in learning stimulus-response associations may contribute to the control of behavior during choice, thereby using those learned associations. Intriguingly, neither reinforcement learning nor the gambler's fallacy conformed to the optimal choice strategy on the specific decision-making task we used. Thus, the dorsal striatum may contribute to the control of behavior according to reinforcement learning even when the prescriptions of such an algorithm are suboptimal in terms of maximizing future rewards.

  13. Two spatiotemporally distinct value systems shape reward-based learning in the human brain

    PubMed Central

    Fouragnan, Elsa; Retzler, Chris; Mullinger, Karen; Philiastides, Marios G.

    2015-01-01

    Avoiding repeated mistakes and learning to reinforce rewarding decisions is critical for human survival and adaptive actions. Yet, the neural underpinnings of the value systems that encode different decision-outcomes remain elusive. Here coupling single-trial electroencephalography with simultaneously acquired functional magnetic resonance imaging, we uncover the spatiotemporal dynamics of two separate but interacting value systems encoding decision-outcomes. Consistent with a role in regulating alertness and switching behaviours, an early system is activated only by negative outcomes and engages arousal-related and motor-preparatory brain structures. Consistent with a role in reward-based learning, a later system differentially suppresses or activates regions of the human reward network in response to negative and positive outcomes, respectively. Following negative outcomes, the early system interacts and downregulates the late system, through a thalamic interaction with the ventral striatum. Critically, the strength of this coupling predicts participants' switching behaviour and avoidance learning, directly implicating the thalamostriatal pathway in reward-based learning. PMID:26348160

  14. Reward-based learning for virtual neurorobotics through emotional speech processing

    PubMed Central

    Jayet Bray, Laurence C.; Ferneyhough, Gareth B.; Barker, Emily R.; Thibeault, Corey M.; Harris, Frederick C.

    2013-01-01

    Reward-based learning can easily be applied to real life with a prevalence in children teaching methods. It also allows machines and software agents to automatically determine the ideal behavior from a simple reward feedback (e.g., encouragement) to maximize their performance. Advancements in affective computing, especially emotional speech processing (ESP) have allowed for more natural interaction between humans and robots. Our research focuses on integrating a novel ESP system in a relevant virtual neurorobotic (VNR) application. We created an emotional speech classifier that successfully distinguished happy and utterances. The accuracy of the system was 95.3 and 98.7% during the offline mode (using an emotional speech database) and the live mode (using live recordings), respectively. It was then integrated in a neurorobotic scenario, where a virtual neurorobot had to learn a simple exercise through reward-based learning. If the correct decision was made the robot received a spoken reward, which in turn stimulated synapses (in our simulated model) undergoing spike-timing dependent plasticity (STDP) and reinforced the corresponding neural pathways. Both our ESP and neurorobotic systems allowed our neurorobot to successfully and consistently learn the exercise. The integration of ESP in real-time computational neuroscience architecture is a first step toward the combination of human emotions and virtual neurorobotics. PMID:23641213

  15. Reward-based learning for virtual neurorobotics through emotional speech processing.

    PubMed

    Jayet Bray, Laurence C; Ferneyhough, Gareth B; Barker, Emily R; Thibeault, Corey M; Harris, Frederick C

    2013-01-01

    Reward-based learning can easily be applied to real life with a prevalence in children teaching methods. It also allows machines and software agents to automatically determine the ideal behavior from a simple reward feedback (e.g., encouragement) to maximize their performance. Advancements in affective computing, especially emotional speech processing (ESP) have allowed for more natural interaction between humans and robots. Our research focuses on integrating a novel ESP system in a relevant virtual neurorobotic (VNR) application. We created an emotional speech classifier that successfully distinguished happy and utterances. The accuracy of the system was 95.3 and 98.7% during the offline mode (using an emotional speech database) and the live mode (using live recordings), respectively. It was then integrated in a neurorobotic scenario, where a virtual neurorobot had to learn a simple exercise through reward-based learning. If the correct decision was made the robot received a spoken reward, which in turn stimulated synapses (in our simulated model) undergoing spike-timing dependent plasticity (STDP) and reinforced the corresponding neural pathways. Both our ESP and neurorobotic systems allowed our neurorobot to successfully and consistently learn the exercise. The integration of ESP in real-time computational neuroscience architecture is a first step toward the combination of human emotions and virtual neurorobotics.

  16. Reward-based learning for virtual neurorobotics through emotional speech processing.

    PubMed

    Jayet Bray, Laurence C; Ferneyhough, Gareth B; Barker, Emily R; Thibeault, Corey M; Harris, Frederick C

    2013-01-01

    Reward-based learning can easily be applied to real life with a prevalence in children teaching methods. It also allows machines and software agents to automatically determine the ideal behavior from a simple reward feedback (e.g., encouragement) to maximize their performance. Advancements in affective computing, especially emotional speech processing (ESP) have allowed for more natural interaction between humans and robots. Our research focuses on integrating a novel ESP system in a relevant virtual neurorobotic (VNR) application. We created an emotional speech classifier that successfully distinguished happy and utterances. The accuracy of the system was 95.3 and 98.7% during the offline mode (using an emotional speech database) and the live mode (using live recordings), respectively. It was then integrated in a neurorobotic scenario, where a virtual neurorobot had to learn a simple exercise through reward-based learning. If the correct decision was made the robot received a spoken reward, which in turn stimulated synapses (in our simulated model) undergoing spike-timing dependent plasticity (STDP) and reinforced the corresponding neural pathways. Both our ESP and neurorobotic systems allowed our neurorobot to successfully and consistently learn the exercise. The integration of ESP in real-time computational neuroscience architecture is a first step toward the combination of human emotions and virtual neurorobotics. PMID:23641213

  17. Reinforcement learning in continuous time and space.

    PubMed

    Doya, K

    2000-01-01

    This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD(lambda) algorithms are shown. For policy improvement, two methods-a continuous actor-critic method and a value-gradient-based greedy policy-are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model. PMID:10636940

  18. Reinforcement learning of periodical gaits in locomotion robots

    NASA Astrophysics Data System (ADS)

    Svinin, Mikhail; Yamada, Kazuyaki; Ushio, S.; Ueda, Kanji

    1999-08-01

    Emergence of stable gaits in locomotion robots is studied in this paper. A classifier system, implementing an instance- based reinforcement learning scheme, is used for sensory- motor control of an eight-legged mobile robot. Important feature of the classifier system is its ability to work with the continuous sensor space. The robot does not have a prior knowledge of the environment, its own internal model, and the goal coordinates. It is only assumed that the robot can acquire stable gaits by learning how to reach a light source. During the learning process the control system, is self-organized by reinforcement signals. Reaching the light source defines a global reward. Forward motion gets a local reward, while stepping back and falling down get a local punishment. Feasibility of the proposed self-organized system is tested under simulation and experiment. The control actions are specified at the leg level. It is shown that, as learning progresses, the number of the action rules in the classifier systems is stabilized to a certain level, corresponding to the acquired gait patterns.

  19. Incidental Learning of Rewarded Associations Bolsters Learning on an Associative Task

    ERIC Educational Resources Information Center

    Freedberg, Michael; Schacherer, Jonathan; Hazeltine, Eliot

    2016-01-01

    Reward has been shown to change behavior as a result of incentive learning (by motivating the individual to increase their effort) and instrumental learning (by increasing the frequency of a particular behavior). However, Palminteri et al. (2011) demonstrated that reward can also improve the incidental learning of a motor skill even when…

  20. Learning to reach by reinforcement learning using a receptive field based function approximation approach with continuous actions.

    PubMed

    Tamosiunaite, Minija; Asfour, Tamim; Wörgötter, Florentin

    2009-03-01

    Reinforcement learning methods can be used in robotics applications especially for specific target-oriented problems, for example the reward-based recalibration of goal directed actions. To this end still relatively large and continuous state-action spaces need to be efficiently handled. The goal of this paper is, thus, to develop a novel, rather simple method which uses reinforcement learning with function approximation in conjunction with different reward-strategies for solving such problems. For the testing of our method, we use a four degree-of-freedom reaching problem in 3D-space simulated by a two-joint robot arm system with two DOF each. Function approximation is based on 4D, overlapping kernels (receptive fields) and the state-action space contains about 10,000 of these. Different types of reward structures are being compared, for example, reward-on- touching-only against reward-on-approach. Furthermore, forbidden joint configurations are punished. A continuous action space is used. In spite of a rather large number of states and the continuous action space these reward/punishment strategies allow the system to find a good solution usually within about 20 trials. The efficiency of our method demonstrated in this test scenario suggests that it might be possible to use it on a real robot for problems where mixed rewards can be defined in situations where other types of learning might be difficult. PMID:19229556

  1. Anticipated Reward Enhances Offline Learning during Sleep

    ERIC Educational Resources Information Center

    Fischer, Stefan; Born, Jan

    2009-01-01

    Sleep is known to promote the consolidation of motor memories. In everyday life, typically more than 1 isolated motor skill is acquired at a time, and this possibly gives rise to interference during consolidation. Here, it is shown that reward expectancy determines the amount of sleep-dependent memory consolidation. Subjects were trained on 2…

  2. Novelty and Inductive Generalization in Human Reinforcement Learning

    PubMed Central

    Gershman, Samuel J.; Niv, Yael

    2015-01-01

    In reinforcement learning, a decision maker searching for the most rewarding option is often faced with the question: what is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: how can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of reinforcement learning in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional reinforcement learning algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty. PMID:25808176

  3. Emotion and reward are dissociable from error during motor learning.

    PubMed

    Festini, Sara B; Preston, Stephanie D; Reuter-Lorenz, Patricia A; Seidler, Rachael D

    2016-06-01

    Although emotion is known to reciprocally interact with cognitive and motor performance, contemporary theories of motor learning do not specifically consider how dynamic variations in a learner's affective state may influence motor performance during motor learning. Using a prism adaptation paradigm, we assessed emotion during motor learning on a trial-by-trial basis. We designed two dart-throwing experiments to dissociate motor performance and reward outcomes by giving participants maximum points for accurate throws and reduced points for throws that hit zones away from the target (i.e., "accidental points"). Experiment 1 dissociated motor performance from emotional responses and found that affective ratings tracked points earned more closely than error magnitude. Further, both reward and error uniquely contributed to motor learning, as indexed by the change in error from one trial to the next. Experiment 2 manipulated accidental point locations vertically, whereas prism displacement remained horizontal. Results demonstrated that reward could bias motor performance even when concurrent sensorimotor adaptation was taking place in a perpendicular direction. Thus, these experiments demonstrate that affective states were dissociable from error magnitude during motor learning and that affect more closely tracked points earned. Our findings further implicate reward as another factor, other than error, that contributes to motor learning, suggesting the importance of incorporating affective states into models of motor learning.

  4. Reinforcement learning for discounted values often loses the goal in the application to animal learning.

    PubMed

    Yamaguchi, Yoshiya; Sakai, Yutaka

    2012-11-01

    The impulsive preference of an animal for an immediate reward implies that it might subjectively discount the value of potential future outcomes. A theoretical framework to maximize the discounted subjective value has been established in the reinforcement learning theory. The framework has been successfully applied in engineering. However, this study identified a limitation when applied to animal behavior, where in some cases, there is no learning goal. Here a possible learning framework was proposed that is well-posed in any cases and that is consistent with the impulsive preference. PMID:22960494

  5. Reinforcement learning for discounted values often loses the goal in the application to animal learning.

    PubMed

    Yamaguchi, Yoshiya; Sakai, Yutaka

    2012-11-01

    The impulsive preference of an animal for an immediate reward implies that it might subjectively discount the value of potential future outcomes. A theoretical framework to maximize the discounted subjective value has been established in the reinforcement learning theory. The framework has been successfully applied in engineering. However, this study identified a limitation when applied to animal behavior, where in some cases, there is no learning goal. Here a possible learning framework was proposed that is well-posed in any cases and that is consistent with the impulsive preference.

  6. Dual mechanisms governing reward-driven perceptual learning

    PubMed Central

    Kim, Dongho; Ling, Sam; Watanabe, Takeo

    2015-01-01

    In this review, we explore how reward signals shape perceptual learning in animals and humans. Perceptual learning is the well-established phenomenon by which extensive practice elicits selective improvement in one’s perceptual discrimination of basic visual features, such as oriented lines or moving stimuli. While perceptual learning has long been thought to rely on ‘top-down’ processes, such as attention and decision-making, a wave of recent findings suggests that these higher-level processes are, in fact, not necessary.  Rather, these recent findings indicate that reward signals alone, in the absence of the contribution of higher-level cognitive processes, are sufficient to drive the benefits of perceptual learning. Here, we will review the literature tying reward signals to perceptual learning. Based on these findings, we propose dual underlying mechanisms that give rise to perceptual learning: one mechanism that operates ‘automatically’ and is tied directly to reward signals, and another mechanism that involves more ‘top-down’, goal-directed computations. PMID:26539293

  7. Cortical mechanisms for reinforcement learning in competitive games.

    PubMed

    Seo, Hyojung; Lee, Daeyeol

    2008-12-12

    Game theory analyses optimal strategies for multiple decision makers interacting in a social group. However, the behaviours of individual humans and animals often deviate systematically from the optimal strategies described by game theory. The behaviours of rhesus monkeys (Macaca mulatta) in simple zero-sum games showed similar patterns, but their departures from the optimal strategies were well accounted for by a simple reinforcement-learning algorithm. During a computer-simulated zero-sum game, neurons in the dorsolateral prefrontal cortex often encoded the previous choices of the animal and its opponent as well as the animal's reward history. By contrast, the neurons in the anterior cingulate cortex predominantly encoded the animal's reward history. Using simple competitive games, therefore, we have demonstrated functional specialization between different areas of the primate frontal cortex involved in outcome monitoring and action selection. Temporally extended signals related to the animal's previous choices might facilitate the association between choices and their delayed outcomes, whereas information about the choices of the opponent might be used to estimate the reward expected from a particular action. Finally, signals related to the reward history might be used to monitor the overall success of the animal's current decision-making strategy.

  8. Cortical mechanisms for reinforcement learning in competitive games.

    PubMed

    Seo, Hyojung; Lee, Daeyeol

    2008-12-12

    Game theory analyses optimal strategies for multiple decision makers interacting in a social group. However, the behaviours of individual humans and animals often deviate systematically from the optimal strategies described by game theory. The behaviours of rhesus monkeys (Macaca mulatta) in simple zero-sum games showed similar patterns, but their departures from the optimal strategies were well accounted for by a simple reinforcement-learning algorithm. During a computer-simulated zero-sum game, neurons in the dorsolateral prefrontal cortex often encoded the previous choices of the animal and its opponent as well as the animal's reward history. By contrast, the neurons in the anterior cingulate cortex predominantly encoded the animal's reward history. Using simple competitive games, therefore, we have demonstrated functional specialization between different areas of the primate frontal cortex involved in outcome monitoring and action selection. Temporally extended signals related to the animal's previous choices might facilitate the association between choices and their delayed outcomes, whereas information about the choices of the opponent might be used to estimate the reward expected from a particular action. Finally, signals related to the reward history might be used to monitor the overall success of the animal's current decision-making strategy. PMID:18829430

  9. The Establishment of Learned Reinforcers in Mildly Retarded Children. IMRID Behavioral Science Monograph No. 24.

    ERIC Educational Resources Information Center

    Worley, John C., Jr.

    Research regarding the establishment of learned reinforcement with mildly retarded children is reviewed. Noted are findings which indicate that educable retarded students, possibly due to cultural differences, are less responsive to social rewards than either nonretarded or more severely retarded children. Characteristics of primary and secondary…

  10. The Effect of a Token Reinforcement Program on the Reading Comprehension of a Learning Disabled Student.

    ERIC Educational Resources Information Center

    Galbreath, Joy; Feldman, David

    The relationship of reading comprehension accuracy and a contingently administered token reinforcement program used with an elementary level learning disabled student in the classroom was examined. The S earned points for each correct answer made after oral reading sessions. At the conclusion of the class he could exchange his points for rewards.…

  11. Operant tempo varies with reinforcement rate: implications for measurement of reward efficacy.

    PubMed

    Conover, Kent L.; Fulton, Stephanie; Shizgal, Peter

    2001-11-01

    Herrnstein's melioration theory has been used to account for the hyperbolic form of the single operant matching law and to scale the effectiveness of reinforcing brain stimulation. Underlying this scaling method is the assumption that the mean rate of responding during operant bouts (the response 'tempo') is fixed and does not vary with the rate of reinforcement. The validity of this account was assessed by testing the constant-tempo assumption via a survivor analysis of the distributions of inter-response times at different variable-intervals (VIs) in rats responding for rewarding electrical stimulation of the lateral hypothalamus. Contrary to the constant-tempo assumption, response tempo was not fixed but rather decreased as the VI was lengthened. This demonstration challenges Herrnstein's account of single-operant matching and suggests that the reinforcement rate that supports a half-maximal rate of responding on a single VI schedule may not provide a valid scale for the value of brain stimulation. Possible remedies are discussed. Although the conclusions of the study are restricted to experiments on brain stimulation reward in rats, the inter-response time analysis employed can provide the basis for testing the validity of the constant-tempo assumption in other species and for other reinforcers.

  12. Rewards.

    PubMed

    Gunderman, Richard B; Kamer, Aaron P

    2011-05-01

    For much of the 20th century, psychologists and economists operated on the assumption that work is devoid of intrinsic rewards, and the only way to get people to work harder is through the use of rewards and punishments. This so-called carrot-and-stick model of workplace motivation, when applied to medical practice, emphasizes the use of financial incentives and disincentives to manipulate behavior. More recently, however, it has become apparent that, particularly when applied to certain kinds of work, such approaches can be ineffective or even frankly counterproductive. Instead of focusing on extrinsic rewards such as compensation, organizations and their leaders need to devote more attention to the intrinsic rewards of work itself. This article reviews this new understanding of rewards and traces out its practical implications for radiology today. PMID:21531311

  13. Rewards.

    PubMed

    Gunderman, Richard B; Kamer, Aaron P

    2011-05-01

    For much of the 20th century, psychologists and economists operated on the assumption that work is devoid of intrinsic rewards, and the only way to get people to work harder is through the use of rewards and punishments. This so-called carrot-and-stick model of workplace motivation, when applied to medical practice, emphasizes the use of financial incentives and disincentives to manipulate behavior. More recently, however, it has become apparent that, particularly when applied to certain kinds of work, such approaches can be ineffective or even frankly counterproductive. Instead of focusing on extrinsic rewards such as compensation, organizations and their leaders need to devote more attention to the intrinsic rewards of work itself. This article reviews this new understanding of rewards and traces out its practical implications for radiology today.

  14. Reinforcement learning improves behaviour from evaluative feedback.

    PubMed

    Littman, Michael L

    2015-05-28

    Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems. PMID:26017443

  15. Reinforcement learning improves behaviour from evaluative feedback.

    PubMed

    Littman, Michael L

    2015-05-28

    Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

  16. Reinforcement learning improves behaviour from evaluative feedback

    NASA Astrophysics Data System (ADS)

    Littman, Michael L.

    2015-05-01

    Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

  17. The "proactive" model of learning: Integrative framework for model-free and model-based reinforcement learning utilizing the associative learning-based proactive brain concept.

    PubMed

    Zsuga, Judit; Biro, Klara; Papp, Csaba; Tajti, Gabor; Gesztelyi, Rudolf

    2016-02-01

    Reinforcement learning (RL) is a powerful concept underlying forms of associative learning governed by the use of a scalar reward signal, with learning taking place if expectations are violated. RL may be assessed using model-based and model-free approaches. Model-based reinforcement learning involves the amygdala, the hippocampus, and the orbitofrontal cortex (OFC). The model-free system involves the pedunculopontine-tegmental nucleus (PPTgN), the ventral tegmental area (VTA) and the ventral striatum (VS). Based on the functional connectivity of VS, model-free and model based RL systems center on the VS that by integrating model-free signals (received as reward prediction error) and model-based reward related input computes value. Using the concept of reinforcement learning agent we propose that the VS serves as the value function component of the RL agent. Regarding the model utilized for model-based computations we turned to the proactive brain concept, which offers an ubiquitous function for the default network based on its great functional overlap with contextual associative areas. Hence, by means of the default network the brain continuously organizes its environment into context frames enabling the formulation of analogy-based association that are turned into predictions of what to expect. The OFC integrates reward-related information into context frames upon computing reward expectation by compiling stimulus-reward and context-reward information offered by the amygdala and hippocampus, respectively. Furthermore we suggest that the integration of model-based expectations regarding reward into the value signal is further supported by the efferent of the OFC that reach structures canonical for model-free learning (e.g., the PPTgN, VTA, and VS).

  18. Altering spatial priority maps via reward-based learning.

    PubMed

    Chelazzi, Leonardo; Eštočinová, Jana; Calletti, Riccardo; Lo Gerfo, Emanuele; Sani, Ilaria; Della Libera, Chiara; Santandrea, Elisa

    2014-06-18

    Spatial priority maps are real-time representations of the behavioral salience of locations in the visual field, resulting from the combined influence of stimulus driven activity and top-down signals related to the current goals of the individual. They arbitrate which of a number of (potential) targets in the visual scene will win the competition for attentional resources. As a result, deployment of visual attention to a specific spatial location is determined by the current peak of activation (corresponding to the highest behavioral salience) across the map. Here we report a behavioral study performed on healthy human volunteers, where we demonstrate that spatial priority maps can be shaped via reward-based learning, reflecting long-lasting alterations (biases) in the behavioral salience of specific spatial locations. These biases exert an especially strong influence on performance under conditions where multiple potential targets compete for selection, conferring competitive advantage to targets presented in spatial locations associated with greater reward during learning relative to targets presented in locations associated with lesser reward. Such acquired biases of spatial attention are persistent, are nonstrategic in nature, and generalize across stimuli and task contexts. These results suggest that reward-based attentional learning can induce plastic changes in spatial priority maps, endowing these representations with the "intelligent" capacity to learn from experience.

  19. Early Years Education: Are Young Students Intrinsically or Extrinsically Motivated Towards School Activities? A Discussion about the Effects of Rewards on Young Children's Learning

    ERIC Educational Resources Information Center

    Theodotou, Evgenia

    2014-01-01

    Rewards can reinforce and at the same time forestall young children's willingness to learn. However, they are broadly used in the field of education, especially in early years settings, to stimulate children towards learning activities. This paper reviews the theoretical and research literature related to intrinsic and extrinsic motivational…

  20. Reinforcement learning for routing in cognitive radio ad hoc networks.

    PubMed

    Al-Rawi, Hasan A A; Yau, Kok-Lim Alvin; Mohamad, Hafizal; Ramli, Nordin; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs. PMID:25140350

  1. Reinforcement Learning for Routing in Cognitive Radio Ad Hoc Networks

    PubMed Central

    Al-Rawi, Hasan A. A.; Mohamad, Hafizal; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs. PMID:25140350

  2. Reinforcement learning for routing in cognitive radio ad hoc networks.

    PubMed

    Al-Rawi, Hasan A A; Yau, Kok-Lim Alvin; Mohamad, Hafizal; Ramli, Nordin; Hashim, Wahidah

    2014-01-01

    Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs.

  3. Spiking neural networks with different reinforcement learning (RL) schemes in a multiagent setting.

    PubMed

    Christodoulou, Chris; Cleanthous, Aristodemos

    2010-12-31

    This paper investigates the effectiveness of spiking agents when trained with reinforcement learning (RL) in a challenging multiagent task. In particular, it explores learning through reward-modulated spike-timing dependent plasticity (STDP) and compares it to reinforcement of stochastic synaptic transmission in the general-sum game of the Iterated Prisoner's Dilemma (IPD). More specifically, a computational model is developed where we implement two spiking neural networks as two "selfish" agents learning simultaneously but independently, competing in the IPD game. The purpose of our system (or collective) is to maximise its accumulated reward in the presence of reward-driven competing agents within the collective. This can only be achieved when the agents engage in a behaviour of mutual cooperation during the IPD. Previously, we successfully applied reinforcement of stochastic synaptic transmission to the IPD game. The current study utilises reward-modulated STDP with eligibility trace and results show that the system managed to exhibit the desired behaviour by establishing mutual cooperation between the agents. It is noted that the cooperative outcome was attained after a relatively short learning period which enhanced the accumulation of reward by the system. As in our previous implementation, the successful application of the learning algorithm to the IPD becomes possible only after we extended it with additional global reinforcement signals in order to enhance competition at the neuronal level. Moreover it is also shown that learning is enhanced (as indicated by an increased IPD cooperative outcome) through: (i) strong memory for each agent (regulated by a high eligibility trace time constant) and (ii) firing irregularity produced by equipping the agents' LIF neurons with a partial somatic reset mechanism.

  4. Dopamine-dependent reinforcement of motor skill learning: evidence from Gilles de la Tourette syndrome.

    PubMed

    Palminteri, Stefano; Lebreton, Maël; Worbe, Yulia; Hartmann, Andreas; Lehéricy, Stéphane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-08-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only affect choices but also motor skills such as typing. Here, we employed a novel paradigm to demonstrate that monetary rewards can improve motor skill learning in humans. Indeed, healthy participants progressively got faster in executing sequences of key presses that were repeatedly rewarded with 10 euro compared with 1 cent. Control tests revealed that the effect of reinforcement on motor skill learning was independent of subjects being aware of sequence-reward associations. To account for this implicit effect, we developed an actor-critic model, in which reward prediction errors are used by the critic to update state values and by the actor to facilitate action execution. To assess the role of dopamine in such computations, we applied the same paradigm in patients with Gilles de la Tourette syndrome, who were either unmedicated or treated with neuroleptics. We also included patients with focal dystonia, as an example of hyperkinetic motor disorder unrelated to dopamine. Model fit showed the following dissociation: while motor skills were affected in all patient groups, reinforcement learning was selectively enhanced in unmedicated patients with Gilles de la Tourette syndrome and impaired by neuroleptics. These results support the hypothesis that overactive dopamine transmission leads to excessive reinforcement of motor sequences, which might explain the formation of tics in Gilles de la Tourette syndrome.

  5. Dopamine-dependent reinforcement of motor skill learning: evidence from Gilles de la Tourette syndrome.

    PubMed

    Palminteri, Stefano; Lebreton, Maël; Worbe, Yulia; Hartmann, Andreas; Lehéricy, Stéphane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-08-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only affect choices but also motor skills such as typing. Here, we employed a novel paradigm to demonstrate that monetary rewards can improve motor skill learning in humans. Indeed, healthy participants progressively got faster in executing sequences of key presses that were repeatedly rewarded with 10 euro compared with 1 cent. Control tests revealed that the effect of reinforcement on motor skill learning was independent of subjects being aware of sequence-reward associations. To account for this implicit effect, we developed an actor-critic model, in which reward prediction errors are used by the critic to update state values and by the actor to facilitate action execution. To assess the role of dopamine in such computations, we applied the same paradigm in patients with Gilles de la Tourette syndrome, who were either unmedicated or treated with neuroleptics. We also included patients with focal dystonia, as an example of hyperkinetic motor disorder unrelated to dopamine. Model fit showed the following dissociation: while motor skills were affected in all patient groups, reinforcement learning was selectively enhanced in unmedicated patients with Gilles de la Tourette syndrome and impaired by neuroleptics. These results support the hypothesis that overactive dopamine transmission leads to excessive reinforcement of motor sequences, which might explain the formation of tics in Gilles de la Tourette syndrome. PMID:21727098

  6. Racial bias shapes social reinforcement learning.

    PubMed

    Lindström, Björn; Selbing, Ida; Molapour, Tanaz; Olsson, Andreas

    2014-03-01

    Both emotional facial expressions and markers of racial-group belonging are ubiquitous signals in social interaction, but little is known about how these signals together affect future behavior through learning. To address this issue, we investigated how emotional (threatening or friendly) in-group and out-group faces reinforced behavior in a reinforcement-learning task. We asked whether reinforcement learning would be modulated by intergroup attitudes (i.e., racial bias). The results showed that individual differences in racial bias critically modulated reinforcement learning. As predicted, racial bias was associated with more efficiently learned avoidance of threatening out-group individuals. We used computational modeling analysis to quantitatively delimit the underlying processes affected by social reinforcement. These analyses showed that racial bias modulates the rate at which exposure to threatening out-group individuals is transformed into future avoidance behavior. In concert, these results shed new light on the learning processes underlying social interaction with racial-in-group and out-group individuals.

  7. Adaptive Educational Software by Applying Reinforcement Learning

    ERIC Educational Resources Information Center

    Bennane, Abdellah

    2013-01-01

    The introduction of the intelligence in teaching software is the object of this paper. In software elaboration process, one uses some learning techniques in order to adapt the teaching software to characteristics of student. Generally, one uses the artificial intelligence techniques like reinforcement learning, Bayesian network in order to adapt…

  8. Learning to trade via direct reinforcement.

    PubMed

    Moody, J; Saffell, M

    2001-01-01

    We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision-making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent reinforcement learning (RRL) for discovering investment policies. The need to build forecasting models is eliminated, and better trading performance is obtained. The direct reinforcement approach differs from dynamic programming and reinforcement algorithms such as TD-learning and Q-learning, which attempt to estimate a value function for the control problem. We find that the RRL direct reinforcement framework enables a simpler problem representation, avoids Bellman's curse of dimensionality and offers compelling advantages in efficiency. We demonstrate how direct reinforcement can be used to optimize risk-adjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work using real financial data, we find that our approach based on RRL produces better trading strategies than systems utilizing Q-learning (a value function method). Real-world applications include an intra-daily currency trader and a monthly asset allocation system for the S&P 500 Stock Index and T-Bills.

  9. Go and no-go learning in reward and punishment: Interactions between affect and effect

    PubMed Central

    Guitart-Masip, Marc; Huys, Quentin J.M.; Fuentemilla, Lluis; Dayan, Peter; Duzel, Emrah; Dolan, Raymond J.

    2012-01-01

    Decision-making invokes two fundamental axes of control: affect or valence, spanning reward and punishment, and effect or action, spanning invigoration and inhibition. We studied the acquisition of instrumental responding in healthy human volunteers in a task in which we orthogonalized action requirements and outcome valence. Subjects were much more successful in learning active choices in rewarded conditions, and passive choices in punished conditions. Using computational reinforcement-learning models, we teased apart contributions from putatively instrumental and Pavlovian components in the generation of the observed asymmetry during learning. Moreover, using model-based fMRI, we showed that BOLD signals in striatum and substantia nigra/ventral tegmental area (SN/VTA) correlated with instrumentally learnt action values, but with opposite signs for go and no-go choices. Finally, we showed that successful instrumental learning depends on engagement of bilateral inferior frontal gyrus. Our behavioral and computational data showed that instrumental learning is contingent on overcoming inherent and plastic Pavlovian biases, while our neuronal data showed this learning is linked to unique patterns of brain activity in regions implicated in action and inhibition respectively. PMID:22548809

  10. Common Neural Mechanisms Underlying Reversal Learning by Reward and Punishment

    PubMed Central

    Xue, Gui; Xue, Feng; Droutman, Vita; Lu, Zhong-Lin; Bechara, Antoine; Read, Stephen

    2013-01-01

    Impairments in flexible goal-directed decisions, often examined by reversal learning, are associated with behavioral abnormalities characterized by impulsiveness and disinhibition. Although the lateral orbital frontal cortex (OFC) has been consistently implicated in reversal learning, it is still unclear whether this region is involved in negative feedback processing, behavioral control, or both, and whether reward and punishment might have different effects on lateral OFC involvement. Using a relatively large sample (N = 47), and a categorical learning task with either monetary reward or moderate electric shock as feedback, we found overlapping activations in the right lateral OFC (and adjacent insula) for reward and punishment reversal learning when comparing correct reversal trials with correct acquisition trials, whereas we found overlapping activations in the right dorsolateral prefrontal cortex (DLPFC) when negative feedback signaled contingency change. The right lateral OFC and DLPFC also showed greater sensitivity to punishment than did their left homologues, indicating an asymmetry in how punishment is processed. We propose that the right lateral OFC and anterior insula are important for transforming affective feedback to behavioral adjustment, whereas the right DLPFC is involved in higher level attention control. These results provide insight into the neural mechanisms of reversal learning and behavioral flexibility, which can be leveraged to understand risky behaviors among vulnerable populations. PMID:24349211

  11. DAT isn’t all that: cocaine reward and reinforcement requires Toll Like Receptor 4 signaling

    PubMed Central

    Northcutt, A.L.; Hutchinson, M.R.; Wang, X.; Baratta, M.V.; Hiranita, T.; Cochran, T.A.; Pomrenze, M.B.; Galer, E.L.; Kopajtic, T.A.; Li, C.M.; Amat, J.; Larson, G.; Cooper, D.C.; Huang, Y.; O’Neill, C.E.; Yin, H.; Zahniser, N.R.; Katz, J.L.; Rice, K.C.; Maier, S.F.; Bachtell, R.K.; Watkins, L.R.

    2014-01-01

    The initial reinforcing properties of drugs of abuse, such as cocaine, are largely attributed to their ability to activate the mesolimbic dopamine system. Resulting increases in extracellular dopamine in the nucleus accumbens (NAc) are traditionally thought to result from cocaine’s ability to block dopamine transporters (DATs). Here we demonstrate that cocaine also interacts with the immunosurveillance receptor complex, Toll-Like Receptor 4 (TLR4), on microglial cells to initiate central innate immune signaling. Disruption of cocaine signaling at TLR4 suppresses cocaine-induced extracellular dopamine in the NAc, as well as cocaine conditioned place preference and cocaine self-administration. These results provide a novel understanding of the neurobiological mechanisms underlying cocaine reward/reinforcement that includes a critical role for central immune signaling, and offer a new target for medication development for cocaine abuse treatment. PMID:25644383

  12. Context transfer in reinforcement learning using action-value functions.

    PubMed

    Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

    2014-01-01

    This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task. PMID:25610457

  13. Dopamine-related deficit in reward learning after catecholamine depletion in unmedicated, remitted subjects with bulimia nervosa.

    PubMed

    Grob, Simona; Pizzagalli, Diego A; Dutra, Sunny J; Stern, Jair; Mörgeli, Hanspeter; Milos, Gabriella; Schnyder, Ulrich; Hasler, Gregor

    2012-07-01

    Disturbances in reward processing have been implicated in bulimia nervosa (BN). Abnormalities in processing reward-related stimuli might be linked to dysfunctions of the catecholaminergic neurotransmitter system, but findings have been inconclusive. A powerful way to investigate the relationship between catecholaminergic function and behavior is to examine behavioral changes in response to experimental catecholamine depletion (CD). The purpose of this study was to uncover putative catecholaminergic dysfunction in remitted subjects with BN who performed a reinforcement-learning task after CD. CD was achieved by oral alpha-methyl-para-tyrosine (AMPT) in 19 unmedicated female subjects with remitted BN (rBN) and 28 demographically matched healthy female controls (HC). Sham depletion administered identical capsules containing diphenhydramine. The study design consisted of a randomized, double-blind, placebo-controlled crossover, single-site experimental trial. The main outcome measures were reward learning in a probabilistic reward task analyzed using signal-detection theory. Secondary outcome measures included self-report assessments, including the Eating Disorder Examination-Questionnaire. Relative to healthy controls, rBN subjects were characterized by blunted reward learning in the AMPT--but not in placebo--condition. Highlighting the specificity of these findings, groups did not differ in their ability to perceptually distinguish between stimuli. Increased CD-induced anhedonic (but not eating disorder) symptoms were associated with a reduced response bias toward a more frequently rewarded stimulus. In conclusion, under CD, rBN subjects showed reduced reward learning compared with healthy control subjects. These deficits uncover disturbance of the central reward processing systems in rBN related to altered brain catecholamine levels, which might reflect a trait-like deficit increasing vulnerability to BN.

  14. Dopamine-related deficit in reward learning after catecholamine depletion in unmedicated, remitted subjects with bulimia nervosa.

    PubMed

    Grob, Simona; Pizzagalli, Diego A; Dutra, Sunny J; Stern, Jair; Mörgeli, Hanspeter; Milos, Gabriella; Schnyder, Ulrich; Hasler, Gregor

    2012-07-01

    Disturbances in reward processing have been implicated in bulimia nervosa (BN). Abnormalities in processing reward-related stimuli might be linked to dysfunctions of the catecholaminergic neurotransmitter system, but findings have been inconclusive. A powerful way to investigate the relationship between catecholaminergic function and behavior is to examine behavioral changes in response to experimental catecholamine depletion (CD). The purpose of this study was to uncover putative catecholaminergic dysfunction in remitted subjects with BN who performed a reinforcement-learning task after CD. CD was achieved by oral alpha-methyl-para-tyrosine (AMPT) in 19 unmedicated female subjects with remitted BN (rBN) and 28 demographically matched healthy female controls (HC). Sham depletion administered identical capsules containing diphenhydramine. The study design consisted of a randomized, double-blind, placebo-controlled crossover, single-site experimental trial. The main outcome measures were reward learning in a probabilistic reward task analyzed using signal-detection theory. Secondary outcome measures included self-report assessments, including the Eating Disorder Examination-Questionnaire. Relative to healthy controls, rBN subjects were characterized by blunted reward learning in the AMPT--but not in placebo--condition. Highlighting the specificity of these findings, groups did not differ in their ability to perceptually distinguish between stimuli. Increased CD-induced anhedonic (but not eating disorder) symptoms were associated with a reduced response bias toward a more frequently rewarded stimulus. In conclusion, under CD, rBN subjects showed reduced reward learning compared with healthy control subjects. These deficits uncover disturbance of the central reward processing systems in rBN related to altered brain catecholamine levels, which might reflect a trait-like deficit increasing vulnerability to BN. PMID:22491353

  15. Dopamine-Related Deficit in Reward Learning After Catecholamine Depletion in Unmedicated, Remitted Subjects with Bulimia Nervosa

    PubMed Central

    Grob, Simona; Pizzagalli, Diego A; Dutra, Sunny J; Stern, Jair; Mörgeli, Hanspeter; Milos, Gabriella; Schnyder, Ulrich; Hasler, Gregor

    2012-01-01

    Disturbances in reward processing have been implicated in bulimia nervosa (BN). Abnormalities in processing reward-related stimuli might be linked to dysfunctions of the catecholaminergic neurotransmitter system, but findings have been inconclusive. A powerful way to investigate the relationship between catecholaminergic function and behavior is to examine behavioral changes in response to experimental catecholamine depletion (CD). The purpose of this study was to uncover putative catecholaminergic dysfunction in remitted subjects with BN who performed a reinforcement-learning task after CD. CD was achieved by oral alpha-methyl-para-tyrosine (AMPT) in 19 unmedicated female subjects with remitted BN (rBN) and 28 demographically matched healthy female controls (HC). Sham depletion administered identical capsules containing diphenhydramine. The study design consisted of a randomized, double-blind, placebo-controlled crossover, single-site experimental trial. The main outcome measures were reward learning in a probabilistic reward task analyzed using signal-detection theory. Secondary outcome measures included self-report assessments, including the Eating Disorder Examination-Questionnaire. Relative to healthy controls, rBN subjects were characterized by blunted reward learning in the AMPT—but not in placebo—condition. Highlighting the specificity of these findings, groups did not differ in their ability to perceptually distinguish between stimuli. Increased CD-induced anhedonic (but not eating disorder) symptoms were associated with a reduced response bias toward a more frequently rewarded stimulus. In conclusion, under CD, rBN subjects showed reduced reward learning compared with healthy control subjects. These deficits uncover disturbance of the central reward processing systems in rBN related to altered brain catecholamine levels, which might reflect a trait-like deficit increasing vulnerability to BN. PMID:22491353

  16. Partial reinforcement in human biofeedback learning.

    PubMed

    Morley, S

    1979-09-01

    This paper reviews the evidence for the efficacy of partial reinforcement in producing resistance to extinction in human biofeedback experiments. The methodological criteria necessary to demonstrate such effects are discussed, as is the status of the analogy of reinforcement and information feedback. It is suggested that the problem of maintaining responding in the absence of feedback should be tackled empirically rather than assuming the validity of findings from other areas of learning theory.

  17. Reinforcement learning in depression: A review of computational research.

    PubMed

    Chen, Chong; Takahashi, Taiki; Nakagawa, Shin; Inoue, Takeshi; Kusumi, Ichiro

    2015-08-01

    Despite being considered primarily a mood disorder, major depressive disorder (MDD) is characterized by cognitive and decision making deficits. Recent research has employed computational models of reinforcement learning (RL) to address these deficits. The computational approach has the advantage in making explicit predictions about learning and behavior, specifying the process parameters of RL, differentiating between model-free and model-based RL, and the computational model-based functional magnetic resonance imaging and electroencephalography. With these merits there has been an emerging field of computational psychiatry and here we review specific studies that focused on MDD. Considerable evidence suggests that MDD is associated with impaired brain signals of reward prediction error and expected value ('wanting'), decreased reward sensitivity ('liking') and/or learning (be it model-free or model-based), etc., although the causality remains unclear. These parameters may serve as valuable intermediate phenotypes of MDD, linking general clinical symptoms to underlying molecular dysfunctions. We believe future computational research at clinical, systems, and cellular/molecular/genetic levels will propel us toward a better understanding of the disease. PMID:25979140

  18. Reinforcement learning in depression: A review of computational research.

    PubMed

    Chen, Chong; Takahashi, Taiki; Nakagawa, Shin; Inoue, Takeshi; Kusumi, Ichiro

    2015-08-01

    Despite being considered primarily a mood disorder, major depressive disorder (MDD) is characterized by cognitive and decision making deficits. Recent research has employed computational models of reinforcement learning (RL) to address these deficits. The computational approach has the advantage in making explicit predictions about learning and behavior, specifying the process parameters of RL, differentiating between model-free and model-based RL, and the computational model-based functional magnetic resonance imaging and electroencephalography. With these merits there has been an emerging field of computational psychiatry and here we review specific studies that focused on MDD. Considerable evidence suggests that MDD is associated with impaired brain signals of reward prediction error and expected value ('wanting'), decreased reward sensitivity ('liking') and/or learning (be it model-free or model-based), etc., although the causality remains unclear. These parameters may serve as valuable intermediate phenotypes of MDD, linking general clinical symptoms to underlying molecular dysfunctions. We believe future computational research at clinical, systems, and cellular/molecular/genetic levels will propel us toward a better understanding of the disease.

  19. Evolution with reinforcement learning in negotiation.

    PubMed

    Zou, Yi; Zhan, Wenjie; Shao, Yuan

    2014-01-01

    Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm.

  20. Evolution with Reinforcement Learning in Negotiation

    PubMed Central

    Zou, Yi; Zhan, Wenjie; Shao, Yuan

    2014-01-01

    Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm. PMID:25048108

  1. Reinforcement Learning with Bounded Information Loss

    NASA Astrophysics Data System (ADS)

    Peters, Jan; Mülling, Katharina; Seldin, Yevgeny; Altun, Yasemin

    2011-03-01

    Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant or natural policy gradients, many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest two reinforcement learning methods, i.e., a model-based and a model free algorithm that bound the loss in relative entropy while maximizing their return. The resulting methods differ significantly from previous policy gradient approaches and yields an exact update step. It works well on typical reinforcement learning benchmark problems as well as novel evaluations in robotics. We also show a Bayesian bound motivation of this new approach [8].

  2. Effort-Reward Imbalance for Learning Is Associated with Fatigue in School Children

    ERIC Educational Resources Information Center

    Fukuda, Sanae; Yamano, Emi; Joudoi, Takako; Mizuno, Kei; Tanaka, Masaaki; Kawatani, Junko; Takano, Miyuki; Tomoda, Akemi; Imai-Matsumura, Kyoko; Miike, Teruhisa; Watanabe, Yasuyoshi

    2010-01-01

    We examined relationships among fatigue, sleep quality, and effort-reward imbalance for learning in school children. We developed an effort-reward for learning scale in school students and examined its reliability and validity. Self-administered surveys, including the effort reward for leaning scale and fatigue scale, were completed by 1,023…

  3. Short-term memory traces for action bias in human reinforcement learning.

    PubMed

    Bogacz, Rafal; McClure, Samuel M; Li, Jian; Cohen, Jonathan D; Montague, P Read

    2007-06-11

    Recent experimental and theoretical work on reinforcement learning has shed light on the neural bases of learning from rewards and punishments. One fundamental problem in reinforcement learning is the credit assignment problem, or how to properly assign credit to actions that lead to reward or punishment following a delay. Temporal difference learning solves this problem, but its efficiency can be significantly improved by the addition of eligibility traces (ET). In essence, ETs function as decaying memories of previous choices that are used to scale synaptic weight changes. It has been shown in theoretical studies that ETs spanning a number of actions may improve the performance of reinforcement learning. However, it remains an open question whether including ETs that persist over sequences of actions allows reinforcement learning models to better fit empirical data regarding the behaviors of humans and other animals. Here, we report an experiment in which human subjects performed a sequential economic decision game in which the long-term optimal strategy differed from the strategy that leads to the greatest short-term return. We demonstrate that human subjects' performance in the task is significantly affected by the time between choices in a surprising and seemingly counterintuitive way. However, this behavior is naturally explained by a temporal difference learning model which includes ETs persisting across actions. Furthermore, we review recent findings that suggest that short-term synaptic plasticity in dopamine neurons may provide a realistic biophysical mechanism for producing ETs that persist on a timescale consistent with behavioral observations.

  4. Adding prediction risk to the theory of reward learning.

    PubMed

    Preuschoff, Kerstin; Bossaerts, Peter

    2007-05-01

    This article analyzes the simple Rescorla-Wagner learning rule from the vantage point of least squares learning theory. In particular, it suggests how measures of risk, such as prediction risk, can be used to adjust the learning constant in reinforcement learning. It argues that prediction risk is most effectively incorporated by scaling the prediction errors. This way, the learning rate needs adjusting only when the covariance between optimal predictions and past (scaled) prediction errors changes. Evidence is discussed that suggests that the dopaminergic system in the (human and nonhuman) primate brain encodes prediction risk, and that prediction errors are indeed scaled with prediction risk (adaptive encoding).

  5. Learning about multiple attributes of reward in Pavlovian conditioning.

    PubMed

    Delamater, Andrew R; Oakeshott, Stephen

    2007-05-01

    The nature of the reward representation in Pavlovian conditioning has been of perennial interest to students of associative learning theory. We consider the view that it consists of a range of different attributes, each of which may be governed by different learning rules. We investigated this issue through a series of experiments using a time-sensitive Pavlovian-to-instrumental transfer procedure, aiming to dissociate learning about temporal and specific sensory features of a reward. Our results successfully demonstrated that learning about these different features appears to be dissociable, with learning about the specific sensory features of a Pavlovian unconditioned stimulus (US) occurring very rapidly across a wide range of experimental procedures, while learning about the temporal features of the US occurred slightly less quickly and was more sensitive to parametric disruption. These results are discussed with regard to the potential independence or interdependence of the relevant learning processes, and to some recent neurophysiological recording and brain lesion work, which provide additional means to investigate these dissociations.

  6. Reinforcement learning deficits in people with schizophrenia persist after extended trials.

    PubMed

    Cicero, David C; Martin, Elizabeth A; Becker, Theresa M; Kerns, John G

    2014-12-30

    Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive feedback and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning. PMID:25172610

  7. Learning to Obtain Reward, but Not Avoid Punishment, Is Affected by Presence of PTSD Symptoms in Male Veterans: Empirical Data and Computational Model

    PubMed Central

    Myers, Catherine E.; Moustafa, Ahmed A.; Sheynin, Jony; VanMeenen, Kirsten M.; Gilbertson, Mark W.; Orr, Scott P.; Beck, Kevin D.; Pang, Kevin C. H.; Servatius, Richard J.

    2013-01-01

    Post-traumatic stress disorder (PTSD) symptoms include behavioral avoidance which is acquired and tends to increase with time. This avoidance may represent a general learning bias; indeed, individuals with PTSD are often faster than controls on acquiring conditioned responses based on physiologically-aversive feedback. However, it is not clear whether this learning bias extends to cognitive feedback, or to learning from both reward and punishment. Here, male veterans with self-reported current, severe PTSD symptoms (PTSS group) or with few or no PTSD symptoms (control group) completed a probabilistic classification task that included both reward-based and punishment-based trials, where feedback could take the form of reward, punishment, or an ambiguous “no-feedback” outcome that could signal either successful avoidance of punishment or failure to obtain reward. The PTSS group outperformed the control group in total points obtained; the PTSS group specifically performed better than the control group on reward-based trials, with no difference on punishment-based trials. To better understand possible mechanisms underlying observed performance, we used a reinforcement learning model of the task, and applied maximum likelihood estimation techniques to derive estimated parameters describing individual participants’ behavior. Estimations of the reinforcement value of the no-feedback outcome were significantly greater in the control group than the PTSS group, suggesting that the control group was more likely to value this outcome as positively reinforcing (i.e., signaling successful avoidance of punishment). This is consistent with the control group’s generally poorer performance on reward trials, where reward feedback was to be obtained in preference to the no-feedback outcome. Differences in the interpretation of ambiguous feedback may contribute to the facilitated reinforcement learning often observed in PTSD patients, and may in turn provide new insight into

  8. Learning to obtain reward, but not avoid punishment, is affected by presence of PTSD symptoms in male veterans: empirical data and computational model.

    PubMed

    Myers, Catherine E; Moustafa, Ahmed A; Sheynin, Jony; Vanmeenen, Kirsten M; Gilbertson, Mark W; Orr, Scott P; Beck, Kevin D; Pang, Kevin C H; Servatius, Richard J

    2013-01-01

    Post-traumatic stress disorder (PTSD) symptoms include behavioral avoidance which is acquired and tends to increase with time. This avoidance may represent a general learning bias; indeed, individuals with PTSD are often faster than controls on acquiring conditioned responses based on physiologically-aversive feedback. However, it is not clear whether this learning bias extends to cognitive feedback, or to learning from both reward and punishment. Here, male veterans with self-reported current, severe PTSD symptoms (PTSS group) or with few or no PTSD symptoms (control group) completed a probabilistic classification task that included both reward-based and punishment-based trials, where feedback could take the form of reward, punishment, or an ambiguous "no-feedback" outcome that could signal either successful avoidance of punishment or failure to obtain reward. The PTSS group outperformed the control group in total points obtained; the PTSS group specifically performed better than the control group on reward-based trials, with no difference on punishment-based trials. To better understand possible mechanisms underlying observed performance, we used a reinforcement learning model of the task, and applied maximum likelihood estimation techniques to derive estimated parameters describing individual participants' behavior. Estimations of the reinforcement value of the no-feedback outcome were significantly greater in the control group than the PTSS group, suggesting that the control group was more likely to value this outcome as positively reinforcing (i.e., signaling successful avoidance of punishment). This is consistent with the control group's generally poorer performance on reward trials, where reward feedback was to be obtained in preference to the no-feedback outcome. Differences in the interpretation of ambiguous feedback may contribute to the facilitated reinforcement learning often observed in PTSD patients, and may in turn provide new insight into how

  9. Optimal control in microgrid using multi-agent reinforcement learning.

    PubMed

    Li, Fu-Dong; Wu, Min; He, Yong; Chen, Xin

    2012-11-01

    This paper presents an improved reinforcement learning method to minimize electricity costs on the premise of satisfying the power balance and generation limit of units in a microgrid with grid-connected mode. Firstly, the microgrid control requirements are analyzed and the objective function of optimal control for microgrid is proposed. Then, a state variable "Average Electricity Price Trend" which is used to express the most possible transitions of the system is developed so as to reduce the complexity and randomicity of the microgrid, and a multi-agent architecture including agents, state variables, action variables and reward function is formulated. Furthermore, dynamic hierarchical reinforcement learning, based on change rate of key state variable, is established to carry out optimal policy exploration. The analysis shows that the proposed method is beneficial to handle the problem of "curse of dimensionality" and speed up learning in the unknown large-scale world. Finally, the simulation results under JADE (Java Agent Development Framework) demonstrate the validity of the presented method in optimal control for a microgrid with grid-connected mode.

  10. Optimal control in microgrid using multi-agent reinforcement learning.

    PubMed

    Li, Fu-Dong; Wu, Min; He, Yong; Chen, Xin

    2012-11-01

    This paper presents an improved reinforcement learning method to minimize electricity costs on the premise of satisfying the power balance and generation limit of units in a microgrid with grid-connected mode. Firstly, the microgrid control requirements are analyzed and the objective function of optimal control for microgrid is proposed. Then, a state variable "Average Electricity Price Trend" which is used to express the most possible transitions of the system is developed so as to reduce the complexity and randomicity of the microgrid, and a multi-agent architecture including agents, state variables, action variables and reward function is formulated. Furthermore, dynamic hierarchical reinforcement learning, based on change rate of key state variable, is established to carry out optimal policy exploration. The analysis shows that the proposed method is beneficial to handle the problem of "curse of dimensionality" and speed up learning in the unknown large-scale world. Finally, the simulation results under JADE (Java Agent Development Framework) demonstrate the validity of the presented method in optimal control for a microgrid with grid-connected mode. PMID:22824135

  11. Human dorsal striatal activity during choice discriminates reinforcement learning behavior from the gambler's fallacy.

    PubMed

    Jessup, Ryan K; O'Doherty, John P

    2011-04-27

    Reinforcement learning theory has generated substantial interest in neurobiology, particularly because of the resemblance between phasic dopamine and reward prediction errors. Actor-critic theories have been adapted to account for the functions of the striatum, with parts of the dorsal striatum equated to the actor. Here, we specifically test whether the human dorsal striatum--as predicted by an actor-critic instantiation--is used on a trial-to-trial basis at the time of choice to choose in accordance with reinforcement learning theory, as opposed to a competing strategy: the gambler's fallacy. Using a partial-brain functional magnetic resonance imaging scanning protocol focused on the striatum and other ventral brain areas, we found that the dorsal striatum is more active when choosing consistent with reinforcement learning compared with the competing strategy. Moreover, an overlapping area of dorsal striatum along with the ventral striatum was found to be correlated with reward prediction errors at the time of outcome, as predicted by the actor-critic framework. These findings suggest that the same region of dorsal striatum involved in learning stimulus-response associations may contribute to the control of behavior during choice, thereby using those learned associations. Intriguingly, neither reinforcement learning nor the gambler's fallacy conformed to the optimal choice strategy on the specific decision-making task we used. Thus, the dorsal striatum may contribute to the control of behavior according to reinforcement learning even when the prescriptions of such an algorithm are suboptimal in terms of maximizing future rewards. PMID:21525269

  12. The basolateral amygdala in reward learning and addiction

    PubMed Central

    Wassum, Kate M.; Izquierdo, Alicia

    2015-01-01

    Sophisticated behavioral paradigms partnered with the emergence of increasingly selective techniques to target the basolateral amygdala (BLA) have resulted in an enhanced understanding of the role of this nucleus in learning and using reward information. Due to the wide variety of behavioral approaches many questions remain on the circumscribed role of BLA in appetitive behavior. In this review, we integrate conclusions of BLA function in reward-related behavior using traditional interference techniques (lesion, pharmacological inactivation) with those using newer methodological approaches in experimental animals that allow in vivo manipulation of cell type-specific populations and neural recordings. Secondly, from a review of appetitive behavioral tasks in rodents and monkeys and recent computational models of reward procurement, we derive evidence for BLA as a neural integrator of reward value, history, and cost parameters. Taken together, BLA codes specific and temporally dynamic outcome representations in a distributed network to orchestrate adaptive responses. We provide evidence that experiences with opiates and psychostimulants alter these outcome representations in BLA, resulting in long-term modified action. PMID:26341938

  13. The basolateral amygdala in reward learning and addiction.

    PubMed

    Wassum, Kate M; Izquierdo, Alicia

    2015-10-01

    Sophisticated behavioral paradigms partnered with the emergence of increasingly selective techniques to target the basolateral amygdala (BLA) have resulted in an enhanced understanding of the role of this nucleus in learning and using reward information. Due to the wide variety of behavioral approaches many questions remain on the circumscribed role of BLA in appetitive behavior. In this review, we integrate conclusions of BLA function in reward-related behavior using traditional interference techniques (lesion, pharmacological inactivation) with those using newer methodological approaches in experimental animals that allow in vivo manipulation of cell type-specific populations and neural recordings. Secondly, from a review of appetitive behavioral tasks in rodents and monkeys and recent computational models of reward procurement, we derive evidence for BLA as a neural integrator of reward value, history, and cost parameters. Taken together, BLA codes specific and temporally dynamic outcome representations in a distributed network to orchestrate adaptive responses. We provide evidence that experiences with opiates and psychostimulants alter these outcome representations in BLA, resulting in long-term modified action.

  14. The basolateral amygdala in reward learning and addiction.

    PubMed

    Wassum, Kate M; Izquierdo, Alicia

    2015-10-01

    Sophisticated behavioral paradigms partnered with the emergence of increasingly selective techniques to target the basolateral amygdala (BLA) have resulted in an enhanced understanding of the role of this nucleus in learning and using reward information. Due to the wide variety of behavioral approaches many questions remain on the circumscribed role of BLA in appetitive behavior. In this review, we integrate conclusions of BLA function in reward-related behavior using traditional interference techniques (lesion, pharmacological inactivation) with those using newer methodological approaches in experimental animals that allow in vivo manipulation of cell type-specific populations and neural recordings. Secondly, from a review of appetitive behavioral tasks in rodents and monkeys and recent computational models of reward procurement, we derive evidence for BLA as a neural integrator of reward value, history, and cost parameters. Taken together, BLA codes specific and temporally dynamic outcome representations in a distributed network to orchestrate adaptive responses. We provide evidence that experiences with opiates and psychostimulants alter these outcome representations in BLA, resulting in long-term modified action. PMID:26341938

  15. Antipsychotic dose modulates behavioral and neural responses to feedback during reinforcement learning in schizophrenia.

    PubMed

    Insel, Catherine; Reinen, Jenna; Weber, Jochen; Wager, Tor D; Jarskog, L Fredrik; Shohamy, Daphna; Smith, Edward E

    2014-03-01

    Schizophrenia is characterized by an abnormal dopamine system, and dopamine blockade is the primary mechanism of antipsychotic treatment. Consistent with the known role of dopamine in reward processing, prior research has demonstrated that patients with schizophrenia exhibit impairments in reward-based learning. However, it remains unknown how treatment with antipsychotic medication impacts the behavioral and neural signatures of reinforcement learning in schizophrenia. The goal of this study was to examine whether antipsychotic medication modulates behavioral and neural responses to prediction error coding during reinforcement learning. Patients with schizophrenia completed a reinforcement learning task while undergoing functional magnetic resonance imaging. The task consisted of two separate conditions in which participants accumulated monetary gain or avoided monetary loss. Behavioral results indicated that antipsychotic medication dose was associated with altered behavioral approaches to learning, such that patients taking higher doses of medication showed increased sensitivity to negative reinforcement. Higher doses of antipsychotic medication were also associated with higher learning rates (LRs), suggesting that medication enhanced sensitivity to trial-by-trial feedback. Neuroimaging data demonstrated that antipsychotic dose was related to differences in neural signatures of feedback prediction error during the loss condition. Specifically, patients taking higher doses of medication showed attenuated prediction error responses in the striatum and the medial prefrontal cortex. These findings indicate that antipsychotic medication treatment may influence motivational processes in patients with schizophrenia.

  16. A proposed resolution to the paradox of drug reward: Dopamine's evolution from an aversive signal to a facilitator of drug reward via negative reinforcement.

    PubMed

    Ting-A-Kee, Ryan; Heinmiller, Andrew; van der Kooy, Derek

    2015-09-01

    The mystery surrounding how plant neurotoxins came to possess reinforcing properties is termed the paradox of drug reward. Here we propose a resolution to this paradox whereby dopamine - which has traditionally been viewed as a signal of reward - initially signaled aversion and encouraged escape. We suggest that after being consumed, plant neurotoxins such as nicotine activated an aversive dopaminergic pathway, thereby deterring predatory herbivores. Later evolutionary events - including the development of a GABAergic system capable of modulating dopaminergic activity - led to the ability to down-regulate and 'control' this dopamine-based aversion. We speculate that this negative reinforcement system evolved so that animals could suppress aversive states such as hunger in order to attend to other internal drives (such as mating and shelter) that would result in improved organismal fitness.

  17. What is learned in sequential learning? An associative model of reward magnitude serial-pattern learning.

    PubMed

    Wallace, Douglas G; Fountain, Stephen B

    2002-01-01

    A computational model of sequence learning is described that is based on pairwise associations and generalization. Simulations by the model predicted that rats should learn a long monotonic pattern of food quantities better than a nonmonotonic pattern, as predicted by rule-learning theory, and that they should learn a short nonmonotonic pattern with highly discriminable elements better than 1 with less discriminable elements, as predicted by interitem association theory. In 2 other studies, the model also simulated behavioral "rule generalization," "extrapolation," and associative transfer data motivated by both rule-learning and associative perspectives. Although these simulations do not rule out the possibility that rats can use rule induction to learn serial patterns, they show that a simple associative model can account for the classical behavioral studies implicating rule learning in reward magnitude serial-pattern learning.

  18. Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology.

    PubMed

    Schultz, Wolfram

    2004-04-01

    Neurons in a small number of brain structures detect rewards and reward-predicting stimuli and are active during the expectation of predictable food and liquid rewards. These neurons code the reward information according to basic terms of various behavioural theories that seek to explain reward-directed learning, approach behaviour and decision-making. The involved brain structures include groups of dopamine neurons, the striatum including the nucleus accumbens, the orbitofrontal cortex and the amygdala. The reward information is fed to brain structures involved in decision-making and organisation of behaviour, such as the dorsolateral prefrontal cortex and possibly the parietal cortex. The neural coding of basic reward terms derived from formal theories puts the neurophysiological investigation of reward mechanisms on firm conceptual grounds and provides neural correlates for the function of rewards in learning, approach behaviour and decision-making.

  19. Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology.

    PubMed

    Schultz, Wolfram

    2004-04-01

    Neurons in a small number of brain structures detect rewards and reward-predicting stimuli and are active during the expectation of predictable food and liquid rewards. These neurons code the reward information according to basic terms of various behavioural theories that seek to explain reward-directed learning, approach behaviour and decision-making. The involved brain structures include groups of dopamine neurons, the striatum including the nucleus accumbens, the orbitofrontal cortex and the amygdala. The reward information is fed to brain structures involved in decision-making and organisation of behaviour, such as the dorsolateral prefrontal cortex and possibly the parietal cortex. The neural coding of basic reward terms derived from formal theories puts the neurophysiological investigation of reward mechanisms on firm conceptual grounds and provides neural correlates for the function of rewards in learning, approach behaviour and decision-making. PMID:15082317

  20. Autonomous reinforcement learning with experience replay.

    PubMed

    Wawrzyński, Paweł; Tanwani, Ajay Kumar

    2013-05-01

    This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time. PMID:23237972

  1. Refining Linear Fuzzy Rules by Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.; Malkani, Anil

    1996-01-01

    Linear fuzzy rules are increasingly being used in the development of fuzzy logic systems. Radial basis functions have also been used in the antecedents of the rules for clustering in product space which can automatically generate a set of linear fuzzy rules from an input/output data set. Manual methods are usually used in refining these rules. This paper presents a method for refining the parameters of these rules using reinforcement learning which can be applied in domains where supervised input-output data is not available and reinforcements are received only after a long sequence of actions. This is shown for a generalization of radial basis functions. The formation of fuzzy rules from data and their automatic refinement is an important step in closing the gap between the application of reinforcement learning methods in the domains where only some limited input-output data is available.

  2. Reinforcement Learning in Information Searching

    ERIC Educational Resources Information Center

    Cen, Yonghua; Gan, Liren; Bai, Chen

    2013-01-01

    Introduction: The study seeks to answer two questions: How do university students learn to use correct strategies to conduct scholarly information searches without instructions? and, What are the differences in learning mechanisms between users at different cognitive levels? Method: Two groups of users, thirteen first year undergraduate students…

  3. Facilitation of Discrimination Learning with Delayed Reinforcement.

    ERIC Educational Resources Information Center

    Goldstein, Sondra Blevins; Siegel, Alexander W.

    Eighty-four nine year old children, twelve in each of seven experimental groups, learned a two-choice successive discrimination problem. The parameters that were systematically manipulated included: (1) immediate versus delayed reinforcement; and (2) forced preresponse stimulus exposure versus stimulus exposure during the delay interval,…

  4. Geographical Inquiry and Learning Reinforcement Theory.

    ERIC Educational Resources Information Center

    Davies, Christopher S.

    1983-01-01

    Although instructors have been reluctant to utilize the Keller Plan (a personalized system of instruction), it lends itself to teaching introductory geography. College students found that the routine and frequent reinforcement led to progressive learning. However, it does not lend itself to the study of reflexive or polemical concepts. (IS)

  5. Classroom Reinforcement and Learning: A Quantitative Synthesis.

    ERIC Educational Resources Information Center

    Lysakowski, Richard S.; Walberg, Herbert J.

    To estimate the influence of positive reinforcement on classroom learning, the authors analyzed statistical data from 39 studies spanning the years 1958-1978 and containing a combined sample of 4,842 students in 202 classes. Twenty-nine characteristics of each study's sample, methodology, and reliability were coded to measure their effects on…

  6. Reward-related learning via multiple memory systems

    PubMed Central

    Delgado, Mauricio R.; Dickerson, Kathryn C.

    2013-01-01

    The application of a neuroeconomic approach to the study of reward-related processes has provided significant insights in our understanding of human learning and decision-making. Much of this research has primarily focused on the contributions of the cortico-striatal circuitry, involved in trial and error reward learning. As a result, less consideration has been allotted to the potential influence of different neural mechanisms such as the hippocampus, or to more common ways in human society in which information is acquired and utilized to reach a decision, such as through explicit instruction rather than trial and error learning. This review examines the basic and applied value of examining the individual contributions of multiple learning and memory neural systems and their interactions during human decision-making in normal individuals and neuropsychiatric populations. Specifically, the anatomical and functional connectivity across multiple memory systems are highlighted to suggest that probing the role of the hippocampus and its interactions with the cortico-striatal circuitry via the application of model-based neuroeconomic approaches may provide novel insights into several neuropsychiatric populations who suffer from damage to one of these structures, and as a consequence have deficits in learning, memory, or decision-making. PMID:22365667

  7. Time representation in reinforcement learning models of the basal ganglia

    PubMed Central

    Gershman, Samuel J.; Moustafa, Ahmed A.; Ludvig, Elliot A.

    2014-01-01

    Reinforcement learning (RL) models have been influential in understanding many aspects of basal ganglia function, from reward prediction to action selection. Time plays an important role in these models, but there is still no theoretical consensus about what kind of time representation is used by the basal ganglia. We review several theoretical accounts and their supporting evidence. We then discuss the relationship between RL models and the timing mechanisms that have been attributed to the basal ganglia. We hypothesize that a single computational system may underlie both RL and interval timing—the perception of duration in the range of seconds to hours. This hypothesis, which extends earlier models by incorporating a time-sensitive action selection mechanism, may have important implications for understanding disorders like Parkinson's disease in which both decision making and timing are impaired. PMID:24409138

  8. The relationship between reinforcement and memory: parallels in the rewarding and mnemonic effects of the neuropeptide substance P.

    PubMed

    Huston, J P; Oitzl, M S

    1989-01-01

    A theory of reinforcement is presented which accounts for the backward action of a reinforcer on operant behavior in terms of its effect on memory traces left by the operant. Several possible ways in which a reinforcer could strengthen the probability of recurrence of an operant are discussed. Predictions from the model regarding general memory-promoting effects of reinforcers presented posttrial in various learning paradigms are outlined. The theory also predicts a parallelism in reinforcing and memory-promoting effects of stimuli, including drugs. The second part of the chapter outlines experiments investigating memory modulating and reinforcing effects of the neuropeptide substance P. In general, injection of SP is positively reinforcing when injected into parts of the brain where it has been shown to facilitate learning. Peripheral injection of SP is also reinforcing at the dose known to promote passive avoidance learning when presented posttrial.

  9. Fuzzy Q-Learning for Generalization of Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1996-01-01

    Fuzzy Q-Learning, introduced earlier by the author, is an extension of Q-Learning into fuzzy environments. GARIC is a methodology for fuzzy reinforcement learning. In this paper, we introduce GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Q-Learning and at the local level, each agent learns and operates based on GARIC. GARIC-Q improves the speed and applicability of Fuzzy Q-Learning through generalization of input space by using fuzzy rules and bridges the gap between Q-Learning and rule based intelligent systems.

  10. Extraversion differentiates between model-based and model-free strategies in a reinforcement learning task.

    PubMed

    Skatova, Anya; Chan, Patricia A; Daw, Nathaniel D

    2013-01-01

    Prominent computational models describe a neural mechanism for learning from reward prediction errors, and it has been suggested that variations in this mechanism are reflected in personality factors such as trait extraversion. However, although trait extraversion has been linked to improved reward learning, it is not yet known whether this relationship is selective for the particular computational strategy associated with error-driven learning, known as model-free reinforcement learning, vs. another strategy, model-based learning, which the brain is also known to employ. In the present study we test this relationship by examining whether humans' scores on an extraversion scale predict individual differences in the balance between model-based and model-free learning strategies in a sequentially structured decision task designed to distinguish between them. In previous studies with this task, participants have shown a combination of both types of learning, but with substantial individual variation in the balance between them. In the current study, extraversion predicted worse behavior across both sorts of learning. However, the hypothesis that extraverts would be selectively better at model-free reinforcement learning held up among a subset of the more engaged participants, and overall, higher task engagement was associated with a more selective pattern by which extraversion predicted better model-free learning. The findings indicate a relationship between a broad personality orientation and detailed computational learning mechanisms. Results like those in the present study suggest an intriguing and rich relationship between core neuro-computational mechanisms and broader life orientations and outcomes.

  11. Multitask Reinforcement Learning on the Distribution of MDPs

    NASA Astrophysics Data System (ADS)

    Fumihide, Tanaka; Masayuki, Yamamura

    In this paper we address a new problem in reinforcement learning. Here we consider an agent that faces multiple learning tasks within its lifetime. The agent’s objective is to maximize its total reward in the lifetime as well as a conventional return in each task. To realize this, it has to be endowed an important ability to keep its past learning experiences and utilize them for improving future learning performance. This time we try to phrase this problem formally. The central idea is to introduce an environmental class, BV-MDPs that is defined with the distribution of MDPs. As an approach to exploiting past learning experiences, we focus on statistical information (mean and deviation) about the agent’s value tables. The mean can be used as initial values of the table when a new task is presented. The deviation can be viewed as measuring reliability of the mean, and we utilize it in calculating priority of simulated backups. We conduct experiments in computer simulation to evaluate the effectiveness.

  12. Can Traditions Emerge from the Interaction of Stimulus Enhancement and Reinforcement Learning? An Experimental Model

    PubMed Central

    MATTHEWS, LUKE J; PAUKNER, ANNIKA; SUOMI, STEPHEN J

    2010-01-01

    The study of social learning in captivity and behavioral traditions in the wild are two burgeoning areas of research, but few empirical studies have tested how learning mechanisms produce emergent patterns of tradition. Studies have examined how social learning mechanisms that are cognitively complex and possessed by few species, such as imitation, result in traditional patterns, yet traditional patterns are also exhibited by species that may not possess such mechanisms. We propose an explicit model of how stimulus enhancement and reinforcement learning could interact to produce traditions. We tested the model experimentally with tufted capuchin monkeys (Cebus apella), which exhibit traditions in the wild but have rarely demonstrated imitative abilities in captive experiments. Monkeys showed both stimulus enhancement learning and a habitual bias to perform whichever behavior first obtained them a reward. These results support our model that simple social learning mechanisms combined with reinforcement can result in traditional patterns of behavior. PMID:21135912

  13. Ventral tegmental area neurons in learned appetitive behavior and positive reinforcement.

    PubMed

    Fields, Howard L; Hjelmstad, Gregory O; Margolis, Elyssa B; Nicola, Saleem M

    2007-01-01

    Ventral tegmental area (VTA) neuron firing precedes behaviors elicited by reward-predictive sensory cues and scales with the magnitude and unpredictability of received rewards. These patterns are consistent with roles in the performance of learned appetitive behaviors and in positive reinforcement, respectively. The VTA includes subpopulations of neurons with different afferent connections, neurotransmitter content, and projection targets. Because the VTA and substantia nigra pars compacta are the sole sources of striatal and limbic forebrain dopamine, measurements of dopamine release and manipulations of dopamine function have provided critical evidence supporting a VTA contribution to these functions. However, the VTA also sends GABAergic and glutamatergic projections to the nucleus accumbens and prefrontal cortex. Furthermore, VTA-mediated but dopamine-independent positive reinforcement has been demonstrated. Consequently, identifying the neurotransmitter content and projection target of VTA neurons recorded in vivo will be critical for determining their contribution to learned appetitive behaviors.

  14. Prediction error in reinforcement learning: a meta-analysis of neuroimaging studies.

    PubMed

    Garrison, Jane; Erdeniz, Burak; Done, John

    2013-08-01

    Activation likelihood estimation (ALE) meta-analyses were used to examine the neural correlates of prediction error in reinforcement learning. The findings are interpreted in the light of current computational models of learning and action selection. In this context, particular consideration is given to the comparison of activation patterns from studies using instrumental and Pavlovian conditioning, and where reinforcement involved rewarding or punishing feedback. The striatum was the key brain area encoding for prediction error, with activity encompassing dorsal and ventral regions for instrumental and Pavlovian reinforcement alike, a finding which challenges the functional separation of the striatum into a dorsal 'actor' and a ventral 'critic'. Prediction error activity was further observed in diverse areas of predominantly anterior cerebral cortex including medial prefrontal cortex and anterior cingulate cortex. Distinct patterns of prediction error activity were found for studies using rewarding and aversive reinforcers; reward prediction errors were observed primarily in the striatum while aversive prediction errors were found more widely including insula and habenula.

  15. Prediction error in reinforcement learning: a meta-analysis of neuroimaging studies.

    PubMed

    Garrison, Jane; Erdeniz, Burak; Done, John

    2013-08-01

    Activation likelihood estimation (ALE) meta-analyses were used to examine the neural correlates of prediction error in reinforcement learning. The findings are interpreted in the light of current computational models of learning and action selection. In this context, particular consideration is given to the comparison of activation patterns from studies using instrumental and Pavlovian conditioning, and where reinforcement involved rewarding or punishing feedback. The striatum was the key brain area encoding for prediction error, with activity encompassing dorsal and ventral regions for instrumental and Pavlovian reinforcement alike, a finding which challenges the functional separation of the striatum into a dorsal 'actor' and a ventral 'critic'. Prediction error activity was further observed in diverse areas of predominantly anterior cerebral cortex including medial prefrontal cortex and anterior cingulate cortex. Distinct patterns of prediction error activity were found for studies using rewarding and aversive reinforcers; reward prediction errors were observed primarily in the striatum while aversive prediction errors were found more widely including insula and habenula. PMID:23567522

  16. Online reinforcement learning for dynamic multimedia systems.

    PubMed

    Mastronarde, Nicholas; van der Schaar, Mihaela

    2010-02-01

    In our previous work, we proposed a systematic cross-layer framework for dynamic multimedia systems, which allows each layer to make autonomous and foresighted decisions that maximize the system's long-term performance, while meeting the application's real-time delay constraints. The proposed solution solved the cross-layer optimization offline, under the assumption that the multimedia system's probabilistic dynamics were known a priori, by modeling the system as a layered Markov decision process. In practice, however, these dynamics are unknown a priori and, therefore, must be learned online. In this paper, we address this problem by allowing the multimedia system layers to learn, through repeated interactions with each other, to autonomously optimize the system's long-term performance at run-time. The two key challenges in this layered learning setting are: (i) each layer's learning performance is directly impacted by not only its own dynamics, but also by the learning processes of the other layers with which it interacts; and (ii) selecting a learning model that appropriately balances time-complexity (i.e., learning speed) with the multimedia system's limited memory and the multimedia application's real-time delay constraints. We propose two reinforcement learning algorithms for optimizing the system under different design constraints: the first algorithm solves the cross-layer optimization in a centralized manner and the second solves it in a decentralized manner. We analyze both algorithms in terms of their required computation, memory, and interlayer communication overheads. After noting that the proposed reinforcement learning algorithms learn too slowly, we introduce a complementary accelerated learning algorithm that exploits partial knowledge about the system's dynamics in order to dramatically improve the system's performance. In our experiments, we demonstrate that decentralized learning can perform equally as well as centralized learning, while

  17. Drive-reinforcement learning system applications

    NASA Astrophysics Data System (ADS)

    Johnson, Daniel W.

    1992-07-01

    The application of Drive-Reinforcement (D-R) to the unsupervised learning of manipulator control functions was investigated. In particular, the ability of a D-R neuronal system to learn servo-level and trajectory-level controls for a robotic mechanism was assessed. Results indicate that D-R based systems can be successful at learning these functions in real-time with actual hardware. Moreover, since the control architectures are generic, the evidence suggests that D-R would be effective in control system applications outside the robotics arena.

  18. Neural correlates of reinforcement learning and social preferences in competitive bidding.

    PubMed

    van den Bos, Wouter; Talwar, Arjun; McClure, Samuel M

    2013-01-30

    In competitive social environments, people often deviate from what rational choice theory prescribes, resulting in losses or suboptimal monetary gains. We investigate how competition affects learning and decision-making in a common value auction task. During the experiment, groups of five human participants were simultaneously scanned using MRI while playing the auction task. We first demonstrate that bidding is well characterized by reinforcement learning with biased reward representations dependent on social preferences. Indicative of reinforcement learning, we found that estimated trial-by-trial prediction errors correlated with activity in the striatum and ventromedial prefrontal cortex. Additionally, we found that individual differences in social preferences were related to activity in the temporal-parietal junction and anterior insula. Connectivity analyses suggest that monetary and social value signals are integrated in the ventromedial prefrontal cortex and striatum. Based on these results, we argue for a novel mechanistic account for the integration of reinforcement history and social preferences in competitive decision-making.

  19. Providing a food reward reduces inhibitory avoidance learning in zebrafish.

    PubMed

    Manuel, Remy; Zethof, Jan; Flik, Gert; van den Bos, Ruud

    2015-11-01

    As shown in male rats, prior history of subjects changes behavioural and stress-responses to challenges: a two-week history of exposure to rewards at fixed intervals led to slightly, but consistently, lower physiological stress-responses and anxiety-like behaviour. Here, we tested whether similar effects are present in zebrafish (Danio rerio). After two weeks of providing Artemia (brine shrimp; Artemia salina) as food reward or flake food (Tetramin) as control at fixed intervals, zebrafish were exposed to a fear-avoidance learning task using an inhibitory avoidance protocol. Half the number of fish received a 3V shock on day 1 and were tested and sacrificed on day 2; the other half received a second 3V shock on day 2 and were tested and sacrificed on day 3. The latter was done to assess whether effects are robust, as effects in rats have been shown to be modest. Zebrafish that were given Artemia showed less inhibitory avoidance after one shock, but not after two shocks, than zebrafish that were given flake-food. Reduced avoidance behaviour was associated with lower telencepahalic gene expression levels of cannabinoid receptor 1 (cnr1) and higher gene expression levels of corticotropin releasing factor (crf). These results suggest that providing rewards at fixed intervals alters fear avoidance behaviour, albeit modestly, in zebrafish. We discuss the data in the context of similar underlying brain structures in mammals and fish.

  20. Providing a food reward reduces inhibitory avoidance learning in zebrafish.

    PubMed

    Manuel, Remy; Zethof, Jan; Flik, Gert; van den Bos, Ruud

    2015-11-01

    As shown in male rats, prior history of subjects changes behavioural and stress-responses to challenges: a two-week history of exposure to rewards at fixed intervals led to slightly, but consistently, lower physiological stress-responses and anxiety-like behaviour. Here, we tested whether similar effects are present in zebrafish (Danio rerio). After two weeks of providing Artemia (brine shrimp; Artemia salina) as food reward or flake food (Tetramin) as control at fixed intervals, zebrafish were exposed to a fear-avoidance learning task using an inhibitory avoidance protocol. Half the number of fish received a 3V shock on day 1 and were tested and sacrificed on day 2; the other half received a second 3V shock on day 2 and were tested and sacrificed on day 3. The latter was done to assess whether effects are robust, as effects in rats have been shown to be modest. Zebrafish that were given Artemia showed less inhibitory avoidance after one shock, but not after two shocks, than zebrafish that were given flake-food. Reduced avoidance behaviour was associated with lower telencepahalic gene expression levels of cannabinoid receptor 1 (cnr1) and higher gene expression levels of corticotropin releasing factor (crf). These results suggest that providing rewards at fixed intervals alters fear avoidance behaviour, albeit modestly, in zebrafish. We discuss the data in the context of similar underlying brain structures in mammals and fish. PMID:26342856

  1. Hemispheric Asymmetries in Striatal Reward Responses Relate to Approach-Avoidance Learning and Encoding of Positive-Negative Prediction Errors in Dopaminergic Midbrain Regions.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly C; Schwartz, Sophie

    2015-10-28

    Some individuals are better at learning about rewarding situations, whereas others are inclined to avoid punishments (i.e., enhanced approach or avoidance learning, respectively). In reinforcement learning, action values are increased when outcomes are better than predicted (positive prediction errors [PEs]) and decreased for worse than predicted outcomes (negative PEs). Because actions with high and low values are approached and avoided, respectively, individual differences in the neural encoding of PEs may influence the balance between approach-avoidance learning. Recent correlational approaches also indicate that biases in approach-avoidance learning involve hemispheric asymmetries in dopamine function. However, the computational and neural mechanisms underpinning such learning biases remain unknown. Here we assessed hemispheric reward asymmetry in striatal activity in 34 human participants who performed a task involving rewards and punishments. We show that the relative difference in reward response between hemispheres relates to individual biases in approach-avoidance learning. Moreover, using a computational modeling approach, we demonstrate that better encoding of positive (vs negative) PEs in dopaminergic midbrain regions is associated with better approach (vs avoidance) learning, specifically in participants with larger reward responses in the left (vs right) ventral striatum. Thus, individual dispositions or traits may be determined by neural processes acting to constrain learning about specific aspects of the world.

  2. Hemispheric Asymmetries in Striatal Reward Responses Relate to Approach-Avoidance Learning and Encoding of Positive-Negative Prediction Errors in Dopaminergic Midbrain Regions.

    PubMed

    Aberg, Kristoffer Carl; Doell, Kimberly C; Schwartz, Sophie

    2015-10-28

    Some individuals are better at learning about rewarding situations, whereas others are inclined to avoid punishments (i.e., enhanced approach or avoidance learning, respectively). In reinforcement learning, action values are increased when outcomes are better than predicted (positive prediction errors [PEs]) and decreased for worse than predicted outcomes (negative PEs). Because actions with high and low values are approached and avoided, respectively, individual differences in the neural encoding of PEs may influence the balance between approach-avoidance learning. Recent correlational approaches also indicate that biases in approach-avoidance learning involve hemispheric asymmetries in dopamine function. However, the computational and neural mechanisms underpinning such learning biases remain unknown. Here we assessed hemispheric reward asymmetry in striatal activity in 34 human participants who performed a task involving rewards and punishments. We show that the relative difference in reward response between hemispheres relates to individual biases in approach-avoidance learning. Moreover, using a computational modeling approach, we demonstrate that better encoding of positive (vs negative) PEs in dopaminergic midbrain regions is associated with better approach (vs avoidance) learning, specifically in participants with larger reward responses in the left (vs right) ventral striatum. Thus, individual dispositions or traits may be determined by neural processes acting to constrain learning about specific aspects of the world. PMID:26511241

  3. Reinforcement Learning Models and Their Neural Correlates: An Activation Likelihood Estimation Meta-Analysis

    PubMed Central

    Kumar, Poornima; Eickhoff, Simon B.; Dombrovski, Alexandre Y.

    2015-01-01

    Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments – prediction error – is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies suggest that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that employed algorithmic reinforcement learning models, across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, while instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually-estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies. PMID:25665667

  4. Predictive reward signal of dopamine neurons.

    PubMed

    Schultz, W

    1998-07-01

    The effects of lesions, receptor blocking, electrical self-stimulation, and drugs of abuse suggest that midbrain dopamine systems are involved in processing reward information and learning approach behavior. Most dopamine neurons show phasic activations after primary liquid and food rewards and conditioned, reward-predicting visual and auditory stimuli. They show biphasic, activation-depression responses after stimuli that resemble reward-predicting stimuli or are novel or particularly salient. However, only few phasic activations follow aversive stimuli. Thus dopamine neurons label environmental stimuli with appetitive value, predict and detect rewards and signal alerting and motivating events. By failing to discriminate between different rewards, dopamine neurons appear to emit an alerting message about the surprising presence or absence of rewards. All responses to rewards and reward-predicting stimuli depend on event predictability. Dopamine neurons are activated by rewarding events that are better than predicted, remain uninfluenced by events that are as good as predicted, and are depressed by events that are worse than predicted. By signaling rewards according to a prediction error, dopamine responses have the formal characteristics of a teaching signal postulated by reinforcement learning theories. Dopamine responses transfer during learning from primary rewards to reward-predicting stimuli. This may contribute to neuronal mechanisms underlying the retrograde action of rewards, one of the main puzzles in reinforcement learning. The impulse response releases a short pulse of dopamine onto many dendrites, thus broadcasting a rather global reinforcement signal to postsynaptic neurons. This signal may improve approach behavior by providing advance reward information before the behavior occurs, and may contribute to learning by modifying synaptic transmission. The dopamine reward signal is supplemented by activity in neurons in striatum, frontal cortex, and

  5. Reinforcement learning in professional basketball players.

    PubMed

    Neiman, Tal; Loewenstein, Yonatan

    2011-01-01

    Reinforcement learning in complex natural environments is a challenging task because the agent should generalize from the outcomes of actions taken in one state of the world to future actions in different states of the world. The extent to which human experts find the proper level of generalization is unclear. Here we show, using the sequences of field goal attempts made by professional basketball players, that the outcome of even a single field goal attempt has a considerable effect on the rate of subsequent 3 point shot attempts, in line with standard models of reinforcement learning. However, this change in behaviour is associated with negative correlations between the outcomes of successive field goal attempts. These results indicate that despite years of experience and high motivation, professional players overgeneralize from the outcomes of their most recent actions, which leads to decreased performance.

  6. The cerebellum: a neural system for the study of reinforcement learning.

    PubMed

    Swain, Rodney A; Kerr, Abigail L; Thompson, Richard F

    2011-01-01

    In its strictest application, the term "reinforcement learning" refers to a computational approach to learning in which an agent (often a machine) interacts with a mutable environment to maximize reward through trial and error. The approach borrows essentials from several fields, most notably Computer Science, Behavioral Neuroscience, and Psychology. At the most basic level, a neural system capable of mediating reinforcement learning must be able to acquire sensory information about the external environment and internal milieu (either directly or through connectivities with other brain regions), must be able to select a behavior to be executed, and must be capable of providing evaluative feedback about the success of that behavior. Given that Psychology informs us that reinforcers, both positive and negative, are stimuli or consequences that increase the probability that the immediately antecedent behavior will be repeated and that reinforcer strength or viability is modulated by the organism's past experience with the reinforcer, its affect, and even the state of its muscles (e.g., eyes open or closed); it is the case that any neural system that supports reinforcement learning must also be sensitive to these same considerations. Once learning is established, such a neural system must finally be able to maintain continued response expression and prevent response drift. In this report, we examine both historical and recent evidence that the cerebellum satisfies all of these requirements. While we report evidence from a variety of learning paradigms, the majority of our discussion will focus on classical conditioning of the rabbit eye blink response as an ideal model system for the study of reinforcement and reinforcement learning.

  7. The Function of Direct and Vicarious Reinforcement in Human Learning.

    ERIC Educational Resources Information Center

    Owens, Carl R.; And Others

    The role of reinforcement has long been an issue in learning theory. The effects of reinforcement in learning were investigated under circumstances which made the information necessary for correct performance equally available to reinforced and nonreinforced subjects. Fourth graders (N=36) were given a pre-test of 20 items from the Peabody Picture…

  8. Reward functions of the basal ganglia.

    PubMed

    Schultz, Wolfram

    2016-07-01

    Besides their fundamental movement function evidenced by Parkinsonian deficits, the basal ganglia are involved in processing closely linked non-motor, cognitive and reward information. This review describes the reward functions of three brain structures that are major components of the basal ganglia or are closely associated with the basal ganglia, namely midbrain dopamine neurons, pedunculopontine nucleus, and striatum (caudate nucleus, putamen, nucleus accumbens). Rewards are involved in learning (positive reinforcement), approach behavior, economic choices and positive emotions. The response of dopamine neurons to rewards consists of an early detection component and a subsequent reward component that reflects a prediction error in economic utility, but is unrelated to movement. Dopamine activations to non-rewarded or aversive stimuli reflect physical impact, but not punishment. Neurons in pedunculopontine nucleus project their axons to dopamine neurons and process sensory stimuli, movements and rewards and reward-predicting stimuli without coding outright reward prediction errors. Neurons in striatum, besides their pronounced movement relationships, process rewards irrespective of sensory and motor aspects, integrate reward information into movement activity, code the reward value of individual actions, change their reward-related activity during learning, and code own reward in social situations depending on whose action produces the reward. These data demonstrate a variety of well-characterized reward processes in specific basal ganglia nuclei consistent with an important function in non-motor aspects of motivated behavior. PMID:26838982

  9. Multiagent reinforcement learning with unshared value functions.

    PubMed

    Hu, Yujing; Gao, Yang; An, Bo

    2015-04-01

    One important approach of multiagent reinforcement learning (MARL) is equilibrium-based MARL, which is a combination of reinforcement learning and game theory. Most existing algorithms involve computationally expensive calculation of mixed strategy equilibria and require agents to replicate the other agents' value functions for equilibrium computing in each state. This is unrealistic since agents may not be willing to share such information due to privacy or safety concerns. This paper aims to develop novel and efficient MARL algorithms without the need for agents to share value functions. First, we adopt pure strategy equilibrium solution concepts instead of mixed strategy equilibria given that a mixed strategy equilibrium is often computationally expensive. In this paper, three types of pure strategy profiles are utilized as equilibrium solution concepts: pure strategy Nash equilibrium, equilibrium-dominating strategy profile, and nonstrict equilibrium-dominating strategy profile. The latter two solution concepts are strategy profiles from which agents can gain higher payoffs than one or more pure strategy Nash equilibria. Theoretical analysis shows that these strategy profiles are symmetric meta equilibria. Second, we propose a multistep negotiation process for finding pure strategy equilibria since value functions are not shared among agents. By putting these together, we propose a novel MARL algorithm called negotiation-based Q-learning (NegoQ). Experiments are first conducted in grid-world games, which are widely used to evaluate MARL algorithms. In these games, NegoQ learns equilibrium policies and runs significantly faster than existing MARL algorithms (correlated Q-learning and Nash Q-learning). Surprisingly, we find that NegoQ also performs well in team Markov games such as pursuit games, as compared with team-task-oriented MARL algorithms (such as friend Q-learning and distributed Q-learning).

  10. Multiagent reinforcement learning with unshared value functions.

    PubMed

    Hu, Yujing; Gao, Yang; An, Bo

    2015-04-01

    One important approach of multiagent reinforcement learning (MARL) is equilibrium-based MARL, which is a combination of reinforcement learning and game theory. Most existing algorithms involve computationally expensive calculation of mixed strategy equilibria and require agents to replicate the other agents' value functions for equilibrium computing in each state. This is unrealistic since agents may not be willing to share such information due to privacy or safety concerns. This paper aims to develop novel and efficient MARL algorithms without the need for agents to share value functions. First, we adopt pure strategy equilibrium solution concepts instead of mixed strategy equilibria given that a mixed strategy equilibrium is often computationally expensive. In this paper, three types of pure strategy profiles are utilized as equilibrium solution concepts: pure strategy Nash equilibrium, equilibrium-dominating strategy profile, and nonstrict equilibrium-dominating strategy profile. The latter two solution concepts are strategy profiles from which agents can gain higher payoffs than one or more pure strategy Nash equilibria. Theoretical analysis shows that these strategy profiles are symmetric meta equilibria. Second, we propose a multistep negotiation process for finding pure strategy equilibria since value functions are not shared among agents. By putting these together, we propose a novel MARL algorithm called negotiation-based Q-learning (NegoQ). Experiments are first conducted in grid-world games, which are widely used to evaluate MARL algorithms. In these games, NegoQ learns equilibrium policies and runs significantly faster than existing MARL algorithms (correlated Q-learning and Nash Q-learning). Surprisingly, we find that NegoQ also performs well in team Markov games such as pursuit games, as compared with team-task-oriented MARL algorithms (such as friend Q-learning and distributed Q-learning). PMID:25014990

  11. Reinforcement Learning and Dopamine in Schizophrenia: Dimensions of Symptoms or Specific Features of a Disease Group?

    PubMed Central

    Deserno, Lorenz; Boehme, Rebecca; Heinz, Andreas; Schlagenhauf, Florian

    2013-01-01

    Abnormalities in reinforcement learning are a key finding in schizophrenia and have been proposed to be linked to elevated levels of dopamine neurotransmission. Behavioral deficits in reinforcement learning and their neural correlates may contribute to the formation of clinical characteristics of schizophrenia. The ability to form predictions about future outcomes is fundamental for environmental interactions and depends on neuronal teaching signals, like reward prediction errors. While aberrant prediction errors, that encode non-salient events as surprising, have been proposed to contribute to the formation of positive symptoms, a failure to build neural representations of decision values may result in negative symptoms. Here, we review behavioral and neuroimaging research in schizophrenia and focus on studies that implemented reinforcement learning models. In addition, we discuss studies that combined reinforcement learning with measures of dopamine. Thereby, we suggest how reinforcement learning abnormalities in schizophrenia may contribute to the formation of psychotic symptoms and may interact with cognitive deficits. These ideas point toward an interplay of more rigid versus flexible control over reinforcement learning. Pronounced deficits in the flexible or model-based domain may allow for a detailed characterization of well-established cognitive deficits in schizophrenia patients based on computational models of learning. Finally, we propose a framework based on the potentially crucial contribution of dopamine to dysfunctional reinforcement learning on the level of neural networks. Future research may strongly benefit from computational modeling but also requires further methodological improvement for clinical group studies. These research tools may help to improve our understanding of disease-specific mechanisms and may help to identify clinically relevant subgroups of the heterogeneous entity schizophrenia. PMID:24391603

  12. Novelty and Inductive Generalization in Human Reinforcement Learning.

    PubMed

    Gershman, Samuel J; Niv, Yael

    2015-07-01

    In reinforcement learning (RL), a decision maker searching for the most rewarding option is often faced with the question: What is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: How can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and we describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of RL in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional RL algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty.

  13. Role of CB2 Cannabinoid Receptors in the Rewarding, Reinforcing, and Physical Effects of Nicotine

    PubMed Central

    Navarrete, Francisco; Rodríguez-Arias, Marta; Martín-García, Elena; Navarro, Daniela; García-Gutiérrez, María S; Aguilar, María A; Aracil-Fernández, Auxiliadora; Berbel, Pere; Miñarro, José; Maldonado, Rafael; Manzanares, Jorge

    2013-01-01

    This study was aimed to evaluate the involvement of CB2 cannabinoid receptors (CB2r) in the rewarding, reinforcing and motivational effects of nicotine. Conditioned place preference (CPP) and intravenous self-administration experiments were carried out in knockout mice lacking CB2r (CB2KO) and wild-type (WT) littermates treated with the CB2r antagonist AM630 (1 and 3 mg/kg). Gene expression analyses of tyrosine hydroxylase (TH) and α3- and α4-nicotinic acetylcholine receptor subunits (nAChRs) in the ventral tegmental area (VTA) and immunohistochemical studies to elucidate whether CB2r colocalized with α3- and α4-nAChRs in the nucleus accumbens and VTA were performed. Mecamylamine-precipitated withdrawal syndrome after chronic nicotine exposure was evaluated in CB2KO mice and WT mice treated with AM630 (1 and 3 mg/kg). CB2KO mice did not show nicotine-induced place conditioning and self-administered significantly less nicotine. In addition, AM630 was able to block (3 mg/kg) nicotine-induced CPP and reduce (1 and 3 mg/kg) nicotine self-administration. Under baseline conditions, TH, α3-nAChR, and α4-nAChR mRNA levels in the VTA of CB2KO mice were significantly lower compared with WT mice. Confocal microscopy images revealed that CB2r colocalized with α3- and α4-nAChRs. Somatic signs of nicotine withdrawal (rearings, groomings, scratches, teeth chattering, and body tremors) increased significantly in WT but were absent in CB2KO mice. Interestingly, the administration of AM630 blocked the nicotine withdrawal syndrome and failed to alter basal behavior in saline-treated WT mice. These results suggest that CB2r play a relevant role in the rewarding, reinforcing, and motivational effects of nicotine. Pharmacological manipulation of this receptor deserves further consideration as a potential new valuable target for the treatment of nicotine dependence. PMID:23817165

  14. Reward-based learning of a redundant task.

    PubMed

    Tamagnone, Irene; Casadio, Maura; Sanguineti, Vittorio

    2013-06-01

    Motor skill learning has different components. When we acquire a new motor skill we have both to learn a reliable action-value map to select a highly rewarded action (task model) and to develop an internal representation of the novel dynamics of the task environment, in order to execute properly the action previously selected (internal model). Here we focus on a 'pure' motor skill learning task, in which adaptation to a novel dynamical environment is negligible and the problem is reduced to the acquisition of an action-value map, only based on knowledge of results. Subjects performed point-to-point movement, in which start and target positions were fixed and visible, but the score provided at the end of the movement depended on the distance of the trajectory from a hidden viapoint. Subjects did not have clues on the correct movement other than the score value. The task is highly redundant, as infinite trajectories are compatible with the maximum score. Our aim was to capture the strategies subjects use in the exploration of the task space and in the exploitation of the task redundancy during learning. The main findings were that (i) subjects did not converge to a unique solution; rather, their final trajectories are determined by subject-specific history of exploration. (ii) with learning, subjects reduced the trajectory's overall variability, but the point of minimum variability gradually shifted toward the portion of the trajectory closer to the hidden via-point. PMID:24187205

  15. Reward-based learning of a redundant task.

    PubMed

    Tamagnone, Irene; Casadio, Maura; Sanguineti, Vittorio

    2013-06-01

    Motor skill learning has different components. When we acquire a new motor skill we have both to learn a reliable action-value map to select a highly rewarded action (task model) and to develop an internal representation of the novel dynamics of the task environment, in order to execute properly the action previously selected (internal model). Here we focus on a 'pure' motor skill learning task, in which adaptation to a novel dynamical environment is negligible and the problem is reduced to the acquisition of an action-value map, only based on knowledge of results. Subjects performed point-to-point movement, in which start and target positions were fixed and visible, but the score provided at the end of the movement depended on the distance of the trajectory from a hidden viapoint. Subjects did not have clues on the correct movement other than the score value. The task is highly redundant, as infinite trajectories are compatible with the maximum score. Our aim was to capture the strategies subjects use in the exploration of the task space and in the exploitation of the task redundancy during learning. The main findings were that (i) subjects did not converge to a unique solution; rather, their final trajectories are determined by subject-specific history of exploration. (ii) with learning, subjects reduced the trajectory's overall variability, but the point of minimum variability gradually shifted toward the portion of the trajectory closer to the hidden via-point.

  16. Multiagent reinforcement learning: spiking and nonspiking agents in the iterated Prisoner's Dilemma.

    PubMed

    Vassiliades, Vassilis; Cleanthous, Aristodemos; Christodoulou, Chris

    2011-04-01

    This paper investigates multiagent reinforcement learning (MARL) in a general-sum game where the payoffs' structure is such that the agents are required to exploit each other in a way that benefits all agents. The contradictory nature of these games makes their study in multiagent systems quite challenging. In particular, we investigate MARL with spiking and nonspiking agents in the Iterated Prisoner's Dilemma by exploring the conditions required to enhance its cooperative outcome. The spiking agents are neural networks with leaky integrate-and-fire neurons trained with two different learning algorithms: 1) reinforcement of stochastic synaptic transmission, or 2) reward-modulated spike-timing-dependent plasticity with eligibility trace. The nonspiking agents use a tabular representation and are trained with Q- and SARSA learning algorithms, with a novel reward transformation process also being applied to the Q-learning agents. According to the results, the cooperative outcome is enhanced by: 1) transformed internal reinforcement signals and a combination of a high learning rate and a low discount factor with an appropriate exploration schedule in the case of non-spiking agents, and 2) having longer eligibility trace time constant in the case of spiking agents. Moreover, it is shown that spiking and nonspiking agents have similar behavior and therefore they can equally well be used in a multiagent interaction setting. For training the spiking agents in the case where more than one output neuron competes for reinforcement, a novel and necessary modification that enhances competition is applied to the two learning algorithms utilized, in order to avoid a possible synaptic saturation. This is done by administering to the networks additional global reinforcement signals for every spike of the output neurons that were not "responsible" for the preceding decision. PMID:21421435

  17. Effects of Cooperative versus Individual Study on Learning and Motivation after Reward-Removal

    ERIC Educational Resources Information Center

    Sears, David A.; Pai, Hui-Hua

    2012-01-01

    Rewards are frequently used in classrooms and recommended as a key component of well-researched methods of cooperative learning (e.g., Slavin, 1995). While many studies of cooperative learning find beneficial effects of rewards, many studies of individuals find negative effects (e.g., Deci, Koestner, & Ryan, 1999; Lepper, 1988). This may be…

  18. The Influence of Personality on Neural Mechanisms of Observational Fear and Reward Learning

    ERIC Educational Resources Information Center

    Hooker, Christine I.; Verosky, Sara C.; Miyakawa, Asako; Knight, Robert T.; D'Esposito, Mark

    2008-01-01

    Fear and reward learning can occur through direct experience or observation. Both channels can enhance survival or create maladaptive behavior. We used fMRI to isolate neural mechanisms of observational fear and reward learning and investigate whether neural response varied according to individual differences in neuroticism and extraversion.…

  19. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning

    PubMed Central

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  20. A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning.

    PubMed

    Balcarras, Matthew; Womelsdorf, Thilo

    2016-01-01

    Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context

  1. Striatal dopamine ramping may indicate flexible reinforcement learning with forgetting in the cortico-basal ganglia circuits.

    PubMed

    Morita, Kenji; Kato, Ayaka

    2014-01-01

    It has been suggested that the midbrain dopamine (DA) neurons, receiving inputs from the cortico-basal ganglia (CBG) circuits and the brainstem, compute reward prediction error (RPE), the difference between reward obtained or expected to be obtained and reward that had been expected to be obtained. These reward expectations are suggested to be stored in the CBG synapses and updated according to RPE through synaptic plasticity, which is induced by released DA. These together constitute the "DA=RPE" hypothesis, which describes the mutual interaction between DA and the CBG circuits and serves as the primary working hypothesis in studying reward learning and value-based decision-making. However, recent work has revealed a new type of DA signal that appears not to represent RPE. Specifically, it has been found in a reward-associated maze task that striatal DA concentration primarily shows a gradual increase toward the goal. We explored whether such ramping DA could be explained by extending the "DA=RPE" hypothesis by taking into account biological properties of the CBG circuits. In particular, we examined effects of possible time-dependent decay of DA-dependent plastic changes of synaptic strengths by incorporating decay of learned values into the RPE-based reinforcement learning model and simulating reward learning tasks. We then found that incorporation of such a decay dramatically changes the model's behavior, causing gradual ramping of RPE. Moreover, we further incorporated magnitude-dependence of the rate of decay, which could potentially be in accord with some past observations, and found that near-sigmoidal ramping of RPE, resembling the observed DA ramping, could then occur. Given that synaptic decay can be useful for flexibly reversing and updating the learned reward associations, especially in case the baseline DA is low and encoding of negative RPE by DA is limited, the observed DA ramping would be indicative of the operation of such flexible reward learning.

  2. Dissociable neural representations of reinforcement and belief prediction errors underlie strategic learning

    PubMed Central

    Zhu, Lusha; Mathewson, Kyle E.; Hsu, Ming

    2012-01-01

    Decision-making in the presence of other competitive intelligent agents is fundamental for social and economic behavior. Such decisions require agents to behave strategically, where in addition to learning about the rewards and punishments available in the environment, they also need to anticipate and respond to actions of others competing for the same rewards. However, whereas we know much about strategic learning at both theoretical and behavioral levels, we know relatively little about the underlying neural mechanisms. Here, we show using a multi-strategy competitive learning paradigm that strategic choices can be characterized by extending the reinforcement learning (RL) framework to incorporate agents’ beliefs about the actions of their opponents. Furthermore, using this characterization to generate putative internal values, we used model-based functional magnetic resonance imaging to investigate neural computations underlying strategic learning. We found that the distinct notions of prediction errors derived from our computational model are processed in a partially overlapping but distinct set of brain regions. Specifically, we found that the RL prediction error was correlated with activity in the ventral striatum. In contrast, activity in the ventral striatum, as well as the rostral anterior cingulate (rACC), was correlated with a previously uncharacterized belief-based prediction error. Furthermore, activity in rACC reflected individual differences in degree of engagement in belief learning. These results suggest a model of strategic behavior where learning arises from interaction of dissociable reinforcement and belief-based inputs. PMID:22307594

  3. Convergence of reinforcement learning algorithms and acceleration of learning

    NASA Astrophysics Data System (ADS)

    Potapov, A.; Ali, M. K.

    2003-02-01

    The techniques of reinforcement learning have been gaining increasing popularity recently. However, the question of their convergence rate is still open. We consider the problem of choosing the learning steps αn, and their relation with discount γ and exploration degree ɛ. Appropriate choices of these parameters may drastically influence the convergence rate of the techniques. From analytical examples, we conjecture optimal values of αn and then use numerical examples to verify our conjectures.

  4. Experiential reward learning outweighs instruction prior to adulthood

    PubMed Central

    Decker, Johannes H.; Lourenco, Frederico S.; Doll, Bradley B.; Hartley, Catherine A.

    2015-01-01

    Throughout our lives, we face the important task of distinguishing rewarding actions from those that are best avoided. Importantly, there are multiple means by which we acquire this information. Through trial and error, we use experiential feedback to evaluate our actions. We also learn which actions are advantageous through explicit instruction from others. Here, we examined whether the influence of these two forms of learning on choice changes across development by placing instruction and experience in competition in a probabilistic-learning task. Whereas inaccurate instruction markedly biased adults’ estimations of a stimulus’s value, children and adolescents were better able to objectively estimate stimulus values through experience. Instructional control of learning is thought to recruit prefrontal–striatal brain circuitry, which continues to mature into adulthood. Our behavioral data suggest that this protracted neurocognitive maturation may cause the motivated actions of children and adolescents to be less influenced by explicit instruction than are those of adults. This absence of a confirmation bias in children and adolescents represents a paradoxical developmental advantage of youth over adults in the unbiased evaluation of actions through positive and negative experience. PMID:25582607

  5. Learning to Produce Syllabic Speech Sounds via Reward-Modulated Neural Plasticity.

    PubMed

    Warlaumont, Anne S; Finnegan, Megan K

    2016-01-01

    At around 7 months of age, human infants begin to reliably produce well-formed syllables containing both consonants and vowels, a behavior called canonical babbling. Over subsequent months, the frequency of canonical babbling continues to increase. How the infant's nervous system supports the acquisition of this ability is unknown. Here we present a computational model that combines a spiking neural network, reinforcement-modulated spike-timing-dependent plasticity, and a human-like vocal tract to simulate the acquisition of canonical babbling. Like human infants, the model's frequency of canonical babbling gradually increases. The model is rewarded when it produces a sound that is more auditorily salient than sounds it has previously produced. This is consistent with data from human infants indicating that contingent adult responses shape infant behavior and with data from deaf and tracheostomized infants indicating that hearing, including hearing one's own vocalizations, is critical for canonical babbling development. Reward receipt increases the level of dopamine in the neural network. The neural network contains a reservoir with recurrent connections and two motor neuron groups, one agonist and one antagonist, which control the masseter and orbicularis oris muscles, promoting or inhibiting mouth closure. The model learns to increase the number of salient, syllabic sounds it produces by adjusting the base level of muscle activation and increasing their range of activity. Our results support the possibility that through dopamine-modulated spike-timing-dependent plasticity, the motor cortex learns to harness its natural oscillations in activity in order to produce syllabic sounds. It thus suggests that learning to produce rhythmic mouth movements for speech production may be supported by general cortical learning mechanisms. The model makes several testable predictions and has implications for our understanding not only of how syllabic vocalizations develop in

  6. Learning to Produce Syllabic Speech Sounds via Reward-Modulated Neural Plasticity.

    PubMed

    Warlaumont, Anne S; Finnegan, Megan K

    2016-01-01

    At around 7 months of age, human infants begin to reliably produce well-formed syllables containing both consonants and vowels, a behavior called canonical babbling. Over subsequent months, the frequency of canonical babbling continues to increase. How the infant's nervous system supports the acquisition of this ability is unknown. Here we present a computational model that combines a spiking neural network, reinforcement-modulated spike-timing-dependent plasticity, and a human-like vocal tract to simulate the acquisition of canonical babbling. Like human infants, the model's frequency of canonical babbling gradually increases. The model is rewarded when it produces a sound that is more auditorily salient than sounds it has previously produced. This is consistent with data from human infants indicating that contingent adult responses shape infant behavior and with data from deaf and tracheostomized infants indicating that hearing, including hearing one's own vocalizations, is critical for canonical babbling development. Reward receipt increases the level of dopamine in the neural network. The neural network contains a reservoir with recurrent connections and two motor neuron groups, one agonist and one antagonist, which control the masseter and orbicularis oris muscles, promoting or inhibiting mouth closure. The model learns to increase the number of salient, syllabic sounds it produces by adjusting the base level of muscle activation and increasing their range of activity. Our results support the possibility that through dopamine-modulated spike-timing-dependent plasticity, the motor cortex learns to harness its natural oscillations in activity in order to produce syllabic sounds. It thus suggests that learning to produce rhythmic mouth movements for speech production may be supported by general cortical learning mechanisms. The model makes several testable predictions and has implications for our understanding not only of how syllabic vocalizations develop in

  7. Learning to Produce Syllabic Speech Sounds via Reward-Modulated Neural Plasticity

    PubMed Central

    Warlaumont, Anne S.; Finnegan, Megan K.

    2016-01-01

    At around 7 months of age, human infants begin to reliably produce well-formed syllables containing both consonants and vowels, a behavior called canonical babbling. Over subsequent months, the frequency of canonical babbling continues to increase. How the infant’s nervous system supports the acquisition of this ability is unknown. Here we present a computational model that combines a spiking neural network, reinforcement-modulated spike-timing-dependent plasticity, and a human-like vocal tract to simulate the acquisition of canonical babbling. Like human infants, the model’s frequency of canonical babbling gradually increases. The model is rewarded when it produces a sound that is more auditorily salient than sounds it has previously produced. This is consistent with data from human infants indicating that contingent adult responses shape infant behavior and with data from deaf and tracheostomized infants indicating that hearing, including hearing one’s own vocalizations, is critical for canonical babbling development. Reward receipt increases the level of dopamine in the neural network. The neural network contains a reservoir with recurrent connections and two motor neuron groups, one agonist and one antagonist, which control the masseter and orbicularis oris muscles, promoting or inhibiting mouth closure. The model learns to increase the number of salient, syllabic sounds it produces by adjusting the base level of muscle activation and increasing their range of activity. Our results support the possibility that through dopamine-modulated spike-timing-dependent plasticity, the motor cortex learns to harness its natural oscillations in activity in order to produce syllabic sounds. It thus suggests that learning to produce rhythmic mouth movements for speech production may be supported by general cortical learning mechanisms. The model makes several testable predictions and has implications for our understanding not only of how syllabic vocalizations develop

  8. Knowledge-Based Reinforcement Learning for Data Mining

    NASA Astrophysics Data System (ADS)

    Kudenko, Daniel; Grzes, Marek

    Data Mining is the process of extracting patterns from data. Two general avenues of research in the intersecting areas of agents and data mining can be distinguished. The first approach is concerned with mining an agent’s observation data in order to extract patterns, categorize environment states, and/or make predictions of future states. In this setting, data is normally available as a batch, and the agent’s actions and goals are often independent of the data mining task. The data collection is mainly considered as a side effect of the agent’s activities. Machine learning techniques applied in such situations fall into the class of supervised learning. In contrast, the second scenario occurs where an agent is actively performing the data mining, and is responsible for the data collection itself. For example, a mobile network agent is acquiring and processing data (where the acquisition may incur a certain cost), or a mobile sensor agent is moving in a (perhaps hostile) environment, collecting and processing sensor readings. In these settings, the tasks of the agent and the data mining are highly intertwined and interdependent (or even identical). Supervised learning is not a suitable technique for these cases. Reinforcement Learning (RL) enables an agent to learn from experience (in form of reward and punishment for explorative actions) and adapt to new situations, without a teacher. RL is an ideal learning technique for these data mining scenarios, because it fits the agent paradigm of continuous sensing and acting, and the RL agent is able to learn to make decisions on the sampling of the environment which provides the data. Nevertheless, RL still suffers from scalability problems, which have prevented its successful use in many complex real-world domains. The more complex the tasks, the longer it takes a reinforcement learning algorithm to converge to a good solution. For many real-world tasks, human expert knowledge is available. For example, human

  9. Post-learning Hippocampal Dynamics Promote Preferential Retention of Rewarding Events.

    PubMed

    Gruber, Matthias J; Ritchey, Maureen; Wang, Shao-Fang; Doss, Manoj K; Ranganath, Charan

    2016-03-01

    Reward motivation is known to modulate memory encoding, and this effect depends on interactions between the substantia nigra/ventral tegmental area complex (SN/VTA) and the hippocampus. It is unknown, however, whether these interactions influence offline neural activity in the human brain that is thought to promote memory consolidation. Here we used fMRI to test the effect of reward motivation on post-learning neural dynamics and subsequent memory for objects that were learned in high- and low-reward motivation contexts. We found that post-learning increases in resting-state functional connectivity between the SN/VTA and hippocampus predicted preferential retention of objects that were learned in high-reward contexts. In addition, multivariate pattern classification revealed that hippocampal representations of high-reward contexts were preferentially reactivated during post-learning rest, and the number of hippocampal reactivations was predictive of preferential retention of items learned in high-reward contexts. These findings indicate that reward motivation alters offline post-learning dynamics between the SN/VTA and hippocampus, providing novel evidence for a potential mechanism by which reward could influence memory consolidation. PMID:26875624

  10. Post-learning Hippocampal Dynamics Promote Preferential Retention of Rewarding Events.

    PubMed

    Gruber, Matthias J; Ritchey, Maureen; Wang, Shao-Fang; Doss, Manoj K; Ranganath, Charan

    2016-03-01

    Reward motivation is known to modulate memory encoding, and this effect depends on interactions between the substantia nigra/ventral tegmental area complex (SN/VTA) and the hippocampus. It is unknown, however, whether these interactions influence offline neural activity in the human brain that is thought to promote memory consolidation. Here we used fMRI to test the effect of reward motivation on post-learning neural dynamics and subsequent memory for objects that were learned in high- and low-reward motivation contexts. We found that post-learning increases in resting-state functional connectivity between the SN/VTA and hippocampus predicted preferential retention of objects that were learned in high-reward contexts. In addition, multivariate pattern classification revealed that hippocampal representations of high-reward contexts were preferentially reactivated during post-learning rest, and the number of hippocampal reactivations was predictive of preferential retention of items learned in high-reward contexts. These findings indicate that reward motivation alters offline post-learning dynamics between the SN/VTA and hippocampus, providing novel evidence for a potential mechanism by which reward could influence memory consolidation.

  11. Reward learning in pediatric depression and anxiety: Preliminary findings in a high-risk sample

    PubMed Central

    Morris, Bethany H.; Bylsma, Lauren M.; Yaroslavsky, Ilya; Kovacs, Maria; Rottenberg, Jonathan

    2015-01-01

    Background Reward learning has been postulated as a critical component of hedonic functioning that predicts depression risk. Reward learning deficits have been established in adults with current depressive disorders, but no prior studies have examined the relationship of reward learning and depression in children. The present study investigated reward learning as a function of familial depression risk and current diagnostic status in a pediatric sample. Method The sample included 204 children of parents with a history of depression (n=86 high-risk offspring) or parents with no history of major mental disorder (n=118 low-risk offspring). Semi-structured clinical interviews were used to establish current mental diagnoses in the children. A modified signal detection task was used to assess reward learning. We tested whether reward learning was impaired in high-risk offspring relative to low-risk offspring. We also tested whether reward learning was impaired in children with current disorders known to blunt hedonic function (depression, social phobia, PTSD, GAD, n=13) compared to children with no disorders and to a psychiatric comparison group with ADHD. Results High- and low-risk youth did not differ in reward learning. However, youth with current anhedonic disorders (depression, social phobia, PTSD, GAD) exhibited blunted reward learning relative to nondisordered youth and those with ADHD. Conclusions Our results are a first demonstration that reward learning deficits are present among youth with disorders known to blunt anhedonic function and that these deficits have some degree of diagnostic specificity. We advocate for future studies to replicate and extend these preliminary findings. PMID:25826304

  12. Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control.

    PubMed

    Uragami, Daisuke; Takahashi, Tatsuji; Matsuo, Yoshiki

    2014-02-01

    Many algorithms and methods in artificial intelligence or machine learning were inspired by human cognition. As a mechanism to handle the exploration-exploitation dilemma in reinforcement learning, the loosely symmetric (LS) value function that models causal intuition of humans was proposed (Shinohara et al., 2007). While LS shows the highest correlation with causal induction by humans, it has been reported that it effectively works in multi-armed bandit problems that form the simplest class of tasks representing the dilemma. However, the scope of application of LS was limited to the reinforcement learning problems that have K actions with only one state (K-armed bandit problems). This study proposes LS-Q learning architecture that can deal with general reinforcement learning tasks with multiple states and delayed reward. We tested the learning performance of the new architecture in giant-swing robot motion learning, where uncertainty and unknown-ness of the environment is huge. In the test, the help of ready-made internal models or functional approximation of the state space were not given. The simulations showed that while the ordinary Q-learning agent does not reach giant-swing motion because of stagnant loops (local optima with low rewards), LS-Q escapes such loops and acquires giant-swing. It is confirmed that the smaller number of states is, in other words, the more coarse-grained the division of states and the more incomplete the state observation is, the better LS-Q performs in comparison with Q-learning. We also showed that the high performance of LS-Q depends comparatively little on parameter tuning and learning time. This suggests that the proposed method inspired by human cognition works adaptively in real environments.

  13. Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control.

    PubMed

    Uragami, Daisuke; Takahashi, Tatsuji; Matsuo, Yoshiki

    2014-02-01

    Many algorithms and methods in artificial intelligence or machine learning were inspired by human cognition. As a mechanism to handle the exploration-exploitation dilemma in reinforcement learning, the loosely symmetric (LS) value function that models causal intuition of humans was proposed (Shinohara et al., 2007). While LS shows the highest correlation with causal induction by humans, it has been reported that it effectively works in multi-armed bandit problems that form the simplest class of tasks representing the dilemma. However, the scope of application of LS was limited to the reinforcement learning problems that have K actions with only one state (K-armed bandit problems). This study proposes LS-Q learning architecture that can deal with general reinforcement learning tasks with multiple states and delayed reward. We tested the learning performance of the new architecture in giant-swing robot motion learning, where uncertainty and unknown-ness of the environment is huge. In the test, the help of ready-made internal models or functional approximation of the state space were not given. The simulations showed that while the ordinary Q-learning agent does not reach giant-swing motion because of stagnant loops (local optima with low rewards), LS-Q escapes such loops and acquires giant-swing. It is confirmed that the smaller number of states is, in other words, the more coarse-grained the division of states and the more incomplete the state observation is, the better LS-Q performs in comparison with Q-learning. We also showed that the high performance of LS-Q depends comparatively little on parameter tuning and learning time. This suggests that the proposed method inspired by human cognition works adaptively in real environments. PMID:24296286

  14. The Effects of Verbal and Material Rewards and Punishers on the Performance of Impulsive and Reflective Children

    ERIC Educational Resources Information Center

    Firestone, Philip; Douglas, Virginia I.

    1977-01-01

    Impulsive and reflective children performed in a discrimination learning task which included four reinforcement conditions: verbal-reward, verbal-punishment, material-reward, and material-punishment. (SB)

  15. Integration of reinforcement learning and optimal decision-making theories of the basal ganglia.

    PubMed

    Bogacz, Rafal; Larsen, Tobias

    2011-04-01

    This article seeks to integrate two sets of theories describing action selection in the basal ganglia: reinforcement learning theories describing learning which actions to select to maximize reward and decision-making theories proposing that the basal ganglia selects actions on the basis of sensory evidence accumulated in the cortex. In particular, we present a model that integrates the actor-critic model of reinforcement learning and a model assuming that the cortico-basal-ganglia circuit implements a statistically optimal decision-making procedure. The values of cortico-striatal weights required for optimal decision making in our model differ from those provided by standard reinforcement learning models. Nevertheless, we show that an actor-critic model converges to the weights required for optimal decision making when biologically realistic limits on synaptic weights are introduced. We also describe the model's predictions concerning reaction times and neural responses during learning, and we discuss directions required for further integration of reinforcement learning and optimal decision-making theories. PMID:21222528

  16. Integration of reinforcement learning and optimal decision-making theories of the basal ganglia.

    PubMed

    Bogacz, Rafal; Larsen, Tobias

    2011-04-01

    This article seeks to integrate two sets of theories describing action selection in the basal ganglia: reinforcement learning theories describing learning which actions to select to maximize reward and decision-making theories proposing that the basal ganglia selects actions on the basis of sensory evidence accumulated in the cortex. In particular, we present a model that integrates the actor-critic model of reinforcement learning and a model assuming that the cortico-basal-ganglia circuit implements a statistically optimal decision-making procedure. The values of cortico-striatal weights required for optimal decision making in our model differ from those provided by standard reinforcement learning models. Nevertheless, we show that an actor-critic model converges to the weights required for optimal decision making when biologically realistic limits on synaptic weights are introduced. We also describe the model's predictions concerning reaction times and neural responses during learning, and we discuss directions required for further integration of reinforcement learning and optimal decision-making theories.

  17. A reinforcement learning approach to instrumental contingency degradation in rats.

    PubMed

    Dutech, Alain; Coutureau, Etienne; Marchand, Alain R

    2011-01-01

    Goal-directed action involves a representation of action consequences. Adapting to changes in action-outcome contingency requires the prefrontal region. Indeed, rats with lesions of the medial prefrontal cortex do not adapt their free operant response when food delivery becomes unrelated to lever-pressing. The present study explores the bases of this deficit through a combined behavioural and computational approach. We show that lesioned rats retain some behavioural flexibility and stop pressing if this action prevents food delivery. We attempt to model this phenomenon in a reinforcement learning framework. The model assumes that distinct action values are learned in an incremental manner in distinct states. The model represents states as n-uplets of events, emphasizing sequences rather than the continuous passage of time. Probabilities of lever-pressing and visits to the food magazine observed in the behavioural experiments are first analyzed as a function of these states, to identify sequences of events that influence action choice. Observed action probabilities appear to be essentially function of the last event that occurred, with reward delivery and waiting significantly facilitating magazine visits and lever-pressing respectively. Behavioural sequences of normal and lesioned rats are then fed into the model, action values are updated at each event transition according to the SARSA algorithm, and predicted action probabilities are derived through a softmax policy. The model captures the time course of learning, as well as the differential adaptation of normal and prefrontal lesioned rats to contingency degradation with the same parameters for both groups. The results suggest that simple temporal difference algorithms with low learning rates can largely account for instrumental learning and performance. Prefrontal lesioned rats appear to mainly differ from control rats in their low rates of visits to the magazine after a lever press, and their inability to

  18. Single amino acids in sucrose rewards modulate feeding and associative learning in the honeybee.

    PubMed

    Simcock, Nicola K; Gray, Helen E; Wright, Geraldine A

    2014-10-01

    Obtaining the correct balance of nutrients requires that the brain integrates information about the body's nutritional state with sensory information from food to guide feeding behaviour. Learning is a mechanism that allows animals to identify cues associated with nutrients so that they can be located quickly when required. Feedback about nutritional state is essential for nutrient balancing and could influence learning. How specific this feedback is to individual nutrients has not often been examined. Here, we tested how the honeybee's nutritional state influenced the likelihood it would feed on and learn sucrose solutions containing single amino acids. Nutritional state was manipulated by pre-feeding bees with either 1M sucrose or 1M sucrose containing 100mM of isoleucine, proline, phenylalanine, or methionine 24h prior to olfactory conditioning of the proboscis extension response. We found that bees pre-fed sucrose solution consumed less of solutions containing amino acids and were also less likely to learn to associate amino acid solutions with odours. Unexpectedly, bees pre-fed solutions containing an amino acid were also less likely to learn to associate odours with sucrose the next day. Furthermore, they consumed more of and were more likely to learn when rewarded with an amino acid solution if they were pre-fed isoleucine and proline. Our data indicate that single amino acids at relatively high concentrations inhibit feeding on sucrose solutions containing them, and they can act as appetitive reinforcers during learning. Our data also suggest that select amino acids interact with mechanisms that signal nutritional sufficiency to reduce hunger. Based on these experiments, we predict that nutrient balancing for essential amino acids during learning requires integration of information about several amino acids experienced simultaneously. PMID:24819203

  19. Single amino acids in sucrose rewards modulate feeding and associative learning in the honeybee.

    PubMed

    Simcock, Nicola K; Gray, Helen E; Wright, Geraldine A

    2014-10-01

    Obtaining the correct balance of nutrients requires that the brain integrates information about the body's nutritional state with sensory information from food to guide feeding behaviour. Learning is a mechanism that allows animals to identify cues associated with nutrients so that they can be located quickly when required. Feedback about nutritional state is essential for nutrient balancing and could influence learning. How specific this feedback is to individual nutrients has not often been examined. Here, we tested how the honeybee's nutritional state influenced the likelihood it would feed on and learn sucrose solutions containing single amino acids. Nutritional state was manipulated by pre-feeding bees with either 1M sucrose or 1M sucrose containing 100mM of isoleucine, proline, phenylalanine, or methionine 24h prior to olfactory conditioning of the proboscis extension response. We found that bees pre-fed sucrose solution consumed less of solutions containing amino acids and were also less likely to learn to associate amino acid solutions with odours. Unexpectedly, bees pre-fed solutions containing an amino acid were also less likely to learn to associate odours with sucrose the next day. Furthermore, they consumed more of and were more likely to learn when rewarded with an amino acid solution if they were pre-fed isoleucine and proline. Our data indicate that single amino acids at relatively high concentrations inhibit feeding on sucrose solutions containing them, and they can act as appetitive reinforcers during learning. Our data also suggest that select amino acids interact with mechanisms that signal nutritional sufficiency to reduce hunger. Based on these experiments, we predict that nutrient balancing for essential amino acids during learning requires integration of information about several amino acids experienced simultaneously.

  20. Negative reinforcement learning is affected in substance dependence

    PubMed Central

    Thompson, Laetitia L.; Claus, Eric D.; Mikulich-Gilbertson, Susan K.; Banich, Marie T.; Crowley, Thomas; Krmpotich, Theodore; Miller, David; Tanabe, Jody

    2011-01-01

    Background Negative reinforcement results in behavior to escape or avoid an aversive outcome. Withdrawal symptoms are purported to be negative reinforcers in perpetuating substance dependence, but little is known about negative reinforcement learning in this population. The purpose of this study was to examine reinforcement learning in substance dependent individuals (SDI), with an emphasis on assessing negative reinforcement learning. We modified the Iowa Gambling Task to separately assess positive and negative reinforcement. We hypothesized that SDI would show differences in negative reinforcement learning compared to controls and we investigated whether learning differed as a function of the relative magnitude or frequency of the reinforcer. Methods Thirty subjects dependent on psychostimulants were compared with 28 community controls on a decision making task that manipulated outcome frequencies and magnitudes and required an action to avoid a negative outcome. Results SDI did not learn to avoid negative outcomes to the same degree as controls. This difference was driven by the magnitude, not the frequency, of negative feedback. In contrast, approach behaviors in response to positive reinforcement were similar in both groups. Conclusions Our findings are consistent with a specific deficit in negative reinforcement learning in SDI. SDI were relatively insensitive to the magnitude, not frequency, of loss. If this generalizes to drug-related stimuli, it suggests that repeated episodes of withdrawal may drive relapse more than the severity of a single episode. PMID:22079143

  1. Rewards versus Learning: A Response to Paul Chance.

    ERIC Educational Resources Information Center

    Kohn, Alfie

    1993-01-01

    Responding to Paul Chance's November 1992 "Kappan" article on motivational value of rewards, this article argues that manipulating student behavior with either punishments or rewards is unnecessary and counterproductive. Extrinsic rewards can never buy more than short-term compliance because they are inherently controlling and ineffective and make…

  2. Human neural learning depends on reward prediction errors in the blocking paradigm.

    PubMed

    Tobler, Philippe N; O'doherty, John P; Dolan, Raymond J; Schultz, Wolfram

    2006-01-01

    Learning occurs when an outcome deviates from expectation (prediction error). According to formal learning theory, the defining paradigm demonstrating the role of prediction errors in learning is the blocking test. Here, a novel stimulus is blocked from learning when it is associated with a fully predicted outcome, presumably because the occurrence of the outcome fails to produce a prediction error. We investigated the role of prediction errors in human reward-directed learning using a blocking paradigm and measured brain activation with functional magnetic resonance imaging. Participants showed blocking of behavioral learning with juice rewards as predicted by learning theory. The medial orbitofrontal cortex and the ventral putamen showed significantly lower responses to blocked, compared with nonblocked, reward-predicting stimuli. In reward-predicting control situations, deactivation in orbitofrontal cortex and ventral putamen occurred at the time of unpredicted reward omissions. Responses in discrete parts of orbitofrontal cortex correlated with the degree of behavioral learning during, and after, the learning phase. These data suggest that learning in primary reward structures in the human brain correlates with prediction errors in a manner that complies with principles of formal learning theory.

  3. A reinforcement learning mechanism responsible for the valuation of free choice

    PubMed Central

    Cockburn, Jeffrey; Collins, Anne G.E.; Frank, Michael J.

    2014-01-01

    Summary Humans exhibit a preference for options they have freely chosen over equally valued options they have not; however, the neural mechanism that drives this bias and its functional significance have yet to be identified. Here, we propose a model in which choice biases arise due to amplified positive reward prediction errors associated with free choice. Using a novel variant of a probabilistic learning task, we show that choice biases are selective to options that are predominantly associated with positive outcomes. A polymorphism in DARPP-32, a gene linked to dopaminergic striatal plasticity and individual differences in reinforcement learning, was found to predict the effect of choice as a function of value. We propose that these choice biases are the behavioral byproduct of a credit assignment mechanism responsible for ensuring the effective delivery of dopaminergic reinforcement learning signals broadcast to the striatum. PMID:25066083

  4. A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning.

    PubMed

    Franklin, Nicholas T; Frank, Michael J

    2015-01-01

    Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments. PMID:26705698

  5. Molecular mechanisms underlying a cellular analogue of operant reward learning

    PubMed Central

    Lorenzetti, Fred D.; Baxter, Douglas A.; Byrne, John H.

    2008-01-01

    SUMMARY Operant conditioning is a ubiquitous but mechanistically poorly understood form of associative learning in which an animal learns the consequences of its behavior. Using a single-cell analogue of operant conditioning in neuron B51 of Aplysia, we examined second-messenger pathways engaged by activity and reward and how they may provide a biochemical association underlying operant learning. Conditioning was blocked by Rp-cAMP, a peptide inhibitor of PKA, a PKC inhibitor and by expressing a dominant negative isoform of Ca2+-dependent PKC (apl-I). Thus, both PKA and PKC were necessary for operant conditioning. Injection of cAMP into B51 mimicked the effects of operant conditioning. Activation of PKC also mimicked conditioning, but was dependent on both cAMP and PKA, suggesting that PKC acted at some point upstream of PKA activation. Our results demonstrate how these molecules can interact to mediate operant conditioning in an individual neuron important for the expression of the conditioned behavior. PMID:18786364

  6. Reinforcement learning of targeted movement in a spiking neuronal model of motor cortex.

    PubMed

    Chadderdon, George L; Neymotin, Samuel A; Kerr, Cliff C; Lytton, William W

    2012-01-01

    Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint "forearm" to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (-1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior.

  7. Vicarious reinforcement learning signals when instructing others.

    PubMed

    Apps, Matthew A J; Lesage, Elise; Ramnani, Narender

    2015-02-18

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action-outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors. PMID:25698730

  8. Vicarious reinforcement learning signals when instructing others.

    PubMed

    Apps, Matthew A J; Lesage, Elise; Ramnani, Narender

    2015-02-18

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action-outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors.

  9. Forward shift of feeding buzz components of dolphins and belugas during associative learning reveals a likely connection to reward expectation, pleasure and brain dopamine activation.

    PubMed

    Ridgway, S H; Moore, P W; Carder, D A; Romano, T A

    2014-08-15

    For many years, we heard sounds associated with reward from dolphins and belugas. We named these pulsed sounds victory squeals (VS), as they remind us of a child's squeal of delight. Here we put these sounds in context with natural and learned behavior. Like bats, echolocating cetaceans produce feeding buzzes as they approach and catch prey. Unlike bats, cetaceans continue their feeding buzzes after prey capture and the after portion is what we call the VS. Prior to training (or conditioning), the VS comes after the fish reward; with repeated trials it moves to before the reward. During training, we use a whistle or other sound to signal a correct response by the animal. This sound signal, named a secondary reinforcer (SR), leads to the primary reinforcer, fish. Trainers usually name their whistle or other SR a bridge, as it bridges the time gap between the correct response and reward delivery. During learning, the SR becomes associated with reward and the VS comes after the SR rather than after the fish. By following the SR, the VS confirms that the animal expects a reward. Results of early brain stimulation work suggest to us that SR stimulates brain dopamine release, which leads to the VS. Although there are no direct studies of dopamine release in cetaceans, we found that the timing of our VS is consistent with a response after dopamine release. We compared trained vocal responses to auditory stimuli with VS responses to SR sounds. Auditory stimuli that did not signal reward resulted in faster responses by a mean of 151 ms for dolphins and 250 ms for belugas. In laboratory animals, there is a 100 to 200 ms delay for dopamine release. VS delay in our animals is similar and consistent with vocalization after dopamine release. Our novel observation suggests that the dopamine reward system is active in cetacean brains. PMID:25122919

  10. Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback.

    PubMed

    Tan, A H; Lu, N; Xiao, D

    2008-02-01

    This paper presents a neural architecture for learning category nodes encoding mappings across multimodal patterns involving sensory inputs, actions, and rewards. By integrating adaptive resonance theory (ART) and temporal difference (TD) methods, the proposed neural model, called TD fusion architecture for learning, cognition, and navigation (TD-FALCON), enables an autonomous agent to adapt and function in a dynamic environment with immediate as well as delayed evaluative feedback (reinforcement) signals. TD-FALCON learns the value functions of the state-action space estimated through on-policy and off-policy TD learning methods, specifically state-action-reward-state-action (SARSA) and Q-learning. The learned value functions are then used to determine the optimal actions based on an action selection policy. We have developed TD-FALCON systems using various TD learning strategies and compared their performance in terms of task completion, learning speed, as well as time and space efficiency. Experiments based on a minefield navigation task have shown that TD-FALCON systems are able to learn effectively with both immediate and delayed reinforcement and achieve a stable performance in a pace much faster than those of standard gradient-descent-based reinforcement learning systems.

  11. Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats.

    PubMed

    Lloyd, Kevin; Becker, Nadine; Jones, Matthew W; Bogacz, Rafal

    2012-01-01

    Learning to form appropriate, task-relevant working memory representations is a complex process central to cognition. Gating models frame working memory as a collection of past observations and use reinforcement learning (RL) to solve the problem of when to update these observations. Investigation of how gating models relate to brain and behavior remains, however, at an early stage. The current study sought to explore the ability of simple RL gating models to replicate rule learning behavior in rats. Rats were trained in a maze-based spatial learning task that required animals to make trial-by-trial choices contingent upon their previous experience. Using an abstract version of this task, we tested the ability of two gating algorithms, one based on the Actor-Critic and the other on the State-Action-Reward-State-Action (SARSA) algorithm, to generate behavior consistent with the rats'. Both models produced rule-acquisition behavior consistent with the experimental data, though only the SARSA gating model mirrored faster learning following rule reversal. We also found that both gating models learned multiple strategies in solving the initial task, a property which highlights the multi-agent nature of such models and which is of importance in considering the neural basis of individual differences in behavior. PMID:23115551

  12. Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats

    PubMed Central

    Lloyd, Kevin; Becker, Nadine; Jones, Matthew W.; Bogacz, Rafal

    2012-01-01

    Learning to form appropriate, task-relevant working memory representations is a complex process central to cognition. Gating models frame working memory as a collection of past observations and use reinforcement learning (RL) to solve the problem of when to update these observations. Investigation of how gating models relate to brain and behavior remains, however, at an early stage. The current study sought to explore the ability of simple RL gating models to replicate rule learning behavior in rats. Rats were trained in a maze-based spatial learning task that required animals to make trial-by-trial choices contingent upon their previous experience. Using an abstract version of this task, we tested the ability of two gating algorithms, one based on the Actor-Critic and the other on the State-Action-Reward-State-Action (SARSA) algorithm, to generate behavior consistent with the rats'. Both models produced rule-acquisition behavior consistent with the experimental data, though only the SARSA gating model mirrored faster learning following rule reversal. We also found that both gating models learned multiple strategies in solving the initial task, a property which highlights the multi-agent nature of such models and which is of importance in considering the neural basis of individual differences in behavior. PMID:23115551

  13. Spike-based reinforcement learning in continuous state and action space: when policy gradient methods fail.

    PubMed

    Vasilaki, Eleni; Frémaux, Nicolas; Urbanczik, Robert; Senn, Walter; Gerstner, Wulfram

    2009-12-01

    Changes of synaptic connections between neurons are thought to be the physiological basis of learning. These changes can be gated by neuromodulators that encode the presence of reward. We study a family of reward-modulated synaptic learning rules for spiking neurons on a learning task in continuous space inspired by the Morris Water maze. The synaptic update rule modifies the release probability of synaptic transmission and depends on the timing of presynaptic spike arrival, postsynaptic action potentials, as well as the membrane potential of the postsynaptic neuron. The family of learning rules includes an optimal rule derived from policy gradient methods as well as reward modulated Hebbian learning. The synaptic update rule is implemented in a population of spiking neurons using a network architecture that combines feedforward input with lateral connections. Actions are represented by a population of hypothetical action cells with strong mexican-hat connectivity and are read out at theta frequency. We show that in this architecture, a standard policy gradient rule fails to solve the Morris watermaze task, whereas a variant with a Hebbian bias can learn the task within 20 trials, consistent with experiments. This result does not depend on implementation details such as the size of the neuronal populations. Our theoretical approach shows how learning new behaviors can be linked to reward-modulated plasticity at the level of single synapses and makes predictions about the voltage and spike-timing dependence of synaptic plasticity and the influence of neuromodulators such as dopamine. It is an important step towards connecting formal theories of reinforcement learning with neuronal and synaptic properties. PMID:19997492

  14. Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

    PubMed Central

    Vasilaki, Eleni; Frémaux, Nicolas; Urbanczik, Robert; Senn, Walter; Gerstner, Wulfram

    2009-01-01

    Changes of synaptic connections between neurons are thought to be the physiological basis of learning. These changes can be gated by neuromodulators that encode the presence of reward. We study a family of reward-modulated synaptic learning rules for spiking neurons on a learning task in continuous space inspired by the Morris Water maze. The synaptic update rule modifies the release probability of synaptic transmission and depends on the timing of presynaptic spike arrival, postsynaptic action potentials, as well as the membrane potential of the postsynaptic neuron. The family of learning rules includes an optimal rule derived from policy gradient methods as well as reward modulated Hebbian learning. The synaptic update rule is implemented in a population of spiking neurons using a network architecture that combines feedforward input with lateral connections. Actions are represented by a population of hypothetical action cells with strong mexican-hat connectivity and are read out at theta frequency. We show that in this architecture, a standard policy gradient rule fails to solve the Morris watermaze task, whereas a variant with a Hebbian bias can learn the task within 20 trials, consistent with experiments. This result does not depend on implementation details such as the size of the neuronal populations. Our theoretical approach shows how learning new behaviors can be linked to reward-modulated plasticity at the level of single synapses and makes predictions about the voltage and spike-timing dependence of synaptic plasticity and the influence of neuromodulators such as dopamine. It is an important step towards connecting formal theories of reinforcement learning with neuronal and synaptic properties. PMID:19997492

  15. Dopamine and opioid gene variants are associated with increased smoking reward and reinforcement owing to negative mood.

    PubMed

    Perkins, Kenneth A; Lerman, Caryn; Grottenthaler, Amy; Ciccocioppo, Melinda M; Milanak, Melissa; Conklin, Cynthia A; Bergen, Andrew W; Benowitz, Neal L

    2008-09-01

    Negative mood increases smoking reinforcement and risk of relapse. We explored associations of gene variants in the dopamine, opioid, and serotonin pathways with smoking reward ('liking') and reinforcement (latency to first puff and total puffs) as a function of negative mood and expected versus actual nicotine content of the cigarette. Smokers of European ancestry (n=72) were randomized to one of four groups in a 2x2 balanced placebo design, corresponding with manipulation of actual (0.6 vs. 0.05 mg) and expected (told nicotine and told denicotinized) nicotine 'dose' in cigarettes during each of two sessions (negative vs. positive mood induction). Following mood induction and expectancy instructions, they sampled and rated the assigned cigarette, and then smoked additional cigarettes ad lib during continued mood induction. The increase in smoking amount owing to negative mood was associated with: dopamine D2 receptor (DRD2) C957T (CC>TT or CT), SLC6A3 (presence of 9 repeat>absence of 9), and among those given a nicotine cigarette, DRD4 (presence of 7 repeat>absence of 7) and DRD2/ANKK1 TaqIA (TT or CT>CC). SLC6A3, and DRD2/ANKK1 TaqIA were also associated with smoking reward and smoking latency. OPRM1 (AA>AG or GG) was associated with smoking reward, but SLC6A4 variable number tandem repeat was unrelated to any of these measures. These results warrant replication but provide the first evidence for genetic associations with the acute increase in smoking reward and reinforcement owing to negative mood.

  16. Dopamine and opioid gene variants are associated with increased smoking reward and reinforcement owing to negative mood.

    PubMed

    Perkins, Kenneth A; Lerman, Caryn; Grottenthaler, Amy; Ciccocioppo, Melinda M; Milanak, Melissa; Conklin, Cynthia A; Bergen, Andrew W; Benowitz, Neal L

    2008-09-01

    Negative mood increases smoking reinforcement and risk of relapse. We explored associations of gene variants in the dopamine, opioid, and serotonin pathways with smoking reward ('liking') and reinforcement (latency to first puff and total puffs) as a function of negative mood and expected versus actual nicotine content of the cigarette. Smokers of European ancestry (n=72) were randomized to one of four groups in a 2x2 balanced placebo design, corresponding with manipulation of actual (0.6 vs. 0.05 mg) and expected (told nicotine and told denicotinized) nicotine 'dose' in cigarettes during each of two sessions (negative vs. positive mood induction). Following mood induction and expectancy instructions, they sampled and rated the assigned cigarette, and then smoked additional cigarettes ad lib during continued mood induction. The increase in smoking amount owing to negative mood was associated with: dopamine D2 receptor (DRD2) C957T (CC>TT or CT), SLC6A3 (presence of 9 repeat>absence of 9), and among those given a nicotine cigarette, DRD4 (presence of 7 repeat>absence of 7) and DRD2/ANKK1 TaqIA (TT or CT>CC). SLC6A3, and DRD2/ANKK1 TaqIA were also associated with smoking reward and smoking latency. OPRM1 (AA>AG or GG) was associated with smoking reward, but SLC6A4 variable number tandem repeat was unrelated to any of these measures. These results warrant replication but provide the first evidence for genetic associations with the acute increase in smoking reward and reinforcement owing to negative mood. PMID:18690118

  17. Connectionist reinforcement learning of robot control skills

    NASA Astrophysics Data System (ADS)

    Araújo, Rui; Nunes, Urbano; de Almeida, A. T.

    1998-07-01

    Many robot manipulator tasks are difficult to model explicitly and it is difficult to design and program automatic control algorithms for them. The development, improvement, and application of learning techniques taking advantage of sensory information would enable the acquisition of new robot skills and avoid some of the difficulties of explicit programming. In this paper we use a reinforcement learning approach for on-line generation of skills for control of robot manipulator systems. Instead of generating skills by explicit programming of a perception to action mapping they are generated by trial and error learning, guided by a performance evaluation feedback function. The resulting system may be seen as an anticipatory system that constructs an internal representation model of itself and of its environment. This enables it to identify its current situation and to generate corresponding appropriate commands to the system in order to perform the required skill. The method was applied to the problem of learning a force control skill in which the tool-tip of a robot manipulator must be moved from a free space situation, to a contact state with a compliant surface and having a constant interaction force.

  18. Mechanisms of reinforcement learning and decision making in the primate dorsolateral prefrontal cortex.

    PubMed

    Lee, Daeyeol; Seo, Hyojung

    2007-05-01

    To a first approximation, decision making is a process of optimization in which the decision maker tries to maximize the desirability of the outcomes resulting from chosen actions. Estimates of desirability are referred to as utilities or value functions, and they must be continually revised through experience according to the discrepancies between the predicted and obtained rewards. Reinforcement learning theory prescribes various algorithms for updating value functions and can parsimoniously account for the results of numerous behavioral, neurophysiological, and imaging studies in humans and other primates. In this article, we first discuss relative merits of various decision-making tasks used in neurophysiological studies of decision making in nonhuman primates. We then focus on how reinforcement learning theory can shed new light on the function of the primate dorsolateral prefrontal cortex. Similar to the findings from other brain areas, such as cingulate cortex and basal ganglia, activity in the dorsolateral prefrontal cortex often signals the value of expected reward and actual outcome. Thus, the dorsolateral prefrontal cortex is likely to be a part of the broader network involved in adaptive decision making. In addition, reward-related activity in the dorsolateral prefrontal cortex is influenced by the animal's choices and other contextual information, and therefore may provide a neural substrate by which the animals can flexibly modify their decision-making strategies according to the demands of specific tasks.

  19. Extraversion differentiates between model-based and model-free strategies in a reinforcement learning task

    PubMed Central

    Skatova, Anya; Chan, Patricia A.; Daw, Nathaniel D.

    2013-01-01

    Prominent computational models describe a neural mechanism for learning from reward prediction errors, and it has been suggested that variations in this mechanism are reflected in personality factors such as trait extraversion. However, although trait extraversion has been linked to improved reward learning, it is not yet known whether this relationship is selective for the particular computational strategy associated with error-driven learning, known as model-free reinforcement learning, vs. another strategy, model-based learning, which the brain is also known to employ. In the present study we test this relationship by examining whether humans' scores on an extraversion scale predict individual differences in the balance between model-based and model-free learning strategies in a sequentially structured decision task designed to distinguish between them. In previous studies with this task, participants have shown a combination of both types of learning, but with substantial individual variation in the balance between them. In the current study, extraversion predicted worse behavior across both sorts of learning. However, the hypothesis that extraverts would be selectively better at model-free reinforcement learning held up among a subset of the more engaged participants, and overall, higher task engagement was associated with a more selective pattern by which extraversion predicted better model-free learning. The findings indicate a relationship between a broad personality orientation and detailed computational learning mechanisms. Results like those in the present study suggest an intriguing and rich relationship between core neuro-computational mechanisms and broader life orientations and outcomes. PMID:24027514

  20. Punishment Insensitivity and Impaired Reinforcement Learning in Preschoolers

    ERIC Educational Resources Information Center

    Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

    2014-01-01

    Background: Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a…

  1. Reinforcement of Science Learning through Local Culture: A Delphi Study

    ERIC Educational Resources Information Center

    Nuangchalerm, Prasart

    2008-01-01

    This study aims to explore the ways to reinforce science learning through local culture by using Delphi technique. Twenty four participants in various fields of study were selected. The result of study provides a framework for reinforcement of science learning through local culture on the theme life and environment. (Contains 1 table.)

  2. CLEANing the Reward: Counterfactual Actions to Remove Exploratory Action Noise in Multiagent Learning

    NASA Technical Reports Server (NTRS)

    HolmesParker, Chris; Taylor, Mathew E.; Tumer, Kagan; Agogino, Adrian

    2014-01-01

    Learning in multiagent systems can be slow because agents must learn both how to behave in a complex environment and how to account for the actions of other agents. The inability of an agent to distinguish between the true environmental dynamics and those caused by the stochastic exploratory actions of other agents creates noise in each agent's reward signal. This learning noise can have unforeseen and often undesirable effects on the resultant system performance. We define such noise as exploratory action noise, demonstrate the critical impact it can have on the learning process in multiagent settings, and introduce a reward structure to effectively remove such noise from each agent's reward signal. In particular, we introduce Coordinated Learning without Exploratory Action Noise (CLEAN) rewards and empirically demonstrate their benefits

  3. Aging Affects Acquisition and Reversal of Reward-Based Associative Learning

    ERIC Educational Resources Information Center

    Weiler, Julia A.; Bellebaum, Christian; Daum, Irene

    2008-01-01

    Reward-based associative learning is mediated by a distributed network of brain regions that are dependent on the dopaminergic system. Age-related changes in key regions of this system, the striatum and the prefrontal cortex, may adversely affect the ability to use reward information for the guidance of behavior. The present study investigated the…

  4. Stimulus-Reward Association and Reversal Learning in Individuals with Asperger Syndrome

    ERIC Educational Resources Information Center

    Zalla, Tiziana; Sav, Anca-Maria; Leboyer, Marion

    2009-01-01

    In the present study, performance of a group of adults with Asperger Syndrome (AS) on two series of object reversal and extinction was compared with that of a group of adults with typical development. Participants were requested to learn a stimulus-reward association rule and monitor changes in reward value of stimuli in order to gain as many…

  5. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia.

    PubMed

    Markou, Athina; Salamone, John D; Bussey, Timothy J; Mar, Adam C; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-11-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu) meeting. A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. PMID:23994273

  6. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia

    PubMed Central

    Markou, Athina; Salamone, John D.; Bussey, Timothy; Mar, Adam; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-01-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu). A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. PMID:23994273

  7. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia.

    PubMed

    Markou, Athina; Salamone, John D; Bussey, Timothy J; Mar, Adam C; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-11-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu) meeting. A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits.

  8. A Discussion of Possibility of Reinforcement Learning Using Event-Related Potential in BCI

    NASA Astrophysics Data System (ADS)

    Yamagishi, Yuya; Tsubone, Tadashi; Wada, Yasuhiro

    Recently, Brain computer interface (BCI) which is a direct connecting pathway an external device such as a computer or a robot and a human brain have gotten a lot of attention. Since BCI can control the machines as robots by using the brain activity without using the voluntary muscle, the BCI may become a useful communication tool for handicapped persons, for instance, amyotrophic lateral sclerosis patients. However, in order to realize the BCI system which can perform precise tasks on various environments, it is necessary to design the control rules to adapt to the dynamic environments. Reinforcement learning is one approach of the design of the control rule. If this reinforcement leaning can be performed by the brain activity, it leads to the attainment of BCI that has general versatility. In this research, we paid attention to P300 of event-related potential as an alternative signal of the reward of reinforcement learning. We discriminated between the success and the failure trials from P300 of the EEG of the single trial by using the proposed discrimination algorithm based on Support vector machine. The possibility of reinforcement learning was examined from the viewpoint of the number of discriminated trials. It was shown that there was a possibility to be able to learn in most subjects.

  9. Differential Modulation of Reinforcement Learning by D2 Dopamine and NMDA Glutamate Receptor Antagonism

    PubMed Central

    Klein, Tilmann A.; Ullsperger, Markus

    2014-01-01

    The firing pattern of midbrain dopamine (DA) neurons is well known to reflect reward prediction errors (PEs), the difference between obtained and expected rewards. The PE is thought to be a crucial signal for instrumental learning, and interference with DA transmission impairs learning. Phasic increases of DA neuron firing during positive PEs are driven by activation of NMDA receptors, whereas phasic suppression of firing during negative PEs is likely mediated by inputs from the lateral habenula. We aimed to determine the contribution of DA D2-class and NMDA receptors to appetitively and aversively motivated reinforcement learning. Healthy human volunteers were scanned with functional magnetic resonance imaging while they performed an instrumental learning task under the influence of either the DA D2 receptor antagonist amisulpride (400 mg), the NMDA receptor antagonist memantine (20 mg), or placebo. Participants quickly learned to select (“approach”) rewarding and to reject (“avoid”) punishing options. Amisulpride impaired both approach and avoidance learning, while memantine mildly attenuated approach learning but had no effect on avoidance learning. These behavioral effects of the antagonists were paralleled by their modulation of striatal PEs. Amisulpride reduced both appetitive and aversive PEs, while memantine diminished appetitive, but not aversive PEs. These data suggest that striatal D2-class receptors contribute to both approach and avoidance learning by detecting both the phasic DA increases and decreases during appetitive and aversive PEs. NMDA receptors on the contrary appear to be required only for approach learning because phasic DA increases during positive PEs are NMDA dependent, whereas phasic decreases during negative PEs are not. PMID:25253860

  10. Reinforcement learning for port-hamiltonian systems.

    PubMed

    Sprangers, Olivier; Babuška, Robert; Nageshrao, Subramanya P; Lopes, Gabriel A D

    2015-05-01

    Passivity-based control (PBC) for port-Hamiltonian systems provides an intuitive way of achieving stabilization by rendering a system passive with respect to a desired storage function. However, in most instances the control law is obtained without any performance considerations and it has to be calculated by solving a complex partial differential equation (PDE). In order to address these issues we introduce a reinforcement learning (RL) approach into the energy-balancing passivity-based control (EB-PBC) method, which is a form of PBC in which the closed-loop energy is equal to the difference between the stored and supplied energies. We propose a technique to parameterize EB-PBC that preserves the systems's PDE matching conditions, does not require the specification of a global desired Hamiltonian, includes performance criteria, and is robust. The parameters of the control law are found by using actor-critic (AC) RL, enabling the search for near-optimal control policies satisfying a desired closed-loop energy landscape. The advantage is that the solutions learned can be interpreted in terms of energy shaping and damping injection, which makes it possible to numerically assess stability using passivity theory. From the RL perspective, our proposal allows for the class of port-Hamiltonian systems to be incorporated in the AC framework, speeding up the learning thanks to the resulting parameterization of the policy. The method has been successfully applied to the pendulum swing-up problem in simulations and real-life experiments. PMID:25167564

  11. Reinforcement learning for port-hamiltonian systems.

    PubMed

    Sprangers, Olivier; Babuška, Robert; Nageshrao, Subramanya P; Lopes, Gabriel A D

    2015-05-01

    Passivity-based control (PBC) for port-Hamiltonian systems provides an intuitive way of achieving stabilization by rendering a system passive with respect to a desired storage function. However, in most instances the control law is obtained without any performance considerations and it has to be calculated by solving a complex partial differential equation (PDE). In order to address these issues we introduce a reinforcement learning (RL) approach into the energy-balancing passivity-based control (EB-PBC) method, which is a form of PBC in which the closed-loop energy is equal to the difference between the stored and supplied energies. We propose a technique to parameterize EB-PBC that preserves the systems's PDE matching conditions, does not require the specification of a global desired Hamiltonian, includes performance criteria, and is robust. The parameters of the control law are found by using actor-critic (AC) RL, enabling the search for near-optimal control policies satisfying a desired closed-loop energy landscape. The advantage is that the solutions learned can be interpreted in terms of energy shaping and damping injection, which makes it possible to numerically assess stability using passivity theory. From the RL perspective, our proposal allows for the class of port-Hamiltonian systems to be incorporated in the AC framework, speeding up the learning thanks to the resulting parameterization of the policy. The method has been successfully applied to the pendulum swing-up problem in simulations and real-life experiments.

  12. Dopaminergic and Prefrontal Contributions to Reward-Based Learning and Outcome Monitoring during Child Development and Aging

    ERIC Educational Resources Information Center

    Hammerer, Dorothea; Eppinger, Ben

    2012-01-01

    In many instances, children and older adults show similar difficulties in reward-based learning and outcome monitoring. These impairments are most pronounced in situations in which reward is uncertain (e.g., probabilistic reward schedules) and if outcome information is ambiguous (e.g., the relative value of outcomes has to be learned).…

  13. Updating dopamine reward signals.

    PubMed

    Schultz, Wolfram

    2013-04-01

    Recent work has advanced our knowledge of phasic dopamine reward prediction error signals. The error signal is bidirectional, reflects well the higher order prediction error described by temporal difference learning models, is compatible with model-free and model-based reinforcement learning, reports the subjective rather than physical reward value during temporal discounting and reflects subjective stimulus perception rather than physical stimulus aspects. Dopamine activations are primarily driven by reward, and to some extent risk, whereas punishment and salience have only limited activating effects when appropriate controls are respected. The signal is homogeneous in terms of time course but heterogeneous in many other aspects. It is essential for synaptic plasticity and a range of behavioural learning situations.

  14. Reward expectation alters learning and memory: The impact of the amygdala on appetitive-driven behaviors

    PubMed Central

    Savage, Lisa M.; Ramos, Raddy L.

    2009-01-01

    The capacity to seek and obtain rewards is essential for survival. Pavlovian conditioning is one mechanism by which organisms develop predictions about rewards and such anticipatory or expectancy states enable successful behavioral adaptations to environmental demands. Reward expectancies have both affective/motivational and discriminative properties that allow for the modulation of instrumental goal-directed behavior. Recent data provides evidence that different cognitive strategies (cue-outcome associations) and neural systems (amygdala) are used when subjects are trained under conditions that allow Pavlovian-induced reward expectancies to guide instrumental behavioral choices. Furthermore, it has been demonstrated that impairments typically observed in a number of brain-damaged models are alleviated or eliminated by embedding unique reward expectancies into learning/memory tasks. These results suggest that Pavlovian-induced reward expectancies can change both behavioral and brain processes. PMID:19022299

  15. Stochastic Reinforcement Benefits Skill Acquisition

    ERIC Educational Resources Information Center

    Dayan, Eran; Averbeck, Bruno B.; Richmond, Barry J.; Cohen, Leonardo G.

    2014-01-01

    Learning complex skills is driven by reinforcement, which facilitates both online within-session gains and retention of the acquired skills. Yet, in ecologically relevant situations, skills are often acquired when mapping between actions and rewarding outcomes is unknown to the learning agent, resulting in reinforcement schedules of a stochastic…

  16. Reinforcement learning in complementarity game and population dynamics.

    PubMed

    Jost, Jürgen; Li, Wei

    2014-02-01

    We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005)] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.

  17. The Roles of Dopamine and Related Compounds in Reward-Seeking Behavior Across Animal Phyla

    PubMed Central

    Barron, Andrew B.; Søvik, Eirik; Cornish, Jennifer L.

    2010-01-01

    Motile animals actively seek out and gather resources they find rewarding, and this is an extremely powerful organizer and motivator of animal behavior. Mammalian studies have revealed interconnected neurobiological systems for reward learning, reward assessment, reinforcement and reward-seeking; all involving the biogenic amine dopamine. The neurobiology of reward-seeking behavioral systems is less well understood in invertebrates, but in many diverse invertebrate groups, reward learning and responses to food rewards also involve dopamine. The obvious exceptions are the arthropods in which the chemically related biogenic amine octopamine has a greater effect on reward learning and reinforcement than dopamine. Here we review the functions of these biogenic amines in behavioral responses to rewards in different animal groups, and discuss these findings in an evolutionary context. PMID:21048897

  18. The role of GABAB receptors in human reinforcement learning.

    PubMed

    Ort, Andres; Kometer, Michael; Rohde, Judith; Seifritz, Erich; Vollenweider, Franz X

    2014-10-01

    Behavioral evidence from human studies suggests that the γ-aminobutyric acid type B receptor (GABAB receptor) agonist baclofen modulates reinforcement learning and reduces craving in patients with addiction spectrum disorders. However, in contrast to the well established role of dopamine in reinforcement learning, the mechanisms by which the GABAB receptor influences reinforcement learning in humans remain completely unknown. To further elucidate this issue, a cross-over, double-blind, placebo-controlled study was performed in healthy human subjects (N=15) to test the effects of baclofen (20 and 50mg p.o.) on probabilistic reinforcement learning. Outcomes were the feedback-induced P2 component of the event-related potential, the feedback-related negativity, and the P300 component of the event-related potential. Baclofen produced a reduction of P2 amplitude over the course of the experiment, but did not modulate the feedback-related negativity. Furthermore, there was a trend towards increased learning after baclofen administration relative to placebo over the course of the experiment. The present results extend previous theories of reinforcement learning, which focus on the importance of mesolimbic dopamine signaling, and indicate that stimulation of cortical GABAB receptors in a fronto-parietal network leads to better attentional allocation in reinforcement learning. This observation is a first step in our understanding of how baclofen may improve reinforcement learning in healthy subjects. Further studies with bigger sample sizes are needed to corroborate this conclusion and furthermore, test this effect in patients with addiction spectrum disorder.

  19. A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback.

    PubMed

    Legenstein, Robert; Pecevski, Dejan; Maass, Wolfgang

    2008-10-01

    Reward-modulated spike-timing-dependent plasticity (STDP) has recently emerged as a candidate for a learning rule that could explain how behaviorally relevant adaptive changes in complex networks of spiking neurons could be achieved in a self-organizing manner through local synaptic plasticity. However, the capabilities and limitations of this learning rule could so far only be tested through computer simulations. This article provides tools for an analytic treatment of reward-modulated STDP, which allows us to predict under which conditions reward-modulated STDP will achieve a desired learning effect. These analytical results imply that neurons can learn through reward-modulated STDP to classify not only spatial but also temporal firing patterns of presynaptic neurons. They also can learn to respond to specific presynaptic firing patterns with particular spike patterns. Finally, the resulting learning theory predicts that even difficult credit-assignment problems, where it is very hard to tell which synaptic weights should be modified in order to increase the global reward for the system, can be solved in a self-organizing manner through reward-modulated STDP. This yields an explanation for a fundamental experimental result on biofeedback in monkeys by Fetz and Baker. In this experiment monkeys were rewarded for increasing the firing rate of a particular neuron in the cortex and were able to solve this extremely difficult credit assignment problem. Our model for this experiment relies on a combination of reward-modulated STDP with variable spontaneous firing activity. Hence it also provides a possible functional explanation for trial-to-trial variability, which is characteristic for cortical networks of neurons but has no analogue in currently existing artificial computing systems. In addition our model demonstrates that reward-modulated STDP can be applied to all synapses in a large recurrent neural network without endangering the stability of the network

  20. How Food as a Reward Is Detrimental to Children's Health, Learning, and Behavior

    ERIC Educational Resources Information Center

    Fedewa, Alicia L.; Davis, Matthew Cody

    2015-01-01

    Background: Despite small- and wide-scale prevention efforts to curb obesity, the percentage of children classified as overweight and obese has remained relatively consistent in the last decade. As school personnel are increasingly pressured to enhance student performance, many educators use food as a reward to motivate and reinforce positive…

  1. Using Fuzzy Logic for Performance Evaluation in Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap S.

    1992-01-01

    Current reinforcement learning algorithms require long training periods which generally limit their applicability to small size problems. A new architecture is described which uses fuzzy rules to initialize its two neural networks: a neural network for performance evaluation and another for action selection. This architecture is applied to control of dynamic systems and it is demonstrated that it is possible to start with an approximate prior knowledge and learn to refine it through experiments using reinforcement learning.

  2. Coevolutionary networks of reinforcement-learning agents.

    PubMed

    Kianercy, Ardeshir; Galstyan, Aram

    2013-07-01

    This paper presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement-learning scheme. It is demonstrated that the coevolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with nontrivial structural dynamics. In particular, we characterize the Nash equilibria (NE) in such games and examine the local stability of the rest points corresponding to those equilibria. We also study general n-player networks via both simulations and analytical methods and find that, in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable.

  3. Coevolutionary networks of reinforcement-learning agents

    NASA Astrophysics Data System (ADS)

    Kianercy, Ardeshir; Galstyan, Aram

    2013-07-01

    This paper presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement-learning scheme. It is demonstrated that the coevolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with nontrivial structural dynamics. In particular, we characterize the Nash equilibria (NE) in such games and examine the local stability of the rest points corresponding to those equilibria. We also study general n-player networks via both simulations and analytical methods and find that, in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable.

  4. Switching Reinforcement Learning for Continuous Action Space

    NASA Astrophysics Data System (ADS)

    Nagayoshi, Masato; Murao, Hajime; Tamaki, Hisashi

    Reinforcement Learning (RL) attracts much attention as a technique of realizing computational intelligence such as adaptive and autonomous decentralized systems. In general, however, it is not easy to put RL into practical use. This difficulty includes a problem of designing a suitable action space of an agent, i.e., satisfying two requirements in trade-off: (i) to keep the characteristics (or structure) of an original search space as much as possible in order to seek strategies that lie close to the optimal, and (ii) to reduce the search space as much as possible in order to expedite the learning process. In order to design a suitable action space adaptively, we propose switching RL model to mimic a process of an infant's motor development in which gross motor skills develop before fine motor skills. Then, a method for switching controllers is constructed by introducing and referring to the “entropy”. Further, through computational experiments by using robot navigation problems with one and two-dimensional continuous action space, the validity of the proposed method has been confirmed.

  5. Oculomotor learning revisited: a model of reinforcement learning in the basal ganglia incorporating an efference copy of motor actions

    PubMed Central

    Fee, Michale S.

    2012-01-01

    In its simplest formulation, reinforcement learning is based on the idea that if an action taken in a particular context is followed by a favorable outcome, then, in the same context, the tendency to produce that action should be strengthened, or reinforced. While reinforcement learning forms the basis of many current theories of basal ganglia (BG) function, these models do not incorporate distinct computational roles for signals that convey context, and those that convey what action an animal takes. Recent experiments in the songbird suggest that vocal-related BG circuitry receives two functionally distinct excitatory inputs. One input is from a cortical region that carries context information about the current “time” in the motor sequence. The other is an efference copy of motor commands from a separate cortical brain region that generates vocal variability during learning. Based on these findings, I propose here a general model of vertebrate BG function that combines context information with a distinct motor efference copy signal. The signals are integrated by a learning rule in which efference copy inputs gate the potentiation of context inputs (but not efference copy inputs) onto medium spiny neurons in response to a rewarded action. The hypothesis is described in terms of a circuit that implements the learning of visually guided saccades. The model makes testable predictions about the anatomical and functional properties of hypothesized context and efference copy inputs to the striatum from both thalamic and cortical sources. PMID:22754501

  6. DAT isn't all that: cocaine reward and reinforcement require Toll-like receptor 4 signaling.

    PubMed

    Northcutt, A L; Hutchinson, M R; Wang, X; Baratta, M V; Hiranita, T; Cochran, T A; Pomrenze, M B; Galer, E L; Kopajtic, T A; Li, C M; Amat, J; Larson, G; Cooper, D C; Huang, Y; O'Neill, C E; Yin, H; Zahniser, N R; Katz, J L; Rice, K C; Maier, S F; Bachtell, R K; Watkins, L R

    2015-12-01

    The initial reinforcing properties of drugs of abuse, such as cocaine, are largely attributed to their ability to activate the mesolimbic dopamine system. Resulting increases in extracellular dopamine in the nucleus accumbens (NAc) are traditionally thought to result from cocaine's ability to block dopamine transporters (DATs). Here we demonstrate that cocaine also interacts with the immunosurveillance receptor complex, Toll-like receptor 4 (TLR4), on microglial cells to initiate central innate immune signaling. Disruption of cocaine signaling at TLR4 suppresses cocaine-induced extracellular dopamine in the NAc, as well as cocaine conditioned place preference and cocaine self-administration. These results provide a novel understanding of the neurobiological mechanisms underlying cocaine reward/reinforcement that includes a critical role for central immune signaling, and offer a new target for medication development for cocaine abuse treatment.

  7. Disentangling pleasure from incentive salience and learning signals in brain reward circuitry.

    PubMed

    Smith, Kyle S; Berridge, Kent C; Aldridge, J Wayne

    2011-07-01

    Multiple signals for reward-hedonic impact, motivation, and learned associative prediction-are funneled through brain mesocorticolimbic circuits involving the nucleus accumbens and ventral pallidum. Here, we show how the hedonic "liking" and motivation "wanting" signals for a sweet reward are distinctly modulated and tracked in this circuit separately from signals for Pavlovian predictions (learning). Animals first learned to associate a fixed sequence of Pavlovian cues with sucrose reward. Subsequent intraaccumbens microinjections of an opioid-stimulating drug increased the hedonic liking impact of sucrose in behavior and firing signals of ventral pallidum neurons, and likewise, they increased incentive salience signals in firing to the reward-proximal incentive cue (but did not alter firing signals to the learned prediction value of a reward-distal cue). Microinjection of a dopamine-stimulating drug instead enhanced only the motivation component but did not alter hedonic impact or learned prediction signals. Different dedicated neuronal subpopulations in the ventral pallidum tracked signal enhancements for hedonic impact vs. incentive salience, and a faster firing pattern also distinguished incentive signals from slower hedonic signals, even for a third overlapping population. These results reveal separate neural representations of wanting, liking, and prediction components of the same reward within the nucleus accumbens to ventral pallidum segment of mesocorticolimbic circuitry.

  8. Macaque monkeys can learn token values from human models through vicarious reward.

    PubMed

    Bevacqua, Sara; Cerasti, Erika; Falcone, Rossella; Cervelloni, Milena; Brunamonti, Emiliano; Ferraina, Stefano; Genovesio, Aldo

    2013-01-01

    Monkeys can learn the symbolic meaning of tokens, and exchange them to get a reward. Monkeys can also learn the symbolic value of a token by observing conspecifics but it is not clear if they can learn passively by observing other actors, e.g., humans. To answer this question, we tested two monkeys in a token exchange paradigm in three experiments. Monkeys learned token values through observation of human models exchanging them. We used, after a phase of object familiarization, different sets of tokens. One token of each set was rewarded with a bit of apple. Other tokens had zero value (neutral tokens). Each token was presented only in one set. During the observation phase, monkeys watched the human model exchange tokens and watched them consume rewards (vicarious rewards). In the test phase, the monkeys were asked to exchange one of the tokens for food reward. Sets of three tokens were used in the first experiment and sets of two tokens were used in the second and third experiments. The valuable token was presented with different probabilities in the observation phase during the first and second experiments in which the monkeys exchanged the valuable token more frequently than any of the neutral tokens. The third experiments examined the effect of unequal probabilities. Our results support the view that monkeys can learn from non-conspecific actors through vicarious reward, even a symbolic task like the token-exchange task. PMID:23544115

  9. Single Dose of a Dopamine Agonist Impairs Reinforcement Learning in Humans: Evidence from Event-related Potentials and Computational Modeling of Striatal-Cortical Function

    PubMed Central

    Santesso, Diane L.; Evins, A. Eden; Frank, Michael J.; Cowman Schetter, Erika M.; Bogdan, Ryan; Pizzagalli, Diego A.

    2011-01-01

    Animal findings have highlighted the modulatory role of phasic dopamine (DA) signaling in incentive learning, particularly in the acquisition of reward-related behavior. In humans, these processes remain largely unknown. In a recent study we demonstrated that a single low dose of a D2/D3 agonist (pramipexole) – assumed to activate DA autoreceptors and thus reduce phasic DA bursts – impaired reward learning in healthy subjects performing a probabilistic reward task. The purpose of the present study was to extend these behavioral findings using event-related potentials and computational modeling. Compared to the placebo group, participants receiving pramipexole showed increased feedback-related negativity to probabilistic rewards and decreased activation in dorsal anterior cingulate regions previously implicated in integrating reinforcement history over time. Additionally, findings of blunted reward learning in participants receiving pramipexole were simulated by reduced presynaptic DA signaling in response to reward in a neural network model of striatal-cortical function. These preliminary findings offer important insights on the role of phasic DA signals on reinforcement learning in humans, and provide initial evidence regarding the spatio-temporal dynamics of brain mechanisms underlying these processes. PMID:18726908

  10. Affective modulation of the startle reflex and the Reinforcement Sensitivity Theory of personality: The role of sensitivity to reward.

    PubMed

    Aluja, Anton; Blanch, Angel; Blanco, Eduardo; Balada, Ferran

    2015-01-01

    This study evaluated differences in the amplitude of startle reflex and Sensitivity to Reward (SR) and Sensitivity to Punishment (SP) personality variables of the Reinforcement Sensitivity Theory (RST). We hypothesized that subjects with higher scores in SR would obtain a higher startle reflex when exposed to pleasant pictures than lower scores, while higher scores in SP would obtain a higher startle reflex when exposed to unpleasant pictures than subjects with lower scores in this dimension. The sample consisted of 112 healthy female undergraduate psychology students. Personality was assessed using the short version of the Sensitivity to Punishment and Sensitivity Reward Questionnaire (SPSRQ). Laboratory anxiety was controlled by the State Anxiety Inventory. The startle blink reflex was recorded electromyographically (EMG) from the right orbicularis oculi muscle as a response to the International Affective Picture System (IAPS) pleasant, neutral and unpleasant pictures. Subjects higher in SR obtained a significant higher startle reflex response in pleasant pictures than lower scorers (48.48 vs 46.28, p<0.012). Subjects with higher scores in SP showed a light tendency of higher startle responses in unpleasant pictures in a non-parametric local regression graphical analysis (LOESS). The findings shed light on the relationships among the impulsive-disinhibited personality, including sensitivity to reward and emotions evoked through pictures of emotional content.

  11. Affective modulation of the startle reflex and the Reinforcement Sensitivity Theory of personality: The role of sensitivity to reward.

    PubMed

    Aluja, Anton; Blanch, Angel; Blanco, Eduardo; Balada, Ferran

    2015-01-01

    This study evaluated differences in the amplitude of startle reflex and Sensitivity to Reward (SR) and Sensitivity to Punishment (SP) personality variables of the Reinforcement Sensitivity Theory (RST). We hypothesized that subjects with higher scores in SR would obtain a higher startle reflex when exposed to pleasant pictures than lower scores, while higher scores in SP would obtain a higher startle reflex when exposed to unpleasant pictures than subjects with lower scores in this dimension. The sample consisted of 112 healthy female undergraduate psychology students. Personality was assessed using the short version of the Sensitivity to Punishment and Sensitivity Reward Questionnaire (SPSRQ). Laboratory anxiety was controlled by the State Anxiety Inventory. The startle blink reflex was recorded electromyographically (EMG) from the right orbicularis oculi muscle as a response to the International Affective Picture System (IAPS) pleasant, neutral and unpleasant pictures. Subjects higher in SR obtained a significant higher startle reflex response in pleasant pictures than lower scorers (48.48 vs 46.28, p<0.012). Subjects with higher scores in SP showed a light tendency of higher startle responses in unpleasant pictures in a non-parametric local regression graphical analysis (LOESS). The findings shed light on the relationships among the impulsive-disinhibited personality, including sensitivity to reward and emotions evoked through pictures of emotional content. PMID:25447471

  12. Novelty as a Reinforcer for Position Learning in Children

    ERIC Educational Resources Information Center

    Wilson, Marian Monyok

    1974-01-01

    The stimulus-familiarization-effect (SFE) paradigm, a reaction-time (RT) task based on a response to novelty procedure, was modified to assess response for novelty, ie., a response-reinforcement sequence. The potential implications of attention for reinforcement theory and learning in general are discussed. (Author/CS)

  13. On the Possibility of a Reinforcement Theory of Cognitive Learning.

    ERIC Educational Resources Information Center

    Smith, Kendon

    This paper discusses cognitive learning in terms of reinforcement theory and presents arguments suggesting that a viable theory of cognition based on reinforcement principles is not out of the question. This position is supported by a discussion of the weaknesses of theories based entirely on contiguity and of considerations that are more positive…

  14. Covert Operant Reinforcement of Remedial Reading Learning Tasks.

    ERIC Educational Resources Information Center

    Schmickley, Verne G.

    The effects of covert operant reinforcement upon remedial reading learning tasks were investigated. Forty junior high school students were taught to imagine either neutral scenes (control) or positive scenes (treatment) upon cue while reading. It was hypothesized that positive covert reinforcement would enhance performance on several measures of…

  15. Repeated electrical stimulation of reward-related brain regions affects cocaine but not "natural" reinforcement.

    PubMed

    Levy, Dino; Shabat-Simon, Maytal; Shalev, Uri; Barnea-Ygael, Noam; Cooper, Ayelet; Zangen, Abraham

    2007-12-19

    Drug addiction is associated with long-lasting neuronal adaptations including alterations in dopamine and glutamate receptors in the brain reward system. Treatment strategies for cocaine addiction and especially the prevention of craving and relapse are limited, and their effectiveness is still questionable. We hypothesized that repeated stimulation of the brain reward system can induce localized neuronal adaptations that may either potentiate or reduce addictive behaviors. The present study was designed to test how repeated interference with the brain reward system using localized electrical stimulation of the medial forebrain bundle at the lateral hypothalamus (LH) or the prefrontal cortex (PFC) affects cocaine addiction-associated behaviors and some of the neuronal adaptations induced by repeated exposure to cocaine. Repeated high-frequency stimulation in either site influenced cocaine, but not sucrose reward-related behaviors. Stimulation of the LH reduced cue-induced seeking behavior, whereas stimulation of the PFC reduced both cocaine-seeking behavior and the motivation for its consumption. The behavioral findings were accompanied by glutamate receptor subtype alterations in the nucleus accumbens and the ventral tegmental area, both key structures of the reward system. It is therefore suggested that repeated electrical stimulation of the PFC can become a novel strategy for treating addiction. PMID:18094257

  16. Modulation of spatial attention by goals, statistical learning, and monetary reward.

    PubMed

    Jiang, Yuhong V; Sha, Li Z; Remington, Roger W

    2015-10-01

    This study documented the relative strength of task goals, visual statistical learning, and monetary reward in guiding spatial attention. Using a difficult T-among-L search task, we cued spatial attention to one visual quadrant by (i) instructing people to prioritize it (goal-driven attention), (ii) placing the target frequently there (location probability learning), or (iii) associating that quadrant with greater monetary gain (reward-based attention). Results showed that successful goal-driven attention exerted the strongest influence on search RT. Incidental location probability learning yielded a smaller though still robust effect. Incidental reward learning produced negligible guidance for spatial attention. The 95 % confidence intervals of the three effects were largely nonoverlapping. To understand these results, we simulated the role of location repetition priming in probability cuing and reward learning. Repetition priming underestimated the strength of location probability cuing, suggesting that probability cuing involved long-term statistical learning of how to shift attention. Repetition priming provided a reasonable account for the negligible effect of reward on spatial attention. We propose a multiple-systems view of spatial attention that includes task goals, search habit, and priming as primary drivers of top-down attention.

  17. Modulation of spatial attention by goals, statistical learning, and monetary reward

    PubMed Central

    Sha, Li Z.; Remington, Roger W.

    2015-01-01

    This study documented the relative strength of task goals, visual statistical learning, and monetary reward in guiding spatial attention. Using a difficult T-among-L search task, we cued spatial attention to one visual quadrant by (i) instructing people to prioritize it (goal-driven attention), (ii) placing the target frequently there (location probability learning), or (iii) associating that quadrant with greater monetary gain (reward-based attention). Results showed that successful goal-driven attention exerted the strongest influence on search RT. Incidental location probability learning yielded a smaller though still robust effect. Incidental reward learning produced negligible guidance for spatial attention. The 95 % confidence intervals of the three effects were largely nonoverlapping. To understand these results, we simulated the role of location repetition priming in probability cuing and reward learning. Repetition priming underestimated the strength of location probability cuing, suggesting that probability cuing involved long-term statistical learning of how to shift attention. Repetition priming provided a reasonable account for the negligible effect of reward on spatial attention. We propose a multiple-systems view of spatial attention that includes task goals, search habit, and priming as primary drivers of top-down attention. PMID:26105657

  18. Neuropeptide F neurons modulate sugar reward during associative olfactory learning of Drosophila larvae.

    PubMed

    Rohwedder, Astrid; Selcho, Mareike; Chassot, Bérénice; Thum, Andreas S

    2015-12-15

    All organisms continuously have to adapt their behavior according to changes in the environment in order to survive. Experience-driven changes in behavior are usually mediated and maintained by modifications in signaling within defined brain circuits. Given the simplicity of the larval brain of Drosophila and its experimental accessibility on the genetic and behavioral level, we analyzed if Drosophila neuropeptide F (dNPF) neurons are involved in classical olfactory conditioning. dNPF is an ortholog of the mammalian neuropeptide Y, a highly conserved neuromodulator that stimulates food-seeking behavior. We provide a comprehensive anatomical analysis of the dNPF neurons on the single-cell level. We demonstrate that artificial activation of dNPF neurons inhibits appetitive olfactory learning by modulating the sugar reward signal during acquisition. No effect is detectable for the retrieval of an established appetitive olfactory memory. The modulatory effect is based on the joint action of three distinct cell types that, if tested on the single-cell level, inhibit and invert the conditioned behavior. Taken together, our work describes anatomically and functionally a new part of the sugar reinforcement signaling pathway for classical olfactory conditioning in Drosophila larvae.

  19. Rewarded by Punishment: Reflections on the Disuse of Positive Reinforcement in Education.

    ERIC Educational Resources Information Center

    Maag, John W.

    2001-01-01

    This article delineates the reasons why educators find punishment a more acceptable approach for managing students' challenging behaviors than positive reinforcement. The article argues that educators should plan the occurrence of positive reinforcement to increase appropriate behaviors rather than running the risk of it haphazardly promoting…

  20. Reinforcement learning of motor skills with policy gradients.

    PubMed

    Peters, Jan; Schaal, Stefan

    2008-05-01

    Autonomous learning is one of the hallmarks of human and animal behavior, and understanding the principles of learning will be crucial in order to achieve true autonomy in advanced machines like humanoid robots. In this paper, we examine learning of complex motor skills with human-like limbs. While supervised learning can offer useful tools for bootstrapping behavior, e.g., by learning from demonstration, it is only reinforcement learning that offers a general approach to the final trial-and-error improvement that is needed by each individual acquiring a skill. Neither neurobiological nor machine learning studies have, so far, offered compelling results on how reinforcement learning can be scaled to the high-dimensional continuous state and action spaces of humans or humanoids. Here, we combine two recent research developments on learning motor control in order to achieve this scaling. First, we interpret the idea of modular motor control by means of motor primitives as a suitable way to generate parameterized control policies for reinforcement learning. Second, we combine motor primitives with the theory of stochastic policy gradient learning, which currently seems to be the only feasible framework for reinforcement learning for humanoids. We evaluate different policy gradient methods with a focus on their applicability to parameterized motor primitives. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm.

  1. Reinforcement learning of motor skills with policy gradients.

    PubMed

    Peters, Jan; Schaal, Stefan

    2008-05-01

    Autonomous learning is one of the hallmarks of human and animal behavior, and understanding the principles of learning will be crucial in order to achieve true autonomy in advanced machines like humanoid robots. In this paper, we examine learning of complex motor skills with human-like limbs. While supervised learning can offer useful tools for bootstrapping behavior, e.g., by learning from demonstration, it is only reinforcement learning that offers a general approach to the final trial-and-error improvement that is needed by each individual acquiring a skill. Neither neurobiological nor machine learning studies have, so far, offered compelling results on how reinforcement learning can be scaled to the high-dimensional continuous state and action spaces of humans or humanoids. Here, we combine two recent research developments on learning motor control in order to achieve this scaling. First, we interpret the idea of modular motor control by means of motor primitives as a suitable way to generate parameterized control policies for reinforcement learning. Second, we combine motor primitives with the theory of stochastic policy gradient learning, which currently seems to be the only feasible framework for reinforcement learning for humanoids. We evaluate different policy gradient methods with a focus on their applicability to parameterized motor primitives. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm. PMID:18482830

  2. Reinforcement learning, conditioning, and the brain: Successes and challenges.

    PubMed

    Maia, Tiago V

    2009-12-01

    The field of reinforcement learning has greatly influenced the neuroscientific study of conditioning. This article provides an introduction to reinforcement learning followed by an examination of the successes and challenges using reinforcement learning to understand the neural bases of conditioning. Successes reviewed include (1) the mapping of positive and negative prediction errors to the firing of dopamine neurons and neurons in the lateral habenula, respectively; (2) the mapping of model-based and model-free reinforcement learning to associative and sensorimotor cortico-basal ganglia-thalamo-cortical circuits, respectively; and (3) the mapping of actor and critic to the dorsal and ventral striatum, respectively. Challenges reviewed consist of several behavioral and neural findings that are at odds with standard reinforcement-learning models, including, among others, evidence for hyperbolic discounting and adaptive coding. The article suggests ways of reconciling reinforcement-learning models with many of the challenging findings, and highlights the need for further theoretical developments where necessary. Additional information related to this study may be downloaded from http://cabn.psychonomic-journals.org/content/supplemental. PMID:19897789

  3. Knockout crickets for the study of learning and memory: Dopamine receptor Dop1 mediates aversive but not appetitive reinforcement in crickets.

    PubMed

    Awata, Hiroko; Watanabe, Takahito; Hamanaka, Yoshitaka; Mito, Taro; Noji, Sumihare; Mizunami, Makoto

    2015-01-01

    Elucidation of reinforcement mechanisms in associative learning is an important subject in neuroscience. In mammals, dopamine neurons are thought to play critical roles in mediating both appetitive and aversive reinforcement. Our pharmacological studies suggested that octopamine and dopamine neurons mediate reward and punishment, respectively, in crickets, but recent studies in fruit-flies concluded that dopamine neurons mediates both reward and punishment, via the type 1 dopamine receptor Dop1. To resolve the discrepancy between studies in different insect species, we produced Dop1 knockout crickets using the CRISPR/Cas9 system and found that they are defective in aversive learning with sodium chloride punishment but not appetitive learning with water or sucrose reward. The results suggest that dopamine and octopamine neurons mediate aversive and appetitive reinforcement, respectively, in crickets. We suggest unexpected diversity in neurotransmitters mediating appetitive reinforcement between crickets and fruit-flies, although the neurotransmitter mediating aversive reinforcement is conserved. This study demonstrates usefulness of the CRISPR/Cas9 system for producing knockout animals for the study of learning and memory. PMID:26521965

  4. Knockout crickets for the study of learning and memory: Dopamine receptor Dop1 mediates aversive but not appetitive reinforcement in crickets.

    PubMed

    Awata, Hiroko; Watanabe, Takahito; Hamanaka, Yoshitaka; Mito, Taro; Noji, Sumihare; Mizunami, Makoto

    2015-11-02

    Elucidation of reinforcement mechanisms in associative learning is an important subject in neuroscience. In mammals, dopamine neurons are thought to play critical roles in mediating both appetitive and aversive reinforcement. Our pharmacological studies suggested that octopamine and dopamine neurons mediate reward and punishment, respectively, in crickets, but recent studies in fruit-flies concluded that dopamine neurons mediates both reward and punishment, via the type 1 dopamine receptor Dop1. To resolve the discrepancy between studies in different insect species, we produced Dop1 knockout crickets using the CRISPR/Cas9 system and found that they are defective in aversive learning with sodium chloride punishment but not appetitive learning with water or sucrose reward. The results suggest that dopamine and octopamine neurons mediate aversive and appetitive reinforcement, respectively, in crickets. We suggest unexpected diversity in neurotransmitters mediating appetitive reinforcement between crickets and fruit-flies, although the neurotransmitter mediating aversive reinforcement is conserved. This study demonstrates usefulness of the CRISPR/Cas9 system for producing knockout animals for the study of learning and memory.

  5. Knockout crickets for the study of learning and memory: Dopamine receptor Dop1 mediates aversive but not appetitive reinforcement in crickets

    PubMed Central

    Awata, Hiroko; Watanabe, Takahito; Hamanaka, Yoshitaka; Mito, Taro; Noji, Sumihare; Mizunami, Makoto

    2015-01-01

    Elucidation of reinforcement mechanisms in associative learning is an important subject in neuroscience. In mammals, dopamine neurons are thought to play critical roles in mediating both appetitive and aversive reinforcement. Our pharmacological studies suggested that octopamine and dopamine neurons mediate reward and punishment, respectively, in crickets, but recent studies in fruit-flies concluded that dopamine neurons mediates both reward and punishment, via the type 1 dopamine receptor Dop1. To resolve the discrepancy between studies in different insect species, we produced Dop1 knockout crickets using the CRISPR/Cas9 system and found that they are defective in aversive learning with sodium chloride punishment but not appetitive learning with water or sucrose reward. The results suggest that dopamine and octopamine neurons mediate aversive and appetitive reinforcement, respectively, in crickets. We suggest unexpected diversity in neurotransmitters mediating appetitive reinforcement between crickets and fruit-flies, although the neurotransmitter mediating aversive reinforcement is conserved. This study demonstrates usefulness of the CRISPR/Cas9 system for producing knockout animals for the study of learning and memory. PMID:26521965

  6. The Use of Rewards in Instructional Digital Games: An Application of Positive Reinforcement

    ERIC Educational Resources Information Center

    Malala, John; Major, Anthony; Maunez-Cuadra, Jose; McCauley-Bell, Pamela

    2007-01-01

    The main argument being presented in this paper is that instructional designers and educational researchers need to shift their attention from performance to interest. Educational digital games have to aim at building lasting interest in real world applications. The main hypothesis advocated in this paper is that the use of rewards in educational…

  7. A Brain-like Learning System with Supervised, Unsupervised and Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Sasakawa, Takafumi; Hu, Jinglu; Hirasawa, Kotaro

    Our brain has three different learning paradigms: supervised, unsupervised and reinforcement learning. And it is suggested that those learning paradigms relate deeply to the cerebellum, cerebral cortex and basal ganglia in the brain, respectively. Inspired by these knowledge of brain, we present a brain-like learning system with those three different learning algorithms. The proposed system consists of three parts: the supervised learning (SL) part, the unsupervised learning (UL) part and the reinforcement learning (RL) part. The SL part, corresponding to the cerebellum of brain, learns an input-output mapping by supervised learning. The UL part, corresponding to the cerebral cortex of brain, is a competitive learning network, and divides an input space to subspaces by unsupervised learning. The RL part, corresponding to the basal ganglia of brain, optimizes the model performance by reinforcement learning. Numerical simulations show that the proposed brain-like learning system optimizes its performance automatically and has superior performance to an ordinary neural network.

  8. A Computer-Assisted Learning Model Based on the Digital Game Exponential Reward System

    ERIC Educational Resources Information Center

    Moon, Man-Ki; Jahng, Surng-Gahb; Kim, Tae-Yong

    2011-01-01

    The aim of this research was to construct a motivational model which would stimulate voluntary and proactive learning using digital game methods offering players more freedom and control. The theoretical framework of this research lays the foundation for a pedagogical learning model based on digital games. We analyzed the game reward system, which…

  9. Continuous theta-burst stimulation (cTBS) over the lateral prefrontal cortex alters reinforcement learning bias.

    PubMed

    Ott, Derek V M; Ullsperger, Markus; Jocham, Gerhard; Neumann, Jane; Klein, Tilmann A

    2011-07-15

    The prefrontal cortex is known to play a key role in higher-order cognitive functions. Recently, we showed that this brain region is active in reinforcement learning, during which subjects constantly have to integrate trial outcomes in order to optimize performance. To further elucidate the role of the dorsolateral prefrontal cortex (DLPFC) in reinforcement learning, we applied continuous theta-burst stimulation (cTBS) either to the left or right DLPFC, or to the vertex as a control region, respectively, prior to the performance of a probabilistic learning task in an fMRI environment. While there was no influence of cTBS on learning performance per se, we observed a stimulation-dependent modulation of reward vs. punishment sensitivity: Left-hemispherical DLPFC stimulation led to a more reward-guided performance, while right-hemispherical cTBS induced a more avoidance-guided behavior. FMRI results showed enhanced prediction error coding in the ventral striatum in subjects stimulated over the left as compared to the right DLPFC. Both behavioral and imaging results are in line with recent findings that left, but not right-hemispherical stimulation can trigger a release of dopamine in the ventral striatum, which has been suggested to increase the relative impact of rewards rather than punishment on behavior. PMID:21554966

  10. The influence of personality on neural mechanisms of observational fear and reward learning

    PubMed Central

    Hooker, Christine I.; Verosky, Sara C.; Miyakawa, Asako; Knight, Robert T.; D’Esposito, Mark

    2012-01-01

    Fear and reward learning can occur through direct experience or observation. Both channels can enhance survival or create maladaptive behavior. We used fMRI to isolate neural mechanisms of observational fear and reward learning and investigate whether neural response varied according to individual differences in neuroticism and extraversion. Participants learned object-emotion associations by observing a woman respond with fearful (or neutral) and happy (or neutral) facial expressions to novel objects. The amygdala-hippocampal complex was active when learning the object-fear association, and the hippocampus was active when learning the object-happy association. After learning, objects were presented alone; amygdala activity was greater for the fear (vs. neutral) and happy (vs. neutral) associated object. Importantly, greater amygdala-hippocampal activity during fear (vs. neutral) learning predicted better recognition of learned objects on a subsequent memory test. Furthermore, personality modulated neural mechanisms of learning. Neuroticism positively correlated with neural activity in the amygdala and hippocampus during fear (vs. neutral) learning. Low extraversion/high introversion was related to faster behavioral predictions of the fearful and neutral expressions during fear learning. In addition, low extraversion/high introversion was related to greater amygdala activity during happy (vs. neutral) learning, happy (vs. neutral) object recognition, and faster reaction times for predicting happy and neutral expressions during reward learning. These findings suggest that neuroticism is associated with an increased sensitivity in the neural mechanism for fear learning which leads to enhanced encoding of fear associations, and that low extraversion/high introversion is related to enhanced conditionability for both fear and reward learning. PMID:18573512

  11. Motive to Avoid Success, Locus of Control, and Reinforcement Avoidance.

    ERIC Educational Resources Information Center

    Katovsky, Walter

    Subjects were four groups of 12 college women, high or low in motive to avoid success (MAS) and locus of control (LC), were reinforced for response A on a fixed partial reinforcement schedule on three concept learning tasks, one task consisting of combined reward and punishment, another of reward only, and one of punishment only. Response B was…

  12. Vicarious reinforcement in rhesus macaques (macaca mulatta).

    PubMed

    Chang, Steve W C; Winecoff, Amy A; Platt, Michael L

    2011-01-01

    What happens to others profoundly influences our own behavior. Such other-regarding outcomes can drive observational learning, as well as motivate cooperation, charity, empathy, and even spite. Vicarious reinforcement may serve as one of the critical mechanisms mediating the influence of other-regarding outcomes on behavior and decision-making in groups. Here we show that rhesus macaques spontaneously derive vicarious reinforcement from observing rewards given to another monkey, and that this reinforcement can motivate them to subsequently deliver or withhold rewards from the other animal. We exploited Pavlovian and instrumental conditioning to associate rewards to self (M1) and/or rewards to another monkey (M2) with visual cues. M1s made more errors in the instrumental trials when cues predicted reward to M2 compared to when cues predicted reward to M1, but made even more errors when cues predicted reward to no one. In subsequent preference tests between pairs of conditioned cues, M1s preferred cues paired with reward to M2 over cues paired with reward to no one. By contrast, M1s preferred cues paired with reward to self over cues paired with reward to both monkeys simultaneously. Rates of attention to M2 strongly predicted the strength and valence of vicarious reinforcement. These patterns of behavior, which were absent in non-social control trials, are consistent with vicarious reinforcement based upon sensitivity to observed, or counterfactual, outcomes with respect to another individual. Vicarious reward may play a critical role in shaping cooperation and competition, as well as motivating observational learning and group coordination in rhesus macaques, much as it does in humans. We propose that vicarious reinforcement signals mediate these behaviors via homologous neural circuits involved in reinforcement learning and decision-making.

  13. Replication study implicates COMT val158met polymorphism as a modulator of probabilistic reward learning.

    PubMed

    Lancaster, T M; Heerey, E A; Mantripragada, K; Linden, D E J

    2015-07-01

    Previous studies suggest that a single nucleotide polymorphism in the catechol-O-methyltransferase (COMT) gene (val158met) may modulate reward-guided decision making in healthy individuals. The polymorphism affects dopamine catabolism and thus modulates prefrontal dopamine levels, which may lead to variation in individual responses to risk and reward. We previously showed, using tasks that index reward responsiveness (measured by responses bias towards reinforced stimuli) and risk taking (measured by the Balloon Analogue Risk Task), that COMT met homozygotes had increased reward responsiveness and, thus, an increased propensity to seek reward. In this study, we sought to replicate these effects in a larger, independent cohort of Caucasian UK university students and staff with similar demographic characteristics (n = 101; 54 females, mean age: 22.2 years). Similarly to our previous study, we observed a significant trial × COMT genotype interaction (P = 0.047; η(2) = 0.052), which was driven by a significant effect of COMT on the incremental acquisition of response bias [response bias at block 3 - block 1 (met/met > val/val: P = 0.028) and block 3 - block 2 (met/met > val/val: P = 0.007)], suggesting that COMT met homozygotes demonstrated higher levels of reward responsiveness by the end of the task. However, we failed to see main effects of COMT genotype on overall response bias or risk-seeking behaviour. These results provide additional evidence that prefrontal dopaminergic variation may have a role in reward responsiveness, but not risk-seeking behaviour. Our findings may have implications for neuropsychiatric disorders characterized by clinical deficits in reward processing such as anhedonia. PMID:26096878

  14. Rats bred for helplessness exhibit positive reinforcement learning deficits which are not alleviated by an antidepressant dose of the MAO-B inhibitor deprenyl.

    PubMed

    Schulz, Daniela; Henn, Fritz A; Petri, David; Huston, Joseph P

    2016-08-01

    Principles of negative reinforcement learning may play a critical role in the etiology and treatment of depression. We examined the integrity of positive reinforcement learning in congenitally helpless (cH) rats, an animal model of depression, using a random ratio schedule and a devaluation-extinction procedure. Furthermore, we tested whether an antidepressant dose of the monoamine oxidase (MAO)-B inhibitor deprenyl would reverse any deficits in positive reinforcement learning. We found that cH rats (n=9) were impaired in the acquisition of even simple operant contingencies, such as a fixed interval (FI) 20 schedule. cH rats exhibited no apparent deficits in appetite or reward sensitivity. They reacted to the devaluation of food in a manner consistent with a dose-response relationship. Reinforcer motivation as assessed by lever pressing across sessions with progressively decreasing reward probabilities was highest in congenitally non-helpless (cNH, n=10) rats as long as the reward probabilities remained relatively high. cNH compared to wild-type (n=10) rats were also more resistant to extinction across sessions. Compared to saline (n=5), deprenyl (n=5) reduced the duration of immobility of cH rats in the forced swimming test, indicative of antidepressant effects, but did not restore any deficits in the acquisition of a FI 20 schedule. We conclude that positive reinforcement learning was impaired in rats bred for helplessness, possibly due to motivational impairments but not deficits in reward sensitivity, and that deprenyl exerted antidepressant effects but did not reverse the deficits in positive reinforcement learning. PMID:27163379

  15. Rats bred for helplessness exhibit positive reinforcement learning deficits which are not alleviated by an antidepressant dose of the MAO-B inhibitor deprenyl.

    PubMed

    Schulz, Daniela; Henn, Fritz A; Petri, David; Huston, Joseph P

    2016-08-01

    Principles of negative reinforcement learning may play a critical role in the etiology and treatment of depression. We examined the integrity of positive reinforcement learning in congenitally helpless (cH) rats, an animal model of depression, using a random ratio schedule and a devaluation-extinction procedure. Furthermore, we tested whether an antidepressant dose of the monoamine oxidase (MAO)-B inhibitor deprenyl would reverse any deficits in positive reinforcement learning. We found that cH rats (n=9) were impaired in the acquisition of even simple operant contingencies, such as a fixed interval (FI) 20 schedule. cH rats exhibited no apparent deficits in appetite or reward sensitivity. They reacted to the devaluation of food in a manner consistent with a dose-response relationship. Reinforcer motivation as assessed by lever pressing across sessions with progressively decreasing reward probabilities was highest in congenitally non-helpless (cNH, n=10) rats as long as the reward probabilities remained relatively high. cNH compared to wild-type (n=10) rats were also more resistant to extinction across sessions. Compared to saline (n=5), deprenyl (n=5) reduced the duration of immobility of cH rats in the forced swimming test, indicative of antidepressant effects, but did not restore any deficits in the acquisition of a FI 20 schedule. We conclude that positive reinforcement learning was impaired in rats bred for helplessness, possibly due to motivational impairments but not deficits in reward sensitivity, and that deprenyl exerted antidepressant effects but did not reverse the deficits in positive reinforcement learning.

  16. Neural Regions that Underlie Reinforcement Learning Also Engage in Social Expectancy Violations

    PubMed Central

    Harris, Lasana T.; Fiske, Susan T.

    2013-01-01

    Prediction error, the difference between an expected and actual outcome, serves as a learning signal that interacts with reward and punishment value to direct future behavior during reinforcement learning. We hypothesized that similar learning and valuation signals may underlie social expectancy violations. Here, we explore the neural correlates of social expectancy violation signals along the universal person-perception dimensions of trait warmth and competence. In this context, social learning may result from expectancy violations that occur when a target is inconsistent with an a priori schema. Expectancy violation may activate neural regions normally implicated in prediction error and valuation during appetitive and aversive conditioning. Using fMRI, we first gave perceivers warmth or competence behavioral information. Participants then saw pictures of people responsible for the behavior; they represented social groups either inconsistent (rated low on either warmth or competence) or consistent (rated high on either warmth or competence) with the behavior information. Warmth and competence expectancy violations activate striatal regions and frontal cortex respectively, areas that represent evaluative and prediction-error signals. These findings suggest that regions underlying reinforcement learning may be engaged in warmth and competence social expectancy violation, and illustrate the neural overlap between neuroeconomics and social neuroscience. PMID:20119878

  17. Neural regions that underlie reinforcement learning are also active for social expectancy violations.

    PubMed

    Harris, Lasana T; Fiske, Susan T

    2010-01-01

    Prediction error, the difference between an expected and an actual outcome, serves as a learning signal that interacts with reward and punishment value to direct future behavior during reinforcement learning. We hypothesized that similar learning and valuation signals may underlie social expectancy violations. Here, we explore the neural correlates of social expectancy violation signals along the universal person-perception dimensions trait warmth and competence. In this context, social learning may result from expectancy violations that occur when a target is inconsistent with an a priori schema. Expectancy violation may activate neural regions normally implicated in prediction error and valuation during appetitive and aversive conditioning. Using fMRI, we first gave perceivers high warmth or competence behavioral information that led to dispositional or situational attributions for the behavior. Participants then saw pictures of people responsible for the behavior; they represented social groups either inconsistent (rated low on either warmth or competence) or consistent (rated high on either warmth or competence) with the behavior information. Warmth and competence expectancy violations activate striatal regions that represent evaluative and prediction error signals. Social cognition regions underlie consistent expectations. These findings suggest that regions underlying reinforcement learning may work in concert with social cognition regions in warmth and competence social expectancy. This study illustrates the neural overlap between neuroeconomics and social neuroscience.

  18. Dopamine prediction errors in reward learning and addiction: from theory to neural circuitry

    PubMed Central

    Keiflin, Ronald; Janak, Patricia H.

    2015-01-01

    Summary Midbrain dopamine (DA) neurons are proposed to signal reward prediction error (RPE), a fundamental parameter in associative learning models. This RPE hypothesis provides a compelling theoretical framework for understanding DA function in reward learning and addiction. New studies support a causal role for DA-mediated RPE activity in promoting learning about natural reward; however, this question has not been explicitly tested in the context of drug addiction. In this review, we integrate theoretical models with experimental findings on the activity of DA systems, and on the causal role of specific neuronal projections and cell types, to provide a circuit-based framework for probing DA-RPE function in addiction. By examining error-encoding DA neurons in the neural network in which they are embedded, hypotheses regarding circuit-level adaptations that possibly contribute to pathological error-signaling and addiction can be formulated and tested. PMID:26494275

  19. Dopamine Prediction Errors in Reward Learning and Addiction: From Theory to Neural Circuitry.

    PubMed

    Keiflin, Ronald; Janak, Patricia H

    2015-10-21

    Midbrain dopamine (DA) neurons are proposed to signal reward prediction error (RPE), a fundamental parameter in associative learning models. This RPE hypothesis provides a compelling theoretical framework for understanding DA function in reward learning and addiction. New studies support a causal role for DA-mediated RPE activity in promoting learning about natural reward; however, this question has not been explicitly tested in the context of drug addiction. In this review, we integrate theoretical models with experimental findings on the activity of DA systems, and on the causal role of specific neuronal projections and cell types, to provide a circuit-based framework for probing DA-RPE function in addiction. By examining error-encoding DA neurons in the neural network in which they are embedded, hypotheses regarding circuit-level adaptations that possibly contribute to pathological error signaling and addiction can be formulated and tested. PMID:26494275

  20. Dopamine Prediction Errors in Reward Learning and Addiction: From Theory to Neural Circuitry.

    PubMed

    Keiflin, Ronald; Janak, Patricia H

    2015-10-21

    Midbrain dopamine (DA) neurons are proposed to signal reward prediction error (RPE), a fundamental parameter in associative learning models. This RPE hypothesis provides a compelling theoretical framework for understanding DA function in reward learning and addiction. New studies support a causal role for DA-mediated RPE activity in promoting learning about natural reward; however, this question has not been explicitly tested in the context of drug addiction. In this review, we integrate theoretical models with experimental findings on the activity of DA systems, and on the causal role of specific neuronal projections and cell types, to provide a circuit-based framework for probing DA-RPE function in addiction. By examining error-encoding DA neurons in the neural network in which they are embedded, hypotheses regarding circuit-level adaptations that possibly contribute to pathological error signaling and addiction can be formulated and tested.

  1. Assessment of rewarding and reinforcing properties of biperiden in conditioned place preference in rats.

    PubMed

    Allahverdiyev, Oruc; Nurten, Asiye; Enginar, Nurhan

    2011-12-01

    Biperiden is one of the most commonly abused anticholinergic drugs. This study assessed its motivational effects in the acquisition of conditioned place preference in rats. Biperiden neither produced place conditioning itself nor enhanced the rewarding effect of morphine. Furthermore, biperiden in combination with haloperidol also did not affect place preference. These findings suggest that biperiden seems devoid of abuse potential properties at least at the doses used.

  2. Qualitative adaptive reward learning with success failure maps: applied to humanoid robot walking.

    PubMed

    Nassour, John; Hugel, Vincent; Ben Ouezdou, Fethi; Cheng, Gordon

    2013-01-01

    In the human brain, rewards are encoded in a flexible and adaptive way after each novel stimulus. Neurons of the orbitofrontal cortex are the key reward structure of the brain. Neurobiological studies show that the anterior cingulate cortex of the brain is primarily responsible for avoiding repeated mistakes. According to vigilance threshold, which denotes the tolerance to risks, we can differentiate between a learning mechanism that takes risks and one that averts risks. The tolerance to risk plays an important role in such a learning mechanism. Results have shown the differences in learning capacity between risk-taking and risk-avert behaviors. These neurological properties provide promising inspirations for robot learning based on rewards. In this paper, we propose a learning mechanism that is able to learn from negative and positive feedback with reward coding adaptively. It is composed of two phases: evaluation and decision making. In the evaluation phase, we use a Kohonen self-organizing map technique to represent success and failure. Decision making is based on an early warning mechanism that enables avoiding repeating past mistakes. The behavior to risk is modulated in order to gain experiences for success and for failure. Success map is learned with adaptive reward that qualifies the learned task in order to optimize the efficiency. Our approach is presented with an implementation on the NAO humanoid robot, controlled by a bioinspired neural controller based on a central pattern generator. The learning system adapts the oscillation frequency and the motor neuron gain in pitch and roll in order to walk on flat and sloped terrain, and to switch between them.

  3. Social Learning, Reinforcement and Crime: Evidence from Three European Cities

    ERIC Educational Resources Information Center

    Tittle, Charles R.; Antonaccio, Olena; Botchkovar, Ekaterina

    2012-01-01

    This study reports a cross-cultural test of Social Learning Theory using direct measures of social learning constructs and focusing on the causal structure implied by the theory. Overall, the results strongly confirm the main thrust of the theory. Prior criminal reinforcement and current crime-favorable definitions are highly related in all three…

  4. One-trial spatial learning: wild hummingbirds relocate a reward after a single visit.

    PubMed

    Flores-Abreu, I Nuri; Hurly, T Andrew; Healy, Susan D

    2012-07-01

    Beaconing to rewarded locations is typically achieved by visual recognition of the actual goal. Spatial recognition, on the other hand, can occur in the absence of the goal itself, relying instead on the landmarks surrounding the goal location. Although the duration or frequency of experiences that an animal needs to learn the landmarks surrounding a goal have been extensively studied with a variety of laboratory tasks, little is known about the way in which wild vertebrates use them in their natural environment. Here, we allowed hummingbirds to feed once only from a rewarding flower (goal) before it was removed. When we presented a similar flower at a different height in another location, birds frequently returned to the location the flower had previously occupied (spatial recognition) before flying to the flower itself (beaconing). After experiencing three rewarded flowers, each in a different location, they were more likely to beacon to the current visible flower than they were to return to previously rewarded locations (without a visible flower). These data show that hummingbirds can encode a rewarded location on the basis of the surrounding landmarks after a single visit. After multiple goal location manipulations, however, the birds changed their strategy to beaconing presumably because they had learned that the flower itself reliably signalled reward.

  5. Chicks' maze learning reinforced by visual pitfall extending downward.

    PubMed

    Hayashibe, K; Hara, M; Tsuji, K

    1989-04-01

    The present study examined whether visually evoked fear of depth could reinforce a particular response of animals, i.e., to special maze learning. The maze was composed of four units of Y-shaped alley. In this maze, the visual pitfalls were set behind corners of the alley in place of a physical barrier. The experiments showed that eight of 13 male chicks could achieve the initial learning and that three successful ones could also achieve reversal learning. The results suggest that the visually evoked fear of depth provided by motion parallax can act as a reinforcer.

  6. Reinforcement learning in complementarity game and population dynamics

    NASA Astrophysics Data System (ADS)

    Jost, Jürgen; Li, Wei

    2014-02-01

    We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005), 10.1016/j.physa.2004.07.005] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.

  7. Human-level control through deep reinforcement learning.

    PubMed

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A; Veness, Joel; Bellemare, Marc G; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-26

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks

  8. Human-level control through deep reinforcement learning

    NASA Astrophysics Data System (ADS)

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K.; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-01

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  9. Human-level control through deep reinforcement learning.

    PubMed

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A; Veness, Joel; Bellemare, Marc G; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-26

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  10. [Reward processing of the basal ganglia--reward function of pedunculopontine tegmental nucleus].

    PubMed

    Kobayashi-, Yasushi; Okada, Ken-Ichi

    2009-04-01

    We address the role of neuronal activity in the pathways of the brainstem-midbrain circuit in reward and the basis for the hypothesis that this circuit provides advantages over previous reinforcement learning theories. Several lines of evidence support the reward-based learning theory proposing that midbrain dopamine (DA) neurons emit a teaching signal (the reward prediction error signal) to control synaptic plasticity of the projection area. However, the underlying mechanism of the location and manner in which the reward prediction error signal is computed remains unclear. Since the pedunculopontine tegmental nucleus (PPTN) in the brainstem is one of the strongest excitatory input sources to DA neurons, we hypothesized that the PPTN may play an important role in activating the DA neurons and reinforce learning by relaying necessary signals for reward prediction error computation to those neurons. To investigate the involvement of PPTN neurons in reward prediction error computation, we employed a visually guided saccade task while recording the neuronal activity in monkeys. Here, we predict that PPTN neurons may relay the excitatory component of tonic reward prediction and phasic primary reward signals, and derive a new computational theory of reward prediction error in DA neurons.

  11. Reinforcement Learning of Optimal Supervisor based on the Worst-Case Behavior

    NASA Astrophysics Data System (ADS)

    Kajiwara, Kouji; Yamasaki, Tatsushi

    The supervisory control initiated by Ramadge and Wonham is a framework for logical control of discrete event systems. In the original supervisory control, the costs for occurrence and disabling of events have not been considered. Then, the optimal supervisory control based on quatitative measures has also been studied. This paper proposes a synthesis method of the optimal supervisor based on the worst-case behavior of discrete event systems. We introduce the new value functions for the assigned control patterns. The new value functions are not based on the expected total rewards, but based on the most undesirable event occurrence in the assigned control pattern. In the proposed method, the supervisor learns how to assign the control pattern based on reinforcement learning so as to maximize the value functions. We show the efficiency of the proposed method by computer simulation.

  12. Sensitization of psychomotor stimulation and conditioned reward in mice: differential modulation by contextual learning.

    PubMed

    Mead, Andy N; Crombag, Hans S; Rocha, Beatriz A

    2004-02-01

    Incentive motivation theory ascribes a critical role to reward-associated stimuli in the generation and maintenance of goal-directed behavior. Repeated psychomotor stimulant treatment, in addition to producing sensitization to the psychomotor-activating effects, can enhance the incentive salience of reward-associated cues and increase their ability to influence behavior. In the present study, we sought to investigate this incentive sensitization effect further by developing a model of conditioned reinforcement (CR) in the mouse and investigating the effects of a sensitizing treatment regimen of amphetamine on CR. Furthermore, we assessed the role of contextual stimuli in amphetamine-induced potentiation of CR. We found that mice responded selectively on a lever resulting in the presentation of a cue previously associated with 30% condensed milk solution, indicating that the cue had attained rewarding properties. Prior treatment with amphetamine (4 x 0.5 mg/kg i.p.) resulted in psychomotor sensitization and enhanced subsequent responding for the CR. Furthermore, this enhancement of responding for the cue occurred independent of the drug-paired context, whereas the sensitized locomotor response was only observed when mice were tested in the same environment as that in which they had received previous amphetamine. These results demonstrate that the CR paradigm previously developed in the rat can be successfully adapted for use in the mouse, and suggest that behavioral sensitization to amphetamine increases the rewarding properties (incentive salience) of reward-paired cues, independent of the drug-paired context.

  13. Social Cognition as Reinforcement Learning: Feedback Modulates Emotion Inference.

    PubMed

    Zaki, Jamil; Kallman, Seth; Wimmer, G Elliott; Ochsner, Kevin; Shohamy, Daphna

    2016-09-01

    Neuroscientific studies of social cognition typically employ paradigms in which perceivers draw single-shot inferences about the internal states of strangers. Real-world social inference features much different parameters: People often encounter and learn about particular social targets (e.g., friends) over time and receive feedback about whether their inferences are correct or incorrect. Here, we examined this process and, more broadly, the intersection between social cognition and reinforcement learning. Perceivers were scanned using fMRI while repeatedly encountering three social targets who produced conflicting visual and verbal emotional cues. Perceivers guessed how targets felt and received feedback about whether they had guessed correctly. Visual cues reliably predicted one target's emotion, verbal cues predicted a second target's emotion, and neither reliably predicted the third target's emotion. Perceivers successfully used this information to update their judgments over time. Furthermore, trial-by-trial learning signals-estimated using two reinforcement learning models-tracked activity in ventral striatum and ventromedial pFC, structures associated with reinforcement learning, and regions associated with updating social impressions, including TPJ. These data suggest that learning about others' emotions, like other forms of feedback learning, relies on domain-general reinforcement mechanisms as well as domain-specific social information processing. PMID:27167401

  14. Distinguishing between learning and motivation in behavioral tests of the reinforcement sensitivity theory of personality.

    PubMed

    Smillie, Luke D; Dalgleish, Len I; Jackson, Chris J

    2007-04-01

    According to Gray's (1973) Reinforcement Sensitivity Theory (RST), a Behavioral Inhibition System (BIS) and a Behavioral Activation System (BAS) mediate effects of goal conflict and reward on behavior. BIS functioning has been linked with individual differences in trait anxiety and BAS functioning with individual differences in trait impulsivity. In this article, it is argued that behavioral outputs of the BIS and BAS can be distinguished in terms of learning and motivation processes and that these can be operationalized using the Signal Detection Theory measures of response-sensitivity and response-bias. In Experiment 1, two measures of BIS-reactivity predicted increased response-sensitivity under goal conflict, whereas one measure of BAS-reactivity predicted increased response-sensitivity under reward. In Experiment 2, two measures of BIS-reactivity predicted response-bias under goal conflict, whereas a measure of BAS-reactivity predicted motivation response-bias under reward. In both experiments, impulsivity measures did not predict criteria for BAS-reactivity as traditionally predicted by RST.

  15. Correlates of reward-predictive value in learning-related hippocampal neural activity

    PubMed Central

    Okatan, Murat

    2009-01-01

    Temporal difference learning (TD) is a popular algorithm in machine learning. Two learning signals that are derived from this algorithm, the predictive value and the prediction error, have been shown to explain changes in neural activity and behavior during learning across species. Here, the predictive value signal is used to explain the time course of learning-related changes in the activity of hippocampal neurons in monkeys performing an associative learning task. The TD algorithm serves as the centerpiece of a joint probability model for the learning-related neural activity and the behavioral responses recorded during the task. The neural component of the model consists of spiking neurons that compete and learn the reward-predictive value of task-relevant input signals. The predictive-value signaled by these neurons influences the behavioral response generated by a stochastic decision stage, which constitutes the behavioral component of the model. It is shown that the time course of the changes in neural activity and behavioral performance generated by the model exhibits key features of the experimental data. The results suggest that information about correct associations may be expressed in the hippocampus before it is detected in the behavior of a subject. In this way, the hippocampus may be among the earliest brain areas to express learning and drive the behavioral changes associated with learning. Correlates of reward-predictive value may be expressed in the hippocampus through rate remapping within spatial memory representations, they may represent reward-related aspects of a declarative or explicit relational memory representation of task contingencies, or they may correspond to reward-related components of episodic memory representations. These potential functions are discussed in connection with hippocampal cell assembly sequences and their reverse reactivation during the awake state. The results provide further support for the proposal that neural

  16. Sound Sequence Discrimination Learning Motivated by Reward Requires Dopaminergic D2 Receptor Activation in the Rat Auditory Cortex

    ERIC Educational Resources Information Center

    Kudoh, Masaharu; Shibuki, Katsuei

    2006-01-01

    We have previously reported that sound sequence discrimination learning requires cholinergic inputs to the auditory cortex (AC) in rats. In that study, reward was used for motivating discrimination behavior in rats. Therefore, dopaminergic inputs mediating reward signals may have an important role in the learning. We tested the possibility in the…

  17. Reconsidering Food Reward, Brain Stimulation, and Dopamine: Incentives Act Forward.

    PubMed

    Newquist, Gunnar; Gardner, R Allen

    2015-01-01

    In operant conditioning, rats pressing levers and pigeons pecking keys depend on contingent food reinforcement. Food reward agrees with Skinner's behaviorism, undergraduate textbooks, and folk psychology. However, nearly a century of experimental evidence shows, instead, that food in an operant conditioning chamber acts forward to evoke species-specific feeding behavior rather than backward to reinforce experimenter-defined responses. Furthermore, recent findings in neuroscience show consistently that intracranial stimulation to reward centers and dopamine release, the proposed reward molecule, also act forward to evoke inborn species-specific behavior. These results challenge longstanding views of hedonic learning and must be incorporated into contemporary learning theory. PMID:26721172

  18. Reconsidering Food Reward, Brain Stimulation, and Dopamine: Incentives Act Forward.

    PubMed

    Newquist, Gunnar; Gardner, R Allen

    2015-01-01

    In operant conditioning, rats pressing levers and pigeons pecking keys depend on contingent food reinforcement. Food reward agrees with Skinner's behaviorism, undergraduate textbooks, and folk psychology. However, nearly a century of experimental evidence shows, instead, that food in an operant conditioning chamber acts forward to evoke species-specific feeding behavior rather than backward to reinforce experimenter-defined responses. Furthermore, recent findings in neuroscience show consistently that intracranial stimulation to reward centers and dopamine release, the proposed reward molecule, also act forward to evoke inborn species-specific behavior. These results challenge longstanding views of hedonic learning and must be incorporated into contemporary learning theory.

  19. Protein interaction network constructing based on text mining and reinforcement learning with application to prostate cancer.

    PubMed

    Zhu, Fei; Liu, Quan; Zhang, Xiaofang; Shen, Bairong

    2015-08-01

    Constructing interaction network from biomedical texts is a very important and interesting work. The authors take advantage of text mining and reinforcement learning approaches to establish protein interaction network. Considering the high computational efficiency of co-occurrence-based interaction extraction approaches and high precision of linguistic patterns approaches, the authors propose an interaction extracting algorithm where they utilise frequently used linguistic patterns to extract the interactions from texts and then find out interactions from extended unprocessed texts under the basic idea of co-occurrence approach, meanwhile they discount the interaction extracted from extended texts. They put forward a reinforcement learning-based algorithm to establish a protein interaction network, where nodes represent proteins and edges denote interactions. During the evolutionary process, a node selects another node and the attained reward determines which predicted interaction should be reinforced. The topology of the network is updated by the agent until an optimal network is formed. They used texts downloaded from PubMed to construct a prostate cancer protein interaction network by the proposed methods. The results show that their method brought out pretty good matching rate. Network topology analysis results also demonstrate that the curves of node degree distribution, node degree probability and probability distribution of constructed network accord with those of the scale-free network well. PMID:26243825

  20. Protein interaction network constructing based on text mining and reinforcement learning with application to prostate cancer.

    PubMed

    Zhu, Fei; Liu, Quan; Zhang, Xiaofang; Shen, Bairong

    2015-08-01

    Constructing interaction network from biomedical texts is a very important and interesting work. The authors take advantage of text mining and reinforcement learning approaches to establish protein interaction network. Considering the high computational efficiency of co-occurrence-based interaction extraction approaches and high precision of linguistic patterns approaches, the authors propose an interaction extracting algorithm where they utilise frequently used linguistic patterns to extract the interactions from texts and then find out interactions from extended unprocessed texts under the basic idea of co-occurrence approach, meanwhile they discount the interaction extracted from extended texts. They put forward a reinforcement learning-based algorithm to establish a protein interaction network, where nodes represent proteins and edges denote interactions. During the evolutionary process, a node selects another node and the attained reward determines which predicted interaction should be reinforced. The topology of the network is updated by the agent until an optimal network is formed. They used texts downloaded from PubMed to construct a prostate cancer protein interaction network by the proposed methods. The results show that their method brought out pretty good matching rate. Network topology analysis results also demonstrate that the curves of node degree distribution, node degree probability and probability distribution of constructed network accord with those of the scale-free network well.

  1. The impact of effort-reward imbalance and learning motivation on teachers' sickness absence.

    PubMed

    Derycke, Hanne; Vlerick, Peter; Van de Ven, Bart; Rots, Isabel; Clays, Els

    2013-02-01

    The aim of this study was to analyse the impact of the effort-reward imbalance and learning motivation on sickness absence duration and sickness absence frequency among beginning teachers in Flanders (Belgium). A total of 603 teachers, who recently graduated, participated in this study. Effort-reward imbalance and learning motivation were assessed by means of self-administered questionnaires. Prospective data of registered sickness absence during 12 months follow-up were collected. Multivariate logistic regression analyses were performed. An imbalance between high efforts and low rewards (extrinsic hypothesis) was associated with longer sickness absence duration and more frequent absences. A low level of learning motivation (intrinsic hypothesis) was not associated with longer sickness absence duration but was significantly positively associated with sickness absence frequency. No significant results were obtained for the interaction hypothesis between imbalance and learning motivation. Further research is needed to deepen our understanding of the impact of psychosocial work conditions and personal resources on both sickness absence duration and frequency. Specifically, attention could be given to optimizing or reducing efforts spent at work, increasing rewards and stimulating learning motivation to influence sickness absence. PMID:22337584

  2. Reinforcement learning by Hebbian synapses with adaptive thresholds.

    PubMed

    Pennartz, C M

    1997-11-01

    A central problem in learning theory is how the vertebrate brain processes reinforcing stimuli in order to master complex sensorimotor tasks. This problem belongs to the domain of supervised learning, in which errors in the response of a neural network serve as the basis for modification of synaptic connectivity in the network and thereby train it on a computational task. The model presented here shows how a reinforcing feedback can modify synapses in a neuronal network according to the principles of Hebbian learning. The reinforcing feedback steers synapses towards long-term potentiation or depression by critically influencing the rise in postsynaptic calcium, in accordance with findings on synaptic plasticity in mammalian brain. An important feature of the model is the dependence of modification thresholds on the previous history of reinforcing feedback processed by the network. The learning algorithm trained networks successfully on a task in which a population vector in the motor output was required to match a sensory stimulus vector presented shortly before. In another task, networks were trained to compute coordinate transformations by combining different visual inputs. The model continued to behave well when simplified units were replaced by single-compartment neurons equipped with several conductances and operating in continuous time. This novel form of reinforcement learning incorporates essential properties of Hebbian synaptic plasticity and thereby shows that supervised learning can be accomplished by a learning rule similar to those used in physiologically plausible models of unsupervised learning. The model can be crudely correlated to the anatomy and electrophysiology of the amygdala, prefrontal and cingulate cortex and has predictive implications for further experiments on synaptic plasticity and learning processes mediated by these areas.

  3. Toward a common theory for learning from reward, affect, and motivation: the SIMON framework

    PubMed Central

    Madan, Christopher R.

    2013-01-01

    While the effects of reward, affect, and motivation on learning have each developed into their own fields of research, they largely have been investigated in isolation. As all three of these constructs are highly related, and use similar experimental procedures, an important advance in research would be to consider the interplay between these constructs. Here we first define each of the three constructs, and then discuss how they may influence each other within a common framework. Finally, we delineate several sources of evidence supporting the framework. By considering the constructs of reward, affect, and motivation within a single framework, we can develop a better understanding of the processes involved in learning and how they interplay, and work toward a comprehensive theory that encompasses reward, affect, and motivation. PMID:24109436

  4. Finding the right motivation: genotype-dependent differences in effective reinforcements for spatial learning.

    PubMed

    Youn, Jiun; Ellenbroek, Bart A; van Eck, Inti; Roubos, Sandra; Verhage, Matthijs; Stiedl, Oliver

    2012-01-15

    Memory impairments of DBA/2J mice have been frequently reported in spatial and emotional behavior tests. However, in some memory tests involving food reward, DBA/2J mice perform equally well to C57BL/6J mice or even outperform them. Thus, it is conceivable that motivational factors differentially affect cognitive performance of different mouse strains. Therefore, spatial memory of DBA/2J and C57BL/6J mice was investigated in a modified version of the Barnes maze (mBM) test with increased complexity. The modified Barnes maze test allowed using either aversive or appetitive reinforcement, but with identical spatial cues and motor requirements. Both mouse strains acquired spatial learning in mBM tests with either reinforcement. However, DBA/2J mice learned slower than C57BL/6J mice when aversive reinforcement was used. In contrast, the two strains performed equally well when appetitive reinforcement was used. The superior performance in C57BL/6J mice in the aversive version of the mBM test was accompanied by a more frequent use of the spatial strategy. In the appetitive version of the mBM test, both strains used the spatial strategy to a similar extent. The present results demonstrate that the cognitive performance of mice depends heavily on motivational factors. Our findings underscore the importance of an effective experimental design when assessing spatial memory and challenges interpretations of impaired hippocampal function in DBA/2J mice drawn on the basis of behavior tests depending on aversive reinforcement.

  5. Learning in neural networks by reinforcement of irregular spiking

    NASA Astrophysics Data System (ADS)

    Xie, Xiaohui; Seung, H. Sebastian

    2004-04-01

    Artificial neural networks are often trained by using the back propagation algorithm to compute the gradient of an objective function with respect to the synaptic strengths. For a biological neural network, such a gradient computation would be difficult to implement, because of the complex dynamics of intrinsic and synaptic conductances in neurons. Here we show that irregular spiking similar to that observed in biological neurons could be used as the basis for a learning rule that calculates a stochastic approximation to the gradient. The learning rule is derived based on a special class of model networks in which neurons fire spike trains with Poisson statistics. The learning is compatible with forms of synaptic dynamics such as short-term facilitation and depression. By correlating the fluctuations in irregular spiking with a reward signal, the learning rule performs stochastic gradient ascent on the expected reward. It is applied to two examples, learning the XOR computation and learning direction selectivity using depressing synapses. We also show in simulation that the learning rule is applicable to a network of noisy integrate-and-fire neurons.

  6. The Role of BDNF, Leptin, and Catecholamines in Reward Learning in Bulimia Nervosa

    PubMed Central

    Grob, Simona; Milos, Gabriella; Schnyder, Ulrich; Eckert, Anne; Lang, Undine; Hasler, Gregor

    2015-01-01

    Background: A relationship between bulimia nervosa and reward-related behavior is supported by several lines of evidence. The dopaminergic dysfunctions in the processing of reward-related stimuli have been shown to be modulated by the neurotrophin brain derived neurotrophic factor (BDNF) and the hormone leptin. Methods: Using a randomized, double-blind, placebo-controlled, crossover design, a reward learning task was applied to study the behavior of 20 female subjects with remitted bulimia nervosa and 27 female healthy controls under placebo and catecholamine depletion with alpha-methyl-para-tyrosine (AMPT). The plasma levels of BDNF and leptin were measured twice during the placebo and the AMPT condition, immediately before and 1 hour after a standardized breakfast. Results: AMPT–induced differences in plasma BDNF levels were positively correlated with the AMPT–induced differences in reward learning in the whole sample (P=.05). Across conditions, plasma brain derived neurotrophic factor levels were higher in remitted bulimia nervosa subjects compared with controls (diagnosis effect; P=.001). Plasma BDNF and leptin levels were higher in the morning before compared with after a standardized breakfast across groups and conditions (time effect; P<.0001). The plasma leptin levels were higher under catecholamine depletion compared with placebo in the whole sample (treatment effect; P=.0004). Conclusions: This study reports on preliminary findings that suggest a catecholamine-dependent association of plasma BDNF and reward learning in subjects with remitted bulimia nervosa and controls. A role of leptin in reward learning is not supported by this study. However, leptin levels were sensitive to a depletion of catecholamine stores in both remitted bulimia nervosa and controls. PMID:25522424

  7. RM-SORN: a reward-modulated self-organizing recurrent neural network.

    PubMed

    Aswolinskiy, Witali; Pipa, Gordon

    2015-01-01

    Neural plasticity plays an important role in learning and memory. Reward-modulation of plasticity offers an explanation for the ability of the brain to adapt its neural activity to achieve a rewarded goal. Here, we define a neural network model that learns through the interaction of Intrinsic Plasticity (IP) and reward-modulated Spike-Timing-Dependent Plasticity (STDP). IP enables the network to explore possible output sequences and STDP, modulated by reward, reinforces the creation of the rewarded output sequences. The model is tested on tasks for prediction, recall, non-linear computation, pattern recognition, and sequence generation. It achieves performance comparable to networks trained with supervised learning, while using simple, biologically motivated plasticity rules, and rewarding strategies. The results confirm the importance of investigating the interaction of several plasticity rules in the context of reward-modulated learning and whether reward-modulated self-organization can explain the amazing capabilities of the brain.

  8. RM-SORN: a reward-modulated self-organizing recurrent neural network

    PubMed Central

    Aswolinskiy, Witali; Pipa, Gordon

    2015-01-01

    Neural plasticity plays an important role in learning and memory. Reward-modulation of plasticity offers an explanation for the ability of the brain to adapt its neural activity to achieve a rewarded goal. Here, we define a neural network model that learns through the interaction of Intrinsic Plasticity (IP) and reward-modulated Spike-Timing-Dependent Plasticity (STDP). IP enables the network to explore possible output sequences and STDP, modulated by reward, reinforces the creation of the rewarded output sequences. The model is tested on tasks for prediction, recall, non-linear computation, pattern recognition, and sequence generation. It achieves performance comparable to networks trained with supervised learning, while using simple, biologically motivated plasticity rules, and rewarding strategies. The results confirm the importance of investigating the interaction of several plasticity rules in the context of reward-modulated learning and whether reward-modulated self-organization can explain the amazing capabilities of the brain. PMID:25852533

  9. Dopamine Replacement Therapy, Learning and Reward Prediction in Parkinson's Disease: Implications for Rehabilitation.

    PubMed

    Ferrazzoli, Davide; Carter, Adrian; Ustun, Fatma S; Palamara, Grazia; Ortelli, Paola; Maestri, Roberto; Yücel, Murat; Frazzitta, Giuseppe

    2016-01-01

    The principal feature of Parkinson's disease (PD) is the impaired ability to acquire and express habitual-automatic actions due to the loss of dopamine in the dorsolateral striatum, the region of the basal ganglia associated with the control of habitual behavior. Dopamine replacement therapy (DRT) compensates for the lack of dopamine, representing the standard treatment for different motor symptoms of PD (such as rigidity, bradykinesia and resting tremor). On the other hand, rehabilitation treatments, exploiting the use of cognitive strategies, feedbacks and external cues, permit to "learn to bypass" the defective basal ganglia (using the dorsolateral area of the prefrontal cortex) allowing the patients to perform correct movements under executive-volitional control. Therefore, DRT and rehabilitation seem to be two complementary and synergistic approaches. Learning and reward are central in rehabilitation: both of these mechanisms are the basis for the success of any rehabilitative treatment. Anyway, it is known that "learning resources" and reward could be negatively influenced from dopaminergic drugs. Furthermore, DRT causes different well-known complications: among these, dyskinesias, motor fluctuations, and dopamine dysregulation syndrome (DDS) are intimately linked with the alteration in the learning and reward mechanisms and could impact seriously on the rehabilitative outcomes. These considerations highlight the need for careful titration of DRT to produce the desired improvement in motor symptoms while minimizing the associated detrimental effects. This is important in order to maximize the motor re-learning based on repetition, reward and practice during rehabilitation. In this scenario, we review the knowledge concerning the interactions between DRT, learning and reward, examine the most impactful DRT side effects and provide suggestions for optimizing rehabilitation in PD. PMID:27378872

  10. Dopamine Replacement Therapy, Learning and Reward Prediction in Parkinson’s Disease: Implications for Rehabilitation

    PubMed Central

    Ferrazzoli, Davide; Carter, Adrian; Ustun, Fatma S.; Palamara, Grazia; Ortelli, Paola; Maestri, Roberto; Yücel, Murat; Frazzitta, Giuseppe

    2016-01-01

    The principal feature of Parkinson’s disease (PD) is the impaired ability to acquire and express habitual-automatic actions due to the loss of dopamine in the dorsolateral striatum, the region of the basal ganglia associated with the control of habitual behavior. Dopamine replacement therapy (DRT) compensates for the lack of dopamine, representing the standard treatment for different motor symptoms of PD (such as rigidity, bradykinesia and resting tremor). On the other hand, rehabilitation treatments, exploiting the use of cognitive strategies, feedbacks and external cues, permit to “learn to bypass” the defective basal ganglia (using the dorsolateral area of the prefrontal cortex) allowing the patients to perform correct movements under executive-volitional control. Therefore, DRT and rehabilitation seem to be two complementary and synergistic approaches. Learning and reward are central in rehabilitation: both of these mechanisms are the basis for the success of any rehabilitative treatment. Anyway, it is known that “learning resources” and reward could be negatively influenced from dopaminergic drugs. Furthermore, DRT causes different well-known complications: among these, dyskinesias, motor fluctuations, and dopamine dysregulation syndrome (DDS) are intimately linked with the alteration in the learning and reward mechanisms and could impact seriously on the rehabilitative outcomes. These considerations highlight the need for careful titration of DRT to produce the desired improvement in motor symptoms while minimizing the associated detrimental effects. This is important in order to maximize the motor re-learning based on repetition, reward and practice during rehabilitation. In this scenario, we review the knowledge concerning the interactions between DRT, learning and reward, examine the most impactful DRT side effects and provide suggestions for optimizing rehabilitation in PD. PMID:27378872

  11. Negative reinforcement impairs overnight memory consolidation.

    PubMed

    Stamm, Andrew W; Nguyen, Nam D; Seicol, Benjamin J; Fagan, Abigail; Oh, Angela; Drumm, Michael; Lundt, Maureen; Stickgold, Robert; Wamsley, Erin J

    2014-11-01

    Post-learning sleep is beneficial for human memory. However, it may be that not all memories benefit equally from sleep. Here, we manipulated a spatial learning task using monetary reward and performance feedback, asking whether enhancing the salience of the task would augment overnight memory consolidation and alter its incorporation into dreaming. Contrary to our hypothesis, we found that the addition of reward impaired overnight consolidation of spatial memory. Our findings seemingly contradict prior reports that enhancing the reward value of learned information augments sleep-dependent memory processing. Given that the reward followed a negative reinforcement paradigm, consolidation may have been impaired via a stress-related mechanism.

  12. Negative reinforcement impairs overnight memory consolidation.

    PubMed

    Stamm, Andrew W; Nguyen, Nam D; Seicol, Benjamin J; Fagan, Abigail; Oh, Angela; Drumm, Michael; Lundt, Maureen; Stickgold, Robert; Wamsley, Erin J

    2014-11-01

    Post-learning sleep is beneficial for human memory. However, it may be that not all memories benefit equally from sleep. Here, we manipulated a spatial learning task using monetary reward and performance feedback, asking whether enhancing the salience of the task would augment overnight memory consolidation and alter its incorporation into dreaming. Contrary to our hypothesis, we found that the addition of reward impaired overnight consolidation of spatial memory. Our findings seemingly contradict prior reports that enhancing the reward value of learned information augments sleep-dependent memory processing. Given that the reward followed a negative reinforcement paradigm, consolidation may have been impaired via a stress-related mechanism. PMID:25320351

  13. Negative reinforcement impairs overnight memory consolidation

    PubMed Central

    Stamm, Andrew W.; Nguyen, Nam D.; Seicol, Benjamin J.; Fagan, Abigail; Oh, Angela; Drumm, Michael; Lundt, Maureen; Stickgold, Robert

    2014-01-01

    Post-learning sleep is beneficial for human memory. However, it may be that not all memories benefit equally from sleep. Here, we manipulated a spatial learning task using monetary reward and performance feedback, asking whether enhancing the salience of the task would augment overnight memory consolidation and alter its incorporation into dreaming. Contrary to our hypothesis, we found that the addition of reward impaired overnight consolidation of spatial memory. Our findings seemingly contradict prior reports that enhancing the reward value of learned information augments sleep-dependent memory processing. Given that the reward followed a negative reinforcement paradigm, consolidation may have been impaired via a stress-related mechanism. PMID:25320351

  14. Involvement of the Rat Anterior Cingulate Cortex in Control of Instrumental Responses Guided by Reward Expectancy

    ERIC Educational Resources Information Center

    Schweimer, Judith; Hauber, Wolfgang

    2005-01-01

    The anterior cingulate cortex (ACC) plays a critical role in stimulus-reinforcement learning and reward-guided selection of actions. Here we conducted a series of experiments to further elucidate the role of the ACC in instrumental behavior involving effort-based decision-making and instrumental learning guided by reward-predictive stimuli. In…

  15. Sensory Responsiveness and the Effects of Equal Subjective Rewards on Tactile Learning and Memory of Honeybees

    ERIC Educational Resources Information Center

    Scheiner, Ricarda; Kuritz-Kaiser, Anthea; Menzel, Randolf; Erber, Joachim

    2005-01-01

    In tactile learning, sucrose is the unconditioned stimulus and reward, which is usually applied to the antenna to elicit proboscis extension and which the bee can drink when it is subsequently applied to the extended proboscis. The conditioned stimulus is a tactile object that the bee can scan with its antennae. In this paper we describe the…

  16. Construction of a Learning Agent Handling Its Rewards According to Environmental Situations

    NASA Astrophysics Data System (ADS)

    Moriyama, Koichi; Numao, Masayuki

    The authors aim at constructing an agent which learns appropriate actions in a Multi-Agent environment with and without social dilemmas. For this aim, the agent must have nonrationality that makes it give up its own profit when it should do that. Since there are many studies on rational learning that brings more and more profit, it is desirable to utilize them for constructing the agent. Therefore, we use a reward-handling manner that makes internal evaluation from the agent's rewards, and then the agent learns actions by a rational learning method with the internal evaluation. If the agent has only a fixed manner, however, it does not act well in the environment with and without dilemmas. Thus, the authors equip the agent with several reward-handling manners and criteria for selecting an effective one for the environmental situation. In the case of humans, what generates the internal evaluation is usually called emotion. Hence, this study also aims at throwing light on emotional activities of humans from a constructive view. In this paper, we divide a Multi-Agent environment into three situations and construct an agent having the reward-handling manners and the criteria. We observe that the agent acts well in all the three Multi-Agent situations composed of homogeneous agents.

  17. Differential contributions of infralimbic prefrontal cortex and nucleus accumbens during reward-based learning and extinction.

    PubMed

    Francois, Jennifer; Huxter, John; Conway, Michael W; Lowry, John P; Tricklebank, Mark D; Gilmour, Gary

    2014-01-01

    Using environmental cues for the prediction of future events is essential for survival. Such cue-outcome associations are thought to depend on mesolimbic circuitry involving the nucleus accumbens (NAc) and prefrontal cortex (PFC). Several studies have identified roles for both NAc and PFC in the expression of stable goal-directed behaviors, but much remains unknown about their roles during learning of such behaviors. To further address this question, we used in vivo oxygen amperometry, a proxy for blood oxygen level-dependent (BOLD) signal measurement in human functional magnetic resonance imaging, in rats performing a cued lever-pressing task requiring discrimination between a rewarded and nonrewarded cue. Simultaneous oxygen recordings were obtained from infralimbic PFC (IFC) and NAc throughout both acquisition and extinction of this task. Activation of NAc was specifically observed following rewarded cue onset during the entire acquisition phase and also during the first days of extinction. In contrast, IFC activated only during the earliest periods of acquisition and extinction, more specifically to the nonrewarded cue. Thus, in vivo oxygen amperometry permits a novel, stable form of longitudinal analysis of brain activity in behaving animals, allowing dissociation of the roles of different brain regions over time during learning of reward-driven instrumental action. The present results offer a unique temporal perspective on how NAc may promote actions directed toward anticipated positive outcome throughout learning, while IFC might suppress actions that no longer result in reward, but only during critical periods of learning. PMID:24403158

  18. Components and characteristics of the dopamine reward utility signal.

    PubMed

    Stauffer, William R; Lak, Armin; Kobayashi, Shunsuke; Schultz, Wolfram

    2016-06-01

    Rewards are defined by their behavioral functions in learning (positive reinforcement), approach behavior, economic choices, and emotions. Dopamine neurons respond to rewards with two components, similar to higher order sensory and cognitive neurons. The initial, rapid, unselective dopamine detection component reports all salient environmental events irrespective of their reward association. It is highly sensitive to factors related to reward and thus detects a maximal number of potential rewards. It also senses aversive stimuli but reports their physical impact rather than their aversiveness. The second response component processes reward value accurately and starts early enough to prevent confusion with unrewarded stimuli and objects. It codes reward value as a numeric, quantitative utility prediction error, consistent with formal concepts of economic decision theory. Thus, the dopamine reward signal is fast, highly sensitive and appropriate for driving and updating economic decisions. PMID:26272220

  19. Components and characteristics of the dopamine reward utility signal.

    PubMed

    Stauffer, William R; Lak, Armin; Kobayashi, Shunsuke; Schultz, Wolfram

    2016-06-01

    Rewards are defined by their behavioral functions in learning (positive reinforcement), approach behavior, economic choices, and emotions. Dopamine neurons respond to rewards with two components, similar to higher order sensory and cognitive neurons. The initial, rapid, unselective dopamine detection component reports all salient environmental events irrespective of their reward association. It is highly sensitive to factors related to reward and thus detects a maximal number of potential rewards. It also senses aversive stimuli but reports their physical impact rather than their aversiveness. The second response component processes reward value accurately and starts early enough to prevent confusion with unrewarded stimuli and objects. It codes reward value as a numeric, quantitative utility prediction error, consistent with formal concepts of economic decision theory. Thus, the dopamine reward signal is fast, highly sensitive and appropriate for driving and updating economic decisions.

  20. A Spiking Network Model of Decision Making Employing Rewarded STDP

    PubMed Central

    Bazhenov, Maxim

    2014-01-01

    Reward-modulated spike timing dependent plasticity (STDP) combines unsupervised STDP with a reinforcement signal that modulates synaptic changes. It was proposed as a learning rule capable of solving the distal reward problem in reinforcement learning. Nonetheless, performance and limitations of this learning mechanism have yet to be tested for its ability to solve biological problems. In our work, rewarded STDP was implemented to model foraging behavior in a simulated environment. Over the course of training the network of spiking neurons developed the capability of producing highly successful decision-making. The network performance remained stable even after significant perturbations of synaptic structure. Rewarded STDP alone was insufficient to learn effective decision making due to the difficulty maintaining homeostatic equilibrium of synaptic weights and the development of local performance maxima. Our study predicts that successful learning requires stabilizing mechanisms that allow neurons to balance their input and output synapses as well as synaptic noise. PMID:24632858

  1. Corticotropin-Releasing Hormone Receptor Type 1 (CRHR1) Genetic Variation and Stress Interact to Influence Reward Learning

    PubMed Central

    Bogdan, Ryan; Santesso, Diane L.; Fagerness, Jesen; Perlis, Roy H.; Pizzagalli, Diego A.

    2011-01-01

    Stress is a general risk factor for psychopathology but the mechanisms underlying this relationship remain largely unknown. Animal studies and limited human research suggest that stress can induce anhedonic behavior. Moreover, emerging data indicate that genetic variation within the corticotropin-releasing hormone type 1 receptor gene (CRHR1) at rs12938031 may promote psychopathology, particularly in the context of stress. Using an intermediate phenotypic neurogenetics approach, we assessed how stress and CRHR1 genetic variation (rs12938031) influence reward learning, an important component of anhedonia. Psychiatrically healthy female participants (n = 75) completed a probabilistic reward learning task during stress and no-stress conditions while 128-channel event-related potentials were recorded. Fifty-six participants were also genotyped across CRHR1. Response bias, an individual’s ability to modulate behavior as a function of reward, was the primary behavioral variable of interest. The feedback-related positivity (FRP) in response to reward feedback was used as a neural index of reward learning. Relative to the no-stress condition, acute stress was associated with blunted response bias as well as a smaller and delayed FRP (indicative of disrupted reward learning) and reduced anterior cingulate and orbitofrontal cortex activation to reward. Critically, rs12938031 interacted with stress to influence reward learning: both behaviorally and neurally, A homozygotes showed stress-induced reward learning abnormalities. These findings indicate that acute, uncontrollable stressors reduce participants’ ability to modulate behavior as a function of reward, and that such effects are modulated by CRHR1 genotype. Homozygosity for the A allele at rs12938031 may increase risk for psychopathology via stress-induced reward learning deficits. PMID:21917807

  2. Corticotropin-releasing hormone receptor type 1 (CRHR1) genetic variation and stress interact to influence reward learning.

    PubMed

    Bogdan, Ryan; Santesso, Diane L; Fagerness, Jesen; Perlis, Roy H; Pizzagalli, Diego A

    2011-09-14

    Stress is a general risk factor for psychopathology, but the mechanisms underlying this relationship remain largely unknown. Animal studies and limited human research suggest that stress can induce anhedonic behavior. Moreover, emerging data indicate that genetic variation within the corticotropin-releasing hormone type 1 receptor gene (CRHR1) at rs12938031 may promote psychopathology, particularly in the context of stress. Using an intermediate phenotypic neurogenetics approach, we assessed how stress and CRHR1 genetic variation (rs12938031) influence reward learning, an important component of anhedonia. Psychiatrically healthy female participants (n = 75) completed a probabilistic reward learning task during stress and no-stress conditions while 128-channel event-related potentials were recorded. Fifty-six participants were also genotyped across CRHR1. Response bias, an individual's ability to modulate behavior as a function of reward, was the primary behavioral variable of interest. The feedback-related positivity (FRP) in response to reward feedback was used as a neural index of reward learning. Relative to the no-stress condition, acute stress was associated with blunted response bias as well as a smaller and delayed FRP (indicative of disrupted reward learning) and reduced anterior cingulate and orbitofrontal cortex activation to reward. Critically, rs12938031 interacted with stress to influence reward learning: both behaviorally and neurally, A homozygotes showed stress-induced reward learning abnormalities. These findings indicate that acute, uncontrollable stressors reduce participants' ability to modulate behavior as a function of reward, and that such effects are modulated by CRHR1 genotype. Homozygosity for the A allele at rs12938031 may increase risk for psychopathology via stress-induced reward learning deficits.

  3. Basal ganglia orient eyes to reward.

    PubMed

    Hikosaka, Okihide; Nakamura, Kae; Nakahara, Hiroyuki

    2006-02-01

    Expectation of reward motivates our behaviors and influences our decisions. Indeed, neuronal activity in many brain areas is modulated by expected reward. However, it is still unclear where and how the reward-dependent modulation of neuronal activity occurs and how the reward-modulated signal is transformed into motor outputs. Recent studies suggest an important role of the basal ganglia. Sensorimotor/cognitive activities of neurons in the basal ganglia are strongly modulated by expected reward. Through their abundant outputs to the brain stem motor areas and the thalamocortical circuits, the basal ganglia appear capable of producing body movements based on expected reward. A good behavioral measure to test this hypothesis is saccadic eye movement because its brain stem mechanism has been extensively studied. Studies from our laboratory suggest that the basal ganglia play a key role in guiding the gaze to the location where reward is available. Neurons in the caudate nucleus and the substantia nigra pars reticulata are extremely sensitive to the positional difference in expected reward, which leads to a bias in excitability between the superior colliculi such that the saccade to the to-be-rewarded position occurs more quickly. It is suggested that the reward modulation occurs in the caudate where cortical inputs carrying spatial signals and dopaminergic inputs carrying reward-related signals are integrated. These data support a specific form of reinforcement learning theories, but also suggest further refinement of the theory.

  4. Socially learned preferences for differentially rewarded tokens in the brown capuchin monkey (Cebus apella).

    PubMed

    Brosnan, Sarah F; de Waal, Frans B M

    2004-06-01

    Social learning is assumed to underlie traditions, yet evidence indicating social learning in capuchin monkeys (Cebus apella), which exhibit traditions, is sparse. The authors tested capuchins for their ability to learn the value of novel tokens using a previously familiar token-exchange economy. Capuchins change their preferences in favor of a token worth a high-value food reward after watching a conspecific model exchange 2 differentially rewarded tokens, yet they fail to develop a similar preference after watching tokens paired with foods in the absence of a conspecific model. They also fail to learn that the value of familiar tokens has changed. Information about token value is available in all situations, but capuchins seem to pay more attention in a social situation involving novel tokens. PMID:15250800

  5. Drift diffusion model of reward and punishment learning in schizophrenia: Modeling and experimental data.

    PubMed

    Moustafa, Ahmed A; Kéri, Szabolcs; Somlai, Zsuzsanna; Balsdon, Tarryn; Frydecka, Dorota; Misiak, Blazej; White, Corey

    2015-09-15

    In this study, we tested reward- and punishment learning performance using a probabilistic classification learning task in patients with schizophrenia (n=37) and healthy controls (n=48). We also fit subjects' data using a Drift Diffusion Model (DDM) of simple decisions to investigate which components of the decision process differ between patients and controls. Modeling results show between-group differences in multiple components of the decision process. Specifically, patients had slower motor/encoding time, higher response caution (favoring accuracy over speed), and a deficit in classification learning for punishment, but not reward, trials. The results suggest that patients with schizophrenia adopt a compensatory strategy of favoring accuracy over speed to improve performance, yet still show signs of a deficit in learning based on negative feedback. Our data highlights the importance of applying fitting models (particularly drift diffusion models) to behavioral data. The implications of these findings are discussed relative to theories of schizophrenia and cognitive processing.

  6. Drift diffusion model of reward and punishment learning in schizophrenia: Modeling and experimental data.

    PubMed

    Moustafa, Ahmed A; Kéri, Szabolcs; Somlai, Zsuzsanna; Balsdon, Tarryn; Frydecka, Dorota; Misiak, Blazej; White, Corey

    2015-09-15

    In this study, we tested reward- and punishment learning performance using a probabilistic classification learning task in patients with schizophrenia (n=37) and healthy controls (n=48). We also fit subjects' data using a Drift Diffusion Model (DDM) of simple decisions to investigate which components of the decision process differ between patients and controls. Modeling results show between-group differences in multiple components of the decision process. Specifically, patients had slower motor/encoding time, higher response caution (favoring accuracy over speed), and a deficit in classification learning for punishment, but not reward, trials. The results suggest that patients with schizophrenia adopt a compensatory strategy of favoring accuracy over speed to improve performance, yet still show signs of a deficit in learning based on negative feedback. Our data highlights the importance of applying fitting models (particularly drift diffusion models) to behavioral data. The implications of these findings are discussed relative to theories of schizophrenia and cognitive processing. PMID:26005124

  7. Cingulate neglect in humans: disruption of contralesional reward learning in right brain damage.

    PubMed

    Lecce, Francesca; Rotondaro, Francesca; Bonnì, Sonia; Carlesimo, Augusto; Thiebaut de Schotten, Michel; Tomaiuolo, Francesco; Doricchi, Fabrizio

    2015-01-01

    Motivational valence plays a key role in orienting spatial attention. Nonetheless, clinical documentation and understanding of motivationally based deficits of spatial orienting in the human is limited. Here in a series of one group-study and two single-case studies, we have examined right brain damaged patients (RBD) with and without left spatial neglect in a spatial reward-learning task, in which the motivational valence of the left contralesional and the right ipsilesional space was contrasted. In each trial two visual boxes were presented, one to the left and one to the right of central fixation. In one session monetary rewards were released more frequently in the box on the left side (75% of trials) whereas in another session they were released more frequently on the right side. In each trial patients were required to: 1) point to each one of the two boxes; 2) choose one of the boxes for obtaining monetary reward; 3) report explicitly the position of reward and whether this position matched or not the original choice. Despite defective spontaneous allocation of attention toward the contralesional space, RBD patients with left spatial neglect showed preserved contralesional reward learning, i.e., comparable to ipsilesional learning and to reward learning displayed by patients without neglect. A notable exception in the group of neglect patients was L.R., who showed no sign of contralesional reward learning in a series of 120 consecutive trials despite being able of reaching learning criterion in only 20 trials in the ipsilesional space. L.R. suffered a cortical-subcortical brain damage affecting the anterior components of the parietal-frontal attentional network and, compared with all other neglect and non-neglect patients, had additional lesion involvement of the medial anterior cingulate cortex (ACC) and of the adjacent sectors of the corpus callosum. In contrast to his lateralized motivational learning deficit, L.R. had no lateral bias in the early phases of

  8. A theory of cerebral learning regulated by the reward system. I. Hypotheses and mathematical description.

    PubMed

    Nakamura, K

    1993-01-01

    Hypothetical mechanisms of the neocorticohippocampal system are presented. Neurophysiological and neuroanatomical findings concerning the system are integrated to demonstrate how animals associate sensory stimuli with rewarding actions: (1) cortical plasticity regulated by cholinergic/noradrenergic inputs from the hypothalamic reward system reinforces association connections between the most activated columns in the cortex; (2) the repetitive reinforcement forms association pathways connecting sensory cortical columns activated by the stimuli with motor cortical columns producing the rewarding actions; (3) after the pathways are formed, the cortex is capable of temporarily memorizing the stimuli by producing long-term potentiation through the cortico-hippocampal circuits; and (4) the memory allows the cortex to extend correct association pathways even in an environment where sensory stimuli rapidly change. A mathematical model of parts of the nervous system is presented to quantitatively examine the mechanisms. Membrane characteristics of single neurons are given by the Hodgkin-Huxley electric circuit. According to anatomical data, neural circuits of the neocortico-hippocampal system are composed by connecting populations of the model neurons. Computer simulation using physiological data concerning ion channels demonstrates how the mechanisms work and how to test the hypotheses presented.

  9. A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning

    PubMed Central

    Franklin, Nicholas T; Frank, Michael J

    2015-01-01

    Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments. DOI: http://dx.doi.org/10.7554/eLife.12029.001 PMID:26705698

  10. Observing others stay or switch - How social prediction errors are integrated into reward reversal learning.

    PubMed

    Ihssen, Niklas; Mussweiler, Thomas; Linden, David E J

    2016-08-01

    Reward properties of stimuli can undergo sudden changes, and the detection of these 'reversals' is often made difficult by the probabilistic nature of rewards/punishments. Here we tested whether and how humans use social information (someone else's choices) to overcome uncertainty during reversal learning. We show a substantial social influence during reversal learning, which was modulated by the type of observed behavior. Participants frequently followed observed conservative choices (no switches after punishment) made by the (fictitious) other player but ignored impulsive choices (switches), even though the experiment was set up so that both types of response behavior would be similarly beneficial/detrimental (Study 1). Computational modeling showed that participants integrated the observed choices as a 'social prediction error' instead of ignoring or blindly following the other player. Modeling also confirmed higher learning rates for 'conservative' versus 'impulsive' social prediction errors. Importantly, this 'conservative bias' was boosted by interpersonal similarity, which in conjunction with the lack of effects observed in a non-social control experiment (Study 2) confirmed its social nature. A third study suggested that relative weighting of observed impulsive responses increased with increased volatility (frequency of reversals). Finally, simulations showed that in the present paradigm integrating social and reward information was not necessarily more adaptive to maximize earnings than learning from reward alone. Moreover, integrating social information increased accuracy only when conservative and impulsive choices were weighted similarly during learning. These findings suggest that to guide decisions in choice contexts that involve reward reversals humans utilize social cues conforming with their preconceptions more strongly than cues conflicting with them, especially when the other is similar. PMID:27128170

  11. Fuel not fun: Reinterpreting attenuated brain responses to reward in obesity.

    PubMed

    Kroemer, Nils B; Small, Dana M

    2016-08-01

    There is a well-established literature linking obesity to altered dopamine signaling and brain response to food-related stimuli. Neuroimaging studies frequently report enhanced responses in dopaminergic regions during food anticipation and decreased responses during reward receipt. This has been interpreted as reflecting anticipatory "reward surfeit", and consummatory "reward deficiency". In particular, attenuated response in the dorsal striatum to primary food rewards is proposed to reflect anhedonia, which leads to overeating in an attempt to compensate for the reward deficit. In this paper, we propose an alternative view. We consider brain response to food-related stimuli in a reinforcement-learning framework, which can be employed to separate the contributions of reward sensitivity and reward-related learning that are typically entangled in the brain response to reward. Consequently, we posit that decreased striatal responses to milkshake receipt reflect reduced reward-related learning rather than reward deficiency or anhedonia because reduced reward sensitivity would translate uniformly into reduced anticipatory and consummatory responses to reward. By re-conceptualizing reward deficiency as a shift in learning about subjective value of rewards, we attempt to reconcile neuroimaging findings with the putative role of dopamine in effort, energy expenditure and exploration and suggest that attenuated brain responses to energy dense foods reflect the "fuel", not the fun entailed by the reward. PMID:27085908

  12. Reinforcement learning to adaptive control of nonlinear systems.

    PubMed

    Hwang, Kao-Shing; Tan, S W; Tsai, Min-Cheng

    2003-01-01

    Based on the feedback linearization theory, this paper presents how a reinforcement learning scheme that is adopted to construct artificial neural networks (ANNs) can linearize a nonlinear system effectively. The proposed reinforcement linearization learning system (RLLS) consists of two sub-systems: The evaluation predictor (EP) is a long-term policy selector, and the other is a short-term action selector composed of linearizing control (LC) and reinforce predictor (RP) elements. In addition, a reference model plays the role of the environment, which provides the reinforcement signal to the linearizing process. The RLLS thus receives reinforcement signals to accomplish the linearizing behavior to control a nonlinear system such that it can behave similarly to the reference model. Eventually, the RLLS performs identification and linearization concurrently. Simulation results demonstrate that the proposed learning scheme, which is applied to linearizing a pendulum system, provides better control reliability and robustness than conventional ANN schemes. Furthermore, a PI controller is used to control the linearized plant where the affine system behaves like a linear system.

  13. Cooperation and Coordination Between Fuzzy Reinforcement Learning Agents in Continuous State Partially Observable Markov Decision Processes

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Vengerov, David

    1999-01-01

    Successful operations of future multi-agent intelligent systems require efficient cooperation schemes between agents sharing learning experiences. We consider a pseudo-realistic world in which one or more opportunities appear and disappear in random locations. Agents use fuzzy reinforcement learning to learn which opportunities are most worthy of pursuing based on their promise rewards, expected lifetimes, path lengths and expected path costs. We show that this world is partially observable because the history of an agent influences the distribution of its future states. We consider a cooperation mechanism in which agents share experience by using and-updating one joint behavior policy. We also implement a coordination mechanism for allocating opportunities to different agents in the same world. Our results demonstrate that K cooperative agents each learning in a separate world over N time steps outperform K independent agents each learning in a separate world over K*N time steps, with this result becoming more pronounced as the degree of partial observability in the environment increases. We also show that cooperation between agents learning in the same world decreases performance with respect to independent agents. Since cooperation reduces diversity between agents, we conclude that diversity is a key parameter in the trade off between maximizing utility from cooperation when diversity is low and maximizing utility from competitive coordination when diversity is high.

  14. Tactile learning and the individual evaluation of the reward in honey bees (Apis mellifera L.).

    PubMed

    Scheiner, R; Erber, J; Page, R E

    1999-07-01

    Using the proboscis extension response we conditioned pollen and nectar foragers of the honey bee (Apis mellifera L.) to tactile patterns under laboratory conditions. Pollen foragers demonstrated better acquisition, extinction, and reversal learning than nectar foragers. We tested whether the known differences in response thresholds to sucrose between pollen and nectar foragers could explain the observed differences in learning and found that nectar foragers with low response thresholds performed better during acquisition and extinction than ones with higher thresholds. Conditioning pollen and nectar foragers with similar response thresholds did not yield differences in their learning performance. These results suggest that differences in the learning performance of pollen and nectar foragers are a consequence of differences in their perception of sucrose. Furthermore, we analysed the effect which the perception of sucrose reward has on associative learning. Nectar foragers with uniform low response thresholds were conditioned using varying concentrations of sucrose. We found significant positive correlations between the concentrations of the sucrose rewards and the performance during acquisition and extinction. The results are summarised in a model which describes the relationships between learning performance, response threshold to sucrose, concentration of sucrose and the number of rewards.

  15. Decision theory, reinforcement learning, and the brain.

    PubMed

    Dayan, Peter; Daw, Nathaniel D

    2008-12-01

    Decision making is a core competence for animals and humans acting and surviving in environments they only partially comprehend, gaining rewards and punishments for their troubles. Decision-theoretic concepts permeate experiments and computational models in ethology, psychology, and neuroscience. Here, we review a well-known, coherent Bayesian approach to decision making, showing how it unifies issues in Markovian decision problems, signal detection psychophysics, sequential sampling, and optimal exploration and discuss paradigmatic psychological and neural examples of each problem. We discuss computational issues concerning what subjects know about their task and how ambitious they are in seeking optimal solutions; we address algorithmic topics concerning model-based and model-free methods for making choices; and we highlight key aspects of the neural implementation of decision making.

  16. IMMEDIATE REINFORCEMENT AND THE DISADVANTAGED LEARNER. A PRACTICAL APPLICATION OF LEARNING THEORY.

    ERIC Educational Resources Information Center

    FANTINI, MARIO; WEINSTEIN, GERALD

    IMPLICIT IN THE CONCEPT OF IMMEDIATE REINFORCEMENT ARE TWO ASSUMPTIONS--(1) THAT A NEED MUST BE SATISFIED, AND (2) THAT A REWARD CAN SERVE TO SATISFY THIS NEED. THE CULTURALLY DISADVANTAGED CHILD NEEDS ENCOURAGEMENT OR DISCOURAGEMENT RIGHT AWAY. HIS SOCIETY OPERATES IN THIS WAY. IN THE CLASSROOM, SUCH REINFORCEMENT MAY TAKE MANY FORMS. ONE TEACHER…

  17. Reinforcement Learning in Young Adults with Developmental Language Impairment

    PubMed Central

    Lee, Joanna C.; Tomblin, J. Bruce

    2012-01-01

    The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank et al., 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic selection task was used to assess how participants implicitly extracted reinforcement history from the environment based on probabilistic positive/negative feedback. The findings showed impaired RL in individuals with DLI, indicating an altered gating function of the striatum in testing. However, they exploited similar learning strategies as comparison participants at the beginning of training, reflecting relatively intact functions of the prefrontal cortex to rapidly update reinforcement information. Within the context of Frank’s model, these results can be interpreted as evidence for alterations in the basal ganglia of individuals with DLI. PMID:22921956

  18. Reinforcement learning in young adults with developmental language impairment.

    PubMed

    Lee, Joanna C; Tomblin, J Bruce

    2012-12-01

    The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank, Seeberger, & O'Reilly, 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic selection task was used to assess how participants implicitly extracted reinforcement history from the environment based on probabilistic positive/negative feedback. The findings showed impaired RL in individuals with DLI, indicating an altered gating function of the striatum in testing. However, they exploited similar learning strategies as comparison participants at the beginning of training, reflecting relatively intact functions of the prefrontal cortex to rapidly update reinforcement information. Within the context of Frank's model, these results can be interpreted as evidence for alterations in the basal ganglia of individuals with DLI.

  19. Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin.

    PubMed

    Ezaki, Takahiro; Horita, Yutaka; Takezawa, Masanori; Masuda, Naoki

    2016-07-01

    Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner's dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations.

  20. Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin

    PubMed Central

    Ezaki, Takahiro; Horita, Yutaka; Masuda, Naoki

    2016-01-01

    Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner’s dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations. PMID:27438888

  1. Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin.

    PubMed

    Ezaki, Takahiro; Horita, Yutaka; Takezawa, Masanori; Masuda, Naoki

    2016-07-01

    Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner's dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations. PMID:27438888

  2. Kinesthetic Reinforcement-Is It a Boon to Learning?

    ERIC Educational Resources Information Center

    Bohrer, Roxilu K.

    1970-01-01

    Language instruction, particularly in the elementary school, should be reinforced through the use of visual aids and through associated physical activity. Kinesthetic experiences provide an opportunity to make use of non-verbal cues to meaning, enliven classroom activities, and maximize learning for pupils. The author discusses the educational…

  3. Human Operant Learning under Concurrent Reinforcement of Response Variability

    ERIC Educational Resources Information Center

    Maes, J. H. R.; van der Goot, M.

    2006-01-01

    This study asked whether the concurrent reinforcement of behavioral variability facilitates learning to emit a difficult target response. Sixty students repeatedly pressed sequences of keys, with an originally infrequently occurring target sequence consistently being followed by positive feedback. Three conditions differed in the feedback given to…

  4. Reinforcement Learning in Young Adults with Developmental Language Impairment

    ERIC Educational Resources Information Center

    Lee, Joanna C.; Tomblin, J. Bruce

    2012-01-01

    The aim of the study was to examine reinforcement learning (RL) in young adults with developmental language impairment (DLI) within the context of a neurocomputational model of the basal ganglia-dopamine system (Frank, Seeberger, & O'Reilly, 2004). Two groups of young adults, one with DLI and the other without, were recruited. A probabilistic…

  5. Deficits in Positive Reinforcement Learning and Uncertainty-Driven Exploration are Associated with Distinct Aspects of Negative Symptoms in Schizophrenia

    PubMed Central

    Strauss, Gregory P.; Frank, Michael J.; Waltz, James A.; Kasanova, Zuzana; Herbener, Ellen S.; Gold, James M.

    2011-01-01

    Background Negative symptoms are core features of schizophrenia; however, the cognitive and neural basis for individual negative symptom domains remains unclear. Converging evidence suggests a role for striatal and prefrontal dopamine in reward learning and the exploration of actions that might produce outcomes that are better than the status quo. The current study examines whether deficits in reinforcement learning and uncertainty-driven exploration predict specific negative symptoms domains. Methods We administered a temporal decision making task, which required trial-by-trial adjustment of reaction time (RT) to maximize reward receipt, to 51 patients with schizophrenia and 39 age-matched healthy controls. Task conditions were designed such that expected value (probability * magnitude) increased (IEV), decreased (DEV), or remained constant (CEV) with increasing response times. Computational analyses were applied to estimate the degree to which trial-by-trial responses are influenced by reinforcement history. Results Individuals with schizophrenia showed impaired Go learning, but intact NoGo learning relative to controls. These effects were pronounced as a function of global measures of negative symptom. Uncertainty-based exploration was substantially reduced in individuals with schizophrenia, and selectively correlated with clinical ratings of anhedonia. Conclusions Schizophrenia patients, particularly those with high negative symptoms, failed to speed RT's to increase positive outcomes and showed reduced tendency to explore when alternative actions could lead to better outcomes than the status quo. Results are interpreted in the context of current computational, genetic, and pharmacological data supporting the roles of striatal and prefrontal dopamine in these processes. PMID:21168124

  6. Learning to minimize efforts versus maximizing rewards: computational principles and neural correlates.

    PubMed

    Skvortsova, Vasilisa; Palminteri, Stefano; Pessiglione, Mathias

    2014-11-19

    The mechanisms of reward maximization have been extensively studied at both the computational and neural levels. By contrast, little is known about how the brain learns to choose the options that minimize action cost. In principle, the brain could have evolved a general mechanism that applies the same learning rule to the different dimensions of choice options. To test this hypothesis, we scanned healthy human volunteers while they performed a probabilistic instrumental learning task that varied in both the physical effort and the monetary outcome associated with choice options. Behavioral data showed that the same computational rule, using prediction errors to update expectations, could account for both reward maximization and effort minimization. However, these learning-related variables were encoded in partially dissociable brain areas. In line with previous findings, the ventromedial prefrontal cortex was found to positively represent expected and actual rewards, regardless of effort. A separate network, encompassing the anterior insula, the dorsal anterior cingulate, and the posterior parietal cortex, correlated positively with expected and actual efforts. These findings suggest that the same computational rule is applied by distinct brain systems, depending on the choice dimension-cost or benefit-that has to be learned.

  7. Neurofeedback in Learning Disabled Children: Visual versus Auditory Reinforcement.

    PubMed

    Fernández, Thalía; Bosch-Bayard, Jorge; Harmony, Thalía; Caballero, María I; Díaz-Comas, Lourdes; Galán, Lídice; Ricardo-Garcell, Josefina; Aubert, Eduardo; Otero-Ojeda, Gloria

    2016-03-01

    Children with learning disabilities (LD) frequently have an EEG characterized by an excess of theta and a deficit of alpha activities. NFB using an auditory stimulus as reinforcer has proven to be a useful tool to treat LD children by positively reinforcing decreases of the theta/alpha ratio. The aim of the present study was to optimize the NFB procedure by comparing the efficacy of visual (with eyes open) versus auditory (with eyes closed) reinforcers. Twenty LD children with an abnormally high theta/alpha ratio were randomly assigned to the Auditory or the Visual group, where a 500 Hz tone or a visual stimulus (a white square), respectively, was used as a positive reinforcer when the value of the theta/alpha ratio was reduced. Both groups had signs consistent with EEG maturation, but only the Auditory Group showed behavioral/cognitive improvements. In conclusion, the auditory reinforcer was more efficacious in reducing the theta/alpha ratio, and it improved the cognitive abilities more than the visual reinforcer.

  8. Working memory contributions to reinforcement learning impairments in schizophrenia.

    PubMed

    Collins, Anne G E; Brown, Jaime K; Gold, James M; Waltz, James A; Frank, Michael J

    2014-10-01

    Previous research has shown that patients with schizophrenia are impaired in reinforcement learning tasks. However, behavioral learning curves in such tasks originate from the interaction of multiple neural processes, including the basal ganglia- and dopamine-dependent reinforcement learning (RL) system, but also prefrontal cortex-dependent cognitive strategies involving working memory (WM). Thus, it is unclear which specific system induces impairments in schizophrenia. We recently developed a task and computational model allowing us to separately assess the roles of RL (slow, cumulative learning) mechanisms versus WM (fast but capacity-limited) mechanisms in healthy adult human subjects. Here, we used this task to assess patients' specific sources of impairments in learning. In 15 separate blocks, subjects learned to pick one of three actions for stimuli. The number of stimuli to learn in each block varied from two to six, allowing us to separate influences of capacity-limited WM from the incremental RL system. As expected, both patients (n = 49) and healthy controls (n = 36) showed effects of set size and delay between stimulus repetitions, confirming the presence of working memory effects. Patients performed significantly worse than controls overall, but computational model fits and behavioral analyses indicate that these deficits could be entirely accounted for by changes in WM parameters (capacity and reliability), whereas RL processes were spared. These results suggest that the working memory system contributes strongly to learning impairments in schizophrenia. PMID:25297101

  9. Structure identification in fuzzy inference using reinforcement learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1993-01-01

    In our previous work on the GARIC architecture, we have shown that the system can start with surface structure of the knowledge base (i.e., the linguistic expression of the rules) and learn the deep structure (i.e., the fuzzy membership functions of the labels used in the rules) by using reinforcement learning. Assuming the surface structure, GARIC refines the fuzzy membership functions used in the consequents of the rules using a gradient descent procedure. This hybrid fuzzy logic and reinforcement learning approach can learn to balance a cart-pole system and to backup a truck to its docking location after a few trials. In this paper, we discuss how to do structure identification using reinforcement learning in fuzzy inference systems. This involves identifying both surface as well as deep structure of the knowledge base. The term set of fuzzy linguistic labels used in describing the values of each control variable must be derived. In this process, splitting a label refers to creating new labels which are more granular than the original label and merging two labels creates a more general label. Splitting and merging of labels directly transform the structure of the action selection network used in GARIC by increasing or decreasing the number of hidden layer nodes.

  10. Robot Docking Based on Omnidirectional Vision and Reinforcement Learning

    NASA Astrophysics Data System (ADS)

    Muse, David; Weber, Cornelius; Wermter, Stefan

    We present a system for visual robotic docking using an omnidirectional camera coupled with the actor critic reinforcement learning algorithm. The system enables a PeopleBot robot to locate and approach a table so that it can pick an object from it using the pan-tilt camera mounted on the robot. We use a staged approach to solve this problem as there are distinct sub tasks and different sensors used. Starting with random wandering of the robot until the table is located via a landmark, and then a network trained via reinforcement allows the robot to rum to and approach the table. Once at the table the robot is to pick the object from it. We argue that our approach has a lot of potential allowing the learning of robot control for navigation removing the need for internal maps of the environment. This is achieved by allowing the robot to learn couplings between motor actions and the position of a landmark.

  11. A tribute to charlie chaplin: induced positive affect improves reward-based decision-learning in Parkinson's disease.

    PubMed

    Ridderinkhof, K Richard; van Wouwe, Nelleke C; Band, Guido P H; Wylie, Scott A; Van der Stigchel, Stefan; van Hees, Pieter; Buitenweg, Jessika; van de Vijver, Irene; van den Wildenberg, Wery P M

    2012-01-01

    Reward-based decision-learning refers to the process of learning to select those actions that lead to rewards while avoiding actions that lead to punishments. This process, known to rely on dopaminergic activity in striatal brain regions, is compromised in Parkinson's disease (PD). We hypothesized that such decision-learning deficits are alleviated by induced positive affect, which is thought to incur transient boosts in midbrain and striatal dopaminergic activity. Computational measures of probabilistic reward-based decision-learning were determined for 51 patients diagnosed with PD. Previous work has shown these measures to rely on the nucleus caudatus (outcome evaluation during the early phases of learning) and the putamen (reward prediction during later phases of learning). We observed that induced positive affect facilitated learning, through its effects on reward prediction rather than outcome evaluation. Viewing a few minutes of comedy clips served to remedy dopamine-related problems associated with frontostriatal circuitry and, consequently, learning to predict which actions will yield reward. PMID:22707944

  12. A Tribute to Charlie Chaplin: Induced Positive Affect Improves Reward-Based Decision-Learning in Parkinson’s Disease

    PubMed Central

    Ridderinkhof, K. Richard; van Wouwe, Nelleke C.; Band, Guido P. H.; Wylie, Scott A.; Van der Stigchel, Stefan; van Hees, Pieter; Buitenweg, Jessika; van de Vijver, Irene; van den Wildenberg, Wery P. M.

    2012-01-01

    Reward-based decision-learning refers to the process of learning to select those actions that lead to rewards while avoiding actions that lead to punishments. This process, known to rely on dopaminergic activity in striatal brain regions, is compromised in Parkinson’s disease (PD). We hypothesized that such decision-learning deficits are alleviated by induced positive affect, which is thought to incur transient boosts in midbrain and striatal dopaminergic activity. Computational measures of probabilistic reward-based decision-learning were determined for 51 patients diagnosed with PD. Previous work has shown these measures to rely on the nucleus caudatus (outcome evaluation during the early phases of learning) and the putamen (reward prediction during later phases of learning). We observed that induced positive affect facilitated learning, through its effects on reward prediction rather than outcome evaluation. Viewing a few minutes of comedy clips served to remedy dopamine-related problems associated with frontostriatal circuitry and, consequently, learning to predict which actions will yield reward. PMID:22707944

  13. Reinforcing Communication Skills While Registered Nurses Simultaneously Learn Course Content: A Response to Learning Needs.

    ERIC Educational Resources Information Center

    De Simone, Barbara B.

    1994-01-01

    Fifteen nursing students participated in Integrated Skills Reinforcement, an approach that reinforces writing, reading, speaking, and listening skills while students learn course content. Pre-/postassessment of writing showed that 93% achieved writing improvement. All students agreed that the approach improved understanding of course content, a…

  14. Effects of subconscious and conscious emotions on human cue-reward association learning.

    PubMed

    Watanabe, Noriya; Haruno, Masahiko

    2015-01-01

    Life demands that we adapt our behaviour continuously in situations in which much of our incoming information is emotional and unrelated to our immediate behavioural goals. Such information is often processed without our consciousness. This poses an intriguing question of whether subconscious exposure to irrelevant emotional information (e.g. the surrounding social atmosphere) affects the way we learn. Here, we addressed this issue by examining whether the learning of cue-reward associations changes when an emotional facial expression is shown subconsciously or consciously prior to the presentation of a reward-predicting cue. We found that both subconscious (0.027 s and 0.033 s) and conscious (0.047 s) emotional signals increased the rate of learning, and this increase was smallest at the border of conscious duration (0.040 s). These data suggest not only that the subconscious and conscious processing of emotional signals enhances value-updating in cue-reward association learning, but also that the computational processes underlying the subconscious enhancement is at least partially dissociable from its conscious counterpart. PMID:25684237

  15. Effective reinforcement learning following cerebellar damage requires a balance between exploration and motor noise.

    PubMed

    Therrien, Amanda S; Wolpert, Daniel M; Bastian, Amy J

    2016-01-01

    Reinforcement and error-based processes are essential for motor learning, with the cerebellum thought to be required only for the error-based mechanism. Here we examined learning and retention of a reaching skill under both processes. Control subjects learned similarly from reinforcement and error-based feedback, but showed much better retention under reinforcement. To apply reinforcement to cerebellar patients, we developed a closed-loop reinforcement schedule in which task difficulty was controlled based on recent performance. This schedule produced substantial learning in cerebellar patients and controls. Cerebellar patients varied in their learning under reinforcement but fully retained what was learned. In contrast, they showed complete lack of retention in error-based learning. We developed a mechanistic model of the reinforcement task and found that learning depended on a balance between exploration variability and motor noise. While the cerebellar and control groups had similar exploration variability, the patients had greater motor noise and hence learned less. Our results suggest that cerebellar damage indirectly impairs reinforcement learning by increasing motor noise, but does not interfere with the reinforcement mechanism itself. Therefore, reinforcement can be used to learn and retain novel skills, but optimal reinforcement learning requires a balance between exploration variability and motor noise.

  16. Reinforcement learning agents providing advice in complex video games

    NASA Astrophysics Data System (ADS)

    Taylor, Matthew E.; Carboni, Nicholas; Fachantidis, Anestis; Vlahavas, Ioannis; Torrey, Lisa

    2014-01-01

    This article introduces a teacher-student framework for reinforcement learning, synthesising and extending material that appeared in conference proceedings [Torrey, L., & Taylor, M. E. (2013)]. Teaching on a budget: Agents advising agents in reinforcement learning. {Proceedings of the international conference on autonomous agents and multiagent systems}] and in a non-archival workshop paper [Carboni, N., &Taylor, M. E. (2013, May)]. Preliminary results for 1 vs. 1 tactics in StarCraft. {Proceedings of the adaptive and learning agents workshop (at AAMAS-13)}]. In this framework, a teacher agent instructs a student agent by suggesting actions the student should take as it learns. However, the teacher may only give such advice a limited number of times. We present several novel algorithms that teachers can use to budget their advice effectively, and we evaluate them in two complex video games: StarCraft and Pac-Man. Our results show that the same amount of advice, given at different moments, can have different effects on student learning, and that teachers can significantly affect student learning even when students use different learning methods and state representations.

  17. Reward prediction error computation in the pedunculopontine tegmental nucleus neurons.

    PubMed

    Kobayashi, Yasushi; Okada, Ken-Ichi

    2007-05-01

    In this article, we address the role of neuronal activity in the pathways of the brainstem-midbrain circuit in reward and the basis for believing that this circuit provides advantages over previous reinforcement learning theory. Several lines of evidence support the reward-based learning theory proposing that midbrain dopamine (DA) neurons send a teaching signal (the reward prediction error signal) to control synaptic plasticity of the projection area. However, the underlying mechanism of where and how the reward prediction error signal is computed still remains unclear. Since the pedunculopontine tegmental nucleus (PPTN) in the brainstem is one of the strongest excitatory input sources to DA neurons, we hypothesized that the PPTN may play an important role in activating DA neurons and reinforcement learning by relaying necessary signals for reward prediction error computation to DA neurons. To investigate the involvement of the PPTN neurons in computation of reward prediction error, we used a visually guided saccade task (VGST) during recording of neuronal activity in monkeys. Here, we predict that PPTN neurons may relay the excitatory component of tonic reward prediction and phasic primary reward signals, and derive a new computational theory of the reward prediction error in DA neurons.

  18. Democratic reinforcement: learning via self-organization

    SciTech Connect

    Stassinopoulos, D.; Bak, P.

    1995-12-31

    The problem of learning in the absence of external intelligence is discussed in the context of a simple model. The model consists of a set of randomly connected, or layered integrate-and fire neurons. Inputs to and outputs from the environment are connected randomly to subsets of neurons. The connections between firing neurons are strengthened or weakened according to whether the action is successful or not. The model departs from the traditional gradient-descent based approaches to learning by operating at a highly susceptible ``critical`` state, with low activity and sparse connections between firing neurons. Quantitative studies on the performance of our model in a simple association task show that by tuning our system close to this critical state we can obtain dramatic gains in performance.

  19. Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective.

    PubMed

    Story, Giles W; Vlaev, Ivo; Seymour, Ben; Darzi, Ara; Dolan, Raymond J

    2014-01-01

    The tendency to make unhealthy choices is hypothesized to be related to an individual's temporal discount rate, the theoretical rate at which they devalue delayed rewards. Furthermore, a particular form of temporal discounting, hyperbolic discounting, has been proposed to explain why unhealthy behavior can occur despite healthy intentions. We examine these two hypotheses in turn. We first systematically review studies which investigate whether discount rates can predict unhealthy behavior. These studies reveal that high discount rates for money (and in some instances food or drug rewards) are associated with several unhealthy behaviors and markers of health status, establishing discounting as a promising predictive measure. We secondly examine whether intention-incongruent unhealthy actions are consistent with hyperbolic discounting. We conclude that intention-incongruent actions are often triggered by environmental cues or changes in motivational state, whose effects are not parameterized by hyperbolic discounting. We propose a framework for understanding these state-based effects in terms of the interplay of two distinct reinforcement learning mechanisms: a "model-based" (or goal-directed) system and a "model-free" (or habitual) system. Under this framework, while discounting of delayed health may contribute to the initiation of unhealthy behavior, with repetition, many unhealthy behaviors become habitual; if health goals then change, habitual behavior can still arise in response to environmental cues. We propose that the burgeoning development of computational models of these processes will permit further identification of health decision-making phenotypes. PMID:24659960

  20. Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective

    PubMed Central

    Story, Giles W.; Vlaev, Ivo; Seymour, Ben; Darzi, Ara; Dolan, Raymond J.

    2014-01-01

    The tendency to make unhealthy choices is hypothesized to be related to an individual's temporal discount rate, the theoretical rate at which they devalue delayed rewards. Furthermore, a particular form of temporal discounting, hyperbolic discounting, has been proposed to explain why unhealthy behavior can occur despite healthy intentions. We examine these two hypotheses in turn. We first systematically review studies which investigate whether discount rates can predict unhealthy behavior. These studies reveal that high discount rates for money (and in some instances food or drug rewards) are associated with several unhealthy behaviors and markers of health status, establishing discounting as a promising predictive measure. We secondly examine whether intention-incongruent unhealthy actions are consistent with hyperbolic discounting. We conclude that intention-incongruent actions are often triggered by environmental cues or changes in motivational state, whose effects are not parameterized by hyperbolic discounting. We propose a framework for understanding these state-based effects in terms of the interplay of two distinct reinforcement learning mechanisms: a “model-based” (or goal-directed) system and a “model-free” (or habitual) system. Under this framework, while discounting of delayed health may contribute to the initiation of unhealthy behavior, with repetition, many unhealthy behaviors become habitual; if health goals then change, habitual behavior can still arise in response to environmental cues. We propose that the burgeoning development of computational models of these processes will permit further identification of health decision-making phenotypes. PMID:24659960

  1. Probabilistic reward- and punishment-based learning in opioid addiction: Experimental and computational data.

    PubMed

    Myers, Catherine E; Sheynin, Jony; Balsdon, Tarryn; Luzardo, Andre; Beck, Kevin D; Hogarth, Lee; Haber, Paul; Moustafa, Ahmed A

    2016-01-01

    Addiction is the continuation of a habit in spite of negative consequences. A vast literature gives evidence that this poor decision-making behavior in individuals addicted to drugs also generalizes to laboratory decision making tasks, suggesting that the impairment in decision-making is not limited to decisions about taking drugs. In the current experiment, opioid-addicted individuals and matched controls with no history of illicit drug use were administered a probabilistic classification task that embeds both reward-based and punishment-based learning trials, and a computational model of decision making was applied to understand the mechanisms describing individuals' performance on the task. Although behavioral results showed that opioid-addicted individuals performed as well as controls on both reward- and punishment-based learning, the modeling results suggested subtle differences in how decisions were made between the two groups. Specifically, the opioid-addicted group showed decreased tendency to repeat prior responses, meaning that they were more likely to "chase reward" when expectancies were violated, whereas controls were more likely to stick with a previously-successful response rule, despite occasional expectancy violations. This tendency to chase short-term reward, potentially at the expense of developing rules that maximize reward over the long term, may be a contributing factor to opioid addiction. Further work is indicated to better understand whether this tendency arises as a result of brain changes in the wake of continued opioid use/abuse, or might be a pre-existing factor that may contribute to risk for addiction. PMID:26381438

  2. Probabilistic reward- and punishment-based learning in opioid addiction: Experimental and computational data.

    PubMed

    Myers, Catherine E; Sheynin, Jony; Balsdon, Tarryn; Luzardo, Andre; Beck, Kevin D; Hogarth, Lee; Haber, Paul; Moustafa, Ahmed A

    2016-01-01

    Addiction is the continuation of a habit in spite of negative consequences. A vast literature gives evidence that this poor decision-making behavior in individuals addicted to drugs also generalizes to laboratory decision making tasks, suggesting that the impairment in decision-making is not limited to decisions about taking drugs. In the current experiment, opioid-addicted individuals and matched controls with no history of illicit drug use were administered a probabilistic classification task that embeds both reward-based and punishment-based learning trials, and a computational model of decision making was applied to understand the mechanisms describing individuals' performance on the task. Although behavioral results showed that opioid-addicted individuals performed as well as controls on both reward- and punishment-based learning, the modeling results suggested subtle differences in how decisions were made between the two groups. Specifically, the opioid-addicted group showed decreased tendency to repeat prior responses, meaning that they were more likely to "chase reward" when expectancies were violated, whereas controls were more likely to stick with a previously-successful response rule, despite occasional expectancy violations. This tendency to chase short-term reward, potentially at the expense of developing rules that maximize reward over the long term, may be a contributing factor to opioid addiction. Further work is indicated to better understand whether this tendency arises as a result of brain changes in the wake of continued opioid use/abuse, or might be a pre-existing factor that may contribute to risk for addiction.

  3. A reinforcement learning-based architecture for fuzzy logic control

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.

    1992-01-01

    This paper introduces a new method for learning to refine a rule-based fuzzy logic controller. A reinforcement learning technique is used in conjunction with a multilayer neural network model of a fuzzy controller. The approximate reasoning based intelligent control (ARIC) architecture proposed here learns by updating its prediction of the physical system's behavior and fine tunes a control knowledge base. Its theory is related to Sutton's temporal difference (TD) method. Because ARIC has the advantage of using the control knowledge of an experienced operator and fine tuning it through the process of learning, it learns faster than systems that train networks from scratch. The approach is applied to a cart-pole balancing system.

  4. Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning

    PubMed Central

    Legenstein, Robert; Chase, Steven M.; Schwartz, Andrew B.; Maass, Wolfgang

    2011-01-01

    The control of neuroprosthetic devices from the activity of motor cortex neurons benefits from learning effects where the function of these neurons is adapted to the control task. It was recently shown that tuning properties of neurons in monkey motor cortex are adapted selectively in order to compensate for an erroneous interpretation of their activity. In particular, it was shown that the tuning curves of those neurons whose preferred directions had been misinterpreted changed more than those of other neurons. In this article, we show that the experimentally observed self-tuning properties of the system can be explained on the basis of a simple learning rule. This learning rule utilizes neuronal noise for exploration and performs Hebbian weight updates that are modulated by a global reward signal. In contrast to most previously proposed reward-modulated Hebbian learning rules, this rule does not require extraneous knowledge about what is noise and what is signal. The learning rule is able to optimize the performance of the model system within biologically realistic periods of time and under high noise levels. When the neuronal noise is fitted to experimental data, the model produces learning effects similar to those found in monkey experiments. PMID:25284966

  5. Deep Brain Stimulation of the Subthalamic Nucleus Improves Reward-Based Decision-Learning in Parkinson's Disease

    PubMed Central

    van Wouwe, Nelleke C.; Ridderinkhof, K. R.; van den Wildenberg, W. P. M.; Band, G. P. H.; Abisogun, A.; Elias, W. J.; Frysinger, R.; Wylie, S. A.

    2011-01-01

    Recently, the subthalamic nucleus (STN) has been shown to be critically involved in decision-making, action selection, and motor control. Here we investigate the effect of deep brain stimulation (DBS) of the STN on reward-based decision-learning in patients diagnosed with Parkinson's disease (PD). We determined computational measures of outcome evaluation and reward prediction from PD patients who performed a probabilistic reward-based decision-learning task. In previous work, these measures covaried with activation in the nucleus caudatus (outcome evaluation during the early phases of learning) and the putamen (reward prediction during later phases of learning). We observed that stimulation of the STN motor regions in PD patients served to improve reward-based decision-learning, probably through its effect on activity in frontostriatal motor loops (prominently involving the putamen and, hence, reward prediction). In a subset of relatively younger patients with relatively shorter disease duration, the effects of DBS appeared to spread to more cognitive regions of the STN, benefiting loops that connect the caudate to various prefrontal areas importantfor outcome evaluation. These results highlight positive effects of STN stimulation on cognitive functions that may benefit PD patients in daily-life association-learning situations. PMID:21519377

  6. Measuring anhedonia: impaired ability to pursue, experience, and learn about reward.

    PubMed

    Thomsen, Kristine Rømer

    2015-01-01

    Ribot's (1896) long standing definition of anhedonia as "the inability to experience pleasure" has been challenged recently following progress in affective neuroscience. In particular, accumulating evidence suggests that reward consists of multiple subcomponents of wanting, liking and learning, as initially outlined by Berridge and Robinson (2003), and these processes have been proposed to relate to appetitive, consummatory and satiety phases of a pleasure cycle. Building on this work, we recently proposed to reconceptualize anhedonia as "impairments in the ability to pursue, experience, and/or learn about pleasure, which is often, but not always accessible to conscious awareness." (Rømer Thomsen et al., 2015). This framework is in line with Treadway and Zald's (2011) proposal to differentiate between motivational and consummatory types of anhedonia, and stresses the need to combine traditional self-report measures with behavioral measures or procedures. In time, this approach may lead to improved clinical assessment and treatment. In line with our reconceptualization, increasing evidence suggests that reward processing deficits are not restricted to impaired hedonic impact in major psychiatric disorders. Successful translations of animal models have led to strong evidence of impairments in the ability to pursue and learn about reward in psychiatric disorders such as major depressive disorder, schizophrenia, and addiction. It is of high importance that we continue to systematically target impairments in all phases of reward processing across disorders using behavioral testing in combination with neuroimaging techniques. This in turn has implications for diagnosis and treatment, and is essential for the purposes of identifying the underlying neurobiological mechanisms. Here I review recent progress in the development and application of behavioral procedures that measure subcomponents of anhedonia across relevant patient groups, and discuss methodological caveats as

  7. Measuring anhedonia: impaired ability to pursue, experience, and learn about reward

    PubMed Central

    Thomsen, Kristine Rømer

    2015-01-01

    Ribot’s (1896) long standing definition of anhedonia as “the inability to experience pleasure” has been challenged recently following progress in affective neuroscience. In particular, accumulating evidence suggests that reward consists of multiple subcomponents of wanting, liking and learning, as initially outlined by Berridge and Robinson (2003), and these processes have been proposed to relate to appetitive, consummatory and satiety phases of a pleasure cycle. Building on this work, we recently proposed to reconceptualize anhedonia as “impairments in the ability to pursue, experience, and/or learn about pleasure, which is often, but not always accessible to conscious awareness.” (Rømer Thomsen et al., 2015). This framework is in line with Treadway and Zald’s (2011) proposal to differentiate between motivational and consummatory types of anhedonia, and stresses the need to combine traditional self-report measures with behavioral measures or procedures. In time, this approach may lead to improved clinical assessment and treatment. In line with our reconceptualization, increasing evidence suggests that reward processing deficits are not restricted to impaired hedonic impact in major psychiatric disorders. Successful translations of animal models have led to strong evidence of impairments in the ability to pursue and learn about reward in psychiatric disorders such as major depressive disorder, schizophrenia, and addiction. It is of high importance that we continue to systematically target impairments in all phases of reward processing across disorders using behavioral testing in combination with neuroimaging techniques. This in turn has implications for diagnosis and treatment, and is essential for the purposes of identifying the underlying neurobiological mechanisms. Here I review recent progress in the development and application of behavioral procedures that measure subcomponents of anhedonia across relevant patient groups, and discuss

  8. Towards autonomous neuroprosthetic control using Hebbian reinforcement learning

    NASA Astrophysics Data System (ADS)

    Mahmoudi, Babak; Pohlmeyer, Eric A.; Prins, Noeline W.; Geng, Shijia; Sanchez, Justin C.

    2013-12-01

    Objective. Our goal was to design an adaptive neuroprosthetic controller that could learn the mapping from neural states to prosthetic actions and automatically adjust adaptation using only a binary evaluative feedback as a measure of desirability/undesirability of performance. Approach. Hebbian reinforcement learning (HRL) in a connectionist network was used for the design of the adaptive controller. The method combines the efficiency of supervised learning with the generality of reinforcement learning. The convergence properties of this approach were studied using both closed-loop control simulations and open-loop simulations that used primate neural data from robot-assisted reaching tasks. Main results. The HRL controller was able to perform classification and regression tasks using its episodic and sequential learning modes, respectively. In our experiments, the HRL controller quickly achieved convergence to an effective control policy, followed by robust performance. The controller also automatically stopped adapting the parameters after converging to a satisfactory control policy. Additionally, when the input neural vector was reorganized, the controller resumed adaptation to maintain performance. Significance. By estimating an evaluative feedback directly from the user, the HRL control algorithm may provide an efficient method for autonomous adaptation of neuroprosthetic systems. This method may enable the user to teach the controller the desired behavior using only a simple feedback signal.

  9. Learning and tuning fuzzy logic controllers through reinforcements

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1992-01-01

    A new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system is presented. In particular, our Generalized Approximate Reasoning-based Intelligent Control (GARIC) architecture: (1) learns and tunes a fuzzy logic controller even when only weak reinforcements, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto, Sutton, and Anderson to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and has demonstrated significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.

  10. Learning and tuning fuzzy logic controllers through reinforcements

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1992-01-01

    This paper presents a new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system. In particular, our generalized approximate reasoning-based intelligent control (GARIC) architecture (1) learns and tunes a fuzzy logic controller even when only weak reinforcement, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward neural network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto et al. (1983) to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and demonstrates significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.

  11. Learning to maximize reward rate: a model based on semi-Markov decision processes.

    PubMed

    Khodadadi, Arash; Fakhari, Pegah; Busemeyer, Jerome R

    2014-01-01

    WHEN ANIMALS HAVE TO MAKE A NUMBER OF DECISIONS DURING A LIMITED TIME INTERVAL, THEY FACE A FUNDAMENTAL PROBLEM: how much time they should spend on each decision in order to achieve the maximum possible total outcome. Deliberating more on one decision usually leads to more outcome but less time will remain for other decisions. In the framework of sequential sampling models, the question is how animals learn to set their decision threshold such that the total expected outcome achieved during a limited time is maximized. The aim of this paper is to provide a theoretical framework for answering this question. To this end, we consider an experimental design in which each trial can come from one of the several possible "conditions." A condition specifies the difficulty of the trial, the reward, the penalty and so on. We show that to maximize the expected reward during a limited time, the subject should set a separate value of decision threshold for each condition. We propose a model of learning the optimal value of decision thresholds based on the theory of semi-Markov decision processes (SMDP). In our model, the experimental environment is modeled as an SMDP with each "condition" being a "state" and the value of decision thresholds being the "actions" taken in those states. The problem of finding the optimal decision thresholds then is cast as the stochastic optimal control problem of taking actions in each state in the corresponding SMDP such that the average reward rate is maximized. Our model utilizes a biologically plausible learning algorithm to solve this problem. The simulation results show that at the beginning of learning the model choses high values of decision threshold which lead to sub-optimal performance. With experience, however, the model learns to lower the value of decision thresholds till finally it finds the optimal values.

  12. Learning to maximize reward rate: a model based on semi-Markov decision processes

    PubMed Central

    Khodadadi, Arash; Fakhari, Pegah; Busemeyer, Jerome R.

    2014-01-01

    When animals have to make a number of decisions during a limited time interval, they face a fundamental problem: how much time they should spend on each decision in order to achieve the maximum possible total outcome. Deliberating more on one decision usually leads to more outcome but less time will remain for other decisions. In the framework of sequential sampling models, the question is how animals learn to set their decision threshold such that the total expected outcome achieved during a limited time is maximized. The aim of this paper is to provide a theoretical framework for answering this question. To this end, we consider an experimental design in which each trial can come from one of the several possible “conditions.” A condition specifies the difficulty of the trial, the reward, the penalty and so on. We show that to maximize the expected reward during a limited time, the subject should set a separate value of decision threshold for each condition. We propose a model of learning the optimal value of decision thresholds based on the theory of semi-Markov decision processes (SMDP). In our model, the experimental environment is modeled as an SMDP with each “condition” being a “state” and the value of decision thresholds being the “actions” taken in those states. The problem of finding the optimal decision thresholds then is cast as the stochastic optimal control problem of taking actions in each state in the corresponding SMDP such that the average reward rate is maximized. Our model utilizes a biologically plausible learning algorithm to solve this problem. The simulation results show that at the beginning of learning the model choses high values of decision threshold which lead to sub-optimal performance. With experience, however, the model learns to lower the value of decision thresholds till finally it finds the optimal values. PMID:24904252

  13. Learning from experience: event-related potential correlates of reward processing, neural adaptation, and behavioral choice.

    PubMed

    Walsh, Matthew M; Anderson, John R

    2012-09-01

    To behave adaptively, we must learn from the consequences of our actions. Studies using event-related potentials (ERPs) have been informative with respect to the question of how such learning occurs. These studies have revealed a frontocentral negativity termed the feedback-related negativity (FRN) that appears after negative feedback. According to one prominent theory, the FRN tracks the difference between the values of actual and expected outcomes, or reward prediction errors. As such, the FRN provides a tool for studying reward valuation and decision making. We begin this review by examining the neural significance of the FRN. We then examine its functional significance. To understand the cognitive processes that occur when the FRN is generated, we explore variables that influence its appearance and amplitude. Specifically, we evaluate four hypotheses: (1) the FRN encodes a quantitative reward prediction error; (2) the FRN is evoked by outcomes and by stimuli that predict outcomes; (3) the FRN and behavior change with experience; and (4) the system that produces the FRN is maximally engaged by volitional actions.

  14. Multiagent reinforcement learning in the Iterated Prisoner's Dilemma.

    PubMed

    Sandholm, T W; Crites, R H

    1996-01-01

    Reinforcement learning (RL) is based on the idea that the tendency to produce an action should be strengthened (reinforced) if it produces favorable results, and weakened if it produces unfavorable results. Q-learning is a recent RL algorithm that does not need a model of its environment and can be used on-line. Therefore, it is well suited for use in repeated games against an unknown opponent. Most RL research has been confined to single-agent settings or to multiagent settings where the agents have totally positively correlated payoffs (team problems) or totally negatively correlated payoffs (zero-sum games). This paper is an empirical study of reinforcement learning in the Iterated Prisoner's Dilemma (IPD), where the agents' payoffs are neither totally positively nor totally negatively correlated. RL is considerably more difficult in such a domain. This paper investigates the ability of a variety of Q-learning agents to play the IPD game against an unknown opponent. In some experiments, the opponent is the fixed strategy Tit-For-Tat, while in others it is another Q-learner. All the Q-learners learned to play optimally against Tit-For-Tat. Playing against another learner was more difficult because the adaptation of the other learner created a non-stationary environment, and because the other learner was not endowed with any a priori knowledge about the IPD game such as a policy designed to encourage cooperation. The learners that were studied varied along three dimensions: the length of history they received as context, the type of memory they employed (lookup tables based on restricted history windows or recurrent neural networks that can theoretically store features from arbitrarily deep in the past), and the exploration schedule they followed. Although all the learners faced difficulties when playing against other learners, agents with longer history windows, lookup table memories, and longer exploration schedules fared best in the IPD games.

  15. Preliminary Work for Examining the Scalability of Reinforcement Learning

    NASA Technical Reports Server (NTRS)

    Clouse, Jeff

    1998-01-01

    Researchers began studying automated agents that learn to perform multiple-step tasks early in the history of artificial intelligence (Samuel, 1963; Samuel, 1967; Waterman, 1970; Fikes, Hart & Nilsonn, 1972). Multiple-step tasks are tasks that can only be solved via a sequence of decisions, such as control problems, robotics problems, classic problem-solving, and game-playing. The objective of agents attempting to learn such tasks is to use the resources they have available in order to become more proficient at the tasks. In particular, each agent attempts to develop a good policy, a mapping from states to actions, that allows it to select actions that optimize a measure of its performance on the task; for example, reducing the number of steps necessary to complete the task successfully. Our study focuses on reinforcement learning, a set of learning techniques where the learner performs trial-and-error experiments in the task and adapts its policy based on the outcome of those experiments. Much of the work in reinforcement learning has focused on a particular, simple representation, where every problem state is represented explicitly in a table, and associated with each state are the actions that can be chosen in that state. A major advantage of this table lookup representation is that one can prove that certain reinforcement learning techniques will develop an optimal policy for the current task. The drawback is that the representation limits the application of reinforcement learning to multiple-step tasks with relatively small state-spaces. There has been a little theoretical work that proves that convergence to optimal solutions can be obtained when using generalization structures, but the structures are quite simple. The