reward reinforcement learning: Topics by Science.gov

Sample records for reward reinforcement learning

Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance

DTIC Science & Technology

2014-09-29

Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity , and Performance W. Bradley Knox...positive a trainer’s reward values are; temporal discounting, the extent to which future reward is discounted in value; episodicity , whether task...learning occurs in discrete learning episodes instead of one continuing session; and task performance, the agent’s performance on the task the trainer
An Upside to Reward Sensitivity: The Hippocampus Supports Enhanced Reinforcement Learning in Adolescence.

PubMed

Davidow, Juliet Y; Foerde, Karin; Galván, Adriana; Shohamy, Daphna

2016-10-05

Adolescents are notorious for engaging in reward-seeking behaviors, a tendency attributed to heightened activity in the brain's reward systems during adolescence. It has been suggested that reward sensitivity in adolescence might be adaptive, but evidence of an adaptive role has been scarce. Using a probabilistic reinforcement learning task combined with reinforcement learning models and fMRI, we found that adolescents showed better reinforcement learning and a stronger link between reinforcement learning and episodic memory for rewarding outcomes. This behavioral benefit was related to heightened prediction error-related BOLD activity in the hippocampus and to stronger functional connectivity between the hippocampus and the striatum at the time of reinforcement. These findings reveal an important role for the hippocampus in reinforcement learning in adolescence and suggest that reward sensitivity in adolescence is related to adaptive differences in how adolescents learn from experience. Copyright © 2016 Elsevier Inc. All rights reserved.
A reward optimization method based on action subrewards in hierarchical reinforcement learning.

PubMed

Fu, Yuchen; Liu, Quan; Ling, Xionghong; Cui, Zhiming

2014-01-01

Reinforcement learning (RL) is one kind of interactive learning methods. Its main characteristics are "trial and error" and "related reward." A hierarchical reinforcement learning method based on action subrewards is proposed to solve the problem of "curse of dimensionality," which means that the states space will grow exponentially in the number of features and low convergence speed. The method can reduce state spaces greatly and choose actions with favorable purpose and efficiency so as to optimize reward function and enhance convergence speed. Apply it to the online learning in Tetris game, and the experiment result shows that the convergence speed of this algorithm can be enhanced evidently based on the new method which combines hierarchical reinforcement learning algorithm and action subrewards. The "curse of dimensionality" problem is also solved to a certain extent with hierarchical method. All the performance with different parameters is compared and analyzed as well.
'Proactive' use of cue-context congruence for building reinforcement learning's reward function.

PubMed

Zsuga, Judit; Biro, Klara; Tajti, Gabor; Szilasi, Magdolna Emma; Papp, Csaba; Juhasz, Bela; Gesztelyi, Rudolf

2016-10-28

Reinforcement learning is a fundamental form of learning that may be formalized using the Bellman equation. Accordingly an agent determines the state value as the sum of immediate reward and of the discounted value of future states. Thus the value of state is determined by agent related attributes (action set, policy, discount factor) and the agent's knowledge of the environment embodied by the reward function and hidden environmental factors given by the transition probability. The central objective of reinforcement learning is to solve these two functions outside the agent's control either using, or not using a model. In the present paper, using the proactive model of reinforcement learning we offer insight on how the brain creates simplified representations of the environment, and how these representations are organized to support the identification of relevant stimuli and action. Furthermore, we identify neurobiological correlates of our model by suggesting that the reward and policy functions, attributes of the Bellman equitation, are built by the orbitofrontal cortex (OFC) and the anterior cingulate cortex (ACC), respectively. Based on this we propose that the OFC assesses cue-context congruence to activate the most context frame. Furthermore given the bidirectional neuroanatomical link between the OFC and model-free structures, we suggest that model-based input is incorporated into the reward prediction error (RPE) signal, and conversely RPE signal may be used to update the reward-related information of context frames and the policy underlying action selection in the OFC and ACC, respectively. Furthermore clinical implications for cognitive behavioral interventions are discussed.
Finding intrinsic rewards by embodied evolution and constrained reinforcement learning.

PubMed

Uchibe, Eiji; Doya, Kenji

2008-12-01

Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and 'mating' by software reproduction are three major 'extrinsic' rewards. We show in hardware experiments that the robots can find appropriate 'intrinsic' rewards for the vision of battery packs and other robots to promote approach behaviors.
When, What, and How Much to Reward in Reinforcement Learning-Based Models of Cognition

ERIC Educational Resources Information Center

Janssen, Christian P.; Gray, Wayne D.

2012-01-01

Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other…
Spared internal but impaired external reward prediction error signals in major depressive disorder during reinforcement learning.

PubMed

Bakic, Jasmina; Pourtois, Gilles; Jepma, Marieke; Duprat, Romain; De Raedt, Rudi; Baeken, Chris

2017-01-01

Major depressive disorder (MDD) creates debilitating effects on a wide range of cognitive functions, including reinforcement learning (RL). In this study, we sought to assess whether reward processing as such, or alternatively the complex interplay between motivation and reward might potentially account for the abnormal reward-based learning in MDD. A total of 35 treatment resistant MDD patients and 44 age matched healthy controls (HCs) performed a standard probabilistic learning task. RL was titrated using behavioral, computational modeling and event-related brain potentials (ERPs) data. MDD patients showed comparable learning rate compared to HCs. However, they showed decreased lose-shift responses as well as blunted subjective evaluations of the reinforcers used during the task, relative to HCs. Moreover, MDD patients showed normal internal (at the level of error-related negativity, ERN) but abnormal external (at the level of feedback-related negativity, FRN) reward prediction error (RPE) signals during RL, selectively when additional efforts had to be made to establish learning. Collectively, these results lend support to the assumption that MDD does not impair reward processing per se during RL. Instead, it seems to alter the processing of the emotional value of (external) reinforcers during RL, when additional intrinsic motivational processes have to be engaged. © 2016 Wiley Periodicals, Inc.
Bio-robots automatic navigation with graded electric reward stimulation based on Reinforcement Learning.

PubMed

Zhang, Chen; Sun, Chao; Gao, Liqiang; Zheng, Nenggan; Chen, Weidong; Zheng, Xiaoxiang

2013-01-01

Bio-robots based on brain computer interface (BCI) suffer from the lack of considering the characteristic of the animals in navigation. This paper proposed a new method for bio-robots' automatic navigation combining the reward generating algorithm base on Reinforcement Learning (RL) with the learning intelligence of animals together. Given the graded electrical reward, the animal e.g. the rat, intends to seek the maximum reward while exploring an unknown environment. Since the rat has excellent spatial recognition, the rat-robot and the RL algorithm can convergent to an optimal route by co-learning. This work has significant inspiration for the practical development of bio-robots' navigation with hybrid intelligence.
Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making.

PubMed

Schönberg, Tom; Daw, Nathaniel D; Joel, Daphna; O'Doherty, John P

2007-11-21

The computational framework of reinforcement learning has been used to forward our understanding of the neural mechanisms underlying reward learning and decision-making behavior. It is known that humans vary widely in their performance in decision-making tasks. Here, we used a simple four-armed bandit task in which subjects are almost evenly split into two groups on the basis of their performance: those who do learn to favor choice of the optimal action and those who do not. Using models of reinforcement learning we sought to determine the neural basis of these intrinsic differences in performance by scanning both groups with functional magnetic resonance imaging. We scanned 29 subjects while they performed the reward-based decision-making task. Our results suggest that these two groups differ markedly in the degree to which reinforcement learning signals in the striatum are engaged during task performance. While the learners showed robust prediction error signals in both the ventral and dorsal striatum during learning, the nonlearner group showed a marked absence of such signals. Moreover, the magnitude of prediction error signals in a region of dorsal striatum correlated significantly with a measure of behavioral performance across all subjects. These findings support a crucial role of prediction error signals, likely originating from dopaminergic midbrain neurons, in enabling learning of action selection preferences on the basis of obtained rewards. Thus, spontaneously observed individual differences in decision making performance demonstrate the suggested dependence of this type of learning on the functional integrity of the dopaminergic striatal system in humans.
The left hemisphere learns what is right: Hemispatial reward learning depends on reinforcement learning processes in the contralateral hemisphere.

PubMed

Aberg, Kristoffer Carl; Doell, Kimberly Crystal; Schwartz, Sophie

2016-08-01

Orienting biases refer to consistent, trait-like direction of attention or locomotion toward one side of space. Recent studies suggest that such hemispatial biases may determine how well people memorize information presented in the left or right hemifield. Moreover, lesion studies indicate that learning rewarded stimuli in one hemispace depends on the integrity of the contralateral striatum. However, the exact neural and computational mechanisms underlying the influence of individual orienting biases on reward learning remain unclear. Because reward-based behavioural adaptation depends on the dopaminergic system and prediction error (PE) encoding in the ventral striatum, we hypothesized that hemispheric asymmetries in dopamine (DA) function may determine individual spatial biases in reward learning. To test this prediction, we acquired fMRI in 33 healthy human participants while they performed a lateralized reward task. Learning differences between hemispaces were assessed by presenting stimuli, assigned to different reward probabilities, to the left or right of central fixation, i.e. presented in the left or right visual hemifield. Hemispheric differences in DA function were estimated through differential fMRI responses to positive vs. negative feedback in the left vs. right ventral striatum, and a computational approach was used to identify the neural correlates of PEs. Our results show that spatial biases favoring reward learning in the right (vs. left) hemifield were associated with increased reward responses in the left hemisphere and relatively better neural encoding of PEs for stimuli presented in the right (vs. left) hemifield. These findings demonstrate that trait-like spatial biases implicate hemisphere-specific learning mechanisms, with individual differences between hemispheres contributing to reinforcing spatial biases. Copyright © 2016 Elsevier Ltd. All rights reserved.
Toward an autonomous brain machine interface: integrating sensorimotor reward modulation and reinforcement learning.

PubMed

Marsh, Brandi T; Tarigoppula, Venkata S Aditya; Chen, Chen; Francis, Joseph T

2015-05-13

For decades, neurophysiologists have worked on elucidating the function of the cortical sensorimotor control system from the standpoint of kinematics or dynamics. Recently, computational neuroscientists have developed models that can emulate changes seen in the primary motor cortex during learning. However, these simulations rely on the existence of a reward-like signal in the primary sensorimotor cortex. Reward modulation of the primary sensorimotor cortex has yet to be characterized at the level of neural units. Here we demonstrate that single units/multiunits and local field potentials in the primary motor (M1) cortex of nonhuman primates (Macaca radiata) are modulated by reward expectation during reaching movements and that this modulation is present even while subjects passively view cursor motions that are predictive of either reward or nonreward. After establishing this reward modulation, we set out to determine whether we could correctly classify rewarding versus nonrewarding trials, on a moment-to-moment basis. This reward information could then be used in collaboration with reinforcement learning principles toward an autonomous brain-machine interface. The autonomous brain-machine interface would use M1 for both decoding movement intention and extraction of reward expectation information as evaluative feedback, which would then update the decoding algorithm as necessary. In the work presented here, we show that this, in theory, is possible. Copyright © 2015 the authors 0270-6474/15/357374-14$15.00/0.
Dopaminergic control of motivation and reinforcement learning: a closed-circuit account for reward-oriented behavior.

PubMed

Morita, Kenji; Morishima, Mieko; Sakai, Katsuyuki; Kawaguchi, Yasuo

2013-05-15

Humans and animals take actions quickly when they expect that the actions lead to reward, reflecting their motivation. Injection of dopamine receptor antagonists into the striatum has been shown to slow such reward-seeking behavior, suggesting that dopamine is involved in the control of motivational processes. Meanwhile, neurophysiological studies have revealed that phasic response of dopamine neurons appears to represent reward prediction error, indicating that dopamine plays central roles in reinforcement learning. However, previous attempts to elucidate the mechanisms of these dopaminergic controls have not fully explained how the motivational and learning aspects are related and whether they can be understood by the way the activity of dopamine neurons itself is controlled by their upstream circuitries. To address this issue, we constructed a closed-circuit model of the corticobasal ganglia system based on recent findings regarding intracortical and corticostriatal circuit architectures. Simulations show that the model could reproduce the observed distinct motivational effects of D1- and D2-type dopamine receptor antagonists. Simultaneously, our model successfully explains the dopaminergic representation of reward prediction error as observed in behaving animals during learning tasks and could also explain distinct choice biases induced by optogenetic stimulation of the D1 and D2 receptor-expressing striatal neurons. These results indicate that the suggested roles of dopamine in motivational control and reinforcement learning can be understood in a unified manner through a notion that the indirect pathway of the basal ganglia represents the value of states/actions at a previous time point, an empirically driven key assumption of our model.
Framework for robot skill learning using reinforcement learning

NASA Astrophysics Data System (ADS)

Wei, Yingzi; Zhao, Mingyang

2003-09-01

Robot acquiring skill is a process similar to human skill learning. Reinforcement learning (RL) is an on-line actor critic method for a robot to develop its skill. The reinforcement function has become the critical component for its effect of evaluating the action and guiding the learning process. We present an augmented reward function that provides a new way for RL controller to incorporate prior knowledge and experience into the RL controller. Also, the difference form of augmented reward function is considered carefully. The additional reward beyond conventional reward will provide more heuristic information for RL. In this paper, we present a strategy for the task of complex skill learning. Automatic robot shaping policy is to dissolve the complex skill into a hierarchical learning process. The new form of value function is introduced to attain smooth motion switching swiftly. We present a formal, but practical, framework for robot skill learning and also illustrate with an example the utility of method for learning skilled robot control on line.
General functioning predicts reward and punishment learning in schizophrenia.

PubMed

Somlai, Zsuzsanna; Moustafa, Ahmed A; Kéri, Szabolcs; Myers, Catherine E; Gluck, Mark A

2011-04-01

Previous studies investigating feedback-driven reinforcement learning in patients with schizophrenia have provided mixed results. In this study, we explored the clinical predictors of reward and punishment learning using a probabilistic classification learning task. Patients with schizophrenia (n=40) performed similarly to healthy controls (n=30) on the classification learning task. However, more severe negative and general symptoms were associated with lower reward-learning performance, whereas poorer general psychosocial functioning was correlated with both lower reward- and punishment-learning performances. Multiple linear regression analyses indicated that general psychosocial functioning was the only significant predictor of reinforcement learning performance when education, antipsychotic dose, and positive, negative and general symptoms were included in the analysis. These results suggest a close relationship between reinforcement learning and general psychosocial functioning in schizophrenia. Published by Elsevier B.V.
Striatal dopaminergic modulation of reinforcement learning predicts reward-oriented behavior in daily life.

PubMed

Kasanova, Zuzana; Ceccarini, Jenny; Frank, Michael J; Amelsvoort, Thérèse van; Booij, Jan; Heinzel, Alexander; Mottaghy, Felix; Myin-Germeys, Inez

2017-07-01

Much human behavior is driven by rewards. Preclinical neurophysiological and clinical positron emission tomography (PET) studies have implicated striatal phasic dopamine (DA) release as a primary modulator of reward processing. However, the relationship between experimental reward-induced striatal DA release and responsiveness to naturalistic rewards, and therefore functional relevance of these findings, has been elusive. We therefore combined, for the first time, a DA D 2/3 receptor [ 18 F]fallypride PET during a probabilistic reinforcement learning (RL) task with a six day ecological momentary assessments (EMA) of reward-related behavior in the everyday life of 16 healthy volunteers. We detected significant reward-induced DA release in the bilateral putamen, caudate nucleus and ventral striatum, the extent of which was associated with better behavioral performance on the RL task across all regions. Furthermore, individual variability in the extent of reward-induced DA release in the right caudate nucleus and ventral striatum modulated the tendency to be actively engaged in a behavior if the active engagement was previously deemed enjoyable. This study suggests a link between striatal reward-related DA release and ecologically relevant reward-oriented behavior, suggesting an avenue for the inquiry into the DAergic basis of optimal and impaired motivational drive. Copyright © 2017 Elsevier B.V. All rights reserved.
Microstimulation of the Human Substantia Nigra Alters Reinforcement Learning

PubMed Central

Ramayya, Ashwin G.; Misra, Amrit

2014-01-01

Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action–reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action–reward associations rather than stimulus–reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action–reward associations during reinforcement learning. PMID:24828643
Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules

PubMed Central

La Camera, Giancarlo; Richmond, Barry J.

2008-01-01

It is often assumed that animals and people adjust their behavior to maximize reward acquisition. In visually cued reinforcement schedules, monkeys make errors in trials that are not immediately rewarded, despite having to repeat error trials. Here we show that error rates are typically smaller in trials equally distant from reward but belonging to longer schedules (referred to as “schedule length effect”). This violates the principles of reward maximization and invariance and cannot be predicted by the standard methods of Reinforcement Learning, such as the method of temporal differences. We develop a heuristic model that accounts for all of the properties of the behavior in the reinforcement schedule task but whose predictions are not different from those of the standard temporal difference model in choice tasks. In the modification of temporal difference learning introduced here, the effect of schedule length emerges spontaneously from the sensitivity to the immediately preceding trial. We also introduce a policy for general Markov Decision Processes, where the decision made at each node is conditioned on the motivation to perform an instrumental action, and show that the application of our model to the reinforcement schedule task and the choice task are special cases of this general theoretical framework. Within this framework, Reinforcement Learning can approach contextual learning with the mixture of empirical findings and principled assumptions that seem to coexist in the best descriptions of animal behavior. As examples, we discuss two phenomena observed in humans that often derive from the violation of the principle of invariance: “framing,” wherein equivalent options are treated differently depending on the context in which they are presented, and the “sunk cost” effect, the greater tendency to continue an endeavor once an investment in money, effort, or time has been made. The schedule length effect might be a manifestation of these phenomena
Modeling the violation of reward maximization and invariance in reinforcement schedules.

PubMed

La Camera, Giancarlo; Richmond, Barry J

2008-08-08

It is often assumed that animals and people adjust their behavior to maximize reward acquisition. In visually cued reinforcement schedules, monkeys make errors in trials that are not immediately rewarded, despite having to repeat error trials. Here we show that error rates are typically smaller in trials equally distant from reward but belonging to longer schedules (referred to as "schedule length effect"). This violates the principles of reward maximization and invariance and cannot be predicted by the standard methods of Reinforcement Learning, such as the method of temporal differences. We develop a heuristic model that accounts for all of the properties of the behavior in the reinforcement schedule task but whose predictions are not different from those of the standard temporal difference model in choice tasks. In the modification of temporal difference learning introduced here, the effect of schedule length emerges spontaneously from the sensitivity to the immediately preceding trial. We also introduce a policy for general Markov Decision Processes, where the decision made at each node is conditioned on the motivation to perform an instrumental action, and show that the application of our model to the reinforcement schedule task and the choice task are special cases of this general theoretical framework. Within this framework, Reinforcement Learning can approach contextual learning with the mixture of empirical findings and principled assumptions that seem to coexist in the best descriptions of animal behavior. As examples, we discuss two phenomena observed in humans that often derive from the violation of the principle of invariance: "framing," wherein equivalent options are treated differently depending on the context in which they are presented, and the "sunk cost" effect, the greater tendency to continue an endeavor once an investment in money, effort, or time has been made. The schedule length effect might be a manifestation of these phenomena in monkeys.
Pedunculopontine tegmental nucleus lesions impair stimulus--reward learning in autoshaping and conditioned reinforcement paradigms.

PubMed

Inglis, W L; Olmstead, M C; Robbins, T W

2000-04-01

The role of the pedunculopontine tegmental nucleus (PPTg) in stimulus-reward learning was assessed by testing the effects of PPTg lesions on performance in visual autoshaping and conditioned reinforcement (CRf) paradigms. Rats with PPTg lesions were unable to learn an association between a conditioned stimulus (CS) and a primary reward in either paradigm. In the autoshaping experiment, PPTg-lesioned rats approached the CS+ and CS- with equal frequency, and the latencies to respond to the two stimuli did not differ. PPTg lesions also disrupted discriminated approaches to an appetitive CS in the CRf paradigm and completely abolished the acquisition of responding with CRf. These data are discussed in the context of a possible cognitive function of the PPTg, particularly in terms of lesion-induced disruptions of attentional processes that are mediated by the thalamus.
How we learn to make decisions: rapid propagation of reinforcement learning prediction errors in humans.

PubMed

Krigolson, Olav E; Hassall, Cameron D; Handy, Todd C

2014-03-01

Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors-discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833-1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129-141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769-776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679-709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward

Microstimulation of the human substantia nigra alters reinforcement learning.

PubMed

Ramayya, Ashwin G; Misra, Amrit; Baltuch, Gordon H; Kahana, Michael J

2014-05-14

Animal studies have shown that substantia nigra (SN) dopaminergic (DA) neurons strengthen action-reward associations during reinforcement learning, but their role in human learning is not known. Here, we applied microstimulation in the SN of 11 patients undergoing deep brain stimulation surgery for the treatment of Parkinson's disease as they performed a two-alternative probability learning task in which rewards were contingent on stimuli, rather than actions. Subjects demonstrated decreased learning from reward trials that were accompanied by phasic SN microstimulation compared with reward trials without stimulation. Subjects who showed large decreases in learning also showed an increased bias toward repeating actions after stimulation trials; therefore, stimulation may have decreased learning by strengthening action-reward associations rather than stimulus-reward associations. Our findings build on previous studies implicating SN DA neurons in preferentially strengthening action-reward associations during reinforcement learning. Copyright © 2014 the authors 0270-6474/14/346887-09$15.00/0.
Mapping anhedonia onto reinforcement learning: a behavioural meta-analysis

PubMed Central

2013-01-01

Background Depression is characterised partly by blunted reactions to reward. However, tasks probing this deficiency have not distinguished insensitivity to reward from insensitivity to the prediction errors for reward that determine learning and are putatively reported by the phasic activity of dopamine neurons. We attempted to disentangle these factors with respect to anhedonia in the context of stress, Major Depressive Disorder (MDD), Bipolar Disorder (BPD) and a dopaminergic challenge. Methods Six behavioural datasets involving 392 experimental sessions were subjected to a model-based, Bayesian meta-analysis. Participants across all six studies performed a probabilistic reward task that used an asymmetric reinforcement schedule to assess reward learning. Healthy controls were tested under baseline conditions, stress or after receiving the dopamine D2 agonist pramipexole. In addition, participants with current or past MDD or BPD were evaluated. Reinforcement learning models isolated the contributions of variation in reward sensitivity and learning rate. Results MDD and anhedonia reduced reward sensitivity more than they affected the learning rate, while a low dose of the dopamine D2 agonist pramipexole showed the opposite pattern. Stress led to a pattern consistent with a mixed effect on reward sensitivity and learning rate. Conclusion Reward-related learning reflected at least two partially separable contributions. The first related to phasic prediction error signalling, and was preferentially modulated by a low dose of the dopamine agonist pramipexole. The second related directly to reward sensitivity, and was preferentially reduced in MDD and anhedonia. Stress altered both components. Collectively, these findings highlight the contribution of model-based reinforcement learning meta-analysis for dissecting anhedonic behavior. PMID:23782813
Statistical Mechanics of the Delayed Reward-Based Learning with Node Perturbation

NASA Astrophysics Data System (ADS)

Hiroshi Saito,; Kentaro Katahira,; Kazuo Okanoya,; Masato Okada,

2010-06-01

In reward-based learning, reward is typically given with some delay after a behavior that causes the reward. In machine learning literature, the framework of the eligibility trace has been used as one of the solutions to handle the delayed reward in reinforcement learning. In recent studies, the eligibility trace is implied to be important for difficult neuroscience problem known as the “distal reward problem”. Node perturbation is one of the stochastic gradient methods from among many kinds of reinforcement learning implementations, and it searches the approximate gradient by introducing perturbation to a network. Since the stochastic gradient method does not require a objective function differential, it is expected to be able to account for the learning mechanism of a complex system, like a brain. We study the node perturbation with the eligibility trace as a specific example of delayed reward-based learning, and analyzed it using a statistical mechanics approach. As a result, we show the optimal time constant of the eligibility trace respect to the reward delay and the existence of unlearnable parameter configurations.
Model-Based Reinforcement Learning under Concurrent Schedules of Reinforcement in Rodents

ERIC Educational Resources Information Center

Huh, Namjung; Jo, Suhyun; Kim, Hoseok; Sul, Jung Hoon; Jung, Min Whan

2009-01-01

Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's…
Time-Extended Policies in Mult-Agent Reinforcement Learning

NASA Technical Reports Server (NTRS)

Tumer, Kagan; Agogino, Adrian K.

2004-01-01

Reinforcement learning methods perform well in many domains where a single agent needs to take a sequence of actions to perform a task. These methods use sequences of single-time-step rewards to create a policy that tries to maximize a time-extended utility, which is a (possibly discounted) sum of these rewards. In this paper we build on our previous work showing how these methods can be extended to a multi-agent environment where each agent creates its own policy that works towards maximizing a time-extended global utility over all agents actions. We show improved methods for creating time-extended utilities for the agents that are both "aligned" with the global utility and "learnable." We then show how to crate single-time-step rewards while avoiding the pi fall of having rewards aligned with the global reward leading to utilities not aligned with the global utility. Finally, we apply these reward functions to the multi-agent Gridworld problem. We explicitly quantify a utility's learnability and alignment, and show that reinforcement learning agents using the prescribed reward functions successfully tradeoff learnability and alignment. As a result they outperform both global (e.g., team games ) and local (e.g., "perfectly learnable" ) reinforcement learning solutions by as much as an order of magnitude.
The Computational Development of Reinforcement Learning during Adolescence

PubMed Central

Palminteri, Stefano; Coricelli, Giorgio; Blakemore, Sarah-Jayne

2016-01-01

Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents’ behaviour was better explained by a basic reinforcement learning algorithm, adults’ behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence. PMID:27322574
Homeostatic reinforcement learning for integrating reward collection and physiological stability.

PubMed

Keramati, Mehdi; Gutkin, Boris

2014-12-02

Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system.
Neural Basis of Reinforcement Learning and Decision Making

PubMed Central

Lee, Daeyeol; Seo, Hyojung; Jung, Min Whan

2012-01-01

Reinforcement learning is an adaptive process in which an animal utilizes its previous experience to improve the outcomes of future choices. Computational theories of reinforcement learning play a central role in the newly emerging areas of neuroeconomics and decision neuroscience. In this framework, actions are chosen according to their value functions, which describe how much future reward is expected from each action. Value functions can be adjusted not only through reward and penalty, but also by the animal’s knowledge of its current environment. Studies have revealed that a large proportion of the brain is involved in representing and updating value functions and using them to choose an action. However, how the nature of a behavioral task affects the neural mechanisms of reinforcement learning remains incompletely understood. Future studies should uncover the principles by which different computational elements of reinforcement learning are dynamically coordinated across the entire brain. PMID:22462543
Homeostatic reinforcement learning for integrating reward collection and physiological stability

PubMed Central

Keramati, Mehdi; Gutkin, Boris

2014-01-01

Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system. DOI: http://dx.doi.org/10.7554/eLife.04811.001 PMID:25457346
Impairments in learning by monetary rewards and alcohol-associated rewards in detoxified alcoholic patients.

PubMed

Jokisch, Daniel; Roser, Patrik; Juckel, Georg; Daum, Irene; Bellebaum, Christian

2014-07-01

Excessive alcohol consumption has been linked to structural and functional brain changes associated with cognitive, emotional, and behavioral impairments. It has been suggested that neural processing in the reward system is also affected by alcoholism. The present study aimed at further investigating reward-based associative learning and reversal learning in detoxified alcohol-dependent patients. Twenty-one detoxified alcohol-dependent patients and 26 healthy control subjects participated in a probabilistic learning task using monetary and alcohol-associated rewards as feedback stimuli indicating correct responses. Performance during acquisition and reversal learning in the different feedback conditions was analyzed. Alcohol-dependent patients and healthy control subjects showed an increase in learning performance over learning blocks during acquisition, with learning performance being significantly lower in alcohol-dependent patients. After changing the contingencies, alcohol-dependent patients exhibited impaired reversal learning and showed, in contrast to healthy controls, different learning curves for different types of rewards with no increase in performance for high monetary and alcohol-associated feedback. The present findings provide evidence that dysfunctional processing in the reward system in alcohol-dependent patients leads to alterations in reward-based learning resulting in a generally reduced performance. In addition, the results suggest that alcohol-dependent patients are, in particular, more impaired in changing an established behavior originally reinforced by high rewards. Copyright © 2014 by the Research Society on Alcoholism.
Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: a simulated robotic study.

PubMed

Mirolli, Marco; Santucci, Vieri G; Baldassarre, Gianluca

2013-03-01

An important issue of recent neuroscientific research is to understand the functional role of the phasic release of dopamine in the striatum, and in particular its relation to reinforcement learning. The literature is split between two alternative hypotheses: one considers phasic dopamine as a reward prediction error similar to the computational TD-error, whose function is to guide an animal to maximize future rewards; the other holds that phasic dopamine is a sensory prediction error signal that lets the animal discover and acquire novel actions. In this paper we propose an original hypothesis that integrates these two contrasting positions: according to our view phasic dopamine represents a TD-like reinforcement prediction error learning signal determined by both unexpected changes in the environment (temporary, intrinsic reinforcements) and biological rewards (permanent, extrinsic reinforcements). Accordingly, dopamine plays the functional role of driving both the discovery and acquisition of novel actions and the maximization of future rewards. To validate our hypothesis we perform a series of experiments with a simulated robotic system that has to learn different skills in order to get rewards. We compare different versions of the system in which we vary the composition of the learning signal. The results show that only the system reinforced by both extrinsic and intrinsic reinforcements is able to reach high performance in sufficiently complex conditions. Copyright © 2013 Elsevier Ltd. All rights reserved.
Dose Dependent Dopaminergic Modulation of Reward-Based Learning in Parkinson's Disease

ERIC Educational Resources Information Center

van Wouwe, N. C.; Ridderinkhof, K. R.; Band, G. P. H.; van den Wildenberg, W. P. M.; Wylie, S. A.

2012-01-01

Learning to select optimal behavior in new and uncertain situations is a crucial aspect of living and requires the ability to quickly associate stimuli with actions that lead to rewarding outcomes. Mathematical models of reinforcement-based learning to select rewarding actions distinguish between (1) the formation of stimulus-action-reward…
Coexistence of Reward and Unsupervised Learning During the Operant Conditioning of Neural Firing Rates

PubMed Central

Kerr, Robert R.; Grayden, David B.; Thomas, Doreen A.; Gilson, Matthieu; Burkitt, Anthony N.

2014-01-01

A fundamental goal of neuroscience is to understand how cognitive processes, such as operant conditioning, are performed by the brain. Typical and well studied examples of operant conditioning, in which the firing rates of individual cortical neurons in monkeys are increased using rewards, provide an opportunity for insight into this. Studies of reward-modulated spike-timing-dependent plasticity (RSTDP), and of other models such as R-max, have reproduced this learning behavior, but they have assumed that no unsupervised learning is present (i.e., no learning occurs without, or independent of, rewards). We show that these models cannot elicit firing rate reinforcement while exhibiting both reward learning and ongoing, stable unsupervised learning. To fix this issue, we propose a new RSTDP model of synaptic plasticity based upon the observed effects that dopamine has on long-term potentiation and depression (LTP and LTD). We show, both analytically and through simulations, that our new model can exhibit unsupervised learning and lead to firing rate reinforcement. This requires that the strengthening of LTP by the reward signal is greater than the strengthening of LTD and that the reinforced neuron exhibits irregular firing. We show the robustness of our findings to spike-timing correlations, to the synaptic weight dependence that is assumed, and to changes in the mean reward. We also consider our model in the differential reinforcement of two nearby neurons. Our model aligns more strongly with experimental studies than previous models and makes testable predictions for future experiments. PMID:24475240
Goal-Directed and Habit-Like Modulations of Stimulus Processing during Reinforcement Learning.

PubMed

Luque, David; Beesley, Tom; Morris, Richard W; Jack, Bradley N; Griffiths, Oren; Whitford, Thomas J; Le Pelley, Mike E

2017-03-15

Recent research has shown that perceptual processing of stimuli previously associated with high-value rewards is automatically prioritized even when rewards are no longer available. It has been hypothesized that such reward-related modulation of stimulus salience is conceptually similar to an "attentional habit." Recording event-related potentials in humans during a reinforcement learning task, we show strong evidence in favor of this hypothesis. Resistance to outcome devaluation (the defining feature of a habit) was shown by the stimulus-locked P1 component, reflecting activity in the extrastriate visual cortex. Analysis at longer latencies revealed a positive component (corresponding to the P3b, from 550-700 ms) sensitive to outcome devaluation. Therefore, distinct spatiotemporal patterns of brain activity were observed corresponding to habitual and goal-directed processes. These results demonstrate that reinforcement learning engages both attentional habits and goal-directed processes in parallel. Consequences for brain and computational models of reinforcement learning are discussed. SIGNIFICANCE STATEMENT The human attentional network adapts to detect stimuli that predict important rewards. A recent hypothesis suggests that the visual cortex automatically prioritizes reward-related stimuli, driven by cached representations of reward value; that is, stimulus-response habits. Alternatively, the neural system may track the current value of the predicted outcome. Our results demonstrate for the first time that visual cortex activity is increased for reward-related stimuli even when the rewarding event is temporarily devalued. In contrast, longer-latency brain activity was specifically sensitive to transient changes in reward value. Therefore, we show that both habit-like attention and goal-directed processes occur in the same learning episode at different latencies. This result has important consequences for computational models of reinforcement learning
Reinforcement active learning in the vibrissae system: optimal object localization.

PubMed

Gordon, Goren; Dorfman, Nimrod; Ahissar, Ehud

2013-01-01

Rats move their whiskers to acquire information about their environment. It has been observed that they palpate novel objects and objects they are required to localize in space. We analyze whisker-based object localization using two complementary paradigms, namely, active learning and intrinsic-reward reinforcement learning. Active learning algorithms select the next training samples according to the hypothesized solution in order to better discriminate between correct and incorrect labels. Intrinsic-reward reinforcement learning uses prediction errors as the reward to an actor-critic design, such that behavior converges to the one that optimizes the learning process. We show that in the context of object localization, the two paradigms result in palpation whisking as their respective optimal solution. These results suggest that rats may employ principles of active learning and/or intrinsic reward in tactile exploration and can guide future research to seek the underlying neuronal mechanisms that implement them. Furthermore, these paradigms are easily transferable to biomimetic whisker-based artificial sensors and can improve the active exploration of their environment. Copyright © 2012 Elsevier Ltd. All rights reserved.
Nucleus Accumbens Core and Shell Differentially Encode Reward-Associated Cues after Reinforcer Devaluation

PubMed Central

West, Elizabeth A.

2016-01-01

Nucleus accumbens (NAc) neurons encode features of stimulus learning and action selection associated with rewards. The NAc is necessary for using information about expected outcome values to guide behavior after reinforcer devaluation. Evidence suggests that core and shell subregions may play dissociable roles in guiding motivated behavior. Here, we recorded neural activity in the NAc core and shell during training and performance of a reinforcer devaluation task. Long–Evans male rats were trained that presses on a lever under an illuminated cue light delivered a flavored sucrose reward. On subsequent test days, each rat was given free access to one of two distinctly flavored foods to consume to satiation and were then immediately tested on the lever pressing task under extinction conditions. Rats decreased pressing on the test day when the reinforcer earned during training was the sated flavor (devalued) compared with the test day when the reinforcer was not the sated flavor (nondevalued), demonstrating evidence of outcome-selective devaluation. Cue-selective encoding during training by NAc core (but not shell) neurons reliably predicted subsequent behavioral performance; that is, the greater the percentage of neurons that responded to the cue, the better the rats suppressed responding after devaluation. In contrast, NAc shell (but not core) neurons significantly decreased cue-selective encoding in the devalued condition compared with the nondevalued condition. These data reveal that NAc core and shell neurons encode information differentially about outcome-specific cues after reinforcer devaluation that are related to behavioral performance and outcome value, respectively. SIGNIFICANCE STATEMENT Many neuropsychiatric disorders are marked by impairments in behavioral flexibility. Although the nucleus accumbens (NAc) is required for behavioral flexibility, it is not known how NAc neurons encode this information. Here, we recorded NAc neurons during a training
Reinforcement learning: Solving two case studies

NASA Astrophysics Data System (ADS)

Duarte, Ana Filipa; Silva, Pedro; dos Santos, Cristina Peixoto

2012-09-01

Reinforcement Learning algorithms offer interesting features for the control of autonomous systems, such as the ability to learn from direct interaction with the environment, and the use of a simple reward signalas opposed to the input-outputs pairsused in classic supervised learning. The reward signal indicates the success of failure of the actions executed by the agent in the environment. In this work, are described RL algorithmsapplied to two case studies: the Crawler robot and the widely known inverted pendulum. We explore RL capabilities to autonomously learn a basic locomotion pattern in the Crawler, andapproach the balancing problem of biped locomotion using the inverted pendulum.
Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults

PubMed Central

Smith, Tim J.; Senju, Atsushi

2017-01-01

While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue–reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue–reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting. PMID:28250186
Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults.

PubMed

Vernetti, Angélina; Smith, Tim J; Senju, Atsushi

2017-03-15

While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue-reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue-reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting. © 2017 The Authors.
Individual differences in sensitivity to reward and punishment and neural activity during reward and avoidance learning

PubMed Central

Yoon, HeungSik; Kim, Hackjin; Hamann, Stephan

2015-01-01

In this functional neuroimaging study, we investigated neural activations during the process of learning to gain monetary rewards and to avoid monetary loss, and how these activations are modulated by individual differences in reward and punishment sensitivity. Healthy young volunteers performed a reinforcement learning task where they chose one of two fractal stimuli associated with monetary gain (reward trials) or avoidance of monetary loss (avoidance trials). Trait sensitivity to reward and punishment was assessed using the behavioral inhibition/activation scales (BIS/BAS). Functional neuroimaging results showed activation of the striatum during the anticipation and reception periods of reward trials. During avoidance trials, activation of the dorsal striatum and prefrontal regions was found. As expected, individual differences in reward sensitivity were positively associated with activation in the left and right ventral striatum during reward reception. Individual differences in sensitivity to punishment were negatively associated with activation in the left dorsal striatum during avoidance anticipation and also with activation in the right lateral orbitofrontal cortex during receiving monetary loss. These results suggest that learning to attain reward and learning to avoid loss are dependent on separable sets of neural regions whose activity is modulated by trait sensitivity to reward or punishment. PMID:25680989

Reinforcement learning and Tourette syndrome.

PubMed

Palminteri, Stefano; Pessiglione, Mathias

2013-01-01

In this chapter, we report the first experimental explorations of reinforcement learning in Tourette syndrome, realized by our team in the last few years. This report will be preceded by an introduction aimed to provide the reader with the state of the art of the knowledge concerning the neural bases of reinforcement learning at the moment of these studies and the scientific rationale beyond them. In short, reinforcement learning is learning by trial and error to maximize rewards and minimize punishments. This decision-making and learning process implicates the dopaminergic system projecting to the frontal cortex-basal ganglia circuits. A large body of evidence suggests that the dysfunction of the same neural systems is implicated in the pathophysiology of Tourette syndrome. Our results show that Tourette condition, as well as the most common pharmacological treatments (dopamine antagonists), affects reinforcement learning performance in these patients. Specifically, the results suggest a deficit in negative reinforcement learning, possibly underpinned by a functional hyperdopaminergia, which could explain the persistence of tics, despite their evident inadaptive (negative) value. This idea, together with the implications of these results in Tourette therapy and the future perspectives, is discussed in Section 4 of this chapter. © 2013 Elsevier Inc. All rights reserved.
Stress enhances model-free reinforcement learning only after negative outcome.

PubMed

Park, Heyeon; Lee, Daeyeol; Chey, Jeanyung

2017-01-01

Previous studies found that stress shifts behavioral control by promoting habits while decreasing goal-directed behaviors during reward-based decision-making. It is, however, unclear how stress disrupts the relative contribution of the two systems controlling reward-seeking behavior, i.e. model-free (or habit) and model-based (or goal-directed). Here, we investigated whether stress biases the contribution of model-free and model-based reinforcement learning processes differently depending on the valence of outcome, and whether stress alters the learning rate, i.e., how quickly information from the new environment is incorporated into choices. Participants were randomly assigned to either a stress or a control condition, and performed a two-stage Markov decision-making task in which the reward probabilities underwent periodic reversals without notice. We found that stress increased the contribution of model-free reinforcement learning only after negative outcome. Furthermore, stress decreased the learning rate. The results suggest that stress diminishes one's ability to make adaptive choices in multiple aspects of reinforcement learning. This finding has implications for understanding how stress facilitates maladaptive habits, such as addictive behavior, and other dysfunctional behaviors associated with stress in clinical and educational contexts.
Stress enhances model-free reinforcement learning only after negative outcome

PubMed Central

Lee, Daeyeol

2017-01-01

Previous studies found that stress shifts behavioral control by promoting habits while decreasing goal-directed behaviors during reward-based decision-making. It is, however, unclear how stress disrupts the relative contribution of the two systems controlling reward-seeking behavior, i.e. model-free (or habit) and model-based (or goal-directed). Here, we investigated whether stress biases the contribution of model-free and model-based reinforcement learning processes differently depending on the valence of outcome, and whether stress alters the learning rate, i.e., how quickly information from the new environment is incorporated into choices. Participants were randomly assigned to either a stress or a control condition, and performed a two-stage Markov decision-making task in which the reward probabilities underwent periodic reversals without notice. We found that stress increased the contribution of model-free reinforcement learning only after negative outcome. Furthermore, stress decreased the learning rate. The results suggest that stress diminishes one’s ability to make adaptive choices in multiple aspects of reinforcement learning. This finding has implications for understanding how stress facilitates maladaptive habits, such as addictive behavior, and other dysfunctional behaviors associated with stress in clinical and educational contexts. PMID:28723943
Social stress reactivity alters reward and punishment learning

PubMed Central

Frank, Michael J.; Allen, John J. B.

2011-01-01

To examine how stress affects cognitive functioning, individual differences in trait vulnerability (punishment sensitivity) and state reactivity (negative affect) to social evaluative threat were examined during concurrent reinforcement learning. Lower trait-level punishment sensitivity predicted better reward learning and poorer punishment learning; the opposite pattern was found in more punishment sensitive individuals. Increasing state-level negative affect was directly related to punishment learning accuracy in highly punishment sensitive individuals, but these measures were inversely related in less sensitive individuals. Combined electrophysiological measurement, performance accuracy and computational estimations of learning parameters suggest that trait and state vulnerability to stress alter cortico-striatal functioning during reinforcement learning, possibly mediated via medio-frontal cortical systems. PMID:20453038
Social stress reactivity alters reward and punishment learning.

PubMed

Cavanagh, James F; Frank, Michael J; Allen, John J B

2011-06-01

To examine how stress affects cognitive functioning, individual differences in trait vulnerability (punishment sensitivity) and state reactivity (negative affect) to social evaluative threat were examined during concurrent reinforcement learning. Lower trait-level punishment sensitivity predicted better reward learning and poorer punishment learning; the opposite pattern was found in more punishment sensitive individuals. Increasing state-level negative affect was directly related to punishment learning accuracy in highly punishment sensitive individuals, but these measures were inversely related in less sensitive individuals. Combined electrophysiological measurement, performance accuracy and computational estimations of learning parameters suggest that trait and state vulnerability to stress alter cortico-striatal functioning during reinforcement learning, possibly mediated via medio-frontal cortical systems.
Individual differences in sensitivity to reward and punishment and neural activity during reward and avoidance learning.

PubMed

Kim, Sang Hee; Yoon, HeungSik; Kim, Hackjin; Hamann, Stephan

2015-09-01

In this functional neuroimaging study, we investigated neural activations during the process of learning to gain monetary rewards and to avoid monetary loss, and how these activations are modulated by individual differences in reward and punishment sensitivity. Healthy young volunteers performed a reinforcement learning task where they chose one of two fractal stimuli associated with monetary gain (reward trials) or avoidance of monetary loss (avoidance trials). Trait sensitivity to reward and punishment was assessed using the behavioral inhibition/activation scales (BIS/BAS). Functional neuroimaging results showed activation of the striatum during the anticipation and reception periods of reward trials. During avoidance trials, activation of the dorsal striatum and prefrontal regions was found. As expected, individual differences in reward sensitivity were positively associated with activation in the left and right ventral striatum during reward reception. Individual differences in sensitivity to punishment were negatively associated with activation in the left dorsal striatum during avoidance anticipation and also with activation in the right lateral orbitofrontal cortex during receiving monetary loss. These results suggest that learning to attain reward and learning to avoid loss are dependent on separable sets of neural regions whose activity is modulated by trait sensitivity to reward or punishment. © The Author (2015). Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Model-based reinforcement learning with dimension reduction.

PubMed

Tangkaratt, Voot; Morimoto, Jun; Sugiyama, Masashi

2016-12-01

The goal of reinforcement learning is to learn an optimal policy which controls an agent to acquire the maximum cumulative reward. The model-based reinforcement learning approach learns a transition model of the environment from data, and then derives the optimal policy using the transition model. However, learning an accurate transition model in high-dimensional environments requires a large amount of data which is difficult to obtain. To overcome this difficulty, in this paper, we propose to combine model-based reinforcement learning with the recently developed least-squares conditional entropy (LSCE) method, which simultaneously performs transition model estimation and dimension reduction. We also further extend the proposed method to imitation learning scenarios. The experimental results show that policy search combined with LSCE performs well for high-dimensional control tasks including real humanoid robot control. Copyright © 2016 Elsevier Ltd. All rights reserved.
A universal role of the ventral striatum in reward-based learning: Evidence from human studies

PubMed Central

Daniel, Reka; Pollmann, Stefan

2014-01-01

Reinforcement learning enables organisms to adjust their behavior in order to maximize rewards. Electrophysiological recordings of dopaminergic midbrain neurons have shown that they code the difference between actual and predicted rewards, i.e., the reward prediction error, in many species. This error signal is conveyed to both the striatum and cortical areas and is thought to play a central role in learning to optimize behavior. However, in human daily life rewards are diverse and often only indirect feedback is available. Here we explore the range of rewards that are processed by the dopaminergic system in human participants, and examine whether it is also involved in learning in the absence of explicit rewards. While results from electrophysiological recordings in humans are sparse, evidence linking dopaminergic activity to the metabolic signal recorded from the midbrain and striatum with functional magnetic resonance imaging (fMRI) is available. Results from fMRI studies suggest that the human ventral striatum (VS) receives valuation information for a diverse set of rewarding stimuli. These range from simple primary reinforcers such as juice rewards over abstract social rewards to internally generated signals on perceived correctness, suggesting that the VS is involved in learning from trial-and-error irrespective of the specific nature of provided rewards. In addition, we summarize evidence that the VS can also be implicated when learning from observing others, and in tasks that go beyond simple stimulus-action-outcome learning, indicating that the reward system is also recruited in more complex learning tasks. PMID:24825620
The "proactive" model of learning: Integrative framework for model-free and model-based reinforcement learning utilizing the associative learning-based proactive brain concept.

PubMed

Zsuga, Judit; Biro, Klara; Papp, Csaba; Tajti, Gabor; Gesztelyi, Rudolf

2016-02-01

Reinforcement learning (RL) is a powerful concept underlying forms of associative learning governed by the use of a scalar reward signal, with learning taking place if expectations are violated. RL may be assessed using model-based and model-free approaches. Model-based reinforcement learning involves the amygdala, the hippocampus, and the orbitofrontal cortex (OFC). The model-free system involves the pedunculopontine-tegmental nucleus (PPTgN), the ventral tegmental area (VTA) and the ventral striatum (VS). Based on the functional connectivity of VS, model-free and model based RL systems center on the VS that by integrating model-free signals (received as reward prediction error) and model-based reward related input computes value. Using the concept of reinforcement learning agent we propose that the VS serves as the value function component of the RL agent. Regarding the model utilized for model-based computations we turned to the proactive brain concept, which offers an ubiquitous function for the default network based on its great functional overlap with contextual associative areas. Hence, by means of the default network the brain continuously organizes its environment into context frames enabling the formulation of analogy-based association that are turned into predictions of what to expect. The OFC integrates reward-related information into context frames upon computing reward expectation by compiling stimulus-reward and context-reward information offered by the amygdala and hippocampus, respectively. Furthermore we suggest that the integration of model-based expectations regarding reward into the value signal is further supported by the efferent of the OFC that reach structures canonical for model-free learning (e.g., the PPTgN, VTA, and VS). (c) 2016 APA, all rights reserved).
The combination of appetitive and aversive reinforcers and the nature of their interaction during auditory learning.

PubMed

Ilango, A; Wetzel, W; Scheich, H; Ohl, F W

2010-03-31

Learned changes in behavior can be elicited by either appetitive or aversive reinforcers. It is, however, not clear whether the two types of motivation, (approaching appetitive stimuli and avoiding aversive stimuli) drive learning in the same or different ways, nor is their interaction understood in situations where the two types are combined in a single experiment. To investigate this question we have developed a novel learning paradigm for Mongolian gerbils, which not only allows rewards and punishments to be presented in isolation or in combination with each other, but also can use these opposite reinforcers to drive the same learned behavior. Specifically, we studied learning of tone-conditioned hurdle crossing in a shuttle box driven by either an appetitive reinforcer (brain stimulation reward) or an aversive reinforcer (electrical footshock), or by a combination of both. Combination of the two reinforcers potentiated speed of acquisition, led to maximum possible performance, and delayed extinction as compared to either reinforcer alone. Additional experiments, using partial reinforcement protocols and experiments in which one of the reinforcers was omitted after the animals had been previously trained with the combination of both reinforcers, indicated that appetitive and aversive reinforcers operated together but acted in different ways: in this particular experimental context, punishment appeared to be more effective for initial acquisition and reward more effective to maintain a high level of conditioned responses (CRs). The results imply that learning mechanisms in problem solving were maximally effective when the initial punishment of mistakes was combined with the subsequent rewarding of correct performance. Copyright 2010 IBRO. Published by Elsevier Ltd. All rights reserved.
Oxytocin attenuates trust as a subset of more general reinforcement learning, with altered reward circuit functional connectivity in males.

PubMed

Ide, Jaime S; Nedic, Sanja; Wong, Kin F; Strey, Shmuel L; Lawson, Elizabeth A; Dickerson, Bradford C; Wald, Lawrence L; La Camera, Giancarlo; Mujica-Parodi, Lilianne R

2018-07-01

Oxytocin (OT) is an endogenous neuropeptide that, while originally thought to promote trust, has more recently been found to be context-dependent. Here we extend experimental paradigms previously restricted to de novo decision-to-trust, to a more realistic environment in which social relationships evolve in response to iterative feedback over twenty interactions. In a randomized, double blind, placebo-controlled within-subject/crossover experiment of human adult males, we investigated the effects of a single dose of intranasal OT (40 IU) on Bayesian expectation updating and reinforcement learning within a social context, with associated brain circuit dynamics. Subjects participated in a neuroeconomic task (Iterative Trust Game) designed to probe iterative social learning while their brains were scanned using ultra-high field (7T) fMRI. We modeled each subject's behavior using Bayesian updating of belief-states ("willingness to trust") as well as canonical measures of reinforcement learning (learning rate, inverse temperature). Behavioral trajectories were then used as regressors within fMRI activation and connectivity analyses to identify corresponding brain network functionality affected by OT. Behaviorally, OT reduced feedback learning, without bias with respect to positive versus negative reward. Neurobiologically, reduced learning under OT was associated with muted communication between three key nodes within the reward circuit: the orbitofrontal cortex, amygdala, and lateral (limbic) habenula. Our data suggest that OT, rather than inspiring feelings of generosity, instead attenuates the brain's encoding of prediction error and therefore its ability to modulate pre-existing beliefs. This effect may underlie OT's putative role in promoting what has typically been reported as 'unjustified trust' in the face of information that suggests likely betrayal, while also resolving apparent contradictions with regard to OT's context-dependent behavioral effects. Copyright
Generalization of value in reinforcement learning by humans.

PubMed

Wimmer, G Elliott; Daw, Nathaniel D; Shohamy, Daphna

2012-04-01

Research in decision-making has focused on the role of dopamine and its striatal targets in guiding choices via learned stimulus-reward or stimulus-response associations, behavior that is well described by reinforcement learning theories. However, basic reinforcement learning is relatively limited in scope and does not explain how learning about stimulus regularities or relations may guide decision-making. A candidate mechanism for this type of learning comes from the domain of memory, which has highlighted a role for the hippocampus in learning of stimulus-stimulus relations, typically dissociated from the role of the striatum in stimulus-response learning. Here, we used functional magnetic resonance imaging and computational model-based analyses to examine the joint contributions of these mechanisms to reinforcement learning. Humans performed a reinforcement learning task with added relational structure, modeled after tasks used to isolate hippocampal contributions to memory. On each trial participants chose one of four options, but the reward probabilities for pairs of options were correlated across trials. This (uninstructed) relationship between pairs of options potentially enabled an observer to learn about option values based on experience with the other options and to generalize across them. We observed blood oxygen level-dependent (BOLD) activity related to learning in the striatum and also in the hippocampus. By comparing a basic reinforcement learning model to one augmented to allow feedback to generalize between correlated options, we tested whether choice behavior and BOLD activity were influenced by the opportunity to generalize across correlated options. Although such generalization goes beyond standard computational accounts of reinforcement learning and striatal BOLD, both choices and striatal BOLD activity were better explained by the augmented model. Consistent with the hypothesized role for the hippocampus in this generalization, functional
Robust reinforcement learning.

PubMed

Morimoto, Jun; Doya, Kenji

2005-02-01

This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both offline learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H(infinity) control, we consider a differential game in which a "disturbing" agent tries to make the worst possible disturbance while a "control" agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H(infinity) control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired.
REWARD/PUNISHMENT REVERSAL LEARNING IN OLDER SUICIDE ATTEMPTERS

PubMed Central

Dombrovski, Alexandre Y.; Clark, Luke; Siegle, Greg J.; Butters, Meryl A.; Ichikawa, Naho; Sahakian, Barbara; Szanto, Katalin

2011-01-01

Objective Suicide rates are very high in old age, and the contribution of cognitive risk factors remains poorly understood. Suicide may be viewed as an outcome of an altered decision process. We hypothesized that impairment in a component of affective decision-making – reward/punishment-based learning – is associated with attempted suicide in late-life depression. We expected that suicide attempters would discount past reward/punishment history, focusing excessively on the most recent rewards and punishments. Further, we hypothesized that this impairment could be dissociated from executive abilities such as forward planning. Method We assessed reward/punishment-based learning using the Probabilistic Reversal Learning task in 65 individuals aged 60 and older: suicide attempters, suicide ideators, non-suicidal depressed elderly, and non-depressed controls. We used a reinforcement learning computational model to decompose reward/punishment processing over time. The Stockings of Cambridge test served as a control measure of executive function. Results Suicide attempters but not suicide ideators showed impaired probabilistic reversal learning compared to both non-suicidal depressed elderly and to non-depressed controls, after controlling for effects of education, global cognitive function, and substance use. Model-based analyses revealed that suicide attempters discounted previous history to a higher degree, compared to controls, basing their choice largely on reward/punishment received on the last trial. Groups did not differ in their performance on the Stockings of Cambridge. Conclusions Older suicide attempters display impaired reward/punishment-based learning. We propose a hypothesis that older suicide attempters make overly present-focused decisions, ignoring past experiences. Modification of this ‘myopia for the past’ may have therapeutic potential. PMID:20231320
Role of dopamine D2 receptors in human reinforcement learning.

PubMed

Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

2014-09-01

Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well.
Role of Dopamine D2 Receptors in Human Reinforcement Learning

PubMed Central

Eisenegger, Christoph; Naef, Michael; Linssen, Anke; Clark, Luke; Gandamaneni, Praveen K; Müller, Ulrich; Robbins, Trevor W

2014-01-01

Influential neurocomputational models emphasize dopamine (DA) as an electrophysiological and neurochemical correlate of reinforcement learning. However, evidence of a specific causal role of DA receptors in learning has been less forthcoming, especially in humans. Here we combine, in a between-subjects design, administration of a high dose of the selective DA D2/3-receptor antagonist sulpiride with genetic analysis of the DA D2 receptor in a behavioral study of reinforcement learning in a sample of 78 healthy male volunteers. In contrast to predictions of prevailing models emphasizing DA's pivotal role in learning via prediction errors, we found that sulpiride did not disrupt learning, but rather induced profound impairments in choice performance. The disruption was selective for stimuli indicating reward, whereas loss avoidance performance was unaffected. Effects were driven by volunteers with higher serum levels of the drug, and in those with genetically determined lower density of striatal DA D2 receptors. This is the clearest demonstration to date for a causal modulatory role of the DA D2 receptor in choice performance that might be distinct from learning. Our findings challenge current reward prediction error models of reinforcement learning, and suggest that classical animal models emphasizing a role of postsynaptic DA D2 receptors in motivational aspects of reinforcement learning may apply to humans as well. PMID:24713613
Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

PubMed Central

Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

2013-01-01

Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity. PMID:23592970
Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

PubMed

Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

2013-04-01

Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.
Prosocial Reward Learning in Children and Adolescents

PubMed Central

Kwak, Youngbin; Huettel, Scott A.

2016-01-01

Adolescence is a period of increased sensitivity to social contexts. To evaluate how social context sensitivity changes over development—and influences reward learning—we investigated how children and adolescents perceive and integrate rewards for oneself and others during a dynamic risky decision-making task. Children and adolescents (N = 75, 8–16 years) performed the Social Gambling Task (SGT, Kwak et al., 2014) and completed a set of questionnaires measuring other-regarding behavior. In the SGT, participants choose amongst four card decks that have different payout structures for oneself and for a charity. We examined patterns of choices, overall decision strategies, and how reward outcomes led to trial-by-trial adjustments in behavior, as estimated using a reinforcement-learning model. Performance of children and adolescents was compared to data from a previously collected sample of adults (N = 102) performing the identical task. We found that that children/adolescents were not only more sensitive to rewards directed to the charity than self but also showed greater prosocial tendencies on independent measures of other-regarding behavior. Children and adolescents also showed less use of a strategy that prioritizes rewards for self at the expense of rewards for others. These results support the conclusion that, compared to adults, children and adolescents show greater sensitivity to outcomes for others when making decisions and learning about potential rewards. PMID:27761125
Pleasurable music affects reinforcement learning according to the listener

PubMed Central

Gold, Benjamin P.; Frank, Michael J.; Bogert, Brigitte; Brattico, Elvira

2013-01-01

Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875

Pragmatically Framed Cross-Situational Noun Learning Using Computational Reinforcement Models

PubMed Central

Najnin, Shamima; Banerjee, Bonny

2018-01-01

Cross-situational learning and social pragmatic theories are prominent mechanisms for learning word meanings (i.e., word-object pairs). In this paper, the role of reinforcement is investigated for early word-learning by an artificial agent. When exposed to a group of speakers, the agent comes to understand an initial set of vocabulary items belonging to the language used by the group. Both cross-situational learning and social pragmatic theory are taken into account. As social cues, joint attention and prosodic cues in caregiver's speech are considered. During agent-caregiver interaction, the agent selects a word from the caregiver's utterance and learns the relations between that word and the objects in its visual environment. The “novel words to novel objects” language-specific constraint is assumed for computing rewards. The models are learned by maximizing the expected reward using reinforcement learning algorithms [i.e., table-based algorithms: Q-learning, SARSA, SARSA-λ, and neural network-based algorithms: Q-learning for neural network (Q-NN), neural-fitted Q-network (NFQ), and deep Q-network (DQN)]. Neural network-based reinforcement learning models are chosen over table-based models for better generalization and quicker convergence. Simulations are carried out using mother-infant interaction CHILDES dataset for learning word-object pairings. Reinforcement is modeled in two cross-situational learning cases: (1) with joint attention (Attentional models), and (2) with joint attention and prosodic cues (Attentional-prosodic models). Attentional-prosodic models manifest superior performance to Attentional ones for the task of word-learning. The Attentional-prosodic DQN outperforms existing word-learning models for the same task. PMID:29441027
Reinforcement learning with Marr.

PubMed

Niv, Yael; Langdon, Angela

2016-10-01

To many, the poster child for David Marr's famous three levels of scientific inquiry is reinforcement learning-a computational theory of reward optimization, which readily prescribes algorithmic solutions that evidence striking resemblance to signals found in the brain, suggesting a straightforward neural implementation. Here we review questions that remain open at each level of analysis, concluding that the path forward to their resolution calls for inspiration across levels, rather than a focus on mutual constraints.
What is the optimal task difficulty for reinforcement learning of brain self-regulation?

PubMed

Bauer, Robert; Vukelić, Mathias; Gharabaghi, Alireza

2016-09-01

The balance between action and reward during neurofeedback may influence reinforcement learning of brain self-regulation. Eleven healthy volunteers participated in three runs of motor imagery-based brain-machine interface feedback where a robot passively opened the hand contingent to β-band modulation. For each run, the β-desynchronization threshold to initiate the hand robot movement increased in difficulty (low, moderate, and demanding). In this context, the incentive to learn was estimated by the change of reward per action, operationalized as the change in reward duration per movement onset. Variance analysis revealed a significant interaction between threshold difficulty and the relationship between reward duration and number of movement onsets (p<0.001), indicating a negative learning incentive for low difficulty, but a positive learning incentive for moderate and demanding runs. Exploration of different thresholds in the same data set indicated that the learning incentive peaked at higher thresholds than the threshold which resulted in maximum classification accuracy. Specificity is more important than sensitivity of neurofeedback for reinforcement learning of brain self-regulation. Learning efficiency requires adequate challenge by neurofeedback interventions. Copyright © 2016 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.
Reinforcement learning of periodical gaits in locomotion robots

NASA Astrophysics Data System (ADS)

Svinin, Mikhail; Yamada, Kazuyaki; Ushio, S.; Ueda, Kanji

1999-08-01

Emergence of stable gaits in locomotion robots is studied in this paper. A classifier system, implementing an instance- based reinforcement learning scheme, is used for sensory- motor control of an eight-legged mobile robot. Important feature of the classifier system is its ability to work with the continuous sensor space. The robot does not have a prior knowledge of the environment, its own internal model, and the goal coordinates. It is only assumed that the robot can acquire stable gaits by learning how to reach a light source. During the learning process the control system, is self-organized by reinforcement signals. Reaching the light source defines a global reward. Forward motion gets a local reward, while stepping back and falling down get a local punishment. Feasibility of the proposed self-organized system is tested under simulation and experiment. The control actions are specified at the leg level. It is shown that, as learning progresses, the number of the action rules in the classifier systems is stabilized to a certain level, corresponding to the acquired gait patterns.
FMRQ-A Multiagent Reinforcement Learning Algorithm for Fully Cooperative Tasks.

PubMed

Zhang, Zhen; Zhao, Dongbin; Gao, Junwei; Wang, Dongqing; Dai, Yujie

2017-06-01

In this paper, we propose a multiagent reinforcement learning algorithm dealing with fully cooperative tasks. The algorithm is called frequency of the maximum reward Q-learning (FMRQ). FMRQ aims to achieve one of the optimal Nash equilibria so as to optimize the performance index in multiagent systems. The frequency of obtaining the highest global immediate reward instead of immediate reward is used as the reinforcement signal. With FMRQ each agent does not need the observation of the other agents' actions and only shares its state and reward at each step. We validate FMRQ through case studies of repeated games: four cases of two-player two-action and one case of three-player two-action. It is demonstrated that FMRQ can converge to one of the optimal Nash equilibria in these cases. Moreover, comparison experiments on tasks with multiple states and finite steps are conducted. One is box-pushing and the other one is distributed sensor network problem. Experimental results show that the proposed algorithm outperforms others with higher performance.
Credit assignment in movement-dependent reinforcement learning

PubMed Central

Boggess, Matthew J.; Crossley, Matthew J.; Parvin, Darius; Ivry, Richard B.; Taylor, Jordan A.

2016-01-01

When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants’ explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem. PMID:27247404
Credit assignment in movement-dependent reinforcement learning.

PubMed

McDougle, Samuel D; Boggess, Matthew J; Crossley, Matthew J; Parvin, Darius; Ivry, Richard B; Taylor, Jordan A

2016-06-14

When a person fails to obtain an expected reward from an object in the environment, they face a credit assignment problem: Did the absence of reward reflect an extrinsic property of the environment or an intrinsic error in motor execution? To explore this problem, we modified a popular decision-making task used in studies of reinforcement learning, the two-armed bandit task. We compared a version in which choices were indicated by key presses, the standard response in such tasks, to a version in which the choices were indicated by reaching movements, which affords execution failures. In the key press condition, participants exhibited a strong risk aversion bias; strikingly, this bias reversed in the reaching condition. This result can be explained by a reinforcement model wherein movement errors influence decision-making, either by gating reward prediction errors or by modifying an implicit representation of motor competence. Two further experiments support the gating hypothesis. First, we used a condition in which we provided visual cues indicative of movement errors but informed the participants that trial outcomes were independent of their actual movements. The main result was replicated, indicating that the gating process is independent of participants' explicit sense of control. Second, individuals with cerebellar degeneration failed to modulate their behavior between the key press and reach conditions, providing converging evidence of an implicit influence of movement error signals on reinforcement learning. These results provide a mechanistically tractable solution to the credit assignment problem.
Behavioral and neural properties of social reinforcement learning

PubMed Central

Jones, Rebecca M.; Somerville, Leah H.; Li, Jian; Ruberry, Erika J.; Libby, Victoria; Glover, Gary; Voss, Henning U.; Ballon, Douglas J.; Casey, BJ

2011-01-01

Social learning is critical for engaging in complex interactions with other individuals. Learning from positive social exchanges, such as acceptance from peers, may be similar to basic reinforcement learning. We formally test this hypothesis by developing a novel paradigm that is based upon work in non-human primates and human imaging studies of reinforcement learning. The probability of receiving positive social reinforcement from three distinct peers was parametrically manipulated while brain activity was recorded in healthy adults using event-related functional magnetic resonance imaging (fMRI). Over the course of the experiment, participants responded more quickly to faces of peers who provided more frequent positive social reinforcement, and rated them as more likeable. Modeling trial-by-trial learning showed ventral striatum and orbital frontal cortex activity correlated positively with forming expectations about receiving social reinforcement. Rostral anterior cingulate cortex activity tracked positively with modulations of expected value of the cues (peers). Together, the findings across three levels of analysis - social preferences, response latencies and modeling neural responses – are consistent with reinforcement learning theory and non-human primate electrophysiological studies of reward. This work highlights the fundamental influence of acceptance by one’s peers in altering subsequent behavior. PMID:21917787
Learning to reach by reinforcement learning using a receptive field based function approximation approach with continuous actions.

PubMed

Tamosiunaite, Minija; Asfour, Tamim; Wörgötter, Florentin

2009-03-01

Reinforcement learning methods can be used in robotics applications especially for specific target-oriented problems, for example the reward-based recalibration of goal directed actions. To this end still relatively large and continuous state-action spaces need to be efficiently handled. The goal of this paper is, thus, to develop a novel, rather simple method which uses reinforcement learning with function approximation in conjunction with different reward-strategies for solving such problems. For the testing of our method, we use a four degree-of-freedom reaching problem in 3D-space simulated by a two-joint robot arm system with two DOF each. Function approximation is based on 4D, overlapping kernels (receptive fields) and the state-action space contains about 10,000 of these. Different types of reward structures are being compared, for example, reward-on- touching-only against reward-on-approach. Furthermore, forbidden joint configurations are punished. A continuous action space is used. In spite of a rather large number of states and the continuous action space these reward/punishment strategies allow the system to find a good solution usually within about 20 trials. The efficiency of our method demonstrated in this test scenario suggests that it might be possible to use it on a real robot for problems where mixed rewards can be defined in situations where other types of learning might be difficult.
Mate call as reward: Acoustic communication signals can acquire positive reinforcing values during adulthood in female zebra finches (Taeniopygia guttata).

PubMed

Hernandez, Alexandra M; Perez, Emilie C; Mulard, Hervé; Mathevon, Nicolas; Vignal, Clémentine

2016-02-01

Social stimuli can have rewarding properties and promote learning. In birds, conspecific vocalizations like song can act as a reinforcer, and specific song variants can acquire particular rewarding values during early life exposure. Here we ask if, during adulthood, an acoustic signal simpler and shorter than song can become a reward for a female songbird because of its particular social value. Using an operant choice apparatus, we showed that female zebra finches display a preferential response toward their mate's calls. This reinforcing value of mate's calls could be involved in the maintenance of the monogamous pair-bond of the zebra finch. (c) 2016 APA, all rights reserved).
Reusable Reinforcement Learning via Shallow Trails.

PubMed

Yu, Yang; Chen, Shi-Yong; Da, Qing; Zhou, Zhi-Hua

2018-06-01

Reinforcement learning has shown great success in helping learning agents accomplish tasks autonomously from environment interactions. Meanwhile in many real-world applications, an agent needs to accomplish not only a fixed task but also a range of tasks. For this goal, an agent can learn a metapolicy over a set of training tasks that are drawn from an underlying distribution. By maximizing the total reward summed over all the training tasks, the metapolicy can then be reused in accomplishing test tasks from the same distribution. However, in practice, we face two major obstacles to train and reuse metapolicies well. First, how to identify tasks that are unrelated or even opposite with each other, in order to avoid their mutual interference in the training. Second, how to characterize task features, according to which a metapolicy can be reused. In this paper, we propose the MetA-Policy LEarning (MAPLE) approach that overcomes the two difficulties by introducing the shallow trail. It probes a task by running a roughly trained policy. Using the rewards of the shallow trail, MAPLE automatically groups similar tasks. Moreover, when the task parameters are unknown, the rewards of the shallow trail also serve as task features. Empirical studies on several controlling tasks verify that MAPLE can train metapolicies well and receives high reward on test tasks.
Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation

PubMed Central

Morita, Kenji

2016-01-01

It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of ‘Go’ or ‘No-Go’ selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of ‘Go’ values towards a goal, and (2) value-contrasts between ‘Go’ and ‘No-Go’ are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological
Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation.

PubMed

Kato, Ayaka; Morita, Kenji

2016-10-01

It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of 'Go' or 'No-Go' selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of 'Go' values towards a goal, and (2) value-contrasts between 'Go' and 'No-Go' are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological systems for value-learning
Cardiac Concomitants of Feedback and Prediction Error Processing in Reinforcement Learning.

PubMed

Kastner, Lucas; Kube, Jana; Villringer, Arno; Neumann, Jane

2017-01-01

Successful learning hinges on the evaluation of positive and negative feedback. We assessed differential learning from reward and punishment in a monetary reinforcement learning paradigm, together with cardiac concomitants of positive and negative feedback processing. On the behavioral level, learning from reward resulted in more advantageous behavior than learning from punishment, suggesting a differential impact of reward and punishment on successful feedback-based learning. On the autonomic level, learning and feedback processing were closely mirrored by phasic cardiac responses on a trial-by-trial basis: (1) Negative feedback was accompanied by faster and prolonged heart rate deceleration compared to positive feedback. (2) Cardiac responses shifted from feedback presentation at the beginning of learning to stimulus presentation later on. (3) Most importantly, the strength of phasic cardiac responses to the presentation of feedback correlated with the strength of prediction error signals that alert the learner to the necessity for behavioral adaptation. Considering participants' weight status and gender revealed obesity-related deficits in learning to avoid negative consequences and less consistent behavioral adaptation in women compared to men. In sum, our results provide strong new evidence for the notion that during learning phasic cardiac responses reflect an internal value and feedback monitoring system that is sensitive to the violation of performance-based expectations. Moreover, inter-individual differences in weight status and gender may affect both behavioral and autonomic responses in reinforcement-based learning.
The Dopamine Prediction Error: Contributions to Associative Models of Reward Learning

PubMed Central

Nasser, Helen M.; Calu, Donna J.; Schoenbaum, Geoffrey; Sharpe, Melissa J.

2017-01-01

Phasic activity of midbrain dopamine neurons is currently thought to encapsulate the prediction-error signal described in Sutton and Barto’s (1981) model-free reinforcement learning algorithm. This phasic signal is thought to contain information about the quantitative value of reward, which transfers to the reward-predictive cue after learning. This is argued to endow the reward-predictive cue with the value inherent in the reward, motivating behavior toward cues signaling the presence of reward. Yet theoretical and empirical research has implicated prediction-error signaling in learning that extends far beyond a transfer of quantitative value to a reward-predictive cue. Here, we review the research which demonstrates the complexity of how dopaminergic prediction errors facilitate learning. After briefly discussing the literature demonstrating that phasic dopaminergic signals can act in the manner described by Sutton and Barto (1981), we consider how these signals may also influence attentional processing across multiple attentional systems in distinct brain circuits. Then, we discuss how prediction errors encode and promote the development of context-specific associations between cues and rewards. Finally, we consider recent evidence that shows dopaminergic activity contains information about causal relationships between cues and rewards that reflect information garnered from rich associative models of the world that can be adapted in the absence of direct experience. In discussing this research we hope to support the expansion of how dopaminergic prediction errors are thought to contribute to the learning process beyond the traditional concept of transferring quantitative value. PMID:28275359
Quantum reinforcement learning.

PubMed

Dong, Daoyi; Chen, Chunlin; Li, Hanxiong; Tarn, Tzyh-Jong

2008-10-01

The key approaches for machine learning, particularly learning in unknown probabilistic environments, are new representations and computation mechanisms. In this paper, a novel quantum reinforcement learning (QRL) method is proposed by combining quantum theory and reinforcement learning (RL). Inspired by the state superposition principle and quantum parallelism, a framework of a value-updating algorithm is introduced. The state (action) in traditional RL is identified as the eigen state (eigen action) in QRL. The state (action) set can be represented with a quantum superposition state, and the eigen state (eigen action) can be obtained by randomly observing the simulated quantum state according to the collapse postulate of quantum measurement. The probability of the eigen action is determined by the probability amplitude, which is updated in parallel according to rewards. Some related characteristics of QRL such as convergence, optimality, and balancing between exploration and exploitation are also analyzed, which shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism. To evaluate the performance and practicability of QRL, several simulated experiments are given, and the results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems. This paper is also an effective exploration on the application of quantum computation to artificial intelligence.
Changes in corticostriatal connectivity during reinforcement learning in humans.

PubMed

Horga, Guillermo; Maia, Tiago V; Marsh, Rachel; Hao, Xuejun; Xu, Dongrong; Duan, Yunsuo; Tau, Gregory Z; Graniello, Barbara; Wang, Zhishun; Kangarlu, Alayar; Martinez, Diana; Packard, Mark G; Peterson, Bradley S

2015-02-01

Many computational models assume that reinforcement learning relies on changes in synaptic efficacy between cortical regions representing stimuli and striatal regions involved in response selection, but this assumption has thus far lacked empirical support in humans. We recorded hemodynamic signals with fMRI while participants navigated a virtual maze to find hidden rewards. We fitted a reinforcement-learning algorithm to participants' choice behavior and evaluated the neural activity and the changes in functional connectivity related to trial-by-trial learning variables. Activity in the posterior putamen during choice periods increased progressively during learning. Furthermore, the functional connections between the sensorimotor cortex and the posterior putamen strengthened progressively as participants learned the task. These changes in corticostriatal connectivity differentiated participants who learned the task from those who did not. These findings provide a direct link between changes in corticostriatal connectivity and learning, thereby supporting a central assumption common to several computational models of reinforcement learning. © 2014 Wiley Periodicals, Inc.
Reinforcement learning or active inference?

PubMed

Friston, Karl J; Daunizeau, Jean; Kiebel, Stefan J

2009-07-29

This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.
The role of reward in word learning and its implications for language acquisition.

PubMed

Ripollés, Pablo; Marco-Pallarés, Josep; Hielscher, Ulrike; Mestres-Missé, Anna; Tempelmann, Claus; Heinze, Hans-Jochen; Rodríguez-Fornells, Antoni; Noesselt, Toemme

2014-11-03

The exact neural processes behind humans' drive to acquire a new language--first as infants and later as second-language learners--are yet to be established. Recent theoretical models have proposed that during human evolution, emerging language-learning mechanisms might have been glued to phylogenetically older subcortical reward systems, reinforcing human motivation to learn a new language. Supporting this hypothesis, our results showed that adult participants exhibited robust fMRI activation in the ventral striatum (VS)--a core region of reward processing--when successfully learning the meaning of new words. This activation was similar to the VS recruitment elicited using an independent reward task. Moreover, the VS showed enhanced functional and structural connectivity with neocortical language areas during successful word learning. Together, our results provide evidence for the neural substrate of reward and motivation during word learning. We suggest that this strong functional and anatomical coupling between neocortical language regions and the subcortical reward system provided a crucial advantage in humans that eventually enabled our lineage to successfully acquire linguistic skills. Copyright © 2014 Elsevier Ltd. All rights reserved.
Neural mechanisms of reinforcement learning in unmedicated patients with major depressive disorder.

PubMed

Rothkirch, Marcus; Tonn, Jonas; Köhler, Stephan; Sterzer, Philipp

2017-04-01

According to current concepts, major depressive disorder is strongly related to dysfunctional neural processing of motivational information, entailing impairments in reinforcement learning. While computational modelling can reveal the precise nature of neural learning signals, it has not been used to study learning-related neural dysfunctions in unmedicated patients with major depressive disorder so far. We thus aimed at comparing the neural coding of reward and punishment prediction errors, representing indicators of neural learning-related processes, between unmedicated patients with major depressive disorder and healthy participants. To this end, a group of unmedicated patients with major depressive disorder (n = 28) and a group of age- and sex-matched healthy control participants (n = 30) completed an instrumental learning task involving monetary gains and losses during functional magnetic resonance imaging. The two groups did not differ in their learning performance. Patients and control participants showed the same level of prediction error-related activity in the ventral striatum and the anterior insula. In contrast, neural coding of reward prediction errors in the medial orbitofrontal cortex was reduced in patients. Moreover, neural reward prediction error signals in the medial orbitofrontal cortex and ventral striatum showed negative correlations with anhedonia severity. Using a standard instrumental learning paradigm we found no evidence for an overall impairment of reinforcement learning in medication-free patients with major depressive disorder. Importantly, however, the attenuated neural coding of reward in the medial orbitofrontal cortex and the relation between anhedonia and reduced reward prediction error-signalling in the medial orbitofrontal cortex and ventral striatum likely reflect an impairment in experiencing pleasure from rewarding events as a key mechanism of anhedonia in major depressive disorder. © The Author (2017). Published by Oxford

Reinforcement Learning Deficits in People with Schizophrenia Persist after Extended Trials

PubMed Central

Cicero, David C.; Martin, Elizabeth A.; Becker, Theresa M.; Kerns, John G.

2014-01-01

Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning. PMID:25172610
Optogenetic mimicry of the transient activation of dopamine neurons by natural reward is sufficient for operant reinforcement.

PubMed

Kim, Kyung Man; Baratta, Michael V; Yang, Aimei; Lee, Doheon; Boyden, Edward S; Fiorillo, Christopher D

2012-01-01

Activation of dopamine receptors in forebrain regions, for minutes or longer, is known to be sufficient for positive reinforcement of stimuli and actions. However, the firing rate of dopamine neurons is increased for only about 200 milliseconds following natural reward events that are better than expected, a response which has been described as a "reward prediction error" (RPE). Although RPE drives reinforcement learning (RL) in computational models, it has not been possible to directly test whether the transient dopamine signal actually drives RL. Here we have performed optical stimulation of genetically targeted ventral tegmental area (VTA) dopamine neurons expressing Channelrhodopsin-2 (ChR2) in mice. We mimicked the transient activation of dopamine neurons that occurs in response to natural reward by applying a light pulse of 200 ms in VTA. When a single light pulse followed each self-initiated nose poke, it was sufficient in itself to cause operant reinforcement. Furthermore, when optical stimulation was delivered in separate sessions according to a predetermined pattern, it increased locomotion and contralateral rotations, behaviors that are known to result from activation of dopamine neurons. All three of the optically induced operant and locomotor behaviors were tightly correlated with the number of VTA dopamine neurons that expressed ChR2, providing additional evidence that the behavioral responses were caused by activation of dopamine neurons. These results provide strong evidence that the transient activation of dopamine neurons provides a functional reward signal that drives learning, in support of RL theories of dopamine function.
Reinforcement Learning in Multidimensional Environments Relies on Attention Mechanisms

PubMed Central

Daniel, Reka; Geana, Andra; Gershman, Samuel J.; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C.

2015-01-01

In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this “representation learning” process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the “curse of dimensionality” in reinforcement learning. PMID:26019331
A Reinforcement-Based Learning Paradigm Increases Anatomical Learning and Retention-A Neuroeducation Study.

PubMed

Anderson, Sarah J; Hecker, Kent G; Krigolson, Olave E; Jamniczky, Heather A

2018-01-01

In anatomy education, a key hurdle to engaging in higher-level discussion in the classroom is recognizing and understanding the extensive terminology used to identify and describe anatomical structures. Given the time-limited classroom environment, seeking methods to impart this foundational knowledge to students in an efficient manner is essential. Just-in-Time Teaching (JiTT) methods incorporate pre-class exercises (typically online) meant to establish foundational knowledge in novice learners so subsequent instructor-led sessions can focus on deeper, more complex concepts. Determining how best do we design and assess pre-class exercises requires a detailed examination of learning and retention in an applied educational context. Here we used electroencephalography (EEG) as a quantitative dependent variable to track learning and examine the efficacy of JiTT activities to teach anatomy. Specifically, we examined changes in the amplitude of the N250 and reward positivity event-related brain potential (ERP) components alongside behavioral performance as novice students participated in a series of computerized reinforcement-based learning modules to teach neuroanatomical structures. We found that as students learned to identify anatomical structures, the amplitude of the N250 increased and reward positivity amplitude decreased in response to positive feedback. Both on a retention and transfer exercise when learners successfully remembered and translated their knowledge to novel images, the amplitude of the reward positivity remained decreased compared to early learning. Our findings suggest ERPs can be used as a tool to track learning, retention, and transfer of knowledge and that employing the reinforcement learning paradigm is an effective educational approach for developing anatomical expertise.
Multi-agent Reinforcement Learning Model for Effective Action Selection

NASA Astrophysics Data System (ADS)

Youk, Sang Jo; Lee, Bong Keun

Reinforcement learning is a sub area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. In the case of multi-agent, especially, which state space and action space gets very enormous in compared to single agent, so it needs to take most effective measure available select the action strategy for effective reinforcement learning. This paper proposes a multi-agent reinforcement learning model based on fuzzy inference system in order to improve learning collect speed and select an effective action in multi-agent. This paper verifies an effective action select strategy through evaluation tests based on Robocop Keep away which is one of useful test-beds for multi-agent. Our proposed model can apply to evaluate efficiency of the various intelligent multi-agents and also can apply to strategy and tactics of robot soccer system.
Cardiac Concomitants of Feedback and Prediction Error Processing in Reinforcement Learning

PubMed Central

Kastner, Lucas; Kube, Jana; Villringer, Arno; Neumann, Jane

2017-01-01

Successful learning hinges on the evaluation of positive and negative feedback. We assessed differential learning from reward and punishment in a monetary reinforcement learning paradigm, together with cardiac concomitants of positive and negative feedback processing. On the behavioral level, learning from reward resulted in more advantageous behavior than learning from punishment, suggesting a differential impact of reward and punishment on successful feedback-based learning. On the autonomic level, learning and feedback processing were closely mirrored by phasic cardiac responses on a trial-by-trial basis: (1) Negative feedback was accompanied by faster and prolonged heart rate deceleration compared to positive feedback. (2) Cardiac responses shifted from feedback presentation at the beginning of learning to stimulus presentation later on. (3) Most importantly, the strength of phasic cardiac responses to the presentation of feedback correlated with the strength of prediction error signals that alert the learner to the necessity for behavioral adaptation. Considering participants' weight status and gender revealed obesity-related deficits in learning to avoid negative consequences and less consistent behavioral adaptation in women compared to men. In sum, our results provide strong new evidence for the notion that during learning phasic cardiac responses reflect an internal value and feedback monitoring system that is sensitive to the violation of performance-based expectations. Moreover, inter-individual differences in weight status and gender may affect both behavioral and autonomic responses in reinforcement-based learning. PMID:29163004
Learning to Cooperate: The Evolution of Social Rewards in Repeated Interactions.

PubMed

Dridi, Slimane; Akçay, Erol

2018-01-01

Understanding the behavioral and psychological mechanisms underlying social behaviors is one of the major goals of social evolutionary theory. In particular, a persistent question about animal cooperation is to what extent it is supported by other-regarding preferences-the motivation to increase the welfare of others. In many situations, animals adjust their behaviors through learning by responding to the rewards they experience as a consequence of their actions. Therefore, we may ask whether learning in social situations can be driven by evolved other-regarding rewards. Here we develop a mathematical model in order to ask whether the mere act of cooperating with a social partner will evolve to be inherently rewarding. Individuals interact repeatedly in pairs and adjust their behaviors through reinforcement learning. We assume that individuals associate with each game outcome an internal reward value. These perceived rewards are genetically evolving traits. We find that conditionally cooperative rewards that value mutual cooperation positively but the sucker's outcome negatively tend to be evolutionarily stable. Purely other-regarding rewards can evolve only under special parameter combinations. On the other hand, selfish rewards that always lead to pure defection are also evolutionarily successful. These findings are consistent with empirical observations showing that humans tend to display conditionally cooperative behavior and also exhibit a diversity of preferences. Our model also demonstrates the need to further integrate multiple levels of biological causation of behavior.
Effects of Ventral Striatum Lesions on Stimulus-Based versus Action-Based Reinforcement Learning.

PubMed

Rothenhoefer, Kathryn M; Costa, Vincent D; Bartolo, Ramón; Vicario-Feliciano, Raquel; Murray, Elisabeth A; Averbeck, Bruno B

2017-07-19

Learning the values of actions versus stimuli may depend on separable neural circuits. In the current study, we evaluated the performance of rhesus macaques with ventral striatum (VS) lesions on a two-arm bandit task that had randomly interleaved blocks of stimulus-based and action-based reinforcement learning (RL). Compared with controls, monkeys with VS lesions had deficits in learning to select rewarding images but not rewarding actions. We used a RL model to quantify learning and choice consistency and found that, in stimulus-based RL, the VS lesion monkeys were more influenced by negative feedback and had lower choice consistency than controls. Using a Bayesian model to parse the groups' learning strategies, we also found that VS lesion monkeys defaulted to an action-based choice strategy. Therefore, the VS is involved specifically in learning the value of stimuli, not actions. SIGNIFICANCE STATEMENT Reinforcement learning models of the ventral striatum (VS) often assume that it maintains an estimate of state value. This suggests that it plays a general role in learning whether rewards are assigned based on a chosen action or stimulus. In the present experiment, we examined the effects of VS lesions on monkeys' ability to learn that choosing a particular action or stimulus was more likely to lead to reward. We found that VS lesions caused a specific deficit in the monkeys' ability to discriminate between images with different values, whereas their ability to discriminate between actions with different values remained intact. Our results therefore suggest that the VS plays a specific role in learning to select rewarded stimuli. Copyright © 2017 the authors 0270-6474/17/376902-13$15.00/0.
Two spatiotemporally distinct value systems shape reward-based learning in the human brain.

PubMed

Fouragnan, Elsa; Retzler, Chris; Mullinger, Karen; Philiastides, Marios G

2015-09-08

Avoiding repeated mistakes and learning to reinforce rewarding decisions is critical for human survival and adaptive actions. Yet, the neural underpinnings of the value systems that encode different decision-outcomes remain elusive. Here coupling single-trial electroencephalography with simultaneously acquired functional magnetic resonance imaging, we uncover the spatiotemporal dynamics of two separate but interacting value systems encoding decision-outcomes. Consistent with a role in regulating alertness and switching behaviours, an early system is activated only by negative outcomes and engages arousal-related and motor-preparatory brain structures. Consistent with a role in reward-based learning, a later system differentially suppresses or activates regions of the human reward network in response to negative and positive outcomes, respectively. Following negative outcomes, the early system interacts and downregulates the late system, through a thalamic interaction with the ventral striatum. Critically, the strength of this coupling predicts participants' switching behaviour and avoidance learning, directly implicating the thalamostriatal pathway in reward-based learning.
The effects of aging on the interaction between reinforcement learning and attention.

PubMed

Radulescu, Angela; Daniel, Reka; Niv, Yael

2016-11-01

Reinforcement learning (RL) in complex environments relies on selective attention to uncover those aspects of the environment that are most predictive of reward. Whereas previous work has focused on age-related changes in RL, it is not known whether older adults learn differently from younger adults when selective attention is required. In 2 experiments, we examined how aging affects the interaction between RL and selective attention. Younger and older adults performed a learning task in which only 1 stimulus dimension was relevant to predicting reward, and within it, 1 "target" feature was the most rewarding. Participants had to discover this target feature through trial and error. In Experiment 1, stimuli varied on 1 or 3 dimensions and participants received hints that revealed the target feature, the relevant dimension, or gave no information. Group-related differences in accuracy and RTs differed systematically as a function of the number of dimensions and the type of hint available. In Experiment 2 we used trial-by-trial computational modeling of the learning process to test for age-related differences in learning strategies. Behavior of both young and older adults was explained well by a reinforcement-learning model that uses selective attention to constrain learning. However, the model suggested that older adults restricted their learning to fewer features, employing more focused attention than younger adults. Furthermore, this difference in strategy predicted age-related deficits in accuracy. We discuss these results suggesting that a narrower filter of attention may reflect an adaptation to the reduced capabilities of the reinforcement learning system. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Dopamine selectively remediates 'model-based' reward learning: a computational approach.

PubMed

Sharp, Madeleine E; Foerde, Karin; Daw, Nathaniel D; Shohamy, Daphna

2016-02-01

Patients with loss of dopamine due to Parkinson's disease are impaired at learning from reward. However, it remains unknown precisely which aspect of learning is impaired. In particular, learning from reward, or reinforcement learning, can be driven by two distinct computational processes. One involves habitual stamping-in of stimulus-response associations, hypothesized to arise computationally from 'model-free' learning. The other, 'model-based' learning, involves learning a model of the world that is believed to support goal-directed behaviour. Much work has pointed to a role for dopamine in model-free learning. But recent work suggests model-based learning may also involve dopamine modulation, raising the possibility that model-based learning may contribute to the learning impairment in Parkinson's disease. To directly test this, we used a two-step reward-learning task which dissociates model-free versus model-based learning. We evaluated learning in patients with Parkinson's disease tested ON versus OFF their dopamine replacement medication and in healthy controls. Surprisingly, we found no effect of disease or medication on model-free learning. Instead, we found that patients tested OFF medication showed a marked impairment in model-based learning, and that this impairment was remediated by dopaminergic medication. Moreover, model-based learning was positively correlated with a separate measure of working memory performance, raising the possibility of common neural substrates. Our results suggest that some learning deficits in Parkinson's disease may be related to an inability to pursue reward based on complete representations of the environment. © The Author (2015). Published by Oxford University Press on behalf of the Guarantors of Brain. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Optimal Reward Functions in Distributed Reinforcement Learning

NASA Technical Reports Server (NTRS)

Wolpert, David H.; Tumer, Kagan

2000-01-01

We consider the design of multi-agent systems so as to optimize an overall world utility function when (1) those systems lack centralized communication and control, and (2) each agents runs a distinct Reinforcement Learning (RL) algorithm. A crucial issue in such design problems is to initialize/update each agent's private utility function, so as to induce best possible world utility. Traditional 'team game' solutions to this problem sidestep this issue and simply assign to each agent the world utility as its private utility function. In previous work we used the 'Collective Intelligence' framework to derive a better choice of private utility functions, one that results in world utility performance up to orders of magnitude superior to that ensuing from use of the team game utility. In this paper we extend these results. We derive the general class of private utility functions that both are easy for the individual agents to learn and that, if learned well, result in high world utility. We demonstrate experimentally that using these new utility functions can result in significantly improved performance over that of our previously proposed utility, over and above that previous utility's superiority to the conventional team game utility.
Reinforcement learning in multidimensional environments relies on attention mechanisms.

PubMed

Niv, Yael; Daniel, Reka; Geana, Andra; Gershman, Samuel J; Leong, Yuan Chang; Radulescu, Angela; Wilson, Robert C

2015-05-27

In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this "representation learning" process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the "curse of dimensionality" in reinforcement learning. Copyright © 2015 the authors 0270-6474/15/358145-13$15.00/0.
A Reinforcement-Based Learning Paradigm Increases Anatomical Learning and Retention—A Neuroeducation Study

PubMed Central

Anderson, Sarah J.; Hecker, Kent G.; Krigolson, Olave E.; Jamniczky, Heather A.

2018-01-01

In anatomy education, a key hurdle to engaging in higher-level discussion in the classroom is recognizing and understanding the extensive terminology used to identify and describe anatomical structures. Given the time-limited classroom environment, seeking methods to impart this foundational knowledge to students in an efficient manner is essential. Just-in-Time Teaching (JiTT) methods incorporate pre-class exercises (typically online) meant to establish foundational knowledge in novice learners so subsequent instructor-led sessions can focus on deeper, more complex concepts. Determining how best do we design and assess pre-class exercises requires a detailed examination of learning and retention in an applied educational context. Here we used electroencephalography (EEG) as a quantitative dependent variable to track learning and examine the efficacy of JiTT activities to teach anatomy. Specifically, we examined changes in the amplitude of the N250 and reward positivity event-related brain potential (ERP) components alongside behavioral performance as novice students participated in a series of computerized reinforcement-based learning modules to teach neuroanatomical structures. We found that as students learned to identify anatomical structures, the amplitude of the N250 increased and reward positivity amplitude decreased in response to positive feedback. Both on a retention and transfer exercise when learners successfully remembered and translated their knowledge to novel images, the amplitude of the reward positivity remained decreased compared to early learning. Our findings suggest ERPs can be used as a tool to track learning, retention, and transfer of knowledge and that employing the reinforcement learning paradigm is an effective educational approach for developing anatomical expertise. PMID:29467638
Novelty and Inductive Generalization in Human Reinforcement Learning

PubMed Central

Gershman, Samuel J.; Niv, Yael

2015-01-01

In reinforcement learning, a decision maker searching for the most rewarding option is often faced with the question: what is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: how can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of reinforcement learning in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional reinforcement learning algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty. PMID:25808176
Reinforcement learning deficits in people with schizophrenia persist after extended trials.

PubMed

Cicero, David C; Martin, Elizabeth A; Becker, Theresa M; Kerns, John G

2014-12-30

Previous research suggests that people with schizophrenia have difficulty learning from positive feedback and when learning needs to occur rapidly. However, they seem to have relatively intact learning from negative feedback when learning occurs gradually. Participants are typically given a limited amount of acquisition trials to learn the reward contingencies and then tested about what they learned. The current study examined whether participants with schizophrenia continue to display these deficits when given extra time to learn the contingences. Participants with schizophrenia and matched healthy controls completed the Probabilistic Selection Task, which measures positive and negative feedback learning separately. Participants with schizophrenia showed a deficit in learning from both positive feedback and negative feedback. These reward learning deficits persisted even if people with schizophrenia are given extra time (up to 10 blocks of 60 trials) to learn the reward contingencies. These results suggest that the observed deficits cannot be attributed solely to slower learning and instead reflect a specific deficit in reinforcement learning. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Dopamine selectively remediates ‘model-based’ reward learning: a computational approach

PubMed Central

Sharp, Madeleine E.; Foerde, Karin; Daw, Nathaniel D.

2016-01-01

Patients with loss of dopamine due to Parkinson’s disease are impaired at learning from reward. However, it remains unknown precisely which aspect of learning is impaired. In particular, learning from reward, or reinforcement learning, can be driven by two distinct computational processes. One involves habitual stamping-in of stimulus-response associations, hypothesized to arise computationally from ‘model-free’ learning. The other, ‘model-based’ learning, involves learning a model of the world that is believed to support goal-directed behaviour. Much work has pointed to a role for dopamine in model-free learning. But recent work suggests model-based learning may also involve dopamine modulation, raising the possibility that model-based learning may contribute to the learning impairment in Parkinson’s disease. To directly test this, we used a two-step reward-learning task which dissociates model-free versus model-based learning. We evaluated learning in patients with Parkinson’s disease tested ON versus OFF their dopamine replacement medication and in healthy controls. Surprisingly, we found no effect of disease or medication on model-free learning. Instead, we found that patients tested OFF medication showed a marked impairment in model-based learning, and that this impairment was remediated by dopaminergic medication. Moreover, model-based learning was positively correlated with a separate measure of working memory performance, raising the possibility of common neural substrates. Our results suggest that some learning deficits in Parkinson’s disease may be related to an inability to pursue reward based on complete representations of the environment. PMID:26685155
Identifying Cognitive Remediation Change Through Computational Modelling—Effects on Reinforcement Learning in Schizophrenia

PubMed Central

Cella, Matteo; Bishara, Anthony J.; Medin, Evelina; Swan, Sarah; Reeder, Clare; Wykes, Til

2014-01-01

Objective: Converging research suggests that individuals with schizophrenia show a marked impairment in reinforcement learning, particularly in tasks requiring flexibility and adaptation. The problem has been associated with dopamine reward systems. This study explores, for the first time, the characteristics of this impairment and how it is affected by a behavioral intervention—cognitive remediation. Method: Using computational modelling, 3 reinforcement learning parameters based on the Wisconsin Card Sorting Test (WCST) trial-by-trial performance were estimated: R (reward sensitivity), P (punishment sensitivity), and D (choice consistency). In Study 1 the parameters were compared between a group of individuals with schizophrenia (n = 100) and a healthy control group (n = 50). In Study 2 the effect of cognitive remediation therapy (CRT) on these parameters was assessed in 2 groups of individuals with schizophrenia, one receiving CRT (n = 37) and the other receiving treatment as usual (TAU, n = 34). Results: In Study 1 individuals with schizophrenia showed impairment in the R and P parameters compared with healthy controls. Study 2 demonstrated that sensitivity to negative feedback (P) and reward (R) improved in the CRT group after therapy compared with the TAU group. R and P parameter change correlated with WCST outputs. Improvements in R and P after CRT were associated with working memory gains and reduction of negative symptoms, respectively. Conclusion: Schizophrenia reinforcement learning difficulties negatively influence performance in shift learning tasks. CRT can improve sensitivity to reward and punishment. Identifying parameters that show change may be useful in experimental medicine studies to identify cognitive domains susceptible to improvement. PMID:24214932
Knockout crickets for the study of learning and memory: Dopamine receptor Dop1 mediates aversive but not appetitive reinforcement in crickets.

PubMed

Awata, Hiroko; Watanabe, Takahito; Hamanaka, Yoshitaka; Mito, Taro; Noji, Sumihare; Mizunami, Makoto

2015-11-02

Elucidation of reinforcement mechanisms in associative learning is an important subject in neuroscience. In mammals, dopamine neurons are thought to play critical roles in mediating both appetitive and aversive reinforcement. Our pharmacological studies suggested that octopamine and dopamine neurons mediate reward and punishment, respectively, in crickets, but recent studies in fruit-flies concluded that dopamine neurons mediates both reward and punishment, via the type 1 dopamine receptor Dop1. To resolve the discrepancy between studies in different insect species, we produced Dop1 knockout crickets using the CRISPR/Cas9 system and found that they are defective in aversive learning with sodium chloride punishment but not appetitive learning with water or sucrose reward. The results suggest that dopamine and octopamine neurons mediate aversive and appetitive reinforcement, respectively, in crickets. We suggest unexpected diversity in neurotransmitters mediating appetitive reinforcement between crickets and fruit-flies, although the neurotransmitter mediating aversive reinforcement is conserved. This study demonstrates usefulness of the CRISPR/Cas9 system for producing knockout animals for the study of learning and memory.
Use of Inverse Reinforcement Learning for Identity Prediction

NASA Technical Reports Server (NTRS)

Hayes, Roy; Bao, Jonathan; Beling, Peter; Horowitz, Barry

2011-01-01

We adopt Markov Decision Processes (MDP) to model sequential decision problems, which have the characteristic that the current decision made by a human decision maker has an uncertain impact on future opportunity. We hypothesize that the individuality of decision makers can be modeled as differences in the reward function under a common MDP model. A machine learning technique, Inverse Reinforcement Learning (IRL), was used to learn an individual's reward function based on limited observation of his or her decision choices. This work serves as an initial investigation for using IRL to analyze decision making, conducted through a human experiment in a cyber shopping environment. Specifically, the ability to determine the demographic identity of users is conducted through prediction analysis and supervised learning. The results show that IRL can be used to correctly identify participants, at a rate of 68% for gender and 66% for one of three college major categories.

Somatic and Reinforcement-Based Plasticity in the Initial Stages of Human Motor Learning.

PubMed

Sidarta, Ananda; Vahdat, Shahabeddin; Bernardi, Nicolò F; Ostry, David J

2016-11-16

As one learns to dance or play tennis, the desired somatosensory state is typically unknown. Trial and error is important as motor behavior is shaped by successful and unsuccessful movements. As an experimental model, we designed a task in which human participants make reaching movements to a hidden target and receive positive reinforcement when successful. We identified somatic and reinforcement-based sources of plasticity on the basis of changes in functional connectivity using resting-state fMRI before and after learning. The neuroimaging data revealed reinforcement-related changes in both motor and somatosensory brain areas in which a strengthening of connectivity was related to the amount of positive reinforcement during learning. Areas of prefrontal cortex were similarly altered in relation to reinforcement, with connectivity between sensorimotor areas of putamen and the reward-related ventromedial prefrontal cortex strengthened in relation to the amount of successful feedback received. In other analyses, we assessed connectivity related to changes in movement direction between trials, a type of variability that presumably reflects exploratory strategies during learning. We found that connectivity in a network linking motor and somatosensory cortices increased with trial-to-trial changes in direction. Connectivity varied as well with the change in movement direction following incorrect movements. Here the changes were observed in a somatic memory and decision making network involving ventrolateral prefrontal cortex and second somatosensory cortex. Our results point to the idea that the initial stages of motor learning are not wholly motor but rather involve plasticity in somatic and prefrontal networks related both to reward and exploration. In the initial stages of motor learning, the placement of the limbs is learned primarily through trial and error. In an experimental analog, participants make reaching movements to a hidden target and receive positive
Somatic and Reinforcement-Based Plasticity in the Initial Stages of Human Motor Learning

PubMed Central

Sidarta, Ananda; Vahdat, Shahabeddin; Bernardi, Nicolò F.

2016-01-01

As one learns to dance or play tennis, the desired somatosensory state is typically unknown. Trial and error is important as motor behavior is shaped by successful and unsuccessful movements. As an experimental model, we designed a task in which human participants make reaching movements to a hidden target and receive positive reinforcement when successful. We identified somatic and reinforcement-based sources of plasticity on the basis of changes in functional connectivity using resting-state fMRI before and after learning. The neuroimaging data revealed reinforcement-related changes in both motor and somatosensory brain areas in which a strengthening of connectivity was related to the amount of positive reinforcement during learning. Areas of prefrontal cortex were similarly altered in relation to reinforcement, with connectivity between sensorimotor areas of putamen and the reward-related ventromedial prefrontal cortex strengthened in relation to the amount of successful feedback received. In other analyses, we assessed connectivity related to changes in movement direction between trials, a type of variability that presumably reflects exploratory strategies during learning. We found that connectivity in a network linking motor and somatosensory cortices increased with trial-to-trial changes in direction. Connectivity varied as well with the change in movement direction following incorrect movements. Here the changes were observed in a somatic memory and decision making network involving ventrolateral prefrontal cortex and second somatosensory cortex. Our results point to the idea that the initial stages of motor learning are not wholly motor but rather involve plasticity in somatic and prefrontal networks related both to reward and exploration. SIGNIFICANCE STATEMENT In the initial stages of motor learning, the placement of the limbs is learned primarily through trial and error. In an experimental analog, participants make reaching movements to a hidden target
Distributed reinforcement learning for adaptive and robust network intrusion response

NASA Astrophysics Data System (ADS)

Malialis, Kleanthis; Devlin, Sam; Kudenko, Daniel

2015-07-01

Distributed denial of service (DDoS) attacks constitute a rapidly evolving threat in the current Internet. Multiagent Router Throttling is a novel approach to defend against DDoS attacks where multiple reinforcement learning agents are installed on a set of routers and learn to rate-limit or throttle traffic towards a victim server. The focus of this paper is on online learning and scalability. We propose an approach that incorporates task decomposition, team rewards and a form of reward shaping called difference rewards. One of the novel characteristics of the proposed system is that it provides a decentralised coordinated response to the DDoS problem, thus being resilient to DDoS attacks themselves. The proposed system learns remarkably fast, thus being suitable for online learning. Furthermore, its scalability is successfully demonstrated in experiments involving 1000 learning agents. We compare our approach against a baseline and a popular state-of-the-art throttling technique from the network security literature and show that the proposed approach is more effective, adaptive to sophisticated attack rate dynamics and robust to agent failures.
Feature Reinforcement Learning: Part I. Unstructured MDPs

NASA Astrophysics Data System (ADS)

Hutter, Marcus

2009-12-01

General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II (Hutter, 2009c). The role of POMDPs is also considered there.
Can model-free reinforcement learning explain deontological moral judgments?

PubMed

Ayars, Alisabeth

2016-05-01

Dual-systems frameworks propose that moral judgments are derived from both an immediate emotional response, and controlled/rational cognition. Recently Cushman (2013) proposed a new dual-system theory based on model-free and model-based reinforcement learning. Model-free learning attaches values to actions based on their history of reward and punishment, and explains some deontological, non-utilitarian judgments. Model-based learning involves the construction of a causal model of the world and allows for far-sighted planning; this form of learning fits well with utilitarian considerations that seek to maximize certain kinds of outcomes. I present three concerns regarding the use of model-free reinforcement learning to explain deontological moral judgment. First, many actions that humans find aversive from model-free learning are not judged to be morally wrong. Moral judgment must require something in addition to model-free learning. Second, there is a dearth of evidence for central predictions of the reinforcement account-e.g., that people with different reinforcement histories will, all else equal, make different moral judgments. Finally, to account for the effect of intention within the framework requires certain assumptions which lack support. These challenges are reasonable foci for future empirical/theoretical work on the model-free/model-based framework. Copyright © 2016 Elsevier B.V. All rights reserved.
Space Objects Maneuvering Detection and Prediction via Inverse Reinforcement Learning

NASA Astrophysics Data System (ADS)

Linares, R.; Furfaro, R.

This paper determines the behavior of Space Objects (SOs) using inverse Reinforcement Learning (RL) to estimate the reward function that each SO is using for control. The approach discussed in this work can be used to analyze maneuvering of SOs from observational data. The inverse RL problem is solved using the Feature Matching approach. This approach determines the optimal reward function that a SO is using while maneuvering by assuming that the observed trajectories are optimal with respect to the SO's own reward function. This paper uses estimated orbital elements data to determine the behavior of SOs in a data-driven fashion.
Learning to Obtain Reward, but Not Avoid Punishment, Is Affected by Presence of PTSD Symptoms in Male Veterans: Empirical Data and Computational Model

PubMed Central

Myers, Catherine E.; Moustafa, Ahmed A.; Sheynin, Jony; VanMeenen, Kirsten M.; Gilbertson, Mark W.; Orr, Scott P.; Beck, Kevin D.; Pang, Kevin C. H.; Servatius, Richard J.

2013-01-01

Post-traumatic stress disorder (PTSD) symptoms include behavioral avoidance which is acquired and tends to increase with time. This avoidance may represent a general learning bias; indeed, individuals with PTSD are often faster than controls on acquiring conditioned responses based on physiologically-aversive feedback. However, it is not clear whether this learning bias extends to cognitive feedback, or to learning from both reward and punishment. Here, male veterans with self-reported current, severe PTSD symptoms (PTSS group) or with few or no PTSD symptoms (control group) completed a probabilistic classification task that included both reward-based and punishment-based trials, where feedback could take the form of reward, punishment, or an ambiguous “no-feedback” outcome that could signal either successful avoidance of punishment or failure to obtain reward. The PTSS group outperformed the control group in total points obtained; the PTSS group specifically performed better than the control group on reward-based trials, with no difference on punishment-based trials. To better understand possible mechanisms underlying observed performance, we used a reinforcement learning model of the task, and applied maximum likelihood estimation techniques to derive estimated parameters describing individual participants’ behavior. Estimations of the reinforcement value of the no-feedback outcome were significantly greater in the control group than the PTSS group, suggesting that the control group was more likely to value this outcome as positively reinforcing (i.e., signaling successful avoidance of punishment). This is consistent with the control group’s generally poorer performance on reward trials, where reward feedback was to be obtained in preference to the no-feedback outcome. Differences in the interpretation of ambiguous feedback may contribute to the facilitated reinforcement learning often observed in PTSD patients, and may in turn provide new insight into
Explicit and implicit reinforcement learning across the psychosis spectrum.

PubMed

Barch, Deanna M; Carter, Cameron S; Gold, James M; Johnson, Sheri L; Kring, Ann M; MacDonald, Angus W; Pizzagalli, Diego A; Ragland, J Daniel; Silverstein, Steven M; Strauss, Milton E

2017-07-01

Motivational and hedonic impairments are core features of a variety of types of psychopathology. An important aspect of motivational function is reinforcement learning (RL), including implicit (i.e., outside of conscious awareness) and explicit (i.e., including explicit representations about potential reward associations) learning, as well as both positive reinforcement (learning about actions that lead to reward) and punishment (learning to avoid actions that lead to loss). Here we present data from paradigms designed to assess both positive and negative components of both implicit and explicit RL, examine performance on each of these tasks among individuals with schizophrenia, schizoaffective disorder, and bipolar disorder with psychosis, and examine their relative relationships to specific symptom domains transdiagnostically. None of the diagnostic groups differed significantly from controls on the implicit RL tasks in either bias toward a rewarded response or bias away from a punished response. However, on the explicit RL task, both the individuals with schizophrenia and schizoaffective disorder performed significantly worse than controls, but the individuals with bipolar did not. Worse performance on the explicit RL task, but not the implicit RL task, was related to worse motivation and pleasure symptoms across all diagnostic categories. Performance on explicit RL, but not implicit RL, was related to working memory, which accounted for some of the diagnostic group differences. However, working memory did not account for the relationship of explicit RL to motivation and pleasure symptoms. These findings suggest transdiagnostic relationships across the spectrum of psychotic disorders between motivation and pleasure impairments and explicit RL. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Dopamine-Dependent Reinforcement of Motor Skill Learning: Evidence from Gilles de la Tourette Syndrome

ERIC Educational Resources Information Center

Palminteri, Stefano; Lebreton, Mael; Worbe, Yulia; Hartmann, Andreas; Lehericy, Stephane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

2011-01-01

Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only…
Antipsychotic dose modulates behavioral and neural responses to feedback during reinforcement learning in schizophrenia.

PubMed

Insel, Catherine; Reinen, Jenna; Weber, Jochen; Wager, Tor D; Jarskog, L Fredrik; Shohamy, Daphna; Smith, Edward E

2014-03-01

Schizophrenia is characterized by an abnormal dopamine system, and dopamine blockade is the primary mechanism of antipsychotic treatment. Consistent with the known role of dopamine in reward processing, prior research has demonstrated that patients with schizophrenia exhibit impairments in reward-based learning. However, it remains unknown how treatment with antipsychotic medication impacts the behavioral and neural signatures of reinforcement learning in schizophrenia. The goal of this study was to examine whether antipsychotic medication modulates behavioral and neural responses to prediction error coding during reinforcement learning. Patients with schizophrenia completed a reinforcement learning task while undergoing functional magnetic resonance imaging. The task consisted of two separate conditions in which participants accumulated monetary gain or avoided monetary loss. Behavioral results indicated that antipsychotic medication dose was associated with altered behavioral approaches to learning, such that patients taking higher doses of medication showed increased sensitivity to negative reinforcement. Higher doses of antipsychotic medication were also associated with higher learning rates (LRs), suggesting that medication enhanced sensitivity to trial-by-trial feedback. Neuroimaging data demonstrated that antipsychotic dose was related to differences in neural signatures of feedback prediction error during the loss condition. Specifically, patients taking higher doses of medication showed attenuated prediction error responses in the striatum and the medial prefrontal cortex. These findings indicate that antipsychotic medication treatment may influence motivational processes in patients with schizophrenia.
Punishment insensitivity and impaired reinforcement learning in preschoolers.

PubMed

Briggs-Gowan, Margaret J; Nichols, Sara R; Voss, Joel; Zobel, Elvira; Carter, Alice S; McCarthy, Kimberly J; Pine, Daniel S; Blair, James; Wakschlag, Lauren S

2014-01-01

Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a developmental vulnerability to psychopathic traits. One hundred and fifty-seven preschoolers (mean age 4.7 ± 0.8 years) participated in a substudy that was embedded within a larger project. Children completed the 'Stars-in-Jars' task, which involved learning to select rewarded jars and avoid punished jars. Maternal report of responsiveness to socialization was assessed with the Punishment Insensitivity and Low Concern for Others scales of the Multidimensional Assessment of Preschool Disruptive Behavior (MAP-DB). Punishment Insensitivity, but not Low Concern for Others, was significantly associated with reinforcement learning in multivariate models that accounted for age and sex. Specifically, higher Punishment Insensitivity was associated with significantly lower overall performance and more errors on punished trials ('passive avoidance'). Impairments in reinforcement learning manifest in preschoolers who are high in maternal ratings of Punishment Insensitivity. If replicated, these findings may help to pinpoint the neurodevelopmental antecedents of psychopathic tendencies and suggest novel intervention targets beginning in early childhood. © 2013 The Authors. Journal of Child Psychology and Psychiatry © 2013 Association for Child and Adolescent Mental Health.
Bayesian Cue Integration as a Developmental Outcome of Reward Mediated Learning

PubMed Central

Weisswange, Thomas H.; Rothkopf, Constantin A.; Rodemann, Tobias; Triesch, Jochen

2011-01-01

Average human behavior in cue combination tasks is well predicted by Bayesian inference models. As this capability is acquired over developmental timescales, the question arises, how it is learned. Here we investigated whether reward dependent learning, that is well established at the computational, behavioral, and neuronal levels, could contribute to this development. It is shown that a model free reinforcement learning algorithm can indeed learn to do cue integration, i.e. weight uncertain cues according to their respective reliabilities and even do so if reliabilities are changing. We also consider the case of causal inference where multimodal signals can originate from one or multiple separate objects and should not always be integrated. In this case, the learner is shown to develop a behavior that is closest to Bayesian model averaging. We conclude that reward mediated learning could be a driving force for the development of cue integration and causal inference. PMID:21750717
The reward of seeing: Different types of visual reward and their ability to modify oculomotor learning.

PubMed

Meermeier, Annegret; Gremmler, Svenja; Richert, Kerstin; Eckermann, Til; Lappe, Markus

2017-10-01

Saccadic adaptation is an oculomotor learning process that maintains the accuracy of eye movements to ensure effective perception of the environment. Although saccadic adaptation is commonly considered an automatic and low-level motor calibration in the cerebellum, we recently found that strength of adaptation is influenced by the visual content of the target: pictures of humans produced stronger adaptation than noise stimuli. This suggests that meaningful images may be considered rewarding or valuable in oculomotor learning. Here we report three experiments that establish the boundaries of this effect. In the first, we tested whether stimuli that were associated with high and low value following long term self-administered reinforcement learning produce stronger adaptation. Twenty-eight expert gamers participated in two sessions of adaptation to game-related high- and low-reward stimuli, but revealed no difference in saccadic adaptation (Bayes Factor01 = 5.49). In the second experiment, we tested whether cognitive (literate) meaning could induce stronger adaptation by comparing targets consisting of words and nonwords. The results of twenty subjects revealed no difference in adaptation strength (Bayes Factor01 = 3.21). The third experiment compared images of human figures to noise patterns for reactive saccades. Twenty-two subjects adapted significantly more toward images of human figures in comparison to noise (p < 0.001). We conclude that only primary (human vs. noise), but not secondary, reinforcement affects saccadic adaptation (words vs. nonwords, high- vs. low-value video game images).
An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning

PubMed Central

Balasubramani, Pragathi P.; Chakravarthy, V. Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A.

2014-01-01

Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG. PMID:24795614
Probabilistic Reinforcement Learning in Adults with Autism Spectrum Disorders

PubMed Central

Solomon, Marjorie; Smith, Anne C.; Frank, Michael J.; Ly, Stanford; Carter, Cameron S.

2017-01-01

Background Autism spectrum disorders (ASDs) can be conceptualized as disorders of learning, however there have been few experimental studies taking this perspective. Methods We examined the probabilistic reinforcement learning performance of 28 adults with ASDs and 30 typically developing adults on a task requiring learning relationships between three stimulus pairs consisting of Japanese characters with feedback that was valid with different probabilities (80%, 70%, and 60%). Both univariate and Bayesian state–space data analytic methods were employed. Hypotheses were based on the extant literature as well as on neurobiological and computational models of reinforcement learning. Results Both groups learned the task after training. However, there were group differences in early learning in the first task block where individuals with ASDs acquired the most frequently accurately reinforced stimulus pair (80%) comparably to typically developing individuals; exhibited poorer acquisition of the less frequently reinforced 70% pair as assessed by state–space learning curves; and outperformed typically developing individuals on the near chance (60%) pair. Individuals with ASDs also demonstrated deficits in using positive feedback to exploit rewarded choices. Conclusions Results support the contention that individuals with ASDs are slower learners. Based on neurobiology and on the results of computational modeling, one interpretation of this pattern of findings is that impairments are related to deficits in flexible updating of reinforcement history as mediated by the orbito-frontal cortex, with spared functioning of the basal ganglia. This hypothesis about the pathophysiology of learning in ASDs can be tested using functional magnetic resonance imaging. PMID:21425243
Identifying cognitive remediation change through computational modelling--effects on reinforcement learning in schizophrenia.

PubMed

Cella, Matteo; Bishara, Anthony J; Medin, Evelina; Swan, Sarah; Reeder, Clare; Wykes, Til

2014-11-01

Converging research suggests that individuals with schizophrenia show a marked impairment in reinforcement learning, particularly in tasks requiring flexibility and adaptation. The problem has been associated with dopamine reward systems. This study explores, for the first time, the characteristics of this impairment and how it is affected by a behavioral intervention-cognitive remediation. Using computational modelling, 3 reinforcement learning parameters based on the Wisconsin Card Sorting Test (WCST) trial-by-trial performance were estimated: R (reward sensitivity), P (punishment sensitivity), and D (choice consistency). In Study 1 the parameters were compared between a group of individuals with schizophrenia (n = 100) and a healthy control group (n = 50). In Study 2 the effect of cognitive remediation therapy (CRT) on these parameters was assessed in 2 groups of individuals with schizophrenia, one receiving CRT (n = 37) and the other receiving treatment as usual (TAU, n = 34). In Study 1 individuals with schizophrenia showed impairment in the R and P parameters compared with healthy controls. Study 2 demonstrated that sensitivity to negative feedback (P) and reward (R) improved in the CRT group after therapy compared with the TAU group. R and P parameter change correlated with WCST outputs. Improvements in R and P after CRT were associated with working memory gains and reduction of negative symptoms, respectively. Schizophrenia reinforcement learning difficulties negatively influence performance in shift learning tasks. CRT can improve sensitivity to reward and punishment. Identifying parameters that show change may be useful in experimental medicine studies to identify cognitive domains susceptible to improvement. © The Author 2013. Published by Oxford University Press on behalf of the Maryland Psychiatric Research Center. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Learned helplessness and learned prevalence: exploring the causal relations among perceived controllability, reward prevalence, and exploration.

PubMed

Teodorescu, Kinneret; Erev, Ido

2014-10-01

Exposure to uncontrollable outcomes has been found to trigger learned helplessness, a state in which the agent, because of lack of exploration, fails to take advantage of regained control. Although the implications of this phenomenon have been widely studied, its underlying cause remains undetermined. One can learn not to explore because the environment is uncontrollable, because the average reinforcement for exploring is low, or because rewards for exploring are rare. In the current research, we tested a simple experimental paradigm that contrasts the predictions of these three contributors and offers a unified psychological mechanism that underlies the observed phenomena. Our results demonstrate that learned helplessness is not correlated with either the perceived controllability of one's environment or the average reward, which suggests that reward prevalence is a better predictor of exploratory behavior than the other two factors. A simple computational model in which exploration decisions were based on small samples of past experiences captured the empirical phenomena while also providing a cognitive basis for feelings of uncontrollability. © The Author(s) 2014.
Human reinforcement learning subdivides structured action spaces by learning effector-specific values

PubMed Central

Gershman, Samuel J.; Pesaran, Bijan; Daw, Nathaniel D.

2009-01-01

Humans and animals are endowed with a large number of effectors. Although this enables great behavioral flexibility, it presents an equally formidable reinforcement learning problem of discovering which actions are most valuable, due to the high dimensionality of the action space. An unresolved question is how neural systems for reinforcement learning – such as prediction error signals for action valuation associated with dopamine and the striatum – can cope with this “curse of dimensionality.” We propose a reinforcement learning framework that allows for learned action valuations to be decomposed into effector-specific components when appropriate to a task, and test it by studying to what extent human behavior and BOLD activity can exploit such a decomposition in a multieffector choice task. Subjects made simultaneous decisions with their left and right hands and received separate reward feedback for each hand movement. We found that choice behavior was better described by a learning model that decomposed the values of bimanual movements into separate values for each effector, rather than a traditional model that treated the bimanual actions as unitary with a single value. A decomposition of value into effector-specific components was also observed in value-related BOLD signaling, in the form of lateralized biases in striatal correlates of prediction error and anticipatory value correlates in the intraparietal sulcus. These results suggest that the human brain can use decomposed value representations to “divide and conquer” reinforcement learning over high-dimensional action spaces. PMID:19864565
Human reinforcement learning subdivides structured action spaces by learning effector-specific values.

PubMed

Gershman, Samuel J; Pesaran, Bijan; Daw, Nathaniel D

2009-10-28

Humans and animals are endowed with a large number of effectors. Although this enables great behavioral flexibility, it presents an equally formidable reinforcement learning problem of discovering which actions are most valuable because of the high dimensionality of the action space. An unresolved question is how neural systems for reinforcement learning-such as prediction error signals for action valuation associated with dopamine and the striatum-can cope with this "curse of dimensionality." We propose a reinforcement learning framework that allows for learned action valuations to be decomposed into effector-specific components when appropriate to a task, and test it by studying to what extent human behavior and blood oxygen level-dependent (BOLD) activity can exploit such a decomposition in a multieffector choice task. Subjects made simultaneous decisions with their left and right hands and received separate reward feedback for each hand movement. We found that choice behavior was better described by a learning model that decomposed the values of bimanual movements into separate values for each effector, rather than a traditional model that treated the bimanual actions as unitary with a single value. A decomposition of value into effector-specific components was also observed in value-related BOLD signaling, in the form of lateralized biases in striatal correlates of prediction error and anticipatory value correlates in the intraparietal sulcus. These results suggest that the human brain can use decomposed value representations to "divide and conquer" reinforcement learning over high-dimensional action spaces.
Place preference and vocal learning rely on distinct reinforcers in songbirds.

PubMed

Murdoch, Don; Chen, Ruidong; Goldberg, Jesse H

2018-04-30

In reinforcement learning (RL) agents are typically tasked with maximizing a single objective function such as reward. But it remains poorly understood how agents might pursue distinct objectives at once. In machines, multiobjective RL can be achieved by dividing a single agent into multiple sub-agents, each of which is shaped by agent-specific reinforcement, but it remains unknown if animals adopt this strategy. Here we use songbirds to test if navigation and singing, two behaviors with distinct objectives, can be differentially reinforced. We demonstrate that strobe flashes aversively condition place preference but not song syllables. Brief noise bursts aversively condition song syllables but positively reinforce place preference. Thus distinct behavior-generating systems, or agencies, within a single animal can be shaped by correspondingly distinct reinforcement signals. Our findings suggest that spatially segregated vocal circuits can solve a credit assignment problem associated with multiobjective learning.

Reinforcement Learning Models and Their Neural Correlates: An Activation Likelihood Estimation Meta-Analysis

PubMed Central

Kumar, Poornima; Eickhoff, Simon B.; Dombrovski, Alexandre Y.

2015-01-01

Reinforcement learning describes motivated behavior in terms of two abstract signals. The representation of discrepancies between expected and actual rewards/punishments – prediction error – is thought to update the expected value of actions and predictive stimuli. Electrophysiological and lesion studies suggest that mesostriatal prediction error signals control behavior through synaptic modification of cortico-striato-thalamic networks. Signals in the ventromedial prefrontal and orbitofrontal cortex are implicated in representing expected value. To obtain unbiased maps of these representations in the human brain, we performed a meta-analysis of functional magnetic resonance imaging studies that employed algorithmic reinforcement learning models, across a variety of experimental paradigms. We found that the ventral striatum (medial and lateral) and midbrain/thalamus represented reward prediction errors, consistent with animal studies. Prediction error signals were also seen in the frontal operculum/insula, particularly for social rewards. In Pavlovian studies, striatal prediction error signals extended into the amygdala, while instrumental tasks engaged the caudate. Prediction error maps were sensitive to the model-fitting procedure (fixed or individually-estimated) and to the extent of spatial smoothing. A correlate of expected value was found in a posterior region of the ventromedial prefrontal cortex, caudal and medial to the orbitofrontal regions identified in animal studies. These findings highlight a reproducible motif of reinforcement learning in the cortico-striatal loops and identify methodological dimensions that may influence the reproducibility of activation patterns across studies. PMID:25665667
Memory Transformation Enhances Reinforcement Learning in Dynamic Environments.

PubMed

Santoro, Adam; Frankland, Paul W; Richards, Blake A

2016-11-30

Over the course of systems consolidation, there is a switch from a reliance on detailed episodic memories to generalized schematic memories. This switch is sometimes referred to as "memory transformation." Here we demonstrate a previously unappreciated benefit of memory transformation, namely, its ability to enhance reinforcement learning in a dynamic environment. We developed a neural network that is trained to find rewards in a foraging task where reward locations are continuously changing. The network can use memories for specific locations (episodic memories) and statistical patterns of locations (schematic memories) to guide its search. We find that switching from an episodic to a schematic strategy over time leads to enhanced performance due to the tendency for the reward location to be highly correlated with itself in the short-term, but regress to a stable distribution in the long-term. We also show that the statistics of the environment determine the optimal utilization of both types of memory. Our work recasts the theoretical question of why memory transformation occurs, shifting the focus from the avoidance of memory interference toward the enhancement of reinforcement learning across multiple timescales. As time passes, memories transform from a highly detailed state to a more gist-like state, in a process called "memory transformation." Theories of memory transformation speak to its advantages in terms of reducing memory interference, increasing memory robustness, and building models of the environment. However, the role of memory transformation from the perspective of an agent that continuously acts and receives reward in its environment is not well explored. In this work, we demonstrate a view of memory transformation that defines it as a way of optimizing behavior across multiple timescales. Copyright © 2016 the authors 0270-6474/16/3612228-15$15.00/0.
The cerebellum: a neural system for the study of reinforcement learning.

PubMed

Swain, Rodney A; Kerr, Abigail L; Thompson, Richard F

2011-01-01

In its strictest application, the term "reinforcement learning" refers to a computational approach to learning in which an agent (often a machine) interacts with a mutable environment to maximize reward through trial and error. The approach borrows essentials from several fields, most notably Computer Science, Behavioral Neuroscience, and Psychology. At the most basic level, a neural system capable of mediating reinforcement learning must be able to acquire sensory information about the external environment and internal milieu (either directly or through connectivities with other brain regions), must be able to select a behavior to be executed, and must be capable of providing evaluative feedback about the success of that behavior. Given that Psychology informs us that reinforcers, both positive and negative, are stimuli or consequences that increase the probability that the immediately antecedent behavior will be repeated and that reinforcer strength or viability is modulated by the organism's past experience with the reinforcer, its affect, and even the state of its muscles (e.g., eyes open or closed); it is the case that any neural system that supports reinforcement learning must also be sensitive to these same considerations. Once learning is established, such a neural system must finally be able to maintain continued response expression and prevent response drift. In this report, we examine both historical and recent evidence that the cerebellum satisfies all of these requirements. While we report evidence from a variety of learning paradigms, the majority of our discussion will focus on classical conditioning of the rabbit eye blink response as an ideal model system for the study of reinforcement and reinforcement learning.
Rats bred for helplessness exhibit positive reinforcement learning deficits which are not alleviated by an antidepressant dose of the MAO-B inhibitor deprenyl.

PubMed

Schulz, Daniela; Henn, Fritz A; Petri, David; Huston, Joseph P

2016-08-04

Principles of negative reinforcement learning may play a critical role in the etiology and treatment of depression. We examined the integrity of positive reinforcement learning in congenitally helpless (cH) rats, an animal model of depression, using a random ratio schedule and a devaluation-extinction procedure. Furthermore, we tested whether an antidepressant dose of the monoamine oxidase (MAO)-B inhibitor deprenyl would reverse any deficits in positive reinforcement learning. We found that cH rats (n=9) were impaired in the acquisition of even simple operant contingencies, such as a fixed interval (FI) 20 schedule. cH rats exhibited no apparent deficits in appetite or reward sensitivity. They reacted to the devaluation of food in a manner consistent with a dose-response relationship. Reinforcer motivation as assessed by lever pressing across sessions with progressively decreasing reward probabilities was highest in congenitally non-helpless (cNH, n=10) rats as long as the reward probabilities remained relatively high. cNH compared to wild-type (n=10) rats were also more resistant to extinction across sessions. Compared to saline (n=5), deprenyl (n=5) reduced the duration of immobility of cH rats in the forced swimming test, indicative of antidepressant effects, but did not restore any deficits in the acquisition of a FI 20 schedule. We conclude that positive reinforcement learning was impaired in rats bred for helplessness, possibly due to motivational impairments but not deficits in reward sensitivity, and that deprenyl exerted antidepressant effects but did not reverse the deficits in positive reinforcement learning. Copyright © 2016 IBRO. Published by Elsevier Ltd. All rights reserved.
A Flexible Mechanism of Rule Selection Enables Rapid Feature-Based Reinforcement Learning

PubMed Central

Balcarras, Matthew; Womelsdorf, Thilo

2016-01-01

Learning in a new environment is influenced by prior learning and experience. Correctly applying a rule that maps a context to stimuli, actions, and outcomes enables faster learning and better outcomes compared to relying on strategies for learning that are ignorant of task structure. However, it is often difficult to know when and how to apply learned rules in new contexts. In our study we explored how subjects employ different strategies for learning the relationship between stimulus features and positive outcomes in a probabilistic task context. We test the hypothesis that task naive subjects will show enhanced learning of feature specific reward associations by switching to the use of an abstract rule that associates stimuli by feature type and restricts selections to that dimension. To test this hypothesis we designed a decision making task where subjects receive probabilistic feedback following choices between pairs of stimuli. In the task, trials are grouped in two contexts by blocks, where in one type of block there is no unique relationship between a specific feature dimension (stimulus shape or color) and positive outcomes, and following an un-cued transition, alternating blocks have outcomes that are linked to either stimulus shape or color. Two-thirds of subjects (n = 22/32) exhibited behavior that was best fit by a hierarchical feature-rule model. Supporting the prediction of the model mechanism these subjects showed significantly enhanced performance in feature-reward blocks, and rapidly switched their choice strategy to using abstract feature rules when reward contingencies changed. Choice behavior of other subjects (n = 10/32) was fit by a range of alternative reinforcement learning models representing strategies that do not benefit from applying previously learned rules. In summary, these results show that untrained subjects are capable of flexibly shifting between behavioral rules by leveraging simple model-free reinforcement learning and context
Deep reinforcement learning for automated radiation adaptation in lung cancer.

PubMed

Tseng, Huan-Hsin; Luo, Yi; Cui, Sunan; Chien, Jen-Tzung; Ten Haken, Randall K; Naqa, Issam El

2017-12-01

To investigate deep reinforcement learning (DRL) based on historical treatment plans for developing automated radiation adaptation protocols for nonsmall cell lung cancer (NSCLC) patients that aim to maximize tumor local control at reduced rates of radiation pneumonitis grade 2 (RP2). In a retrospective population of 114 NSCLC patients who received radiotherapy, a three-component neural networks framework was developed for deep reinforcement learning (DRL) of dose fractionation adaptation. Large-scale patient characteristics included clinical, genetic, and imaging radiomics features in addition to tumor and lung dosimetric variables. First, a generative adversarial network (GAN) was employed to learn patient population characteristics necessary for DRL training from a relatively limited sample size. Second, a radiotherapy artificial environment (RAE) was reconstructed by a deep neural network (DNN) utilizing both original and synthetic data (by GAN) to estimate the transition probabilities for adaptation of personalized radiotherapy patients' treatment courses. Third, a deep Q-network (DQN) was applied to the RAE for choosing the optimal dose in a response-adapted treatment setting. This multicomponent reinforcement learning approach was benchmarked against real clinical decisions that were applied in an adaptive dose escalation clinical protocol. In which, 34 patients were treated based on avid PET signal in the tumor and constrained by a 17.2% normal tissue complication probability (NTCP) limit for RP2. The uncomplicated cure probability (P+) was used as a baseline reward function in the DRL. Taking our adaptive dose escalation protocol as a blueprint for the proposed DRL (GAN + RAE + DQN) architecture, we obtained an automated dose adaptation estimate for use at ∼2/3 of the way into the radiotherapy treatment course. By letting the DQN component freely control the estimated adaptive dose per fraction (ranging from 1-5 Gy), the DRL automatically favored dose
Intrinsically motivated reinforcement learning for human-robot interaction in the real-world.

PubMed

Qureshi, Ahmed Hussain; Nakamura, Yutaka; Yoshikawa, Yuichiro; Ishiguro, Hiroshi

2018-03-26

For a natural social human-robot interaction, it is essential for a robot to learn the human-like social skills. However, learning such skills is notoriously hard due to the limited availability of direct instructions from people to teach a robot. In this paper, we propose an intrinsically motivated reinforcement learning framework in which an agent gets the intrinsic motivation-based rewards through the action-conditional predictive model. By using the proposed method, the robot learned the social skills from the human-robot interaction experiences gathered in the real uncontrolled environments. The results indicate that the robot not only acquired human-like social skills but also took more human-like decisions, on a test dataset, than a robot which received direct rewards for the task achievement. Copyright © 2018 Elsevier Ltd. All rights reserved.
Rule learning in autism: the role of reward type and social context.

PubMed

Jones, E J H; Webb, S J; Estes, A; Dawson, G

2013-01-01

Learning abstract rules is central to social and cognitive development. Across two experiments, we used Delayed Non-Matching to Sample tasks to characterize the longitudinal development and nature of rule-learning impairments in children with Autism Spectrum Disorder (ASD). Results showed that children with ASD consistently experienced more difficulty learning an abstract rule from a discrete physical reward than children with DD. Rule learning was facilitated by the provision of more concrete reinforcement, suggesting an underlying difficulty in forming conceptual connections. Learning abstract rules about social stimuli remained challenging through late childhood, indicating the importance of testing executive functions in both social and non-social contexts.
Rule Learning in Autism: The Role of Reward Type and Social Context

PubMed Central

Jones, E. J. H.; Webb, S. J.; Estes, A.; Dawson, G.

2013-01-01

Learning abstract rules is central to social and cognitive development. Across two experiments, we used Delayed Non-Matching to Sample tasks to characterize the longitudinal development and nature of rule-learning impairments in children with Autism Spectrum Disorder (ASD). Results showed that children with ASD consistently experienced more difficulty learning an abstract rule from a discrete physical reward than children with DD. Rule learning was facilitated by the provision of more concrete reinforcement, suggesting an underlying difficulty in forming conceptual connections. Learning abstract rules about social stimuli remained challenging through late childhood, indicating the importance of testing executive functions in both social and non-social contexts. PMID:23311315
Stochastic Reinforcement Benefits Skill Acquisition

ERIC Educational Resources Information Center

Dayan, Eran; Averbeck, Bruno B.; Richmond, Barry J.; Cohen, Leonardo G.

2014-01-01

Learning complex skills is driven by reinforcement, which facilitates both online within-session gains and retention of the acquired skills. Yet, in ecologically relevant situations, skills are often acquired when mapping between actions and rewarding outcomes is unknown to the learning agent, resulting in reinforcement schedules of a stochastic…
Neural correlates of reinforcement learning and social preferences in competitive bidding.

PubMed

van den Bos, Wouter; Talwar, Arjun; McClure, Samuel M

2013-01-30

In competitive social environments, people often deviate from what rational choice theory prescribes, resulting in losses or suboptimal monetary gains. We investigate how competition affects learning and decision-making in a common value auction task. During the experiment, groups of five human participants were simultaneously scanned using MRI while playing the auction task. We first demonstrate that bidding is well characterized by reinforcement learning with biased reward representations dependent on social preferences. Indicative of reinforcement learning, we found that estimated trial-by-trial prediction errors correlated with activity in the striatum and ventromedial prefrontal cortex. Additionally, we found that individual differences in social preferences were related to activity in the temporal-parietal junction and anterior insula. Connectivity analyses suggest that monetary and social value signals are integrated in the ventromedial prefrontal cortex and striatum. Based on these results, we argue for a novel mechanistic account for the integration of reinforcement history and social preferences in competitive decision-making.
Continuous theta-burst stimulation (cTBS) over the lateral prefrontal cortex alters reinforcement learning bias.

PubMed

Ott, Derek V M; Ullsperger, Markus; Jocham, Gerhard; Neumann, Jane; Klein, Tilmann A

2011-07-15

The prefrontal cortex is known to play a key role in higher-order cognitive functions. Recently, we showed that this brain region is active in reinforcement learning, during which subjects constantly have to integrate trial outcomes in order to optimize performance. To further elucidate the role of the dorsolateral prefrontal cortex (DLPFC) in reinforcement learning, we applied continuous theta-burst stimulation (cTBS) either to the left or right DLPFC, or to the vertex as a control region, respectively, prior to the performance of a probabilistic learning task in an fMRI environment. While there was no influence of cTBS on learning performance per se, we observed a stimulation-dependent modulation of reward vs. punishment sensitivity: Left-hemispherical DLPFC stimulation led to a more reward-guided performance, while right-hemispherical cTBS induced a more avoidance-guided behavior. FMRI results showed enhanced prediction error coding in the ventral striatum in subjects stimulated over the left as compared to the right DLPFC. Both behavioral and imaging results are in line with recent findings that left, but not right-hemispherical stimulation can trigger a release of dopamine in the ventral striatum, which has been suggested to increase the relative impact of rewards rather than punishment on behavior. Copyright © 2011 Elsevier Inc. All rights reserved.
Reconsidering Food Reward, Brain Stimulation, and Dopamine: Incentives Act Forward.

PubMed

Newquist, Gunnar; Gardner, R Allen

2015-01-01

In operant conditioning, rats pressing levers and pigeons pecking keys depend on contingent food reinforcement. Food reward agrees with Skinner's behaviorism, undergraduate textbooks, and folk psychology. However, nearly a century of experimental evidence shows, instead, that food in an operant conditioning chamber acts forward to evoke species-specific feeding behavior rather than backward to reinforce experimenter-defined responses. Furthermore, recent findings in neuroscience show consistently that intracranial stimulation to reward centers and dopamine release, the proposed reward molecule, also act forward to evoke inborn species-specific behavior. These results challenge longstanding views of hedonic learning and must be incorporated into contemporary learning theory.
A common neural circuit mechanism for internally guided and externally reinforced forms of motor learning.

PubMed

Hisey, Erin; Kearney, Matthew Gene; Mooney, Richard

2018-04-01

The complex skills underlying verbal and musical expression can be learned without external punishment or reward, indicating their learning is internally guided. The neural mechanisms that mediate internally guided learning are poorly understood, but a circuit comprising dopamine-releasing neurons in the midbrain ventral tegmental area (VTA) and their targets in the basal ganglia are important to externally reinforced learning. Juvenile zebra finches copy a tutor song in a process that is internally guided and, in adulthood, can learn to modify the fundamental frequency (pitch) of a target syllable in response to external reinforcement with white noise. Here we combined intersectional genetic ablation of VTA neurons, reversible blockade of dopamine receptors in the basal ganglia, and singing-triggered optogenetic stimulation of VTA terminals to establish that a common VTA-basal ganglia circuit enables internally guided song copying and externally reinforced syllable pitch learning.
Joint Extraction of Entities and Relations Using Reinforcement Learning and Deep Learning.

PubMed

Feng, Yuntian; Zhang, Hongjun; Hao, Wenning; Chen, Gang

2017-01-01

We use both reinforcement learning and deep learning to simultaneously extract entities and relations from unstructured texts. For reinforcement learning, we model the task as a two-step decision process. Deep learning is used to automatically capture the most important information from unstructured texts, which represent the state in the decision process. By designing the reward function per step, our proposed method can pass the information of entity extraction to relation extraction and obtain feedback in order to extract entities and relations simultaneously. Firstly, we use bidirectional LSTM to model the context information, which realizes preliminary entity extraction. On the basis of the extraction results, attention based method can represent the sentences that include target entity pair to generate the initial state in the decision process. Then we use Tree-LSTM to represent relation mentions to generate the transition state in the decision process. Finally, we employ Q -Learning algorithm to get control policy π in the two-step decision process. Experiments on ACE2005 demonstrate that our method attains better performance than the state-of-the-art method and gets a 2.4% increase in recall-score.
Joint Extraction of Entities and Relations Using Reinforcement Learning and Deep Learning

PubMed Central

Zhang, Hongjun; Chen, Gang

2017-01-01

We use both reinforcement learning and deep learning to simultaneously extract entities and relations from unstructured texts. For reinforcement learning, we model the task as a two-step decision process. Deep learning is used to automatically capture the most important information from unstructured texts, which represent the state in the decision process. By designing the reward function per step, our proposed method can pass the information of entity extraction to relation extraction and obtain feedback in order to extract entities and relations simultaneously. Firstly, we use bidirectional LSTM to model the context information, which realizes preliminary entity extraction. On the basis of the extraction results, attention based method can represent the sentences that include target entity pair to generate the initial state in the decision process. Then we use Tree-LSTM to represent relation mentions to generate the transition state in the decision process. Finally, we employ Q-Learning algorithm to get control policy π in the two-step decision process. Experiments on ACE2005 demonstrate that our method attains better performance than the state-of-the-art method and gets a 2.4% increase in recall-score. PMID:28894463
Learning the specific quality of taste reinforcement in larval Drosophila.

PubMed

Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

2015-01-27

The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing-in any brain.
Effects of reward and punishment on learning from errors in smokers.

PubMed

Duehlmeyer, Leonie; Levis, Bianca; Hester, Robert

2018-04-30

Punishing errors facilitates adaptation in healthy individuals, while aberrant reward and punishment sensitivity in drug-dependent individuals may change this impact. Many societies have institutions that use the concept of punishing drug use behavior, making it important to understand how drug dependency mediates the effects of negative feedback for influencing adaptive behavior. Using an associative learning task, we investigated differences in error correction rates of dependent smokers, compared with controls. Two versions of the task were administered to different participant samples: One assessed the effect of varying monetary contingencies to task performance, the other, the presence of reward as compared to avoidance of punishment for correct performance. While smokers recalled associations that were rewarded with a higher value 11% more often than lower rewarded locations, they did not correct higher punished locations more often. Controls exhibited the opposite pattern. The three-way interaction between magnitude, feedback type and group was significant, F(1,48) = 5.288, p =0.026, ɳ 2 p =0.099. Neither participant group corrected locations offering reward more often than those offering avoidances of punishment. The interaction between group and feedback condition was not significant, F(1,58) = 0.0, p =0.99, ɳ 2 p =0.001. The present results suggest that smokers have poorer learning from errors when receiving negative feedback. Moreover, larger rewards reinforce smokers' behavior stronger than smaller rewards, whereas controls made no distinction. These findings support the hypothesis that dependent smokers may respond to positively framed and rewarded anti-smoking programs when compared to those relying on negative feedback or punishment. Copyright © 2018 Elsevier B.V. All rights reserved.
Cocaine addiction as a homeostatic reinforcement learning disorder.

PubMed

Keramati, Mehdi; Durand, Audrey; Girardeau, Paul; Gutkin, Boris; Ahmed, Serge H

2017-03-01

Drug addiction implicates both reward learning and homeostatic regulation mechanisms of the brain. This has stimulated 2 partially successful theoretical perspectives on addiction. Many important aspects of addiction, however, remain to be explained within a single, unified framework that integrates the 2 mechanisms. Building upon a recently developed homeostatic reinforcement learning theory, the authors focus on a key transition stage of addiction that is well modeled in animals, escalation of drug use, and propose a computational theory of cocaine addiction where cocaine reinforces behavior due to its rapid homeostatic corrective effect, whereas its chronic use induces slow and long-lasting changes in homeostatic setpoint. Simulations show that our new theory accounts for key behavioral and neurobiological features of addiction, most notably, escalation of cocaine use, drug-primed craving and relapse, individual differences underlying dose-response curves, and dopamine D2-receptor downregulation in addicts. The theory also generates unique predictions about cocaine self-administration behavior in rats that are confirmed by new experimental results. Viewing addiction as a homeostatic reinforcement learning disorder coherently explains many behavioral and neurobiological aspects of the transition to cocaine addiction, and suggests a new perspective toward understanding addiction. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Neuronal Reward and Decision Signals: From Theories to Data

PubMed Central

Schultz, Wolfram

2015-01-01

Rewards are crucial objects that induce learning, approach behavior, choices, and emotions. Whereas emotions are difficult to investigate in animals, the learning function is mediated by neuronal reward prediction error signals which implement basic constructs of reinforcement learning theory. These signals are found in dopamine neurons, which emit a global reward signal to striatum and frontal cortex, and in specific neurons in striatum, amygdala, and frontal cortex projecting to select neuronal populations. The approach and choice functions involve subjective value, which is objectively assessed by behavioral choices eliciting internal, subjective reward preferences. Utility is the formal mathematical characterization of subjective value and a prime decision variable in economic choice theory. It is coded as utility prediction error by phasic dopamine responses. Utility can incorporate various influences, including risk, delay, effort, and social interaction. Appropriate for formal decision mechanisms, rewards are coded as object value, action value, difference value, and chosen value by specific neurons. Although all reward, reinforcement, and decision variables are theoretical constructs, their neuronal signals constitute measurable physical implementations and as such confirm the validity of these concepts. The neuronal reward signals provide guidance for behavior while constraining the free will to act. PMID:26109341

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

PubMed

Huertas, Marco A; Schwettmann, Sarah E; Shouval, Harel Z

2016-01-01

The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for
The influence of trial order on learning from reward vs. punishment in a probabilistic categorization task: experimental and computational analyses.

PubMed

Moustafa, Ahmed A; Gluck, Mark A; Herzallah, Mohammad M; Myers, Catherine E

2015-01-01

Previous research has shown that trial ordering affects cognitive performance, but this has not been tested using category-learning tasks that differentiate learning from reward and punishment. Here, we tested two groups of healthy young adults using a probabilistic category learning task of reward and punishment in which there are two types of trials (reward, punishment) and three possible outcomes: (1) positive feedback for correct responses in reward trials; (2) negative feedback for incorrect responses in punishment trials; and (3) no feedback for incorrect answers in reward trials and correct answers in punishment trials. Hence, trials without feedback are ambiguous, and may represent either successful avoidance of punishment or failure to obtain reward. In Experiment 1, the first group of subjects received an intermixed task in which reward and punishment trials were presented in the same block, as a standard baseline task. In Experiment 2, a second group completed the separated task, in which reward and punishment trials were presented in separate blocks. Additionally, in order to understand the mechanisms underlying performance in the experimental conditions, we fit individual data using a Q-learning model. Results from Experiment 1 show that subjects who completed the intermixed task paradoxically valued the no-feedback outcome as a reinforcer when it occurred on reinforcement-based trials, and as a punisher when it occurred on punishment-based trials. This is supported by patterns of empirical responding, where subjects showed more win-stay behavior following an explicit reward than following an omission of punishment, and more lose-shift behavior following an explicit punisher than following an omission of reward. In Experiment 2, results showed similar performance whether subjects received reward-based or punishment-based trials first. However, when the Q-learning model was applied to these data, there were differences between subjects in the reward
The influence of trial order on learning from reward vs. punishment in a probabilistic categorization task: experimental and computational analyses

PubMed Central

Moustafa, Ahmed A.; Gluck, Mark A.; Herzallah, Mohammad M.; Myers, Catherine E.

2015-01-01

Previous research has shown that trial ordering affects cognitive performance, but this has not been tested using category-learning tasks that differentiate learning from reward and punishment. Here, we tested two groups of healthy young adults using a probabilistic category learning task of reward and punishment in which there are two types of trials (reward, punishment) and three possible outcomes: (1) positive feedback for correct responses in reward trials; (2) negative feedback for incorrect responses in punishment trials; and (3) no feedback for incorrect answers in reward trials and correct answers in punishment trials. Hence, trials without feedback are ambiguous, and may represent either successful avoidance of punishment or failure to obtain reward. In Experiment 1, the first group of subjects received an intermixed task in which reward and punishment trials were presented in the same block, as a standard baseline task. In Experiment 2, a second group completed the separated task, in which reward and punishment trials were presented in separate blocks. Additionally, in order to understand the mechanisms underlying performance in the experimental conditions, we fit individual data using a Q-learning model. Results from Experiment 1 show that subjects who completed the intermixed task paradoxically valued the no-feedback outcome as a reinforcer when it occurred on reinforcement-based trials, and as a punisher when it occurred on punishment-based trials. This is supported by patterns of empirical responding, where subjects showed more win-stay behavior following an explicit reward than following an omission of punishment, and more lose-shift behavior following an explicit punisher than following an omission of reward. In Experiment 2, results showed similar performance whether subjects received reward-based or punishment-based trials first. However, when the Q-learning model was applied to these data, there were differences between subjects in the reward
Reinforcement learning and decision making in monkeys during a competitive game.

PubMed

Lee, Daeyeol; Conroy, Michelle L; McGreevy, Benjamin P; Barraclough, Dominic J

2004-12-01

Animals living in a dynamic environment must adjust their decision-making strategies through experience. To gain insights into the neural basis of such adaptive decision-making processes, we trained monkeys to play a competitive game against a computer in an oculomotor free-choice task. The animal selected one of two visual targets in each trial and was rewarded only when it selected the same target as the computer opponent. To determine how the animal's decision-making strategy can be affected by the opponent's strategy, the computer opponent was programmed with three different algorithms that exploited different aspects of the animal's choice and reward history. When the computer selected its targets randomly with equal probabilities, animals selected one of the targets more often, violating the prediction of probability matching, and their choices were systematically influenced by the choice history of the two players. When the computer exploited only the animal's choice history but not its reward history, animal's choice became more independent of its own choice history but was still related to the choice history of the opponent. This bias was substantially reduced, but not completely eliminated, when the computer used the choice history of both players in making its predictions. These biases were consistent with the predictions of reinforcement learning, suggesting that the animals sought optimal decision-making strategies using reinforcement learning algorithms.
Knowledge-Based Reinforcement Learning for Data Mining

NASA Astrophysics Data System (ADS)

Kudenko, Daniel; Grzes, Marek

Data Mining is the process of extracting patterns from data. Two general avenues of research in the intersecting areas of agents and data mining can be distinguished. The first approach is concerned with mining an agent’s observation data in order to extract patterns, categorize environment states, and/or make predictions of future states. In this setting, data is normally available as a batch, and the agent’s actions and goals are often independent of the data mining task. The data collection is mainly considered as a side effect of the agent’s activities. Machine learning techniques applied in such situations fall into the class of supervised learning. In contrast, the second scenario occurs where an agent is actively performing the data mining, and is responsible for the data collection itself. For example, a mobile network agent is acquiring and processing data (where the acquisition may incur a certain cost), or a mobile sensor agent is moving in a (perhaps hostile) environment, collecting and processing sensor readings. In these settings, the tasks of the agent and the data mining are highly intertwined and interdependent (or even identical). Supervised learning is not a suitable technique for these cases. Reinforcement Learning (RL) enables an agent to learn from experience (in form of reward and punishment for explorative actions) and adapt to new situations, without a teacher. RL is an ideal learning technique for these data mining scenarios, because it fits the agent paradigm of continuous sensing and acting, and the RL agent is able to learn to make decisions on the sampling of the environment which provides the data. Nevertheless, RL still suffers from scalability problems, which have prevented its successful use in many complex real-world domains. The more complex the tasks, the longer it takes a reinforcement learning algorithm to converge to a good solution. For many real-world tasks, human expert knowledge is available. For example, human
Dissociable neural representations of reinforcement and belief prediction errors underlie strategic learning

PubMed Central

Zhu, Lusha; Mathewson, Kyle E.; Hsu, Ming

2012-01-01

Decision-making in the presence of other competitive intelligent agents is fundamental for social and economic behavior. Such decisions require agents to behave strategically, where in addition to learning about the rewards and punishments available in the environment, they also need to anticipate and respond to actions of others competing for the same rewards. However, whereas we know much about strategic learning at both theoretical and behavioral levels, we know relatively little about the underlying neural mechanisms. Here, we show using a multi-strategy competitive learning paradigm that strategic choices can be characterized by extending the reinforcement learning (RL) framework to incorporate agents’ beliefs about the actions of their opponents. Furthermore, using this characterization to generate putative internal values, we used model-based functional magnetic resonance imaging to investigate neural computations underlying strategic learning. We found that the distinct notions of prediction errors derived from our computational model are processed in a partially overlapping but distinct set of brain regions. Specifically, we found that the RL prediction error was correlated with activity in the ventral striatum. In contrast, activity in the ventral striatum, as well as the rostral anterior cingulate (rACC), was correlated with a previously uncharacterized belief-based prediction error. Furthermore, activity in rACC reflected individual differences in degree of engagement in belief learning. These results suggest a model of strategic behavior where learning arises from interaction of dissociable reinforcement and belief-based inputs. PMID:22307594
Dissociable neural representations of reinforcement and belief prediction errors underlie strategic learning.

PubMed

Zhu, Lusha; Mathewson, Kyle E; Hsu, Ming

2012-01-31

Decision-making in the presence of other competitive intelligent agents is fundamental for social and economic behavior. Such decisions require agents to behave strategically, where in addition to learning about the rewards and punishments available in the environment, they also need to anticipate and respond to actions of others competing for the same rewards. However, whereas we know much about strategic learning at both theoretical and behavioral levels, we know relatively little about the underlying neural mechanisms. Here, we show using a multi-strategy competitive learning paradigm that strategic choices can be characterized by extending the reinforcement learning (RL) framework to incorporate agents' beliefs about the actions of their opponents. Furthermore, using this characterization to generate putative internal values, we used model-based functional magnetic resonance imaging to investigate neural computations underlying strategic learning. We found that the distinct notions of prediction errors derived from our computational model are processed in a partially overlapping but distinct set of brain regions. Specifically, we found that the RL prediction error was correlated with activity in the ventral striatum. In contrast, activity in the ventral striatum, as well as the rostral anterior cingulate (rACC), was correlated with a previously uncharacterized belief-based prediction error. Furthermore, activity in rACC reflected individual differences in degree of engagement in belief learning. These results suggest a model of strategic behavior where learning arises from interaction of dissociable reinforcement and belief-based inputs.
Reinforcement learning of targeted movement in a spiking neuronal model of motor cortex.

PubMed

Chadderdon, George L; Neymotin, Samuel A; Kerr, Cliff C; Lytton, William W

2012-01-01

Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint "forearm" to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (-1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior.
Learning the specific quality of taste reinforcement in larval Drosophila

PubMed Central

Schleyer, Michael; Miura, Daisuke; Tanimura, Teiichi; Gerber, Bertram

2015-01-01

The only property of reinforcement insects are commonly thought to learn about is its value. We show that larval Drosophila not only remember the value of reinforcement (How much?), but also its quality (What?). This is demonstrated both within the appetitive domain by using sugar vs amino acid as different reward qualities, and within the aversive domain by using bitter vs high-concentration salt as different qualities of punishment. From the available literature, such nuanced memories for the quality of reinforcement are unexpected and pose a challenge to present models of how insect memory is organized. Given that animals as simple as larval Drosophila, endowed with but 10,000 neurons, operate with both reinforcement value and quality, we suggest that both are fundamental aspects of mnemonic processing—in any brain. DOI: http://dx.doi.org/10.7554/eLife.04711.001 PMID:25622533
Reinforcement learning in depression: A review of computational research.

PubMed

Chen, Chong; Takahashi, Taiki; Nakagawa, Shin; Inoue, Takeshi; Kusumi, Ichiro

2015-08-01

Despite being considered primarily a mood disorder, major depressive disorder (MDD) is characterized by cognitive and decision making deficits. Recent research has employed computational models of reinforcement learning (RL) to address these deficits. The computational approach has the advantage in making explicit predictions about learning and behavior, specifying the process parameters of RL, differentiating between model-free and model-based RL, and the computational model-based functional magnetic resonance imaging and electroencephalography. With these merits there has been an emerging field of computational psychiatry and here we review specific studies that focused on MDD. Considerable evidence suggests that MDD is associated with impaired brain signals of reward prediction error and expected value ('wanting'), decreased reward sensitivity ('liking') and/or learning (be it model-free or model-based), etc., although the causality remains unclear. These parameters may serve as valuable intermediate phenotypes of MDD, linking general clinical symptoms to underlying molecular dysfunctions. We believe future computational research at clinical, systems, and cellular/molecular/genetic levels will propel us toward a better understanding of the disease. Copyright © 2015 Elsevier Ltd. All rights reserved.
Deficient reinforcement learning in medial frontal cortex as a model of dopamine-related motivational deficits in ADHD.

PubMed

Silvetti, Massimo; Wiersema, Jan R; Sonuga-Barke, Edmund; Verguts, Tom

2013-10-01

Attention Deficit/Hyperactivity Disorder (ADHD) is a pathophysiologically complex and heterogeneous condition with both cognitive and motivational components. We propose a novel computational hypothesis of motivational deficits in ADHD, drawing together recent evidence on the role of anterior cingulate cortex (ACC) and associated mesolimbic dopamine circuits in both reinforcement learning and ADHD. Based on findings of dopamine dysregulation and ACC involvement in ADHD we simulated a lesion in a previously validated computational model of ACC (Reward Value and Prediction Model, RVPM). We explored the effects of the lesion on the processing of reinforcement signals. We tested specific behavioral predictions about the profile of reinforcement-related deficits in ADHD in three experimental contexts; probability tracking task, partial and continuous reward schedules, and immediate versus delayed rewards. In addition, predictions were made at the neurophysiological level. Behavioral and neurophysiological predictions from the RVPM-based lesion-model of motivational dysfunction in ADHD were confirmed by data from previously published studies. RVPM represents a promising model of ADHD reinforcement learning suggesting that ACC dysregulation might play a role in the pathogenesis of motivational deficits in ADHD. However, more behavioral and neurophysiological studies are required to test core predictions of the model. In addition, the interaction with different brain networks underpinning other aspects of ADHD neuropathology (i.e., executive function) needs to be better understood. Copyright © 2013 Elsevier Ltd. All rights reserved.
Seizure Control in a Computational Model Using a Reinforcement Learning Stimulation Paradigm.

PubMed

Nagaraj, Vivek; Lamperski, Andrew; Netoff, Theoden I

2017-11-01

Neuromodulation technologies such as vagus nerve stimulation and deep brain stimulation, have shown some efficacy in controlling seizures in medically intractable patients. However, inherent patient-to-patient variability of seizure disorders leads to a wide range of therapeutic efficacy. A patient specific approach to determining stimulation parameters may lead to increased therapeutic efficacy while minimizing stimulation energy and side effects. This paper presents a reinforcement learning algorithm that optimizes stimulation frequency for controlling seizures with minimum stimulation energy. We apply our method to a computational model called the epileptor. The epileptor model simulates inter-ictal and ictal local field potential data. In order to apply reinforcement learning to the Epileptor, we introduce a specialized reward function and state-space discretization. With the reward function and discretization fixed, we test the effectiveness of the temporal difference reinforcement learning algorithm (TD(0)). For periodic pulsatile stimulation, we derive a relation that describes, for any stimulation frequency, the minimal pulse amplitude required to suppress seizures. The TD(0) algorithm is able to identify parameters that control seizures quickly. Additionally, our results show that the TD(0) algorithm refines the stimulation frequency to minimize stimulation energy thereby converging to optimal parameters reliably. An advantage of the TD(0) algorithm is that it is adaptive so that the parameters necessary to control the seizures can change over time. We show that the algorithm can converge on the optimal solution in simulation with slow and fast inter-seizure intervals.
Impairment of probabilistic reward-based learning in schizophrenia.

PubMed

Weiler, Julia A; Bellebaum, Christian; Brüne, Martin; Juckel, Georg; Daum, Irene

2009-09-01

Recent models assume that some symptoms of schizophrenia originate from defective reward processing mechanisms. Understanding the precise nature of reward-based learning impairments might thus make an important contribution to the understanding of schizophrenia and the development of treatment strategies. The present study investigated several features of probabilistic reward-based stimulus association learning, namely the acquisition of initial contingencies, reversal learning, generalization abilities, and the effects of reward magnitude. Compared to healthy controls, individuals with schizophrenia exhibited attenuated overall performance during acquisition, whereas learning rates across blocks were similar to the rates of controls. On the group level, persons with schizophrenia were, however, unable to learn the reversal of the initial reward contingencies. Exploratory analysis of only the subgroup of individuals with schizophrenia who showed significant learning during acquisition yielded deficits in reversal learning with low reward magnitudes only. There was further evidence of a mild generalization impairment of the persons with schizophrenia in an acquired equivalence task. In summary, although there was evidence of intact basic processing of reward magnitudes, individuals with schizophrenia were impaired at using this feedback for the adaptive guidance of behavior.
Contextual modulation of value signals in reward and punishment learning.

PubMed

Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

2015-08-25

Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative--context-dependent--scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system.
Dissociable effects of surprising rewards on learning and memory.

PubMed

Rouhani, Nina; Norman, Kenneth A; Niv, Yael

2018-03-19

Reward-prediction errors track the extent to which rewards deviate from expectations, and aid in learning. How do such errors in prediction interact with memory for the rewarding episode? Existing findings point to both cooperative and competitive interactions between learning and memory mechanisms. Here, we investigated whether learning about rewards in a high-risk context, with frequent, large prediction errors, would give rise to higher fidelity memory traces for rewarding events than learning in a low-risk context. Experiment 1 showed that recognition was better for items associated with larger absolute prediction errors during reward learning. Larger prediction errors also led to higher rates of learning about rewards. Interestingly we did not find a relationship between learning rate for reward and recognition-memory accuracy for items, suggesting that these two effects of prediction errors were caused by separate underlying mechanisms. In Experiment 2, we replicated these results with a longer task that posed stronger memory demands and allowed for more learning. We also showed improved source and sequence memory for items within the high-risk context. In Experiment 3, we controlled for the difficulty of reward learning in the risk environments, again replicating the previous results. Moreover, this control revealed that the high-risk context enhanced item-recognition memory beyond the effect of prediction errors. In summary, our results show that prediction errors boost both episodic item memory and incremental reward learning, but the two effects are likely mediated by distinct underlying systems. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Reinforcement learning and episodic memory in humans and animals: an integrative framework

PubMed Central

Gershman, Samuel J.; Daw, Nathaniel D.

2018-01-01

We review the psychology and neuroscience of reinforcement learning (RL), which has witnessed significant progress in the last two decades, enabled by the comprehensive experimental study of simple learning and decision-making tasks. However, the simplicity of these tasks misses important aspects of reinforcement learning in the real world: (i) State spaces are high-dimensional, continuous, and partially observable; this implies that (ii) data are relatively sparse: indeed precisely the same situation may never be encountered twice; and also that (iii) rewards depend on long-term consequences of actions in ways that violate the classical assumptions that make RL tractable. A seemingly distinct challenge is that, cognitively, these theories have largely connected with procedural and semantic memory: how knowledge about action values or world models extracted gradually from many experiences can drive choice. This misses many aspects of memory related to traces of individual events, such as episodic memory. We suggest that these two gaps are related. In particular, the computational challenges can be dealt with, in part, by endowing RL systems with episodic memory, allowing them to (i) efficiently approximate value functions over complex state spaces, (ii) learn with very little data, and (iii) bridge long-term dependencies between actions and rewards. We review the computational theory underlying this proposal and the empirical evidence to support it. Our proposal suggests that the ubiquitous and diverse roles of memory in RL may function as part of an integrated learning system. PMID:27618944
Impaired associative learning with food rewards in obese women.

PubMed

Zhang, Zhihao; Manson, Kirk F; Schiller, Daniela; Levy, Ifat

2014-08-04

Obesity is a major epidemic in many parts of the world. One of the main factors contributing to obesity is overconsumption of high-fat and high-calorie food, which is driven by the rewarding properties of these types of food. Previous studies have suggested that dysfunction in reward circuits may be associated with overeating and obesity. The nature of this dysfunction, however, is still unknown. Here, we demonstrate impairment in reward-based associative learning specific to food in obese women. Normal-weight and obese participants performed an appetitive reversal learning task in which they had to learn and modify cue-reward associations. To test whether any learning deficits were specific to food reward or were more general, we used a between-subject design in which half of the participants received food reward and the other half received money reward. Our results reveal a marked difference in associative learning between normal-weight and obese women when food was used as reward. Importantly, no learning deficits were observed with money reward. Multiple regression analyses also established a robust negative association between body mass index and learning performance in the food domain in female participants. Interestingly, such impairment was not observed in obese men. These findings suggest that obesity may be linked to impaired reward-based associative learning and that this impairment may be specific to the food domain. Copyright © 2014 Elsevier Ltd. All rights reserved.
Dopamine, reward learning, and active inference.

PubMed

FitzGerald, Thomas H B; Dolan, Raymond J; Friston, Karl

2015-01-01

Temporal difference learning models propose phasic dopamine signaling encodes reward prediction errors that drive learning. This is supported by studies where optogenetic stimulation of dopamine neurons can stand in lieu of actual reward. Nevertheless, a large body of data also shows that dopamine is not necessary for learning, and that dopamine depletion primarily affects task performance. We offer a resolution to this paradox based on an hypothesis that dopamine encodes the precision of beliefs about alternative actions, and thus controls the outcome-sensitivity of behavior. We extend an active inference scheme for solving Markov decision processes to include learning, and show that simulated dopamine dynamics strongly resemble those actually observed during instrumental conditioning. Furthermore, simulated dopamine depletion impairs performance but spares learning, while simulated excitation of dopamine neurons drives reward learning, through aberrant inference about outcome states. Our formal approach provides a novel and parsimonious reconciliation of apparently divergent experimental findings.
The dissociable effects of punishment and reward on motor learning.

PubMed

Galea, Joseph M; Mallia, Elizabeth; Rothwell, John; Diedrichsen, Jörn

2015-04-01

A common assumption regarding error-based motor learning (motor adaptation) in humans is that its underlying mechanism is automatic and insensitive to reward- or punishment-based feedback. Contrary to this hypothesis, we show in a double dissociation that the two have independent effects on the learning and retention components of motor adaptation. Negative feedback, whether graded or binary, accelerated learning. While it was not necessary for the negative feedback to be coupled to monetary loss, it had to be clearly related to the actual performance on the preceding movement. Positive feedback did not speed up learning, but it increased retention of the motor memory when performance feedback was withdrawn. These findings reinforce the view that independent mechanisms underpin learning and retention in motor adaptation, reject the assumption that motor adaptation is independent of motivational feedback, and raise new questions regarding the neural basis of negative and positive motivational feedback in motor learning.
RM-SORN: a reward-modulated self-organizing recurrent neural network.

PubMed

Aswolinskiy, Witali; Pipa, Gordon

2015-01-01

Neural plasticity plays an important role in learning and memory. Reward-modulation of plasticity offers an explanation for the ability of the brain to adapt its neural activity to achieve a rewarded goal. Here, we define a neural network model that learns through the interaction of Intrinsic Plasticity (IP) and reward-modulated Spike-Timing-Dependent Plasticity (STDP). IP enables the network to explore possible output sequences and STDP, modulated by reward, reinforces the creation of the rewarded output sequences. The model is tested on tasks for prediction, recall, non-linear computation, pattern recognition, and sequence generation. It achieves performance comparable to networks trained with supervised learning, while using simple, biologically motivated plasticity rules, and rewarding strategies. The results confirm the importance of investigating the interaction of several plasticity rules in the context of reward-modulated learning and whether reward-modulated self-organization can explain the amazing capabilities of the brain.

Dissociating error-based and reinforcement-based loss functions during sensorimotor learning

PubMed Central

McGregor, Heather R.; Mohatarem, Ayman

2017-01-01

It has been proposed that the sensorimotor system uses a loss (cost) function to evaluate potential movements in the presence of random noise. Here we test this idea in the context of both error-based and reinforcement-based learning. In a reaching task, we laterally shifted a cursor relative to true hand position using a skewed probability distribution. This skewed probability distribution had its mean and mode separated, allowing us to dissociate the optimal predictions of an error-based loss function (corresponding to the mean of the lateral shifts) and a reinforcement-based loss function (corresponding to the mode). We then examined how the sensorimotor system uses error feedback and reinforcement feedback, in isolation and combination, when deciding where to aim the hand during a reach. We found that participants compensated differently to the same skewed lateral shift distribution depending on the form of feedback they received. When provided with error feedback, participants compensated based on the mean of the skewed noise. When provided with reinforcement feedback, participants compensated based on the mode. Participants receiving both error and reinforcement feedback continued to compensate based on the mean while repeatedly missing the target, despite receiving auditory, visual and monetary reinforcement feedback that rewarded hitting the target. Our work shows that reinforcement-based and error-based learning are separable and can occur independently. Further, when error and reinforcement feedback are in conflict, the sensorimotor system heavily weights error feedback over reinforcement feedback. PMID:28753634
Dissociating error-based and reinforcement-based loss functions during sensorimotor learning.

PubMed

Cashaback, Joshua G A; McGregor, Heather R; Mohatarem, Ayman; Gribble, Paul L

2017-07-01

It has been proposed that the sensorimotor system uses a loss (cost) function to evaluate potential movements in the presence of random noise. Here we test this idea in the context of both error-based and reinforcement-based learning. In a reaching task, we laterally shifted a cursor relative to true hand position using a skewed probability distribution. This skewed probability distribution had its mean and mode separated, allowing us to dissociate the optimal predictions of an error-based loss function (corresponding to the mean of the lateral shifts) and a reinforcement-based loss function (corresponding to the mode). We then examined how the sensorimotor system uses error feedback and reinforcement feedback, in isolation and combination, when deciding where to aim the hand during a reach. We found that participants compensated differently to the same skewed lateral shift distribution depending on the form of feedback they received. When provided with error feedback, participants compensated based on the mean of the skewed noise. When provided with reinforcement feedback, participants compensated based on the mode. Participants receiving both error and reinforcement feedback continued to compensate based on the mean while repeatedly missing the target, despite receiving auditory, visual and monetary reinforcement feedback that rewarded hitting the target. Our work shows that reinforcement-based and error-based learning are separable and can occur independently. Further, when error and reinforcement feedback are in conflict, the sensorimotor system heavily weights error feedback over reinforcement feedback.
Dopamine, reward learning, and active inference

PubMed Central

FitzGerald, Thomas H. B.; Dolan, Raymond J.; Friston, Karl

2015-01-01

Temporal difference learning models propose phasic dopamine signaling encodes reward prediction errors that drive learning. This is supported by studies where optogenetic stimulation of dopamine neurons can stand in lieu of actual reward. Nevertheless, a large body of data also shows that dopamine is not necessary for learning, and that dopamine depletion primarily affects task performance. We offer a resolution to this paradox based on an hypothesis that dopamine encodes the precision of beliefs about alternative actions, and thus controls the outcome-sensitivity of behavior. We extend an active inference scheme for solving Markov decision processes to include learning, and show that simulated dopamine dynamics strongly resemble those actually observed during instrumental conditioning. Furthermore, simulated dopamine depletion impairs performance but spares learning, while simulated excitation of dopamine neurons drives reward learning, through aberrant inference about outcome states. Our formal approach provides a novel and parsimonious reconciliation of apparently divergent experimental findings. PMID:26581305
Model-free and model-based reward prediction errors in EEG.

PubMed

Sambrook, Thomas D; Hardwick, Ben; Wills, Andy J; Goslin, Jeremy

2018-05-24

Learning theorists posit two reinforcement learning systems: model-free and model-based. Model-based learning incorporates knowledge about structure and contingencies in the world to assign candidate actions with an expected value. Model-free learning is ignorant of the world's structure; instead, actions hold a value based on prior reinforcement, with this value updated by expectancy violation in the form of a reward prediction error. Because they use such different learning mechanisms, it has been previously assumed that model-based and model-free learning are computationally dissociated in the brain. However, recent fMRI evidence suggests that the brain may compute reward prediction errors to both model-free and model-based estimates of value, signalling the possibility that these systems interact. Because of its poor temporal resolution, fMRI risks confounding reward prediction errors with other feedback-related neural activity. In the present study, EEG was used to show the presence of both model-based and model-free reward prediction errors and their place in a temporal sequence of events including state prediction errors and action value updates. This demonstration of model-based prediction errors questions a long-held assumption that model-free and model-based learning are dissociated in the brain. Copyright © 2018 Elsevier Inc. All rights reserved.
Functional Contour-following via Haptic Perception and Reinforcement Learning.

PubMed

Hellman, Randall B; Tekin, Cem; van der Schaar, Mihaela; Santos, Veronica J

2018-01-01

Many tasks involve the fine manipulation of objects despite limited visual feedback. In such scenarios, tactile and proprioceptive feedback can be leveraged for task completion. We present an approach for real-time haptic perception and decision-making for a haptics-driven, functional contour-following task: the closure of a ziplock bag. This task is challenging for robots because the bag is deformable, transparent, and visually occluded by artificial fingertip sensors that are also compliant. A deep neural net classifier was trained to estimate the state of a zipper within a robot's pinch grasp. A Contextual Multi-Armed Bandit (C-MAB) reinforcement learning algorithm was implemented to maximize cumulative rewards by balancing exploration versus exploitation of the state-action space. The C-MAB learner outperformed a benchmark Q-learner by more efficiently exploring the state-action space while learning a hard-to-code task. The learned C-MAB policy was tested with novel ziplock bag scenarios and contours (wire, rope). Importantly, this work contributes to the development of reinforcement learning approaches that account for limited resources such as hardware life and researcher time. As robots are used to perform complex, physically interactive tasks in unstructured or unmodeled environments, it becomes important to develop methods that enable efficient and effective learning with physical testbeds.
Neural computations underlying inverse reinforcement learning in the human brain

PubMed Central

Pauli, Wolfgang M; Bossaerts, Peter; O'Doherty, John

2017-01-01

In inverse reinforcement learning an observer infers the reward distribution available for actions in the environment solely through observing the actions implemented by another agent. To address whether this computational process is implemented in the human brain, participants underwent fMRI while learning about slot machines yielding hidden preferred and non-preferred food outcomes with varying probabilities, through observing the repeated slot choices of agents with similar and dissimilar food preferences. Using formal model comparison, we found that participants implemented inverse RL as opposed to a simple imitation strategy, in which the actions of the other agent are copied instead of inferring the underlying reward structure of the decision problem. Our computational fMRI analysis revealed that anterior dorsomedial prefrontal cortex encoded inferences about action-values within the value space of the agent as opposed to that of the observer, demonstrating that inverse RL is an abstract cognitive process divorceable from the values and concerns of the observer him/herself. PMID:29083301
An effect of serotonergic stimulation on learning rates for rewards apparent after long intertrial intervals.

PubMed

Iigaya, Kiyohito; Fonseca, Madalena S; Murakami, Masayoshi; Mainen, Zachary F; Dayan, Peter

2018-06-26

Serotonin has widespread, but computationally obscure, modulatory effects on learning and cognition. Here, we studied the impact of optogenetic stimulation of dorsal raphe serotonin neurons in mice performing a non-stationary, reward-driven decision-making task. Animals showed two distinct choice strategies. Choices after short inter-trial-intervals (ITIs) depended only on the last trial outcome and followed a win-stay-lose-switch pattern. In contrast, choices after long ITIs reflected outcome history over multiple trials, as described by reinforcement learning models. We found that optogenetic stimulation during a trial significantly boosted the rate of learning that occurred due to the outcome of that trial, but these effects were only exhibited on choices after long ITIs. This suggests that serotonin neurons modulate reinforcement learning rates, and that this influence is masked by alternate, unaffected, decision mechanisms. These results provide insight into the role of serotonin in treating psychiatric disorders, particularly its modulation of neural plasticity and learning.
Multiagent cooperation and competition with deep reinforcement learning.

PubMed

Tampuu, Ardi; Matiisen, Tambet; Kodelja, Dorian; Kuzovkin, Ilya; Korjus, Kristjan; Aru, Juhan; Aru, Jaan; Vicente, Raul

2017-01-01

Evolution of cooperation and competition can appear when multiple adaptive agents share a biological, social, or technological niche. In the present work we study how cooperation and competition emerge between autonomous agents that learn by reinforcement while using only their raw visual input as the state representation. In particular, we extend the Deep Q-Learning framework to multiagent environments to investigate the interaction between two learning agents in the well-known video game Pong. By manipulating the classical rewarding scheme of Pong we show how competitive and collaborative behaviors emerge. We also describe the progression from competitive to collaborative behavior when the incentive to cooperate is increased. Finally we show how learning by playing against another adaptive agent, instead of against a hard-wired algorithm, results in more robust strategies. The present work shows that Deep Q-Networks can become a useful tool for studying decentralized learning of multiagent systems coping with high-dimensional environments.
Multiagent cooperation and competition with deep reinforcement learning

PubMed Central

Kodelja, Dorian; Kuzovkin, Ilya; Korjus, Kristjan; Aru, Juhan; Aru, Jaan; Vicente, Raul

2017-01-01

Evolution of cooperation and competition can appear when multiple adaptive agents share a biological, social, or technological niche. In the present work we study how cooperation and competition emerge between autonomous agents that learn by reinforcement while using only their raw visual input as the state representation. In particular, we extend the Deep Q-Learning framework to multiagent environments to investigate the interaction between two learning agents in the well-known video game Pong. By manipulating the classical rewarding scheme of Pong we show how competitive and collaborative behaviors emerge. We also describe the progression from competitive to collaborative behavior when the incentive to cooperate is increased. Finally we show how learning by playing against another adaptive agent, instead of against a hard-wired algorithm, results in more robust strategies. The present work shows that Deep Q-Networks can become a useful tool for studying decentralized learning of multiagent systems coping with high-dimensional environments. PMID:28380078
Contextual modulation of value signals in reward and punishment learning

PubMed Central

Palminteri, Stefano; Khamassi, Mehdi; Joffily, Mateus; Coricelli, Giorgio

2015-01-01

Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative—context-dependent—scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system. PMID:26302782
Behavioral and Electrophysiological Alterations for Reinforcement Learning in Manic and Euthymic Patients with Bipolar Disorder.

PubMed

Ryu, Vin; Ha, Ra Yeon; Lee, Su Jin; Ha, Kyooseob; Cho, Hyun-Sang

2017-03-01

Bipolar disorder is characterized by behavioral changes such as risk-taking and increasing goal-directed activities, which may result from altered reward processing. Patients with bipolar disorder show impaired reward learning in situations that require the integration of reinforced feedback over time. In this study, we examined the behavioral and electrophysiological characteristics of reward learning in manic and euthymic patients with bipolar disorder using a probabilistic reward task. Twenty-four manic and 20 euthymic patients with bipolar I disorder and 24 healthy control subjects performed the probabilistic reward task. We assessed response bias (RB) as a preference for the stimulus paired with the more frequent reward and feedback-related negativity (FRN) to correct identification of the rich stimulus. Both manic and euthymic patients showed significantly lower RB scores in the early learning stage (block 1) in comparison with the late learning stage (block 2 or block 3) of the task, as well as significantly lower RB scores in the early stage compared to healthy subjects. Relatively more negative FRN amplitude is elicited by no presentation of an expected reward, compared to that elicited by presentation of expected feedback. The FRN became significantly more negative from the early (block 1) to the later stages (blocks 2 and 3) in both manic and euthymic patients, but not in healthy subjects. Changes in RB scores and FRN amplitudes between blocks 2 and 3 and block 1 correlated positively in healthy controls, but correlated negatively in manic and euthymic patients. The severity of manic symptoms correlated positively with reward learning scores and negatively with the FRN. These findings suggest that patients with bipolar disorder during euthymic or manic states have behavioral and electrophysiological alterations in reward learning compared to healthy subjects. This dysfunctional reward processing may be related to the abnormal decision-making or altered
Reward-Guided Learning with and without Causal Attribution

PubMed Central

Jocham, Gerhard; Brodersen, Kay H.; Constantinescu, Alexandra O.; Kahn, Martin C.; Ianni, Angela M.; Walton, Mark E.; Rushworth, Matthew F.S.; Behrens, Timothy E.J.

2016-01-01

Summary When an organism receives a reward, it is crucial to know which of many candidate actions caused this reward. However, recent work suggests that learning is possible even when this most fundamental assumption is not met. We used novel reward-guided learning paradigms in two fMRI studies to show that humans deploy separable learning mechanisms that operate in parallel. While behavior was dominated by precise contingent learning, it also revealed hallmarks of noncontingent learning strategies. These learning mechanisms were separable behaviorally and neurally. Lateral orbitofrontal cortex supported contingent learning and reflected contingencies between outcomes and their causal choices. Amygdala responses around reward times related to statistical patterns of learning. Time-based heuristic mechanisms were related to activity in sensorimotor corticostriatal circuitry. Our data point to the existence of several learning mechanisms in the human brain, of which only one relies on applying known rules about the causal structure of the task. PMID:26971947
Reward-Based Spatial Learning in Teens With Bulimia Nervosa.

PubMed

Cyr, Marilyn; Wang, Zhishun; Tau, Gregory Z; Zhao, Guihu; Friedl, Eve; Stefan, Mihaela; Terranova, Kate; Marsh, Rachel

2016-11-01

To assess the functioning of mesolimbic and fronto-striatal areas involved in reward-based spatial learning in teenaged girls with bulimia nervosa (BN) that might be involved in the development and maintenance of maladaptive behaviors characteristic of the disorder. We compared functional magnetic resonance imaging blood oxygen level-dependent response in 27 adolescent girls with BN to that of 27 healthy, age-matched control participants during a reward-based learning task that required learning to use extra-maze cues to navigate a virtual 8-arm radial maze to find hidden rewards. We compared groups in their patterns of brain activation associated with reward-based spatial learning versus a control condition in which rewards were unexpected because they were allotted pseudo-randomly to experimentally prevent learning. Both groups learned to navigate the maze to find hidden rewards, but group differences in brain activity associated with maze navigation and reward processing were detected in the fronto-striatal regions and right anterior hippocampus. Unlike healthy adolescents, those with BN did not engage the right inferior frontal gyrus during maze navigation, activated the right anterior hippocampus during the receipt of unexpected rewards (control condition), and deactivated the left superior frontal gyrus and right anterior hippocampus during expected reward receipt (learning condition). These patterns of hippocampal activation in the control condition were significantly associated with the frequency of binge-eating episodes. Adolescents with BN displayed abnormal functioning of the anterior hippocampus and fronto-striatal regions during reward-based spatial learning. These findings suggest that an imbalance in control and reward circuits may arise early in the course of BN. Clinical trial registration information-An fMRI Study of Self-Regulation in Adolescents With Bulimia Nervosa; https://clinicaltrials.gov/; NCT00345943. Copyright © 2016 American Academy
Reward-Based Spatial Learning in Teens With Bulimia Nervosa

PubMed Central

Cyr, Marilyn; Wang, Zhishun; Tau, Gregory Z.; Zhao, Guihu; Friedl, Eve; Stefan, Mihaela; Terranova, Kate; Marsh, Rachel

2016-01-01

Objective To assess the functioning of mesolimbic and fronto-striatal areas involved in reward-based spatial learning in teenaged girls with bulimia nervosa (BN) that might be involved in the development and maintenance of maladaptive behaviors characteristic of the disorder. Method We compared functional magnetic resonance imaging blood oxygen level dependent response in 27 adolescent girls with BN to that of 27 healthy, age-matched control participants during a reward-based learning task that required learning to use extra-maze cues to navigate a virtual 8-arm radial maze to find hidden rewards. We compared groups in their patterns of brain activation associated with reward-based spatial learning versus a control condition in which rewards were unexpected because they were allotted pseudo-randomly to experimentally prevent learning. Results Both groups learned to navigate the maze to find hidden rewards, but group differences in brain activity associated with maze navigation and reward processing were detected in fronto-striatal regions and right anterior hippocampus. Unlike healthy adolescents, those with BN did not engage right inferior frontal gyrus during maze navigation, activated right anterior hippocampus during the receipt of unexpected rewards (control condition), and deactivated left superior frontal gyrus and right anterior hippocampus during expected reward receipt (learning condition). These patterns of hippocampal activation in the control condition were significantly associated with the frequency of binge-eating episodes. Conclusion Adolescents with BN displayed abnormal functioning of anterior hippocampus and fronto-striatal regions during reward-based spatial learning. These findings suggest that an imbalance in control and reward circuits may arise early in the course of BN. Clinical trial registration information An fMRI Study of Self-regulation in Adolescents With Bulimia Nervosa; https://clinicaltrials.gov/ct2/show/NCT00345943; NCT00345943
Negative reinforcement impairs overnight memory consolidation.

PubMed

Stamm, Andrew W; Nguyen, Nam D; Seicol, Benjamin J; Fagan, Abigail; Oh, Angela; Drumm, Michael; Lundt, Maureen; Stickgold, Robert; Wamsley, Erin J

2014-11-01

Post-learning sleep is beneficial for human memory. However, it may be that not all memories benefit equally from sleep. Here, we manipulated a spatial learning task using monetary reward and performance feedback, asking whether enhancing the salience of the task would augment overnight memory consolidation and alter its incorporation into dreaming. Contrary to our hypothesis, we found that the addition of reward impaired overnight consolidation of spatial memory. Our findings seemingly contradict prior reports that enhancing the reward value of learned information augments sleep-dependent memory processing. Given that the reward followed a negative reinforcement paradigm, consolidation may have been impaired via a stress-related mechanism. © 2014 Stamm et al.; Published by Cold Spring Harbor Laboratory Press.
Ventral striatal network connectivity reflects reward learning and behavior in patients with Parkinson's disease.

PubMed

Petersen, Kalen; Van Wouwe, Nelleke; Stark, Adam; Lin, Ya-Chen; Kang, Hakmook; Trujillo-Diaz, Paula; Kessler, Robert; Zald, David; Donahue, Manus J; Claassen, Daniel O

2018-01-01

A subgroup of Parkinson's disease (PD) patients treated with dopaminergic therapy develop compulsive reward-driven behaviors, which can result in life-altering morbidity. The mesocorticolimbic dopamine network guides reward-motivated behavior; however, its role in this treatment-related behavioral phenotype is incompletely understood. Here, mesocorticolimbic network function in PD patients who develop impulsive and compulsive behaviors (ICB) in response to dopamine agonists was assessed using BOLD fMRI. The tested hypothesis was that network connectivity between the ventral striatum and the limbic cortex is elevated in patients with ICB and that reward-learning proficiency reflects the extent of mesocorticolimbic network connectivity. To evaluate this hypothesis, 3.0T BOLD-fMRI was applied to measure baseline functional connectivity on and off dopamine agonist therapy in age and sex-matched PD patients with (n = 19) or without (n = 18) ICB. An incentive-based task was administered to a subset of patients (n = 20) to quantify positively or negatively reinforced learning. Whole-brain voxelwise analyses and region-of-interest-based mixed linear effects modeling were performed. Elevated ventral striatal connectivity to the anterior cingulate gyrus (P = 0.013), orbitofrontal cortex (P = 0.034), insula (P = 0.044), putamen (P = 0.014), globus pallidus (P < 0.01), and thalamus (P < 0.01) was observed in patients with ICB. A strong trend for elevated amygdala-to-midbrain connectivity was found in ICB patients on dopamine agonist. Ventral striatum-to-subgenual cingulate connectivity correlated with reward learning (P < 0.01), but not with punishment-avoidance learning. These data indicate that PD-ICB patients have elevated network connectivity in the mesocorticolimbic network. Behaviorally, proficient reward-based learning is related to this enhanced limbic and ventral striatal connectivity. Hum Brain Mapp 39:509-521, 2018. © 2017
The Nucleus Accumbens and Pavlovian Reward Learning

PubMed Central

Day, Jeremy J.

2011-01-01

The ability to form associations between predictive environmental events and rewarding outcomes is a fundamental aspect of learned behavior. This apparently simple ability likely requires complex neural processing evolved to identify, seek, and utilize natural rewards and redirect these activities based on updated sensory information. Emerging evidence from both animal and human research suggests that this type of processing is mediated in part by the nucleus accumbens and a closely associated network of brain structures. The nucleus accumbens is required for a number of reward-related behaviors, and processes specific information about reward availability, value, and context. Additionally, this structure is critical for the acquisition and expression of most Pavlovian stimulus-reward relationships, and cues that predict rewards produce robust changes in neural activity in the nucleus accumbens. While processing within the nucleus accumbens may enable or promote Pavlovian reward learning in natural situations, it has also been implicated in aspects of human drug addiction, including the ability of drug-paired cues to control behavior. This article will provide a critical review of the existing animal and human literature concerning the role of the NAc in Pavlovian learning with non-drug rewards and consider some clinical implications of these findings. PMID:17404375
Cost-Benefit Arbitration Between Multiple Reinforcement-Learning Systems.

PubMed

Kool, Wouter; Gershman, Samuel J; Cushman, Fiery A

2017-09-01

Human behavior is sometimes determined by habit and other times by goal-directed planning. Modern reinforcement-learning theories formalize this distinction as a competition between a computationally cheap but inaccurate model-free system that gives rise to habits and a computationally expensive but accurate model-based system that implements planning. It is unclear, however, how people choose to allocate control between these systems. Here, we propose that arbitration occurs by comparing each system's task-specific costs and benefits. To investigate this proposal, we conducted two experiments showing that people increase model-based control when it achieves greater accuracy than model-free control, and especially when the rewards of accurate performance are amplified. In contrast, they are insensitive to reward amplification when model-based and model-free control yield equivalent accuracy. This suggests that humans adaptively balance habitual and planned action through on-line cost-benefit analysis.
Impaired Learning of Social Compared to Monetary Rewards in Autism

PubMed Central

Lin, Alice; Rangel, Antonio; Adolphs, Ralph

2012-01-01

A leading hypothesis to explain the social dysfunction in people with autism spectrum disorders (ASD) is that they exhibit a deficit in reward processing and motivation specific to social stimuli. However, there have been few direct tests of this hypothesis to date. Here we used an instrumental reward learning task that contrasted learning with social rewards (pictures of positive and negative faces) against learning with monetary reward (winning and losing money). The two tasks were structurally identical except for the type of reward, permitting direct comparisons. We tested 10 high-functioning people with ASD (7M, 3F) and 10 healthy controls who were matched on gender, age, and education. We found no significant differences between the two groups in terms of overall ability behaviorally to discriminate positive from negative slot machines, reaction-times, and valence ratings, However, there was a specific impairment in the ASD group in learning to choose social rewards, compared to monetary rewards: they had a significantly lower cumulative number of choices of the most rewarding social slot machine, and had a significantly slower initial learning rate for the socially rewarding slot machine, compared to the controls. The findings show a deficit in reward learning in ASD that is greater for social rewards than for monetary rewards, and support the hypothesis of a disproportionate impairment in social reward processing in ASD. PMID:23060743
Working memory and reward association learning impairments in obesity.

PubMed

Coppin, Géraldine; Nolan-Poupart, Sarah; Jones-Gotman, Marilyn; Small, Dana M

2014-12-01

Obesity has been associated with impaired executive functions including working memory. Less explored is the influence of obesity on learning and memory. In the current study we assessed stimulus reward association learning, explicit learning and memory and working memory in healthy weight, overweight and obese individuals. Explicit learning and memory did not differ as a function of group. In contrast, working memory was significantly and similarly impaired in both overweight and obese individuals compared to the healthy weight group. In the first reward association learning task the obese, but not healthy weight or overweight participants consistently formed paradoxical preferences for a pattern associated with a negative outcome (fewer food rewards). To determine if the deficit was specific to food reward a second experiment was conducted using money. Consistent with Experiment 1, obese individuals selected the pattern associated with a negative outcome (fewer monetary rewards) more frequently than healthy weight individuals and thus failed to develop a significant preference for the most rewarded patterns as was observed in the healthy weight group. Finally, on a probabilistic learning task, obese compared to healthy weight individuals showed deficits in negative, but not positive outcome learning. Taken together, our results demonstrate deficits in working memory and stimulus reward learning in obesity and suggest that obese individuals are impaired in learning to avoid negative outcomes. Copyright © 2014 Elsevier Ltd. All rights reserved.

A Discussion of Possibility of Reinforcement Learning Using Event-Related Potential in BCI

NASA Astrophysics Data System (ADS)

Yamagishi, Yuya; Tsubone, Tadashi; Wada, Yasuhiro

Recently, Brain computer interface (BCI) which is a direct connecting pathway an external device such as a computer or a robot and a human brain have gotten a lot of attention. Since BCI can control the machines as robots by using the brain activity without using the voluntary muscle, the BCI may become a useful communication tool for handicapped persons, for instance, amyotrophic lateral sclerosis patients. However, in order to realize the BCI system which can perform precise tasks on various environments, it is necessary to design the control rules to adapt to the dynamic environments. Reinforcement learning is one approach of the design of the control rule. If this reinforcement leaning can be performed by the brain activity, it leads to the attainment of BCI that has general versatility. In this research, we paid attention to P300 of event-related potential as an alternative signal of the reward of reinforcement learning. We discriminated between the success and the failure trials from P300 of the EEG of the single trial by using the proposed discrimination algorithm based on Support vector machine. The possibility of reinforcement learning was examined from the viewpoint of the number of discriminated trials. It was shown that there was a possibility to be able to learn in most subjects.
The Effects of a Token Reinforcement System on the Reading and Arithmetic Skills Learnings of Migrant Primary School Pupils.

ERIC Educational Resources Information Center

Heitzman, Andrew J.

The New York State Center for Migrant Studies conducted this 1968 study which investigated effects of token reinforcers on reading and arithmetic skills learnings of migrant primary school students during a 6-week summer school session. Students (Negro and Caucasian) received plastic tokens to reward skills learning responses. Tokens were traded…
Tunnel Ventilation Control Using Reinforcement Learning Methodology

NASA Astrophysics Data System (ADS)

Chu, Baeksuk; Kim, Dongnam; Hong, Daehie; Park, Jooyoung; Chung, Jin Taek; Kim, Tae-Hyung

The main purpose of tunnel ventilation system is to maintain CO pollutant concentration and VI (visibility index) under an adequate level to provide drivers with comfortable and safe driving environment. Moreover, it is necessary to minimize power consumption used to operate ventilation system. To achieve the objectives, the control algorithm used in this research is reinforcement learning (RL) method. RL is a goal-directed learning of a mapping from situations to actions without relying on exemplary supervision or complete models of the environment. The goal of RL is to maximize a reward which is an evaluative feedback from the environment. In the process of constructing the reward of the tunnel ventilation system, two objectives listed above are included, that is, maintaining an adequate level of pollutants and minimizing power consumption. RL algorithm based on actor-critic architecture and gradient-following algorithm is adopted to the tunnel ventilation system. The simulations results performed with real data collected from existing tunnel ventilation system and real experimental verification are provided in this paper. It is confirmed that with the suggested controller, the pollutant level inside the tunnel was well maintained under allowable limit and the performance of energy consumption was improved compared to conventional control scheme.
Reward Processing, Neuroeconomics, and Psychopathology.

PubMed

Zald, David H; Treadway, Michael T

2017-05-08

Abnormal reward processing is a prominent transdiagnostic feature of psychopathology. The present review provides a framework for considering the different aspects of reward processing and their assessment, and highlights recent insights from the field of neuroeconomics that may aid in understanding these processes. Although altered reward processing in psychopathology has often been treated as a general hypo- or hyperresponsivity to reward, increasing data indicate that a comprehensive understanding of reward dysfunction requires characterization within more specific reward-processing domains, including subjective valuation, discounting, hedonics, reward anticipation and facilitation, and reinforcement learning. As such, more nuanced models of the nature of these abnormalities are needed. We describe several processing abnormalities capable of producing the types of selective alterations in reward-related behavior observed in different forms of psychopathology, including (mal)adaptive scaling and anchoring, dysfunctional weighting of reward and cost variables, competition between valuation systems, and reward prediction error signaling.
Frontal Theta Links Prediction Errors to Behavioral Adaptation in Reinforcement Learning

PubMed Central

Cavanagh, James F.; Frank, Michael J.; Klein, Theresa J.; Allen, John J.B.

2009-01-01

Investigations into action monitoring have consistently detailed a fronto-central voltage deflection in the Event-Related Potential (ERP) following the presentation of negatively valenced feedback, sometimes termed the Feedback Related Negativity (FRN). The FRN has been proposed to reflect a neural response to prediction errors during reinforcement learning, yet the single trial relationship between neural activity and the quanta of expectation violation remains untested. Although ERP methods are not well suited to single trial analyses, the FRN has been associated with theta band oscillatory perturbations in the medial prefrontal cortex. Medio-frontal theta oscillations have been previously associated with expectation violation and behavioral adaptation and are well suited to single trial analysis. Here, we recorded EEG activity during a probabilistic reinforcement learning task and fit the performance data to an abstract computational model (Q-learning) for calculation of single-trial reward prediction errors. Single-trial theta oscillatory activities following feedback were investigated within the context of expectation (prediction error) and adaptation (subsequent reaction time change). Results indicate that interactive medial and lateral frontal theta activities reflect the degree of negative and positive reward prediction error in the service of behavioral adaptation. These different brain areas use prediction error calculations for different behavioral adaptations: with medial frontal theta reflecting the utilization of prediction errors for reaction time slowing (specifically following errors), but lateral frontal theta reflecting prediction errors leading to working memory-related reaction time speeding for the correct choice. PMID:19969093
Reinforcement Learning and Episodic Memory in Humans and Animals: An Integrative Framework.

PubMed

Gershman, Samuel J; Daw, Nathaniel D

2017-01-03

We review the psychology and neuroscience of reinforcement learning (RL), which has experienced significant progress in the past two decades, enabled by the comprehensive experimental study of simple learning and decision-making tasks. However, one challenge in the study of RL is computational: The simplicity of these tasks ignores important aspects of reinforcement learning in the real world: (a) State spaces are high-dimensional, continuous, and partially observable; this implies that (b) data are relatively sparse and, indeed, precisely the same situation may never be encountered twice; furthermore, (c) rewards depend on the long-term consequences of actions in ways that violate the classical assumptions that make RL tractable. A seemingly distinct challenge is that, cognitively, theories of RL have largely involved procedural and semantic memory, the way in which knowledge about action values or world models extracted gradually from many experiences can drive choice. This focus on semantic memory leaves out many aspects of memory, such as episodic memory, related to the traces of individual events. We suggest that these two challenges are related. The computational challenge can be dealt with, in part, by endowing RL systems with episodic memory, allowing them to (a) efficiently approximate value functions over complex state spaces, (b) learn with very little data, and (c) bridge long-term dependencies between actions and rewards. We review the computational theory underlying this proposal and the empirical evidence to support it. Our proposal suggests that the ubiquitous and diverse roles of memory in RL may function as part of an integrated learning system.
Pavlovian reward learning underlies value driven attentional capture.

PubMed

Bucker, Berno; Theeuwes, Jan

2017-02-01

Recent evidence shows that distractors that signal high compared to low reward availability elicit stronger attentional capture, even when this is detrimental for task-performance. This suggests that simply correlating stimuli with reward administration, rather than their instrumental relationship with obtaining reward, produces value-driven attentional capture. However, in previous studies, reward delivery was never response independent, as only correct responses were rewarded, nor was it completely task-irrelevant, as the distractor signaled the magnitude of reward that could be earned on that trial. In two experiments, we ensured that associative reward learning was completely response independent by letting participants perform a task at fixation, while high and low rewards were automatically administered following the presentation of task-irrelevant colored stimuli in the periphery (Experiment 1) or at fixation (Experiment 2). In a following non-reward test phase, using the additional singleton paradigm, the previously reward signaling stimuli were presented as distractors to assess truly task-irrelevant value driven attentional capture. The results showed that high compared to low reward-value associated distractors impaired performance, and thus captured attention more strongly. This suggests that genuine Pavlovian conditioning of stimulus-reward contingencies is sufficient to obtain value-driven attentional capture. Furthermore, value-driven attentional capture can occur following associative reward learning of temporally and spatially task-irrelevant distractors that signal the magnitude of available reward (Experiment 1), and is independent of training spatial shifts of attention towards the reward signaling stimuli (Experiment 2). This confirms and strengthens the idea that Pavlovian reward learning underlies value driven attentional capture.
Optimal control in microgrid using multi-agent reinforcement learning.

PubMed

Li, Fu-Dong; Wu, Min; He, Yong; Chen, Xin

2012-11-01

This paper presents an improved reinforcement learning method to minimize electricity costs on the premise of satisfying the power balance and generation limit of units in a microgrid with grid-connected mode. Firstly, the microgrid control requirements are analyzed and the objective function of optimal control for microgrid is proposed. Then, a state variable "Average Electricity Price Trend" which is used to express the most possible transitions of the system is developed so as to reduce the complexity and randomicity of the microgrid, and a multi-agent architecture including agents, state variables, action variables and reward function is formulated. Furthermore, dynamic hierarchical reinforcement learning, based on change rate of key state variable, is established to carry out optimal policy exploration. The analysis shows that the proposed method is beneficial to handle the problem of "curse of dimensionality" and speed up learning in the unknown large-scale world. Finally, the simulation results under JADE (Java Agent Development Framework) demonstrate the validity of the presented method in optimal control for a microgrid with grid-connected mode. Copyright © 2012 ISA. Published by Elsevier Ltd. All rights reserved.
Post-learning hippocampal dynamics promote preferential retention of rewarding events

PubMed Central

Gruber, Matthias J.; Ritchey, Maureen; Wang, Shao-Fang; Doss, Manoj K.; Ranganath, Charan

2016-01-01

Reward motivation is known to modulate memory encoding, and this effect depends on interactions between the substantia nigra/ ventral tegmental area complex (SN/VTA) and the hippocampus. It is unknown, however, whether these interactions influence offline neural activity in the human brain that is thought to promote memory consolidation. Here, we used functional magnetic resonance imaging (fMRI) to test the effect of reward motivation on post-learning neural dynamics and subsequent memory for objects that were learned in high- or low-reward motivation contexts. We found that post-learning increases in resting-state functional connectivity between the SN/VTA and hippocampus predicted preferential retention of objects that were learned in high-reward contexts. In addition, multivariate pattern classification revealed that hippocampal representations of high-reward contexts were preferentially reactivated during post-learning rest, and the number of hippocampal reactivations was predictive of preferential retention of items learned in high-reward contexts. These findings indicate that reward motivation alters offline post-learning dynamics between the SN/VTA and hippocampus, providing novel evidence for a potential mechanism by which reward could influence memory consolidation. PMID:26875624
Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems

PubMed Central

Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

2013-01-01

A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider. PMID:24137146
Modeling effects of intrinsic and extrinsic rewards on the competition between striatal learning systems.

PubMed

Boedecker, Joschka; Lampe, Thomas; Riedmiller, Martin

2013-01-01

A common assumption in psychology, economics, and other fields holds that higher performance will result if extrinsic rewards (such as money) are offered as an incentive. While this principle seems to work well for tasks that require the execution of the same sequence of steps over and over, with little uncertainty about the process, in other cases, especially where creative problem solving is required due to the difficulty in finding the optimal sequence of actions, external rewards can actually be detrimental to task performance. Furthermore, they have the potential to undermine intrinsic motivation to do an otherwise interesting activity. In this work, we extend a computational model of the dorsomedial and dorsolateral striatal reinforcement learning systems to account for the effects of extrinsic and intrinsic rewards. The model assumes that the brain employs both a goal-directed and a habitual learning system, and competition between both is based on the trade-off between the cost of the reasoning process and value of information. The goal-directed system elicits internal rewards when its models of the environment improve, while the habitual system, being model-free, does not. Our results account for the phenomena that initial extrinsic reward leads to reduced activity after extinction compared to the case without any initial extrinsic rewards, and that performance in complex task settings drops when higher external rewards are promised. We also test the hypothesis that external rewards bias the competition in favor of the computationally efficient, but cruder and less flexible habitual system, which can negatively influence intrinsic motivation and task performance in the class of tasks we consider.
Incidental Learning of Rewarded Associations Bolsters Learning on an Associative Task

ERIC Educational Resources Information Center

Freedberg, Michael; Schacherer, Jonathan; Hazeltine, Eliot

2016-01-01

Reward has been shown to change behavior as a result of incentive learning (by motivating the individual to increase their effort) and instrumental learning (by increasing the frequency of a particular behavior). However, Palminteri et al. (2011) demonstrated that reward can also improve the incidental learning of a motor skill even when…
Dopamine neurons learn relative chosen value from probabilistic rewards

PubMed Central

Lak, Armin; Stauffer, William R; Schultz, Wolfram

2016-01-01

Economic theories posit reward probability as one of the factors defining reward value. Individuals learn the value of cues that predict probabilistic rewards from experienced reward frequencies. Building on the notion that responses of dopamine neurons increase with reward probability and expected value, we asked how dopamine neurons in monkeys acquire this value signal that may represent an economic decision variable. We found in a Pavlovian learning task that reward probability-dependent value signals arose from experienced reward frequencies. We then assessed neuronal response acquisition during choices among probabilistic rewards. Here, dopamine responses became sensitive to the value of both chosen and unchosen options. Both experiments showed also the novelty responses of dopamine neurones that decreased as learning advanced. These results show that dopamine neurons acquire predictive value signals from the frequency of experienced rewards. This flexible and fast signal reflects a specific decision variable and could update neuronal decision mechanisms. DOI: http://dx.doi.org/10.7554/eLife.18044.001 PMID:27787196
Hemispheric Asymmetries in Striatal Reward Responses Relate to Approach-Avoidance Learning and Encoding of Positive-Negative Prediction Errors in Dopaminergic Midbrain Regions.

PubMed

Aberg, Kristoffer Carl; Doell, Kimberly C; Schwartz, Sophie

2015-10-28

Some individuals are better at learning about rewarding situations, whereas others are inclined to avoid punishments (i.e., enhanced approach or avoidance learning, respectively). In reinforcement learning, action values are increased when outcomes are better than predicted (positive prediction errors [PEs]) and decreased for worse than predicted outcomes (negative PEs). Because actions with high and low values are approached and avoided, respectively, individual differences in the neural encoding of PEs may influence the balance between approach-avoidance learning. Recent correlational approaches also indicate that biases in approach-avoidance learning involve hemispheric asymmetries in dopamine function. However, the computational and neural mechanisms underpinning such learning biases remain unknown. Here we assessed hemispheric reward asymmetry in striatal activity in 34 human participants who performed a task involving rewards and punishments. We show that the relative difference in reward response between hemispheres relates to individual biases in approach-avoidance learning. Moreover, using a computational modeling approach, we demonstrate that better encoding of positive (vs negative) PEs in dopaminergic midbrain regions is associated with better approach (vs avoidance) learning, specifically in participants with larger reward responses in the left (vs right) ventral striatum. Thus, individual dispositions or traits may be determined by neural processes acting to constrain learning about specific aspects of the world. Copyright © 2015 the authors 0270-6474/15/3514491-10$15.00/0.
Disrupted reinforcement learning and maladaptive behavior in women with a history of childhood sexual abuse: a high-density event-related potential study.

PubMed

Pechtel, Pia; Pizzagalli, Diego A

2013-05-01

Childhood sexual abuse (CSA) has been associated with psychopathology, particularly major depressive disorder (MDD), and high-risk behaviors. Despite the epidemiological data available, the mechanisms underlying these maladaptive outcomes remain poorly understood. We examined whether a history of CSA, particularly in conjunction with a past episode of MDD, is associated with behavioral and neural dysfunction in reinforcement learning, and whether such dysfunction is linked to maladaptive behavior. Participants completed a clinical evaluation and a probabilistic reinforcement task while 128-channel event-related potentials were recorded. Academic setting; participants recruited from the community. Fifteen women with a history of CSA and remitted MDD (CSA + rMDD), 16 women with remitted MDD with no history of CSA (rMDD), and 18 healthy women (controls). Three or more episodes of coerced sexual contact (mean [SD] duration, 3.00 [2.20] years) between the ages of 7 and 12 years by at least 1 male perpetrator. Participants' preference for choosing the most rewarded stimulus and avoiding the most punished stimulus was evaluated. The feedback-related negativity and error-related negativity-hypothesized to reflect activation in the anterior cingulate cortex-were used as electrophysiological indices of reinforcement learning. No group differences emerged in the acquisition of reinforcement contingencies. In trials requiring participants to rely partially or exclusively on previously rewarded information, the CSA + rMDD group showed (1) lower accuracy (relative to both controls and the rMDD group), (2) blunted electrophysiological differentiation between correct and incorrect responses (relative to controls), and (3) increased activation in the subgenual anterior cingulate cortex (relative to the rMDD group). A history of CSA was not associated with impairments in avoiding the most punished stimulus. Self-harm and suicidal behaviors correlated with poorer performance of
Multi Agent Reward Analysis for Learning in Noisy Domains

NASA Technical Reports Server (NTRS)

Tumer, Kagan; Agogino, Adrian K.

2005-01-01

In many multi agent learning problems, it is difficult to determine, a priori, the agent reward structure that will lead to good performance. This problem is particularly pronounced in continuous, noisy domains ill-suited to simple table backup schemes commonly used in TD(lambda)/Q-learning. In this paper, we present a new reward evaluation method that allows the tradeoff between coordination among the agents and the difficulty of the learning problem each agent faces to be visualized. This method is independent of the learning algorithm and is only a function of the problem domain and the agents reward structure. We then use this reward efficiency visualization method to determine an effective reward without performing extensive simulations. We test this method in both a static and a dynamic multi-rover learning domain where the agents have continuous state spaces and where their actions are noisy (e.g., the agents movement decisions are not always carried out properly). Our results show that in the more difficult dynamic domain, the reward efficiency visualization method provides a two order of magnitude speedup in selecting a good reward. Most importantly it allows one to quickly create and verify rewards tailored to the observational limitations of the domain.
Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning.

PubMed

Morimura, Tetsuro; Uchibe, Eiji; Yoshimoto, Junichiro; Peters, Jan; Doya, Kenji

2010-02-01

Most conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate gamma for the value functions close to 1, these algorithms do not permit gamma to be set exactly at gamma = 1. In this article, we propose a method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distribution through backward Markov chain formulation and a temporal difference learning framework. A new policy gradient (PG) framework with an LSD is also proposed, in which the average reward gradient can be estimated by setting gamma = 0, so it becomes unnecessary to learn the value functions. We also test the performance of the proposed algorithms using simple benchmark tasks and show that these can improve the performances of existing PG methods.
Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia.

PubMed

Markou, Athina; Salamone, John D; Bussey, Timothy J; Mar, Adam C; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

2013-11-01

The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu) meeting. A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. Copyright © 2013 Elsevier
Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia

PubMed Central

Markou, Athina; Salamone, John D.; Bussey, Timothy; Mar, Adam; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

2013-01-01

The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu). A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. PMID:23994273
Learning Reward Uncertainty in the Basal Ganglia

PubMed Central

Bogacz, Rafal

2016-01-01

Learning the reliability of different sources of rewards is critical for making optimal choices. However, despite the existence of detailed theory describing how the expected reward is learned in the basal ganglia, it is not known how reward uncertainty is estimated in these circuits. This paper presents a class of models that encode both the mean reward and the spread of the rewards, the former in the difference between the synaptic weights of D1 and D2 neurons, and the latter in their sum. In the models, the tendency to seek (or avoid) options with variable reward can be controlled by increasing (or decreasing) the tonic level of dopamine. The models are consistent with the physiology of and synaptic plasticity in the basal ganglia, they explain the effects of dopaminergic manipulations on choices involving risks, and they make multiple experimental predictions. PMID:27589489

Reinforcement learning for routing in cognitive radio ad hoc networks.

PubMed

Al-Rawi, Hasan A A; Yau, Kok-Lim Alvin; Mohamad, Hafizal; Ramli, Nordin; Hashim, Wahidah

2014-01-01

Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs.
Reinforcement Learning for Routing in Cognitive Radio Ad Hoc Networks

PubMed Central

Al-Rawi, Hasan A. A.; Mohamad, Hafizal; Hashim, Wahidah

2014-01-01

Cognitive radio (CR) enables unlicensed users (or secondary users, SUs) to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs). Reinforcement learning (RL) is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate) through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs' network performance without significantly jeopardizing PUs' network performance, specifically SUs' interference to PUs. PMID:25140350
Social and monetary reward learning engage overlapping neural substrates.

PubMed

Lin, Alice; Adolphs, Ralph; Rangel, Antonio

2012-03-01

Learning to make choices that yield rewarding outcomes requires the computation of three distinct signals: stimulus values that are used to guide choices at the time of decision making, experienced utility signals that are used to evaluate the outcomes of those decisions and prediction errors that are used to update the values assigned to stimuli during reward learning. Here we investigated whether monetary and social rewards involve overlapping neural substrates during these computations. Subjects engaged in two probabilistic reward learning tasks that were identical except that rewards were either social (pictures of smiling or angry people) or monetary (gaining or losing money). We found substantial overlap between the two types of rewards for all components of the learning process: a common area of ventromedial prefrontal cortex (vmPFC) correlated with stimulus value at the time of choice and another common area of vmPFC correlated with reward magnitude and common areas in the striatum correlated with prediction errors. Taken together, the findings support the hypothesis that shared anatomical substrates are involved in the computation of both monetary and social rewards. © The Author (2011). Published by Oxford University Press.
The effects of oxytocin on social reward learning in humans.

PubMed

Clark-Elford, Rebecca; Nathan, Pradeep J; Auyeung, Bonnie; Voon, Valerie; Sule, Akeem; Müller, Ulrich; Dudas, Robert; Sahakian, Barbara J; Phan, K Luan; Baron-Cohen, Simon

2014-02-01

It has been hypothesised that the mechanisms modulating social affiliation are regulated by reward circuitry. Oxytocin, previously shown to support affiliative behaviour and the processing of socio-emotional stimuli, is expressed in areas of the brain involved in reward and motivation. However, limited data are available that test if oxytocin is directly involved in reward learning, or whether oxytocin can modulate the effect of emotion on reward learning. In a double-blind, randomised, placebo-controlled, within-group study design, 24 typical male volunteers were administered 24 IU of oxytocin or placebo and subsequently completed an affective reward learning task. Oxytocin selectively reduced performance of learning rewards, but not losses, from happy faces. The mechanism by which oxytocin may be exerting this effect is discussed in terms of whether oxytocin is affecting identity recognition via affecting the salience of happy faces. We conclude that oxytocin detrimentally affects learning rewards from happy faces in certain contexts.
A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning.

PubMed

Franklin, Nicholas T; Frank, Michael J

2015-12-25

Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments.
Early Years Education: Are Young Students Intrinsically or Extrinsically Motivated Towards School Activities? A Discussion about the Effects of Rewards on Young Children's Learning

ERIC Educational Resources Information Center

Theodotou, Evgenia

2014-01-01

Rewards can reinforce and at the same time forestall young children's willingness to learn. However, they are broadly used in the field of education, especially in early years settings, to stimulate children towards learning activities. This paper reviews the theoretical and research literature related to intrinsic and extrinsic motivational…
Reward Processing, Neuroeconomics, and Psychopathology

PubMed Central

Zald, David H.; Treadway, Michael

2018-01-01

Abnormal reward processing is a prominent transdiagnostic feature of psychopathology. The present review provides a framework for considering the different aspects of reward processing and their assessment and highlight recent insights from the field of neuroeconomics that may aid in understanding these processes. Although altered reward processing in psychopathology has often been treated as a general hypo- or hyper-responsivity to reward, increasing data indicate that a comprehensive understanding of reward dysfunction requires characterization within more specific reward processing domains, including subjective valuation, discounting, hedonics, reward anticipation and facilitation, and reinforcement learning. As such, more nuanced models of the nature of these abnormalities are needed. We describe several processing abnormalities capable of producing the types of selective alterations in reward related behavior observed in different forms of psychopathology, including (mal)adaptive scaling and anchoring, dysfunctional weighting of reward and cost variables, completion between valuation systems, and positive prediction error signaling. PMID:28301764
The Effect of a Token Reinforcement Program on the Reading Comprehension of a Learning Disabled Student.

ERIC Educational Resources Information Center

Galbreath, Joy; Feldman, David

The relationship of reading comprehension accuracy and a contingently administered token reinforcement program used with an elementary level learning disabled student in the classroom was examined. The S earned points for each correct answer made after oral reading sessions. At the conclusion of the class he could exchange his points for rewards.…
Reinforcement Learning and Dopamine in Schizophrenia: Dimensions of Symptoms or Specific Features of a Disease Group?

PubMed Central

Deserno, Lorenz; Boehme, Rebecca; Heinz, Andreas; Schlagenhauf, Florian

2013-01-01

Abnormalities in reinforcement learning are a key finding in schizophrenia and have been proposed to be linked to elevated levels of dopamine neurotransmission. Behavioral deficits in reinforcement learning and their neural correlates may contribute to the formation of clinical characteristics of schizophrenia. The ability to form predictions about future outcomes is fundamental for environmental interactions and depends on neuronal teaching signals, like reward prediction errors. While aberrant prediction errors, that encode non-salient events as surprising, have been proposed to contribute to the formation of positive symptoms, a failure to build neural representations of decision values may result in negative symptoms. Here, we review behavioral and neuroimaging research in schizophrenia and focus on studies that implemented reinforcement learning models. In addition, we discuss studies that combined reinforcement learning with measures of dopamine. Thereby, we suggest how reinforcement learning abnormalities in schizophrenia may contribute to the formation of psychotic symptoms and may interact with cognitive deficits. These ideas point toward an interplay of more rigid versus flexible control over reinforcement learning. Pronounced deficits in the flexible or model-based domain may allow for a detailed characterization of well-established cognitive deficits in schizophrenia patients based on computational models of learning. Finally, we propose a framework based on the potentially crucial contribution of dopamine to dysfunctional reinforcement learning on the level of neural networks. Future research may strongly benefit from computational modeling but also requires further methodological improvement for clinical group studies. These research tools may help to improve our understanding of disease-specific mechanisms and may help to identify clinically relevant subgroups of the heterogeneous entity schizophrenia. PMID:24391603
The Effects of Verbal and Material Rewards and Punishers on the Performance of Impulsive and Reflective Children

ERIC Educational Resources Information Center

Firestone, Philip; Douglas, Virginia I.

1977-01-01

Impulsive and reflective children performed in a discrimination learning task which included four reinforcement conditions: verbal-reward, verbal-punishment, material-reward, and material-punishment. (SB)
Habits, action sequences, and reinforcement learning

PubMed Central

Dezfouli, Amir; Balleine, Bernard W.

2012-01-01

It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquire and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions. PMID:22487034
Deep Direct Reinforcement Learning for Financial Signal Representation and Trading.

PubMed

Deng, Yue; Bao, Feng; Kong, Youyong; Ren, Zhiquan; Dai, Qionghai

2017-03-01

Can we train the computer to beat experienced traders for financial assert trading? In this paper, we try to address this challenge by introducing a recurrent deep neural network (NN) for real-time financial signal representation and trading. Our model is inspired by two biological-related learning concepts of deep learning (DL) and reinforcement learning (RL). In the framework, the DL part automatically senses the dynamic market condition for informative feature learning. Then, the RL module interacts with deep representations and makes trading decisions to accumulate the ultimate rewards in an unknown environment. The learning system is implemented in a complex NN that exhibits both the deep and recurrent structures. Hence, we propose a task-aware backpropagation through time method to cope with the gradient vanishing issue in deep training. The robustness of the neural system is verified on both the stock and the commodity future markets under broad testing conditions.
Reward prediction error signal enhanced by striatum-amygdala interaction explains the acceleration of probabilistic reward learning by emotion.

PubMed

Watanabe, Noriya; Sakagami, Masamichi; Haruno, Masahiko

2013-03-06

Learning does not only depend on rationality, because real-life learning cannot be isolated from emotion or social factors. Therefore, it is intriguing to determine how emotion changes learning, and to identify which neural substrates underlie this interaction. Here, we show that the task-independent presentation of an emotional face before a reward-predicting cue increases the speed of cue-reward association learning in human subjects compared with trials in which a neutral face is presented. This phenomenon was attributable to an increase in the learning rate, which regulates reward prediction errors. Parallel to these behavioral findings, functional magnetic resonance imaging demonstrated that presentation of an emotional face enhanced reward prediction error (RPE) signal in the ventral striatum. In addition, we also found a functional link between this enhanced RPE signal and increased activity in the amygdala following presentation of an emotional face. Thus, this study revealed an acceleration of cue-reward association learning by emotion, and underscored a role of striatum-amygdala interactions in the modulation of the reward prediction errors by emotion.
Reward-based spatial learning in unmedicated adults with obsessive-compulsive disorder.

PubMed

Marsh, Rachel; Tau, Gregory Z; Wang, Zhishun; Huo, Yuankai; Liu, Ge; Hao, Xuejun; Packard, Mark G; Peterson, Bradley S; Simpson, H Blair

2015-04-01

The authors assessed the functioning of mesolimbic and striatal areas involved in reward-based spatial learning in unmedicated adults with obsessive-compulsive disorder (OCD). Functional MRI blood-oxygen-level-dependent response was compared in 33 unmedicated adults with OCD and 33 healthy, age-matched comparison subjects during a reward-based learning task that required learning to use extramaze cues to navigate a virtual eight-arm radial maze to find hidden rewards. The groups were compared in their patterns of brain activation associated with reward-based spatial learning versus a control condition in which rewards were unexpected because they were allotted pseudorandomly to experimentally prevent learning. Both groups learned to navigate the maze to find hidden rewards, but group differences in neural activity during navigation and reward processing were detected in mesolimbic and striatal areas. During navigation, the OCD group, unlike the healthy comparison group, exhibited activation in the left posterior hippocampus. Unlike healthy subjects, participants in the OCD group did not show activation in the left ventral putamen and amygdala when anticipating rewards or in the left hippocampus, amygdala, and ventral putamen when receiving unexpected rewards (control condition). Signal in these regions decreased relative to baseline during unexpected reward receipt among those in the OCD group, and the degree of activation was inversely associated with doubt/checking symptoms. Participants in the OCD group displayed abnormal recruitment of mesolimbic and ventral striatal circuitry during reward-based spatial learning. Whereas healthy comparison subjects exhibited activation in this circuitry in response to the violation of reward expectations, unmedicated OCD participants did not and instead over-relied on the posterior hippocampus during learning. Thus, dopaminergic innervation of reward circuitry may be altered, and future study of anterior/posterior hippocampal
Altered Risk-Based Decision Making following Adolescent Alcohol Use Results from an Imbalance in Reinforcement Learning in Rats

PubMed Central

Hart, Andrew S.; Collins, Anne L.; Bernstein, Ilene L.; Phillips, Paul E. M.

2012-01-01

Alcohol use during adolescence has profound and enduring consequences on decision-making under risk. However, the fundamental psychological processes underlying these changes are unknown. Here, we show that alcohol use produces over-fast learning for better-than-expected, but not worse-than-expected, outcomes without altering subjective reward valuation. We constructed a simple reinforcement learning model to simulate altered decision making using behavioral parameters extracted from rats with a history of adolescent alcohol use. Remarkably, the learning imbalance alone was sufficient to simulate the divergence in choice behavior observed between these groups of animals. These findings identify a selective alteration in reinforcement learning following adolescent alcohol use that can account for a robust change in risk-based decision making persisting into later life. PMID:22615989
A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning

PubMed Central

Franklin, Nicholas T; Frank, Michael J

2015-01-01

Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments. DOI: http://dx.doi.org/10.7554/eLife.12029.001 PMID:26705698
Predicting psychosis across diagnostic boundaries: Behavioral and computational modeling evidence for impaired reinforcement learning in schizophrenia and bipolar disorder with a history of psychosis.

PubMed

Strauss, Gregory P; Thaler, Nicholas S; Matveeva, Tatyana M; Vogel, Sally J; Sutton, Griffin P; Lee, Bern G; Allen, Daniel N

2015-08-01

There is increasing evidence that schizophrenia (SZ) and bipolar disorder (BD) share a number of cognitive, neurobiological, and genetic markers. Shared features may be most prevalent among SZ and BD with a history of psychosis. This study extended this literature by examining reinforcement learning (RL) performance in individuals with SZ (n = 29), BD with a history of psychosis (BD+; n = 24), BD without a history of psychosis (BD-; n = 23), and healthy controls (HC; n = 24). RL was assessed through a probabilistic stimulus selection task with acquisition and test phases. Computational modeling evaluated competing accounts of the data. Each participant's trial-by-trial decision-making behavior was fit to 3 computational models of RL: (a) a standard actor-critic model simulating pure basal ganglia-dependent learning, (b) a pure Q-learning model simulating action selection as a function of learned expected reward value, and (c) a hybrid model where an actor-critic is "augmented" by a Q-learning component, meant to capture the top-down influence of orbitofrontal cortex value representations on the striatum. The SZ group demonstrated greater reinforcement learning impairments at acquisition and test phases than the BD+, BD-, and HC groups. The BD+ and BD- groups displayed comparable performance at acquisition and test phases. Collapsing across diagnostic categories, greater severity of current psychosis was associated with poorer acquisition of the most rewarding stimuli as well as poor go/no-go learning at test. Model fits revealed that reinforcement learning in SZ was best characterized by a pure actor-critic model where learning is driven by prediction error signaling alone. In contrast, BD-, BD+, and HC were best fit by a hybrid model where prediction errors are influenced by top-down expected value representations that guide decision making. These findings suggest that abnormalities in the reward system are more prominent in SZ than BD; however, current psychotic
Dynamic Sensor Tasking for Space Situational Awareness via Reinforcement Learning

NASA Astrophysics Data System (ADS)

Linares, R.; Furfaro, R.

2016-09-01

This paper studies the Sensor Management (SM) problem for optical Space Object (SO) tracking. The tasking problem is formulated as a Markov Decision Process (MDP) and solved using Reinforcement Learning (RL). The RL problem is solved using the actor-critic policy gradient approach. The actor provides a policy which is random over actions and given by a parametric probability density function (pdf). The critic evaluates the policy by calculating the estimated total reward or the value function for the problem. The parameters of the policy action pdf are optimized using gradients with respect to the reward function. Both the critic and the actor are modeled using deep neural networks (multi-layer neural networks). The policy neural network takes the current state as input and outputs probabilities for each possible action. This policy is random, and can be evaluated by sampling random actions using the probabilities determined by the policy neural network's outputs. The critic approximates the total reward using a neural network. The estimated total reward is used to approximate the gradient of the policy network with respect to the network parameters. This approach is used to find the non-myopic optimal policy for tasking optical sensors to estimate SO orbits. The reward function is based on reducing the uncertainty for the overall catalog to below a user specified uncertainty threshold. This work uses a 30 km total position error for the uncertainty threshold. This work provides the RL method with a negative reward as long as any SO has a total position error above the uncertainty threshold. This penalizes policies that take longer to achieve the desired accuracy. A positive reward is provided when all SOs are below the catalog uncertainty threshold. An optimal policy is sought that takes actions to achieve the desired catalog uncertainty in minimum time. This work trains the policy in simulation by letting it task a single sensor to "learn" from its performance
Frontostriatal white matter integrity mediates adult age differences in probabilistic reward learning.

PubMed

Samanez-Larkin, Gregory R; Levens, Sara M; Perry, Lee M; Dougherty, Robert F; Knutson, Brian

2012-04-11

Frontostriatal circuits have been implicated in reward learning, and emerging findings suggest that frontal white matter structural integrity and probabilistic reward learning are reduced in older age. This cross-sectional study examined whether age differences in frontostriatal white matter integrity could account for age differences in reward learning in a community life span sample of human adults. By combining diffusion tensor imaging with a probabilistic reward learning task, we found that older age was associated with decreased reward learning and decreased white matter integrity in specific pathways running from the thalamus to the medial prefrontal cortex and from the medial prefrontal cortex to the ventral striatum. Further, white matter integrity in these thalamocorticostriatal paths could statistically account for age differences in learning. These findings suggest that the integrity of frontostriatal white matter pathways critically supports reward learning. The findings also raise the possibility that interventions that bolster frontostriatal integrity might improve reward learning and decision making.
Involvement of the Rat Anterior Cingulate Cortex in Control of Instrumental Responses Guided by Reward Expectancy

ERIC Educational Resources Information Center

Schweimer, Judith; Hauber, Wolfgang

2005-01-01

The anterior cingulate cortex (ACC) plays a critical role in stimulus-reinforcement learning and reward-guided selection of actions. Here we conducted a series of experiments to further elucidate the role of the ACC in instrumental behavior involving effort-based decision-making and instrumental learning guided by reward-predictive stimuli. In…

Benchmarking for Bayesian Reinforcement Learning.

PubMed

Castronovo, Michael; Ernst, Damien; Couëtoux, Adrien; Fonteneau, Raphael

2016-01-01

In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but the benchmarks used to compare them are only relevant for specific cases. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed.
Fuel not fun: Reinterpreting attenuated brain responses to reward in obesity.

PubMed

Kroemer, Nils B; Small, Dana M

2016-08-01

There is a well-established literature linking obesity to altered dopamine signaling and brain response to food-related stimuli. Neuroimaging studies frequently report enhanced responses in dopaminergic regions during food anticipation and decreased responses during reward receipt. This has been interpreted as reflecting anticipatory "reward surfeit", and consummatory "reward deficiency". In particular, attenuated response in the dorsal striatum to primary food rewards is proposed to reflect anhedonia, which leads to overeating in an attempt to compensate for the reward deficit. In this paper, we propose an alternative view. We consider brain response to food-related stimuli in a reinforcement-learning framework, which can be employed to separate the contributions of reward sensitivity and reward-related learning that are typically entangled in the brain response to reward. Consequently, we posit that decreased striatal responses to milkshake receipt reflect reduced reward-related learning rather than reward deficiency or anhedonia because reduced reward sensitivity would translate uniformly into reduced anticipatory and consummatory responses to reward. By re-conceptualizing reward deficiency as a shift in learning about subjective value of rewards, we attempt to reconcile neuroimaging findings with the putative role of dopamine in effort, energy expenditure and exploration and suggest that attenuated brain responses to energy dense foods reflect the "fuel", not the fun entailed by the reward. Copyright © 2016 Elsevier Inc. All rights reserved.
Pressure to cooperate: is positive reward interdependence really needed in cooperative learning?

PubMed

Buchs, Céline; Gilles, Ingrid; Dutrévis, Marion; Butera, Fabrizio

2011-03-01

BACKGROUND. Despite extensive research on cooperative learning, the debate regarding whether or not its effectiveness depends on positive reward interdependence has not yet found clear evidence. AIMS. We tested the hypothesis that positive reward interdependence, as compared to reward independence, enhances cooperative learning only if learners work on a 'routine task'; if the learners work on a 'true group task', positive reward interdependence induces the same level of learning as reward independence. SAMPLE. The study involved 62 psychology students during regular workshops. METHOD. Students worked on two psychology texts in cooperative dyads for three sessions. The type of task was manipulated through resource interdependence: students worked on either identical (routine task) or complementary (true group task) information. Students expected to be assessed with a Multiple Choice Test (MCT) on the two texts. The MCT assessment type was introduced according to two reward interdependence conditions, either individual (reward independence) or common (positive reward interdependence). A follow-up individual test took place 4 weeks after the third session of dyadic work to examine individual learning. RESULTS. The predicted interaction between the two types of interdependence was significant, indicating that students learned more with positive reward interdependence than with reward independence when they worked on identical information (routine task), whereas students who worked on complementary information (group task) learned the same with or without reward interdependence. CONCLUSIONS. This experiment sheds light on the conditions under which positive reward interdependence enhances cooperative learning, and suggests that creating a real group task allows to avoid the need for positive reward interdependence. © 2010 The British Psychological Society.
Reinforcement learning and counterfactual reasoning explain adaptive behavior in a changing environment.

PubMed

Zhang, Yunfeng; Paik, Jaehyon; Pirolli, Peter

2015-04-01

Animals routinely adapt to changes in the environment in order to survive. Though reinforcement learning may play a role in such adaptation, it is not clear that it is the only mechanism involved, as it is not well suited to producing rapid, relatively immediate changes in strategies in response to environmental changes. This research proposes that counterfactual reasoning might be an additional mechanism that facilitates change detection. An experiment is conducted in which a task state changes over time and the participants had to detect the changes in order to perform well and gain monetary rewards. A cognitive model is constructed that incorporates reinforcement learning with counterfactual reasoning to help quickly adjust the utility of task strategies in response to changes. The results show that the model can accurately explain human data and that counterfactual reasoning is key to reproducing the various effects observed in this change detection paradigm. Copyright © 2015 Cognitive Science Society, Inc.
Neuropsychology of reward learning and negative symptoms in schizophrenia.

PubMed

Nestor, Paul G; Choate, Victoria; Niznikiewicz, Margaret; Levitt, James J; Shenton, Martha E; McCarley, Robert W

2014-11-01

We used the Iowa Gambling Test (IGT) to examine the relationship of reward learning to both neuropsychological functioning and symptom formation in 65 individuals with schizophrenia. Results indicated that compared to controls, participants with schizophrenia showed significantly reduced reward learning, which in turn correlated with reduced intelligence, memory and executive function, and negative symptoms. The current findings suggested that a disease-related disturbance in reward learning may underlie both cognitive and motivation deficits, as expressed by neuropsychological impairment and negative symptoms in schizophrenia. Copyright © 2014 Elsevier B.V. All rights reserved.
Within- and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory.

PubMed

Collins, Anne G E; Frank, Michael J

2018-03-06

Learning from rewards and punishments is essential to survival and facilitates flexible human behavior. It is widely appreciated that multiple cognitive and reinforcement learning systems contribute to decision-making, but the nature of their interactions is elusive. Here, we leverage methods for extracting trial-by-trial indices of reinforcement learning (RL) and working memory (WM) in human electro-encephalography to reveal single-trial computations beyond that afforded by behavior alone. Neural dynamics confirmed that increases in neural expectation were predictive of reduced neural surprise in the following feedback period, supporting central tenets of RL models. Within- and cross-trial dynamics revealed a cooperative interplay between systems for learning, in which WM contributes expectations to guide RL, despite competition between systems during choice. Together, these results provide a deeper understanding of how multiple neural systems interact for learning and decision-making and facilitate analysis of their disruption in clinical populations.
Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

PubMed

Glimcher, Paul W

2011-09-13

A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn.
Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis

PubMed Central

Glimcher, Paul W.

2011-01-01

A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn. PMID:21389268
Distinct roles of three frontal cortical areas in reward-guided behavior

PubMed Central

Noonan, M.P.; Mars, R.B.; Rushworth, M.F.S

2011-01-01

Functional magnetic resonance imaging (fMRI) was used to measure activity in three frontal cortical areas, lateral orbitofrontal cortex (lOFC), medial orbitofrontal cortex/ventromedial frontal cortex (mOFC/vmPFC), and anterior cingulate cortex (ACC) when expectations about type of reward, and not just reward presence or absence, could be learned. Two groups of human subjects learned twelve stimulus-response pairings. In one group (Consistent), correct performances of a given pairing were always reinforced with a specific reward outcome whereas in the other group (Inconsistent), correct performances were reinforced with randomly selected rewards. MOFC/vmPFC and lOFC were not distinguished by simple differences in relative preference for positive and negative outcomes. Instead lOFC activity reflected updating of reward-related associations specific to reward type; lOFC was active whenever informative outcomes allowed updating of reward-related associations regardless of whether the outcomes were positive or negative and the effects were greater when consistent stimulus-outcome and response-outcome mappings were present. A psycho-physiological interaction (PPI) analysis demonstrated changed coupling between lOFC and brain areas for visual object representation, such as perirhinal cortex, and reward-guided learning, such as amygdala, ventral striatum, and habenula /mediodorsal thalamus. By contrast mOFC/vmPFC activity reflected expected values of outcomes and occurrence of positive outcomes, irrespective of consistency of outcome mappings. The third frontal cortical region, ACC, reflected the use of reward type information to guide response selection. ACC activity reflected the probability of selecting the correct response, was greater when consistent outcome mappings were present, and was related to individual differences in propensity to select the correct response. PMID:21976525
The computational neurobiology of learning and reward.

PubMed

Daw, Nathaniel D; Doya, Kenji

2006-04-01

Following the suggestion that midbrain dopaminergic neurons encode a signal, known as a 'reward prediction error', used by artificial intelligence algorithms for learning to choose advantageous actions, the study of the neural substrates for reward-based learning has been strongly influenced by computational theories. In recent work, such theories have been increasingly integrated into experimental design and analysis. Such hybrid approaches have offered detailed new insights into the function of a number of brain areas, especially the cortex and basal ganglia. In part this is because these approaches enable the study of neural correlates of subjective factors (such as a participant's beliefs about the reward to be received for performing some action) that the computational theories purport to quantify.
Disrupted Reinforcement Learning and Maladaptive Behavior in Women with a History of Childhood Sexual Abuse: A High-Density Event-Related Potential Study

PubMed Central

Pechtel, Pia; Pizzagalli, Diego A.

2013-01-01

Context Childhood sexual abuse (CSA) has been associated with psychopathology, particularly major depressive disorder (MDD), and high-risk behaviors. Despite grave epidemiological data, the mechanisms underlying these maladaptive outcomes remain poorly understood. Objective We examined whether CSA history, particularly in conjunction with past MDD, is associated with behavioral and neural dysfunction in reinforcement learning, and whether such dysfunction is linked to maladaptive behavior. Design Participants completed a clinical evaluation and a probabilistic reinforcement task while 128-channel event-related potentials were recorded. Setting Academic setting; participants recruited from the community. Participants Fifteen remitted depressed females with CSA history (CSA+rMDD), 16 remitted depressed females without CSA history (rMDD), and 18 healthy females. Main Outcome Measures Participants’ preference for choosing the most rewarded stimulus and avoiding the most punished stimulus was evaluated. The feedback-related negativity (FRN) and error-related negativity (ERN)–hypothesized to reflect activation in the anterior cingulate cortex–were used as electrophysiological indices of reinforcement learning. Results No group differences emerged in the acquisition of reinforcement contingencies. In trials requiring to rely partially or exclusively on previously rewarded information, the CSA+rMDD group showed (1) lower accuracy (relative to both controls and rMDD), (2) blunted electrophysiological differentiation between correct and incorrect responses (relative to controls), and (3) increased activation in the subgenual anterior cingulate cortex (relative to rMDD). CSA history was not associated with impairments in avoiding the most punished stimulus. Self-harm and suicidal behaviors correlated with poorer performance of previously rewarded–but not previously punished–trials. Conclusions Irrespective of past MDD, women with CSA histories showed neural and
Reinforcement learning in scheduling

NASA Technical Reports Server (NTRS)

Dietterich, Tom G.; Ok, Dokyeong; Zhang, Wei; Tadepalli, Prasad

1994-01-01

The goal of this research is to apply reinforcement learning methods to real-world problems like scheduling. In this preliminary paper, we show that learning to solve scheduling problems such as the Space Shuttle Payload Processing and the Automatic Guided Vehicle (AGV) scheduling can be usefully studied in the reinforcement learning framework. We discuss some of the special challenges posed by the scheduling domain to these methods and propose some possible solutions we plan to implement.
A review of reward processing and motivational impairment in schizophrenia.

PubMed

Strauss, Gregory P; Waltz, James A; Gold, James M

2014-03-01

This article reviews and synthesizes research on reward processing in schizophrenia, which has begun to provide important insights into the cognitive and neural mechanisms associated with motivational impairments. Aberrant cortical-striatal interactions may be involved with multiple reward processing abnormalities, including: (1) dopamine-mediated basal ganglia systems that support reinforcement learning and the ability to predict cues that lead to rewarding outcomes; (2) orbitofrontal cortex-driven deficits in generating, updating, and maintaining value representations; (3) aberrant effort-value computations, which may be mediated by disrupted anterior cingulate cortex and midbrain dopamine functioning; and (4) altered activation of the prefrontal cortex, which is important for generating exploratory behaviors in environments where reward outcomes are uncertain. It will be important for psychosocial interventions targeting negative symptoms to account for abnormalities in each of these reward processes, which may also have important interactions; suggestions for novel behavioral intervention strategies that make use of external cues, reinforcers, and mobile technology are discussed.
Racial bias shapes social reinforcement learning.

PubMed

Lindström, Björn; Selbing, Ida; Molapour, Tanaz; Olsson, Andreas

2014-03-01

Both emotional facial expressions and markers of racial-group belonging are ubiquitous signals in social interaction, but little is known about how these signals together affect future behavior through learning. To address this issue, we investigated how emotional (threatening or friendly) in-group and out-group faces reinforced behavior in a reinforcement-learning task. We asked whether reinforcement learning would be modulated by intergroup attitudes (i.e., racial bias). The results showed that individual differences in racial bias critically modulated reinforcement learning. As predicted, racial bias was associated with more efficiently learned avoidance of threatening out-group individuals. We used computational modeling analysis to quantitatively delimit the underlying processes affected by social reinforcement. These analyses showed that racial bias modulates the rate at which exposure to threatening out-group individuals is transformed into future avoidance behavior. In concert, these results shed new light on the learning processes underlying social interaction with racial-in-group and out-group individuals.
Generalization of value in reinforcement learning by humans

PubMed Central

Wimmer, G. Elliott; Daw, Nathaniel D.; Shohamy, Daphna

2012-01-01

Research in decision making has focused on the role of dopamine and its striatal targets in guiding choices via learned stimulus-reward or stimulus-response associations, behavior that is well-described by reinforcement learning (RL) theories. However, basic RL is relatively limited in scope and does not explain how learning about stimulus regularities or relations may guide decision making. A candidate mechanism for this type of learning comes from the domain of memory, which has highlighted a role for the hippocampus in learning of stimulus-stimulus relations, typically dissociated from the role of the striatum in stimulus-response learning. Here, we used fMRI and computational model-based analyses to examine the joint contributions of these mechanisms to RL. Humans performed an RL task with added relational structure, modeled after tasks used to isolate hippocampal contributions to memory. On each trial participants chose one of four options, but the reward probabilities for pairs of options were correlated across trials. This (uninstructed) relationship between pairs of options potentially enabled an observer to learn about options’ values based on experience with the other options and to generalize across them. We observed BOLD activity related to learning in the striatum and also in the hippocampus. By comparing a basic RL model to one augmented to allow feedback to generalize between correlated options, we tested whether choice behavior and BOLD activity were influenced by the opportunity to generalize across correlated options. Although such generalization goes beyond standard computational accounts of RL and striatal BOLD, both choices and striatal BOLD were better explained by the augmented model. Consistent with the hypothesized role for the hippocampus in this generalization, functional connectivity between the ventral striatum and hippocampus was modulated, across participants, by the ability of the augmented model to capture participants’ choice
Autistic Traits Moderate the Impact of Reward Learning on Social Behaviour.

PubMed

Panasiti, Maria Serena; Puzzo, Ignazio; Chakrabarti, Bhismadev

2016-04-01

A deficit in empathy has been suggested to underlie social behavioural atypicalities in autism. A parallel theoretical account proposes that reduced social motivation (i.e., low responsivity to social rewards) can account for the said atypicalities. Recent evidence suggests that autistic traits modulate the link between reward and proxy metrics related to empathy. Using an evaluative conditioning paradigm to associate high and low rewards with faces, a previous study has shown that individuals high in autistic traits show reduced spontaneous facial mimicry of faces associated with high vs. low reward. This observation raises the possibility that autistic traits modulate the magnitude of evaluative conditioning. To test this, we investigated (a) if autistic traits could modulate the ability to implicitly associate a reward value to a social stimulus (reward learning/conditioning, using the Implicit Association Task, IAT); (b) if the learned association could modulate participants' prosocial behaviour (i.e., social reciprocity, measured using the cyberball task); (c) if the strength of this modulation was influenced by autistic traits. In 43 neurotypical participants, we found that autistic traits moderated the relationship of social reward learning on prosocial behaviour but not reward learning itself. This evidence suggests that while autistic traits do not directly influence social reward learning, they modulate the relationship of social rewards with prosocial behaviour. © 2015 The Authors Autism Research published by Wiley Periodicals, Inc. on behalf of International Society for Autism Research.
Mechanisms of Hierarchical Reinforcement Learning in Corticostriatal Circuits 1: Computational Analysis

PubMed Central

Badre, David

2012-01-01

Growing evidence suggests that the prefrontal cortex (PFC) is organized hierarchically, with more anterior regions having increasingly abstract representations. How does this organization support hierarchical cognitive control and the rapid discovery of abstract action rules? We present computational models at different levels of description. A neural circuit model simulates interacting corticostriatal circuits organized hierarchically. In each circuit, the basal ganglia gate frontal actions, with some striatal units gating the inputs to PFC and others gating the outputs to influence response selection. Learning at all of these levels is accomplished via dopaminergic reward prediction error signals in each corticostriatal circuit. This functionality allows the system to exhibit conditional if–then hypothesis testing and to learn rapidly in environments with hierarchical structure. We also develop a hybrid Bayesian-reinforcement learning mixture of experts (MoE) model, which can estimate the most likely hypothesis state of individual participants based on their observed sequence of choices and rewards. This model yields accurate probabilistic estimates about which hypotheses are attended by manipulating attentional states in the generative neural model and recovering them with the MoE model. This 2-pronged modeling approach leads to multiple quantitative predictions that are tested with functional magnetic resonance imaging in the companion paper. PMID:21693490
Dopamine and reward: the anhedonia hypothesis 30 years on.

PubMed

Wise, Roy A

2008-10-01

The anhedonia hypothesis--that brain dopamine plays a critical role in the subjective pleasure associated with positive rewards--was intended to draw the attention of psychiatrists to the growing evidence that dopamine plays a critical role in the objective reinforcement and incentive motivation associated with food and water, brain stimulation reward, and psychomotor stimulant and opiate reward. The hypothesis called to attention the apparent paradox that neuroleptics, drugs used to treat a condition involving anhedonia (schizophrenia), attenuated in laboratory animals the positive reinforcement that we normally associate with pleasure. The hypothesis held only brief interest for psychiatrists, who pointed out that the animal studies reflected acute actions of neuroleptics whereas the treatment of schizophrenia appears to result from neuroadaptations to chronic neuroleptic administration, and that it is the positive symptoms of schizophrenia that neuroleptics alleviate, rather than the negative symptoms that include anhedonia. Perhaps for these reasons, the hypothesis has had minimal impact in the psychiatric literature. Despite its limited heuristic value for the understanding of schizophrenia, however, the anhedonia hypothesis has had major impact on biological theories of reinforcement, motivation, and addiction. Brain dopamine plays a very important role in reinforcement of response habits, conditioned preferences, and synaptic plasticity in cellular models of learning and memory. The notion that dopamine plays a dominant role in reinforcement is fundamental to the psychomotor stimulant theory of addiction, to most neuroadaptation theories of addiction, and to current theories of conditioned reinforcement and reward prediction. Properly understood, it is also fundamental to recent theories of incentive motivation.
Stimulus discriminability may bias value-based probabilistic learning.

PubMed

Schutte, Iris; Slagter, Heleen A; Collins, Anne G E; Frank, Michael J; Kenemans, J Leon

2017-01-01

Reinforcement learning tasks are often used to assess participants' tendency to learn more from the positive or more from the negative consequences of one's action. However, this assessment often requires comparison in learning performance across different task conditions, which may differ in the relative salience or discriminability of the stimuli associated with more and less rewarding outcomes, respectively. To address this issue, in a first set of studies, participants were subjected to two versions of a common probabilistic learning task. The two versions differed with respect to the stimulus (Hiragana) characters associated with reward probability. The assignment of character to reward probability was fixed within version but reversed between versions. We found that performance was highly influenced by task version, which could be explained by the relative perceptual discriminability of characters assigned to high or low reward probabilities, as assessed by a separate discrimination experiment. Participants were more reliable in selecting rewarding characters that were more discriminable, leading to differences in learning curves and their sensitivity to reward probability. This difference in experienced reinforcement history was accompanied by performance biases in a test phase assessing ability to learn from positive vs. negative outcomes. In a subsequent large-scale web-based experiment, this impact of task version on learning and test measures was replicated and extended. Collectively, these findings imply a key role for perceptual factors in guiding reward learning and underscore the need to control stimulus discriminability when making inferences about individual differences in reinforcement learning.
Reinforcement learning in supply chains.

PubMed

Valluri, Annapurna; North, Michael J; Macal, Charles M

2009-10-01

Effective management of supply chains creates value and can strategically position companies. In practice, human beings have been found to be both surprisingly successful and disappointingly inept at managing supply chains. The related fields of cognitive psychology and artificial intelligence have postulated a variety of potential mechanisms to explain this behavior. One of the leading candidates is reinforcement learning. This paper applies agent-based modeling to investigate the comparative behavioral consequences of three simple reinforcement learning algorithms in a multi-stage supply chain. For the first time, our findings show that the specific algorithm that is employed can have dramatic effects on the results obtained. Reinforcement learning is found to be valuable in multi-stage supply chains with several learning agents, as independent agents can learn to coordinate their behavior. However, learning in multi-stage supply chains using these postulated approaches from cognitive psychology and artificial intelligence take extremely long time periods to achieve stability which raises questions about their ability to explain behavior in real supply chains. The fact that it takes thousands of periods for agents to learn in this simple multi-agent setting provides new evidence that real world decision makers are unlikely to be using strict reinforcement learning in practice.

Attenuating GABAA Receptor Signaling in Dopamine Neurons Selectively Enhances Reward Learning and Alters Risk Preference in Mice

PubMed Central

Parker, Jones G.; Wanat, Matthew J.; Soden, Marta E.; Ahmad, Kinza; Zweifel, Larry S.; Bamford, Nigel S.; Palmiter, Richard D.

2011-01-01

Phasic dopamine transmission encodes the value of reward-predictive stimuli and influences both learning and decision-making. Altered dopamine signaling is associated with psychiatric conditions characterized by risky choices such as pathological gambling. These observations highlight the importance of understanding how dopamine neuron activity is modulated. While excitatory drive onto dopamine neurons is critical for generating phasic dopamine responses, emerging evidence suggests that inhibitory signaling also modulates these responses. To address the functional importance of inhibitory signaling in dopamine neurons, we generated mice lacking the β3 subunit of the GABAA receptor specifically in dopamine neurons (β3-KO mice) and examined their behavior in tasks that assessed appetitive learning, aversive learning, and risk preference. Dopamine neurons in midbrain slices from β3-KO mice exhibited attenuated GABA-evoked inhibitory post-synaptic currents. Furthermore, electrical stimulation of excitatory afferents to dopamine neurons elicited more dopamine release in the nucleus accumbens of β3-KO mice as measured by fast-scan cyclic voltammetry. β3-KO mice were more active than controls when given morphine, which correlated with potential compensatory upregulation of GABAergic tone onto dopamine neurons. β3-KO mice learned faster in two food-reinforced learning paradigms, but extinguished their learned behavior normally. Enhanced learning was specific for appetitive tasks, as aversive learning was unaffected in β3-KO mice. Finally, we found that β3-KO mice had enhanced risk preference in a probabilistic selection task that required mice to choose between a small certain reward and a larger uncertain reward. Collectively, these findings identify a selective role for GABAA signaling in dopamine neurons in appetitive learning and decision-making. PMID:22114279
Effects of RO 15-1788 on a running response rewarded on continuous or partial reinforcement schedules.

PubMed

Hawkins, M; Sinden, J; Martin, I; Gray, J A

1988-01-01

Two experiments were run in which rats were rewarded with food for running in a straight alley at one trial a day, followed by extinction of the running response. During acquisition of the response, reward was delivered either on a continuous reinforcement (CRF) or on a quasirandom 50% partial reinforcement (PRF) schedule. The groups given PRF were more resistant to extinction than those given CRF, the well-known partial reinforcement extinction effect. In Experiment 1 different groups of rats were injected during acquisition only with 1, 5 or 10 mg/kg of the benzodiazepine antagonist, RO 15-1788, or with placebo. In Experiment 2, 5 mg/kg RO 15-1788 or placebo were administered in a full cross-over design during acquisition, extinction or both. At the end of Experiment 2 only [3H]-flunitrazepam binding was measured in either the presence or absence of added gamma-aminobutyrate (GABA) in homogenates of hippocampi dissected from the animals that had received behavioural training. The drug affected running speeds during both acquisition and extinction in different ways depending upon the schedule of reinforcement (CRF or PRF) and also gave rise to enhanced GABA stimulation of [3H]-flunitrazepam binding. The results are discussed in relation to the hypothesis that the neurochemical pathways by which reinforcement schedules modify behaviour include a step influenced by benzodiazepine receptors.
Oculomotor learning revisited: a model of reinforcement learning in the basal ganglia incorporating an efference copy of motor actions

PubMed Central

Fee, Michale S.

2012-01-01

In its simplest formulation, reinforcement learning is based on the idea that if an action taken in a particular context is followed by a favorable outcome, then, in the same context, the tendency to produce that action should be strengthened, or reinforced. While reinforcement learning forms the basis of many current theories of basal ganglia (BG) function, these models do not incorporate distinct computational roles for signals that convey context, and those that convey what action an animal takes. Recent experiments in the songbird suggest that vocal-related BG circuitry receives two functionally distinct excitatory inputs. One input is from a cortical region that carries context information about the current “time” in the motor sequence. The other is an efference copy of motor commands from a separate cortical brain region that generates vocal variability during learning. Based on these findings, I propose here a general model of vertebrate BG function that combines context information with a distinct motor efference copy signal. The signals are integrated by a learning rule in which efference copy inputs gate the potentiation of context inputs (but not efference copy inputs) onto medium spiny neurons in response to a rewarded action. The hypothesis is described in terms of a circuit that implements the learning of visually guided saccades. The model makes testable predictions about the anatomical and functional properties of hypothesized context and efference copy inputs to the striatum from both thalamic and cortical sources. PMID:22754501
Oculomotor learning revisited: a model of reinforcement learning in the basal ganglia incorporating an efference copy of motor actions.

PubMed

Fee, Michale S

2012-01-01

In its simplest formulation, reinforcement learning is based on the idea that if an action taken in a particular context is followed by a favorable outcome, then, in the same context, the tendency to produce that action should be strengthened, or reinforced. While reinforcement learning forms the basis of many current theories of basal ganglia (BG) function, these models do not incorporate distinct computational roles for signals that convey context, and those that convey what action an animal takes. Recent experiments in the songbird suggest that vocal-related BG circuitry receives two functionally distinct excitatory inputs. One input is from a cortical region that carries context information about the current "time" in the motor sequence. The other is an efference copy of motor commands from a separate cortical brain region that generates vocal variability during learning. Based on these findings, I propose here a general model of vertebrate BG function that combines context information with a distinct motor efference copy signal. The signals are integrated by a learning rule in which efference copy inputs gate the potentiation of context inputs (but not efference copy inputs) onto medium spiny neurons in response to a rewarded action. The hypothesis is described in terms of a circuit that implements the learning of visually guided saccades. The model makes testable predictions about the anatomical and functional properties of hypothesized context and efference copy inputs to the striatum from both thalamic and cortical sources.
Scheduled power tracking control of the wind-storage hybrid system based on the reinforcement learning theory

NASA Astrophysics Data System (ADS)

Li, Ze

2017-09-01

In allusion to the intermittency and uncertainty of the wind electricity, energy storage and wind generator are combined into a hybrid system to improve the controllability of the output power. A scheduled power tracking control method is proposed based on the reinforcement learning theory and Q-learning algorithm. In this method, the state space of the environment is formed with two key factors, i.e. the state of charge of the energy storage and the difference value between the actual wind power and scheduled power, the feasible action is the output power of the energy storage, and the corresponding immediate rewarding function is designed to reflect the rationality of the control action. By interacting with the environment and learning from the immediate reward, the optimal control strategy is gradually formed. After that, it could be applied to the scheduled power tracking control of the hybrid system. Finally, the rationality and validity of the method are verified through simulation examples.
The impact of flavoring on the rewarding and reinforcing value of e-cigarettes with nicotine among young adult smokers

PubMed Central

Audrain-McGovern, Janet; Strasser, Andrew A.; Wileyto, E. Paul

2016-01-01

Objective Flavored e-cigarette use has risen rapidly, especially among young adults who also smoke cigarettes. We sought to determine whether flavoring enhances the subjective rewarding value, relative reinforcing value, and absolute reinforcing value of an e-cigarette with nicotine compared to an unflavored e-cigarette with nicotine. Methods Using a within-subjects design, young adult smokers (n=32) participated in three human laboratory sessions. Session 1 evaluated the rewarding value of flavoring by having participants rate unflavored and flavored e-cigarettes with nicotine. Session 2 assessed the relative reinforcing value of a flavored vs unflavored e-cigarette via a choice task that evaluated the willingness to “work” to hit targets on a computer screen to earn flavored or unflavored e-cigarette puffs. Session 3 measured the absolute reinforcing value of flavored versus unflavored e-cigarettes via a 90-minute ad-libitum vaping session where puffs from each e-cigarette were counted. Results Subjective reward value was higher for the flavored versus the unflavored e-cigarette (β=.83, CI 0.35 to 1.32, p=0.001). Participants worked harder for flavored e-cigarette puffs versus unflavored e-cigarette puffs (breakpoint = 5.7; 597 responses versus 127 responses; β=460.733, CI 246.58 to 674.88, p <0.0001). Participants took twice as many flavored puffs than unflavored e-cigarette puffs (40 vs 23 puffs; IRR=2.028, CI 1.183 to 3.475, p=0.01). Conclusions Flavoring enhances the rewarding and reinforcing value of e-cigarettes with nicotine, and thus their abuse liability in young adult smokers. Further research is necessary to determine whether the use of flavoring in e-cigarettes impacts cigarette smoking behavior among young adults. PMID:27426010
The impact of flavoring on the rewarding and reinforcing value of e-cigarettes with nicotine among young adult smokers.

PubMed

Audrain-McGovern, Janet; Strasser, Andrew A; Wileyto, E Paul

2016-09-01

Flavored e-cigarette use has risen rapidly, especially among young adults who also smoke cigarettes. We sought to determine whether flavoring enhances the subjective rewarding value, relative reinforcing value, and absolute reinforcing value of an e-cigarette with nicotine compared to an unflavored e-cigarette with nicotine. Using a within-subjects design, young adult smokers (n=32) participated in three human laboratory sessions. Session 1 evaluated the rewarding value of flavoring by having participants rate unflavored and flavored e-cigarettes with nicotine. Session 2 assessed the relative reinforcing value of a flavored vs unflavored e-cigarette via a choice task that evaluated the willingness to "work" to hit targets on a computer screen to earn flavored or unflavored e-cigarette puffs. Session 3 measured the absolute reinforcing value of flavored versus unflavored e-cigarettes via a 90-min ad-libitum vaping session where puffs from each e-cigarette were counted. Subjective reward value was higher for the flavored versus the unflavored e-cigarette (β=0.83, CI 0.35-1.32, p=0.001). Participants worked harder for flavored e-cigarette puffs versus unflavored e-cigarette puffs (breakpoint=5.7; 597 responses versus 127 responses; β=460.733, CI 246.58-674.88, p<0.0001). Participants took twice as many flavored puffs than unflavored e-cigarette puffs (40 vs 23 puffs; IRR=2.028, CI 1.183-3.475, p=0.01). Flavoring enhances the rewarding and reinforcing value of e-cigarettes with nicotine, and thus their abuse liability in young adult smokers. Further research is necessary to determine whether the use of flavoring in e-cigarettes impacts cigarette smoking behavior among young adults. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Context transfer in reinforcement learning using action-value functions.

PubMed

Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

2014-01-01

This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task.
Context Transfer in Reinforcement Learning Using Action-Value Functions

PubMed Central

Mousavi, Amin; Nadjar Araabi, Babak; Nili Ahmadabadi, Majid

2014-01-01

This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task. PMID:25610457
Rational and Mechanistic Perspectives on Reinforcement Learning

ERIC Educational Resources Information Center

Chater, Nick

2009-01-01

This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: "mechanistic" and "rational." Reinforcement learning is often viewed in mechanistic terms--as…
Rational and mechanistic perspectives on reinforcement learning.

PubMed

Chater, Nick

2009-12-01

This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: mechanistic and rational. Reinforcement learning is often viewed in mechanistic terms--as describing the operation of aspects of an agent's cognitive and neural machinery. Yet it can also be viewed as a rational level of description, specifically, as describing a class of methods for learning from experience, using minimal background knowledge. This paper considers how rational and mechanistic perspectives differ, and what types of evidence distinguish between them. Reinforcement learning research in the cognitive and brain sciences is often implicitly committed to the mechanistic interpretation. Here the opposite view is put forward: that accounts of reinforcement learning should apply at the rational level, unless there is strong evidence for a mechanistic interpretation. Implications of this viewpoint for reinforcement-based theories in the cognitive and brain sciences are discussed.
Agent Reward Shaping for Alleviating Traffic Congestion

NASA Technical Reports Server (NTRS)

Tumer, Kagan; Agogino, Adrian

2006-01-01

Traffic congestion problems provide a unique environment to study how multi-agent systems promote desired system level behavior. What is particularly interesting in this class of problems is that no individual action is intrinsically "bad" for the system but that combinations of actions among agents lead to undesirable outcomes, As a consequence, agents need to learn how to coordinate their actions with those of other agents, rather than learn a particular set of "good" actions. This problem is ubiquitous in various traffic problems, including selecting departure times for commuters, routes for airlines, and paths for data routers. In this paper we present a multi-agent approach to two traffic problems, where far each driver, an agent selects the most suitable action using reinforcement learning. The agent rewards are based on concepts from collectives and aim to provide the agents with rewards that are both easy to learn and that if learned, lead to good system level behavior. In the first problem, we study how agents learn the best departure times of drivers in a daily commuting environment and how following those departure times alleviates congestion. In the second problem, we study how agents learn to select desirable routes to improve traffic flow and minimize delays for. all drivers.. In both sets of experiments,. agents using collective-based rewards produced near optimal performance (93-96% of optimal) whereas agents using system rewards (63-68%) barely outperformed random action selection (62-64%) and agents using local rewards (48-72%) performed worse than random in some instances.
Negative symptoms in schizophrenia result from a failure to represent the expected value of rewards: Behavioral and computational modeling evidence

PubMed Central

Gold, James M.; Waltz, James A.; Matveeva, Tatyana M.; Kasanova, Zuzana; Strauss, Gregory P.; Herbener, Ellen S.; Collins, Anne G.E.; Frank, Michael J.

2015-01-01

Context Negative symptoms are a core feature of schizophrenia, but their pathophysiology remains unclear. Objective Negative symptoms are defined by the absence of normal function. However, there must be a productive mechanism that leads to this absence. Here, we test a reinforcement learning account suggesting that negative symptoms result from a failure to represent the expected value of rewards coupled with preserved loss avoidance learning. Design Subjects performed a probabilistic reinforcement learning paradigm involving stimulus pairs in which choices resulted in either reward or avoidance of loss. Following training, subjects indicated their valuation of the stimuli in a transfer task. Computational modeling was used to distinguish between alternative accounts of the data. Setting A tertiary care research outpatient clinic. Patients A total of 47 clinically stable patients with a diagnosis of schizophrenia or schizoaffective disorder and 28 healthy volunteers participated. Patients were divided into high and low negative symptom groups. Main Outcome measures 1) The number of choices leading to reward or loss avoidance and 2) performance in the transfer phase. Quantitative fits from three different models were examined. Results High negative symptom patients demonstrated impaired learning from rewards but intact loss avoidance learning, and failed to distinguish rewarding stimuli from loss-avoiding stimuli in the transfer phase. Model fits revealed that high negative symptom patients were better characterized by an “actor-critic” model, learning stimulus-response associations, whereas controls and low negative symptom patients incorporated expected value of their actions (“Q-learning”) into the selection process. Conclusions Negative symptoms are associated with a specific reinforcement learning abnormality: High negative symptoms patients do not represent the expected value of rewards when making decisions but learn to avoid punishments through the
How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis

PubMed Central

Collins, Anne G. E.; Frank, Michael J.

2012-01-01

Instrumental learning involves corticostriatal circuitry and the dopaminergic system. This system is typically modeled in the reinforcement learning (RL) framework by incrementally accumulating reward values of states and actions. However, human learning also implicates prefrontal cortical mechanisms involved in higher level cognitive functions. The interaction of these systems remains poorly understood, and models of human behavior often ignore working memory (WM) and therefore incorrectly assign behavioral variance to the RL system. Here we designed a task that highlights the profound entanglement of these two processes, even in simple learning problems. By systematically varying the size of the learning problem and delay between stimulus repetitions, we separately extracted WM-specific effects of load and delay on learning. We propose a new computational model that accounts for the dynamic integration of RL and WM processes observed in subjects' behavior. Incorporating capacity-limited WM into the model allowed us to capture behavioral variance that could not be captured in a pure RL framework even if we (implausibly) allowed separate RL systems for each set size. The WM component also allowed for a more reasonable estimation of a single RL process. Finally, we report effects of two genetic polymorphisms having relative specificity for prefrontal and basal ganglia functions. Whereas the COMT gene coding for catechol-O-methyl transferase selectively influenced model estimates of WM capacity, the GPR6 gene coding for G-protein-coupled receptor 6 influenced the RL learning rate. Thus, this study allowed us to specify distinct influences of the high-level and low-level cognitive functions on instrumental learning, beyond the possibilities offered by simple RL models. PMID:22487033
Reward and punishment act as distinct factors in guiding behavior

PubMed Central

Kubanek, Jan; Snyder, Lawrence H; Abrams, Richard A

2015-01-01

Behavior rests on the experience of reinforcement and punishment. It has been unclear whether reinforcement and punishment act as oppositely valenced components of a single behavioral factor, or whether these two kinds of outcomes play fundamentally distinct behavioral roles. To this end, we varied the magnitude of a reward or a penalty experienced following a choice using monetary tokens. The outcome of each trial was independent of the outcome of the previous trial, which enabled us to isolate and study the effect on behavior of each outcome magnitude in single trials. As expected, we found that a reward led to a repetition of the previous choice, whereas a penalty led to an avoidance of the previous choice. However, the effects of the reward magnitude and the penalty magnitude revealed a striking asymmetry. The choice repetition effect of a reward strongly scaled with the magnitude of the reward. In a marked contrast, the avoidance effect of a penalty was flat, not influenced by the magnitude of the penalty. These effects were mechanistically described using the Reinforcement Learning model after the model was updated to account for the penalty-based asymmetry. The asymmetry in the effects of the reward magnitude and the punishment magnitude was so striking that it is diffcult to conceive that one factor is just a weighted or transformed form of the other factor. Instead, the data suggest that rewards and penalties are fundamentally distinct factors in governing behavior. PMID:25824862
Benchmarking for Bayesian Reinforcement Learning

PubMed Central

Ernst, Damien; Couëtoux, Adrien

2016-01-01

In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but the benchmarks used to compare them are only relevant for specific cases. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed. PMID:27304891
Risk-sensitive reinforcement learning.

PubMed

Shen, Yun; Tobia, Michael J; Sommer, Tobias; Obermayer, Klaus

2014-07-01

We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.
Intrinsic interactive reinforcement learning - Using error-related potentials for real world human-robot interaction.

PubMed

Kim, Su Kyoung; Kirchner, Elsa Andrea; Stefes, Arne; Kirchner, Frank

2017-12-14

Reinforcement learning (RL) enables robots to learn its optimal behavioral strategy in dynamic environments based on feedback. Explicit human feedback during robot RL is advantageous, since an explicit reward function can be easily adapted. However, it is very demanding and tiresome for a human to continuously and explicitly generate feedback. Therefore, the development of implicit approaches is of high relevance. In this paper, we used an error-related potential (ErrP), an event-related activity in the human electroencephalogram (EEG), as an intrinsically generated implicit feedback (rewards) for RL. Initially we validated our approach with seven subjects in a simulated robot learning scenario. ErrPs were detected online in single trial with a balanced accuracy (bACC) of 91%, which was sufficient to learn to recognize gestures and the correct mapping between human gestures and robot actions in parallel. Finally, we validated our approach in a real robot scenario, in which seven subjects freely chose gestures and the real robot correctly learned the mapping between gestures and actions (ErrP detection (90% bACC)). In this paper, we demonstrated that intrinsically generated EEG-based human feedback in RL can successfully be used to implicitly improve gesture-based robot control during human-robot interaction. We call our approach intrinsic interactive RL.
Reward-based contextual learning supported by anterior cingulate cortex.

PubMed

Umemoto, Akina; HajiHosseini, Azadeh; Yates, Michael E; Holroyd, Clay B

2017-06-01

The anterior cingulate cortex (ACC) is commonly associated with cognitive control and decision making, but its specific function is highly debated. To explore a recent theory that the ACC learns the reward values of task contexts (Holroyd & McClure in Psychological Review, 122, 54-83, 2015; Holroyd & Yeung in Trends in Cognitive Sciences, 16, 122-128, 2012), we recorded the event-related brain potentials (ERPs) from participants as they played a novel gambling task. The participants were first required to select from among three games in one "virtual casino," and subsequently they were required to select from among three different games in a different virtual casino; unbeknownst to them, the payoffs for the games were higher in one casino than in the other. Analysis of the reward positivity, an ERP component believed to reflect reward-related signals carried to the ACC by the midbrain dopamine system, revealed that the ACC is sensitive to differences in the reward values associated with both the casinos and the games inside the casinos, indicating that participants learned the values of the contexts in which rewards were delivered. These results highlight the importance of the ACC in learning the reward values of task contexts in order to guide action selection.
Differences in color learning between pollen- and sucrose-rewarded bees

PubMed Central

Nicholls, Elizabeth K; Ehrendreich, Doreen; Hempel de Ibarra, Natalie

2015-01-01

What bees learn during pollen collection, and how they might discriminate between flowers on the basis of the quality of this reward, is not well understood. Recently we showed that bees learn to associate colors with differences in pollen rewards. Extending these findings, we present here additional evidence to suggest that the strength and time-course of memory formation may differ between pollen- and sucrose-rewarded bees. Color-naïve honeybees, trained with pollen or sucrose rewards to discriminate colored stimuli, were found to differ in their responses when recalling learnt information after reversal training. Such differences could affect the decision-making and foraging dynamics of individual bees when collecting different types of floral rewards. PMID:26478780

Altered cingulo-striatal function underlies reward drive deficits in schizophrenia.

PubMed

Park, Il Ho; Chun, Ji Won; Park, Hae-Jeong; Koo, Min-Seong; Park, Sunyoung; Kim, Seok-Hyeong; Kim, Jae-Jin

2015-02-01

Amotivation in schizophrenia is assumed to involve dysfunctional dopaminergic signaling of reward prediction or anticipation. It is unclear, however, whether the translation of neural representation of reward value to behavioral drive is affected in schizophrenia. In order to examine how abnormal neural processing of response valuation and initiation affects incentive motivation in schizophrenia, we conducted functional MRI using a deterministic reinforcement learning task with variable intervals of contingency reversals in 20 clinically stable patients with schizophrenia and 20 healthy controls. Behaviorally, the advantage of positive over negative reinforcer in reinforcement-related responsiveness was not observed in patients. Patients showed altered response valuation and initiation-related striatal activity and deficient rostro-ventral anterior cingulate cortex activation during reward approach initiation. Among these neural abnormalities, rostro-ventral anterior cingulate cortex activation was correlated with positive reinforcement-related responsiveness in controls and social anhedonia and social amotivation subdomain scores in patients. Our findings indicate that the central role of the anterior cingulate cortex is in translating action value into driving force of action, and underscore the role of the cingulo-striatal network in amotivation in schizophrenia. Copyright © 2014 Elsevier B.V. All rights reserved.
Modulation of spatial attention by goals, statistical learning, and monetary reward.

PubMed

Jiang, Yuhong V; Sha, Li Z; Remington, Roger W

2015-10-01

This study documented the relative strength of task goals, visual statistical learning, and monetary reward in guiding spatial attention. Using a difficult T-among-L search task, we cued spatial attention to one visual quadrant by (i) instructing people to prioritize it (goal-driven attention), (ii) placing the target frequently there (location probability learning), or (iii) associating that quadrant with greater monetary gain (reward-based attention). Results showed that successful goal-driven attention exerted the strongest influence on search RT. Incidental location probability learning yielded a smaller though still robust effect. Incidental reward learning produced negligible guidance for spatial attention. The 95 % confidence intervals of the three effects were largely nonoverlapping. To understand these results, we simulated the role of location repetition priming in probability cuing and reward learning. Repetition priming underestimated the strength of location probability cuing, suggesting that probability cuing involved long-term statistical learning of how to shift attention. Repetition priming provided a reasonable account for the negligible effect of reward on spatial attention. We propose a multiple-systems view of spatial attention that includes task goals, search habit, and priming as primary drivers of top-down attention.
Modulation of spatial attention by goals, statistical learning, and monetary reward

PubMed Central

Sha, Li Z.; Remington, Roger W.

2015-01-01

This study documented the relative strength of task goals, visual statistical learning, and monetary reward in guiding spatial attention. Using a difficult T-among-L search task, we cued spatial attention to one visual quadrant by (i) instructing people to prioritize it (goal-driven attention), (ii) placing the target frequently there (location probability learning), or (iii) associating that quadrant with greater monetary gain (reward-based attention). Results showed that successful goal-driven attention exerted the strongest influence on search RT. Incidental location probability learning yielded a smaller though still robust effect. Incidental reward learning produced negligible guidance for spatial attention. The 95 % confidence intervals of the three effects were largely nonoverlapping. To understand these results, we simulated the role of location repetition priming in probability cuing and reward learning. Repetition priming underestimated the strength of location probability cuing, suggesting that probability cuing involved long-term statistical learning of how to shift attention. Repetition priming provided a reasonable account for the negligible effect of reward on spatial attention. We propose a multiple-systems view of spatial attention that includes task goals, search habit, and priming as primary drivers of top-down attention. PMID:26105657
Reinforcement learning in computer vision

NASA Astrophysics Data System (ADS)

Bernstein, A. V.; Burnaev, E. V.

2018-04-01

Nowadays, machine learning has become one of the basic technologies used in solving various computer vision tasks such as feature detection, image segmentation, object recognition and tracking. In many applications, various complex systems such as robots are equipped with visual sensors from which they learn state of surrounding environment by solving corresponding computer vision tasks. Solutions of these tasks are used for making decisions about possible future actions. It is not surprising that when solving computer vision tasks we should take into account special aspects of their subsequent application in model-based predictive control. Reinforcement learning is one of modern machine learning technologies in which learning is carried out through interaction with the environment. In recent years, Reinforcement learning has been used both for solving such applied tasks as processing and analysis of visual information, and for solving specific computer vision problems such as filtering, extracting image features, localizing objects in scenes, and many others. The paper describes shortly the Reinforcement learning technology and its use for solving computer vision problems.
Reinforcement Learning Trees

PubMed Central

Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R.

2015-01-01

In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings. PMID:26903687
Novelty and Inductive Generalization in Human Reinforcement Learning.

PubMed

Gershman, Samuel J; Niv, Yael

2015-07-01

In reinforcement learning (RL), a decision maker searching for the most rewarding option is often faced with the question: What is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: How can I generalize my previous experience with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and we describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of RL in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search efficiency compared to traditional RL algorithms previously applied to human cognition. In two behavioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty. Copyright © 2015 Cognitive Science Society, Inc.
Evidence for the negative impact of reward on self-regulated learning.

PubMed

Wehe, Hillary S; Rhodes, Matthew G; Seger, Carol A

2015-01-01

The undermining effect refers to the detrimental impact rewards can have on intrinsic motivation to engage in a behaviour. The current study tested the hypothesis that participants' self-regulated learning behaviours are susceptible to the undermining effect. Participants were assigned to learn a set of Swahili-English word pairs. Half of the participants were offered a reward for performance, and half were not offered a reward. After the initial study phase, participants were permitted to continue studying the words during a free period. The results were consistent with an undermining effect: Participants who were not offered a reward spent more time studying the words during the free period. The results suggest that rewards may negatively impact self-regulated learning behaviours and provide support for the encouragement of intrinsic motivation.
Reward-based training of recurrent neural networks for cognitive and value-based tasks

PubMed Central

Song, H Francis; Yang, Guangyu R; Wang, Xiao-Jing

2017-01-01

Trained neural network models, which exhibit features of neural activity recorded from behaving animals, may provide insights into the circuit mechanisms of cognitive functions through systematic analysis of network activity and connectivity. However, in contrast to the graded error signals commonly used to train networks through supervised learning, animals learn from reward feedback on definite actions through reinforcement learning. Reward maximization is particularly relevant when optimal behavior depends on an animal’s internal judgment of confidence or subjective preferences. Here, we implement reward-based training of recurrent neural networks in which a value network guides learning by using the activity of the decision network to predict future reward. We show that such models capture behavioral and electrophysiological findings from well-known experimental paradigms. Our work provides a unified framework for investigating diverse cognitive and value-based computations, and predicts a role for value representation that is essential for learning, but not executing, a task. DOI: http://dx.doi.org/10.7554/eLife.21492.001 PMID:28084991
Partial Planning Reinforcement Learning

DTIC Science & Technology

2012-08-31

Research Office P.O. Box 12211 Research Triangle Park, NC 27709-2211 15. SUBJECT TERMS Reinforcement Learning, Bayesian Optimization, Active ... Learning , Action Model Learning, Decision Theoretic Assistance Prasad Tadepalli, Alan Fern Oregon State University Office of Sponsored Programs Oregon State
Impairments in action-outcome learning in schizophrenia.

PubMed

Morris, Richard W; Cyrzon, Chad; Green, Melissa J; Le Pelley, Mike E; Balleine, Bernard W

2018-03-03

Learning the causal relation between actions and their outcomes (AO learning) is critical for goal-directed behavior when actions are guided by desire for the outcome. This can be contrasted with habits that are acquired by reinforcement and primed by prevailing stimuli, in which causal learning plays no part. Recently, we demonstrated that goal-directed actions are impaired in schizophrenia; however, whether this deficit exists alongside impairments in habit or reinforcement learning is unknown. The present study distinguished deficits in causal learning from reinforcement learning in schizophrenia. We tested people with schizophrenia (SZ, n = 25) and healthy adults (HA, n = 25) in a vending machine task. Participants learned two action-outcome contingencies (e.g., push left to get a chocolate M&M, push right to get a cracker), and they also learned one contingency was degraded by delivery of noncontingent outcomes (e.g., free M&Ms), as well as changes in value by outcome devaluation. Both groups learned the best action to obtain rewards; however, SZ did not distinguish the more causal action when one AO contingency was degraded. Moreover, action selection in SZ was insensitive to changes in outcome value unless feedback was provided, and this was related to the deficit in AO learning. The failure to encode the causal relation between action and outcome in schizophrenia occurred without any apparent deficit in reinforcement learning. This implies that poor goal-directed behavior in schizophrenia cannot be explained by a more primary deficit in reward learning such as insensitivity to reward value or reward prediction errors.
Altering spatial priority maps via reward-based learning.

PubMed

Chelazzi, Leonardo; Eštočinová, Jana; Calletti, Riccardo; Lo Gerfo, Emanuele; Sani, Ilaria; Della Libera, Chiara; Santandrea, Elisa

2014-06-18

Spatial priority maps are real-time representations of the behavioral salience of locations in the visual field, resulting from the combined influence of stimulus driven activity and top-down signals related to the current goals of the individual. They arbitrate which of a number of (potential) targets in the visual scene will win the competition for attentional resources. As a result, deployment of visual attention to a specific spatial location is determined by the current peak of activation (corresponding to the highest behavioral salience) across the map. Here we report a behavioral study performed on healthy human volunteers, where we demonstrate that spatial priority maps can be shaped via reward-based learning, reflecting long-lasting alterations (biases) in the behavioral salience of specific spatial locations. These biases exert an especially strong influence on performance under conditions where multiple potential targets compete for selection, conferring competitive advantage to targets presented in spatial locations associated with greater reward during learning relative to targets presented in locations associated with lesser reward. Such acquired biases of spatial attention are persistent, are nonstrategic in nature, and generalize across stimuli and task contexts. These results suggest that reward-based attentional learning can induce plastic changes in spatial priority maps, endowing these representations with the "intelligent" capacity to learn from experience. Copyright © 2014 the authors 0270-6474/14/348594-11$15.00/0.
Social defeat disrupts reward learning and potentiates striatal nociceptin/orphanin FQ mRNA in rats.

PubMed

Der-Avakian, Andre; D'Souza, Manoranjan S; Potter, David N; Chartoff, Elena H; Carlezon, William A; Pizzagalli, Diego A; Markou, Athina

2017-05-01

Mood disorders can be triggered by stress and are characterized by deficits in reward processing, including disrupted reward learning (the ability to modulate behavior according to past rewards). Reward learning is regulated by the anterior cingulate cortex (ACC) and striatal circuits, both of which are implicated in the pathophysiology of mood disorders. Here, we assessed in rats the effects of a potent stressor (social defeat) on reward learning and gene expression in the ACC, ventral tegmental area (VTA), and striatum. Adult male Wistar rats were trained on an operant probabilistic reward task (PRT) and then exposed to 3 days of social defeat before assessment of reward learning. After testing, the ACC, VTA, and striatum were dissected, and expression of genes previously implicated in stress was assessed. Social defeat blunted reward learning (manifested as reduced response bias toward a more frequently rewarded stimulus) and was associated with increased nociceptin/orphanin FQ (N/OFQ) peptide mRNA levels in the striatum and decreased Fos mRNA levels in the VTA. Moreover, N/OFQ peptide and nociceptin receptor mRNA levels in the ACC, VTA and striatum were inversely related to reward learning. The behavioral findings parallel previous data in humans, suggesting that stress similarly disrupts reward learning in both species. Increased striatal N/OFQ mRNA in stressed rats characterized by impaired reward learning is consistent with accumulating evidence that antagonism of nociceptin receptors, which bind N/OFQ, has antidepressant-like effects. These results raise the possibility that nociceptin systems represent a molecular substrate through which stress produces reward learning deficits in mood disorders.
Negative reinforcement learning is affected in substance dependence.

PubMed

Thompson, Laetitia L; Claus, Eric D; Mikulich-Gilbertson, Susan K; Banich, Marie T; Crowley, Thomas; Krmpotich, Theodore; Miller, David; Tanabe, Jody

2012-06-01

Negative reinforcement results in behavior to escape or avoid an aversive outcome. Withdrawal symptoms are purported to be negative reinforcers in perpetuating substance dependence, but little is known about negative reinforcement learning in this population. The purpose of this study was to examine reinforcement learning in substance dependent individuals (SDI), with an emphasis on assessing negative reinforcement learning. We modified the Iowa Gambling Task to separately assess positive and negative reinforcement. We hypothesized that SDI would show differences in negative reinforcement learning compared to controls and we investigated whether learning differed as a function of the relative magnitude or frequency of the reinforcer. Thirty subjects dependent on psychostimulants were compared with 28 community controls on a decision making task that manipulated outcome frequencies and magnitudes and required an action to avoid a negative outcome. SDI did not learn to avoid negative outcomes to the same degree as controls. This difference was driven by the magnitude, not the frequency, of negative feedback. In contrast, approach behaviors in response to positive reinforcement were similar in both groups. Our findings are consistent with a specific deficit in negative reinforcement learning in SDI. SDI were relatively insensitive to the magnitude, not frequency, of loss. If this generalizes to drug-related stimuli, it suggests that repeated episodes of withdrawal may drive relapse more than the severity of a single episode. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
Neural correlates of reward-based spatial learning in persons with cocaine dependence.

PubMed

Tau, Gregory Z; Marsh, Rachel; Wang, Zhishun; Torres-Sanchez, Tania; Graniello, Barbara; Hao, Xuejun; Xu, Dongrong; Packard, Mark G; Duan, Yunsuo; Kangarlu, Alayar; Martinez, Diana; Peterson, Bradley S

2014-02-01

Dysfunctional learning systems are thought to be central to the pathogenesis of and impair recovery from addictions. The functioning of the brain circuits for episodic memory or learning that support goal-directed behavior has not been studied previously in persons with cocaine dependence (CD). Thirteen abstinent CD and 13 healthy participants underwent MRI scanning while performing a task that requires the use of spatial cues to navigate a virtual-reality environment and find monetary rewards, allowing the functional assessment of the brain systems for spatial learning, a form of episodic memory. Whereas both groups performed similarly on the reward-based spatial learning task, we identified disturbances in brain regions involved in learning and reward in CD participants. In particular, CD was associated with impaired functioning of medial temporal lobe (MTL), a brain region that is crucial for spatial learning (and episodic memory) with concomitant recruitment of striatum (which normally participates in stimulus-response, or habit, learning), and prefrontal cortex. CD was also associated with enhanced sensitivity of the ventral striatum to unexpected rewards but not to expected rewards earned during spatial learning. We provide evidence that spatial learning in CD is characterized by disturbances in functioning of an MTL-based system for episodic memory and a striatum-based system for stimulus-response learning and reward. We have found additional abnormalities in distributed cortical regions. Consistent with findings from animal studies, we provide the first evidence in humans describing the disruptive effects of cocaine on the coordinated functioning of multiple neural systems for learning and memory.
Morphological elucidation of basal ganglia circuits contributing reward prediction

PubMed Central

Fujiyama, Fumino; Takahashi, Susumu; Karube, Fuyuki

2015-01-01

Electrophysiological studies in monkeys have shown that dopaminergic neurons respond to the reward prediction error. In addition, striatal neurons alter their responsiveness to cortical or thalamic inputs in response to the dopamine signal, via the mechanism of dopamine-regulated synaptic plasticity. These findings have led to the hypothesis that the striatum exhibits synaptic plasticity under the influence of the reward prediction error and conduct reinforcement learning throughout the basal ganglia circuits. The reinforcement learning model is useful; however, the mechanism by which such a process emerges in the basal ganglia needs to be anatomically explained. The actor–critic model has been previously proposed and extended by the existence of role sharing within the striatum, focusing on the striosome/matrix compartments. However, this hypothesis has been difficult to confirm morphologically, partly because of the complex structure of the striosome/matrix compartments. Here, we review recent morphological studies that elucidate the input/output organization of the striatal compartments. PMID:25698913
Simultaneous vibration control and energy harvesting using actor-critic based reinforcement learning

NASA Astrophysics Data System (ADS)

Loong, Cheng Ning; Chang, C. C.; Dimitrakopoulos, Elias G.

2018-03-01

Mitigating excessive vibration of civil engineering structures using various types of devices has been a conspicuous research topic in the past few decades. Some devices, such as electromagnetic transducers, which have a capability of exerting control forces while simultaneously harvesting energy, have been proposed recently. These devices make possible a self-regenerative system that can semi-actively mitigate structural vibration without the need of external energy. Integrating mechanical, electrical components, and control algorithms, these devices open up a new research domain that needs to be addressed. In this study, the feasibility of using an actor-critic based reinforcement learning control algorithm for simultaneous vibration control and energy harvesting for a civil engineering structure is investigated. The actor-critic based reinforcement learning control algorithm is a real-time, model-free adaptive technique that can adjust the controller parameters based on observations and reward signals without knowing the system characteristics. It is suitable for the control of a partially known nonlinear system with uncertain parameters. The feasibility of implementing this algorithm on a building structure equipped with an electromagnetic damper will be investigated in this study. Issues related to the modelling of learning algorithm, initialization and convergence will be presented and discussed.
Distinguishing between learning and motivation in behavioral tests of the reinforcement sensitivity theory of personality.

PubMed

Smillie, Luke D; Dalgleish, Len I; Jackson, Chris J

2007-04-01

According to Gray's (1973) Reinforcement Sensitivity Theory (RST), a Behavioral Inhibition System (BIS) and a Behavioral Activation System (BAS) mediate effects of goal conflict and reward on behavior. BIS functioning has been linked with individual differences in trait anxiety and BAS functioning with individual differences in trait impulsivity. In this article, it is argued that behavioral outputs of the BIS and BAS can be distinguished in terms of learning and motivation processes and that these can be operationalized using the Signal Detection Theory measures of response-sensitivity and response-bias. In Experiment 1, two measures of BIS-reactivity predicted increased response-sensitivity under goal conflict, whereas one measure of BAS-reactivity predicted increased response-sensitivity under reward. In Experiment 2, two measures of BIS-reactivity predicted response-bias under goal conflict, whereas a measure of BAS-reactivity predicted motivation response-bias under reward. In both experiments, impulsivity measures did not predict criteria for BAS-reactivity as traditionally predicted by RST.
Fear of losing money? Aversive conditioning with secondary reinforcers.

PubMed

Delgado, M R; Labouliere, C D; Phelps, E A

2006-12-01

Money is a secondary reinforcer that acquires its value through social communication and interaction. In everyday human behavior and laboratory studies, money has been shown to influence appetitive or reward learning. It is unclear, however, if money has a similar impact on aversive learning. The goal of this study was to investigate the efficacy of money in aversive learning, comparing it with primary reinforcers that are traditionally used in fear conditioning paradigms. A series of experiments were conducted in which participants initially played a gambling game that led to a monetary gain. They were then presented with an aversive conditioning paradigm, with either shock (primary reinforcer) or loss of money (secondary reinforcer) as the unconditioned stimulus. Skin conductance responses and subjective ratings indicated that potential monetary loss modulated the conditioned response. Depending on the presentation context, the secondary reinforcer was as effective as the primary reinforcer during aversive conditioning. These results suggest that stimuli that acquire reinforcing properties through social communication and interaction, such as money, can effectively influence aversive learning.
Sweet Taste and Nutrient Value Subdivide Rewarding Dopaminergic Neurons in Drosophila

PubMed Central

Huetteroth, Wolf; Perisse, Emmanuel; Lin, Suewei; Klappenbach, Martín; Burke, Christopher; Waddell, Scott

2015-01-01

Summary Dopaminergic neurons provide reward learning signals in mammals and insects [1–4]. Recent work in Drosophila has demonstrated that water-reinforcing dopaminergic neurons are different to those for nutritious sugars [5]. Here, we tested whether the sweet taste and nutrient properties of sugar reinforcement further subdivide the fly reward system. We found that dopaminergic neurons expressing the OAMB octopamine receptor [6] specifically convey the short-term reinforcing effects of sweet taste [4]. These dopaminergic neurons project to the β′2 and γ4 regions of the mushroom body lobes. In contrast, nutrient-dependent long-term memory requires different dopaminergic neurons that project to the γ5b regions, and it can be artificially reinforced by those projecting to the β lobe and adjacent α1 region. Surprisingly, whereas artificial implantation and expression of short-term memory occur in satiated flies, formation and expression of artificial long-term memory require flies to be hungry. These studies suggest that short-term and long-term sugar memories have different physiological constraints. They also demonstrate further functional heterogeneity within the rewarding dopaminergic neuron population. PMID:25728694
Effort-Reward Imbalance for Learning Is Associated with Fatigue in School Children

ERIC Educational Resources Information Center

Fukuda, Sanae; Yamano, Emi; Joudoi, Takako; Mizuno, Kei; Tanaka, Masaaki; Kawatani, Junko; Takano, Miyuki; Tomoda, Akemi; Imai-Matsumura, Kyoko; Miike, Teruhisa; Watanabe, Yasuyoshi

2010-01-01

We examined relationships among fatigue, sleep quality, and effort-reward imbalance for learning in school children. We developed an effort-reward for learning scale in school students and examined its reliability and validity. Self-administered surveys, including the effort reward for leaning scale and fatigue scale, were completed by 1,023…

From creatures of habit to goal-directed learners: Tracking the developmental emergence of model-based reinforcement learning

PubMed Central

Decker, Johannes H.; Otto, A. Ross; Daw, Nathaniel D.; Hartley, Catherine A.

2016-01-01

Theoretical models distinguish two decision-making strategies that have been formalized in reinforcement-learning theory. A model-based strategy leverages a cognitive model of potential actions and their consequences to make goal-directed choices, whereas a model-free strategy evaluates actions based solely on their reward history. Research in adults has begun to elucidate the psychological mechanisms and neural substrates underlying these learning processes and factors that influence their relative recruitment. However, the developmental trajectory of these evaluative strategies has not been well characterized. In this study, children, adolescents, and adults, performed a sequential reinforcement-learning task that enables estimation of model-based and model-free contributions to choice. Whereas a model-free strategy was evident in choice behavior across all age groups, evidence of a model-based strategy only emerged during adolescence and continued to increase into adulthood. These results suggest that recruitment of model-based valuation systems represents a critical cognitive component underlying the gradual maturation of goal-directed behavior. PMID:27084852
From Creatures of Habit to Goal-Directed Learners: Tracking the Developmental Emergence of Model-Based Reinforcement Learning.

PubMed

Decker, Johannes H; Otto, A Ross; Daw, Nathaniel D; Hartley, Catherine A

2016-06-01

Theoretical models distinguish two decision-making strategies that have been formalized in reinforcement-learning theory. A model-based strategy leverages a cognitive model of potential actions and their consequences to make goal-directed choices, whereas a model-free strategy evaluates actions based solely on their reward history. Research in adults has begun to elucidate the psychological mechanisms and neural substrates underlying these learning processes and factors that influence their relative recruitment. However, the developmental trajectory of these evaluative strategies has not been well characterized. In this study, children, adolescents, and adults performed a sequential reinforcement-learning task that enabled estimation of model-based and model-free contributions to choice. Whereas a model-free strategy was apparent in choice behavior across all age groups, a model-based strategy was absent in children, became evident in adolescents, and strengthened in adults. These results suggest that recruitment of model-based valuation systems represents a critical cognitive component underlying the gradual maturation of goal-directed behavior. © The Author(s) 2016.
Episodic Memory Encoding Interferes with Reward Learning and Decreases Striatal Prediction Errors

PubMed Central

Braun, Erin Kendall; Daw, Nathaniel D.

2014-01-01

Learning is essential for adaptive decision making. The striatum and its dopaminergic inputs are known to support incremental reward-based learning, while the hippocampus is known to support encoding of single events (episodic memory). Although traditionally studied separately, in even simple experiences, these two types of learning are likely to co-occur and may interact. Here we sought to understand the nature of this interaction by examining how incremental reward learning is related to concurrent episodic memory encoding. During the experiment, human participants made choices between two options (colored squares), each associated with a drifting probability of reward, with the goal of earning as much money as possible. Incidental, trial-unique object pictures, unrelated to the choice, were overlaid on each option. The next day, participants were given a surprise memory test for these pictures. We found that better episodic memory was related to a decreased influence of recent reward experience on choice, both within and across participants. fMRI analyses further revealed that during learning the canonical striatal reward prediction error signal was significantly weaker when episodic memory was stronger. This decrease in reward prediction error signals in the striatum was associated with enhanced functional connectivity between the hippocampus and striatum at the time of choice. Our results suggest a mechanism by which memory encoding may compete for striatal processing and provide insight into how interactions between different forms of learning guide reward-based decision making. PMID:25378157
Digital Badges--Rewards for Learning?

ERIC Educational Resources Information Center

Shields, Rebecca; Chugh, Ritesh

2017-01-01

Digital badges are quickly becoming an appropriate, easy and efficient way for educators, community groups and other professional organisations, to exhibit and reward participants for skills obtained in professional development or formal and informal learning. This paper offers an account of digital badges, how they work and the underlying…
Self-regulation of the anterior insula: Reinforcement learning using real-time fMRI neurofeedback.

PubMed

Lawrence, Emma J; Su, Li; Barker, Gareth J; Medford, Nick; Dalton, Jeffrey; Williams, Steve C R; Birbaumer, Niels; Veit, Ralf; Ranganatha, Sitaram; Bodurka, Jerzy; Brammer, Michael; Giampietro, Vincent; David, Anthony S

2014-03-01

The anterior insula (AI) plays a key role in affective processing, and insular dysfunction has been noted in several clinical conditions. Real-time functional MRI neurofeedback (rtfMRI-NF) provides a means of helping people learn to self-regulate activation in this brain region. Using the Blood Oxygenated Level Dependant (BOLD) signal from the right AI (RAI) as neurofeedback, we trained participants to increase RAI activation. In contrast, another group of participants was shown 'control' feedback from another brain area. Pre- and post-training affective probes were shown, with subjective ratings and skin conductance response (SCR) measured. We also investigated a reward-related reinforcement learning model of rtfMRI-NF. In contrast to the controls, we hypothesised a positive linear increase in RAI activation in participants shown feedback from this region, alongside increases in valence ratings and SCR to affective probes. Hypothesis-driven analyses showed a significant interaction between the RAI/control neurofeedback groups and the effect of self-regulation. Whole-brain analyses revealed a significant linear increase in RAI activation across four training runs in the group who received feedback from RAI. Increased activation was also observed in the caudate body and thalamus, likely representing feedback-related learning. No positive linear trend was observed in the RAI in the group receiving control feedback, suggesting that these data are not a general effect of cognitive strategy or control feedback. The control group did, however, show diffuse activation across the putamen, caudate and posterior insula which may indicate the representation of false feedback. No significant training-related behavioural differences were observed for valence ratings, or SCR. In addition, correlational analyses based on a reinforcement learning model showed that the dorsal anterior cingulate cortex underpinned learning in both groups. In summary, these data demonstrate that it
Working Memory Load Strengthens Reward Prediction Errors.

PubMed

Collins, Anne G E; Ciullo, Brittany; Frank, Michael J; Badre, David

2017-04-19

Reinforcement learning (RL) in simple instrumental tasks is usually modeled as a monolithic process in which reward prediction errors (RPEs) are used to update expected values of choice options. This modeling ignores the different contributions of different memory and decision-making systems thought to contribute even to simple learning. In an fMRI experiment, we investigated how working memory (WM) and incremental RL processes interact to guide human learning. WM load was manipulated by varying the number of stimuli to be learned across blocks. Behavioral results and computational modeling confirmed that learning was best explained as a mixture of two mechanisms: a fast, capacity-limited, and delay-sensitive WM process together with slower RL. Model-based analysis of fMRI data showed that striatum and lateral prefrontal cortex were sensitive to RPE, as shown previously, but, critically, these signals were reduced when the learning problem was within capacity of WM. The degree of this neural interaction related to individual differences in the use of WM to guide behavioral learning. These results indicate that the two systems do not process information independently, but rather interact during learning. SIGNIFICANCE STATEMENT Reinforcement learning (RL) theory has been remarkably productive at improving our understanding of instrumental learning as well as dopaminergic and striatal network function across many mammalian species. However, this neural network is only one contributor to human learning and other mechanisms such as prefrontal cortex working memory also play a key role. Our results also show that these other players interact with the dopaminergic RL system, interfering with its key computation of reward prediction errors. Copyright © 2017 the authors 0270-6474/17/374332-11$15.00/0.
A Neurogenetic Dissociation between Punishment-, Reward-, and Relief-Learning in Drosophila

PubMed Central

Yarali, Ayse; Gerber, Bertram

2010-01-01

What is particularly worth remembering about a traumatic experience is what brought it about, and what made it cease. For example, fruit flies avoid an odor which during training had preceded electric shock punishment; on the other hand, if the odor had followed shock during training, it is later on approached as a signal for the relieving end of shock. We provide a neurogenetic analysis of such relief learning. Blocking, using UAS-shibirets1, the output from a particular set of dopaminergic neurons defined by the TH-Gal4 driver partially impaired punishment learning, but left relief learning intact. Thus, with respect to these particular neurons, relief learning differs from punishment learning. Targeting another set of dopaminergic/serotonergic neurons defined by the DDC-Gal4 driver on the other hand affected neither punishment nor relief learning. As for the octopaminergic system, the tbhM18 mutation, compromising octopamine biosynthesis, partially impaired sugar-reward learning, but not relief learning. Thus, with respect to this particular mutation, relief learning, and reward learning are dissociated. Finally, blocking output from the set of octopaminergic/tyraminergic neurons defined by the TDC2-Gal4 driver affected neither reward, nor relief learning. We conclude that regarding the used genetic tools, relief learning is neurogenetically dissociated from both punishment and reward learning. This may be a message relevant also for analyses of relief learning in other experimental systems including man. PMID:21206762
Dorsomedial striatum lesions affect adjustment to reward uncertainty, but not to reward devaluation or omission.

PubMed

Torres, Carmen; Glueck, Amanda C; Conrad, Shannon E; Morón, Ignacio; Papini, Mauricio R

2016-09-22

The dorsomedial striatum (DMS) has been implicated in the acquisition of reward representations, a proposal leading to the hypothesis that it should play a role in situations involving reward loss. We report the results of an experiment in which the effects of DMS excitotoxic lesions were tested in consummatory successive negative contrast (reward devaluation), autoshaping training with partial vs. continuous reinforcement (reward uncertainty), and appetitive extinction (reward omission). Animals with DMS lesions exhibited reduced lever pressing responding, but enhanced goal entries, during partial reinforcement training in autoshaping. However, they showed normal negative contrast, acquisition under continuous reinforcement (CR), appetitive extinction, and response facilitation in early extinction trials. Open-field testing also indicated normal motor behavior. Thus, DMS lesions selectively affected the behavioral adjustment to a situation involving reward uncertainty, producing a behavioral reorganization according to which goal tracking (goal entries) became predominant at the expense of sign tracking (lever pressing). This pattern of results shows that the function of the DMS in situations involving reward loss is not general, but restricted to reward uncertainty. We suggest that a nonassociative, drive-related process induced by reward uncertainty requires normal output from DMS neurons. Copyright © 2016 IBRO. Published by Elsevier Ltd. All rights reserved.
Effects of lesions of the nucleus accumbens core on choice between small certain rewards and large uncertain rewards in rats

PubMed Central

Cardinal, Rudolf N; Howes, Nathan J

2005-01-01

Background Animals must frequently make choices between alternative courses of action, seeking to maximize the benefit obtained. They must therefore evaluate the magnitude and the likelihood of the available outcomes. Little is known of the neural basis of this process, or what might predispose individuals to be overly conservative or to take risks excessively (avoiding or preferring uncertainty, respectively). The nucleus accumbens core (AcbC) is known to contribute to rats' ability to choose large, delayed rewards over small, immediate rewards; AcbC lesions cause impulsive choice and an impairment in learning with delayed reinforcement. However, it is not known how the AcbC contributes to choice involving probabilistic reinforcement, such as between a large, uncertain reward and a small, certain reward. We examined the effects of excitotoxic lesions of the AcbC on probabilistic choice in rats. Results Rats chose between a single food pellet delivered with certainty (p = 1) and four food pellets delivered with varying degrees of uncertainty (p = 1, 0.5, 0.25, 0.125, and 0.0625) in a discrete-trial task, with the large-reinforcer probability decreasing or increasing across the session. Subjects were trained on this task and then received excitotoxic or sham lesions of the AcbC before being retested. After a transient period during which AcbC-lesioned rats exhibited relative indifference between the two alternatives compared to controls, AcbC-lesioned rats came to exhibit risk-averse choice, choosing the large reinforcer less often than controls when it was uncertain, to the extent that they obtained less food as a result. Rats behaved as if indifferent between a single certain pellet and four pellets at p = 0.32 (sham-operated) or at p = 0.70 (AcbC-lesioned) by the end of testing. When the probabilities did not vary across the session, AcbC-lesioned rats and controls strongly preferred the large reinforcer when it was certain, and strongly preferred the small
Reinforcement learning state estimator.

PubMed

Morimoto, Jun; Doya, Kenji

2007-03-01

In this study, we propose a novel use of reinforcement learning for estimating hidden variables and parameters of nonlinear dynamical systems. A critical issue in hidden-state estimation is that we cannot directly observe estimation errors. However, by defining errors of observable variables as a delayed penalty, we can apply a reinforcement learning frame-work to state estimation problems. Specifically, we derive a method to construct a nonlinear state estimator by finding an appropriate feedback input gain using the policy gradient method. We tested the proposed method on single pendulum dynamics and show that the joint angle variable could be successfully estimated by observing only the angular velocity, and vice versa. In addition, we show that we could acquire a state estimator for the pendulum swing-up task in which a swing-up controller is also acquired by reinforcement learning simultaneously. Furthermore, we demonstrate that it is possible to estimate the dynamics of the pendulum itself while the hidden variables are estimated in the pendulum swing-up task. Application of the proposed method to a two-linked biped model is also presented.
Persistent effects of prior chronic exposure to corticosterone on reward-related learning and motivation in rodents.

PubMed

Olausson, Peter; Kiraly, Drew D; Gourley, Shannon L; Taylor, Jane R

2013-02-01

Repeated or prolonged exposure to stress has profound effects on a wide spectrum of behavioral and neurobiological processes and has been associated with the pathophysiology of depression. The multifaceted nature of this disorder includes despair, anhedonia, diminished motivation, and disrupted cognition, and it has been proposed that depression is also associated with reduced reward-motivated learning. We have previously reported that prior chronic corticosterone exposure to mice produces a lasting depressive-like state that can be reversed by chronic antidepressant treatment. In the present study, we tested the effects of prior chronic exposure to corticosterone (50 μg/ml) administered to rats or to mice in drinking water for 14 days followed by dose-tapering over 9 days. The exposure to corticosterone produced lasting deficits in the acquisition of reward-related learning tested on a food-motivated instrumental task conducted 10-20 days after the last day of full dose corticosterone exposure. Rats exposed to corticosterone also displayed reduced responding on a progressive ratio schedule of reinforcement when tested on day 21 after exposure. Amitriptyline (200 mg/ml in drinking water) exposure for 14 days to mice produced the opposite effect, enhancing food-motivated instrumental acquisition and performance. Repeated treatment with amitriptyline (5 mg/kg, intraperitoneally; bid) subsequent to corticosterone exposure also prevented the corticosterone-induced deficits in rats. These results are consistent with aberrant reward-related learning and motivational processes in depressive states and provide new evidence that stress-induced neuroadaptive alterations in cortico-limbic-striatal brain circuits involved in learning and motivation may play a critical role in aspects of mood disorders.
Reward and punishment enhance motor adaptation in stroke.

PubMed

Quattrocchi, Graziella; Greenwood, Richard; Rothwell, John C; Galea, Joseph M; Bestmann, Sven

2017-09-01

The effects of motor learning, such as motor adaptation, in stroke rehabilitation are often transient, thus mandating approaches that enhance the amount of learning and retention. Previously, we showed in young individuals that reward and punishment feedback have dissociable effects on motor adaptation, with punishment improving adaptation and reward enhancing retention. If these findings were able to generalise to patients with stroke, they would provide a way to optimise motor learning in these patients. Therefore, we tested this in 45 patients with chronic stroke allocated in three groups. Patients performed reaching movements with their paretic arm with a robotic manipulandum. After training (day 1), day 2 involved adaptation to a novel force field. During the adaptation phase, patients received performance-based feedback according to the group they were allocated: reward, punishment or no feedback (neutral). On day 3, patients readapted to the force field but all groups now received neutral feedback. All patients adapted, with reward and punishment groups displaying greater adaptation and readaptation than the neutral group, irrespective of demographic, cognitive or functional differences. Remarkably, the reward and punishment groups adapted to similar degree as healthy controls. Finally, the reward group showed greater retention. This study provides, for the first time, evidence that reward and punishment can enhance motor adaptation in patients with stroke. Further research on reinforcement-based motor learning regimes is warranted to translate these promising results into clinical practice and improve motor rehabilitation outcomes in patients with stroke. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Separating the effect of reward from corrective feedback during learning in patients with Parkinson's disease.

PubMed

Freedberg, Michael; Schacherer, Jonathan; Chen, Kuan-Hua; Uc, Ergun Y; Narayanan, Nandakumar S; Hazeltine, Eliot

2017-06-01

Parkinson's disease (PD) is associated with procedural learning deficits. Nonetheless, studies have demonstrated that reward-related learning is comparable between patients with PD and controls (Bódi et al., Brain, 132(9), 2385-2395, 2009; Frank, Seeberger, & O'Reilly, Science, 306(5703), 1940-1943, 2004; Palminteri et al., Proceedings of the National Academy of Sciences of the United States of America, 106(45), 19179-19184, 2009). However, because these studies do not separate the effect of reward from the effect of practice, it is difficult to determine whether the effect of reward on learning is distinct from the effect of corrective feedback on learning. Thus, it is unknown whether these group differences in learning are due to reward processing or learning in general. Here, we compared the performance of medicated PD patients to demographically matched healthy controls (HCs) on a task where the effect of reward can be examined separately from the effect of practice. We found that patients with PD showed significantly less reward-related learning improvements compared to HCs. In addition, stronger learning of rewarded associations over unrewarded associations was significantly correlated with smaller skin-conductance responses for HCs but not PD patients. These results demonstrate that when separating the effect of reward from the effect of corrective feedback, PD patients do not benefit from reward.
Prefrontal cortex as a meta-reinforcement learning system.

PubMed

Wang, Jane X; Kurth-Nelson, Zeb; Kumaran, Dharshan; Tirumala, Dhruva; Soyer, Hubert; Leibo, Joel Z; Hassabis, Demis; Botvinick, Matthew

2018-06-01

Over the past 20 years, neuroscience research on reward-based learning has converged on a canonical model, under which the neurotransmitter dopamine 'stamps in' associations between situations, actions and rewards by modulating the strength of synaptic connections between neurons. However, a growing number of recent findings have placed this standard model under strain. We now draw on recent advances in artificial intelligence to introduce a new theory of reward-based learning. Here, the dopamine system trains another part of the brain, the prefrontal cortex, to operate as its own free-standing learning system. This new perspective accommodates the findings that motivated the standard model, but also deals gracefully with a wider range of observations, providing a fresh foundation for future research.
Tonic or Phasic Stimulation of Dopaminergic Projections to Prefrontal Cortex Causes Mice to Maintain or Deviate from Previously Learned Behavioral Strategies

PubMed Central

Ellwood, Ian T.; Patel, Tosha; Wadia, Varun; Lee, Anthony T.; Liptak, Alayna T.

2017-01-01

Dopamine neurons in the ventral tegmental area (VTA) encode reward prediction errors and can drive reinforcement learning through their projections to striatum, but much less is known about their projections to prefrontal cortex (PFC). Here, we studied these projections and observed phasic VTA–PFC fiber photometry signals after the delivery of rewards. Next, we studied how optogenetic stimulation of these projections affects behavior using conditioned place preference and a task in which mice learn associations between cues and food rewards and then use those associations to make choices. Neither phasic nor tonic stimulation of dopaminergic VTA–PFC projections elicited place preference. Furthermore, substituting phasic VTA–PFC stimulation for food rewards was not sufficient to reinforce new cue–reward associations nor maintain previously learned ones. However, the same patterns of stimulation that failed to reinforce place preference or cue–reward associations were able to modify behavior in other ways. First, continuous tonic stimulation maintained previously learned cue–reward associations even after they ceased being valid. Second, delivering phasic stimulation either continuously or after choices not previously associated with reward induced mice to make choices that deviated from previously learned associations. In summary, despite the fact that dopaminergic VTA–PFC projections exhibit phasic increases in activity that are time locked to the delivery of rewards, phasic activation of these projections does not necessarily reinforce specific actions. Rather, dopaminergic VTA–PFC activity can control whether mice maintain or deviate from previously learned cue–reward associations. SIGNIFICANCE STATEMENT Dopaminergic inputs from ventral tegmental area (VTA) to striatum encode reward prediction errors and reinforce specific actions; however, it is currently unknown whether dopaminergic inputs to prefrontal cortex (PFC) play similar or distinct
Curiosity and reward: Valence predicts choice and information prediction errors enhance learning.

PubMed

Marvin, Caroline B; Shohamy, Daphna

2016-03-01

Curiosity drives many of our daily pursuits and interactions; yet, we know surprisingly little about how it works. Here, we harness an idea implied in many conceptualizations of curiosity: that information has value in and of itself. Reframing curiosity as the motivation to obtain reward-where the reward is information-allows one to leverage major advances in theoretical and computational mechanisms of reward-motivated learning. We provide new evidence supporting 2 predictions that emerge from this framework. First, we find an asymmetric effect of positive versus negative information, with positive information enhancing both curiosity and long-term memory for information. Second, we find that it is not the absolute value of information that drives learning but, rather, the gap between the reward expected and reward received, an "information prediction error." These results support the idea that information functions as a reward, much like money or food, guiding choices and driving learning in systematic ways. (c) 2016 APA, all rights reserved).
Ventral striatum lesions do not affect reinforcement learning with deterministic outcomes on slow time scales.

PubMed

Vicario-Feliciano, Raquel; Murray, Elisabeth A; Averbeck, Bruno B

2017-10-01

A large body of work has implicated the ventral striatum (VS) in aspects of reinforcement learning (RL). However, less work has directly examined the effects of lesions in the VS, or other forms of inactivation, on 2-armed bandit RL tasks. We have recently found that lesions in the VS in macaque monkeys affect learning with stochastic schedules but have minimal effects with deterministic schedules. The reasons for this are not currently clear. Because our previous work used short intertrial intervals, one possibility is that the animals were using working memory to bridge stimulus-reward associations from 1 trial to the next. In the present study, we examined learning of 60 pairs of objects, in which the animals received only 1 trial per day with each pair. The large number of object pairs and the long interval (approximately 24 hr) between trials with a given pair minimized the chances that the animals could use working memory to bridge trials. We found that monkeys with VS lesions were unimpaired relative to controls, which suggests that animals with VS lesions can still learn to select rewarded objects even when they cannot make use of working memory. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Reward-dependent learning in neuronal networks for planning and decision making.

PubMed

Dehaene, S; Changeux, J P

2000-01-01

Neuronal network models have been proposed for the organization of evaluation and decision processes in prefrontal circuitry and their putative neuronal and molecular bases. The models all include an implementation and simulation of an elementary reward mechanism. Their central hypothesis is that tentative rules of behavior, which are coded by clusters of active neurons in prefrontal cortex, are selected or rejected based on an evaluation by this reward signal, which may be conveyed, for instance, by the mesencephalic dopaminergic neurons with which the prefrontal cortex is densely interconnected. At the molecular level, the reward signal is postulated to be a neurotransmitter such as dopamine, which exerts a global modulatory action on prefrontal synaptic efficacies, either via volume transmission or via targeted synaptic triads. Negative reinforcement has the effect of destabilizing the currently active rule-coding clusters; subsequently, spontaneous activity varies again from one cluster to another, giving the organism the chance to discover and learn a new rule. Thus, reward signals function as effective selection signals that either maintain or suppress currently active prefrontal representations as a function of their current adequacy. Simulations of this variation-selection have successfully accounted for the main features of several major tasks that depend on prefrontal cortex integrity, such as the delayed-response test, the Wisconsin card sorting test, the Tower of London test and the Stroop test. For the more complex tasks, we have found it necessary to supplement the external reward input with a second mechanism that supplies an internal reward; it consists of an auto-evaluation loop which short-circuits the reward input from the exterior. This allows for an internal evaluation of covert motor intentions without actualizing them as behaviors, by simply testing them covertly by comparison with memorized former experiences. This element of architecture
Acute Δ-9-tetrahydrocannabinol administration in female rats attenuates immediate responses following losses but not multi-trial reinforcement learning from wins.

PubMed

Wong, Scott A; Randolph, Sienna H; Ivan, Victorita E; Gruber, Aaron J

2017-09-29

Δ-9-Tetrahydrocannabinol (THC) is the main psychoactive component of marijuana and has potent effects on decision-making, including a proposed reduction in cognitive flexibility. We demonstrate here that acute THC administration differentially affects some of the processes that contribute to cognitive flexibility. Specifically, THC reduces lose-shift responding in which female rats tend to immediately shift choice responses away from options that result in reward omission on the previous trial. THC, however, did not impair the ability of rats to flexibly bias responses toward feeders with higher probability of reward in a reversal task. This response adaptation developed over several trials, suggesting that THC did not impair slower forms of reinforcement learning needed to choose among options with unequal utility. This dissociation of THC's effects on innate/rapid and learned/gradual decision-making processes was unexpected, but is supported by emerging evidence that lose-shift responding is mediated by neural mechanisms distinct from those involved in other forms of reinforcement learning. The present data suggest that, at least in some tasks, the apparent reductions in cognitive flexibility by THC may be explained by the immediate effects on loss sensitivity, rather than impairments of all processes used for choice adaptation. Copyright © 2017 Elsevier B.V. All rights reserved.
Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats

PubMed Central

Lloyd, Kevin; Becker, Nadine; Jones, Matthew W.; Bogacz, Rafal

2012-01-01

Learning to form appropriate, task-relevant working memory representations is a complex process central to cognition. Gating models frame working memory as a collection of past observations and use reinforcement learning (RL) to solve the problem of when to update these observations. Investigation of how gating models relate to brain and behavior remains, however, at an early stage. The current study sought to explore the ability of simple RL gating models to replicate rule learning behavior in rats. Rats were trained in a maze-based spatial learning task that required animals to make trial-by-trial choices contingent upon their previous experience. Using an abstract version of this task, we tested the ability of two gating algorithms, one based on the Actor-Critic and the other on the State-Action-Reward-State-Action (SARSA) algorithm, to generate behavior consistent with the rats'. Both models produced rule-acquisition behavior consistent with the experimental data, though only the SARSA gating model mirrored faster learning following rule reversal. We also found that both gating models learned multiple strategies in solving the initial task, a property which highlights the multi-agent nature of such models and which is of importance in considering the neural basis of individual differences in behavior. PMID:23115551

CLEANing the Reward: Counterfactual Actions to Remove Exploratory Action Noise in Multiagent Learning

NASA Technical Reports Server (NTRS)

HolmesParker, Chris; Taylor, Mathew E.; Tumer, Kagan; Agogino, Adrian

2014-01-01

Learning in multiagent systems can be slow because agents must learn both how to behave in a complex environment and how to account for the actions of other agents. The inability of an agent to distinguish between the true environmental dynamics and those caused by the stochastic exploratory actions of other agents creates noise in each agent's reward signal. This learning noise can have unforeseen and often undesirable effects on the resultant system performance. We define such noise as exploratory action noise, demonstrate the critical impact it can have on the learning process in multiagent settings, and introduce a reward structure to effectively remove such noise from each agent's reward signal. In particular, we introduce Coordinated Learning without Exploratory Action Noise (CLEAN) rewards and empirically demonstrate their benefits
Neural correlates of water reward in thirsty Drosophila

PubMed Central

Lin, Suewei; Owald, David; Chandra, Vikram; Talbot, Clifford; Huetteroth, Wolf; Waddell, Scott

2014-01-01

Drinking water is innately rewarding to thirsty animals. In addition, the consumed value can be assigned to behavioral actions and predictive sensory cues by associative learning. Here we show that thirst converts water avoidance into water-seeking in naïve Drosophila. Thirst also permits flies to learn olfactory cues paired with water reward. Water learning requires water taste and <40 water-responsive dopaminergic neurons that innervate a restricted zone of the mushroom body γ lobe. These water learning neurons are different from those that are critical to convey the reinforcing effects of sugar. Naïve water-seeking behavior in thirsty flies does not require water taste but relies on another subset of water-responsive dopaminergic neurons that target the mushroom body β′ lobe. Furthermore, these naïve water-approach neurons are not required for learned water-seeking. Our results therefore demonstrate that naïve and learned water-seeking, and water learning, utilize separable neural circuitry in the brain of thirsty flies. PMID:25262493
Brain Stimulation Reward Supports More Consistent and Accurate Rodent Decision-Making than Food Reward.

PubMed

McMurray, Matthew S; Conway, Sineadh M; Roitman, Jamie D

2017-01-01

Animal models of decision-making rely on an animal's motivation to decide and its ability to detect differences among various alternatives. Food reinforcement, although commonly used, is associated with problematic confounds, especially satiety. Here, we examined the use of brain stimulation reward (BSR) as an alternative reinforcer in rodent models of decision-making and compared it with the effectiveness of sugar pellets. The discriminability of various BSR frequencies was compared to differing numbers of sugar pellets in separate free-choice tasks. We found that BSR was more discriminable and motivated greater task engagement and more consistent preference for the larger reward. We then investigated whether rats prefer BSR of varying frequencies over sugar pellets. We found that animals showed either a clear preference for sugar reward or no preference between reward modalities, depending on the frequency of the BSR alternative and the size of the sugar reward. Overall, these results suggest that BSR is an effective reinforcer in rodent decision-making tasks, removing food-related confounds and resulting in more accurate, consistent, and reliable metrics of choice.
Enhanced Experience Replay for Deep Reinforcement Learning

DTIC Science & Technology

2015-11-01

ARL-TR-7538 ● NOV 2015 US Army Research Laboratory Enhanced Experience Replay for Deep Reinforcement Learning by David Doria...Experience Replay for Deep Reinforcement Learning by David Doria, Bryan Dawson, and Manuel Vindiola Computational and Information Sciences Directorate...
Changes in the stimulus-preceding negativity and lateralized readiness potential during reinforcement learning.

PubMed

Ren, Xi; Valle-Inclán, Fernando; Tukaiev, Sergii; Hackley, Steven A

2017-07-01

According to reinforcement learning theory, dopamine-dependent anticipatory processes play a critical role in learning from action outcomes such as feedback or reward. To better understand outcome anticipation, we examined variation in slow cortical potentials and assessed their changes over the course of motor-skill acquisition. Healthy young adults learned a series of precisely timed, key press sequences. Feedback was delivered at a delay of either 2.5 or 8 s, to encourage use of either the striatally mediated, habit learning system or the hippocampus-dependent, episodic memory system, respectively. During the 2.5-s delay, the stimulus-preceding negativity (SPN) was shown to decline in amplitude across trials, confirming previous results from a perceptual categorization task (Morís, Luque, & Rodríguez-Fornells, 2013). This falsifies the hypothesis that SPN reflects specific outcome predictions, on the assumption that the ability to make such predictions should improve as a task is mastered. An SPN was also evident during the 8-s delay, but it increased in amplitude across trials. At the conclusion of the 8-s but not the 2.5-s prefeedback interval, a reversed-polarity lateralized readiness potential (LRP) was noted. It was suggested that this might indicate maintenance of an action representation for comparison with the feedback display. If so, this would constitute the first direct psychophysiological evidence for a popular hypothetical construct in quantitative models of reinforcement learning, the so-called eligibility trace. © 2017 Society for Psychophysiological Research.
Human-level control through deep reinforcement learning.

PubMed

Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A; Veness, Joel; Bellemare, Marc G; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

2015-02-26

The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Human-level control through deep reinforcement learning

NASA Astrophysics Data System (ADS)

Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K.; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

2015-02-01

The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Extinction reveals that primary sensory cortex predicts reinforcement outcome

PubMed Central

Bieszczad, Kasia M.; Weinberger, Norman M.

2011-01-01

Primary sensory cortices are traditionally regarded as stimulus analyzers. However, studies of associative learning-induced plasticity in the primary auditory cortex (A1) indicate involvement in learning, memory and other cognitive processes. For example, the area of representation of a tone becomes larger for stronger auditory memories and the magnitude of area gain is proportional to the degree that a tone becomes behaviorally important. Here, we used extinction to investigate whether “behavioral importance” specifically reflects a sound’s ability to predict reinforcement (reward or punishment) vs. to predict any significant change in the meaning of a sound. If the former, then extinction should reverse area gains as the signal no longer predicts reinforcement. Rats (n = 11) were trained to bar-press to a signal tone (5.0 kHz) for water-rewards, to induce signal-specific area gains in A1. After subsequent withdrawal of reward, A1 was mapped to determine representational areas. Signal-specific area gains — estimated from a previously established brain–behavior quantitative function — were reversed, supporting the “reinforcement prediction” hypothesis. Area loss was specific to the signal tone vs. test tones, further indicating that withdrawal of reinforcement, rather than unreinforced tone presentation per se, was responsible for area loss. Importantly, the amount of area loss was correlated with the amount of extinction (r = 0.82, p < 0.01). These findings show that primary sensory cortical representation can encode behavioral importance as a signal’s value to predict reinforcement, and that the number of cells tuned to a stimulus can dictate its ability to command behavior. PMID:22304434
Rewarding and reinforcing effects of 4-chloro-2,5-dimethoxyamphetamine and AH-7921 in rodents.

PubMed

Cha, Hye Jin; Jeon, Seo Young; Jang, Hwa Jin; Shin, Jisoon; Kim, Young-Hoon; Suh, Soo Kyung

2018-05-29

New psychoactive substances (NPSs), i.e., newly designed substances with chemical residues that are slightly different from those of known psychoactive substances, have been emerging since the late 2000s, and social problems related to the use of these substances are increasing globally. Two such NPSs are 4-chloro-2,5-dimethoxyamphetamine (DOC), a psychedelic substance that is structurally related to amphetamine, and AH-7921, an opioid analgesic that is used for recreational purposes and has a potency similar to that of morphine. Currently, scientific evidence for the dependence liability or toxicity of NPSs is lacking. Therefore, in this study, we performed animal behavioral tests to evaluate the dependence liability of DOC and AH-7921. The rewarding and reinforcing effects of DOC and AH-7921 were evaluated using the conditioned place preference (CPP) paradigm in mice and the self-administration (SA) procedure in rats. Both DOC and AH-7921 increased the preference for the drug-paired compartment in the CPP test at a dose of 0.3 mg/kg and increased the number of responses to the active lever in the SA test at 0.01 mg/(kg·infusion). Collectively, the data suggest that DOC and AH-7921 may have both rewarding and reinforcing effects. Further studies are needed to confirm the reinforcing effects in broader dose ranges with various schedules. Copyright © 2018 Elsevier B.V. All rights reserved.
Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer.

PubMed

Hu, Yujing; Gao, Yang; An, Bo

2015-07-01

An important approach in multiagent reinforcement learning (MARL) is equilibrium-based MARL, which adopts equilibrium solution concepts in game theory and requires agents to play equilibrium strategies at each state. However, most existing equilibrium-based MARL algorithms cannot scale due to a large number of computationally expensive equilibrium computations (e.g., computing Nash equilibria is PPAD-hard) during learning. For the first time, this paper finds that during the learning process of equilibrium-based MARL, the one-shot games corresponding to each state's successive visits often have the same or similar equilibria (for some states more than 90% of games corresponding to successive visits have similar equilibria). Inspired by this observation, this paper proposes to use equilibrium transfer to accelerate equilibrium-based MARL. The key idea of equilibrium transfer is to reuse previously computed equilibria when each agent has a small incentive to deviate. By introducing transfer loss and transfer condition, a novel framework called equilibrium transfer-based MARL is proposed. We prove that although equilibrium transfer brings transfer loss, equilibrium-based MARL algorithms can still converge to an equilibrium policy under certain assumptions. Experimental results in widely used benchmarks (e.g., grid world game, soccer game, and wall game) show that the proposed framework: 1) not only significantly accelerates equilibrium-based MARL (up to 96.7% reduction in learning time), but also achieves higher average rewards than algorithms without equilibrium transfer and 2) scales significantly better than algorithms without equilibrium transfer when the state/action space grows and the number of agents increases.
Macaque monkeys can learn token values from human models through vicarious reward.

PubMed

Bevacqua, Sara; Cerasti, Erika; Falcone, Rossella; Cervelloni, Milena; Brunamonti, Emiliano; Ferraina, Stefano; Genovesio, Aldo

2013-01-01

Monkeys can learn the symbolic meaning of tokens, and exchange them to get a reward. Monkeys can also learn the symbolic value of a token by observing conspecifics but it is not clear if they can learn passively by observing other actors, e.g., humans. To answer this question, we tested two monkeys in a token exchange paradigm in three experiments. Monkeys learned token values through observation of human models exchanging them. We used, after a phase of object familiarization, different sets of tokens. One token of each set was rewarded with a bit of apple. Other tokens had zero value (neutral tokens). Each token was presented only in one set. During the observation phase, monkeys watched the human model exchange tokens and watched them consume rewards (vicarious rewards). In the test phase, the monkeys were asked to exchange one of the tokens for food reward. Sets of three tokens were used in the first experiment and sets of two tokens were used in the second and third experiments. The valuable token was presented with different probabilities in the observation phase during the first and second experiments in which the monkeys exchanged the valuable token more frequently than any of the neutral tokens. The third experiments examined the effect of unequal probabilities. Our results support the view that monkeys can learn from non-conspecific actors through vicarious reward, even a symbolic task like the token-exchange task.
Reinforcement, Behavior Constraint, and the Overjustification Effect.

ERIC Educational Resources Information Center

Williams, Bruce W.

1980-01-01

Four levels of the behavior constraint-reinforcement variable were manipulated: attractive reward, unattractive reward, request to perform, and a no-reward control. Only the unattractive reward and request groups showed the performance decrements that suggest the overjustification effect. It is concluded that reinforcement does not cause the…
Cholinergic Mesopontine Signals Govern Locomotion and Reward Through Dissociable Midbrain Pathways

PubMed Central

Xiao, Cheng; Cho, Jounhong Ryan; Zhou, Chunyi; Treweek, Jennifer B.; Chan, Ken; McKinney, Sheri L.; Yang, Bin; Gradinaru, Viviana

2016-01-01

The mesopontine tegmentum, including the pedunculopontine and laterodorsal tegmental nuclei (PPN and LDT), provides major cholinergic inputs to midbrain and regulates locomotion and reward. To delineate the underlying projection-specific circuit mechanisms we employed optogenetics to control mesopontine cholinergic neurons at somata and at divergent projections within distinct midbrain areas. Bidirectional manipulation of PPN cholinergic cell bodies exerted opposing effects on locomotor behavior and reinforcement learning. These motor and reward effects were separable via limiting photostimulation to PPN cholinergic terminals in the ventral substantia nigra pars compacta (vSNc) or to the ventral tegmental area (VTA), respectively. LDT cholinergic neurons also form connections with vSNc and VTA neurons, however although photo-excitation of LDT cholinergic terminals in the VTA caused positive reinforcement, LDT-to-vSNc modulation did not alter locomotion or reward. Therefore, the selective targeting of projection-specific mesopontine cholinergic pathways may offer increased benefit in treating movement and addiction disorders. PMID:27100197
Evolution with Reinforcement Learning in Negotiation

PubMed Central

Zou, Yi; Zhan, Wenjie; Shao, Yuan

2014-01-01

Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm. PMID:25048108
Evolution with reinforcement learning in negotiation.

PubMed

Zou, Yi; Zhan, Wenjie; Shao, Yuan

2014-01-01

Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm.
Deficits in Positive Reinforcement Learning and Uncertainty-Driven Exploration are Associated with Distinct Aspects of Negative Symptoms in Schizophrenia

PubMed Central

Strauss, Gregory P.; Frank, Michael J.; Waltz, James A.; Kasanova, Zuzana; Herbener, Ellen S.; Gold, James M.

2011-01-01

Background Negative symptoms are core features of schizophrenia; however, the cognitive and neural basis for individual negative symptom domains remains unclear. Converging evidence suggests a role for striatal and prefrontal dopamine in reward learning and the exploration of actions that might produce outcomes that are better than the status quo. The current study examines whether deficits in reinforcement learning and uncertainty-driven exploration predict specific negative symptoms domains. Methods We administered a temporal decision making task, which required trial-by-trial adjustment of reaction time (RT) to maximize reward receipt, to 51 patients with schizophrenia and 39 age-matched healthy controls. Task conditions were designed such that expected value (probability * magnitude) increased (IEV), decreased (DEV), or remained constant (CEV) with increasing response times. Computational analyses were applied to estimate the degree to which trial-by-trial responses are influenced by reinforcement history. Results Individuals with schizophrenia showed impaired Go learning, but intact NoGo learning relative to controls. These effects were pronounced as a function of global measures of negative symptom. Uncertainty-based exploration was substantially reduced in individuals with schizophrenia, and selectively correlated with clinical ratings of anhedonia. Conclusions Schizophrenia patients, particularly those with high negative symptoms, failed to speed RT's to increase positive outcomes and showed reduced tendency to explore when alternative actions could lead to better outcomes than the status quo. Results are interpreted in the context of current computational, genetic, and pharmacological data supporting the roles of striatal and prefrontal dopamine in these processes. PMID:21168124
Expression of HIV gp120 protein increases sensitivity to the rewarding properties of methamphetamine in mice

PubMed Central

Kesby, James P.; Hubbard, David T.; Markou, Athina; Semenova, Svetlana

2012-01-01

Methamphetamine abuse and human immunodeficiency virus (HIV) infection induce neuropathological changes in corticolimbic brain areas involved in reward and cognitive function. Little is known about the combined effects of methamphetamine and HIV infection on cognitive and reward processes. The HIV/gp120 protein induces neurodegeneration in mice, similar to HIV-induced pathology in humans. We investigated the effects of gp120 expression on associative learning, preference for methamphetamine and non-drug reinforcers, and sensitivity to the conditioned rewarding properties of methamphetamine in transgenic (tg) mice expressing HIV/gp120 protein (gp120-tg). gp120-tg mice learned the operant response for food at the same rate as non-tg mice. In the two-bottle choice procedure with restricted access to drugs, gp120-tg mice exhibited greater preference for methamphetamine and saccharin than non-tg mice, whereas preference for quinine was similar between genotypes. Under conditions of unrestricted access to methamphetamine, the mice exhibited a decreased preference for increasing methamphetamine concentrations. However, male gp120-tg mice showed a decreased preference for methamphetamine at lower concentrations than non-tg male mice. gp120-tg mice developed methamphetamine-induced conditioned place preference at lower methamphetamine doses compared with non-tg mice. No differences in methamphetamine pharmacokinetics were found between genotypes. These results indicate that gp120-tg mice exhibit no deficits in associative learning or reward/motivational function for a natural reinforcer. Interestingly, gp120 expression resulted in increased preference for methamphetamine and a highly palatable non-drug reinforcer (saccharin) and increased sensitivity to methamphetamine-induced conditioned reward. These data suggest that HIV-positive individuals may have increased sensitivity to methamphetamine, leading to high methamphetamine abuse potential in this population. PMID
A neurocomputational account of reward and novelty processing and effects of psychostimulants in attention deficit hyperactivity disorder.

PubMed

Sethi, Arjun; Voon, Valerie; Critchley, Hugo D; Cercignani, Mara; Harrison, Neil A

2018-05-01

Computational models of reinforcement learning have helped dissect discrete components of reward-related function and characterize neurocognitive deficits in psychiatric illnesses. Stimulus novelty biases decision-making, even when unrelated to choice outcome, acting as if possessing intrinsic reward value to guide decisions toward uncertain options. Heightened novelty seeking is characteristic of attention deficit hyperactivity disorder, yet how this influences reward-related decision-making is computationally encoded, or is altered by stimulant medication, is currently uncertain. Here we used an established reinforcement-learning task to model effects of novelty on reward-related behaviour during functional MRI in 30 adults with attention deficit hyperactivity disorder and 30 age-, sex- and IQ-matched control subjects. Each participant was tested on two separate occasions, once ON and once OFF stimulant medication. OFF medication, patients with attention deficit hyperactivity disorder showed significantly impaired task performance (P = 0.027), and greater selection of novel options (P = 0.004). Moreover, persistence in selecting novel options predicted impaired task performance (P = 0.025). These behavioural deficits were accompanied by a significantly lower learning rate (P = 0.011) and heightened novelty signalling within the substantia nigra/ventral tegmental area (family-wise error corrected P < 0.05). Compared to effects in controls, stimulant medication improved attention deficit hyperactivity disorder participants' overall task performance (P = 0.011), increased reward-learning rates (P = 0.046) and enhanced their ability to differentiate optimal from non-optimal novel choices (P = 0.032). It also reduced substantia nigra/ventral tegmental area responses to novelty. Preliminary cross-sectional evidence additionally suggested an association between long-term stimulant treatment and a reduction in the rewarding value of novelty. These data suggest that
Prespeech motor learning in a neural network using reinforcement.

PubMed

Warlaumont, Anne S; Westermann, Gert; Buder, Eugene H; Oller, D Kimbrough

2013-02-01

Vocal motor development in infancy provides a crucial foundation for language development. Some significant early accomplishments include learning to control the process of phonation (the production of sound at the larynx) and learning to produce the sounds of one's language. Previous work has shown that social reinforcement shapes the kinds of vocalizations infants produce. We present a neural network model that provides an account of how vocal learning may be guided by reinforcement. The model consists of a self-organizing map that outputs to muscles of a realistic vocalization synthesizer. Vocalizations are spontaneously produced by the network. If a vocalization meets certain acoustic criteria, it is reinforced, and the weights are updated to make similar muscle activations increasingly likely to recur. We ran simulations of the model under various reinforcement criteria and tested the types of vocalizations it produced after learning in the different conditions. When reinforcement was contingent on the production of phonated (i.e. voiced) sounds, the network's post-learning productions were almost always phonated, whereas when reinforcement was not contingent on phonation, the network's post-learning productions were almost always not phonated. When reinforcement was contingent on both phonation and proximity to English vowels as opposed to Korean vowels, the model's post-learning productions were more likely to resemble the English vowels and vice versa. Copyright © 2012 Elsevier Ltd. All rights reserved.
Reinforcement learning for stabilizing an inverted pendulum naturally leads to intermittent feedback control as in human quiet standing.

PubMed

Michimoto, Kenjiro; Suzuki, Yasuyuki; Kiyono, Ken; Kobayashi, Yasushi; Morasso, Pietro; Nomura, Taishin

2016-08-01

Intermittent feedback control for stabilizing human upright stance is a promising strategy, alternative to the standard time-continuous stiffness control. Here we show that such an intermittent controller can be established naturally through reinforcement learning. To this end, we used a single inverted pendulum model of the upright posture and a very simple reward function that gives a certain amount of punishments when the inverted pendulum falls or changes its position in the state space. We found that the acquired feedback controller exhibits hallmarks of the intermittent feedback control strategy, namely the action of the feedback controller is switched-off intermittently when the state of the pendulum is located near the stable manifold of the unstable saddle-type upright equilibrium of the inverted pendulum with no active control: this action provides an opportunity to exploit transiently converging dynamics toward the unstable upright position with no help of the active feedback control. We then speculate about a possible physiological mechanism of such reinforcement learning, and suggest that it may be related to the neural activity in the pedunculopontine tegmental nucleus (PPN) of the brainstem. This hypothesis is supported by recent evidence indicating that PPN might play critical roles for generation and regulation of postural tonus, reward prediction, as well as postural instability in patients with Parkinson's disease.

A Learning Theory for Reward-Modulated Spike-Timing-Dependent Plasticity with Application to Biofeedback

PubMed Central

Maass, Wolfgang

2008-01-01

Reward-modulated spike-timing-dependent plasticity (STDP) has recently emerged as a candidate for a learning rule that could explain how behaviorally relevant adaptive changes in complex networks of spiking neurons could be achieved in a self-organizing manner through local synaptic plasticity. However, the capabilities and limitations of this learning rule could so far only be tested through computer simulations. This article provides tools for an analytic treatment of reward-modulated STDP, which allows us to predict under which conditions reward-modulated STDP will achieve a desired learning effect. These analytical results imply that neurons can learn through reward-modulated STDP to classify not only spatial but also temporal firing patterns of presynaptic neurons. They also can learn to respond to specific presynaptic firing patterns with particular spike patterns. Finally, the resulting learning theory predicts that even difficult credit-assignment problems, where it is very hard to tell which synaptic weights should be modified in order to increase the global reward for the system, can be solved in a self-organizing manner through reward-modulated STDP. This yields an explanation for a fundamental experimental result on biofeedback in monkeys by Fetz and Baker. In this experiment monkeys were rewarded for increasing the firing rate of a particular neuron in the cortex and were able to solve this extremely difficult credit assignment problem. Our model for this experiment relies on a combination of reward-modulated STDP with variable spontaneous firing activity. Hence it also provides a possible functional explanation for trial-to-trial variability, which is characteristic for cortical networks of neurons but has no analogue in currently existing artificial computing systems. In addition our model demonstrates that reward-modulated STDP can be applied to all synapses in a large recurrent neural network without endangering the stability of the network
Reward and punishment learning in daily life: A replication study

PubMed Central

van Roekel, Eeske; Wichers, Marieke; Oldehinkel, Albertine J.

2017-01-01

Day-to-day experiences are accompanied by feelings of Positive Affect (PA) and Negative Affect (NA). Implicitly, without conscious processing, individuals learn about the reward and punishment value of each context and activity. These associative learning processes, in turn, affect the probability that individuals will re-engage in such activities or seek out that context. So far, implicit learning processes are almost exclusively investigated in controlled laboratory settings and not in daily life. Here we aimed to replicate the first study that investigated implicit learning processes in real life, by means of the Experience Sampling Method (ESM). That is, using an experience-sampling study with 90 time points (three measurements over 30 days), we prospectively measured time spent in social company and amount of physical activity as well as PA and NA in the daily lives of 18-24-year-old young adults (n = 69 with anhedonia, n = 69 without anhedonia). Multilevel analyses showed a punishment learning effect with regard to time spent in company of friends, but not a reward learning effect. Neither reward nor punishment learning effects were found with regard to physical activity. Our study shows promising results for future research on implicit learning processes in daily life, with the proviso of careful consideration of the timescale used. Short-term retrospective ESM design with beeps approximately six hours apart may suffer from mismatch noise that hampers accurate detection of associative learning effects over time. PMID:28976985
Reward and punishment learning in daily life: A replication study.

PubMed

Heininga, Vera E; van Roekel, Eeske; Wichers, Marieke; Oldehinkel, Albertine J

2017-01-01

Day-to-day experiences are accompanied by feelings of Positive Affect (PA) and Negative Affect (NA). Implicitly, without conscious processing, individuals learn about the reward and punishment value of each context and activity. These associative learning processes, in turn, affect the probability that individuals will re-engage in such activities or seek out that context. So far, implicit learning processes are almost exclusively investigated in controlled laboratory settings and not in daily life. Here we aimed to replicate the first study that investigated implicit learning processes in real life, by means of the Experience Sampling Method (ESM). That is, using an experience-sampling study with 90 time points (three measurements over 30 days), we prospectively measured time spent in social company and amount of physical activity as well as PA and NA in the daily lives of 18-24-year-old young adults (n = 69 with anhedonia, n = 69 without anhedonia). Multilevel analyses showed a punishment learning effect with regard to time spent in company of friends, but not a reward learning effect. Neither reward nor punishment learning effects were found with regard to physical activity. Our study shows promising results for future research on implicit learning processes in daily life, with the proviso of careful consideration of the timescale used. Short-term retrospective ESM design with beeps approximately six hours apart may suffer from mismatch noise that hampers accurate detection of associative learning effects over time.
Dopamine Reward Prediction Error Responses Reflect Marginal Utility

PubMed Central

Stauffer, William R.; Lak, Armin; Schultz, Wolfram

2014-01-01

Summary Background Optimal choices require an accurate neuronal representation of economic value. In economics, utility functions are mathematical representations of subjective value that can be constructed from choices under risk. Utility usually exhibits a nonlinear relationship to physical reward value that corresponds to risk attitudes and reflects the increasing or decreasing marginal utility obtained with each additional unit of reward. Accordingly, neuronal reward responses coding utility should robustly reflect this nonlinearity. Results In two monkeys, we measured utility as a function of physical reward value from meaningful choices under risk (that adhered to first- and second-order stochastic dominance). The resulting nonlinear utility functions predicted the certainty equivalents for new gambles, indicating that the functions’ shapes were meaningful. The monkeys were risk seeking (convex utility function) for low reward and risk avoiding (concave utility function) with higher amounts. Critically, the dopamine prediction error responses at the time of reward itself reflected the nonlinear utility functions measured at the time of choices. In particular, the reward response magnitude depended on the first derivative of the utility function and thus reflected the marginal utility. Furthermore, dopamine responses recorded outside of the task reflected the marginal utility of unpredicted reward. Accordingly, these responses were sufficient to train reinforcement learning models to predict the behaviorally defined expected utility of gambles. Conclusions These data suggest a neuronal manifestation of marginal utility in dopamine neurons and indicate a common neuronal basis for fundamental explanatory constructs in animal learning theory (prediction error) and economic decision theory (marginal utility). PMID:25283778
Dopamine reward prediction error responses reflect marginal utility.

PubMed

Stauffer, William R; Lak, Armin; Schultz, Wolfram

2014-11-03

Optimal choices require an accurate neuronal representation of economic value. In economics, utility functions are mathematical representations of subjective value that can be constructed from choices under risk. Utility usually exhibits a nonlinear relationship to physical reward value that corresponds to risk attitudes and reflects the increasing or decreasing marginal utility obtained with each additional unit of reward. Accordingly, neuronal reward responses coding utility should robustly reflect this nonlinearity. In two monkeys, we measured utility as a function of physical reward value from meaningful choices under risk (that adhered to first- and second-order stochastic dominance). The resulting nonlinear utility functions predicted the certainty equivalents for new gambles, indicating that the functions' shapes were meaningful. The monkeys were risk seeking (convex utility function) for low reward and risk avoiding (concave utility function) with higher amounts. Critically, the dopamine prediction error responses at the time of reward itself reflected the nonlinear utility functions measured at the time of choices. In particular, the reward response magnitude depended on the first derivative of the utility function and thus reflected the marginal utility. Furthermore, dopamine responses recorded outside of the task reflected the marginal utility of unpredicted reward. Accordingly, these responses were sufficient to train reinforcement learning models to predict the behaviorally defined expected utility of gambles. These data suggest a neuronal manifestation of marginal utility in dopamine neurons and indicate a common neuronal basis for fundamental explanatory constructs in animal learning theory (prediction error) and economic decision theory (marginal utility). Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.
Amygdala Contributions to Stimulus–Reward Encoding in the Macaque Medial and Orbital Frontal Cortex during Learning

PubMed Central

Averbeck, Bruno B.

2017-01-01

Orbitofrontal cortex (OFC), medial frontal cortex (MFC), and amygdala mediate stimulus–reward learning, but the mechanisms through which they interact are unclear. Here, we investigated how neurons in macaque OFC and MFC signaled rewards and the stimuli that predicted them during learning with and without amygdala input. Macaques performed a task that required them to evaluate two stimuli and then choose one to receive the reward associated with that option. Four main findings emerged. First, amygdala lesions slowed the acquisition and use of stimulus–reward associations. Further analyses indicated that this impairment was due, at least in part, to ineffective use of negative feedback to guide subsequent decisions. Second, the activity of neurons in OFC and MFC rapidly evolved to encode the amount of reward associated with each stimulus. Third, amygdalectomy reduced encoding of stimulus–reward associations during the evaluation of different stimuli. Reward encoding of anticipated and received reward after choices were made was not altered. Fourth, amygdala lesions led to an increase in the proportion of neurons in MFC, but not OFC, that encoded the instrumental response that monkeys made on each trial. These correlated changes in behavior and neural activity after amygdala lesions strongly suggest that the amygdala contributes to the ability to learn stimulus–reward associations rapidly by shaping encoding within OFC and MFC. SIGNIFICANCE STATEMENT Altered functional interactions among orbital frontal cortex (OFC), medial frontal cortex (MFC), and amygdala are thought to underlie several psychiatric conditions, many related to reward learning. Here, we investigated the causal contribution of the amygdala to the development of neuronal activity in macaque OFC and MFC related to rewards and the stimuli that predict them during learning. Without amygdala inputs, neurons in both OFC and MFC showed decreased encoding of stimulus–reward associations. MFC also
A comparison of plan-based and abstract MDP reward shaping

NASA Astrophysics Data System (ADS)

Efthymiadis, Kyriakos; Kudenko, Daniel

2014-01-01

Reward shaping has been shown to significantly improve an agent's performance in reinforcement learning. As attention is shifting away from tabula-rasa approaches many different reward shaping methods have been developed. In this paper, we compare two different methods for reward shaping; plan-based, in which an agent is provided with a plan and extra rewards are given according to the steps of the plan the agent satisfies, and reward shaping via abstract Markov decision process (MDPs), in which an abstract high-level MDP of the environment is solved and the resulting value function is used to shape the agent. The comparison is conducted in terms of total reward, convergence speed and scaling up to more complex environments. Empirical results demonstrate the need to correctly select and set up reward shaping methods according to the needs of the environment the agents are acting in. This leads to the more interesting question, is there a reward shaping method which is universally better than all other approaches regardless of the environment dynamics?
A Tribute to Charlie Chaplin: Induced Positive Affect Improves Reward-Based Decision-Learning in Parkinson’s Disease

PubMed Central

Ridderinkhof, K. Richard; van Wouwe, Nelleke C.; Band, Guido P. H.; Wylie, Scott A.; Van der Stigchel, Stefan; van Hees, Pieter; Buitenweg, Jessika; van de Vijver, Irene; van den Wildenberg, Wery P. M.

2012-01-01

Reward-based decision-learning refers to the process of learning to select those actions that lead to rewards while avoiding actions that lead to punishments. This process, known to rely on dopaminergic activity in striatal brain regions, is compromised in Parkinson’s disease (PD). We hypothesized that such decision-learning deficits are alleviated by induced positive affect, which is thought to incur transient boosts in midbrain and striatal dopaminergic activity. Computational measures of probabilistic reward-based decision-learning were determined for 51 patients diagnosed with PD. Previous work has shown these measures to rely on the nucleus caudatus (outcome evaluation during the early phases of learning) and the putamen (reward prediction during later phases of learning). We observed that induced positive affect facilitated learning, through its effects on reward prediction rather than outcome evaluation. Viewing a few minutes of comedy clips served to remedy dopamine-related problems associated with frontostriatal circuitry and, consequently, learning to predict which actions will yield reward. PMID:22707944
Synthetic cathinones and their rewarding and reinforcing effects in rodents.

PubMed

Watterson, Lucas R; Olive, M Foster

2014-06-04

Synthetic cathinones, colloquially referred to as "bath salts", are derivatives of the psychoactive alkaloid cathinone found in Catha edulis (Khat). Since the mid-to-late 2000's, these amphetamine-like psychostimulants have gained popularity amongst drug users due to their potency, low cost, ease of procurement, and constantly evolving chemical structures. Concomitant with their increased use is the emergence of a growing collection of case reports of bizarre and dangerous behaviors, toxicity to numerous organ systems, and death. However, scientific information regarding the abuse liability of these drugs has been relatively slower to materialize. Recently we have published several studies demonstrating that laboratory rodents will readily self-administer the "first generation" synthetic cathinones methylenedioxypyrovalerone (MDPV) and methylone via the intravenous route, in patterns similar to those of methamphetamine. Under progressive ratio schedules of reinforcement, the rank order of reinforcing efficacy of these compounds are MDPV ≥ methamphetamine > methylone. MDPV and methylone, as well as the "second generation" synthetic cathinones α-pyrrolidinovalerophenone (α-PVP) and 4-methylethcathinone (4-MEC), also dose-dependently increase brain reward function. Collectively, these findings indicate that synthetic cathinones have a high abuse and addiction potential and underscore the need for future assessment of the extent and duration of neurotoxicity induced by these emerging drugs of abuse.
Reinforcement learning in complementarity game and population dynamics

NASA Astrophysics Data System (ADS)

Jost, Jürgen; Li, Wei

2014-02-01

We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005), 10.1016/j.physa.2004.07.005] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.
Decodability of Reward Learning Signals Predicts Mood Fluctuations.

PubMed

Eldar, Eran; Roth, Charlotte; Dayan, Peter; Dolan, Raymond J

2018-05-07

Our mood often fluctuates without warning. Recent accounts propose that these fluctuations might be preceded by changes in how we process reward. According to this view, the degree to which reward improves our mood reflects not only characteristics of the reward itself (e.g., its magnitude) but also how receptive to reward we happen to be. Differences in receptivity to reward have been suggested to play an important role in the emergence of mood episodes in psychiatric disorders [1-16]. However, despite substantial theory, the relationship between reward processing and daily fluctuations of mood has yet to be tested directly. In particular, it is unclear whether the extent to which people respond to reward changes from day to day and whether such changes are followed by corresponding shifts in mood. Here, we use a novel mobile-phone platform with dense data sampling and wearable heart-rate and electroencephalographic sensors to examine mood and reward processing over an extended period of one week. Subjects regularly performed a trial-and-error choice task in which different choices were probabilistically rewarded. Subjects' choices revealed two complementary learning processes, one fast and one slow. Reward prediction errors [17, 18] indicative of these two processes were decodable from subjects' physiological responses. Strikingly, more accurate decodability of prediction-error signals reflective of the fast process predicted improvement in subjects' mood several hours later, whereas more accurate decodability of the slow process' signals predicted better mood a whole day later. We conclude that real-life mood fluctuations follow changes in responsivity to reward at multiple timescales. Copyright © 2018 The Author(s). Published by Elsevier Ltd.. All rights reserved.
The role of GABAB receptors in human reinforcement learning.

PubMed

Ort, Andres; Kometer, Michael; Rohde, Judith; Seifritz, Erich; Vollenweider, Franz X

2014-10-01

Behavioral evidence from human studies suggests that the γ-aminobutyric acid type B receptor (GABAB receptor) agonist baclofen modulates reinforcement learning and reduces craving in patients with addiction spectrum disorders. However, in contrast to the well established role of dopamine in reinforcement learning, the mechanisms by which the GABAB receptor influences reinforcement learning in humans remain completely unknown. To further elucidate this issue, a cross-over, double-blind, placebo-controlled study was performed in healthy human subjects (N=15) to test the effects of baclofen (20 and 50mg p.o.) on probabilistic reinforcement learning. Outcomes were the feedback-induced P2 component of the event-related potential, the feedback-related negativity, and the P300 component of the event-related potential. Baclofen produced a reduction of P2 amplitude over the course of the experiment, but did not modulate the feedback-related negativity. Furthermore, there was a trend towards increased learning after baclofen administration relative to placebo over the course of the experiment. The present results extend previous theories of reinforcement learning, which focus on the importance of mesolimbic dopamine signaling, and indicate that stimulation of cortical GABAB receptors in a fronto-parietal network leads to better attentional allocation in reinforcement learning. This observation is a first step in our understanding of how baclofen may improve reinforcement learning in healthy subjects. Further studies with bigger sample sizes are needed to corroborate this conclusion and furthermore, test this effect in patients with addiction spectrum disorder. Copyright © 2014 Elsevier B.V. and ECNP. All rights reserved.
Neural circuits for long-term water-reward memory processing in thirsty Drosophila.

PubMed

Shyu, Wei-Huan; Chiu, Tai-Hsiang; Chiang, Meng-Hsuan; Cheng, Yu-Chin; Tsai, Ya-Lun; Fu, Tsai-Feng; Wu, Tony; Wu, Chia-Lin

2017-05-15

The intake of water is important for the survival of all animals and drinking water can be used as a reward in thirsty animals. Here we found that thirsty Drosophila melanogaster can associate drinking water with an odour to form a protein-synthesis-dependent water-reward long-term memory (LTM). Furthermore, we found that the reinforcement of LTM requires water-responsive dopaminergic neurons projecting to the restricted region of mushroom body (MB) β' lobe, which are different from the neurons required for the reinforcement of learning and short-term memory (STM). Synaptic output from α'β' neurons is required for consolidation, whereas the output from γ and αβ neurons is required for the retrieval of LTM. Finally, two types of MB efferent neurons retrieve LTM from γ and αβ neurons by releasing glutamate and acetylcholine, respectively. Our results therefore cast light on the cellular and molecular mechanisms responsible for processing water-reward LTM in Drosophila.
A Dynamic Connectome Supports the Emergence of Stable Computational Function of Neural Circuits through Reward-Based Learning.

PubMed

Kappel, David; Legenstein, Robert; Habenschuss, Stefan; Hsieh, Michael; Maass, Wolfgang

2018-01-01

Synaptic connections between neurons in the brain are dynamic because of continuously ongoing spine dynamics, axonal sprouting, and other processes. In fact, it was recently shown that the spontaneous synapse-autonomous component of spine dynamics is at least as large as the component that depends on the history of pre- and postsynaptic neural activity. These data are inconsistent with common models for network plasticity and raise the following questions: how can neural circuits maintain a stable computational function in spite of these continuously ongoing processes, and what could be functional uses of these ongoing processes? Here, we present a rigorous theoretical framework for these seemingly stochastic spine dynamics and rewiring processes in the context of reward-based learning tasks. We show that spontaneous synapse-autonomous processes, in combination with reward signals such as dopamine, can explain the capability of networks of neurons in the brain to configure themselves for specific computational tasks, and to compensate automatically for later changes in the network or task. Furthermore, we show theoretically and through computer simulations that stable computational performance is compatible with continuously ongoing synapse-autonomous changes. After reaching good computational performance it causes primarily a slow drift of network architecture and dynamics in task-irrelevant dimensions, as observed for neural activity in motor cortex and other areas. On the more abstract level of reinforcement learning the resulting model gives rise to an understanding of reward-driven network plasticity as continuous sampling of network configurations.
A Dynamic Connectome Supports the Emergence of Stable Computational Function of Neural Circuits through Reward-Based Learning

PubMed Central

Habenschuss, Stefan; Hsieh, Michael

2018-01-01

Synaptic connections between neurons in the brain are dynamic because of continuously ongoing spine dynamics, axonal sprouting, and other processes. In fact, it was recently shown that the spontaneous synapse-autonomous component of spine dynamics is at least as large as the component that depends on the history of pre- and postsynaptic neural activity. These data are inconsistent with common models for network plasticity and raise the following questions: how can neural circuits maintain a stable computational function in spite of these continuously ongoing processes, and what could be functional uses of these ongoing processes? Here, we present a rigorous theoretical framework for these seemingly stochastic spine dynamics and rewiring processes in the context of reward-based learning tasks. We show that spontaneous synapse-autonomous processes, in combination with reward signals such as dopamine, can explain the capability of networks of neurons in the brain to configure themselves for specific computational tasks, and to compensate automatically for later changes in the network or task. Furthermore, we show theoretically and through computer simulations that stable computational performance is compatible with continuously ongoing synapse-autonomous changes. After reaching good computational performance it causes primarily a slow drift of network architecture and dynamics in task-irrelevant dimensions, as observed for neural activity in motor cortex and other areas. On the more abstract level of reinforcement learning the resulting model gives rise to an understanding of reward-driven network plasticity as continuous sampling of network configurations. PMID:29696150
The role of efference copy in striatal learning.

PubMed

Fee, Michale S

2014-04-01

Reinforcement learning requires the convergence of signals representing context, action, and reward. While models of basal ganglia function have well-founded hypotheses about the neural origin of signals representing context and reward, the function and origin of signals representing action are less clear. Recent findings suggest that exploratory or variable behaviors are initiated by a wide array of 'action-generating' circuits in the midbrain, brainstem, and cortex. Thus, in order to learn, the striatum must incorporate an efference copy of action decisions made in these action-generating circuits. Here we review several recent neural models of reinforcement learning that emphasize the role of efference copy signals. Also described are ideas about how these signals might be integrated with inputs signaling context and reward. Copyright © 2014 Elsevier Ltd. All rights reserved.
Multi-layer network utilizing rewarded spike time dependent plasticity to learn a foraging task

PubMed Central

2017-01-01

Neural networks with a single plastic layer employing reward modulated spike time dependent plasticity (STDP) are capable of learning simple foraging tasks. Here we demonstrate advanced pattern discrimination and continuous learning in a network of spiking neurons with multiple plastic layers. The network utilized both reward modulated and non-reward modulated STDP and implemented multiple mechanisms for homeostatic regulation of synaptic efficacy, including heterosynaptic plasticity, gain control, output balancing, activity normalization of rewarded STDP and hard limits on synaptic strength. We found that addition of a hidden layer of neurons employing non-rewarded STDP created neurons that responded to the specific combinations of inputs and thus performed basic classification of the input patterns. When combined with a following layer of neurons implementing rewarded STDP, the network was able to learn, despite the absence of labeled training data, discrimination between rewarding patterns and the patterns designated as punishing. Synaptic noise allowed for trial-and-error learning that helped to identify the goal-oriented strategies which were effective in task solving. The study predicts a critical set of properties of the spiking neuronal network with STDP that was sufficient to solve a complex foraging task involving pattern classification and decision making. PMID:28961245
Signed reward prediction errors drive declarative learning.

PubMed

De Loof, Esther; Ergo, Kate; Naert, Lien; Janssens, Clio; Talsma, Durk; Van Opstal, Filip; Verguts, Tom

2018-01-01

Reward prediction errors (RPEs) are thought to drive learning. This has been established in procedural learning (e.g., classical and operant conditioning). However, empirical evidence on whether RPEs drive declarative learning-a quintessentially human form of learning-remains surprisingly absent. We therefore coupled RPEs to the acquisition of Dutch-Swahili word pairs in a declarative learning paradigm. Signed RPEs (SRPEs; "better-than-expected" signals) during declarative learning improved recognition in a follow-up test, with increasingly positive RPEs leading to better recognition. In addition, classic declarative memory mechanisms such as time-on-task failed to explain recognition performance. The beneficial effect of SRPEs on recognition was subsequently affirmed in a replication study with visual stimuli.
Punishment Insensitivity and Impaired Reinforcement Learning in Preschoolers

ERIC Educational Resources Information Center

Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

2014-01-01

Background: Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a…
Common Neural Mechanisms Underlying Reversal Learning by Reward and Punishment

PubMed Central

Xue, Gui; Xue, Feng; Droutman, Vita; Lu, Zhong-Lin; Bechara, Antoine; Read, Stephen

2013-01-01

Impairments in flexible goal-directed decisions, often examined by reversal learning, are associated with behavioral abnormalities characterized by impulsiveness and disinhibition. Although the lateral orbital frontal cortex (OFC) has been consistently implicated in reversal learning, it is still unclear whether this region is involved in negative feedback processing, behavioral control, or both, and whether reward and punishment might have different effects on lateral OFC involvement. Using a relatively large sample (N = 47), and a categorical learning task with either monetary reward or moderate electric shock as feedback, we found overlapping activations in the right lateral OFC (and adjacent insula) for reward and punishment reversal learning when comparing correct reversal trials with correct acquisition trials, whereas we found overlapping activations in the right dorsolateral prefrontal cortex (DLPFC) when negative feedback signaled contingency change. The right lateral OFC and DLPFC also showed greater sensitivity to punishment than did their left homologues, indicating an asymmetry in how punishment is processed. We propose that the right lateral OFC and anterior insula are important for transforming affective feedback to behavioral adjustment, whereas the right DLPFC is involved in higher level attention control. These results provide insight into the neural mechanisms of reversal learning and behavioral flexibility, which can be leveraged to understand risky behaviors among vulnerable populations. PMID:24349211

Common neural mechanisms underlying reversal learning by reward and punishment.

PubMed

Xue, Gui; Xue, Feng; Droutman, Vita; Lu, Zhong-Lin; Bechara, Antoine; Read, Stephen

2013-01-01

Impairments in flexible goal-directed decisions, often examined by reversal learning, are associated with behavioral abnormalities characterized by impulsiveness and disinhibition. Although the lateral orbital frontal cortex (OFC) has been consistently implicated in reversal learning, it is still unclear whether this region is involved in negative feedback processing, behavioral control, or both, and whether reward and punishment might have different effects on lateral OFC involvement. Using a relatively large sample (N = 47), and a categorical learning task with either monetary reward or moderate electric shock as feedback, we found overlapping activations in the right lateral OFC (and adjacent insula) for reward and punishment reversal learning when comparing correct reversal trials with correct acquisition trials, whereas we found overlapping activations in the right dorsolateral prefrontal cortex (DLPFC) when negative feedback signaled contingency change. The right lateral OFC and DLPFC also showed greater sensitivity to punishment than did their left homologues, indicating an asymmetry in how punishment is processed. We propose that the right lateral OFC and anterior insula are important for transforming affective feedback to behavioral adjustment, whereas the right DLPFC is involved in higher level attention control. These results provide insight into the neural mechanisms of reversal learning and behavioral flexibility, which can be leveraged to understand risky behaviors among vulnerable populations.
The influence of personality on neural mechanisms of observational fear and reward learning

PubMed Central

Hooker, Christine I.; Verosky, Sara C.; Miyakawa, Asako; Knight, Robert T.; D’Esposito, Mark

2012-01-01

Fear and reward learning can occur through direct experience or observation. Both channels can enhance survival or create maladaptive behavior. We used fMRI to isolate neural mechanisms of observational fear and reward learning and investigate whether neural response varied according to individual differences in neuroticism and extraversion. Participants learned object-emotion associations by observing a woman respond with fearful (or neutral) and happy (or neutral) facial expressions to novel objects. The amygdala-hippocampal complex was active when learning the object-fear association, and the hippocampus was active when learning the object-happy association. After learning, objects were presented alone; amygdala activity was greater for the fear (vs. neutral) and happy (vs. neutral) associated object. Importantly, greater amygdala-hippocampal activity during fear (vs. neutral) learning predicted better recognition of learned objects on a subsequent memory test. Furthermore, personality modulated neural mechanisms of learning. Neuroticism positively correlated with neural activity in the amygdala and hippocampus during fear (vs. neutral) learning. Low extraversion/high introversion was related to faster behavioral predictions of the fearful and neutral expressions during fear learning. In addition, low extraversion/high introversion was related to greater amygdala activity during happy (vs. neutral) learning, happy (vs. neutral) object recognition, and faster reaction times for predicting happy and neutral expressions during reward learning. These findings suggest that neuroticism is associated with an increased sensitivity in the neural mechanism for fear learning which leads to enhanced encoding of fear associations, and that low extraversion/high introversion is related to enhanced conditionability for both fear and reward learning. PMID:18573512
Cingulate neglect in humans: disruption of contralesional reward learning in right brain damage.

PubMed

Lecce, Francesca; Rotondaro, Francesca; Bonnì, Sonia; Carlesimo, Augusto; Thiebaut de Schotten, Michel; Tomaiuolo, Francesco; Doricchi, Fabrizio

2015-01-01

Motivational valence plays a key role in orienting spatial attention. Nonetheless, clinical documentation and understanding of motivationally based deficits of spatial orienting in the human is limited. Here in a series of one group-study and two single-case studies, we have examined right brain damaged patients (RBD) with and without left spatial neglect in a spatial reward-learning task, in which the motivational valence of the left contralesional and the right ipsilesional space was contrasted. In each trial two visual boxes were presented, one to the left and one to the right of central fixation. In one session monetary rewards were released more frequently in the box on the left side (75% of trials) whereas in another session they were released more frequently on the right side. In each trial patients were required to: 1) point to each one of the two boxes; 2) choose one of the boxes for obtaining monetary reward; 3) report explicitly the position of reward and whether this position matched or not the original choice. Despite defective spontaneous allocation of attention toward the contralesional space, RBD patients with left spatial neglect showed preserved contralesional reward learning, i.e., comparable to ipsilesional learning and to reward learning displayed by patients without neglect. A notable exception in the group of neglect patients was L.R., who showed no sign of contralesional reward learning in a series of 120 consecutive trials despite being able of reaching learning criterion in only 20 trials in the ipsilesional space. L.R. suffered a cortical-subcortical brain damage affecting the anterior components of the parietal-frontal attentional network and, compared with all other neglect and non-neglect patients, had additional lesion involvement of the medial anterior cingulate cortex (ACC) and of the adjacent sectors of the corpus callosum. In contrast to his lateralized motivational learning deficit, L.R. had no lateral bias in the early phases of
Amygdala Contributions to Stimulus-Reward Encoding in the Macaque Medial and Orbital Frontal Cortex during Learning.

PubMed

Rudebeck, Peter H; Ripple, Joshua A; Mitz, Andrew R; Averbeck, Bruno B; Murray, Elisabeth A

2017-02-22

Orbitofrontal cortex (OFC), medial frontal cortex (MFC), and amygdala mediate stimulus-reward learning, but the mechanisms through which they interact are unclear. Here, we investigated how neurons in macaque OFC and MFC signaled rewards and the stimuli that predicted them during learning with and without amygdala input. Macaques performed a task that required them to evaluate two stimuli and then choose one to receive the reward associated with that option. Four main findings emerged. First, amygdala lesions slowed the acquisition and use of stimulus-reward associations. Further analyses indicated that this impairment was due, at least in part, to ineffective use of negative feedback to guide subsequent decisions. Second, the activity of neurons in OFC and MFC rapidly evolved to encode the amount of reward associated with each stimulus. Third, amygdalectomy reduced encoding of stimulus-reward associations during the evaluation of different stimuli. Reward encoding of anticipated and received reward after choices were made was not altered. Fourth, amygdala lesions led to an increase in the proportion of neurons in MFC, but not OFC, that encoded the instrumental response that monkeys made on each trial. These correlated changes in behavior and neural activity after amygdala lesions strongly suggest that the amygdala contributes to the ability to learn stimulus-reward associations rapidly by shaping encoding within OFC and MFC. SIGNIFICANCE STATEMENT Altered functional interactions among orbital frontal cortex (OFC), medial frontal cortex (MFC), and amygdala are thought to underlie several psychiatric conditions, many related to reward learning. Here, we investigated the causal contribution of the amygdala to the development of neuronal activity in macaque OFC and MFC related to rewards and the stimuli that predict them during learning. Without amygdala inputs, neurons in both OFC and MFC showed decreased encoding of stimulus-reward associations. MFC also showed
Autonomous reinforcement learning with experience replay.

PubMed

Wawrzyński, Paweł; Tanwani, Ajay Kumar

2013-05-01

This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time. Copyright © 2012 Elsevier Ltd. All rights reserved.
Necessary Contributions of Human Frontal Lobe Subregions to Reward Learning in a Dynamic, Multidimensional Environment.

PubMed

Vaidya, Avinash R; Fellows, Lesley K

2016-09-21

Real-world decisions are typically made between options that vary along multiple dimensions, requiring prioritization of the important dimensions to support optimal choice. Learning in this setting depends on attributing decision outcomes to the dimensions with predictive relevance rather than to dimensions that are irrelevant and nonpredictive. This attribution problem is computationally challenging, and likely requires an interplay between selective attention and reward learning. Both these processes have been separately linked to the prefrontal cortex, but little is known about how they combine to support learning the reward value of multidimensional stimuli. Here, we examined the necessary contributions of frontal lobe subregions in attributing feedback to relevant and irrelevant dimensions on a trial-by-trial basis in humans. Patients with focal frontal lobe damage completed a demanding reward learning task where options varied on three dimensions, only one of which predicted reward. Participants with left lateral frontal lobe damage attributed rewards to irrelevant dimensions, rather than the relevant dimension. Damage to the ventromedial frontal lobe also impaired learning about the relevant dimension, but did not increase reward attribution to irrelevant dimensions. The results argue for distinct roles for these two regions in learning the value of multidimensional decision options under dynamic conditions, with the lateral frontal lobe required for selecting the relevant dimension to associate with reward, and the ventromedial frontal lobe required to learn the reward association itself. The real world is complex and multidimensional; how do we attribute rewards to predictive features when surrounded by competing cues? Here, we tested the critical involvement of human frontal lobe subregions in a probabilistic, multidimensional learning environment, asking whether focal lesions affected trial-by-trial attribution of feedback to relevant and irrelevant
Beyond Rewards

ERIC Educational Resources Information Center

Hall, Philip S.

2009-01-01

Using rewards to impact students' behavior has long been common practice. However, using reward systems to enhance student learning conveniently masks the larger and admittedly more difficult task of finding and implementing the structure and techniques that children with special needs require to learn. More important, rewarding the child for good…
The Influence of Personality on Neural Mechanisms of Observational Fear and Reward Learning

ERIC Educational Resources Information Center

Hooker, Christine I.; Verosky, Sara C.; Miyakawa, Asako; Knight, Robert T.; D'Esposito, Mark

2008-01-01

Fear and reward learning can occur through direct experience or observation. Both channels can enhance survival or create maladaptive behavior. We used fMRI to isolate neural mechanisms of observational fear and reward learning and investigate whether neural response varied according to individual differences in neuroticism and extraversion.…
Reinforcement Sensitivity and Responsiveness to Performance Feedback: A Preliminary Investigation

ERIC Educational Resources Information Center

Lovett, Benjamin J.; Eckert, Tanya L.

2009-01-01

Variability in responsiveness to academic interventions is a common phenomenon in school psychology practice, but the variables associated with this responsiveness are not well understood. Reinforcement sensitivity, a generalized tendency to learn quickly in reward contingency situations, is one variable for increased understanding. In the present…
Learning shapes the aversion and reward responses of lateral habenula neurons

PubMed Central

Wang, Daqing; Li, Yi; Feng, Qiru; Guo, Qingchun; Zhou, Jingfeng; Luo, Minmin

2017-01-01

The lateral habenula (LHb) is believed to encode negative motivational values. It remains unknown how LHb neurons respond to various stressors and how learning shapes their responses. Here, we used fiber-photometry and electrophysiology to track LHb neuronal activity in freely-behaving mice. Bitterness, pain, and social attack by aggressors intensively excite LHb neurons. Aversive Pavlovian conditioning induced activation by the aversion-predicting cue in a few trials. The experience of social defeat also conditioned excitatory responses to previously neutral social stimuli. In contrast, fiber photometry and single-unit recordings revealed that sucrose reward inhibited LHb neurons and often produced excitatory rebound. It required prolonged conditioning and high reward probability to induce inhibition by reward-predicting cues. Therefore, LHb neurons can bidirectionally process a diverse array of aversive and reward signals. Importantly, their responses are dynamically shaped by learning, suggesting that the LHb participates in experience-dependent selection of behavioral responses to stressors and rewards. DOI: http://dx.doi.org/10.7554/eLife.23045.001 PMID:28561735
Reinforcement learning improves behaviour from evaluative feedback

NASA Astrophysics Data System (ADS)

Littman, Michael L.

2015-05-01

Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.
Reinforcement learning improves behaviour from evaluative feedback.

PubMed

Littman, Michael L

2015-05-28

Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.
Frontostriatal development and probabilistic reinforcement learning during adolescence.

PubMed

DePasque, Samantha; Galván, Adriana

2017-09-01

Adolescence has traditionally been viewed as a period of vulnerability to increased risk-taking and adverse outcomes, which have been linked to neurobiological maturation of the frontostriatal reward system. However, growing research on the role of developmental changes in the adolescent frontostriatal system in facilitating learning will provide a more nuanced view of adolescence. In this review, we discuss the implications of existing research on this topic for learning during adolescence, and suggest that the very neural changes that render adolescents vulnerable to social pressure and risky decision making may also stand to play a role in scaffolding the ability to learn from rewards and from performance-related feedback. Copyright © 2017 Elsevier Inc. All rights reserved.
Reinforcement Learning and Savings Behavior.

PubMed

Choi, James J; Laibson, David; Madrian, Brigitte C; Metrick, Andrew

2009-12-01

We show that individual investors over-extrapolate from their personal experience when making savings decisions. Investors who experience particularly rewarding outcomes from saving in their 401(k)-a high average and/or low variance return-increase their 401(k) savings rate more than investors who have less rewarding experiences with saving. This finding is not driven by aggregate time-series shocks, income effects, rational learning about investing skill, investor fixed effects, or time-varying investor-level heterogeneity that is correlated with portfolio allocations to stock, bond, and cash asset classes. We discuss implications for the equity premium puzzle and interventions aimed at improving household financial outcomes.
Reinforcement Learning and Savings Behavior*

PubMed Central

Choi, James J.; Laibson, David; Madrian, Brigitte C.; Metrick, Andrew

2009-01-01

We show that individual investors over-extrapolate from their personal experience when making savings decisions. Investors who experience particularly rewarding outcomes from saving in their 401(k)—a high average and/or low variance return—increase their 401(k) savings rate more than investors who have less rewarding experiences with saving. This finding is not driven by aggregate time-series shocks, income effects, rational learning about investing skill, investor fixed effects, or time-varying investor-level heterogeneity that is correlated with portfolio allocations to stock, bond, and cash asset classes. We discuss implications for the equity premium puzzle and interventions aimed at improving household financial outcomes. PMID:20352013
Aging Affects Acquisition and Reversal of Reward-Based Associative Learning

ERIC Educational Resources Information Center

Weiler, Julia A.; Bellebaum, Christian; Daum, Irene

2008-01-01

Reward-based associative learning is mediated by a distributed network of brain regions that are dependent on the dopaminergic system. Age-related changes in key regions of this system, the striatum and the prefrontal cortex, may adversely affect the ability to use reward information for the guidance of behavior. The present study investigated the…
Histidine-decarboxylase knockout mice show deficient nonreinforced episodic object memory, improved negatively reinforced water-maze performance, and increased neo- and ventro-striatal dopamine turnover.

PubMed

Dere, Ekrem; De Souza-Silva, Maria A; Topic, Bianca; Spieler, Richard E; Haas, Helmut L; Huston, Joseph P

2003-01-01

The brain's histaminergic system has been implicated in hippocampal synaptic plasticity, learning, and memory, as well as brain reward and reinforcement. Our past pharmacological and lesion studies indicated that the brain's histamine system exerts inhibitory effects on the brain's reinforcement respective reward system reciprocal to mesolimbic dopamine systems, thereby modulating learning and memory performance. Given the close functional relationship between brain reinforcement and memory processes, the total disruption of brain histamine synthesis via genetic disruption of its synthesizing enzyme, histidine decarboxylase (HDC), in the mouse might have differential effects on learning dependent on the task-inherent reinforcement contingencies. Here, we investigated the effects of an HDC gene disruption in the mouse in a nonreinforced object exploration task and a negatively reinforced water-maze task as well as on neo- and ventro-striatal dopamine systems known to be involved in brain reward and reinforcement. Histidine decarboxylase knockout (HDC-KO) mice had higher dihydrophenylacetic acid concentrations and a higher dihydrophenylacetic acid/dopamine ratio in the neostriatum. In the ventral striatum, dihydrophenylacetic acid/dopamine and 3-methoxytyramine/dopamine ratios were higher in HDC-KO mice. Furthermore, the HDC-KO mice showed improved water-maze performance during both hidden and cued platform tasks, but deficient object discrimination based on temporal relationships. Our data imply that disruption of brain histamine synthesis can have both memory promoting and suppressive effects via distinct and independent mechanisms and further indicate that these opposed effects are related to the task-inherent reinforcement contingencies.
Hedging Your Bets by Learning Reward Correlations in the Human Brain

PubMed Central

Wunderlich, Klaus; Symmonds, Mkael; Bossaerts, Peter; Dolan, Raymond J.

2011-01-01

Summary Human subjects are proficient at tracking the mean and variance of rewards and updating these via prediction errors. Here, we addressed whether humans can also learn about higher-order relationships between distinct environmental outcomes, a defining ecological feature of contexts where multiple sources of rewards are available. By manipulating the degree to which distinct outcomes are correlated, we show that subjects implemented an explicit model-based strategy to learn the associated outcome correlations and were adept in using that information to dynamically adjust their choices in a task that required a minimization of outcome variance. Importantly, the experimentally generated outcome correlations were explicitly represented neuronally in right midinsula with a learning prediction error signal expressed in rostral anterior cingulate cortex. Thus, our data show that the human brain represents higher-order correlation structures between rewards, a core adaptive ability whose immediate benefit is optimized sampling. PMID:21943609
Diurnal rhythms in psychological reward functioning in healthy young men: 'Wanting', liking, and learning.

PubMed

Byrne, Jamie E M; Murray, Greg

2017-01-01

A range of evidence suggests that human reward functioning is partly driven by the endogenous circadian system, generating 24-hour rhythms in behavioural measures of reward activation. Reward functioning is multifaceted but literature to date is largely limited to measures of self-reported positive mood states. The aim of this study was to advance the field by testing for hypothesised diurnal variation in previously unexplored components of psychological reward: 'wanting', liking, and learning using subjective and behavioural measures. Risky decision making (automatic Balloon Analogue Risk Task), affective responsivity to positive images (International Affective Pictures System), uncued self-reported discrete emotions, and learning-contingent reward (Iowa Gambling Task) were measured at 10.00 hours, 14.00 hours, and 19.00 hours in a counterbalanced repeated measures design with 50 healthy male participants (aged 18-30). As hypothesised, risky decision making (unconscious 'wanting') and ratings of arousal towards positive images (conscious wanting) exhibited a diurnal waveform with indices highest at 14.00 hours. No diurnal rhythm was observed for liking (pleasure ratings to positive images, discrete uncued positive emotions) or in a learning-contingent reward task. Findings reaffirm that diurnal variation in human reward functioning is most pronounced in the motivational 'wanting' components of reward.
Effects of emotional preferences on value-based decision making are mediated by mentalizing not reward networks

PubMed Central

Evans, Simon; Fleming, Stephen M.; Dolan, Raymond J.; Averbeck, Bruno B.

2012-01-01

Real-world decision-making often involves social considerations. Consequently, the social value of stimuli can induce preferences in choice behavior. However, it is unknown how financial and social values are integrated in the brain. Here, we investigated how smiling and angry face stimuli interacted with financial reward feedback in a stochastically-rewarded decision-making task. Subjects reliably preferred the smiling faces despite equivalent reward feedback, demonstrating a socially driven bias. We fit a Bayesian reinforcement learning model to factor the effects of financial rewards and emotion preferences in individual subjects, and regressed model predictions on the trial-by-trial fMRI signal. Activity in the sub-callosal cingulate and the ventral striatum, both involved in reward learning, correlated with financial reward feedback, whereas the differential contribution of social value activated dorsal temporo-parietal junction and dorsal anterior cingulate cortex, previously proposed as components of a mentalizing network. We conclude that the impact of social stimuli on value-based decision processes is mediated by effects in brain regions partially separable from classical reward circuitry. PMID:20946058

Dorsal Striatal-Midbrain Connectivity in Humans Predicts How Reinforcements Are Used to Guide Decisions

ERIC Educational Resources Information Center

Kahnt, Thorsten; Park, Soyoung Q.; Cohen, Michael X.; Beck, Anne; Heinz, Andreas; Wrase, Jana

2009-01-01

It has been suggested that the target areas of dopaminergic midbrain neurons, the dorsal (DS) and ventral striatum (VS), are differently involved in reinforcement learning especially as actor and critic. Whereas the critic learns to predict rewards, the actor maintains action values to guide future decisions. The different midbrain connections to…
Application of Reinforcement Learning in Cognitive Radio Networks: Models and Algorithms

PubMed Central

Yau, Kok-Lim Alvin; Poh, Geong-Sen; Chien, Su Fong; Al-Rawi, Hasan A. A.

2014-01-01

Cognitive radio (CR) enables unlicensed users to exploit the underutilized spectrum in licensed spectrum whilst minimizing interference to licensed users. Reinforcement learning (RL), which is an artificial intelligence approach, has been applied to enable each unlicensed user to observe and carry out optimal actions for performance enhancement in a wide range of schemes in CR, such as dynamic channel selection and channel sensing. This paper presents new discussions of RL in the context of CR networks. It provides an extensive review on how most schemes have been approached using the traditional and enhanced RL algorithms through state, action, and reward representations. Examples of the enhancements on RL, which do not appear in the traditional RL approach, are rules and cooperative learning. This paper also reviews performance enhancements brought about by the RL algorithms and open issues. This paper aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive to readers outside the specialty of RL and CR. PMID:24995352
Synthetic cathinones and their rewarding and reinforcing effects in rodents

PubMed Central

Watterson, Lucas R.; Olive, M. Foster

2014-01-01

Synthetic cathinones, colloquially referred to as “bath salts”, are derivatives of the psychoactive alkaloid cathinone found in Catha edulis (Khat). Since the mid-to-late 2000’s, these amphetamine-like psychostimulants have gained popularity amongst drug users due to their potency, low cost, ease of procurement, and constantly evolving chemical structures. Concomitant with their increased use is the emergence of a growing collection of case reports of bizarre and dangerous behaviors, toxicity to numerous organ systems, and death. However, scientific information regarding the abuse liability of these drugs has been relatively slower to materialize. Recently we have published several studies demonstrating that laboratory rodents will readily self-administer the “first generation” synthetic cathinones methylenedioxypyrovalerone (MDPV) and methylone via the intravenous route, in patterns similar to those of methamphetamine. Under progressive ratio schedules of reinforcement, the rank order of reinforcing efficacy of these compounds are MDPV ≥ methamphetamine > methylone. MDPV and methylone, as well as the “second generation” synthetic cathinones α-pyrrolidinovalerophenone (α-PVP) and 4-methylethcathinone (4-MEC), also dose-dependently increase brain reward function. Collectively, these findings indicate that synthetic cathinones have a high abuse and addiction potential and underscore the need for future assessment of the extent and duration of neurotoxicity induced by these emerging drugs of abuse. PMID:25328910
The impact of effort-reward imbalance and learning motivation on teachers' sickness absence.

PubMed

Derycke, Hanne; Vlerick, Peter; Van de Ven, Bart; Rots, Isabel; Clays, Els

2013-02-01

The aim of this study was to analyse the impact of the effort-reward imbalance and learning motivation on sickness absence duration and sickness absence frequency among beginning teachers in Flanders (Belgium). A total of 603 teachers, who recently graduated, participated in this study. Effort-reward imbalance and learning motivation were assessed by means of self-administered questionnaires. Prospective data of registered sickness absence during 12 months follow-up were collected. Multivariate logistic regression analyses were performed. An imbalance between high efforts and low rewards (extrinsic hypothesis) was associated with longer sickness absence duration and more frequent absences. A low level of learning motivation (intrinsic hypothesis) was not associated with longer sickness absence duration but was significantly positively associated with sickness absence frequency. No significant results were obtained for the interaction hypothesis between imbalance and learning motivation. Further research is needed to deepen our understanding of the impact of psychosocial work conditions and personal resources on both sickness absence duration and frequency. Specifically, attention could be given to optimizing or reducing efforts spent at work, increasing rewards and stimulating learning motivation to influence sickness absence. Copyright © 2012 John Wiley & Sons, Ltd.
Comparing the neural basis of monetary reward and cognitive feedback during information-integration category learning.

PubMed

Daniel, Reka; Pollmann, Stefan

2010-01-06

The dopaminergic system is known to play a central role in reward-based learning (Schultz, 2006), yet it was also observed to be involved when only cognitive feedback is given (Aron et al., 2004). Within the domain of information-integration category learning, in which information from several stimulus dimensions has to be integrated predecisionally (Ashby and Maddox, 2005), the importance of contingent feedback is well established (Maddox et al., 2003). We examined the common neural correlates of reward anticipation and prediction error in this task. Sixteen subjects performed two parallel information-integration tasks within a single event-related functional magnetic resonance imaging session but received a monetary reward only for one of them. Similar functional areas including basal ganglia structures were activated in both task versions. In contrast, a single structure, the nucleus accumbens, showed higher activation during monetary reward anticipation compared with the anticipation of cognitive feedback in information-integration learning. Additionally, this activation was predicted by measures of intrinsic motivation in the cognitive feedback task and by measures of extrinsic motivation in the rewarded task. Our results indicate that, although all other structures implicated in category learning are not significantly affected by altering the type of reward, the nucleus accumbens responds to the positive incentive properties of an expected reward depending on the specific type of the reward.
Activation of dopamine D3 receptors inhibits reward-related learning induced by cocaine.

PubMed

Kong, H; Kuang, W; Li, S; Xu, M

2011-03-10

Memories of learned associations between the rewarding properties of drugs and environmental cues contribute to craving and relapse in humans. The mesocorticolimbic dopamine (DA) system is involved in reward-related learning induced by drugs of abuse. DA D3 receptors are preferentially expressed in mesocorticolimbic DA projection areas. Genetic and pharmacological studies have shown that DA D3 receptors suppress locomotor-stimulant effects of cocaine and reinstatement of cocaine-seeking behaviors. Activation of the extracellular signal-regulated kinase (ERK) induced by acute cocaine administration is also inhibited by D3 receptors. How D3 receptors modulate cocaine-induced reward-related learning and associated changes in cell signaling in reward circuits in the brain, however, have not been fully investigated. In the present study, we show that D3 receptor mutant mice exhibit potentiated acquisition of conditioned place preference (CPP) at low doses of cocaine compared to wild-type mice. Activation of ERK and CaMKIIα, but not the c-Jun N-terminal kinase and p38, in the nucleus accumbens, amygdala and prefrontal cortex is also potentiated in D3 receptor mutant mice compared to that in wild-type mice following CPP expression. These results support a model in which D3 receptors modulate reward-related learning induced by low doses of cocaine by inhibiting activation of ERK and CaMKIIα in reward circuits in the brain. Copyright © 2011 IBRO. Published by Elsevier Ltd. All rights reserved.
Cannabinoid modulation of drug reward and the implications of marijuana legalization.

PubMed

Covey, Dan P; Wenzel, Jennifer M; Cheer, Joseph F

2015-12-02

Marijuana is the most popular illegal drug worldwide. Recent trends indicate that this may soon change; not due to decreased marijuana use, but to an amendment in marijuana's illegal status. The cannabinoid type 1 (CB1) receptor mediates marijuana's psychoactive and reinforcing properties. CB1 receptors are also part of the brain endocannabinoid (eCB) system and support numerous forms of learning and memory, including the conditioned reinforcing properties of cues predicting reward or punishment. This is accomplished via eCB-dependent alterations in mesolimbic dopamine function, which plays an obligatory role in reward learning and motivation. Presynaptic CB1 receptors control midbrain dopamine neuron activity and thereby shape phasic dopamine release in target regions, particularly the nucleus accumbens (NAc). By also regulating synaptic input to the NAc, CB1 receptors modulate NAc output onto downstream neurons of the basal ganglia motor circuit, and thereby support goal-directed behaviors. Abused drugs promote short- and long-term adaptations in eCB-regulation of mesolimbic dopamine function, and thereby hijack neural systems related to the pursuit of rewards to promote drug abuse. By pharmacologically targeting the CB1 receptors, marijuana has preferential access to this neuronal system and can potently alter eCB-dependent processing of reward-related stimuli. As marijuana legalization progresses, greater access to this drug should increase the utility of marijuana as a research tool to better understand the eCB system, which has the potential to advance cannabinoid-based treatments for drug addiction. Copyright © 2014 Elsevier B.V. All rights reserved.
The role of the orbitofrontal cortex in the pursuit of happiness and more specific rewards.

PubMed

Burke, Kathryn A; Franz, Theresa M; Miller, Danielle N; Schoenbaum, Geoffrey

2008-07-17

Cues that reliably predict rewards trigger the thoughts and emotions normally evoked by those rewards. Humans and other animals will work, often quite hard, for these cues. This is termed conditioned reinforcement. The ability to use conditioned reinforcers to guide our behaviour is normally beneficial; however, it can go awry. For example, corporate icons, such as McDonald's Golden Arches, influence consumer behaviour in powerful and sometimes surprising ways, and drug-associated cues trigger relapse to drug seeking in addicts and animals exposed to addictive drugs, even after abstinence or extinction. Yet, despite their prevalence, it is not known how conditioned reinforcers control human or other animal behaviour. One possibility is that they act through the use of the specific rewards they predict; alternatively, they could control behaviour directly by activating emotions that are independent of any specific reward. In other words, the Golden Arches may drive business because they evoke thoughts of hamburgers and fries, or instead, may be effective because they also evoke feelings of hunger or happiness. Moreover, different brain circuits could support conditioned reinforcement mediated by thoughts of specific outcomes versus more general affective information. Here we have attempted to address these questions in rats. Rats were trained to learn that different cues predicted different rewards using specialized conditioning procedures that controlled whether the cues evoked thoughts of specific outcomes or general affective representations common to different outcomes. Subsequently, these rats were given the opportunity to press levers to obtain short and otherwise unrewarded presentations of these cues. We found that rats were willing to work for cues that evoked either outcome-specific or general affective representations. Furthermore the orbitofrontal cortex, a prefrontal region important for adaptive decision-making, was critical for the former but not for
The role of the orbitofrontal cortex in the pursuit of happiness and more specific rewards

PubMed Central

Burke, Kathryn A.; Franz, Theresa M.; Miller, Danielle N.; Schoenbaum, Geoffrey

2009-01-01

Cues that reliably predict rewards trigger the thoughts and emotions normally evoked by those rewards. Humans and other animals will work, often quite hard, for these cues. This is termed conditioned reinforcement. The ability to use conditioned reinforcers to guide our behaviour is normally beneficial; however, it can go awry. For example, corporate icons, such as McDonald’s Golden Arches, influence consumer behaviour in powerful and sometimes surprising ways1, and drug-associated cues trigger relapse to drug seeking in addicts and animals exposed to addictive drugs, even after abstinence or extinction2,3. Yet, despite their prevalence, it is not known how conditioned reinforcers control human or other animal behaviour. One possibility is that they act through the use of the specific rewards they predict; alternatively, they could control behaviour directly by activating emotions that are independent of any specific reward. In other words, the Golden Arches may drive business because they evoke thoughts of hamburgers and fries, or instead, may be effective because they also evoke feelings of hunger or happiness. Moreover, different brain circuits could support conditioned reinforcement mediated by thoughts of specific outcomes versus more general affective information. Here we have attempted to address these questions in rats. Rats were trained to learn that different cues predicted different rewards using specialized conditioning procedures that controlled whether the cues evoked thoughts of specific outcomes or general affective representations common to different outcomes. Subsequently, these rats were given the opportunity to press levers to obtain short and otherwise unrewarded presentations of these cues. We found that rats were willing to work for cues that evoked either outcome-specific or general affective representations. Furthermore the orbitofrontal cortex, a prefrontal region important for adaptive decision-making4, was critical for the former but
Reinforcement and inference in cross-situational word learning.

PubMed

Tilles, Paulo F C; Fontanari, José F

2013-01-01

Cross-situational word learning is based on the notion that a learner can determine the referent of a word by finding something in common across many observed uses of that word. Here we propose an adaptive learning algorithm that contains a parameter that controls the strength of the reinforcement applied to associations between concurrent words and referents, and a parameter that regulates inference, which includes built-in biases, such as mutual exclusivity, and information of past learning events. By adjusting these parameters so that the model predictions agree with data from representative experiments on cross-situational word learning, we were able to explain the learning strategies adopted by the participants of those experiments in terms of a trade-off between reinforcement and inference. These strategies can vary wildly depending on the conditions of the experiments. For instance, for fast mapping experiments (i.e., the correct referent could, in principle, be inferred in a single observation) inference is prevalent, whereas for segregated contextual diversity experiments (i.e., the referents are separated in groups and are exhibited with members of their groups only) reinforcement is predominant. Other experiments are explained with more balanced doses of reinforcement and inference.
Hierarchically organized behavior and its neural foundations: A reinforcement-learning perspective

PubMed Central

Botvinick, Matthew M.; Niv, Yael; Barto, Andrew C.

2009-01-01

Research on human and animal behavior has long emphasized its hierarchical structure — the divisibility of ongoing behavior into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behavior has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this paper, we reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behavior. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behavior. PMID:18926527
Event-related brain potentials and the study of reward processing: Methodological considerations.

PubMed

Krigolson, Olave E

2017-11-14

There is growing interest in using electroencephalography and specifically the event-related brain potential (ERP) methodology to study human reward processing. Since the discovery of the feedback related negativity (Miltner et al., 1997) and the development of theories associating the feedback related negativity and more recently the reward positivity with reinforcement learning, midbrain dopamine function, and the anterior cingulate cortex (i.e., Holroyd and Coles, 2002) researchers have used the ERP methodology to probe the neural basis of reward learning in humans. However, examination of the feedback related negativity and the reward positivity cannot be done without an understanding of some key methodological issues that must be taken into account when using ERPs and examining these ERP components. For example, even the component name - the feedback related negativity - is a source of debate within the research community as some now strongly feel that the component should be named the reward positivity (Proudfit, 2015). Here, ten key methodological issues are discussed - confusion in component naming, the reward positivity, component identification, peak quantification and the use of difference waveforms, frequency (the N200) and component contamination (the P300), the impact of feedback timing, action, and task learnability, and how learning results in changes in the amplitude of the feedback-related negativity/reward positivity. The hope here is to not provide a definitive approach for examining the feedback related negativity/reward positivity, but instead to outline the key issues that must be taken into account when examining this component to assist researchers in their study of human reward processing with the ERP methodology. Copyright © 2017 Elsevier B.V. All rights reserved.
Self-Paced Prioritized Curriculum Learning With Coverage Penalty in Deep Reinforcement Learning.

PubMed

Ren, Zhipeng; Dong, Daoyi; Li, Huaxiong; Chen, Chunlin; Zhipeng Ren; Daoyi Dong; Huaxiong Li; Chunlin Chen; Dong, Daoyi; Li, Huaxiong; Chen, Chunlin; Ren, Zhipeng

2018-06-01

In this paper, a new training paradigm is proposed for deep reinforcement learning using self-paced prioritized curriculum learning with coverage penalty. The proposed deep curriculum reinforcement learning (DCRL) takes the most advantage of experience replay by adaptively selecting appropriate transitions from replay memory based on the complexity of each transition. The criteria of complexity in DCRL consist of self-paced priority as well as coverage penalty. The self-paced priority reflects the relationship between the temporal-difference error and the difficulty of the current curriculum for sample efficiency. The coverage penalty is taken into account for sample diversity. With comparison to deep Q network (DQN) and prioritized experience replay (PER) methods, the DCRL algorithm is evaluated on Atari 2600 games, and the experimental results show that DCRL outperforms DQN and PER on most of these games. More results further show that the proposed curriculum training paradigm of DCRL is also applicable and effective for other memory-based deep reinforcement learning approaches, such as double DQN and dueling network. All the experimental results demonstrate that DCRL can achieve improved training efficiency and robustness for deep reinforcement learning.
Excitotoxic lesions of the medial striatum delay extinction of a reinforcement color discrimination operant task in domestic chicks; a functional role of reward anticipation.

PubMed

Ichikawa, Yoko; Izawa, Ei-Ichi; Matsushima, Toshiya

2004-12-01

To reveal the functional roles of the striatum, we examined the effects of excitotoxic lesions to the bilateral medial striatum (mSt) and nucleus accumbens (Ac) in a food reinforcement color discrimination operant task. With a food reward as reinforcement, 1-week-old domestic chicks were trained to peck selectively at red and yellow beads (S+) and not to peck at a blue bead (S-). Those chicks then received either lesions or sham operations and were tested in extinction training sessions, during which yellow turned out to be nonrewarding (S-), whereas red and blue remained unchanged. To further examine the effects on postoperant noninstrumental aspects of behavior, we also measured the "waiting time", during which chicks stayed at the empty feeder after pecking at yellow. Although the lesioned chicks showed significantly higher error rates in the nonrewarding yellow trials, their postoperant waiting time gradually decreased similarly to the sham controls. Furthermore, the lesioned chicks waited significantly longer than the controls, even from the first extinction block. In the blue trials, both lesioned and sham chicks consistently refrained from pecking, indicating that the delayed extinction was not due to a general disinhibition of pecking. Similarly, no effects were found in the novel training sessions, suggesting that the lesions had selective effects on the extinction of a learned operant. These results suggest that a neural representation of memory-based reward anticipation in the mSt/Ac could contribute to the anticipation error required for extinction.
Fuzzy Q-Learning for Generalization of Reinforcement Learning

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1996-01-01

Fuzzy Q-Learning, introduced earlier by the author, is an extension of Q-Learning into fuzzy environments. GARIC is a methodology for fuzzy reinforcement learning. In this paper, we introduce GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Q-Learning and at the local level, each agent learns and operates based on GARIC. GARIC-Q improves the speed and applicability of Fuzzy Q-Learning through generalization of input space by using fuzzy rules and bridges the gap between Q-Learning and rule based intelligent systems.
Construction of a Learning Agent Handling Its Rewards According to Environmental Situations

NASA Astrophysics Data System (ADS)

Moriyama, Koichi; Numao, Masayuki

The authors aim at constructing an agent which learns appropriate actions in a Multi-Agent environment with and without social dilemmas. For this aim, the agent must have nonrationality that makes it give up its own profit when it should do that. Since there are many studies on rational learning that brings more and more profit, it is desirable to utilize them for constructing the agent. Therefore, we use a reward-handling manner that makes internal evaluation from the agent's rewards, and then the agent learns actions by a rational learning method with the internal evaluation. If the agent has only a fixed manner, however, it does not act well in the environment with and without dilemmas. Thus, the authors equip the agent with several reward-handling manners and criteria for selecting an effective one for the environmental situation. In the case of humans, what generates the internal evaluation is usually called emotion. Hence, this study also aims at throwing light on emotional activities of humans from a constructive view. In this paper, we divide a Multi-Agent environment into three situations and construct an agent having the reward-handling manners and the criteria. We observe that the agent acts well in all the three Multi-Agent situations composed of homogeneous agents.
Signed reward prediction errors drive declarative learning

PubMed Central

Naert, Lien; Janssens, Clio; Talsma, Durk; Van Opstal, Filip; Verguts, Tom

2018-01-01

Reward prediction errors (RPEs) are thought to drive learning. This has been established in procedural learning (e.g., classical and operant conditioning). However, empirical evidence on whether RPEs drive declarative learning–a quintessentially human form of learning–remains surprisingly absent. We therefore coupled RPEs to the acquisition of Dutch-Swahili word pairs in a declarative learning paradigm. Signed RPEs (SRPEs; “better-than-expected” signals) during declarative learning improved recognition in a follow-up test, with increasingly positive RPEs leading to better recognition. In addition, classic declarative memory mechanisms such as time-on-task failed to explain recognition performance. The beneficial effect of SRPEs on recognition was subsequently affirmed in a replication study with visual stimuli. PMID:29293493
Reward Signals, Attempted Suicide, and Impulsivity in Late-Life Depression

PubMed Central

Dombrovski, Alexandre Y.; Szanto, Katalin; Clark, Luke; Reynolds, Charles F.; Siegle, Greg J.

2013-01-01

IMPORTANCE Suicide can be viewed as an escape from unendurable punishment at the cost of any future rewards. Could faulty estimation of these outcomes predispose to suicidal behavior? In behavioral studies, many of those who have attempted suicide misestimate expected rewards on gambling and probabilistic learning tasks. OBJECTIVES To describe the neural circuit abnormalities that underlie disadvantageous choices in people at risk for suicide and to relate these abnormalities to impulsivity, which is one of the components of vulnerability to suicide. DESIGN Case-control functional magnetic resonance imaging study of reward learning using a reinforcement learning model. SETTING University hospital and outpatient clinic. PATIENTS Fifty-three participants 60 years or older, including 15 depressed patients who had attempted suicide, 18 depressed patients who had never attempted suicide (depressed control subjects), and 20 psychiatrically healthy controls. MAIN OUTCOMES AND MEASURES Components of the cortical blood oxygenation level–dependent response tracking expected and unpredicted rewards. RESULTS Depressed elderly participants displayed 2 distinct disruptions of control over reward-guided behavior. First, impulsivity and a history of suicide attempts (particularly poorly planned ones) were associated with a weakened expected reward signal in the paralimbic cortex, which in turn predicted the behavioral insensitivity to contingency change. Second, depression was associated with disrupted corticostriatothalamic encoding of unpredicted rewards, which in turn predicted the behavioral oversensitivity to punishment. These results were robust to the effects of possible brain damage from suicide attempts, depressive severity, co-occurring substance use and anxiety disorders, antidepressant and anticholinergic exposure, lifetime exposure to electroconvulsive therapy, vascular illness, and incipient dementia. CONCLUSIONS AND RELEVANCE Altered paralimbic reward signals and
Dopaminergic and Prefrontal Contributions to Reward-Based Learning and Outcome Monitoring during Child Development and Aging

ERIC Educational Resources Information Center

Hammerer, Dorothea; Eppinger, Ben

2012-01-01

In many instances, children and older adults show similar difficulties in reward-based learning and outcome monitoring. These impairments are most pronounced in situations in which reward is uncertain (e.g., probabilistic reward schedules) and if outcome information is ambiguous (e.g., the relative value of outcomes has to be learned).…
Repeated nicotine exposure enhances reward-related learning in the rat.

PubMed

Olausson, Peter; Jentsch, J David; Taylor, Jane R

2003-07-01

Repeated exposure to addictive drugs causes neuroadaptive changes in cortico-limbic-striatal circuits that may underlie alterations in incentive-motivational processes and reward-related learning. Such drug-induced alterations may be relevant to drug addiction because enhanced incentive motivation and increased control over behavior by drug-associated stimuli may contribute to aspects of compulsive drug-seeking and drug-taking behaviors. This study investigated the consequences of repeated nicotine treatment on the acquisition and performance of Pavlovian discriminative approach behavior, a measure of reward-related learning, in male rats. Water-restricted rats were trained to associate a compound conditioned stimulus (tone+light) with the availability of water (the unconditioned stimulus) in 15 consecutive daily sessions. In separate experiments, rats were repeatedly treated with nicotine (0.35 mg/kg, s.c.) either (1) prior to the onset of training, (2) after each daily training session was completed (ie postsession injections), or (3) received nicotine both before the onset of training as well as after each daily training session. In this study, all nicotine treatment schedules increased Pavlovian discriminative approach behavior and, thus, prior repeated exposure to nicotine, repeated postsession nicotine injections, or both, facilitated reward-related learning.

The basolateral amygdala in reward learning and addiction.

PubMed

Wassum, Kate M; Izquierdo, Alicia

2015-10-01

Sophisticated behavioral paradigms partnered with the emergence of increasingly selective techniques to target the basolateral amygdala (BLA) have resulted in an enhanced understanding of the role of this nucleus in learning and using reward information. Due to the wide variety of behavioral approaches many questions remain on the circumscribed role of BLA in appetitive behavior. In this review, we integrate conclusions of BLA function in reward-related behavior using traditional interference techniques (lesion, pharmacological inactivation) with those using newer methodological approaches in experimental animals that allow in vivo manipulation of cell type-specific populations and neural recordings. Secondly, from a review of appetitive behavioral tasks in rodents and monkeys and recent computational models of reward procurement, we derive evidence for BLA as a neural integrator of reward value, history, and cost parameters. Taken together, BLA codes specific and temporally dynamic outcome representations in a distributed network to orchestrate adaptive responses. We provide evidence that experiences with opiates and psychostimulants alter these outcome representations in BLA, resulting in long-term modified action. Copyright © 2015 Elsevier Ltd. All rights reserved.
The basolateral amygdala in reward learning and addiction

PubMed Central

Wassum, Kate M.; Izquierdo, Alicia

2015-01-01

Sophisticated behavioral paradigms partnered with the emergence of increasingly selective techniques to target the basolateral amygdala (BLA) have resulted in an enhanced understanding of the role of this nucleus in learning and using reward information. Due to the wide variety of behavioral approaches many questions remain on the circumscribed role of BLA in appetitive behavior. In this review, we integrate conclusions of BLA function in reward-related behavior using traditional interference techniques (lesion, pharmacological inactivation) with those using newer methodological approaches in experimental animals that allow in vivo manipulation of cell type-specific populations and neural recordings. Secondly, from a review of appetitive behavioral tasks in rodents and monkeys and recent computational models of reward procurement, we derive evidence for BLA as a neural integrator of reward value, history, and cost parameters. Taken together, BLA codes specific and temporally dynamic outcome representations in a distributed network to orchestrate adaptive responses. We provide evidence that experiences with opiates and psychostimulants alter these outcome representations in BLA, resulting in long-term modified action. PMID:26341938
Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology.

PubMed

Schultz, Wolfram

2004-04-01

Neurons in a small number of brain structures detect rewards and reward-predicting stimuli and are active during the expectation of predictable food and liquid rewards. These neurons code the reward information according to basic terms of various behavioural theories that seek to explain reward-directed learning, approach behaviour and decision-making. The involved brain structures include groups of dopamine neurons, the striatum including the nucleus accumbens, the orbitofrontal cortex and the amygdala. The reward information is fed to brain structures involved in decision-making and organisation of behaviour, such as the dorsolateral prefrontal cortex and possibly the parietal cortex. The neural coding of basic reward terms derived from formal theories puts the neurophysiological investigation of reward mechanisms on firm conceptual grounds and provides neural correlates for the function of rewards in learning, approach behaviour and decision-making.
Linking Individual Learning Styles to Approach-Avoidance Motivational Traits and Computational Aspects of Reinforcement Learning

PubMed Central

Carl Aberg, Kristoffer; Doell, Kimberly C.; Schwartz, Sophie

2016-01-01

Learning how to gain rewards (approach learning) and avoid punishments (avoidance learning) is fundamental for everyday life. While individual differences in approach and avoidance learning styles have been related to genetics and aging, the contribution of personality factors, such as traits, remains undetermined. Moreover, little is known about the computational mechanisms mediating differences in learning styles. Here, we used a probabilistic selection task with positive and negative feedbacks, in combination with computational modelling, to show that individuals displaying better approach (vs. avoidance) learning scored higher on measures of approach (vs. avoidance) trait motivation, but, paradoxically, also displayed reduced learning speed following positive (vs. negative) outcomes. These data suggest that learning different types of information depend on associated reward values and internal motivational drives, possibly determined by personality traits. PMID:27851807
DAT isn’t all that: cocaine reward and reinforcement requires Toll Like Receptor 4 signaling

PubMed Central

Northcutt, A.L.; Hutchinson, M.R.; Wang, X.; Baratta, M.V.; Hiranita, T.; Cochran, T.A.; Pomrenze, M.B.; Galer, E.L.; Kopajtic, T.A.; Li, C.M.; Amat, J.; Larson, G.; Cooper, D.C.; Huang, Y.; O’Neill, C.E.; Yin, H.; Zahniser, N.R.; Katz, J.L.; Rice, K.C.; Maier, S.F.; Bachtell, R.K.; Watkins, L.R.

2014-01-01

The initial reinforcing properties of drugs of abuse, such as cocaine, are largely attributed to their ability to activate the mesolimbic dopamine system. Resulting increases in extracellular dopamine in the nucleus accumbens (NAc) are traditionally thought to result from cocaine’s ability to block dopamine transporters (DATs). Here we demonstrate that cocaine also interacts with the immunosurveillance receptor complex, Toll-Like Receptor 4 (TLR4), on microglial cells to initiate central innate immune signaling. Disruption of cocaine signaling at TLR4 suppresses cocaine-induced extracellular dopamine in the NAc, as well as cocaine conditioned place preference and cocaine self-administration. These results provide a novel understanding of the neurobiological mechanisms underlying cocaine reward/reinforcement that includes a critical role for central immune signaling, and offer a new target for medication development for cocaine abuse treatment. PMID:25644383
Regulating recognition decisions through incremental reinforcement learning.

PubMed

Han, Sanghoon; Dobbins, Ian G

2009-06-01

Does incremental reinforcement learning influence recognition memory judgments? We examined this question by subtly altering the relative validity or availability of feedback in order to differentially reinforce old or new recognition judgments. Experiment 1 probabilistically and incorrectly indicated that either misses or false alarms were correct in the context of feedback that was otherwise accurate. Experiment 2 selectively withheld feedback for either misses or false alarms in the context of feedback that was otherwise present. Both manipulations caused prominent shifts of recognition memory decision criteria that remained for considerable periods even after feedback had been altogether removed. Overall, these data demonstrate that incremental reinforcement-learning mechanisms influence the degree of caution subjects exercise when evaluating explicit memories.
Reward-seeking and discrimination deficits displayed by hypodopaminergic mice are prevented in mice lacking dopamine D4 receptors.

PubMed

Nemirovsky, Sergio I; Avale, M Elena; Brunner, Daniela; Rubinstein, Marcelo

2009-11-01

The dopamine D4 receptor (D4R) is predominantly expressed in the prefrontal cortex, a brain area that integrates motor, rewarding, and cognitive information. Because participation of D4Rs in executive learning is largely unknown, we challenged D4R knockout mice (Drd4(-/-)) and their wild-type (WT) littermates, neonatally treated with 6-hydroxydopamine (6-OHDA; icv) or vehicle in two operant learning paradigms. A continuous reinforcement task, in which one food-pellet was delivered after every lever press, showed that 6-OHDA-treated mice (hypodopaminergic) WT mice pressed the reinforcing lever at much lower rates than normodopaminergic WT mice. In contrast, Drd4(-/-) mice displayed increased lever pressing rates, regardless of their dopamine content. In another study, mice were trained to solve an operant two-choice task in which a first showing lever was coupled to the delivery of one food pellet only after a second lever emerged. Interval between presentation of both levers was initially 12 s and progressively shortened to 6, 2, and finally 0.5 s. Normodopaminergic WT mice obtained a pellet reward in more than 75% of the trials at 12, 6, and 2 s, whereas hypodopaminergic WT mice were severely impaired to select the reward-paired lever. Absence of D4Rs was not detrimental in this task. Moreover, hypodopaminergic Drd4(-/-) mice were as efficient as their normodopaminergic Drd4(-/-) siblings in selecting the reward-paired lever. In summary, hypodopaminergic mice exhibit severe impairments to retrieve rewards in two operant positive reinforcement tasks, but these deleterious effects are totally prevented in the absence of functional D4Rs.
Hierarchical extreme learning machine based reinforcement learning for goal localization

NASA Astrophysics Data System (ADS)

AlDahoul, Nouar; Zaw Htike, Zaw; Akmeliawati, Rini

2017-03-01

The objective of goal localization is to find the location of goals in noisy environments. Simple actions are performed to move the agent towards the goal. The goal detector should be capable of minimizing the error between the predicted locations and the true ones. Few regions need to be processed by the agent to reduce the computational effort and increase the speed of convergence. In this paper, reinforcement learning (RL) method was utilized to find optimal series of actions to localize the goal region. The visual data, a set of images, is high dimensional unstructured data and needs to be represented efficiently to get a robust detector. Different deep Reinforcement models have already been used to localize a goal but most of them take long time to learn the model. This long learning time results from the weights fine tuning stage that is applied iteratively to find an accurate model. Hierarchical Extreme Learning Machine (H-ELM) was used as a fast deep model that doesn’t fine tune the weights. In other words, hidden weights are generated randomly and output weights are calculated analytically. H-ELM algorithm was used in this work to find good features for effective representation. This paper proposes a combination of Hierarchical Extreme learning machine and Reinforcement learning to find an optimal policy directly from visual input. This combination outperforms other methods in terms of accuracy and learning speed. The simulations and results were analysed by using MATLAB.
Effect of reinforcement learning on coordination of multiangent systems

NASA Astrophysics Data System (ADS)

Bukkapatnam, Satish T. S.; Gao, Greg

2000-12-01

For effective coordination of distributed environments involving multiagent systems, learning ability of each agent in the environment plays a crucial role. In this paper, we develop a simple group learning method based on reinforcement, and study its effect on coordination through application to a supply chain procurement scenario involving a computer manufacturer. Here, all parties are represented by self-interested, autonomous agents, each capable of performing specific simple tasks. They negotiate with each other to perform complex tasks and thus coordinate supply chain procurement. Reinforcement learning is intended to enable each agent to reach a best negotiable price within a shortest possible time. Our simulations of the application scenario under different learning strategies reveals the positive effects of reinforcement learning on an agent's as well as the system's performance.
How Food as a Reward Is Detrimental to Children's Health, Learning, and Behavior

ERIC Educational Resources Information Center

Fedewa, Alicia L.; Davis, Matthew Cody

2015-01-01

Background: Despite small- and wide-scale prevention efforts to curb obesity, the percentage of children classified as overweight and obese has remained relatively consistent in the last decade. As school personnel are increasingly pressured to enhance student performance, many educators use food as a reward to motivate and reinforce positive…
Effective reinforcement learning following cerebellar damage requires a balance between exploration and motor noise

PubMed Central

Therrien, Amanda S.; Wolpert, Daniel M.

2016-01-01

Abstract See Miall and Galea (doi: 10.1093/awv343 ) for a scientific commentary on this article. Reinforcement and error-based processes are essential for motor learning, with the cerebellum thought to be required only for the error-based mechanism. Here we examined learning and retention of a reaching skill under both processes. Control subjects learned similarly from reinforcement and error-based feedback, but showed much better retention under reinforcement. To apply reinforcement to cerebellar patients, we developed a closed-loop reinforcement schedule in which task difficulty was controlled based on recent performance. This schedule produced substantial learning in cerebellar patients and controls. Cerebellar patients varied in their learning under reinforcement but fully retained what was learned. In contrast, they showed complete lack of retention in error-based learning. We developed a mechanistic model of the reinforcement task and found that learning depended on a balance between exploration variability and motor noise. While the cerebellar and control groups had similar exploration variability, the patients had greater motor noise and hence learned less. Our results suggest that cerebellar damage indirectly impairs reinforcement learning by increasing motor noise, but does not interfere with the reinforcement mechanism itself. Therefore, reinforcement can be used to learn and retain novel skills, but optimal reinforcement learning requires a balance between exploration variability and motor noise. PMID:26626368
Effective reinforcement learning following cerebellar damage requires a balance between exploration and motor noise.

PubMed

Therrien, Amanda S; Wolpert, Daniel M; Bastian, Amy J

2016-01-01

Reinforcement and error-based processes are essential for motor learning, with the cerebellum thought to be required only for the error-based mechanism. Here we examined learning and retention of a reaching skill under both processes. Control subjects learned similarly from reinforcement and error-based feedback, but showed much better retention under reinforcement. To apply reinforcement to cerebellar patients, we developed a closed-loop reinforcement schedule in which task difficulty was controlled based on recent performance. This schedule produced substantial learning in cerebellar patients and controls. Cerebellar patients varied in their learning under reinforcement but fully retained what was learned. In contrast, they showed complete lack of retention in error-based learning. We developed a mechanistic model of the reinforcement task and found that learning depended on a balance between exploration variability and motor noise. While the cerebellar and control groups had similar exploration variability, the patients had greater motor noise and hence learned less. Our results suggest that cerebellar damage indirectly impairs reinforcement learning by increasing motor noise, but does not interfere with the reinforcement mechanism itself. Therefore, reinforcement can be used to learn and retain novel skills, but optimal reinforcement learning requires a balance between exploration variability and motor noise. © The Author (2015). Published by Oxford University Press on behalf of the Guarantors of Brain.
Effects of subconscious and conscious emotions on human cue–reward association learning

PubMed Central

Watanabe, Noriya; Haruno, Masahiko

2015-01-01

Life demands that we adapt our behaviour continuously in situations in which much of our incoming information is emotional and unrelated to our immediate behavioural goals. Such information is often processed without our consciousness. This poses an intriguing question of whether subconscious exposure to irrelevant emotional information (e.g. the surrounding social atmosphere) affects the way we learn. Here, we addressed this issue by examining whether the learning of cue-reward associations changes when an emotional facial expression is shown subconsciously or consciously prior to the presentation of a reward-predicting cue. We found that both subconscious (0.027 s and 0.033 s) and conscious (0.047 s) emotional signals increased the rate of learning, and this increase was smallest at the border of conscious duration (0.040 s). These data suggest not only that the subconscious and conscious processing of emotional signals enhances value-updating in cue–reward association learning, but also that the computational processes underlying the subconscious enhancement is at least partially dissociable from its conscious counterpart. PMID:25684237
Social Cognition as Reinforcement Learning: Feedback Modulates Emotion Inference.

PubMed

Zaki, Jamil; Kallman, Seth; Wimmer, G Elliott; Ochsner, Kevin; Shohamy, Daphna

2016-09-01

Neuroscientific studies of social cognition typically employ paradigms in which perceivers draw single-shot inferences about the internal states of strangers. Real-world social inference features much different parameters: People often encounter and learn about particular social targets (e.g., friends) over time and receive feedback about whether their inferences are correct or incorrect. Here, we examined this process and, more broadly, the intersection between social cognition and reinforcement learning. Perceivers were scanned using fMRI while repeatedly encountering three social targets who produced conflicting visual and verbal emotional cues. Perceivers guessed how targets felt and received feedback about whether they had guessed correctly. Visual cues reliably predicted one target's emotion, verbal cues predicted a second target's emotion, and neither reliably predicted the third target's emotion. Perceivers successfully used this information to update their judgments over time. Furthermore, trial-by-trial learning signals-estimated using two reinforcement learning models-tracked activity in ventral striatum and ventromedial pFC, structures associated with reinforcement learning, and regions associated with updating social impressions, including TPJ. These data suggest that learning about others' emotions, like other forms of feedback learning, relies on domain-general reinforcement mechanisms as well as domain-specific social information processing.
Discounting of reward sequences: a test of competing formal models of hyperbolic discounting

PubMed Central

Zarr, Noah; Alexander, William H.; Brown, Joshua W.

2014-01-01

Humans are known to discount future rewards hyperbolically in time. Nevertheless, a formal recursive model of hyperbolic discounting has been elusive until recently, with the introduction of the hyperbolically discounted temporal difference (HDTD) model. Prior to that, models of learning (especially reinforcement learning) have relied on exponential discounting, which generally provides poorer fits to behavioral data. Recently, it has been shown that hyperbolic discounting can also be approximated by a summed distribution of exponentially discounted values, instantiated in the μAgents model. The HDTD model and the μAgents model differ in one key respect, namely how they treat sequences of rewards. The μAgents model is a particular implementation of a Parallel discounting model, which values sequences based on the summed value of the individual rewards whereas the HDTD model contains a non-linear interaction. To discriminate among these models, we observed how subjects discounted a sequence of three rewards, and then we tested how well each candidate model fit the subject data. The results show that the Parallel model generally provides a better fit to the human data. PMID:24639662
A Biologically Plausible Architecture of the Striatum to Solve Context-Dependent Reinforcement Learning Tasks.

PubMed

Shivkumar, Sabyasachi; Muralidharan, Vignesh; Chakravarthy, V Srinivasa

2017-01-01

Basal ganglia circuit is an important subcortical system of the brain thought to be responsible for reward-based learning. Striatum, the largest nucleus of the basal ganglia, serves as an input port that maps cortical information. Microanatomical studies show that the striatum is a mosaic of specialized input-output structures called striosomes and regions of the surrounding matrix called the matrisomes. We have developed a computational model of the striatum using layered self-organizing maps to capture the center-surround structure seen experimentally and explain its functional significance. We believe that these structural components could build representations of state and action spaces in different environments. The striatum model is then integrated with other components of basal ganglia, making it capable of solving reinforcement learning tasks. We have proposed a biologically plausible mechanism of action-based learning where the striosome biases the matrisome activity toward a preferred action. Several studies indicate that the striatum is critical in solving context dependent problems. We build on this hypothesis and the proposed model exploits the modularity of the striatum to efficiently solve such tasks.
A Biologically Plausible Architecture of the Striatum to Solve Context-Dependent Reinforcement Learning Tasks

PubMed Central

Shivkumar, Sabyasachi; Muralidharan, Vignesh; Chakravarthy, V. Srinivasa

2017-01-01

Basal ganglia circuit is an important subcortical system of the brain thought to be responsible for reward-based learning. Striatum, the largest nucleus of the basal ganglia, serves as an input port that maps cortical information. Microanatomical studies show that the striatum is a mosaic of specialized input-output structures called striosomes and regions of the surrounding matrix called the matrisomes. We have developed a computational model of the striatum using layered self-organizing maps to capture the center-surround structure seen experimentally and explain its functional significance. We believe that these structural components could build representations of state and action spaces in different environments. The striatum model is then integrated with other components of basal ganglia, making it capable of solving reinforcement learning tasks. We have proposed a biologically plausible mechanism of action-based learning where the striosome biases the matrisome activity toward a preferred action. Several studies indicate that the striatum is critical in solving context dependent problems. We build on this hypothesis and the proposed model exploits the modularity of the striatum to efficiently solve such tasks. PMID:28680395
Online Bahavior Aquisition of an Agent based on Coaching as Learning Assistance

NASA Astrophysics Data System (ADS)

Hirokawa, Masakazu; Suzuki, Kenji

This paper describes a novel methodology, namely ``Coaching'', which allows humans to give a subjective evaluation to an agent in an iterative manner. This is an interactive learning method to improve the reinforcement learning by modifying a reward function dynamically according to given evaluations by a trainer and the learning situation of the agent. We demonstrate that the agent can learn different reward functions by given instructions such as ``good or bad'' by human's observation, and can also obtain a set of behavior based on the learnt reward functions through several experiments.
Stimulus-Reward Association and Reversal Learning in Individuals with Asperger Syndrome

ERIC Educational Resources Information Center

Zalla, Tiziana; Sav, Anca-Maria; Leboyer, Marion

2009-01-01

In the present study, performance of a group of adults with Asperger Syndrome (AS) on two series of object reversal and extinction was compared with that of a group of adults with typical development. Participants were requested to learn a stimulus-reward association rule and monitor changes in reward value of stimuli in order to gain as many…
Predictive representations can link model-based reinforcement learning to model-free mechanisms.

PubMed

Russek, Evan M; Momennejad, Ida; Botvinick, Matthew M; Gershman, Samuel J; Daw, Nathaniel D

2017-09-01

Humans and animals are capable of evaluating actions by considering their long-run future rewards through a process described using model-based reinforcement learning (RL) algorithms. The mechanisms by which neural circuits perform the computations prescribed by model-based RL remain largely unknown; however, multiple lines of evidence suggest that neural circuits supporting model-based behavior are structurally homologous to and overlapping with those thought to carry out model-free temporal difference (TD) learning. Here, we lay out a family of approaches by which model-based computation may be built upon a core of TD learning. The foundation of this framework is the successor representation, a predictive state representation that, when combined with TD learning of value predictions, can produce a subset of the behaviors associated with model-based learning, while requiring less decision-time computation than dynamic programming. Using simulations, we delineate the precise behavioral capabilities enabled by evaluating actions using this approach, and compare them to those demonstrated by biological organisms. We then introduce two new algorithms that build upon the successor representation while progressively mitigating its limitations. Because this framework can account for the full range of observed putatively model-based behaviors while still utilizing a core TD framework, we suggest that it represents a neurally plausible family of mechanisms for model-based evaluation.

Predictive representations can link model-based reinforcement learning to model-free mechanisms

PubMed Central

Botvinick, Matthew M.

2017-01-01

Humans and animals are capable of evaluating actions by considering their long-run future rewards through a process described using model-based reinforcement learning (RL) algorithms. The mechanisms by which neural circuits perform the computations prescribed by model-based RL remain largely unknown; however, multiple lines of evidence suggest that neural circuits supporting model-based behavior are structurally homologous to and overlapping with those thought to carry out model-free temporal difference (TD) learning. Here, we lay out a family of approaches by which model-based computation may be built upon a core of TD learning. The foundation of this framework is the successor representation, a predictive state representation that, when combined with TD learning of value predictions, can produce a subset of the behaviors associated with model-based learning, while requiring less decision-time computation than dynamic programming. Using simulations, we delineate the precise behavioral capabilities enabled by evaluating actions using this approach, and compare them to those demonstrated by biological organisms. We then introduce two new algorithms that build upon the successor representation while progressively mitigating its limitations. Because this framework can account for the full range of observed putatively model-based behaviors while still utilizing a core TD framework, we suggest that it represents a neurally plausible family of mechanisms for model-based evaluation. PMID:28945743
Reward deficiency and anti-reward in pain chronification.

PubMed

Borsook, D; Linnman, C; Faria, V; Strassman, A M; Becerra, L; Elman, I

2016-09-01

Converging lines of evidence suggest that the pathophysiology of pain is mediated to a substantial degree via allostatic neuroadaptations in reward- and stress-related brain circuits. Thus, reward deficiency (RD) represents a within-system neuroadaptation to pain-induced protracted activation of the reward circuits that leads to depletion-like hypodopaminergia, clinically manifested anhedonia, and diminished motivation for natural reinforcers. Anti-reward (AR) conversely pertains to a between-systems neuroadaptation involving over-recruitment of key limbic structures (e.g., the central and basolateral amygdala nuclei, the bed nucleus of the stria terminalis, the lateral tegmental noradrenergic nuclei of the brain stem, the hippocampus and the habenula) responsible for massive outpouring of stressogenic neurochemicals (e.g., norepinephrine, corticotropin releasing factor, vasopressin, hypocretin, and substance P) giving rise to such negative affective states as anxiety, fear and depression. We propose here the Combined Reward deficiency and Anti-reward Model (CReAM), in which biopsychosocial variables modulating brain reward, motivation and stress functions can interact in a 'downward spiral' fashion to exacerbate the intensity, chronicity and comorbidities of chronic pain syndromes (i.e., pain chronification). Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.
Oxytocin selectively facilitates learning with social feedback and increases activity and functional connectivity in emotional memory and reward processing regions.

PubMed

Hu, Jiehui; Qi, Song; Becker, Benjamin; Luo, Lizhu; Gao, Shan; Gong, Qiyong; Hurlemann, René; Kendrick, Keith M

2015-06-01

In male Caucasian subjects, learning is facilitated by receipt of social compared with non-social feedback, and the neuropeptide oxytocin (OXT) facilitates this effect. In this study, we have first shown a cultural difference in that male Chinese subjects actually perform significantly worse in the same reinforcement associated learning task with social (emotional faces) compared with non-social feedback. Nevertheless, in two independent double-blind placebo (PLC) controlled between-subject design experiments we found OXT still selectively facilitated learning with social feedback. Similar to Caucasian subjects this OXT effect was strongest with feedback using female rather than male faces. One experiment performed in conjunction with functional magnetic resonance imaging showed that during the response, but not feedback phase of the task, OXT selectively increased activity in the amygdala, hippocampus, parahippocampal gyrus and putamen during the social feedback condition, and functional connectivity between the amygdala and insula and caudate. Therefore, OXT may be increasing the salience and reward value of anticipated social feedback. In the PLC group, response times and state anxiety scores during social feedback were associated with signal changes in these same regions but not in the OXT group. OXT may therefore have also facilitated learning by reducing anxiety in the social feedback condition. Overall our results provide the first evidence for cultural differences in social facilitation of learning per se, but a similar selective enhancement of learning with social feedback under OXT. This effect of OXT may be associated with enhanced responses and functional connectivity in emotional memory and reward processing regions. © 2015 Wiley Periodicals, Inc.
GA-based fuzzy reinforcement learning for control of a magnetic bearing system.

PubMed

Lin, C T; Jou, C P

2000-01-01

This paper proposes a TD (temporal difference) and GA (genetic algorithm)-based reinforcement (TDGAR) learning method and applies it to the control of a real magnetic bearing system. The TDGAR learning scheme is a new hybrid GA, which integrates the TD prediction method and the GA to perform the reinforcement learning task. The TDGAR learning system is composed of two integrated feedforward networks. One neural network acts as a critic network to guide the learning of the other network (the action network) which determines the outputs (actions) of the TDGAR learning system. The action network can be a normal neural network or a neural fuzzy network. Using the TD prediction method, the critic network can predict the external reinforcement signal and provide a more informative internal reinforcement signal to the action network. The action network uses the GA to adapt itself according to the internal reinforcement signal. The key concept of the TDGAR learning scheme is to formulate the internal reinforcement signal as the fitness function for the GA such that the GA can evaluate the candidate solutions (chromosomes) regularly, even during periods without external feedback from the environment. This enables the GA to proceed to new generations regularly without waiting for the arrival of the external reinforcement signal. This can usually accelerate the GA learning since a reinforcement signal may only be available at a time long after a sequence of actions has occurred in the reinforcement learning problem. The proposed TDGAR learning system has been used to control an active magnetic bearing (AMB) system in practice. A systematic design procedure is developed to achieve successful integration of all the subsystems including magnetic suspension, mechanical structure, and controller training. The results show that the TDGAR learning scheme can successfully find a neural controller or a neural fuzzy controller for a self-designed magnetic bearing system.
Improving Robot Motor Learning with Negatively Valenced Reinforcement Signals

PubMed Central

Navarro-Guerrero, Nicolás; Lowe, Robert J.; Wermter, Stefan

2017-01-01

Both nociception and punishment signals have been used in robotics. However, the potential for using these negatively valenced types of reinforcement learning signals for robot learning has not been exploited in detail yet. Nociceptive signals are primarily used as triggers of preprogrammed action sequences. Punishment signals are typically disembodied, i.e., with no or little relation to the agent-intrinsic limitations, and they are often used to impose behavioral constraints. Here, we provide an alternative approach for nociceptive signals as drivers of learning rather than simple triggers of preprogrammed behavior. Explicitly, we use nociception to expand the state space while we use punishment as a negative reinforcement learning signal. We compare the performance—in terms of task error, the amount of perceived nociception, and length of learned action sequences—of different neural networks imbued with punishment-based reinforcement signals for inverse kinematic learning. We contrast the performance of a version of the neural network that receives nociceptive inputs to that without such a process. Furthermore, we provide evidence that nociception can improve learning—making the algorithm more robust against network initializations—as well as behavioral performance by reducing the task error, perceived nociception, and length of learned action sequences. Moreover, we provide evidence that punishment, at least as typically used within reinforcement learning applications, may be detrimental in all relevant metrics. PMID:28420976
Impaired reward learning and intact motivation after serotonin depletion in rats.

PubMed

Izquierdo, Alicia; Carlos, Kathleen; Ostrander, Serena; Rodriguez, Danilo; McCall-Craddolph, Aaron; Yagnik, Gargey; Zhou, Feimeng

2012-08-01

Aside from the well-known influence of serotonin (5-hydroxytryptamine, 5-HT) on emotional regulation, more recent investigations have revealed the importance of this monoamine in modulating cognition. Parachlorophenylalanine (PCPA) depletes 5-HT by inhibiting tryptophan hydroxylase, the enzyme required for 5-HT synthesis and, if administered at sufficiently high doses, can result in a depletion of at least 90% of the brain's 5-HT levels. The present study assessed the long-lasting effects of widespread 5-HT depletions on two tasks of cognitive flexibility in Long Evans rats: effort discounting and reversal learning. We assessed performance on these tasks after administration of either 250 or 500 mg/kg PCPA or saline (SAL) on two consecutive days. Consistent with a previous report investigating the role of 5-HT on effort discounting, pretreatment with either dose of PCPA resulted in normal effortful choice: All rats continued to climb tall barriers to obtain large rewards and were not work-averse. Additionally, rats receiving the lower dose of PCPA displayed normal reversal learning. However, despite intact motivation to work for food rewards, rats receiving the largest dose of PCPA were unexpectedly impaired relative to SAL rats on the pretraining stages leading up to reversal learning, ultimately failing to approach and respond to the stimuli associated with reward. High performance liquid chromatography (HPLC) with electrochemical detection confirmed 5-HT, and not dopamine, levels in the ventromedial frontal cortex were correlated with this measure of associative reward learning. Copyright © 2012 Elsevier B.V. All rights reserved.
The role of BDNF, leptin, and catecholamines in reward learning in bulimia nervosa.

PubMed

Homan, Philipp; Grob, Simona; Milos, Gabriella; Schnyder, Ulrich; Eckert, Anne; Lang, Undine; Hasler, Gregor

2014-12-07

A relationship between bulimia nervosa and reward-related behavior is supported by several lines of evidence. The dopaminergic dysfunctions in the processing of reward-related stimuli have been shown to be modulated by the neurotrophin brain derived neurotrophic factor (BDNF) and the hormone leptin. Using a randomized, double-blind, placebo-controlled, crossover design, a reward learning task was applied to study the behavior of 20 female subjects with remitted bulimia nervosa and 27 female healthy controls under placebo and catecholamine depletion with alpha-methyl-para-tyrosine (AMPT). The plasma levels of BDNF and leptin were measured twice during the placebo and the AMPT condition, immediately before and 1 hour after a standardized breakfast. AMPT-induced differences in plasma BDNF levels were positively correlated with the AMPT-induced differences in reward learning in the whole sample (P=.05). Across conditions, plasma brain derived neurotrophic factor levels were higher in remitted bulimia nervosa subjects compared with controls (diagnosis effect; P=.001). Plasma BDNF and leptin levels were higher in the morning before compared with after a standardized breakfast across groups and conditions (time effect; P<.0001). The plasma leptin levels were higher under catecholamine depletion compared with placebo in the whole sample (treatment effect; P=.0004). This study reports on preliminary findings that suggest a catecholamine-dependent association of plasma BDNF and reward learning in subjects with remitted bulimia nervosa and controls. A role of leptin in reward learning is not supported by this study. However, leptin levels were sensitive to a depletion of catecholamine stores in both remitted bulimia nervosa and controls. © The Author 2015. Published by Oxford University Press on behalf of CINP.
Reward abundance interferes with error-based learning in a visuomotor adaptation task

PubMed Central

Oostwoud Wijdenes, Leonie; Rigterink, Tessa; Overvliet, Krista E.; Smeets, Joeren B. J.

2018-01-01

The brain rapidly adapts reaching movements to changing circumstances by using visual feedback about errors. Providing reward in addition to error feedback facilitates the adaptation but the underlying mechanism is unknown. Here, we investigate whether the proportion of trials rewarded (the ‘reward abundance’) influences how much participants adapt to their errors. We used a 3D multi-target pointing task in which reward alone is insufficient for motor adaptation. Participants (N = 423) performed the pointing task with feedback based on a shifted hand-position. On a proportion of trials we gave them rewarding feedback that their hand hit the target. Half of the participants only received this reward feedback. The other half also received feedback about endpoint errors. In different groups, we varied the proportion of trials that was rewarded. As expected, participants who received feedback about their errors did adapt, but participants who only received reward-feedback did not. Critically, participants who received abundant rewards adapted less to their errors than participants who received less reward. Thus, reward abundance negatively influences how much participants learn from their errors. Probably participants used a mechanism that relied more on the reward feedback when the reward was abundant. Because participants could not adapt to the reward, this interfered with adaptation to errors. PMID:29513681
Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning

PubMed Central

Konovalov, Arkady; Krajbich, Ian

2016-01-01

Organisms appear to learn and make decisions using different strategies known as model-free and model-based learning; the former is mere reinforcement of previously rewarded actions and the latter is a forward-looking strategy that involves evaluation of action-state transition probabilities. Prior work has used neural data to argue that both model-based and model-free learners implement a value comparison process at trial onset, but model-based learners assign more weight to forward-looking computations. Here using eye-tracking, we report evidence for a different interpretation of prior results: model-based subjects make their choices prior to trial onset. In contrast, model-free subjects tend to ignore model-based aspects of the task and instead seem to treat the decision problem as a simple comparison process between two differentially valued items, consistent with previous work on sequential-sampling models of decision making. These findings illustrate a problem with assuming that experimental subjects make their decisions at the same prescribed time. PMID:27511383
The reinforcing property and the rewarding aftereffect of wheel running in rats: a combination of two paradigms.

PubMed

Belke, Terry W; Wagner, Jason P

2005-02-28

Wheel running reinforces the behavior that generates it and produces a preference for the context that follows it. The goal of the present study was to demonstrate both of these effects in the same animals. Twelve male Wistar rats were first exposed to a fixed-interval 30 s schedule of wheel-running reinforcement. The operant was lever-pressing and the reinforcer was the opportunity to run for 45 s. Following this phase, the method of place conditioning was used to test for a rewarding aftereffect following operant sessions. On alternating days, half the rats responded for wheel-running reinforcement while the other half remained in their home cage. Upon completion of the wheel-running reinforcement sessions, rats that ran and rats that remained in their home cages were placed into a chamber of a conditioned place preference (CPP) apparatus for 30 min. Each animal received six pairings of a distinctive context with wheel running and six pairings of a different context with their home cage. On the test day, animals were free to move between the chambers for 10 min. Results showed a conditioned place preference for the context associated with wheel running; however, time spent in the context associated with running was not related to wheel-running rate, lever-pressing rate, or post-reinforcement pause duration. (c) 2004 Elsevier B.V. All rights reserved.
Effects of dopamine on reinforcement learning and consolidation in Parkinson's disease.

PubMed

Grogan, John P; Tsivos, Demitra; Smith, Laura; Knight, Brogan E; Bogacz, Rafal; Whone, Alan; Coulthard, Elizabeth J

2017-07-10

Emerging evidence suggests that dopamine may modulate learning and memory with important implications for understanding the neurobiology of memory and future therapeutic targeting. An influential hypothesis posits that dopamine biases reinforcement learning. More recent data also suggest an influence during both consolidation and retrieval. Eighteen Parkinson's disease patients learned through feedback ON or OFF medication, with memory tested 24 hr later ON or OFF medication (4 conditions, within-subjects design with matched healthy control group). Patients OFF medication during learning decreased in memory accuracy over the following 24 hr. In contrast to previous studies, however, dopaminergic medication during learning and testing did not affect expression of positive or negative reinforcement. Two further experiments were run without the 24 hr delay, but they too failed to reproduce effects of dopaminergic medication on reinforcement learning. While supportive of a dopaminergic role in consolidation, this study failed to replicate previous findings on reinforcement learning.
Sensitivity to value-driven attention is predicted by how we learn from value.

PubMed

Jahfari, Sara; Theeuwes, Jan

2017-04-01

Reward learning is known to influence the automatic capture of attention. This study examined how the rate of learning, after high- or low-value reward outcomes, can influence future transfers into value-driven attentional capture. Participants performed an instrumental learning task that was directly followed by an attentional capture task. A hierarchical Bayesian reinforcement model was used to infer individual differences in learning from high or low reward. Results showed a strong relationship between high-reward learning rates (or the weight that is put on learning after a high reward) and the magnitude of attentional capture with high-reward colors. Individual differences in learning from high or low rewards were further related to performance differences when high- or low-value distractors were present. These findings provide novel insight into the development of value-driven attentional capture by showing how information updating after desired or undesired outcomes can influence future deployments of automatic attention.
From Recurrent Choice to Skill Learning: A Reinforcement-Learning Model

ERIC Educational Resources Information Center

Fu, Wai-Tat; Anderson, John R.

2006-01-01

The authors propose a reinforcement-learning mechanism as a model for recurrent choice and extend it to account for skill learning. The model was inspired by recent research in neurophysiological studies of the basal ganglia and provides an integrated explanation of recurrent choice behavior and skill learning. The behavior includes effects of…
When theory and biology differ: The relationship between reward prediction errors and expectancy.

PubMed

Williams, Chad C; Hassall, Cameron D; Trska, Robert; Holroyd, Clay B; Krigolson, Olave E

2017-10-01

Comparisons between expectations and outcomes are critical for learning. Termed prediction errors, the violations of expectancy that occur when outcomes differ from expectations are used to modify value and shape behaviour. In the present study, we examined how a wide range of expectancy violations impacted neural signals associated with feedback processing. Participants performed a time estimation task in which they had to guess the duration of one second while their electroencephalogram was recorded. In a key manipulation, we varied task difficulty across the experiment to create a range of different feedback expectancies - reward feedback was either very expected, expected, 50/50, unexpected, or very unexpected. As predicted, the amplitude of the reward positivity, a component of the human event-related brain potential associated with feedback processing, scaled inversely with expectancy (e.g., unexpected feedback yielded a larger reward positivity than expected feedback). Interestingly, the scaling of the reward positivity to outcome expectancy was not linear as would be predicted by some theoretical models. Specifically, we found that the amplitude of the reward positivity was about equivalent for very expected and expected feedback, and for very unexpected and unexpected feedback. As such, our results demonstrate a sigmoidal relationship between reward expectancy and the amplitude of the reward positivity, with interesting implications for theories of reinforcement learning. Copyright © 2017 Elsevier B.V. All rights reserved.
The Good, the Bad, and the Irrelevant: Neural Mechanisms of Learning Real and Hypothetical Rewards and Effort

PubMed Central

Kolling, Nils; Nelissen, Natalie; Wittmann, Marco K.; Harmer, Catherine J.; Rushworth, Matthew F. S.

2015-01-01

Natural environments are complex, and a single choice can lead to multiple outcomes. Agents should learn which outcomes are due to their choices and therefore relevant for future decisions and which are stochastic in ways common to all choices and therefore irrelevant for future decisions between options. We designed an experiment in which human participants learned the varying reward and effort magnitudes of two options and repeatedly chose between them. The reward associated with a choice was randomly real or hypothetical (i.e., participants only sometimes received the reward magnitude associated with the chosen option). The real/hypothetical nature of the reward on any one trial was, however, irrelevant for learning the longer-term values of the choices, and participants ought to have only focused on the informational content of the outcome and disregarded whether it was a real or hypothetical reward. However, we found that participants showed an irrational choice bias, preferring choices that had previously led, by chance, to a real reward in the last trial. Amygdala and ventromedial prefrontal activity was related to the way in which participants' choices were biased by real reward receipt. By contrast, activity in dorsal anterior cingulate cortex, frontal operculum/anterior insula, and especially lateral anterior prefrontal cortex was related to the degree to which participants resisted this bias and chose effectively in a manner guided by aspects of outcomes that had real and more sustained relationships with particular choices, suppressing irrelevant reward information for more optimal learning and decision making. SIGNIFICANCE STATEMENT In complex natural environments, a single choice can lead to multiple outcomes. Human agents should only learn from outcomes that are due to their choices, not from outcomes without such a relationship. We designed an experiment to measure learning about reward and effort magnitudes in an environment in which other features
The attention habit: how reward learning shapes attentional selection.

PubMed

Anderson, Brian A

2016-04-01

There is growing consensus that reward plays an important role in the control of attention. Until recently, reward was thought to influence attention indirectly by modulating task-specific motivation and its effects on voluntary control over selection. Such an account was consistent with the goal-directed (endogenous) versus stimulus-driven (exogenous) framework that had long dominated the field of attention research. Now, a different perspective is emerging. Demonstrations that previously reward-associated stimuli can automatically capture attention even when physically inconspicuous and task-irrelevant challenge previously held assumptions about attentional control. The idea that attentional selection can be value driven, reflecting a distinct and previously unrecognized control mechanism, has gained traction. Since these early demonstrations, the influence of reward learning on attention has rapidly become an area of intense investigation, sparking many new insights. The result is an emerging picture of how the reward system of the brain automatically biases information processing. Here, I review the progress that has been made in this area, synthesizing a wealth of recent evidence to provide an integrated, up-to-date account of value-driven attention and some of its broader implications. © 2015 New York Academy of Sciences.
Reinforcement Learning of Linking and Tracing Contours in Recurrent Neural Networks

PubMed Central

Brosch, Tobias; Neumann, Heiko; Roelfsema, Pieter R.

2015-01-01

The processing of a visual stimulus can be subdivided into a number of stages. Upon stimulus presentation there is an early phase of feedforward processing where the visual information is propagated from lower to higher visual areas for the extraction of basic and complex stimulus features. This is followed by a later phase where horizontal connections within areas and feedback connections from higher areas back to lower areas come into play. In this later phase, image elements that are behaviorally relevant are grouped by Gestalt grouping rules and are labeled in the cortex with enhanced neuronal activity (object-based attention in psychology). Recent neurophysiological studies revealed that reward-based learning influences these recurrent grouping processes, but it is not well understood how rewards train recurrent circuits for perceptual organization. This paper examines the mechanisms for reward-based learning of new grouping rules. We derive a learning rule that can explain how rewards influence the information flow through feedforward, horizontal and feedback connections. We illustrate the efficiency with two tasks that have been used to study the neuronal correlates of perceptual organization in early visual cortex. The first task is called contour-integration and demands the integration of collinear contour elements into an elongated curve. We show how reward-based learning causes an enhancement of the representation of the to-be-grouped elements at early levels of a recurrent neural network, just as is observed in the visual cortex of monkeys. The second task is curve-tracing where the aim is to determine the endpoint of an elongated curve composed of connected image elements. If trained with the new learning rule, neural networks learn to propagate enhanced activity over the curve, in accordance with neurophysiological data. We close the paper with a number of model predictions that can be tested in future neurophysiological and computational studies
Motor Learning Enhances Use-Dependent Plasticity

PubMed Central

2017-01-01

Motor behaviors are shaped not only by current sensory signals but also by the history of recent experiences. For instance, repeated movements toward a particular target bias the subsequent movements toward that target direction. This process, called use-dependent plasticity (UDP), is considered a basic and goal-independent way of forming motor memories. Most studies consider movement history as the critical component that leads to UDP (Classen et al., 1998; Verstynen and Sabes, 2011). However, the effects of learning (i.e., improved performance) on UDP during movement repetition have not been investigated. Here, we used transcranial magnetic stimulation in two experiments to assess plasticity changes occurring in the primary motor cortex after individuals repeated reinforced and nonreinforced actions. The first experiment assessed whether learning a skill task modulates UDP. We found that a group that successfully learned the skill task showed greater UDP than a group that did not accumulate learning, but made comparable repeated actions. The second experiment aimed to understand the role of reinforcement learning in UDP while controlling for reward magnitude and action kinematics. We found that providing subjects with a binary reward without visual feedback of the cursor led to increased UDP effects. Subjects in the group that received comparable reward not associated with their actions maintained the previously induced UDP. Our findings illustrate how reinforcing consistent actions strengthens use-dependent memories and provide insight into operant mechanisms that modulate plastic changes in the motor cortex. SIGNIFICANCE STATEMENT Performing consistent motor actions induces use-dependent plastic changes in the motor cortex. This plasticity reflects one of the basic forms of human motor learning. Past studies assumed that this form of learning is exclusively affected by repetition of actions. However, here we showed that success-based reinforcement signals could
Toward a Science of Learning Games

ERIC Educational Resources Information Center

Howard-Jones, Paul; Demetriou, Skevi; Bogacz, Rafal; Yoo, Jee H.; Leonards, Ute

2011-01-01

Reinforcement learning involves a tight coupling of reward-associated behavior and a type of learning that is very different from that promoted by education. However, the emerging understanding of its underlying processes may help derive principles for effective learning games that have, until now, been elusive. This article first reviews findings…
Medial prefrontal cortex and the adaptive regulation of reinforcement learning parameters.

PubMed

Khamassi, Mehdi; Enel, Pierre; Dominey, Peter Ford; Procyk, Emmanuel

2013-01-01

Converging evidence suggest that the medial prefrontal cortex (MPFC) is involved in feedback categorization, performance monitoring, and task monitoring, and may contribute to the online regulation of reinforcement learning (RL) parameters that would affect decision-making processes in the lateral prefrontal cortex (LPFC). Previous neurophysiological experiments have shown MPFC activities encoding error likelihood, uncertainty, reward volatility, as well as neural responses categorizing different types of feedback, for instance, distinguishing between choice errors and execution errors. Rushworth and colleagues have proposed that the involvement of MPFC in tracking the volatility of the task could contribute to the regulation of one of RL parameters called the learning rate. We extend this hypothesis by proposing that MPFC could contribute to the regulation of other RL parameters such as the exploration rate and default action values in case of task shifts. Here, we analyze the sensitivity to RL parameters of behavioral performance in two monkey decision-making tasks, one with a deterministic reward schedule and the other with a stochastic one. We show that there exist optimal parameter values specific to each of these tasks, that need to be found for optimal performance and that are usually hand-tuned in computational models. In contrast, automatic online regulation of these parameters using some heuristics can help producing a good, although non-optimal, behavioral performance in each task. We finally describe our computational model of MPFC-LPFC interaction used for online regulation of the exploration rate and its application to a human-robot interaction scenario. There, unexpected uncertainties are produced by the human introducing cued task changes or by cheating. The model enables the robot to autonomously learn to reset exploration in response to such uncertain cues and events. The combined results provide concrete evidence specifying how prefrontal

Social learning through prediction error in the brain

NASA Astrophysics Data System (ADS)

Joiner, Jessica; Piva, Matthew; Turrin, Courtney; Chang, Steve W. C.

2017-06-01

Learning about the world is critical to survival and success. In social animals, learning about others is a necessary component of navigating the social world, ultimately contributing to increasing evolutionary fitness. How humans and nonhuman animals represent the internal states and experiences of others has long been a subject of intense interest in the developmental psychology tradition, and, more recently, in studies of learning and decision making involving self and other. In this review, we explore how psychology conceptualizes the process of representing others, and how neuroscience has uncovered correlates of reinforcement learning signals to explore the neural mechanisms underlying social learning from the perspective of representing reward-related information about self and other. In particular, we discuss self-referenced and other-referenced types of reward prediction errors across multiple brain structures that effectively allow reinforcement learning algorithms to mediate social learning. Prediction-based computational principles in the brain may be strikingly conserved between self-referenced and other-referenced information.
Dopamine prediction error responses integrate subjective value from different reward dimensions

PubMed Central

Lak, Armin; Stauffer, William R.; Schultz, Wolfram

2014-01-01

Prediction error signals enable us to learn through experience. These experiences include economic choices between different rewards that vary along multiple dimensions. Therefore, an ideal way to reinforce economic choice is to encode a prediction error that reflects the subjective value integrated across these reward dimensions. Previous studies demonstrated that dopamine prediction error responses reflect the value of singular reward attributes that include magnitude, probability, and delay. Obviously, preferences between rewards that vary along one dimension are completely determined by the manipulated variable. However, it is unknown whether dopamine prediction error responses reflect the subjective value integrated from different reward dimensions. Here, we measured the preferences between rewards that varied along multiple dimensions, and as such could not be ranked according to objective metrics. Monkeys chose between rewards that differed in amount, risk, and type. Because their choices were complete and transitive, the monkeys chose “as if” they integrated different rewards and attributes into a common scale of value. The prediction error responses of single dopamine neurons reflected the integrated subjective value inferred from the choices, rather than the singular reward attributes. Specifically, amount, risk, and reward type modulated dopamine responses exactly to the extent that they influenced economic choices, even when rewards were vastly different, such as liquid and food. This prediction error response could provide a direct updating signal for economic values. PMID:24453218
Changes in reward-induced brain activation in opiate addicts.

PubMed

Martin-Soelch, C; Chevalley, A F; Künig, G; Missimer, J; Magyar, S; Mino, A; Schultz, W; Leenders, K L

2001-10-01

Many studies indicate a role of the cerebral dopaminergic reward system in addiction. Motivated by these findings, we examined in opiate addicts whether brain regions involved in the reward circuitry also react to human prototypical rewards. We measured regional cerebral blood flow (rCBF) with H(2)(15)O positron emission tomography (PET) during a visuo-spatial recognition task with delayed response in control subjects and in opiate addicts participating in a methadone program. Three conditions were defined by the types of feedback: nonsense feedback; nonmonetary reinforcement; or monetary reward, received by the subjects for a correct response. We found in the control subjects rCBF increases in regions associated with the meso-striatal and meso-corticolimbic circuits in response to both monetary reward and nonmonetary reinforcement. In opiate addicts, these regions were activated only in response to monetary reward. Furthermore, nonmonetary reinforcement elicited rCBF increases in limbic regions of the opiate addicts that were not activated in the control subjects. Because psychoactive drugs serve as rewards and directly affect regions of the dopaminergic system like the striatum, we conclude that the differences in rCBF increases between controls and addicts can be attributed to an adaptive consequence of the addiction process.
Brain Circuits of Methamphetamine Place Reinforcement Learning: The Role of the Hippocampus-VTA Loop.

PubMed

Keleta, Yonas B; Martinez, Joe L

2012-03-01

The reinforcing effects of addictive drugs including methamphetamine (METH) involve the midbrain ventral tegmental area (VTA). VTA is primary source of dopamine (DA) to the nucleus accumbens (NAc) and the ventral hippocampus (VHC). These three brain regions are functionally connected through the hippocampal-VTA loop that includes two main neural pathways: the bottom-up pathway and the top-down pathway. In this paper, we take the view that addiction is a learning process. Therefore, we tested the involvement of the hippocampus in reinforcement learning by studying conditioned place preference (CPP) learning by sequentially conditioning each of the three nuclei in either the bottom-up order of conditioning; VTA, then VHC, finally NAc, or the top-down order; VHC, then VTA, finally NAc. Following habituation, the rats underwent experimental modules consisting of two conditioning trials each followed by immediate testing (test 1 and test 2) and two additional tests 24 h (test 3) and/or 1 week following conditioning (test 4). The module was repeated three times for each nucleus. The results showed that METH, but not Ringer's, produced positive CPP following conditioning each brain area in the bottom-up order. In the top-down order, METH, but not Ringer's, produced either an aversive CPP or no learning effect following conditioning each nucleus of interest. In addition, METH place aversion was antagonized by coadministration of the N-methyl-d-aspartate (NMDA) receptor antagonist MK801, suggesting that the aversion learning was an NMDA receptor activation-dependent process. We conclude that the hippocampus is a critical structure in the reward circuit and hence suggest that the development of target-specific therapeutics for the control of addiction emphasizes on the hippocampus-VTA top-down connection.
Drug-Induced Alterations of Endocannabinoid-Mediated Plasticity in Brain Reward Regions.

PubMed

Zlebnik, Natalie E; Cheer, Joseph F

2016-10-05

The endocannabinoid (eCB) system has emerged as one of the most important mediators of physiological and pathological reward-related synaptic plasticity. eCBs are retrograde messengers that provide feedback inhibition, resulting in the suppression of neurotransmitter release at both excitatory and inhibitory synapses, and they serve a critical role in the spatiotemporal regulation of both short- and long-term synaptic plasticity that supports adaptive learning of reward-motivated behaviors. However, mechanisms of eCB-mediated synaptic plasticity in reward areas of the brain are impaired following exposure to drugs of abuse. Because of this, it is theorized that maladaptive eCB signaling may contribute to the development and maintenance of addiction-related behavior. Here we review various forms of eCB-mediated synaptic plasticity present in regions of the brain involved in reward and reinforcement and explore the potential physiological relevance of maladaptive eCB signaling to addiction vulnerability. Copyright © 2016 the authors 0270-6474/16/3610230-09$15.00/0.
A neural model of hierarchical reinforcement learning

PubMed Central

Rasmussen, Daniel; Eliasmith, Chris

2017-01-01

We develop a novel, biologically detailed neural model of reinforcement learning (RL) processes in the brain. This model incorporates a broad range of biological features that pose challenges to neural RL, such as temporally extended action sequences, continuous environments involving unknown time delays, and noisy/imprecise computations. Most significantly, we expand the model into the realm of hierarchical reinforcement learning (HRL), which divides the RL process into a hierarchy of actions at different levels of abstraction. Here we implement all the major components of HRL in a neural model that captures a variety of known anatomical and physiological properties of the brain. We demonstrate the performance of the model in a range of different environments, in order to emphasize the aim of understanding the brain’s general reinforcement learning ability. These results show that the model compares well to previous modelling work and demonstrates improved performance as a result of its hierarchical ability. We also show that the model’s behaviour is consistent with available data on human hierarchical RL, and generate several novel predictions. PMID:28683111
A neural model of hierarchical reinforcement learning.

PubMed

Rasmussen, Daniel; Voelker, Aaron; Eliasmith, Chris

2017-01-01

We develop a novel, biologically detailed neural model of reinforcement learning (RL) processes in the brain. This model incorporates a broad range of biological features that pose challenges to neural RL, such as temporally extended action sequences, continuous environments involving unknown time delays, and noisy/imprecise computations. Most significantly, we expand the model into the realm of hierarchical reinforcement learning (HRL), which divides the RL process into a hierarchy of actions at different levels of abstraction. Here we implement all the major components of HRL in a neural model that captures a variety of known anatomical and physiological properties of the brain. We demonstrate the performance of the model in a range of different environments, in order to emphasize the aim of understanding the brain's general reinforcement learning ability. These results show that the model compares well to previous modelling work and demonstrates improved performance as a result of its hierarchical ability. We also show that the model's behaviour is consistent with available data on human hierarchical RL, and generate several novel predictions.
Using Fuzzy Logic for Performance Evaluation in Reinforcement Learning

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap S.

1992-01-01

Current reinforcement learning algorithms require long training periods which generally limit their applicability to small size problems. A new architecture is described which uses fuzzy rules to initialize its two neural networks: a neural network for performance evaluation and another for action selection. This architecture is applied to control of dynamic systems and it is demonstrated that it is possible to start with an approximate prior knowledge and learn to refine it through experiments using reinforcement learning.
Dopamine prediction errors in reward learning and addiction: from theory to neural circuitry

PubMed Central

Keiflin, Ronald; Janak, Patricia H.

2015-01-01

Summary Midbrain dopamine (DA) neurons are proposed to signal reward prediction error (RPE), a fundamental parameter in associative learning models. This RPE hypothesis provides a compelling theoretical framework for understanding DA function in reward learning and addiction. New studies support a causal role for DA-mediated RPE activity in promoting learning about natural reward; however, this question has not been explicitly tested in the context of drug addiction. In this review, we integrate theoretical models with experimental findings on the activity of DA systems, and on the causal role of specific neuronal projections and cell types, to provide a circuit-based framework for probing DA-RPE function in addiction. By examining error-encoding DA neurons in the neural network in which they are embedded, hypotheses regarding circuit-level adaptations that possibly contribute to pathological error-signaling and addiction can be formulated and tested. PMID:26494275
Irrelevant learned reward associations disrupt voluntary spatial attention.

PubMed

MacLean, Mary H; Diaz, Gisella K; Giesbrecht, Barry

2016-10-01

Attention can be guided involuntarily by physical salience and by non-salient, previously learned reward associations that are currently task-irrelevant. Attention can be guided voluntarily by current goals and expectations. The current study examined, in two experiments, whether irrelevant reward associations could disrupt current, goal-driven, voluntary attention. In a letter-search task, attention was directed voluntarily (i.e., cued) on half the trials by a cue stimulus indicating the hemifield in which the target letter would appear with 100 % accuracy. On the other half of the trials, a cue stimulus was presented, but it did not provide information about the target hemifield (i.e., uncued). On both cued and uncued trials, attention could be involuntarily captured by the presence of a task-irrelevant, and physically non-salient, color, either within the cued or the uncued hemifield. Importantly, one week prior to the letter search task, the irrelevant color had served as a target feature that was predictive of reward in a separate training task. Target identification accuracy was better on cued compared to uncued trials. However, this effect was reduced when the irrelevant, and physically non-salient, reward-associated feature was present in the uncued hemifield. This effect was not observed in a second, control experiment in which the irrelevant color was not predictive of reward during training. Our results indicate that involuntary, value-driven capture can disrupt the voluntary control of spatial attention.
Active and observational reward learning in adults with autism spectrum disorder: relationship with empathy in an atypical sample.

PubMed

Bellebaum, Christian; Brodmann, Katja; Thoma, Patrizia

2014-01-01

Autism spectrum disorders (ASDs) are characterised by disturbances in social behaviour. A prevailing hypothesis suggests that these problems are related to deficits in assigning rewarding value to social stimuli. The present study aimed to examine monetary reward processing in adults with ASDs by means of event-related potentials (ERPs). Ten individuals with mild ASDs (Asperger's syndrome and high-functioning autism) and 12 healthy control subjects performed an active and an observational probabilistic reward-learning task. Both groups showed similar overall learning performance. With respect to reward processing, subjects with ASDs exhibited a general reduction in feedback-related negativity (FRN) amplitude, irrespective of feedback valence and type of learning (active or observational). Individuals with ASDs showed lower scores for cognitive empathy, while affective empathy did not differ between groups. Correlation analyses revealed that higher empathy (both cognitive and affective) negatively affected performance in observational learning in controls and in active learning in ASDs (only cognitive empathy). No relationships were seen between empathy and ERPs. Reduced FRN amplitudes are discussed in terms of a deficit in fast reward processing in ASDs, which may indicate altered reward system functioning.
Multiagent Reinforcement Learning With Sparse Interactions by Negotiation and Knowledge Transfer.

PubMed

Zhou, Luowei; Yang, Pei; Chen, Chunlin; Gao, Yang

2017-05-01

Reinforcement learning has significant applications for multiagent systems, especially in unknown dynamic environments. However, most multiagent reinforcement learning (MARL) algorithms suffer from such problems as exponential computation complexity in the joint state-action space, which makes it difficult to scale up to realistic multiagent problems. In this paper, a novel algorithm named negotiation-based MARL with sparse interactions (NegoSIs) is presented. In contrast to traditional sparse-interaction-based MARL algorithms, NegoSI adopts the equilibrium concept and makes it possible for agents to select the nonstrict equilibrium-dominating strategy profile (nonstrict EDSP) or meta equilibrium for their joint actions. The presented NegoSI algorithm consists of four parts: 1) the equilibrium-based framework for sparse interactions; 2) the negotiation for the equilibrium set; 3) the minimum variance method for selecting one joint action; and 4) the knowledge transfer of local Q -values. In this integrated algorithm, three techniques, i.e., unshared value functions, equilibrium solutions, and sparse interactions are adopted to achieve privacy protection, better coordination and lower computational complexity, respectively. To evaluate the performance of the presented NegoSI algorithm, two groups of experiments are carried out regarding three criteria: 1) steps of each episode; 2) rewards of each episode; and 3) average runtime. The first group of experiments is conducted using six grid world games and shows fast convergence and high scalability of the presented algorithm. Then in the second group of experiments NegoSI is applied to an intelligent warehouse problem and simulated results demonstrate the effectiveness of the presented NegoSI algorithm compared with other state-of-the-art MARL algorithms.
Ventral striatum response during reward and punishment reversal learning in unmedicated major depressive disorder.

PubMed

Robinson, Oliver J; Cools, Roshan; Carlisi, Christina O; Sahakian, Barbara J; Drevets, Wayne C

2012-02-01

Affective biases may underlie many of the key symptoms of major depressive disorder, from anhedonia to altered cognitive performance. Understanding the cause of these biases is therefore critical in the quest for improved treatments. Depression is associated, for example, with a negative affective bias in reversal learning. However, despite the fact that reversal learning is associated with striatal response in healthy individuals and depressed individuals exhibit attenuated striatal function on multiple tasks, studies to date have not demonstrated striatal involvement in the negative bias in reversal learning in depression. In this study, the authors sought to determine whether this may be because reversal learning tasks conventionally used to study behavior examine reversals only on the basis of unexpected punishment and therefore do not adequately separate reward- and punishment-based behavior. The authors used functional MRI to compare the hemodynamic response to a reversal learning task with mixed reward- and punishment-based reversal stages between individuals with unmedicated major depressive disorder (N=13) and healthy comparison subjects (N=14). Impaired reward (but not punishment) reversal accuracy was found alongside attenuated anteroventral striatal response to unexpected reward in depression. Attenuated neurophysiological response of the anteroventral striatum may reflect dysfunction in circuits involving afferent projections from the orbitofrontal, limbic, and/or mesostriatal dopaminergic pathways, which conceivably may, together with the ventral striatum, underlie anhedonia in depression. Learning to appreciate and enjoy positive life experiences is critical for recovery from depression. This study pinpoints a neural target for such recovery.
On the integration of reinforcement learning and approximate reasoning for control

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1991-01-01

The author discusses the importance of strengthening the knowledge representation characteristic of reinforcement learning techniques using methods such as approximate reasoning. The ARIC (approximate reasoning-based intelligent control) architecture is an example of such a hybrid approach in which the fuzzy control rules are modified (fine-tuned) using reinforcement learning. ARIC also demonstrates that it is possible to start with an approximately correct control knowledge base and learn to refine this knowledge through further experience. On the other hand, techniques such as the TD (temporal difference) algorithm and Q-learning establish stronger theoretical foundations for their use in adaptive control and also in stability analysis of hybrid reinforcement learning and approximate reasoning-based controllers.
Intelligence moderates reinforcement learning: a mini-review of the neural evidence

PubMed Central

2014-01-01

Our understanding of the neural basis of reinforcement learning and intelligence, two key factors contributing to human strivings, has progressed significantly recently. However, the overlap of these two lines of research, namely, how intelligence affects neural responses during reinforcement learning, remains uninvestigated. A mini-review of three existing studies suggests that higher IQ (especially fluid IQ) may enhance the neural signal of positive prediction error in dorsolateral prefrontal cortex, dorsal anterior cingulate cortex, and striatum, several brain substrates of reinforcement learning or intelligence. PMID:25185818
Refining Linear Fuzzy Rules by Reinforcement Learning

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap S.; Malkani, Anil

1996-01-01

Linear fuzzy rules are increasingly being used in the development of fuzzy logic systems. Radial basis functions have also been used in the antecedents of the rules for clustering in product space which can automatically generate a set of linear fuzzy rules from an input/output data set. Manual methods are usually used in refining these rules. This paper presents a method for refining the parameters of these rules using reinforcement learning which can be applied in domains where supervised input-output data is not available and reinforcements are received only after a long sequence of actions. This is shown for a generalization of radial basis functions. The formation of fuzzy rules from data and their automatic refinement is an important step in closing the gap between the application of reinforcement learning methods in the domains where only some limited input-output data is available.
A Selective Role for Dopamine in Learning to Maximize Reward But Not to Minimize Effort: Evidence from Patients with Parkinson's Disease.

PubMed

Skvortsova, Vasilisa; Degos, Bertrand; Welter, Marie-Laure; Vidailhet, Marie; Pessiglione, Mathias

2017-06-21

Instrumental learning is a fundamental process through which agents optimize their choices, taking into account various dimensions of available options such as the possible reward or punishment outcomes and the costs associated with potential actions. Although the implication of dopamine in learning from choice outcomes is well established, less is known about its role in learning the action costs such as effort. Here, we tested the ability of patients with Parkinson's disease (PD) to maximize monetary rewards and minimize physical efforts in a probabilistic instrumental learning task. The implication of dopamine was assessed by comparing performance ON and OFF prodopaminergic medication. In a first sample of PD patients ( n = 15), we observed that reward learning, but not effort learning, was selectively impaired in the absence of treatment, with a significant interaction between learning condition (reward vs effort) and medication status (OFF vs ON). These results were replicated in a second, independent sample of PD patients ( n = 20) using a simplified version of the task. According to Bayesian model selection, the best account for medication effects in both studies was a specific amplification of reward magnitude in a Q-learning algorithm. These results suggest that learning to avoid physical effort is independent from dopaminergic circuits and strengthen the general idea that dopaminergic signaling amplifies the effects of reward expectation or obtainment on instrumental behavior. SIGNIFICANCE STATEMENT Theoretically, maximizing reward and minimizing effort could involve the same computations and therefore rely on the same brain circuits. Here, we tested whether dopamine, a key component of reward-related circuitry, is also implicated in effort learning. We found that patients suffering from dopamine depletion due to Parkinson's disease were selectively impaired in reward learning, but not effort learning. Moreover, anti-parkinsonian medication restored the
Functional specialization within the striatum along both the dorsal/ventral and anterior/posterior axes during associative learning via reward and punishment

PubMed Central

Mattfeld, Aaron T.; Gluck, Mark A.; Stark, Craig E.L.

2011-01-01

The goal of the present study was to elucidate the role of the human striatum in learning via reward and punishment during an associative learning task. Previous studies have identified the striatum as a critical component in the neural circuitry of reward-related learning. It remains unclear, however, under what task conditions, and to what extent, the striatum is modulated by punishment during an instrumental learning task. Using high-resolution functional magnetic resonance imaging (fMRI) during a reward- and punishment-based probabilistic associative learning task, we observed activity in the ventral putamen for stimuli learned via reward regardless of whether participants were correct or incorrect (i.e., outcome). In contrast, activity in the dorsal caudate was modulated by trials that received feedback—either correct reward or incorrect punishment trials. We also identified an anterior/posterior dissociation reflecting reward and punishment prediction error estimates. Additionally, differences in patterns of activity that correlated with the amount of training were identified along the anterior/posterior axis of the striatum. We suggest that unique subregions of the striatum—separated along both a dorsal/ventral and anterior/posterior axis— differentially participate in the learning of associations through reward and punishment. PMID:22021252
SOVEREIGN: An autonomous neural system for incrementally learning planned action sequences to navigate towards a rewarded goal.

PubMed

Gnadt, William; Grossberg, Stephen

2008-06-01

How do reactive and planned behaviors interact in real time? How are sequences of such behaviors released at appropriate times during autonomous navigation to realize valued goals? Controllers for both animals and mobile robots, or animats, need reactive mechanisms for exploration, and learned plans to reach goal objects once an environment becomes familiar. The SOVEREIGN (Self-Organizing, Vision, Expectation, Recognition, Emotion, Intelligent, Goal-oriented Navigation) animat model embodies these capabilities, and is tested in a 3D virtual reality environment. SOVEREIGN includes several interacting subsystems which model complementary properties of cortical What and Where processing streams and which clarify similarities between mechanisms for navigation and arm movement control. As the animat explores an environment, visual inputs are processed by networks that are sensitive to visual form and motion in the What and Where streams, respectively. Position-invariant and size-invariant recognition categories are learned by real-time incremental learning in the What stream. Estimates of target position relative to the animat are computed in the Where stream, and can activate approach movements toward the target. Motion cues from animat locomotion can elicit head-orienting movements to bring a new target into view. Approach and orienting movements are alternately performed during animat navigation. Cumulative estimates of each movement are derived from interacting proprioceptive and visual cues. Movement sequences are stored within a motor working memory. Sequences of visual categories are stored in a sensory working memory. These working memories trigger learning of sensory and motor sequence categories, or plans, which together control planned movements. Predictively effective chunk combinations are selectively enhanced via reinforcement learning when the animat is rewarded. Selected planning chunks effect a gradual transition from variable reactive exploratory
Structure identification in fuzzy inference using reinforcement learning

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap

1993-01-01

In our previous work on the GARIC architecture, we have shown that the system can start with surface structure of the knowledge base (i.e., the linguistic expression of the rules) and learn the deep structure (i.e., the fuzzy membership functions of the labels used in the rules) by using reinforcement learning. Assuming the surface structure, GARIC refines the fuzzy membership functions used in the consequents of the rules using a gradient descent procedure. This hybrid fuzzy logic and reinforcement learning approach can learn to balance a cart-pole system and to backup a truck to its docking location after a few trials. In this paper, we discuss how to do structure identification using reinforcement learning in fuzzy inference systems. This involves identifying both surface as well as deep structure of the knowledge base. The term set of fuzzy linguistic labels used in describing the values of each control variable must be derived. In this process, splitting a label refers to creating new labels which are more granular than the original label and merging two labels creates a more general label. Splitting and merging of labels directly transform the structure of the action selection network used in GARIC by increasing or decreasing the number of hidden layer nodes.

Reinforcement of Science Learning through Local Culture: A Delphi Study

ERIC Educational Resources Information Center

Nuangchalerm, Prasart

2008-01-01

This study aims to explore the ways to reinforce science learning through local culture by using Delphi technique. Twenty four participants in various fields of study were selected. The result of study provides a framework for reinforcement of science learning through local culture on the theme life and environment. (Contains 1 table.)
Intelligence moderates reinforcement learning: a mini-review of the neural evidence.

PubMed

Chen, Chong

2015-06-01

Our understanding of the neural basis of reinforcement learning and intelligence, two key factors contributing to human strivings, has progressed significantly recently. However, the overlap of these two lines of research, namely, how intelligence affects neural responses during reinforcement learning, remains uninvestigated. A mini-review of three existing studies suggests that higher IQ (especially fluid IQ) may enhance the neural signal of positive prediction error in dorsolateral prefrontal cortex, dorsal anterior cingulate cortex, and striatum, several brain substrates of reinforcement learning or intelligence. Copyright © 2015 the American Physiological Society.
Reward positivity is elicited by monetary reward in the absence of response choice.

PubMed

Varona-Moya, Sergio; Morís, Joaquín; Luque, David

2015-02-11

The neural response to positive and negative feedback differs in their event-related potentials. Most often this difference is interpreted as the result of a negative voltage deflection after negative feedback. This deflection has been referred to as the feedback-related negativity component. The reinforcement learning model of the feedback-related negativity establishes that this component reflects an error monitoring process aimed to increase behavior adjustment progressively. However, a recent proposal suggests that the difference observed is actually due to a positivity reflecting the rewarding value of positive feedbacks - that is, the reward positivity component (RewP). From this it follows that RewP could be found even in the absence of any action-monitoring processes. We tested this prediction by means of an experiment in which visual target stimuli were intermixed with nontarget stimuli. Three types of targets signaled money gains, money losses, or the absence of either money gain or money loss, respectively. No motor response was required. Event-related potential analyses showed a central positivity in a 270-370 ms time window that was elicited by target stimuli signaling money gains, as compared with both stimuli signaling losses and no-gain/no-loss neutral stimuli. This is the first evidence to show that RewP is obtained when stimuli with rewarding values are passively perceived.
Effects of dopamine on reinforcement learning and consolidation in Parkinson’s disease

PubMed Central

Grogan, John P; Tsivos, Demitra; Smith, Laura; Knight, Brogan E; Bogacz, Rafal; Whone, Alan; Coulthard, Elizabeth J

2017-01-01

Emerging evidence suggests that dopamine may modulate learning and memory with important implications for understanding the neurobiology of memory and future therapeutic targeting. An influential hypothesis posits that dopamine biases reinforcement learning. More recent data also suggest an influence during both consolidation and retrieval. Eighteen Parkinson’s disease patients learned through feedback ON or OFF medication, with memory tested 24 hr later ON or OFF medication (4 conditions, within-subjects design with matched healthy control group). Patients OFF medication during learning decreased in memory accuracy over the following 24 hr. In contrast to previous studies, however, dopaminergic medication during learning and testing did not affect expression of positive or negative reinforcement. Two further experiments were run without the 24 hr delay, but they too failed to reproduce effects of dopaminergic medication on reinforcement learning. While supportive of a dopaminergic role in consolidation, this study failed to replicate previous findings on reinforcement learning. DOI: http://dx.doi.org/10.7554/eLife.26801.001 PMID:28691905
"Notice of Violation of IEEE Publication Principles" Multiobjective Reinforcement Learning: A Comprehensive Overview.

PubMed

Liu, Chunming; Xu, Xin; Hu, Dewen

2013-04-29

Reinforcement learning is a powerful mechanism for enabling agents to learn in an unknown environment, and most reinforcement learning algorithms aim to maximize some numerical value, which represents only one long-term objective. However, multiple long-term objectives are exhibited in many real-world decision and control problems; therefore, recently, there has been growing interest in solving multiobjective reinforcement learning (MORL) problems with multiple conflicting objectives. The aim of this paper is to present a comprehensive overview of MORL. In this paper, the basic architecture, research topics, and naive solutions of MORL are introduced at first. Then, several representative MORL approaches and some important directions of recent research are reviewed. The relationships between MORL and other related research are also discussed, which include multiobjective optimization, hierarchical reinforcement learning, and multi-agent reinforcement learning. Finally, research challenges and open problems of MORL techniques are highlighted.
Trial-by-Trial Modulation of Associative Memory Formation by Reward Prediction Error and Reward Anticipation as Revealed by a Biologically Plausible Computational Model.

PubMed

Aberg, Kristoffer C; Müller, Julia; Schwartz, Sophie

2017-01-01

Anticipation and delivery of rewards improves memory formation, but little effort has been made to disentangle their respective contributions to memory enhancement. Moreover, it has been suggested that the effects of reward on memory are mediated by dopaminergic influences on hippocampal plasticity. Yet, evidence linking memory improvements to actual reward computations reflected in the activity of the dopaminergic system, i.e., prediction errors and expected values, is scarce and inconclusive. For example, different previous studies reported that the magnitude of prediction errors during a reinforcement learning task was a positive, negative, or non-significant predictor of successfully encoding simultaneously presented images. Individual sensitivities to reward and punishment have been found to influence the activation of the dopaminergic reward system and could therefore help explain these seemingly discrepant results. Here, we used a novel associative memory task combined with computational modeling and showed independent effects of reward-delivery and reward-anticipation on memory. Strikingly, the computational approach revealed positive influences from both reward delivery, as mediated by prediction error magnitude, and reward anticipation, as mediated by magnitude of expected value, even in the absence of behavioral effects when analyzed using standard methods, i.e., by collapsing memory performance across trials within conditions. We additionally measured trait estimates of reward and punishment sensitivity and found that individuals with increased reward (vs. punishment) sensitivity had better memory for associations encoded during positive (vs. negative) prediction errors when tested after 20 min, but a negative trend when tested after 24 h. In conclusion, modeling trial-by-trial fluctuations in the magnitude of reward, as we did here for prediction errors and expected value computations, provides a comprehensive and biologically plausible description of
Ventral Striatum Response During Reward and Punishment Reversal Learning in Unmedicated Major Depressive Disorder

PubMed Central

Robinson, Oliver J.; Cools, Roshan; Carlisi, Christina O.; Sahakian, Barbara J.; Drevets, Wayne C.

2017-01-01

Objective Affective biases may underlie many of the key symptoms of major depressive disorder, from anhedonia to altered cognitive performance. Understanding the cause of these biases is therefore critical in the quest for improved treatments. Depression is associated, for example, with a negative affective bias in reversal learning. However, despite the fact that reversal learning is associated with striatal response in healthy individuals and depressed individuals exhibit attenuated striatal function on multiple tasks, studies to date have not demonstrated striatal involvement in the negative bias in reversal learning in depression. In this study, the authors sought to determine whether this may be because reversal learning tasks conventionally used to study behavior examine reversals only on the basis of unexpected punishment and therefore do not adequately separate reward- and punishment-based behavior. Method The authors used functional MRI to com pare the hemodynamic response to a reversal learning task with mixed reward- and punishment-based reversal stages between individuals with unmedicated major depressive disorder (N=13) and healthy comparison subjects (N=14). Results Impaired reward (but not punishment) reversal accuracy was found alongside attenuated anteroventral striatal response to unexpected reward in depression. Conclusions Attenuated neurophysiological response of the anteroventral striatum may reflect dysfunction in circuits involving afferent projections from the orbitofrontal, limbic, and/or mesostriatal dopaminergic pathways, which conceivably may, together with the ventral striatum, underlie anhedonia in depression. Learning to appreciate and enjoy positive life experiences is critical for recovery from depression. This study pinpoints a neural target for such recovery. PMID:22420038
Neuromodulation of reward-based learning and decision making in human aging

PubMed Central

Eppinger, Ben; Hämmerer, Dorothea; Li, Shu-Chen

2013-01-01

In this paper, we review the current literature to highlight relations between age-associated declines in dopaminergic and serotonergic neuromodulation and adult age differences in adaptive goal-directed behavior. Specifically, we focus on evidence suggesting that deficits in neuromodulation contribute to older adults’ behavioral disadvantages in learning and decision making. These deficits are particularly pronounced when reward information is uncertain or the task context requires flexible adaptations to changing stimulus–reward contingencies. Moreover, emerging evidence points to age-related differences in the sensitivity to rewarding and aversive outcomes during learning and decision making if the acquisition of behavior critically depends on outcome processing. These age-related asymmetries in outcome valuation may be explained by age differences in the interplay of dopaminergic and serotonergic neuromodulation. This hypothesis is based on recent neurocomputational and psychopharmacological approaches, which suggest that dopamine and serotonin serve opponent roles in regulating the balance between approach behavior and inhibitory control. Studying adaptive regulation of behavior across the adult life span may shed new light on how the aging brain changes functionally in response to its diminishing resources. PMID:22023564
Mobile robots exploration through cnn-based reinforcement learning.

PubMed

Tai, Lei; Liu, Ming

2016-01-01

Exploration in an unknown environment is an elemental application for mobile robots. In this paper, we outlined a reinforcement learning method aiming for solving the exploration problem in a corridor environment. The learning model took the depth image from an RGB-D sensor as the only input. The feature representation of the depth image was extracted through a pre-trained convolutional-neural-networks model. Based on the recent success of deep Q-network on artificial intelligence, the robot controller achieved the exploration and obstacle avoidance abilities in several different simulated environments. It is the first time that the reinforcement learning is used to build an exploration strategy for mobile robots through raw sensor information.
Longitudinal investigation on learned helplessness tested under negative and positive reinforcement involving stimulus control.

PubMed

Oliveira, Emileane C; Hunziker, Maria Helena

2014-07-01

In this study, we investigated whether (a) animals demonstrating the learned helplessness effect during an escape contingency also show learning deficits under positive reinforcement contingencies involving stimulus control and (b) the exposure to positive reinforcement contingencies eliminates the learned helplessness effect under an escape contingency. Rats were initially exposed to controllable (C), uncontrollable (U) or no (N) shocks. After 24h, they were exposed to 60 escapable shocks delivered in a shuttlebox. In the following phase, we selected from each group the four subjects that presented the most typical group pattern: no escape learning (learned helplessness effect) in Group U and escape learning in Groups C and N. All subjects were then exposed to two phases, the (1) positive reinforcement for lever pressing under a multiple FR/Extinction schedule and (2) a re-test under negative reinforcement (escape). A fourth group (n=4) was exposed only to the positive reinforcement sessions. All subjects showed discrimination learning under multiple schedule. In the escape re-test, the learned helplessness effect was maintained for three of the animals in Group U. These results suggest that the learned helplessness effect did not extend to discriminative behavior that is positively reinforced and that the learned helplessness effect did not revert for most subjects after exposure to positive reinforcement. We discuss some theoretical implications as related to learned helplessness as an effect restricted to aversive contingencies and to the absence of reversion after positive reinforcement. This article is part of a Special Issue entitled: insert SI title. Copyright © 2014. Published by Elsevier B.V.
Effects of Direct Social Experience on Trust Decisions and Neural Reward Circuitry

PubMed Central

Fareri, Dominic S.; Chang, Luke J.; Delgado, Mauricio R.

2012-01-01

The human striatum is integral for reward-processing and supports learning by linking experienced outcomes with prior expectations. Recent endeavors implicate the striatum in processing outcomes of social interactions, such as social approval/rejection, as well as in learning reputations of others. Interestingly, social impressions often influence our behavior with others during interactions. Information about an interaction partner’s moral character acquired from biographical information hinders updating of expectations after interactions via top down modulation of reward circuitry. An outstanding question is whether initial impressions formed through experience similarly modulate the ability to update social impressions at the behavioral and neural level. We investigated the role of experienced social information on trust behavior and reward-related BOLD activity. Participants played a computerized ball-tossing game with three fictional partners manipulated to be perceived as good, bad, or neutral. Participants then played an iterated trust game as investors with these same partners while undergoing fMRI. Unbeknownst to participants, partner behavior in the trust game was random and unrelated to their ball-tossing behavior. Participants’ trust decisions were influenced by their prior experience in the ball-tossing game, investing less often with the bad partner compared to the good and neutral. Reinforcement learning models revealed that participants were more sensitive to updating their beliefs about good and bad partners when experiencing outcomes consistent with initial experience. Increased striatal and anterior cingulate BOLD activity for positive versus negative trust game outcomes emerged, which further correlated with model-derived prediction error learning signals. These results suggest that initial impressions formed from direct social experience can be continually shaped by consistent information through reward learning mechanisms. PMID:23087604
B-tree search reinforcement learning for model based intelligent agent

NASA Astrophysics Data System (ADS)

Bhuvaneswari, S.; Vignashwaran, R.

2013-03-01

Agents trained by learning techniques provide a powerful approximation of active solutions for naive approaches. In this study using B - Trees implying reinforced learning the data search for information retrieval is moderated to achieve accuracy with minimum search time. The impact of variables and tactics applied in training are determined using reinforcement learning. Agents based on these techniques perform satisfactory baseline and act as finite agents based on the predetermined model against competitors from the course.
Adaptive Educational Software by Applying Reinforcement Learning

ERIC Educational Resources Information Center

Bennane, Abdellah

2013-01-01

The introduction of the intelligence in teaching software is the object of this paper. In software elaboration process, one uses some learning techniques in order to adapt the teaching software to characteristics of student. Generally, one uses the artificial intelligence techniques like reinforcement learning, Bayesian network in order to adapt…
Instructed knowledge shapes feedback-driven aversive learning in striatum and orbitofrontal cortex, but not the amygdala

PubMed Central

Atlas, Lauren Y; Doll, Bradley B; Li, Jian; Daw, Nathaniel D; Phelps, Elizabeth A

2016-01-01

Socially-conveyed rules and instructions strongly shape expectations and emotions. Yet most neuroscientific studies of learning consider reinforcement history alone, irrespective of knowledge acquired through other means. We examined fear conditioning and reversal in humans to test whether instructed knowledge modulates the neural mechanisms of feedback-driven learning. One group was informed about contingencies and reversals. A second group learned only from reinforcement. We combined quantitative models with functional magnetic resonance imaging and found that instructions induced dissociations in the neural systems of aversive learning. Responses in striatum and orbitofrontal cortex updated with instructions and correlated with prefrontal responses to instructions. Amygdala responses were influenced by reinforcement similarly in both groups and did not update with instructions. Results extend work on instructed reward learning and reveal novel dissociations that have not been observed with punishments or rewards. Findings support theories of specialized threat-detection and may have implications for fear maintenance in anxiety. DOI: http://dx.doi.org/10.7554/eLife.15192.001 PMID:27171199
Opponent appetitive-aversive neural processes underlie predictive learning of pain relief.

PubMed

Seymour, Ben; O'Doherty, John P; Koltzenburg, Martin; Wiech, Katja; Frackowiak, Richard; Friston, Karl; Dolan, Raymond

2005-09-01

Termination of a painful or unpleasant event can be rewarding. However, whether the brain treats relief in a similar way as it treats natural reward is unclear, and the neural processes that underlie its representation as a motivational goal remain poorly understood. We used fMRI (functional magnetic resonance imaging) to investigate how humans learn to generate expectations of pain relief. Using a pavlovian conditioning procedure, we show that subjects experiencing prolonged experimentally induced pain can be conditioned to predict pain relief. This proceeds in a manner consistent with contemporary reward-learning theory (average reward/loss reinforcement learning), reflected by neural activity in the amygdala and midbrain. Furthermore, these reward-like learning signals are mirrored by opposite aversion-like signals in lateral orbitofrontal cortex and anterior cingulate cortex. This dual coding has parallels to 'opponent process' theories in psychology and promotes a formal account of prediction and expectation during pain.
A Threshold Model for Opposing Actions of Acetylcholine on Reward Behavior: Molecular Mechanisms and Implications for Treatment of Substance Abuse Disorders

PubMed Central

Grasing, Kenneth

2016-01-01

The cholinergic system plays important roles in both learning and addiction. Medications that modify cholinergic tone can have pronounced effects on behaviors reinforced by natural and drug reinforcers. Importantly, enhancing the action of acetylcholine (ACh) in the nucleus accumbens and ventral tegmental area (VTA) dopamine system can either augment or diminish these behaviors. A threshold model is presented that can explain these seemingly contradictory results. Relatively low levels of ACh rise above a lower threshold, facilitating behaviors supported by drugs or natural reinforcers. Further increases in cholinergic tone that rise above a second upper threshold oppose the same behaviors. Accordingly, cholinesterase inhibitors, or agonists for nicotinic or muscarinic receptors, each have the potential to produce biphasic effects on reward behaviors. Pretreatment with either nicotinic or muscarinic antagonists can block drug- or food- reinforced behavior by maintaining cholinergic tone below its lower threshold. Potential threshold mediators include desensitization of nicotinic receptors and biphasic effects of ACh on the firing of medium spiny neurons. Nicotinic receptors with high- and low-affinity appear to play greater roles in reward enhancement and inhibition, respectively. Cholinergic inhibition of natural and drug rewards may serve as mediators of previously described opponent processes. Future studies should evaluate cholinergic agents across a broader range of doses, and include a variety of reinforced behaviors. PMID:27316344
Learning from Noisy and Delayed Rewards: The Value of Reinforcement Learning to Defense Modeling and Simulation

DTIC Science & Technology

2012-09-01

following 500 trials with 1000 replications with single reward upon attainment of the goal state by algorithm and policy. DQ- C with -greedy obtained...aspects of the civilian population rather than combat forces. These agents rep- resent not a single human, but a population segment. Similar...TD(λ) combines elements of MC and TD methods into a single framework to estimate the value of each state, V(s), through the use of eligibility traces
Reinforcement learning in professional basketball players

PubMed Central

Neiman, Tal; Loewenstein, Yonatan

2011-01-01

Reinforcement learning in complex natural environments is a challenging task because the agent should generalize from the outcomes of actions taken in one state of the world to future actions in different states of the world. The extent to which human experts find the proper level of generalization is unclear. Here we show, using the sequences of field goal attempts made by professional basketball players, that the outcome of even a single field goal attempt has a considerable effect on the rate of subsequent 3 point shot attempts, in line with standard models of reinforcement learning. However, this change in behaviour is associated with negative correlations between the outcomes of successive field goal attempts. These results indicate that despite years of experience and high motivation, professional players overgeneralize from the outcomes of their most recent actions, which leads to decreased performance. PMID:22146388
Layered reward signalling through octopamine and dopamine in Drosophila.

PubMed

Burke, Christopher J; Huetteroth, Wolf; Owald, David; Perisse, Emmanuel; Krashes, Michael J; Das, Gaurav; Gohl, Daryl; Silies, Marion; Certel, Sarah; Waddell, Scott

2012-12-20

Dopamine is synonymous with reward and motivation in mammals. However, only recently has dopamine been linked to motivated behaviour and rewarding reinforcement in fruitflies. Instead, octopamine has historically been considered to be the signal for reward in insects. Here we show, using temporal control of neural function in Drosophila, that only short-term appetitive memory is reinforced by octopamine. Moreover, octopamine-dependent memory formation requires signalling through dopamine neurons. Part of the octopamine signal requires the α-adrenergic-like OAMB receptor in an identified subset of mushroom-body-targeted dopamine neurons. Octopamine triggers an increase in intracellular calcium in these dopamine neurons, and their direct activation can substitute for sugar to form appetitive memory, even in flies lacking octopamine. Analysis of the β-adrenergic-like OCTβ2R receptor reveals that octopamine-dependent reinforcement also requires an interaction with dopamine neurons that control appetitive motivation. These data indicate that sweet taste engages a distributed octopamine signal that reinforces memory through discrete subsets of mushroom-body-targeted dopamine neurons. In addition, they reconcile previous findings with octopamine and dopamine and suggest that reinforcement systems in flies are more similar to mammals than previously thought.
Visible spatial contiguity of social information and reward affects social learning in brown capuchins (Sapajus apella) and children (Homo sapiens).

PubMed

Wood, Lara A; Whiten, Andrew

2017-11-01

Animal social learning is typically studied experimentally by the presentation of artificial foraging tasks. Although productive, results are often variable even for the same species. We present and test the hypothesis that one cause of variation is that spatial distance between rewards and the means of reward release causes conflicts for participants' attentional focus. We investigated whether spatial contiguity between a visible reward and the means of release would affect behavioral responses that evidence social learning, testing 21 brown capuchins ( Sapajus apella ), a much-studied species with variant evidence for social learning, and one hundred eighty 2- to 4-year-old human children ( Homo sapiens ), a benchmark species known for a strong social learning disposition. Participants were presented with a novel transparent apparatus where a reward was either proximal or distal to a demonstrated means of releasing it. A distal reward location decreased attention toward the location of the demonstration and impaired subsequent success in gaining rewards. Generally, the capuchins produced the alternative method to that demonstrated, whereas children copied the method demonstrated, although a distal reward location reduced copying in younger children. We conclude that some design features in common social learning tasks may significantly degrade the evidence for social learning. We have demonstrated this for 2 different primates but suggest that it is a significant factor to control for in social learning research across all taxa. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

Flow Navigation by Smart Microswimmers via Reinforcement Learning

NASA Astrophysics Data System (ADS)

Colabrese, Simona; Biferale, Luca; Celani, Antonio; Gustavsson, Kristian

2017-11-01

We have numerically modeled active particles which are able to acquire some limited knowledge of the fluid environment from simple mechanical cues and exert a control on their preferred steering direction. We show that those swimmers can learn effective strategies just by experience, using a reinforcement learning algorithm. As an example, we focus on smart gravitactic swimmers. These are active particles whose task is to reach the highest altitude within some time horizon, exploiting the underlying flow whenever possible. The reinforcement learning algorithm allows particles to learn effective strategies even in difficult situations when, in the absence of control, they would end up being trapped by flow structures. These strategies are highly nontrivial and cannot be easily guessed in advance. This work paves the way towards the engineering of smart microswimmers that solve difficult navigation problems. ERC AdG NewTURB 339032.
Toward a common theory for learning from reward, affect, and motivation: the SIMON framework.

PubMed

Madan, Christopher R

2013-10-07

While the effects of reward, affect, and motivation on learning have each developed into their own fields of research, they largely have been investigated in isolation. As all three of these constructs are highly related, and use similar experimental procedures, an important advance in research would be to consider the interplay between these constructs. Here we first define each of the three constructs, and then discuss how they may influence each other within a common framework. Finally, we delineate several sources of evidence supporting the framework. By considering the constructs of reward, affect, and motivation within a single framework, we can develop a better understanding of the processes involved in learning and how they interplay, and work toward a comprehensive theory that encompasses reward, affect, and motivation.
The drift diffusion model as the choice rule in reinforcement learning.

PubMed

Pedersen, Mads Lund; Frank, Michael J; Biele, Guido

2017-08-01

Current reinforcement-learning models often assume simplified decision processes that do not fully reflect the dynamic complexities of choice processes. Conversely, sequential-sampling models of decision making account for both choice accuracy and response time, but assume that decisions are based on static decision values. To combine these two computational models of decision making and learning, we implemented reinforcement-learning models in which the drift diffusion model describes the choice process, thereby capturing both within- and across-trial dynamics. To exemplify the utility of this approach, we quantitatively fit data from a common reinforcement-learning paradigm using hierarchical Bayesian parameter estimation, and compared model variants to determine whether they could capture the effects of stimulant medication in adult patients with attention-deficit hyperactivity disorder (ADHD). The model with the best relative fit provided a good description of the learning process, choices, and response times. A parameter recovery experiment showed that the hierarchical Bayesian modeling approach enabled accurate estimation of the model parameters. The model approach described here, using simultaneous estimation of reinforcement-learning and drift diffusion model parameters, shows promise for revealing new insights into the cognitive and neural mechanisms of learning and decision making, as well as the alteration of such processes in clinical groups.
The drift diffusion model as the choice rule in reinforcement learning

PubMed Central

Frank, Michael J.

2017-01-01

Current reinforcement-learning models often assume simplified decision processes that do not fully reflect the dynamic complexities of choice processes. Conversely, sequential-sampling models of decision making account for both choice accuracy and response time, but assume that decisions are based on static decision values. To combine these two computational models of decision making and learning, we implemented reinforcement-learning models in which the drift diffusion model describes the choice process, thereby capturing both within- and across-trial dynamics. To exemplify the utility of this approach, we quantitatively fit data from a common reinforcement-learning paradigm using hierarchical Bayesian parameter estimation, and compared model variants to determine whether they could capture the effects of stimulant medication in adult patients with attention-deficit hyper-activity disorder (ADHD). The model with the best relative fit provided a good description of the learning process, choices, and response times. A parameter recovery experiment showed that the hierarchical Bayesian modeling approach enabled accurate estimation of the model parameters. The model approach described here, using simultaneous estimation of reinforcement-learning and drift diffusion model parameters, shows promise for revealing new insights into the cognitive and neural mechanisms of learning and decision making, as well as the alteration of such processes in clinical groups. PMID:27966103
Flow Navigation by Smart Microswimmers via Reinforcement Learning

NASA Astrophysics Data System (ADS)

Colabrese, Simona; Gustavsson, Kristian; Celani, Antonio; Biferale, Luca

2017-04-01

Smart active particles can acquire some limited knowledge of the fluid environment from simple mechanical cues and exert a control on their preferred steering direction. Their goal is to learn the best way to navigate by exploiting the underlying flow whenever possible. As an example, we focus our attention on smart gravitactic swimmers. These are active particles whose task is to reach the highest altitude within some time horizon, given the constraints enforced by fluid mechanics. By means of numerical experiments, we show that swimmers indeed learn nearly optimal strategies just by experience. A reinforcement learning algorithm allows particles to learn effective strategies even in difficult situations when, in the absence of control, they would end up being trapped by flow structures. These strategies are highly nontrivial and cannot be easily guessed in advance. This Letter illustrates the potential of reinforcement learning algorithms to model adaptive behavior in complex flows and paves the way towards the engineering of smart microswimmers that solve difficult navigation problems.
Learning and tuning fuzzy logic controllers through reinforcements.

PubMed

Berenji, H R; Khedkar, P

1992-01-01

A method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system is presented. It is shown that: the generalized approximate-reasoning-based intelligent control (GARIC) architecture learns and tunes a fuzzy logic controller even when only weak reinforcement, such as a binary failure signal, is available; introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward network, which can then adaptively improve performance by using gradient descent methods. The GARIC architecture is applied to a cart-pole balancing system and demonstrates significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.
THE INFLUENCE OF EXTRINSIC REINFORCEMENT ON CHILDREN WITH HEAVY PRENATAL ALCOHOL EXPOSURE

PubMed Central

Graham, Diana M.; Glass, Leila; Mattson, Sarah N.

2016-01-01

Background Prenatal alcohol exposure affects inhibitory control and other aspects of attention and executive function. However, the efficacy of extrinsic reinforcement on these behaviors has not been tested. Methods Alcohol-exposed children (AE; n=34), children with ADHD (ADHD; n=23), and controls (CON; n=31) completed a flanker task with four reward conditions (no reward, reward, reward+occasional response cost, equal probability of reward+response cost). Inhibitory control was tested in the no reward conditions using a 3(group) x 2(flanker type) ANCOVA. Response to reinforcement was tested using 3(group) x 4(reward condition) x 4(flanker type) ANCOVA. Response time (RT) and accuracy were tested independently. Results Groups did not differ on demographic variables. The flanker task was successful in taxing interference control, an aspect of executive attention (i.e., responses to incongruent stimuli were slower than to congruent stimuli) and the AE group demonstrated impaired executive control over the other groups. Overall, the AE group had significantly slower response times compared to the CON and ADHD groups, which did not differ. However, reinforcement improved RT in all groups. While occasional response cost had the greatest benefit in the CON group, the type of reinforcement did not differentially affect the AE and ADHD groups. Accuracy across reward conditions did not differ by group, but was dependent on flanker type and reward condition. Conclusions Alcohol-exposed children, but not children with ADHD, had impaired interference control in comparison to controls, supporting a differential neurobehavioral profile in these two groups. Both clinical groups were equally affected by introduction of reinforcement, although the type of reinforcement did not differentially affect performance as it did in the control group, suggesting that reward or response cost could be used interchangeably to result in the same benefit. PMID:26842253
Group Rewards Make Groupwork Work.

ERIC Educational Resources Information Center

Slavin, Robert E.

1991-01-01

Critiques Kohn's article (in the same "Educational Leadership" issue) arguing against the use of cooperative rewards. Without group rewards based on the learning of all group members, cooperative learning can degenerate into answer-sharing. The idea that such rewards (usually paper certificates) can be dispensed with is wishful thinking. Includes…
Molecular role of dopamine in anhedonia linked to reward deficiency syndrome (RDS) and anti- reward systems.

PubMed

Gold, Mark S; Blum, Kenneth; Febo, Marcelo; Baron, David; Modestino, Edward Justin; Elman, Igor; Badgaiyan, Rajendra D

2018-03-01

Anhedonia is a condition that leads to the loss of feelings pleasure in response to natural reinforcers like food, sex, exercise, and social activities. This disorder occurs in addiction, and an array of related neuropsychiatric syndromes, including schizophrenia, depression, and Post Traumatic Stress Disorder (PTSD). Anhedonia may by due to derangements in mesolimbic dopaminergic pathways and their terminal fields (e.g., striatum, amygdala, and prefrontal cortex) that persist long after the traces of the causative drugs are eliminated (pharmacokinetically). Here we postulate that anhedonia is not a distinct entity but is rather an epiphenomenon of hypodopaminergic states and traits arising from the interaction of genetic traits and epigenetic neurobiological alterations in response to environmental influences. Moreover, dopaminergic activity is rather complex, and so it may give rise to differential pathophysiological processes such as incentive sensitization, aberrant learning and stress-like "anti-reward" phenomena. These processes may have additive, synergistic or antagonistic interactions with the concurrent reward deficiency states leading in some instances to more severe and long-lasting symptoms. Operant understanding of the neurogenetic antecedents to reward deficiency syndrome (RDS) and the elucidation of reward gene polymorphisms may provide a map for accessing an individual's genetic risk for developing Anhedonia. Prevention techniques that can restore homeostatic balance via physiological activation of dopaminergic receptors (D2/D3) may be instrumental for targeting not only anhedonia per se but also drug craving and relapse.
Using rewards and penalties to obtain desired subject performance

NASA Technical Reports Server (NTRS)

Cook, M.; Jex, H. R.; Stein, A. C.; Allen, R. W.

1981-01-01

Operant conditioning procedures, specifically the use of negative reinforcement, in achieving stable learning behavior is described. The critical tracking test (CTT) a method of detecting human operator impairment was tested. A pass level is set for each subject, based on that subject's asymptotic skill level while sober. It is critical that complete training take place before the individualized pass level is set in order that the impairment can be detected. The results provide a more general basis for the application of reward/penalty structures in manual control research.
Tactile learning and the individual evaluation of the reward in honey bees (Apis mellifera L.).

PubMed

Scheiner, R; Erber, J; Page, R E

1999-07-01

Using the proboscis extension response we conditioned pollen and nectar foragers of the honey bee (Apis mellifera L.) to tactile patterns under laboratory conditions. Pollen foragers demonstrated better acquisition, extinction, and reversal learning than nectar foragers. We tested whether the known differences in response thresholds to sucrose between pollen and nectar foragers could explain the observed differences in learning and found that nectar foragers with low response thresholds performed better during acquisition and extinction than ones with higher thresholds. Conditioning pollen and nectar foragers with similar response thresholds did not yield differences in their learning performance. These results suggest that differences in the learning performance of pollen and nectar foragers are a consequence of differences in their perception of sucrose. Furthermore, we analysed the effect which the perception of sucrose reward has on associative learning. Nectar foragers with uniform low response thresholds were conditioned using varying concentrations of sucrose. We found significant positive correlations between the concentrations of the sucrose rewards and the performance during acquisition and extinction. The results are summarised in a model which describes the relationships between learning performance, response threshold to sucrose, concentration of sucrose and the number of rewards.
Embedded Incremental Feature Selection for Reinforcement Learning

DTIC Science & Technology

2012-05-01

Prior to this work, feature selection for reinforce- ment learning has focused on linear value function ap- proximation ( Kolter and Ng, 2009; Parr et al...InProceed- ings of the the 23rd International Conference on Ma- chine Learning, pages 449–456. Kolter , J. Z. and Ng, A. Y. (2009). Regularization and feature
Blocking of conditioning to a cocaine-paired stimulus: testing the hypothesis that cocaine perpetually produces a signal of larger-than-expected reward.

PubMed

Panlilio, Leigh V; Thorndike, Eric B; Schindler, Charles W

2007-04-01

According to a recent account of addiction, dopaminergic effects of drugs like cocaine mimic the neuronal signal that occurs when a natural reward has a larger value than expected. Consequently, the drug's expected reward value increases with each administration, leading to an over-selection of drug-seeking behavior. One prediction of this hypothesis is that the blocking effect, a cornerstone of contemporary learning theory, should not occur with drug reinforcers. To test this prediction, two groups of rats were trained to self-administer cocaine with a nose-poking response. For 5 sessions, a tone was paired with each self-administered injection (blocking group), or no stimulus was paired with injection (non-blocking group). Then, in both groups, the tone and a light were both paired with each injection for 5 sessions. In subsequent testing, the light functioned as a conditioned reinforcer for a new response (lever-pressing) in the non-blocking group, but not the blocking group. Thus, contrary to prediction, pre-training with the tone blocked conditioning to the light. Although these results fail to support a potentially powerful explanation of addiction, they are consistent with the fact that most conditioning and learning phenomena that occur with non-drug reinforcers can also be demonstrated with drug reinforcers.
Novel reinforcement learning paradigm based on response patterning under interval schedules of reinforcement.

PubMed

Schifani, Christin; Sukhanov, Ilya; Dorofeikova, Mariia; Bespalov, Anton

2017-07-28

There is a need to develop cognitive tasks that address valid neuropsychological constructs implicated in disease mechanisms and can be used in animals and humans to guide novel drug discovery. Present experiments aimed to characterize a novel reinforcement learning task based on a classical operant behavioral phenomenon observed in multiple species - differences in response patterning under variable (VI) vs fixed interval (FI) schedules of reinforcement. Wistar rats were trained to press a lever for food under VI30s and later weekly test sessions were introduced with reinforcement schedule switched to FI30s. During the FI30s test session, post-reinforcement pauses (PRPs) gradually grew towards the end of the session reaching 22-43% of the initial values. Animals could be retrained under VI30s conditions, and FI30s test sessions were repeated over a period of several months without appreciable signs of a practice effect. Administration of the non-competitive N-methyl-d-aspartate (NMDA) receptor antagonist MK-801 ((5S,10R)-(+)-5-Methyl-10,11-dihydro-5H-dibenzo[a,d]cyclohepten-5,10-imine maleate) prior to FI30s sessions prevented adjustment of PRPs associated with the change from VI to FI schedule. This effect was most pronounced at the highest tested dose of MK-801 and appeared to be independent of the effects of this dose on response rates. These results provide initial evidence for the possibility to use different response patterning under VI and FI schedules with equivalent reinforcement density for studying effects of drug treatment on reinforcement learning. Copyright © 2017 Elsevier B.V. All rights reserved.
Instructional control of reinforcement learning: A behavioral and neurocomputational investigation

PubMed Central

Doll, Bradley B.; Jacobs, W. Jake; Sanfey, Alan G.; Frank, Michael J.

2011-01-01

Humans learn how to behave directly through environmental experience and indirectly through rules and instructions. Behavior analytic research has shown that instructions can control behavior, even when such behavior leads to sub-optimal outcomes (Hayes, S. (Ed.). 1989. Rule-governed behavior: cognition, contingencies, and instructional control. Plenum Press.). Here we examine the control of behavior through instructions in a reinforcement learning task known to depend on striatal dopaminergic function. Participants selected between probabilistically reinforced stimuli, and were (incorrectly) told that a specific stimulus had the highest (or lowest) reinforcement probability. Despite experience to the contrary, instructions drove choice behavior. We present neural network simulations that capture the interactions between instruction-driven and reinforcement-driven behavior via two potential neural circuits: one in which the striatum is inaccurately trained by instruction representations coming from prefrontal cortex/hippocampus (PFC/HC), and another in which the striatum learns the environmentally based reinforcement contingencies, but is “overridden” at decision output. Both models capture the core behavioral phenomena but, because they differ fundamentally on what is learned, make distinct predictions for subsequent behavioral and neuroimaging experiments. Finally, we attempt to distinguish between the proposed computational mechanisms governing instructed behavior by fitting a series of abstract “Q-learning” and Bayesian models to subject data. The best-fitting model supports one of the neural models, suggesting the existence of a “confirmation bias” in which the PFC/HC system trains the reinforcement system by amplifying outcomes that are consistent with instructions while diminishing inconsistent outcomes. PMID:19595993
Interactions Among Working Memory, Reinforcement Learning, and Effort in Value-Based Choice: A New Paradigm and Selective Deficits in Schizophrenia.

PubMed

Collins, Anne G E; Albrecht, Matthew A; Waltz, James A; Gold, James M; Frank, Michael J

2017-09-15

When studying learning, researchers directly observe only the participants' choices, which are often assumed to arise from a unitary learning process. However, a number of separable systems, such as working memory (WM) and reinforcement learning (RL), contribute simultaneously to human learning. Identifying each system's contributions is essential for mapping the neural substrates contributing in parallel to behavior; computational modeling can help to design tasks that allow such a separable identification of processes and infer their contributions in individuals. We present a new experimental protocol that separately identifies the contributions of RL and WM to learning, is sensitive to parametric variations in both, and allows us to investigate whether the processes interact. In experiments 1 and 2, we tested this protocol with healthy young adults (n = 29 and n = 52, respectively). In experiment 3, we used it to investigate learning deficits in medicated individuals with schizophrenia (n = 49 patients, n = 32 control subjects). Experiments 1 and 2 established WM and RL contributions to learning, as evidenced by parametric modulations of choice by load and delay and reward history, respectively. They also showed interactions between WM and RL, where RL was enhanced under high WM load. Moreover, we observed a cost of mental effort when controlling for reinforcement history: participants preferred stimuli they encountered under low WM load. Experiment 3 revealed selective deficits in WM contributions and preserved RL value learning in individuals with schizophrenia compared with control subjects. Computational approaches allow us to disentangle contributions of multiple systems to learning and, consequently, to further our understanding of psychiatric diseases. Copyright © 2017 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.
Working Memory and Reinforcement Schedule Jointly Determine Reinforcement Learning in Children: Potential Implications for Behavioral Parent Training

PubMed Central

Segers, Elien; Beckers, Tom; Geurts, Hilde; Claes, Laurence; Danckaerts, Marina; van der Oord, Saskia

2018-01-01

Introduction: Behavioral Parent Training (BPT) is often provided for childhood psychiatric disorders. These disorders have been shown to be associated with working memory impairments. BPT is based on operant learning principles, yet how operant principles shape behavior (through the partial reinforcement (PRF) extinction effect, i.e., greater resistance to extinction that is created when behavior is reinforced partially rather than continuously) and the potential role of working memory therein is scarcely studied in children. This study explored the PRF extinction effect and the role of working memory therein using experimental tasks in typically developing children. Methods: Ninety-seven children (age 6–10) completed a working memory task and an operant learning task, in which children acquired a response-sequence rule under either continuous or PRF (120 trials), followed by an extinction phase (80 trials). Data of 88 children were used for analysis. Results: The PRF extinction effect was confirmed: We observed slower acquisition and extinction in the PRF condition as compared to the continuous reinforcement (CRF) condition. Working memory was negatively related to acquisition but not extinction performance. Conclusion: Both reinforcement contingencies and working memory relate to acquisition performance. Potential implications for BPT are that decreasing working memory load may enhance the chance of optimally learning through reinforcement. PMID:29643822
Collaborating Fuzzy Reinforcement Learning Agents

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.

1997-01-01

Earlier, we introduced GARIC-Q, a new method for doing incremental Dynamic Programming using a society of intelligent agents which are controlled at the top level by Fuzzy Relearning and at the local level, each agent learns and operates based on ANTARCTIC, a technique for fuzzy reinforcement learning. In this paper, we show that it is possible for these agents to compete in order to affect the selected control policy but at the same time, they can collaborate while investigating the state space. In this model, the evaluator or the critic learns by observing all the agents behaviors but the control policy changes only based on the behavior of the winning agent also known as the super agent.
Impaired Feedback Processing for Symbolic Reward in Individuals with Internet Game Overuse

PubMed Central

Kim, Jinhee; Kim, Hackjin; Kang, Eunjoo

2017-01-01

Reward processing, which plays a critical role in adaptive behavior, is impaired in addiction disorders, which are accompanied by functional abnormalities in brain reward circuits. Internet gaming disorder, like substance addiction, is thought to be associated with impaired reward processing, but little is known about how it affects learning, especially when feedback is conveyed by less-salient motivational events. Here, using both monetary (±500 KRW) and symbolic (Chinese characters “right” or “wrong”) rewards and penalties, we investigated whether behavioral performance and feedback-related neural responses are altered in Internet game overuse (IGO) group. Using functional MRI, brain responses for these two types of reward/penalty feedback were compared between young males with problems of IGO (IGOs, n = 18, mean age = 22.2 ± 2.0 years) and age-matched control subjects (Controls, n = 20, mean age = 21.2 ± 2.1) during a visuomotor association task where associations were learned between English letters and one of four responses. No group difference was found in adjustment of error responses following the penalty or in brain responses to penalty, for either monetary or symbolic penalties. The IGO individuals, however, were more likely to fail to choose the response previously reinforced by symbolic (but not monetary) reward. A whole brain two-way ANOVA analysis for reward revealed reduced activations in the IGO group in the rostral anterior cingulate cortex/ventromedial prefrontal cortex (vmPFC) in response to both reward types, suggesting impaired reward processing. However, the responses to reward in the inferior parietal region and medial orbitofrontal cortex/vmPFC were affected by the types of reward in the IGO group. Unlike the control group, in the IGO group the reward response was reduced only for symbolic reward, suggesting lower attentional and value processing specific to symbolic reward. Furthermore, the more severe
Learning and generalization from reward and punishment in opioid addiction

PubMed Central

Myers, Catherine E.; Rego, Janice; Haber, Paul; Morley, Kirsten; Beck, Kevin D.; Hogarth, Lee; Moustafa, Ahmed A.

2016-01-01

This study adapts a widely-used acquired equivalence paradigm to investigate how opioid-addicted individuals learn from positive and negative feedback, and how they generalize this learning. The opioid-addicted group consisted of 33 participants with a history of heroin dependency currently in a methadone maintenance program; the control group consisted of 32 healthy participants without a history of drug addiction. All participants performed a novel variant of the acquired equivalence task, where they learned to map some stimuli to correct outcomes in order to obtain reward, and to map other stimuli to correct outcomes in order to avoid punishment; some stimuli were implicitly “equivalent” in the sense of being paired with the same outcome. On the initial training phase, both groups performed similarly on learning to obtain reward, but as memory load grew, the control group outperformed the addicted group on learning to avoid punishment. On a subsequent testing phase, the addicted and control groups performed similarly on retention trials involving previously-trained stimulus-outcome pairs, as well as on generalization trials to assess acquired equivalence. Since prior work with acquired equivalence tasks has associated stimulus-outcome learning with the nigrostriatal dopamine system, and generalization with the hippocampal region, the current results are consistent with basal ganglia dysfunction in the opioid-addicted patients. Further, a selective deficit in learning from punishment could contribute to processes by which addicted individuals continue to pursue drug use even at the cost of negative consequences such as loss of income and the opportunity to engage in other life activities. PMID:27641323

Learning and generalization from reward and punishment in opioid addiction.

PubMed

Myers, Catherine E; Rego, Janice; Haber, Paul; Morley, Kirsten; Beck, Kevin D; Hogarth, Lee; Moustafa, Ahmed A

2017-01-15

This study adapts a widely-used acquired equivalence paradigm to investigate how opioid-addicted individuals learn from positive and negative feedback, and how they generalize this learning. The opioid-addicted group consisted of 33 participants with a history of heroin dependency currently in a methadone maintenance program; the control group consisted of 32 healthy participants without a history of drug addiction. All participants performed a novel variant of the acquired equivalence task, where they learned to map some stimuli to correct outcomes in order to obtain reward, and to map other stimuli to correct outcomes in order to avoid punishment; some stimuli were implicitly "equivalent" in the sense of being paired with the same outcome. On the initial training phase, both groups performed similarly on learning to obtain reward, but as memory load grew, the control group outperformed the addicted group on learning to avoid punishment. On a subsequent testing phase, the addicted and control groups performed similarly on retention trials involving previously-trained stimulus-outcome pairs, as well as on generalization trials to assess acquired equivalence. Since prior work with acquired equivalence tasks has associated stimulus-outcome learning with the nigrostriatal dopamine system, and generalization with the hippocampal region, the current results are consistent with basal ganglia dysfunction in the opioid-addicted patients. Further, a selective deficit in learning from punishment could contribute to processes by which addicted individuals continue to pursue drug use even at the cost of negative consequences such as loss of income and the opportunity to engage in other life activities. Published by Elsevier B.V.
Functional requirements for reward-modulated spike-timing-dependent plasticity.

PubMed

Frémaux, Nicolas; Sprekeler, Henning; Gerstner, Wulfram

2010-10-06

Recent experiments have shown that spike-timing-dependent plasticity is influenced by neuromodulation. We derive theoretical conditions for successful learning of reward-related behavior for a large class of learning rules where Hebbian synaptic plasticity is conditioned on a global modulatory factor signaling reward. We show that all learning rules in this class can be separated into a term that captures the covariance of neuronal firing and reward and a second term that presents the influence of unsupervised learning. The unsupervised term, which is, in general, detrimental for reward-based learning, can be suppressed if the neuromodulatory signal encodes the difference between the reward and the expected reward-but only if the expected reward is calculated for each task and stimulus separately. If several tasks are to be learned simultaneously, the nervous system needs an internal critic that is able to predict the expected reward for arbitrary stimuli. We show that, with a critic, reward-modulated spike-timing-dependent plasticity is capable of learning motor trajectories with a temporal resolution of tens of milliseconds. The relation to temporal difference learning, the relevance of block-based learning paradigms, and the limitations of learning with a critic are discussed.
The prefrontal cortex and hybrid learning during iterative competitive games.

PubMed

Abe, Hiroshi; Seo, Hyojung; Lee, Daeyeol

2011-12-01

Behavioral changes driven by reinforcement and punishment are referred to as simple or model-free reinforcement learning. Animals can also change their behaviors by observing events that are neither appetitive nor aversive when these events provide new information about payoffs available from alternative actions. This is an example of model-based reinforcement learning and can be accomplished by incorporating hypothetical reward signals into the value functions for specific actions. Recent neuroimaging and single-neuron recording studies showed that the prefrontal cortex and the striatum are involved not only in reinforcement and punishment, but also in model-based reinforcement learning. We found evidence for both types of learning, and hence hybrid learning, in monkeys during simulated competitive games. In addition, in both the dorsolateral prefrontal cortex and orbitofrontal cortex, individual neurons heterogeneously encoded signals related to actual and hypothetical outcomes from specific actions, suggesting that both areas might contribute to hybrid learning. © 2011 New York Academy of Sciences.
Learning stochastic reward distributions in a speeded pointing task.

PubMed

Seydell, Anna; McCann, Brian C; Trommershäuser, Julia; Knill, David C

2008-04-23

Recent studies have shown that humans effectively take into account task variance caused by intrinsic motor noise when planning fast hand movements. However, previous evidence suggests that humans have greater difficulty accounting for arbitrary forms of stochasticity in their environment, both in economic decision making and sensorimotor tasks. We hypothesized that humans can learn to optimize movement strategies when environmental randomness can be experienced and thus implicitly learned over several trials, especially if it mimics the kinds of randomness for which subjects might have generative models. We tested the hypothesis using a task in which subjects had to rapidly point at a target region partly covered by three stochastic penalty regions introduced as "defenders." At movement completion, each defender jumped to a new position drawn randomly from fixed probability distributions. Subjects earned points when they hit the target, unblocked by a defender, and lost points otherwise. Results indicate that after approximately 600 trials, subjects approached optimal behavior. We further tested whether subjects simply learned a set of stimulus-contingent motor plans or the statistics of defenders' movements by training subjects with one penalty distribution and then testing them on a new penalty distribution. Subjects immediately changed their strategy to achieve the same average reward as subjects who had trained with the second penalty distribution. These results indicate that subjects learned the parameters of the defenders' jump distributions and used this knowledge to optimally plan their hand movements under conditions involving stochastic rewards and penalties.
Effects of Self-Explanation and Game-Reward on Sixth Graders' Algebra Variable Learning

ERIC Educational Resources Information Center

Sun-Lin, Hong-Zheng; Chiou, Guey-Fa

2017-01-01

This study examined the interaction effects of self-explanation and game-reward strategies on sixth graders' algebra variable learning achievement, learning attitude, and meta-cognitive awareness. A learning system was developed to support the learning activity, and a 2×2 quasi-experiment was conducted. Ninety-seven students were invited to…
Reward learning modulates the attentional processing of faces in children with and without autism spectrum disorder.

PubMed

Li, Tianbi; Wang, Xueqin; Pan, Junhao; Feng, Shuyuan; Gong, Mengyuan; Wu, Yaxue; Li, Guoxiang; Li, Sheng; Yi, Li

2017-11-01

The processing of social stimuli, such as human faces, is impaired in individuals with autism spectrum disorder (ASD), which could be accounted for by their lack of social motivation. The current study examined how the attentional processing of faces in children with ASD could be modulated by the learning of face-reward associations. Sixteen high-functioning children with ASD and 20 age- and ability-matched typically developing peers participated in the experiments. All children started with a reward learning task, in which the children were presented with three female faces that were attributed with positive, negative, and neutral values, and were required to remember the faces and their associated values. After this, they were tested on the recognition of the learned faces and a visual search task in which the learned faces served as the distractor. We found a modulatory effect of the face-reward associations on the visual search but not the recognition performance in both groups despite the lower efficacy among children with ASD in learning the face-reward associations. Specifically, both groups responded faster when one of the distractor faces was associated with positive or negative values than when the distractor face was neutral, suggesting an efficient attentional processing of these reward-associated faces. Our findings provide direct evidence for the perceptual-level modulatory effect of reward learning on the attentional processing of faces in individuals with ASD. Autism Res 2017, 10: 1797-1807. © 2017 International Society for Autism Research, Wiley Periodicals, Inc. In our study, we tested whether the face processing of individuals with ASD could be changed when the faces were associated with different social meanings. We found no effect of social meanings on face recognition, but both groups responded faster in the visual search task when one of the distractor faces was associated with positive or negative values than when the neutral face. The
Network congestion control algorithm based on Actor-Critic reinforcement learning model

NASA Astrophysics Data System (ADS)

Xu, Tao; Gong, Lina; Zhang, Wei; Li, Xuhong; Wang, Xia; Pan, Wenwen

2018-04-01

Aiming at the network congestion control problem, a congestion control algorithm based on Actor-Critic reinforcement learning model is designed. Through the genetic algorithm in the congestion control strategy, the network congestion problems can be better found and prevented. According to Actor-Critic reinforcement learning, the simulation experiment of network congestion control algorithm is designed. The simulation experiments verify that the AQM controller can predict the dynamic characteristics of the network system. Moreover, the learning strategy is adopted to optimize the network performance, and the dropping probability of packets is adaptively adjusted so as to improve the network performance and avoid congestion. Based on the above finding, it is concluded that the network congestion control algorithm based on Actor-Critic reinforcement learning model can effectively avoid the occurrence of TCP network congestion.
Dopamine cells respond to predicted events during classical conditioning: evidence for eligibility traces in the reward-learning network.

PubMed

Pan, Wei-Xing; Schmidt, Robert; Wickens, Jeffery R; Hyland, Brian I

2005-06-29

Behavioral conditioning of cue-reward pairing results in a shift of midbrain dopamine (DA) cell activity from responding to the reward to responding to the predictive cue. However, the precise time course and mechanism underlying this shift remain unclear. Here, we report a combined single-unit recording and temporal difference (TD) modeling approach to this question. The data from recordings in conscious rats showed that DA cells retain responses to predicted reward after responses to conditioned cues have developed, at least early in training. This contrasts with previous TD models that predict a gradual stepwise shift in latency with responses to rewards lost before responses develop to the conditioned cue. By exploring the TD parameter space, we demonstrate that the persistent reward responses of DA cells during conditioning are only accurately replicated by a TD model with long-lasting eligibility traces (nonzero values for the parameter lambda) and low learning rate (alpha). These physiological constraints for TD parameters suggest that eligibility traces and low per-trial rates of plastic modification may be essential features of neural circuits for reward learning in the brain. Such properties enable rapid but stable initiation of learning when the number of stimulus-reward pairings is limited, conferring significant adaptive advantages in real-world environments.
A conditioned reinforcer did not help to maintain an operant conditioning in the absence of a primary reinforcer in horses.

PubMed

Lansade, Léa; Calandreau, Ludovic

2018-01-01

The use of conditioned reinforcers is increasingly promoted in animal training. Surprisingly, the efficiency of their use remains to be demonstrated in horses. This study aimed to determine whether an auditory signal which had previously been associated with a food reward 288 times could be used as a conditioned reinforcer to replace the primary reinforcer in an unrelated operant conditioning procedure. Fourteen horses were divided into two groups of 7: No Reinforcement (NR) and Conditioned Reinforcement (CR). All horses underwent nine sessions of Pavlovian conditioning during which the word "good" was associated with food (32 associations/session). The horses then followed five sessions of operant conditioning (30 trials/session) during which they had to touch a cone signaled by an experimenter to receive a food reward. The last day, horses underwent one test session of the operant response: no reward was given, but the word "good" was said each time a CR horse touched the cone. Nothing was said in the NR group. CR horses did not achieve more correct trials than NR horses during the test. These findings again show that the conditioned reinforcement was ineffective when used instead of the primary reinforcement to maintain conditioning. Copyright © 2017 Elsevier B.V. All rights reserved.
Needs and Rewards. Supervising: Principles and Practice of Supervision. The Choice Series #11. A Self Learning Opportunity.

ERIC Educational Resources Information Center

Ellingham, Richard

This learning unit on needs and rewards is one in the Choice Series, a self-learning development program for supervisors. Purpose stated for the approximately eight-hour-long unit is to enable the supervisor to understand and list the needs that influence work behavior and devise ways in which a work system can be both productive and rewarding for…
Neural basis of the undermining effect of monetary reward on intrinsic motivation

PubMed Central

Murayama, Kou; Matsumoto, Madoka; Izuma, Keise; Matsumoto, Kenji

2010-01-01

Contrary to the widespread belief that people are positively motivated by reward incentives, some studies have shown that performance-based extrinsic reward can actually undermine a person's intrinsic motivation to engage in a task. This “undermining effect” has timely practical implications, given the burgeoning of performance-based incentive systems in contemporary society. It also presents a theoretical challenge for economic and reinforcement learning theories, which tend to assume that monetary incentives monotonically increase motivation. Despite the practical and theoretical importance of this provocative phenomenon, however, little is known about its neural basis. Herein we induced the behavioral undermining effect using a newly developed task, and we tracked its neural correlates using functional MRI. Our results show that performance-based monetary reward indeed undermines intrinsic motivation, as assessed by the number of voluntary engagements in the task. We found that activity in the anterior striatum and the prefrontal areas decreased along with this behavioral undermining effect. These findings suggest that the corticobasal ganglia valuation system underlies the undermining effect through the integration of extrinsic reward value and intrinsic task value. PMID:21078974
Neural basis of the undermining effect of monetary reward on intrinsic motivation.

PubMed

Murayama, Kou; Matsumoto, Madoka; Izuma, Keise; Matsumoto, Kenji

2010-12-07

Contrary to the widespread belief that people are positively motivated by reward incentives, some studies have shown that performance-based extrinsic reward can actually undermine a person's intrinsic motivation to engage in a task. This "undermining effect" has timely practical implications, given the burgeoning of performance-based incentive systems in contemporary society. It also presents a theoretical challenge for economic and reinforcement learning theories, which tend to assume that monetary incentives monotonically increase motivation. Despite the practical and theoretical importance of this provocative phenomenon, however, little is known about its neural basis. Herein we induced the behavioral undermining effect using a newly developed task, and we tracked its neural correlates using functional MRI. Our results show that performance-based monetary reward indeed undermines intrinsic motivation, as assessed by the number of voluntary engagements in the task. We found that activity in the anterior striatum and the prefrontal areas decreased along with this behavioral undermining effect. These findings suggest that the corticobasal ganglia valuation system underlies the undermining effect through the integration of extrinsic reward value and intrinsic task value.
Reward Expectation Modulates Feedback-Related Negativity and EEG Spectra

PubMed Central

Cohen, Michael X; Elger, Christian E.; Ranganath, Charan

2007-01-01

The ability to evaluate outcomes of previous decisions is critical to adaptive decision-making. The feedback-related negativity (FRN) is an event-related potential (ERP) modulation that distinguishes losses from wins, but little is known about the effects of outcome probability on these ERP responses. Further, little is known about the frequency characteristics of feedback processing, for example, event-related oscillations and phase synchronizations. Here, we report an EEG experiment designed to address these issues. Subjects engaged in a probabilistic reinforcement learning task in which we manipulated, across blocks, the probability of winning and losing to each of two possible decision options. Behaviorally, all subjects quickly adapted their decision-making to maximize rewards. ERP analyses revealed that the probability of reward modulated neural responses to wins, but not to losses. This was seen both across blocks as well as within blocks, as learning progressed. Frequency decomposition via complex wavelets revealed that EEG responses to losses, compared to wins, were associated with enhanced power and phase coherence in the theta frequency band. As in the ERP analyses, power and phase coherence values following wins but not losses were modulated by reward probability. Some findings between ERP and frequency analyses diverged, suggesting that these analytic approaches provide complementary insights into neural processing. These findings suggest that the neural mechanisms of feedback processing may differ between wins and losses. PMID:17257860
Mastery Learning through Individualized Instruction: A Reinforcement Strategy

ERIC Educational Resources Information Center

Sagy, John; Ravi, R.; Ananthasayanam, R.

2009-01-01

The present study attempts to gauge the effect of individualized instructional methods as a reinforcement strategy for mastery learning. Among various individualized instructional methods, the study focuses on PIM (Programmed Instructional Method) and CAIM (Computer Assisted Instruction Method). Mastery learning is a process where students achieve…
Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin.

PubMed

Ezaki, Takahiro; Horita, Yutaka; Takezawa, Masanori; Masuda, Naoki

2016-07-01

Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner's dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations.
Cooperation and Coordination Between Fuzzy Reinforcement Learning Agents in Continuous State Partially Observable Markov Decision Processes

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Vengerov, David

1999-01-01

Successful operations of future multi-agent intelligent systems require efficient cooperation schemes between agents sharing learning experiences. We consider a pseudo-realistic world in which one or more opportunities appear and disappear in random locations. Agents use fuzzy reinforcement learning to learn which opportunities are most worthy of pursuing based on their promise rewards, expected lifetimes, path lengths and expected path costs. We show that this world is partially observable because the history of an agent influences the distribution of its future states. We consider a cooperation mechanism in which agents share experience by using and-updating one joint behavior policy. We also implement a coordination mechanism for allocating opportunities to different agents in the same world. Our results demonstrate that K cooperative agents each learning in a separate world over N time steps outperform K independent agents each learning in a separate world over K*N time steps, with this result becoming more pronounced as the degree of partial observability in the environment increases. We also show that cooperation between agents learning in the same world decreases performance with respect to independent agents. Since cooperation reduces diversity between agents, we conclude that diversity is a key parameter in the trade off between maximizing utility from cooperation when diversity is low and maximizing utility from competitive coordination when diversity is high.
Delay discounting of qualitatively different reinforcers in rats.

PubMed

Calvert, Amanda L; Green, Leonard; Myerson, Joel

2010-03-01

Humans discount larger delayed rewards less steeply than smaller rewards, whereas no such magnitude effect has been observed in rats (and pigeons). It remains possible that rats' discounting is sensitive to differences in the quality of the delayed reinforcer even though it is not sensitive to amount. To evaluate this possibility, Experiment 1 examined discounting of qualitatively different food reinforcers: highly preferred versus nonpreferred food pellets. Similarly, Experiment 2 examined discounting of highly preferred versus nonpreferred liquid reinforcers. In both experiments, an adjusting-amount procedure was used to determine the amount of immediate reinforcer that was judged to be of equal subjective value to the delayed reinforcer. The amount and quality of the delayed reinforcer were varied across conditions. Discounting was well described by a hyperbolic function, but no systematic effects of the quantity or the quality of the delayed reinforcer were observed.
Learning and tuning fuzzy logic controllers through reinforcements

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap

1992-01-01

A new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system is presented. In particular, our Generalized Approximate Reasoning-based Intelligent Control (GARIC) architecture: (1) learns and tunes a fuzzy logic controller even when only weak reinforcements, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto, Sutton, and Anderson to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and has demonstrated significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.
Learning and tuning fuzzy logic controllers through reinforcements

NASA Technical Reports Server (NTRS)

Berenji, Hamid R.; Khedkar, Pratap

1992-01-01

This paper presents a new method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system. In particular, our generalized approximate reasoning-based intelligent control (GARIC) architecture (1) learns and tunes a fuzzy logic controller even when only weak reinforcement, such as a binary failure signal, is available; (2) introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; (3) introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and (4) learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward neural network, which can then adaptively improve performance by using gradient descent methods. We extend the AHC algorithm of Barto et al. (1983) to include the prior control knowledge of human operators. The GARIC architecture is applied to a cart-pole balancing system and demonstrates significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing.
Adolescent-specific patterns of behavior and neural activity during social reinforcement learning

PubMed Central

Jones, Rebecca M.; Somerville, Leah H.; Li, Jian; Ruberry, Erika J.; Powers, Alisa; Mehta, Natasha; Dyke, Jonathan; Casey, BJ

2014-01-01

Humans are sophisticated social beings. Social cues from others are exceptionally salient, particularly during adolescence. Understanding how adolescents interpret and learn from variable social signals can provide insight into the observed shift in social sensitivity during this period. The current study tested 120 participants between the ages of 8 and 25 years on a social reinforcement learning task where the probability of receiving positive social feedback was parametrically manipulated. Seventy-eight of these participants completed the task during fMRI scanning. Modeling trial-by-trial learning, children and adults showed higher positive learning rates than adolescents, suggesting that adolescents demonstrated less differentiation in their reaction times for peers who provided more positive feedback. Forming expectations about receiving positive social reinforcement correlated with neural activity within the medial prefrontal cortex and ventral striatum across age. Adolescents, unlike children and adults, showed greater insular activity during positive prediction error learning and increased activity in the supplementary motor cortex and the putamen when receiving positive social feedback regardless of the expected outcome, suggesting that peer approval may motivate adolescents towards action. While different amounts of positive social reinforcement enhanced learning in children and adults, all positive social reinforcement equally motivated adolescents. Together, these findings indicate that sensitivity to peer approval during adolescence goes beyond simple reinforcement theory accounts and suggests possible explanations for how peers may motivate adolescent behavior. PMID:24550063

Some links on this page may take you to non-federal websites. Their policies may differ from this site.