Report

Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons

See allHide authors and affiliations

Science  21 Mar 2003:
Vol. 299, Issue 5614, pp. 1898-1902
DOI: 10.1126/science.1077349

Abstract

Uncertainty is critical in the measure of information and in assessing the accuracy of predictions. It is determined by probability P, being maximal at P = 0.5 and decreasing at higher and lower probabilities. Using distinct stimuli to indicate the probability of reward, we found that the phasic activation of dopamine neurons varied monotonically across the full range of probabilities, supporting past claims that this response codes the discrepancy between predicted and actual reward. In contrast, a previously unobserved response covaried with uncertainty and consisted of a gradual increase in activity until the potential time of reward. The coding of uncertainty suggests a possible role for dopamine signals in attention-based learning and risk-taking behavior.

The brain continuously makes predictions and compares outcomes (or inputs) with those predictions (1–4). Predictions are fundamentally concerned with the probability that an event will occur within a specified time period. It is only through a rich representation of probabilities that an animal can infer the structure of its environment and form associations between correlated events (4–7). Substantial evidence indicates that dopamine neurons of the primate ventral midbrain code errors in the prediction of reward (8–10). In the simplified case in which reward magnitude and timing are held constant, prediction error is the discrepancy between the probability P with which reward is predicted and the actual outcome (reward or no reward). Thus, if dopamine neurons code reward prediction error, their activation after reward should decline monotonically as the probability of reward increases. However, in varying probability across its full range (P = 0 to 1), a fundamentally distinct parameter is introduced. Uncertainty is maximal at P = 0.5 but absent at the two extremes (P = 0 and 1) and is critical in assessing the accuracy of a prediction. We examined the influence of reward probability and uncertainty on the activity of primate dopamine neurons.

Two monkeys were conditioned in a Pavlovian procedure with distinct visual stimuli indicating the probability (P = 0, 0.25, 0.5, 0.75, and 1.0) of liquid reward being delivered after a 2-s delay (11). Anticipatory licking responses during the interval between stimulus and reward increased with the probability of reward (Fig. 1), indicating that the animals discriminated the stimuli behaviorally. However, at none of the intermediate probabilities was there a difference in the amount of anticipatory licking between rewarded and unrewarded trials (fig. S1). This suggests that the expectation of reward did not fluctuate significantly on a trial-by-trial basis as a result of the monkey learning the reward schedule (11).

Figure 1

Conditioned licking behavior increased with reward probability. The ordinate displays the mean duration of licking during the 2-s period from conditioned stimulus onset to potential reward. Each point represents the mean (±SEM) of 2322 to 6668 trials. Standard errors are too small to be visible. The behavioral data shown were collected between the first and last day of recordings and include data collected in the absence of physiological recordings.

Dopamine neurons of ventral midbrain areas A8, A9, and A10 (fig. S2) were identified solely on the basis of previously described electrophysiological characteristics, particularly the long waveform of their impulses (1.5 to 5.0 ms) (11). The analyses presented here are for the entire population of dopamine neurons sampled, without selection for the presence of any event-related response. Dopamine neurons (n = 188) showed little or no response to fully predicted reward (P = 1.0), but they displayed the typical phasic activations (8–10) when reward was delivered with P < 1.0, even after extensive training (Fig. 2, A and B). The magnitude of the reward responses increased as probability decreased, as illustrated by linear regression analyses (correlation coefficientr 2 = 0.97, P = 0.002 and r 2 = 0.92,P = 0.01 in monkeys A and B, respectively) (Fig. 2C and fig. S3A) (12). Although dopamine neurons discriminated the full range of probabilities effectively as a population, in contrast toFig. 2A, many single neurons appeared not to discriminate across the full range (13). For trials in which reward was predicted with intermediate probabilities (P = 0.25 to 0.75) but did not occur, neuronal activity was significantly suppressed. The amount of suppression tended to increase with probability (r 2 = 0.65, P = 0.20 and r 2 = 0.80,P = 0.10 in monkeys A and B, respectively) (Fig. 2, B and D) although the quantification of suppression may have been limited by the low spontaneous activity levels. Conditioned stimuli elicited the typical phasic activations (8–10), with their magnitude increasing with increasing reward probability (r 2 = 0.80, P = 0.04 and r 2 = 0.69,P = 0.08 in monkeys A and B, respectively) (Figs. 2, A and E, and 3, A and B). In summary, the phasic activations varied monotonically with reward probability, although further conclusions about the quantitative relations are not warranted (13).

Figure 2

Dependence of phasic neuronal responses on reward probability. (A) Rasters and histograms of activity in a single cell, illustrating responses to the conditioned stimuli and reward at various reward probabilities, increasing from top to bottom. The thick vertical line in the middle of the top panel (P = 0) indicates that the conditioned stimulus response to the left and the reward response to the right were not from a single trial type as in other panels but were spliced together. Reward at P = 0.0 was given in the absence of any explicit stimulus at a rate constant of 0.02 per 100 ms and thus presumably occurred with a low subjective probability (11). Only rewarded trials are shown at intermediate probabilities. Bin width = 20 ms. (B) Population histograms of rewarded (left) and unrewarded (right) trials at P = 0.5 (n = 39, monkey A, set 1). Bin width = 10 ms. (C to E) The median response (n = 34 to 62) measured in fixed standard windows, along with symmetric 95% confidence intervals (bars) (11). Circles and squares represent data from analogous experiments, with the squares representing a subsequent replication of the prior “circle” data but with distinct visual stimuli and only two or three probabilities tested. Error bars represent standard errors. In (C), the median magnitude of reward responses as a function of probability is shown, normalized in each neuron to the response to unpredicted reward. Unpredicted reward caused a median increase in activity that ranged from 76 to 270% above baseline for the four picture sets. Analogous to (C), fig. S3A shows means (±SEM) for a subset of responsive neurons (11). In (D), the median magnitude of responses to no reward as a function of probability is shown, normalized in each neuron to the response at P = 0.5. Median decreases in activity at P = 0.5 ranged from –22 to –55% below baseline. Symbols represent picture sets as shown in (C). At reward probability P = 0 for monkey B, a neutral visual stimulus was predicted (P = 0.5) by the conditioned stimulus. The data point shows the response after the neutral stimulus failed to occur. In (E), responses to conditioned stimuli are shown, normalized in each neuron to the response to the stimulus predicting reward at P = 1.0. The median response to this stimulus ranged from 67 to 194% above baseline. Symbols represent picture sets as shown in (C). The stimuli withP = 0 for monkey A, set 2, and for monkey B, set 1, predicted the subsequent occurrence of a neutral visual stimulus withP = 0.5.

Figure 3

Sustained activation of dopamine neurons precedes uncertain rewards. (A) Rasters and histograms of activity in a single cell with reward probabilities ranging from 0.0 (top) to 1.0 (bottom). This neuron showed sustained activation before potential reward at all three intermediate probabilities. Both rewarded and unrewarded trials are shown at intermediate probabilities; the longer vertical marks in the rasters indicate the occurrence of reward. Bin width = 20 ms. (B) Population histograms at reward probabilities ranging from 0.0 (top) to 1.0 (bottom). Histograms were constructed from every trial in each neuron in the first picture set in monkey A (35 to 44 neurons per stimulus type; 638 total trials at P = 0 and 1200 to 1700 trials for all other probabilities). Both rewarded and unrewarded trials are included at intermediate probabilities. At P = 0.5, the mean (±SD) rate of basal activity in this population was 2.5 ± 1.4 impulses per second before stimulus onset and 3.9 ± 2.7 in the 500 ms before potential reward. (C) Median sustained activation of dopamine neurons as a function of reward probability. In analogy, means (±SEM) are shown in fig. S3B for a subset of responsive neurons (11). Symbols have the same meaning as in Fig. 2C. For monkey A, set 1, the points at P = 0.25 and 0.75 may underestimate the amount of sustained activation, as 11 cells with unusually high levels of sustained activity at P = 0.5 (median activation of 72%) were not tested at P = 0.25 or 0.75. This was because, at the time of those experiments, the novel form of activation cast doubt on the dopaminergic identity of the neurons. For P = 0 in monkey A, set 2, and in monkey B, set 1, there was a 50% chance of a neutral stimulus following the conditioned stimulus. (D) Sustained responses (atP = 0.5) plotted against phasic responses to unpredicted reward (P = 0) for all neurons recorded in both monkeys (188 neurons, with an additional 53 neurons tested with different reward magnitudes as in Fig. 4B; five outlying neurons, in both dimensions, are not shown).

The present work revealed an additional, previously unreported activation of dopamine neurons. There was a sustained increase in activity that grew from the onset of the conditioned stimulus to the expected time of reward (Fig. 3, A and B). At P = 0.5, 29% of 188 neurons showed significant increases in activity before potential reward, whereas 3% showed decreases (P < 0.05, Wilcoxon test). By contrast, at P = 1.0, only 9% showed significant increases, and 5% showed significant decreases. For the population response, the sustained activation was maximal atP = 0.5, less pronounced at P = 0.25 and 0.75, and absent at P = 0.0 and 1.0 (Fig. 3C and fig. S3B). Statistical analysis revealed a significant effect of uncertainty on the population response (P < 0.005 in each of four data sets) (11), indicating that the sustained activation codes uncertainty (14). Furthermore, the peak of the sustained activation occurs at the time of potential reward, which corresponds to the moment of greatest uncertainty (15). The particular function of uncertainty signaled by dopamine neurons is not known (13), but we note that common measures of uncertainty (variance, standard deviation, and entropy) are all maximal at P = 0.5 and have highly nonlinear relations to probability, being very sensitive to small changes in probability near the extremes (P = 0 or 1).

The phasic and sustained activations differed not only in timing and relation to reward probability, but also in their occurrence in single neurons. In Fig. 3D, the magnitude of the phasic and sustained activation is shown for each neuron (n = 241). First, a substantial number of neurons had little or no response of either type (13); however, the magnitudes of each type of response fell along a continuum, with no evidence for subpopulations among dopamine neurons. Second, the magnitude of the sustained activation showed no consistent relation to the magnitude of phasic activation across neurons. This was the case both for the phasic response to conditioned stimuli (r = 0.095, P > 0.10) and for the response to unpredicted reward (r = –0.024) (Fig. 3D). In contrast, there was a significant positive correlation of phasic responses between conditioned stimuli and reward (r = 0.196, P < 0.01) (fig. S4). Thus, the phasic and sustained activations appear to occur independently and within a single population of dopamine neurons.

Although the sustained activation occurs in response to reward uncertainty, it is important to know whether it is specific to motivationally relevant stimuli or generalizes to all uncertain events. We conditioned two visual stimuli in a series, with the second following the first in only half of the trials (P = 0.5). The stimuli were distinct but entirely analogous to the other stimuli used for conditioning. Dopamine neurons showed neither sustained (Figs. 3C and 4A) nor phasic responses (Fig. 2, D and E) to either the first or second of these stimuli. Thus, the sustained activation seems to be related to uncertainty about motivationally relevant stimuli.

Figure 4

Sustained activation is dependent on the discrepancy in potential reward magnitude. (A) All stimuli predicted potential reward (0.05, 0.15, or 0.5 ml of liquid) or a neutral picture at P = 0.5. Data are from 35 cells in monkey A and 49 cells in monkey B. (B) Each stimulus predicted that reward would be one of two potential magnitudes, each atP = 0.5, as indicated on the abscissa. Every trial was rewarded with one of the two potential reward magnitudes. Data are from 53 cells in monkey B.

If the sustained dopamine activation is related to the motivational properties of uncertain rewards, it should vary with reward magnitude. We used distinct visual stimuli to predict the magnitude of potential reward at P = 0.5 and found that the sustained activation of dopamine neurons increased with increasing reward magnitude (n = 84, P < 0.02 in each monkey) (Fig. 4A) (11). The sustained activation could reflect the discrepancy in potential reward rather than absolute reward magnitude. To address this issue, we performed an additional experiment (53 neurons in monkey B) in which reward was delivered in each trial but varied between two magnitudes at P = 0.5. One stimulus predicted a small or medium reward, another predicted a small or large reward, and a third predicted a medium or large reward. The sustained activation was maximal after the stimulus predicting the largest variation (small versus large reward) (P < 0.01) (Fig. 4B). These data indicate that the amount of sustained activation by reward uncertainty in dopamine neurons increases with the discrepancy between potential rewards.

The present results demonstrate two distinct response types in dopamine neurons. Brief, phasic activations changed monotonically with increasing reward probability, whereas slower, more sustained activations developed with increasing reward uncertainty. These sustained activations were not observed in previous studies in which predictions had low uncertainty. Thus, the activity of dopamine neurons carries information about two intimately related but fundamentally distinct statistical parameters of reward. A potentially analogous coding scheme was identified in neurons of the fly visual system, in which the visual stimulus and uncertainty about that visual stimulus appeared to be coded independently in single neurons (16).

By systematically varying reward probability, we show that the phasic activity of dopamine neurons matches the quantitative definition of reward prediction error. Responses to reward decreased with increasing reward probability, and, conversely, responses to the predictive stimulus increased. Furthermore, reward always elicited responses when it occurred at P < 1, even after thousands of pairings between stimulus and reward. By always coding prediction error over the full range of probabilities, dopamine neurons could provide a teaching signal in accord with the principles of learning originally described by Rescorla and Wagner (17–19).

In addition to those principles described by Rescorla-Wagner, other basic intuitive principles of associative learning have been described, focusing in particular on the importance of attention (20,21). It is generally accepted that no single principle alone is sufficient to explain all observations of animal learning, and the various theories are thus considered to be complementary (6,7). The Pearce-Hall theory proposes that attention (and thus learning) is proportional to uncertainty about reinforcers (21, 22). As dopamine neurons are activated by reward uncertainty, dopamine could facilitate attention and learning in accord with the Pearce-Hall theory. This raises the possibility that two fundamental principles of learning are embodied by two distinct types of response in dopamine neurons (23).

The link between uncertainty, attention, and learning has two related aspects [another aspect is given in (24)]. The goal of learning can be seen as finding accurate predictors for motivationally significant events. Subjective uncertainty indicates that the animal lacks an accurate predictor and thus indicates the utility of identifying a more accurate predictor (25). Similarly, and as indicated by mathematical principles of information (26), only in the presence of uncertainty is it anticipated that there will be information available in the outcome. If reward (P = 1) or no reward (P = 0) occurs exactly as predicted, that event contains no information beyond that already given by the conditioned stimulus; that is, it is redundant. However, when the prediction of reward is uncertain, the outcome (reward or no reward) always contains information. The outcome at P = 0.5 contains, on average, the maximal amount of information (one bit) of any probability. The processing of this reward information is demonstrated by the fact that prediction error signals are always generated in dopamine neurons when reward outcomes occur under conditions of uncertainty. Thus, subjective reward uncertainty corresponds both to the utility of identifying more accurate predictors and to the expectation of reward information. Through its widespread influence, dopamine could control a nonselective form of attention or arousal, which is dependent on uncertainty and designed to aid the learning of predictive stimuli and actions.

Although dopaminergic signals may promote a particular form of attention, an extensive literature has already established the critical importance of dopamine in reward and reinforcement. Whereas the phasic response of dopamine neurons to reward prediction error fits remarkably well with dopamine's presumed role in appetitive reinforcement (10, 17, 18), the activation by reward uncertainty may appear inconsistent with a reinforcing function. This apparent discrepancy would be resolved to the extent that postsynaptic neurons can discriminate the two forms of activity. However, it seems unlikely that the two patterns of activity can be discriminated perfectly, especially given the slow time course of dopamine transmission. Rather than arguing against a role for the activity of dopamine neurons in reinforcement, one might ask whether reward uncertainty itself has rewarding and reinforcing properties. Indeed, gambling behavior is defined by reward uncertainty and is prevalent throughout many cultures. Animals display a potentially related behavior, preferring variable over fixed reward schedules [for discussion, see (27) and (28)]. The present results suggest that dopamine is elevated during gambling in a manner that is dependent on both the probability and magnitude of potential reward. This uncertainty-induced increase in dopamine could contribute to the rewarding properties of gambling, which are not readily explained by overall monetary gain or dopamine's corresponding role in prediction error (as losses tend to outnumber gains) (29). The question arises as to why a reward signal would be produced by reward uncertainty. Although risk-taking behavior may be maladaptive in a laboratory or casino, where the probabilities are fixed and there is nothing useful to learn, it could be advantageous in natural settings, where it would be expected to promote learning of stimuli or actions that are accurate predictors of reward (25). Thus, the sustained, uncertainty-induced increase in dopamine could act to reinforce risk-taking behavior and its consequent reward information, whereas the phasic response after prediction error could mediate the more dominant reinforcement of reward itself.

Supporting Online Material

www.sciencemag.org/cgi/content/full/299/5614/1898/DC1

Materials and Methods

SOM Text

Figs. S1 to S4

References and Notes

  • * To whom correspondence should be addressed. E-mail: cdf28{at}cam.ac.uk

REFERENCES AND NOTES

View Abstract

Navigate This Article