Report

Representation of Action-Specific Reward Values in the Striatum

See allHide authors and affiliations

Science  25 Nov 2005:
Vol. 310, Issue 5752, pp. 1337-1340
DOI: 10.1126/science.1115270

Abstract

The estimation of the reward an action will yield is critical in decision-making. To elucidate the role of the basal ganglia in this process, we recorded striatal neurons of monkeys who chose between left and right handle turns, based on the estimated reward probabilities of the actions. During a delay period before the choices, the activity of more than one-third of striatal projection neurons was selective to the values of one of the two actions. Fewer neurons were tuned to relative values or action choice. These results suggest representation of action values in the striatum, which can guide action selection in the basal ganglia circuit.

Animals and humans flexibly choose actions in pursuit of their specific goals in the environment on a trial-and-error basis (1, 2). Theories of reinforcement learning (3) describe reward-based decision-making and adaptive choice of actions by the following three steps: (i) The organism estimates the action value, defined as how much reward value (probability times volume) an action will yield. (ii) It selects an action by comparing the action values of multiple alternatives. (iii) It updates the action values by the errors of estimated action values. Reinforcement learning models of the basal ganglia have been put forward (46). The midbrain dopamine neurons encode errors of reward expectation (79) and motivation (9), and they regulate the plasticity of the corticostriatal synapses (10, 11). Neuronal discharge rates in the cerebral cortex (1215) and striatum (1618) are modulated by rewards that are estimated by sensory cues and behavioral responses. These observations are consistent with action selection through the reinforcement learning rule (3) and with the notion of stimulus-response learning (19, 20). However, two critical questions remain unanswered: Do the striatal neurons acquire action values in their activity through learning? How is the striatal neuron activity involved in reward-based action selection? Here we show by using a reward-based, free-choice paradigm that the striatal neurons learn to encode the action values through trial-and-error learning and predict choice probability of action options under a reinforcement learning algorithm.

Two macaque monkeys performed a reward-based, free-choice task of turning a handle to the left or right (Fig. 1A). The monkeys held a handle in the center position for 1 s (delay period) with their left hand. Then, they turned the handle in either the left (a = L) or right (a = R) direction. A light-emitting diode (LED) on the turned side was illuminated stochastically in either green or red. The green and red LEDs instructed monkeys that either a large reward (0.2 ml of water) or a small reward (0.07 ml), respectively, would follow. The probabilities of a large reward after left and right turns were fixed during a block of 30 to 150 trials and varied between five types of trial blocks. In the “90-50” block, for example, the probability of a large reward for the left turn was 90%, and for the right turn, 50%. In this case, by taking the small reward as the baseline (r = 0) and the large reward as unity (r = 1), the left action value QL was 0.9 and the right action value QR was 0.5. We used four asymmetrically rewarded blocks, “90-50,” “50-90,” “50-10,” and “10-50,” and one symmetrically rewarded block, “50-50” (Fig. 1B). An important feature of this block design was that the neuronal activity related to the action value could be dissociated from that related to action choice. Although the monkeys should prefer the left turn in both 90-50 and 50-10 blocks, the action value QL for the left turn changes from 0.9 to 0.5. Conversely, in the 90-50 and 10-50 blocks, although the monkey's choice behavior should be the opposite, the action value QR remains at 0.5.

Fig. 1.

Reward-based, free-choice task and monkey's performance. (A) Time chart of events that occurred during the task. (B) Diagram of large-reward probabilities for left, P(r | a = L), and right handle turn, P(r | a = R), in five types of trial blocks. (C) Representative record of individual choices in the five blocks of trials. Red and blue vertical lines indicate individual choices of trials (long line: large-reward trial, short line: small-reward trial, crosses: error trials with no reward). The light blue trace in the middle indicates the probability of a left-turn choice (PL, running average of last 10 choices). (D) Average curves of PL (solid line) and its 95% confidence interval (shaded band) in five trial blocks in monkey RO. Data of 977, 306, 282, 277, and 242 blocks are shown for 50-50, 10-50, 50-10, 50-90, and 90-50 blocks, respectively. Color code is the same as in (B).

Figure 1C shows a representative time course of choices on individual trials and the left-turn choice probability, PL. Figure 1D shows the average curves of PL during 2084 blocks of trials by monkey RO. The PL started at around 0.5 (average of first 10 trials: 0.48 for monkey RO and 0.39 for monkey AR) and stayed around 0.5 in a symmetrically rewarded block in both monkeys. In asymmetrically rewarded blocks, the choice probability gradually shifted toward the action with higher reward values (binomial test, P < 0.01 for 50-50 versus four asymmetrically rewarded blocks). Although the time courses of the PL shifts were variable among individual blocks, such as those in Fig. 1C, the average PL at the same number of trials after the block start were not significantly different between 90-50 and 50-10 blocks, and between 50-90 and 10-50 blocks (Fig. 1D, P > 0.05).

We recorded 504 striatal projection neurons in the right putamen and caudate nucleus of two monkeys. The present study focused on the 142 (61 in monkey RO, 81 in monkey AR) neurons that displayed increased discharges during at least one task event and those that had discharge rates higher than 1 spike/s during the delay period. We compared the average discharge rates during the delay period from two asymmetrically rewarded blocks. The comparison was based on the trials after the monkey's choices had reached a “stationary phase,” when the choice probability was biased toward the action with higher reward probability in more than 70% of trials. In half of the neurons (72/142 in two monkeys), activity was modulated by either QL or QR. Figure 2, A and B, shows a representative neuron in which the delay-period discharge rate was significantly higher in the 90-50 block (blue) than in the 10-50 block (orange) (P = 0.003, two-tailed Mann-Whitney U test). Delay period discharges were not significantly different (P = 0.70) between the 50-10 and 50-90 blocks (Fig. 2B), for which preferred actions were the opposite. Thus, the neuron encodes the left action value, QL, but not the action itself. Another neuron in Fig. 2, C and D, showed a significantly higher discharge rate in the 50-10 block than in the 50-90 block (P < 0.001), but there was no significant difference between the 10-50 and 90-50 blocks (P = 0.67). This neuron may code a negative right action value, –QR. We also found neurons (Fig. 2, E and F) that discharged more in the 90-50 block than in the 10-50 block (P = 0.028), but less in the 50-90 block than in the 50-10 block (P = 0.003). This neuron may encode the difference of action values, QLQR, and choice of left turn.

Fig. 2.

Three representative reward-value coding neurons in the striatum. (A) A left–action value (QL-type) neuron in the anterior striatum. Average discharge rates during 10-50 and 90-50 blocks (left panel) and during 50-10 and 50-90 blocks (right panel) are shown. (B) Three-dimensional bar graph of average magnitudes and standard deviation of activity during delay period [shaded period in (A)]. Floor gradient shows the regression surface of neuronal activity by large-reward probability after left and right turns. (C and D) A right–action value (QR-type) neuron in anterior putamen. (E and F) A differential–action value (ΔQ and m-type) neuron with correlation also to action choice. The average activity curves in (A), (C), and (E) are smoothed with a Gaussian kernel (σ = 50 ms). Double and single asterisks indicate significant difference at P < 0.001 and P < 0.01 in Mann-Whitney U test, respectively.

To study the representation of action values in the population of striatal neurons, we made a multiple regression analysis of neuronal discharge rates with QL and QR as regressors (21). Figure 3A shows a scatter plot of t-values of the regression coefficients. We found 24 (17%) “QL-type” neurons, which had significant regression coefficients to QL (t-test, P < 0.05) but not to QR, and 31 (22%) “QR-type” neurons correlated to QR but not to QL. There were 16 (11%) “differential action value (ΔQ-type)” neurons correlated with the difference between QL and QR. One neuron was classified as “value (V)-type” (<1%), which was positively correlated with reward values independent of actions. In 41 neurons, there were significant regression coefficients to the behavioral measures including chosen action, reaction time, and movement time (Fig. 3A, open symbols). There were 18 “motor related (m)-type” neurons that had significant t-values only for behavioral measures. The discharge rates of most action value neurons (19/24 in QL-type, 24/31 in QR-type) were not correlated significantly with the behavioral measures. We concluded that, during a delay period before action choices, more than one-third of striate projection neurons examined (43/142) encoded action values, and that 60% (43/72) of all the reward value–sensitive neurons were action-value neurons.

Fig. 3.

Multiple regression analysis of neuronal activity with regressor of action value. (A) A scatter plot of partial regression coefficients of action values for left turn (QL) and right turn (QR). Blue circles, QL-type; red circles, QR-type; green squares, ΔQ-type; magenta triangles, V-type; crosses, m-type. Dark dots indicate neurons with no significant t-values for either regressor. Interrupted lines indicate levels of significant QL and QR slopes at P = 0.05 (t = ±1.97, 140 degrees of freedom). Open symbols indicate the neurons that also have significant regression coefficient of animals' choice, reaction time, or movement time. Letters a, b, and c indicate the example neurons in Fig. 2; A and B, C and D, and E and F, respectively. (B) Pie chart of neurons categorized into the four main types (QL, QR, ΔQ, and m) and three subtypes (ΔQ and m, QL and m, and QR and m).

We next examined whether the neuronal activity encoding action values predict monkey's action choices (21). The action values QL(i) and QR(i) at the ith trial of a single block of trials were estimated based on a standard reinforcement learning model and the past action a(j) and reward r(j) (j = 1,..., i – 1) (3, 22). The estimated action values successfully predicted the probability of subsequent action choices, a(i), (Fig. 4A and fig. S2). Figure 4, B and C, shows a QL-type neuron whose discharge rate during the delay period followed the time course of QL(i) but not of QR(i) (double asterisk regression slope for QL(i) = –9.7, P < 0.001, slope for QR(i) = 2.3, P = 0.29) (Fig. 4, B and C). These results suggest that a large subset of striatal neurons encode the action values that are updated by the history of actions and rewards and determine the probability of selecting a particular action.

Fig. 4.

Prediction of action choices and multiple regression analysis of neuronal activity by action values based on a reinforcement learning model. (A) An example of the time course of action values and predicted actions. From the data of actions and rewards (top panel: long vertical blue and red lines, large-reward trial; short lines, small-reward), the action values were estimated (bottom panel: QL(i), blue line; QR(i), red line) by a reinforcement learning model (21). Black and cyan curves indicate the action probability PL given by the action values and the actual action choice ratio given by the weighted averages with a Gaussian kernel (σ = 2.5). (B) An example of the activity of a caudate neuron plotted on the space of estimated action values QL(i) and QR(i). It had significant regression coefficients with QL(i)(slope kQL = –9.7, P < 0.001, t-test), but not with QR(i) (kQR = 2.3, P = 0.29). Heights and colors of stem plots indicate the discharge rates of the neuron and individual choices of actions (blue, left; red, right), respectively. (C) The discharge rates of the neuron in (B) are projected on the QL(i) and QR(i) axes. The color code is the same as in (B). Gray lines are derived from the regression model. Circles and error bars indicate average and SD of neural discharge rates for each of 10 equally populated action value bins. The double asterisk indicates that the discharge rates were significantly correlated with QL(i). This neuron was not selective to action itself (Mann-Whitney U test, P = 0.33). ns, not significant.

Action-value coding in the striatum may be a core feature of information processing in the basal ganglia. The striatum is the primary target of dopaminergic signals, which regulate the plasticity of cortico-striatal synaptic transmission (10, 11), conveying signals of actions and cognition. Thus, the striatum may be the locus where reward value is first encoded in the brain. This idea was supported by theoretical prediction (23) and by neural recordings from the striatum and the prefrontal cortex during association learning (24). Our finding of the successful prediction of individual action choices by estimated action values suggests the involvement of striatal action-value neurons in the process of selection of an action under a reinforcement learning algorithm. Whereas a large population of striatal neurons encoded action values, a much smaller population of neurons encoded forthcoming action during a premovement delay period. This favors action value–based models (4, 25) for the striatal functions over stimulus-response learning and actor-critic models (5, 6). However, further studies on the neuronal activity before and after action choices, not only in the striatum but also downstream from it, are necessary to clarify whether action selection is realized within the striatum through lateral inhibition (5, 26) or in the globus pallidus (4, 27). Deficits in actionvalue coding may lead to an inappropriate selection of competing actions or an inability to select any action, which might underlie some of the core symptoms of Parkinson's disease.

Supporting Online Material

www.sciencemag.org/cgi/content/full/310/5752/1337/DC1

Materials and Methods

SOM Text

Figs. S1 to S4

Table S1

References

References and Notes

View Abstract

Navigate This Article