Fast Readout of Object Identity from Macaque Inferior Temporal Cortex

See allHide authors and affiliations

Science  04 Nov 2005:
Vol. 310, Issue 5749, pp. 863-866
DOI: 10.1126/science.1117593


Understanding the brain computations leading to object recognition requires quantitative characterization of the information represented in inferior temporal (IT) cortex. We used a biologically plausible, classifier-based readout technique to investigate the neural coding of selectivity and invariance at the IT population level. The activity of small neuronal populations (∼100 randomly selected cells) over very short time intervals (as small as 12.5 milliseconds) contained unexpectedly accurate and robust information about both object “identity” and “category.” This information generalized over a range of object positions and scales, even for novel objects. Coarse information about position and scale could also be read out from the same population.

Primates can recognize and categorize objects as quickly as 200 ms after stimulus onset (1). This remarkable ability underscores the high speed and efficiency of the object recognition computations by the ventral visual pathway (25). Because the feed-forward part of this circuitry requires at least eight or more synapses from the retina to anterior IT cortex, it has been proposed that the computations at each stage are based on just one or very few spikes per neuron (6, 7). At the end of the ventral stream, single cells in IT cortex show selectivity for complex objects with some tolerance to changes in object scale and position (24, 6, 816). Small groups of neurons in IT cortex tuned to different objects and object parts might thus provide sufficient information for several visual recognition tasks, including identification, categorization, etc. This information could then be “read out” by circuits receiving input from IT neurons (1719).

Although physiological and functional imaging data suggest that visual object identity and category are coded in the activity of IT neurons (26, 816, 20), fundamental aspects of this code remain under debate, including the discriminative power in relation to population size, temporal resolution, and time course. These questions must be understood at the population level to provide quantitative constraints for models of visual object recognition. We examined these issues by obtaining independent recordings from a large unbiased sample of IT neuronal sites and using a population readout technique based on classifiers. The readout approach consists of training a regularization classifier (21) to learn the map from neuronal responses to each object label (Supporting Online Material), as in recent studies in the motor system [e.g., (22)]. Instead of making strong assumptions about the prior probability distribution of the training examples, the classifier learns directly from them and generalizes to novel responses (21). The input consists of the neuronal responses from the independently recorded neurons; different input representations allow quantitative comparisons among neural coding alternatives (10, 13, 2228). After training, the classifier can be used to decode the responses to novel stimuli. We used a one-versus-all approach whereby for each class of stimuli (8 classes for categorization, 77 classes for identification, 3 classes for scale and position readout; see below), one binary classifier was trained. The overall classifier prediction on test data was given by the binary classifier with the maximum activation. The performance of such classifiers constitutes a lower bound on the information available in the population activity, but is a meaningful measure that could be directly implemented by neuronal hardware.

We used the classifier approach to determine the ability of more than 300 sequentially collected IT sites from two passively fixating monkeys to “categorize” 77 gray-scale objects as belonging to one of eight possible groups (29) (Fig. 1A). Figure 1B (red curve) shows the cross-validated performance of classifiers in performing this categorization task as a function of the number of recording sites (30). The spiking activity of 256 randomly selected multi-unit activity (MUA) sites was sufficient to categorize the objects with 94 ± 4% accuracy (mean ± SD; for 100 sites, interpolated performance = 81%; chance = 12.5%). Similarly, we tested the ability of the IT population to identify each of the 77 objects (Fig. 1B, blue curve). Even small populations of IT neurons were capable of performing this identification task at high accuracy (for 256 sites, 72 ± 3% correct; for 100 sites, interpolated performance = 49%; chance = 1.3%), although at lower performance than categorization for the same number of sites (31). Classifier performance increased approximately linearly with the logarithm of the number of sites, which is indicative of a distributed representation in contrast to a grandmother-like representation (13, 28, 32, 33). Very similar levels of performance were obtained when single unit activity (SUA) was considered [Fig. 1C, (28)]. The local field potentials also contain information about object category [Fig. 1C, (28)]. Examination of the classification errors suggests that some objects and categories were easier to discriminate than others (Fig. 1D). All the results reported here were obtained using a linear (regularized) classifier. Classification performance was similar for several different types of classifiers, and the performance of linear classifiers—among the simplest classifiers—could not be substantially improved upon (28, 34).

Fig. 1.

Accurate readout of object category and identity from IT population activity. (A) Example of multi-unit spiking responses of 3 independently recorded sites to 5 of the 77 objects. Rasters show spikes in the 200 ms after stimulus onset for 10 repetitions (black bars indicate object presentation). (B) Performance of a linear classifier over the entire object set on test data (not used for training) as a function of the number of sites for reading out object category (red, chance = 12.5%) or identity (blue, chance = 1.3%). The input from each site was the spike count in consecutive 50-ms bins from 100 to 300 ms after stimulus onset (28). Sequentially recorded sites were combined by assuming independence (Supporting Online Material). In this and subsequent figures, error bars show the SD for 20 random choices of the sites used for training; the dashed lines show chance levels, and the bars next to the dashed lines show the range of performances using the 200 ms before stimulus onset (control). (C) Categorization performance (n = 64 sites, mean ± SEM) for different data sources used as input to the classifier: multi-unit activity (MUA) as shown in (B), single-unit activity (SUA), and local field potentials (LFP, Supporting Online Material). (D) This confusion matrix describes the pattern of mistakes made by the classifier (n = 256 sites). Each row indicates the actual category presented to the monkey (29), and each column indicates the classifier predictions (in color code).

The performance values in Fig. 1, B to D, are based on the responses of single stimulus presentations that were not included in the classifier training. Thus, the level of recognition performance is what real downstream neurons could, in theory, perform on a single trial by simply computing a weighted sum of spikes over a short time interval (100- to 300-ms interval divided into bins of 50 ms in this case) (11, 23, 24, 28). This is notable considering the high trial-to-trial variability of cortical neurons (27). The IT population performance is also robust to biological noise sources such as neuronal death and failures in neurotransmitter release [fig. S1, (35)]. Although Fig. 1 (and most other decoding studies) assumes precise knowledge about stimulus onset time, this is not a limitation because we could also accurately read out stimulus onset time from the same IT population [fig. S5, (28)].

A key computational difficulty of object recognition is that it requires both selectivity (different responses to distinct objects such as one face versus another face) and invariance to image transformations (similar responses to, e.g., rotations or translations of the same face) (8, 12, 17). The main achievement of mammalian vision, and one reason why it is still so much better than computer vision algorithms, is the combination of high selectivity and robust invariance. The results in Fig. 1 demonstrate selectivity; the IT population can also support generalization over objects within predefined categories, suggesting that neuronal responses within a category are similar (36). We also explored the ability of the IT population to generalize recognition over changes in position and scale by testing 71 additional sites with the original 77 images and four transformations in position or scale. We could reliably classify (with less than 10% reduction in performance) the objects across these transformations even though the classifier only “saw” each object at one particular scale and position during training (Fig. 2). The “identification” performance also robustly generalized across position and scale (28). Neurons also showed scale and position invariance for novel objects not seen before (fig. S6). The IT population representation is thus both selective and invariant in a highly nontrivial manner. That is, although neuronal population selectivity for objects could be obtained from areas like V1, this selectivity would not generalize over changes in, e.g., position (Supporting Online Material).

Fig. 2.

Invariance to scale and position changes. Classification performance (categorization, n = 64 sites, chance = 12.5%) when the classifier was trained on the responses to the 77 objects at a single scale and position (depicted for one object by “TRAIN”) and performance was evaluated with spatially shifted or scaled versions of those objects (depicted for one object by “TEST”). The classifier never “saw” the shifted/scaled versions during training. Time interval = 100 to 300 ms after stimulus onset, bin size = 50 ms. The left-most column shows the performance for training and testing on separate repetitions of the objects at the same standard position and scale (as in Fig. 1). The second bar shows the performance after training on the standard position and scale (scale = 3.4°, center of gaze) and testing on the shifted and scaled images of the 77 objects. Subsequent columns use different image scales and positions for training and testing.

We studied the temporal resolution of the code by examining how classification performance depended on the spike count bin size in the interval from 100 to 300 ms after stimulus onset (Supporting Online Material). We observed that bin sizes ranging from 12.5 through 50 ms yielded better performance than larger bin sizes (Fig. 3A). This does not imply that downstream neurons are simply integrating over 50-ms intervals or that no useful object information is contained in smaller time intervals. Indeed, we could decode object category at 70 ± 3% accuracy using only the spikes contained in one single bin of 12.5-ms duration at 125-ms latency (Fig. 3B). Notably, this time bin typically contained zero to two spikes (0.18 ± 0.26 spikes/bin, mean ± SD). This shows that a few spikes from a small number of neurons (essentially a binary vector with either ones or zeros) are sufficient to encode “what” information in IT neurons within behaviorally relevant time scales.

Fig. 3.

Latency and time resolution of the neural code. (A) Classification performance (n = 128 sites) as a function of the bin size (12.5 to 200 ms, i.e., temporal resolution) to count spikes within the 100- to 300-ms window after stimulus onset for categorization (red) and identification (blue). The same linear classifier as in Figs. 1 and 2 was used. (B) Classification performance (n = 256 sites) using a single bin of 12.5 ms to train and test the classifier at different latencies from stimulus onset (x axis). The colors and conventions are as in Fig. 1B.

What other “types” of information are carried in the IT population? Using the readout method, we compared the information available for “categorization” versus “identification” (18, 37, 38). The time course and temporal resolution did not depend strongly on the classification task (Fig. 3); the best sites for categorization overlapped the best sites for identification; the signal-to-noise ratios for categorization and identification were strongly correlated (r = 0.54, p < 10–10); and the same randomly selected sites could be used for both tasks (28). The same IT neuronal population can thus be used by downstream neurons to perform tasks traditionally considered to be different (e.g., “categorization” versus “identification”).

Although anterior IT cortex is generally regarded as the brain area at the top of the ventral “what” stream, the readout approach allowed us to examine the possibility that the IT population might contain useful information about object scale and position (“where”). Our observation that IT populations convey scale- and position-invariant object category and identity information (Fig. 2) might seem to suggest that object position information is lost in IT neurons. However, it is also possible to read out—at least coarsely—both object scale and position (“where” information) based on the activity of the same population, independent of identity or category, by training the classifier to learn the map between neuronal responses and scale or position, irrespective of object identity (fig. S4A). Reading out object position or scale had a similar time course to the readout of object category (fig. S4B). There was little correlation between the ability of each IT site to signal scale/position versus object category information, suggesting that IT neurons encode both types of information (fig. S4C).

Our observations characterize the available information in IT for object recognition, but they do not necessarily imply that the brain utilizes exclusively the IT neurons (39) or the same coding schemes and algorithms that we have used for decoding. However, a linear classifier—which we found to be very close to optimal (34)—could be easily implemented in the brain by summating appropriately weighted inputs to downstream neurons. Thus, targets of IT [such as prefrontal cortex (PFC)] could decode information over brief time intervals, using inputs from small neuronal populations (e.g., ∼100 neurons). It is conceivable that the dynamic setting of the synaptic weights from IT to PFC may switch between different tasks in PFC, reading out information from the same neuronal population in IT cortex (18). In this perspective, some neurons in IT cortex would be similar to tuned units in a learning network, supporting a range of different recognition tasks including “categorization” and “identification” in PFC (40).

The approach described here can be used to characterize the information represented in a cortical area such as object identity in IT cortex (26, 811). Classifiers can be trained on any stimulus property and then tested to systematically examine putative neural codes for that stimulus information. Our results quantitatively show how targets of IT cortex may rapidly, accurately, and robustly perform tasks of categorization, identification, and readout of scale and position based on the activity of small neuronal populations in IT cortex.

Supporting Online Material

SOM Text

Figs. S1 to S7


References and Notes

Stay Connected to Science

Navigate This Article