Unsupervised Natural Experience Rapidly Alters Invariant Object Representation in Visual Cortex

See allHide authors and affiliations

Science  12 Sep 2008:
Vol. 321, Issue 5895, pp. 1502-1507
DOI: 10.1126/science.1160028


Object recognition is challenging because each object produces myriad retinal images. Responses of neurons from the inferior temporal cortex (IT) are selective to different objects, yet tolerant (“invariant”) to changes in object position, scale, and pose. How does the brain construct this neuronal tolerance? We report a form of neuronal learning that suggests the underlying solution. Targeted alteration of the natural temporal contiguity of visual experience caused specific changes in IT position tolerance. This unsupervised temporal slowness learning (UTL) was substantial, increased with experience, and was significant in single IT neurons after just 1 hour. Together with previous theoretical work and human object perception experiments, we speculate that UTL may reflect the mechanism by which the visual stream builds and maintains tolerant object representations.

When presented with a visual image, primates can rapidly (<200 ms) recognize objects despite large variations in object position, scale, and pose (1, 2). This ability likely derives from the responses of neurons at high levels of the primate ventral visual stream (35). But how are these powerful “invariant” neuronal object representations built by the visual system? On the basis of theoretical (611) and behavioral (12, 13) work, one possibility is that tolerance (“invariance”) is learned from the temporal contiguity of object features during natural visual experience, potentially in an unsupervised manner. Specifically, during natural visual experience, objects tend to remain present for seconds or longer, while object motion or viewer motion (e.g., eye movements) tends to cause rapid changes in the retinal image cast by each object over shorter time intervals (hundreds of ms). The ventral visual stream could construct a tolerant object representation by taking advantage of this natural tendency for temporally contiguous retinal images to belong to the same object. If this hypothesis is correct, it might be possible to uncover a neuronal signature of the underlying learning by using targeted alteration of those spatiotemporal statistics (12, 13).

To look for such a signature, we focused on position tolerance. If two objects consistently swapped identity across temporally contiguous changes in retinal position then, after sufficient experience in this “altered” visual world, the visual system might incorrectly associate the neural representations of those objects viewed at different positions into a single object representation (12, 13). We focused on the top level of the primate ventral visual stream, the inferior temporal cortex (IT), where many individual neurons possess position tolerance—they respond preferentially to different objects, and that selectivity is largely maintained across changes in object retinal position, even when images are simply presented to a fixating animal (14, 15).

We tested a strong, “online” form of the temporal contiguity hypothesis—two monkeys visually explored an altered visual world (Fig. 1A, “Exposure phase”), and we paused every ∼15 min to test each IT neuron for any change in position tolerance produced by that altered experience (Fig. 1A, “Test phase”). We concentrated on each neuron's responses to two objects that elicited strong (object “P”, preferred) and moderate (object “N”, nonpreferred) responses, and we tested the position tolerance of that object selectivity by briefly presenting each object at 3° above, below, or at the center of gaze (16) (fig. S1). All neuronal data reported in this study were obtained in these test phases: animal tasks unrelated to the test stimuli; no attentional cueing; and completely randomized, brief presentations of test stimuli (16). We alternated between these two phases (test phase ∼5 min; exposure phase ∼15 min) until neuronal isolation was lost.

Fig. 1.

Experimental design and predictions. (A) IT responses were tested in “Test phase” (green boxes, see text), which alternated with “Exposure phase.” Each exposure phase consisted of 100 normal exposures (50 P→P, 50 N→N) and 100 swap exposures (50 P→N, 50 N→P). Stimulus size was 1.5° (16). (B) Each box shows the exposure-phase design for a single neuron. Arrows show the saccade-induced temporal contiguity of retinal images (arrowheads point to the retinal images occurring later in time, i.e., at the end of the saccade). The swap position was strictly alternated (neuron-by-neuron) so that it was counter-balanced across neurons. (C) Prediction for responses collected in the test phase: If the visual system builds tolerance using temporal contiguity (here driven by saccades), the swap exposure should cause incorrect grouping of two different object images (here P and N). Thus, the predicted effect is a decrease in object selectivity at the swap position that increases with increasing exposure (in the limit, reversing object preference), and little or no change in object selectivity at the non-swap position.

To create the altered visual world (“Exposure phase” in Fig. 1A), each monkey freely viewed the video monitor on which isolated objects appeared intermittently, and its only task was to freely look at each object. This exposure “task” is a natural, automatic primate behavior in that it requires no training. However, by means of real-time eye-tracking (17), the images that played out on the monkey's retina during exploration of this world were under precise experimental control (16). The objects were placed on the video monitor so as to (initially) cast their image at one of two possible retinal positions (+3° or –3°). One of these retinal positions was pre-chosen for targeted alteration in visual experience (the “swap” position; counterbalanced across neurons) (Fig. 1B) (16); the other position acted as a control (the “non-swap” position). The monkey quickly saccaded to each object (mean: 108 ms after object appearance), which rapidly brought the object image to the center of its retina (mean saccade duration 23 ms). When the object had appeared at the non-swap position, its identity remained stable as the monkey saccaded to it, typical of real-world visual experience (“Normal exposure”, Fig. 1A) (16). However, when the object had appeared at the swap position, it was always replaced by the other object (e.g., P→N) as the monkey saccaded to it (Fig. 1A, “Swap exposure”). This experience manipulation took advantage of the fact that primates are effectively blind during the brief time it takes to complete a saccade (18). It consistently made the image of one object at a peripheral retinal position (swap position) temporally contiguous with the retinal image of the other object at the center of the retina (Fig. 1).

We recorded from 101 IT neurons while the monkeys were exposed to this altered visual world (isolation held for at least two test phases; n = 50 in monkey 1; 51 in monkey 2). For each neuron, we measured its object selectivity at each position as the difference in response to the two objects (P – N; all key effects were also found with a contrast index of selectivity) (fig. S6). We found that, at the swap position, IT neurons (on average) decreased their initial object selectivity for P over N, and this change in object selectivity grew monotonically stronger with increasing numbers of swap exposure trials (Fig. 2, A and C). However, the same IT neurons showed (Fig. 2A) no average change in their object selectivity at the equally eccentric control position (non-swap position), and little change in their object selectivity among two other (nonexposed) control objects (see below).

Fig. 2.

Change in the population object selectivity. (A) Mean population object selectivity at the swap and (equally eccentric) non-swap position, and for control objects at the swap position. Each row of plots shows effect among all neurons held for at least the indicated amount of exposure (e.g., top row shows all neurons held for more than 100 swap exposures—including the neurons from the lower rows). The object selectivity for each neuron was the difference in its response to object P and N. To avoid any bias in this estimate, for each neuron we defined the labels “P” (preferred) and “N” by using a portion of the pre-exposure data (10 repetitions) to determine these labels, and the reminder to compute the displayed results in all analyses using these labels. Though there was, by chance, slightly greater initial selectivity at the swap position, this cannot explain the position specificity of the observed change in selectivity (table S2). (B) Mean population object selectivity of 10 multi-unit sites. Error bars (A and B) are SEMs. (C) Histograms of the object selectivity change at the swap position, Δ(P – N) = (P – N)post-exposure – (P – N) pre-exposure. The arrows indicate the means of the distributions. The mean Δ(P – N) at the non-swap position was –0.01, –0.5, –0.9, and –0.9 spikes/s, respectively. The variability around that mean (i.e., distribution along the x axis) is commensurate with repeated measurements in the face of known Poisson spiking variability (fig. S11). (D) Object selectivity changes at the multi-unit sites. The mean Δ(P – N) at the non-swap position was 1.6 spikes/s.

Because each IT neuron was tested for different amounts of exposure time, we first computed a net object selectivity change, Δ(P – N), in the IT population by using the first and last available test phase data for each neuron. The prediction was that Δ(P – N) should be negative (i.e., in the direction of object preference reversal), and greatest at the swap position (Fig. 1C). This prediction was borne out (Fig. 3A). The position specificity of the experience-induced changes in object selectivity was confirmed by two different statistical approaches: (i) a direct comparison of Δ(P – N) between the swap and non-swap position (n =101; P = 0.005, one-tailed paired t test); (ii) a significant interaction between position and exposure—that is, object selectivity decreased at the swap position with increasing amounts of exposure [P = 0.009 by one-tailed bootstrap; P = 0.007 by one-tailed permutation test; tests were done on (P – N)].

Fig. 3.

Position specificity, object specificity, and time course. (A) Mean object selectivity change, Δ(P – N), at the swap, non-swap, and central (0°) retinal position. Δ(P – N) was computed as in Fig. 2C from each neuron's first and last available test phase (mean ∼200 swap exposures). The insets show the same analysis performed separately for each monkey. (B) Mean object selectivity change for the (exposed) swap objects and (nonexposed) control objects at the swap position. Error bars (A and B) are SEMs. The swap object selectivity change at the swap position is statistically significant (*) in the pooled data as well as in individual animals (P < 0.05, one-tailed t test against 0). (C) Mean object selectivity change as a function of the number of swap exposures for all single-unit (n = 101) and multi-unit sites (n = 10). Each data point shows the average across all the neurons and sites held for a particular amount of time. Gray line is the best linear fit with a zero intercept; slope is mean effect size: –5.6 spikes/s per 400 exposures. The slope at the non-swap position based on the same analysis was 0.6 spikes/s (not shown).

The changes in object selectivity at the swap position were also largely shape-specific. For 88 of the 101 neurons, we monitored the neuron's selectivity among two control objects not shown to the animals during the exposure phase (chosen similar to the way in which the P and N objects were selected, fully interleaved testing in each test phase) (16). Across the IT population, control object selectivity at the swap position did not significantly change (Fig. 2A), and the swap object selectivity changed significantly more than the control object selectivity (Fig. 3B) (n = 88, P = 0.009, one-tailed paired t test of swap versus control objects at the swap position).

These changes in object selectivity were substantial (average change of ∼5 spikes/s per 400 exposures at the swap position) (Figs. 2C and 3C) and were readily apparent and highly significant at the population level. In the face of well-known Poisson spiking variability (19, 20), these effects were only weakly visible in most single IT neurons recorded for short durations, but were much more apparent over the maximal 1-hour exposure time that we could hold neurons in isolation (Fig. 2C, lower panels). To determine if the object selectivity changes continued to grow even larger with longer periods of exposure, we next recorded multi-unit activity (MUA) in one animal (monkey 2), which allowed us to record from a number of (nonisolated) neurons around the electrode tip (which all tend to have similar selectivity) (21, 22) while the monkey was exposed to the altered visual world for the entire experiment (∼2 hours) (16). The MUA data replicated the single-unit results—a change in object selectivity only at the swap position (Fig. 2C) (“position × exposure” interaction: P = 0.03, one-tailed bootstrap; P = 0.014, one-tailed permutation test; n = 10). Furthermore, the MUA object selectivity change at the swap position continued to increase as the animal received even more exposure to the altered visual world, followed a very similar time course in the rate of object selectivity change (∼5 spikes/s per 400 exposures) (Fig. 3C), and even showed a slight reversal in object selectivity (N > P in Fig. 4D).

Fig. 4.

Responses to objects P and N. (A) Response data to object P and N at the swap position for three example neurons and one multi-unit site as a function of exposure time. The solid line is standard linear regression. The slope of each line (ΔS) provides a measure of the response change to object P and N for each neuron. Some neurons showed a response decrease to P, some showed a response enhancement to N, and others showed both (see examples). (B) Histograms of the slopes obtained for the object-selective neurons/sites tested for at least 300 exposures. The dark-colored bars indicate neurons with significant change by permutation test (P < 0.05) (16). (C) Histograms of the slopes from linear regression fits to object selectivity (P – N) as a function of exposure time; units are the same as in (B). Arrow indicates the mean of the distribution [the mean ΔS(P – N) at the non-swap position was –1.7 spikes/s, P = 0.38]. The black bars indicate instances (32%; 12 of 38 neurons and sites) that showed a significant change in object selectivity by permutation test (P < 0.05). Results were very similar when we discarded neurons and sites with greater initial selectivity at the swap position (fig. S8). (D) Data from all the neurons and sites that were tested for the longest exposure time. The plot shows the mean normalized response to object P and N as a function of exposure time (compare to Fig. 1C; see fig. S3 for data at the non-swap position and for control objects). Error bars (A and D) are SEMs.

Our main results were similar in magnitude (Fig. 3, A and B) and statistically significant in each of the two monkeys (monkey 1: P = 0.019; monkey 2: P = 0.0192; one-tailed t test). Each monkey performed a different task during the test phase (16), suggesting that these neuronal changes are not task dependent.

Because we selected the objects P and N so that they both tended to drive the neuron (16), the population distribution of selectivity for P and N at each position was very broad [95% range: (–5.7 to 31.0 spikes/s) pooled across position; n = 101]. However, our main prediction assumes that the IT neurons were initially object-selective (i.e., the response to object P was greater than to object N). Consistent with this, neurons in our population with no initial object selectivity at the center of gaze showed little average change in object selectivity at the swap position with exposure (fig. S5). To test the learning effect in the most selective IT neurons, we selected the neurons with significant object selectivity [n =52 of 101 neurons; two-way analysis of variance (2 objects × 3 positions), P < 0.05, significant main object effect or interaction]. Among this smaller number of object-selective neurons, the learning effect remained highly significant and still specific to the swap position (P = 0.002 by t test; P = 0.009 by bootstrap; P = 0.004 by permutation test).

To further characterize the response changes to individual objects, we closely examined the selective neurons held for at least 300 exposures (n = 28 of 52 neurons) and the multi-unit sites (n = 10). For each neuron and site, we used linear regression to measure any trend in response to each object as a function of exposure time (Fig. 4A). Changes in response to P and N at the swap position were apparent in a fraction of single neurons and sites (Fig. 4A), and statistically significant object selectivity change was encountered in 12 of 38 (32%) instances (Fig. 4C) (16). Across our neuronal population, the change in object selectivity at the swap position was due to both a decreased response to object P and an increased response to object N (approximately equal change) (Fig. 4B). These response changes were highly visible in the single-units and multi-units held for the longest exposure times (Fig. 4D).

These changes in the position profile of IT object selectivity (i.e., position tolerance) cannot be explained by changes in attention or by adaptation (fig. S10). First, a simple fatigue-adaptation model cannot explain the position specificity of the changes because, during the recording of each neuron, each object was experienced equally often at the swap and non-swap positions (also see additional control in table S2). Second, we measured these object selectivity changes with briefly presented, fully randomized stimuli while the monkeys performed tasks unrelated to the stimuli (16), which argues against an attentional account. Third, both of these explanations predict response decrease to all objects at the swap position, yet we found that the change in object selectivity at the swap position was due to an increase in response to object N (+2.3 spikes/s per 400 swap exposures) as well as a decrease in response to object P (–3.0 spikes/s per 400 swap exposures) (Fig. 4). Fourth, neither possibility can explain the shape specificity of the changes.

We term this effect “unsupervised temporal slowness learning” (UTL), because the selectivity changes depend on the temporal contiguity of object images on the retina and are consistent with the hypothesis that the natural stability (slowness) of object identity instructs the learning without external supervision (611). Our current data as well as previous human object perception experiments (12) cannot rule out the possibility that the brain's saccade-generation mechanisms or the associated attentional mechanisms (23, 24) may also be needed. Indeed, eyemovement signals are present in the ventral stream (25, 26). The relatively fast time scale and unsupervised nature of UTL may allow rapid advances in answering these questions, systematically characterizing the spatiotemporal sensory statistics that drive it, and understanding if and how it extends to other types of image tolerance (e.g., changes in object scale, pose) (27, 28).

IT neurons “learn” to give similar responses to different visual shapes (“paired associates”) when reward is used to explicitly teach monkeys to associate those shapes over long time scales [1 to 5 s between images; see, e.g., (29, 30)], but sometimes without explicit instruction (31, 32). A top-down explanation of the neuronal selectivity changes in our study is unlikely because animals performed tasks that were unrelated to the object images when the selectivity was probed, and the selectivity changes were present in the earliest part of the IT responses (∼100 ms; fig S4). But UTL could be an instance of the same plasticity mechanisms that underlie “paired associate” learning; here, the “associations” are between object images at different retinal positions (which, in the real world, are typically images of the same object). However, UTL may be qualitatively different because (i) the learning is retinal position-specific; (ii) it operates over the much shorter time scales of natural visual exploration (∼200 ms); and (iii) it is unsupervised in that, besides the visual world, no external “teacher” was used to direct the learning (e.g., no association-contingent reward was used, but we do not rule out the role of internal “teachers” such as efferent eye-movement signals). These distinctions are important because we naturally receive orders-of-magnitude more such experience (e.g., ∼108 unsupervised temporal-contiguity saccadic “experiences” per year of life).

Our results show that targeted alteration of natural, unsupervised visual experience changes the position tolerance of IT neurons as predicted by the hypothesis that the brain uses a temporal contiguity learning strategy to build that tolerance in the first place. Several computational models show how such strategies can build tolerance (611), and such models can be implemented by means of Hebbian-like learning rules (8, 33) that are consistent with spike-timing–dependent plasticity (34). One can imagine IT neurons using almost temporally coincident activity to learn which sets of its afferents correspond to features of the same object at different positions. The time course and task independence of UTL are consistent with synaptic plasticity (35, 36), but our data do not constrain the locus of plasticity, and changes at multiple levels of the ventral visual stream are likely (37, 38).

We do not yet know if UTL reflects mechanisms than are necessary for building tolerant representations. But these same experience manipulations change the position tolerance of human object perception—producing a tendency to, for example, perceive one object to be the same identity as another object across a swap position (12). Moreover, given that the animals had a lifetime of visual experience to potentially build their IT position tolerance, the strength of UTL is substantial (∼5 spikes/s change per hour)—just 1 hour of UTL is comparable to attentional effect sizes (39) and is more than double that observed in previous IT learning studies over much longer training intervals (4042). We do not yet know how far we can extend this learning, but just 2 hours of (highly targeted) unsupervised experience begins to reverse the object preferences of IT neurons (Fig. 4D). This discovery reemphasizes the importance of plasticity in vision (4, 32, 35, 37, 40, 41, 43, 44) by showing that it extends to a bedrock property of the adult ventral visual stream—position-tolerant object selectivity (4547), and studies along the postnatal developmental time line are now needed.

Supporting Online Material

Materials and Methods

SOM Text

Figs. S1 to 12

Tables S1 and S2

References and Notes

References and Notes

Stay Connected to Science

Navigate This Article