Research Article

Intersubject Synchronization of Cortical Activity During Natural Vision

See allHide authors and affiliations

Science  12 Mar 2004:
Vol. 303, Issue 5664, pp. 1634-1640
DOI: 10.1126/science.1089506


To what extent do all brains work alike during natural conditions? We explored this question by letting five subjects freely view half an hour of a popular movie while undergoing functional brain imaging. Applying an unbiased analysis in which spatiotemporal activity patterns in one brain were used to “model” activity in another brain, we found a striking level of voxel-by-voxel synchronization between individuals, not only in primary and secondary visual and auditory areas but also in association cortices. The results reveal a surprising tendency of individual brains to “tick collectively” during natural vision. The intersubject synchronization consisted of a widespread cortical activation pattern correlated with emotionally arousing scenes and regionally selective components. The characteristics of these activations were revealed with the use of an open-ended “reverse-correlation” approach, which inverts the conventional analysis by letting the brain signals themselves “pick up” the optimal stimuli for each specialized cortical area.

A fundamental question in neuroscience is to what extent brains of different human individuals operate in a similar manner. Although numerous neuroimaging studies demonstrated a substantial similarity across different brains, these results have been obtained in highly controlled experimental settings that constrain and remove all spontaneous, individual variations. In the visual domain, this question can be placed in the context of an even broader, fundamental puzzle: Do we all see the world in the same way? In a typical visual mapping experiment, subjects are presented with simplified visual stimuli and asked to maintain fixation and perform identical and particularly demanding attentional tasks. With the use of such highly controlled conditions, a consistent network of functionally distinct retinotopic areas (1, 2) and object-related regions has been described along the entire extent of human lateral-occipital and temporal cortices (35). However, natural vision drastically differs from conventional visual mapping studies by relaxing at least four fundamental constraints: (i) Visual stimuli are not presented in isolation but are embedded in a complex multiobject scene. (ii) Objects move in a complex manner within the scene. (iii) Subjects freely move their eyes. (iv) Seeing usually interacts with additional modalities, as well as context and emotional valence. Thus, the world seen in the controlled experimental setting bears little resemblance to our natural viewing experience.

Recently, several studies have begun to investigate the functional architecture of macaque (68) and human brains (912) under more naturalistic settings. However, because of the spatial and temporal complexity, multidimensionality, and lack of any prefixed protocol, it is difficult to use conventional hypothesis-driven analysis methods in the natural viewing setting [but see (11, 12)]. To overcome this inherent limitation, we have introduced an unbiased type of analysis that does not rely on predetermined stimulation protocols. This was done in two ways: (i) in the intersubject correlation analysis, we used the voxels' time courses of one brain to predict the activity in other brains. The strength of this across-subject correlation measure is that it allows the detection of all sensory-driven cortical areas without the need of any prior design matrix or assumptions as to their exact functional responses. We will refer to this across-subject voxel-by-voxel synchronization as the “intersubject dimension.” (ii) In the “reverse-correlation” analysis, we used regionally specific brain responses to identify the particular attributes or dimensions of complex natural stimuli present at times of peak responses. Put simply, we inverted the classical approach, which uses a set of predefined stimuli to locate brain regions. Instead, we used the brain activations themselves to find the preferred stimuli embedded in the complex stimulation sequence.

We implemented this approach in the study of the functional organization of human cortex under free viewing of a long (30 min) uninterrupted segment taken from an original audiovisual feature film (13). Subjects were instructed to freely view the movie segment and report its plot at the end of the experiment (14). We reasoned that such rich and complex stimulation will be much closer to ecological vision relative to the highly constrained visual stimuli used in the laboratory.

Intersubject Correlation

To examine the intersubject dimension, we normalized all brains into a Talairach coordinate system, spatially smoothed the data, and then used the time course of each voxel in a given source brain as a predictor of the activation in the corresponding voxel of the target brain (15). Overall, there were 10 unique pairwise comparisons between the five subjects watching the same movie. Despite the free viewing and complex nature of the movie, we found an extensive and highly significant correlation across individuals watching the same movie. Thus, on average over 29% ± 10 SD of the cortical surface showed a highly significant intersubject correlation during the movie (Fig. 1A). Figure 1B shows a representative correlation map between two subjects in whom the percentage of functionally correlated cortical surface was around the mean (30%). All 10 pairwise comparisons, arranged in a descending order according to the fraction of correlated cortex, are presented in fig. S1.

Fig. 1.

Intersubject correlation during free viewing of an uninterrupted movie segment. (A) Average percentage of functionally correlated cortical surface across all pairwise comparisons between subjects for the entire movie time course (All), for the regionally specific movie time course (after the removal of the nonselective component, Regional) and for the darkness control experiment (In darkness). (B) Voxel-by-voxel intersubject correlation between the source subject (ZO) and the target subject (SN). Correlation maps are shown on unfolded left and right hemispheres (LH and RH, respectively). Color indicates the significance level of the intersubject correlation in each voxel. Black dotted lines denote borders of retinotopic visual areas V1, V2, V3, VP, V3A, V4/V8, and estimated border of auditory cortex (A1+). The face-, object-, and building-related borders (red, blue, and green rings, respectively) are also superimposed on the map. Note the substantial extent of intersubject correlations and the extension of the correlations beyond visual and auditory cortices.

Close inspection of this across-subject correlation revealed that the synchronization was far more extensive than the boundaries of well-known audiovisual sensory cortex defined with conventional mapping approach (4). This point is illustrated in Fig. 1B, where the borders of early retinotopic areas are marked by black dotted lines, and color contours mark the high-order face-, building-, and common object-related regions. As can be seen, the across-subject correlation covered most of the visual system, including early retinotopic areas as well as high-order object areas within the occipitotemporal and intraparietal cortex. Moreover, the correlation extended far beyond the visual and auditory cortices [estimated border of auditory cortex (A1+) is marked by black dotted line (16)] to the entire superior temporal (STS) and lateral sulcus (LS), retrosplenial gyrus, even secondary somatosensory regions in the postcentral sulcus, as well as multimodal areas in the inferior frontal gyrus and parts of the limbic system in the cingulate gyrus.

This strong intersubject correlation shows that, despite the completely free viewing of dynamical, complex scenes, individual brains “tick together” in synchronized spatiotemporal patterns when exposed to the same visual environment.

In order to rule out the possibility that the across-subject correlations were introduced by scanner noise or preprocessing procedures, we measured intersubject correlations between five subjects scanned while lying passively in the dark with their eyes closed for 10 min. The intersubject correlation during darkness was negligible compared to the movie condition (Fig. 1A and fig. S2).

Correlation across cortical space. What is the source of the strong intersubject correlation? We found two separate components that underlie this correlation: (i) A widespread, spatially nonselective activation wave that was apparent across cortical areas. (ii) A selective, regionally distinct component, associated with specific functional properties of individual cortical regions. First, we will discuss the nonselective activation wave.

To assess the nonselective component we mapped, within each subject, the correlation between anatomically distinct cortical regions. Figure 2A demonstrates the correlation between the average time course of the entire ventral occipito-temporal (VOT) cortex in one cortical hemisphere (red contour, Fig. 2A), and the rest of the cortex, including the other hemisphere. As can be seen, the within-subject correlation to VOT activation was extremely high and widespread and showed similarity to the intersubject correlation map (compare Figs. 1 and 2). Similar maps were obtained with the use of the dorsal occipitotemporal (DOT) subdivision (17). Figure 2B shows the nonselective time courses obtained during the first 10 min of the movie for all five subjects. The striking similarity between subjects attests to the degree to which global cortical activity was synchronized across individuals watching the same movie.

Fig. 2.

Nonselective activation across regions. (A) Correlation between the averaged time course of the VOT cortex in one cortical hemisphere (correlation seed marked by the red contour) and the rest of the cortex, shown on unfolded left and right hemispheres. (B) The average nonselective time course across all activated regions obtained during the first 10 min of the movie for all five subjects. Red line represents the across-subject average time course. There is a striking degree of synchronization among different individuals watching the same movie.

The “reverse-correlation” approach. In order to identify the source of such powerful common “consensus” between different individuals watching the same movie, we adopted an analysis approach loosely analogous to the reverse-correlation method used for single unit mapping (6, 18). In this analysis, we used the peaks of activation in a given region's time course to recover the stimulus events that evoked them, thus constructing a regionally specific “movie” based on the appended sequence of all frames that evoked strong activation in a particular region of interest (ROI) while skipping all weakly activating time points (19).

Applying the reverse correlation to the global nonselective time course (red line, Fig. 2B) revealed a substantial component of emotionally charged and surprising moments in the original movie (e.g., all gunshots and explosion scenes, or surprising shifts in the movie plot). The time codes of all frames associated with the peaks of activation in the nonselective time course are presented in table S1. Readers that can obtain the original digital video disc (DVD) version of the movie (13) are encouraged to download the BrainShow.exe program (14) to view the reverse-correlation movie of the nonselective component.

Category selectivity under natural viewing conditions. In addition to the global nonselective activation, the movie evoked distinct activation patterns in different brain regions, which were nevertheless highly correlated between individuals watching the same movie. This was assessed by performing the same across-subject correlation to the movie data set after the removal of the nonselective component from each voxel's time course (20). The extent of correlation between subjects watching the movie remained high and largely unchanged (24% ± 8.5) when applied to the residual data set (without the nonselective component) (Fig. 1A and fig. S3).

To assess the functionality of known areas during natural viewing, we examined the time course of two well-studied cortical regions: the face-related posterior fusiform gyrus [pFs, also termed the FFA (21, 22)] and the building-related collateral sulcus [CoS, also termed the PPA (23, 24)]. These regions were defined independently with the use of conventional static object images (ROIs are indicated by red and green for face and building, respectively, on the inflated hemispheres shown from a ventral view, Fig. 3, A to B). We then extracted these regions' time courses of activation during the movie, subtracted the nonselective activation component, and averaged the signal across subjects (19). The level of intersubject synchronization after the subtraction of the nonselective activation was quantified by performing a t test on each time point in the time course across subjects. Points that were significantly different from the mean value (P < 0.05) in the fusiform face-related region (Fig. 3A) and the collateral building-related region (Fig. 3B) are marked in red and green, respectively. As can be seen, all selective peaks after subtraction were significantly different from the mean value across subjects.

Fig. 3.

Functional selectivity revealed by the reverse-correlation method. The averaged time course of the regionally selective component of the fusiform face-related region (A) and collateral building-related region (B) during the movie, defined with the use of an external localizer (see inflated hemisphere from a central view). Time points that were significantly (P < 0.05) different from baseline across subjects are marked by red and green, respectively. Above each time course are the movie frames that produced the highest activations in that region. The movie frames are ordered according to descending signal amplitude. There is remarkable category selectivity revealed in the frames for faces in the fusiform face-related region (A) and for indoor and outdoor scenes in the building-related region (B). This selectivity was apparent in 16 out of the 16 marked peaks in the fusiform face-related region and in 12 out of the 16 marked peaks in the collateral building-related region. [Movie stills courtesy of MGM CLIP+STILL. The Good, The Bad, and The Ugly. ©1966 Alberto Grimaldi Productions S.A. All Rights Reserved.]

With the use of the reverse-correlation approach, we could then ask what frames of the movie evoked the highest activation in the face-related and in the building-related regions. This approach was free from any predetermined biases regarding the functional properties and thus could reveal for the first time their regional selectivities under natural viewing conditions. Figure 3 shows sampled movie frames from the highest five activation peaks in the face-related (A) and building-related (B) regions. The time codes of all frames are presented in table S1. The activation peaks are ordered according to descending signal amplitude. The fusiform face-related region was activated mainly by close-ups of face images, whereas the building-related region was mostly activated by images of indoor (e.g., peaks 1 and 5) and outdoor scenes, including buildings (e.g., peak 2) and open fields (e.g., peaks 3 and 4) (25).

It should be noted that the observed selectivity in the face and building-related regions was maintained in most of the activations peaks. Thus, in the face-related fusiform gyrus, 15 our of the 16 highest peaks were associated with face images, whereas in the building-related collateral sulcus 13 out of the 16 highest peaks were associated exclusively with scene images. Thus, it is clear that the fusiform face-related and collateral building-related regions indeed maintained their selectivity even under free viewing of natural and complex scenes.

Given that the intersubject correlation actually extended far beyond the well-known sensory regions (Fig. 1), we applied the reverse-correlation approach in the search for additional functional preferences in regions that are typically not activated with the use of the conventional, discrete, object stimuli.

An example of such unexpected selectivity was a cortical region located in the middle postcentral sulcus (PCS), in the vicinity of Brodmann area 5, which showed a highly correlated activation across subjects during the movie (see PCS in Figs. 1 and 4). Figure 4 shows the movie frames that evoked the eight highest activation peaks in this region. The time codes of all frames are presented in table S1. Although on first sight these frames do not seem to share a common property, closer inspection reveals a consistent activation associated with the performance of delicate hand movements during various motor tasks (white arrows, Fig. 4). Furthermore, 15 out of the 16 highest peaks in this region were associated with hand-related movements (25).

Fig. 4.

Selectivity preference of the mid-postcentral sulcus. Averaged time course of the mid-postcentral sulcus region; coloring and frame selection as in Fig. 3. A common theme across all frames, revealed without any prior experimental design, is the usage of hands for performance of various motor tasks, as denoted by the white arrow in each frame. This selectivity was apparent in 15 out of the 16 marked peaks. [Movie stills courtesy of MGM CLIP+STILL. The Good, The Bad, and The Ugly. ©1966 Alberto Grimaldi Productions S.A. All Rights Reserved.]

Direct comparison between natural and controlled viewing conditions. The lack of any predefined design matrix in the unedited movie prevented a direct comparison [e.g., using paradigm-based general linear model (GLM) analysis] between naturalistic vision and the more conventional mapping approach. To allow such comparison, we constructed a condition-based movie (experiment 3), which was composed of consecutive 15-s clips selected to contain preferentially one of four object categories: faces, buildings, open landscape scenes, and miscellaneous images of various objects. Such object-selective movie segments allowed conventional contrast-based analysis (14). We compared the selectivity maps of the standard discrete images (experiment 2, n = 18, general-linear model with random-effect analysis, Fig. 5A) and the movie clips (experiment 3, n = 9, Fig. 5B) for both hemispheres on a flattened brain format.

Fig. 5.

Comparison of free viewing and controlled viewing: activation maps for conventional mapping using line drawings of static objects (A) and free viewing of preedited, category-specific movie clips (B) of faces (red), objects (blue), and buildings (green). The DOT and VOT subdivisions are indicated by dashed rectangles. Colors indicate contrast of each category with the other two (e.g., faces versus buildings and objects). There is a clear similarity between the maps generated by conventional and category-specific movie clips within the occipitotemporal cortex (see small white arrows within the DOT and VOT subdivisions). (C) Correlation level between VOT and DOT subregions sharing similar object preference for faces (F-F) and buildings (B-B), as well as across different object preference (F-B). All data were obtained from ROIs defined by the conventional mapping after removal of the nonselective component. In all three presentation methods, the within-category correlation was dramatically higher than the across-category correlation, indicating that object selectivity was maintained despite the free viewing of complex stimuli. Asterisks indicate P < 0.05.

Recently, we identified seven category-related object regions in human occipitotemporal cortex (4), organized in a mirror symmetry structure. These include two face-related regions, three object-related regions, and two building-related regions. The category-related movie clips produced a map that nicely agrees with the conventionally produced selectivity map within the VOT and DOT subdivisions (dashed rectangles in Fig. 5, A and B). White arrows point to cortical regions showing similar object selectivity under the conventional and movie clips conditions. However, the movie clips of faces induced additional activations, which extended further anteriorly in the occipitotemporal cortex, including regions anterior to area MT (V5) and a region in the vicinity of the STS (black arrows, Fig. 5B). These regions are known to be sensitive to movements of body parts (26, 27) and shifts in eye gaze (28, 29) and were probably strongly affected by the dynamic human-related motion present in the movie clips.

Selectivity index. The results so far indicate that, qualitatively, the object selectivity appeared to be maintained under free viewing of complex natural stimuli. To obtain a more quantitative index of such selectivity, we examined the correlations between cortical regions that are known to have either similar or different functional preferences. In this analysis, we compared the correlations within face-related regions (pFs and inferior occipital gyrus, i.e., F-F), within building-related regions (CoS and transverse occipital gyrus, B-B), as well as correlations across these regions (F-B). “Within” (F-F and BB) correlations were significantly higher than the “across” (F-B) correlations in all three presentation methods (one tail, paired t test; P < 0.01) (Fig. 5C). Thus, the selectivity of the face- and building-related regions was maintained under natural and open-ended viewing conditions.


In this study, we report the unexpected finding that brains of different individuals show a highly significant tendency to act in unison during free viewing of a complex scene such as a movie sequence (Fig. 1). Such responses imply that a large extent of the human cortex is stereotypically responsive to naturalistic audiovisual stimuli. These intersubject correlations extended beyond the well-known visual and auditory cortices into high-order association areas, e.g., along the STS, LS, and retrosplenial and cingulate cortex, which have not been previously associated with sensory processing. Thus, the “collective” dimension, i.e., the level of intersubject synchronization, appears to provide a new sensitive and quantifiable measure of the involvement of cortical areas with external sensory stimuli. Critically, this measure does not depend on any prior knowledge or assumptions regarding the functionality of these areas.

In addition to the highly synchronized cortex, we also found a pattern of areas which consistently failed to show intersubject coherence. These areas included the supramarginal gyrus, angular gyrus, and prefrontal areas. Thus, the “collective” coherence effect naturally divides the cortex into a system of areas that manifest an across-subject, stereotypical response to external world stimuli versus regions that are linked to unique, individual variations (30).

The intersubject correlation was composed of two unrelated components: a spatially nonselective component and a regionally selective one. Below, we discuss these separately.

Nonselective activation component. The spatially nonselective interarea response could result from several factors. First, it could reflect modulations in the feed-forward processing load imposed by variations in the visual and contextual complexity of the movie scenes. Such global variations are expected particularly if the object representations have a widely distributed nature (31). Second, the nonselective component might reflect the global attentional and arousal impact of the scenes, as indeed was suggested by the highly emotional activation clips generated through reverse correlation of the nonselective component with the movie (25). Although such arousal effects might also produce global autonomic responses, this is unlikely to explain the results, which show a highly stereotyped and heterogeneous neuroanatomical distribution (Fig. 2; also see similar conclusion in (32)].

Selective activation component. Given the complete lack of control over the stimuli, the recovery of the known functional selectivity of cortical areas using the reverse-correlation method has important implications to the nature of object representations. These become apparent when considering which constraints have been relaxed during the movie. First, subjects' eye movements were completely uncontrolled, which allowed subjects to view the stimuli at different retinal locations. The consistency of the results despite the spontaneous nature of eye movement is compatible with the notion that high-order object areas are not very sensitive to changes in retinotopic position (3335).

Second, the finding that selectivity did not decrease with the movie's spatiotemporal complexity (Fig. 5C) points to the efficient operation of selection mechanisms, which govern subjects' attention (3638). Thus, given that several objects were embedded within a complex background in each scene, it seems that object-based attention (8, 35) was capable of isolating the object of choice within the complex frame.

Our results demonstrate that the unified nature of conscious experience in fact consists of temporally interleaved and highly selective activations in an ensemble of specialized regions, each of which “picks up” and analyzes its own unique subset of stimuli according to its functional specialization. Finally, the collective correlation attests to the engaging power of the movie to evoke a remarkably similar activation across subjects. In that sense, the across-subject correlation may serve as a potential measure for tracing cultural and attentional differences among various populations. An interesting, related issue is the potential impact of prior experience on the responses to the movie. For example, might the viewer's background (e.g., combat experience or prior exposure to the movie) be a factor in modulating the intersubject synchronization? Future studies involving carefully selected target groups will be needed to further examine these questions.

The reverse-correlation method. The reverse-correlation method has proven to be very effective in uncovering both known and unexpected functional specializations. Thus, the approach could serve as a promising unbiased tool for probing functional characteristics of new brain areas. This was most evident in the unexpected finding that the hand-related somatosensory region in the postcentral sulcus was activated by images of delicate hand movements. It appears likely that this activation is part of the visuo-somato-motor “mirror” system originally reported in macaque monkeys (39, 40) and more recently extended to the human cortex (41). Recently, social psychologists have stressed the role of such a mirror system directly linking perception and behavior as one of the fundamental bases of social cognition (42). Moreover, a recent study of this system has stressed, similar to the present work, the advantage inherent in the use of complex natural stimulation to discover the underlying functions of different brain regions (43).

The reverse-correlation method has several limiting factors as well. First, the rather sluggish nature of the hemodynamic response can make this method unsuitable when presenting a particularly rapid movie sequence. However, it seems that the natural temporal flow of visual events tends to be rather slow, and sufficient for blood oxygenation level–dependent (BOLD) resolution. Thus, although no “blank” periods were introduced to segregate BOLD responses in the present study, our results show that the BOLD signal was able to “pick” quite successfully the object-selective frames appropriate for each cortical area despite the continuous flow of the movie.

Second, given the complexity and multidimensionality of each frame, it will obviously be impossible to isolate the appropriate functional dimensions solely on the basis of the reverse-correlation method. Thus, the reverse-correlation should be viewed as a complementary tool for evaluating putative selectivities found under natural vision and for “pilot” searches, both for normal and pathological cases, which can suggest preliminary functional specializations to be followed by a more controlled set of stimulation conditions. The reverse-correlation method could be further supported by a more quantitative analysis, e.g., by a binomial probability estimation using an objective, binary rating of the presence of specific stimuli (e.g., faces) in each frame of the various movies.

Supporting Online Material

Materials and Methods

Figs. S1 to S3

Table S1

References and Notes

View Abstract

Navigate This Article