Technical Comments

Filtering Reveals Form in Temporally Structured Displays

Science  17 Dec 1999:
Vol. 286, Issue 5448, pp. 2231
DOI: 10.1126/science.286.5448.2231a

In their report, Lee and Blake (1) asked whether the visual system could use temporal microstructure to bind image regions into unified objects, as has been proposed in some neural models (2). Lee and Blake presented two regions of dynamic texture. The elements of the target region changed in synchrony according to a random sequence, while the elements of the background region changed at independent times. The stimulus was designed in an attempt to remove all classical form-giving cues such as luminance, contrast, or motion, so that timing itself would provide the only cue. Subjects were readily able to distinguish the shape of the target region. Lee and Blake posited the existence of new visual mechanisms “exquisitely sensitive to the rich temporal structure contained in these high-order stochastic events.” The results have generated much excitement (3).

However, we believe that the effects can be explained with well-known mechanisms. The filtering properties of early vision can convert the task into a simple static or dynamic texture discrimination problem. A sustained cell (temporal lowpass) will emphasize static texture through the mechanisms of visual persistence; a transient cell (temporal bandpass) will emphasize texture that is flickering or moving.

We simulated a lowpass mechanism to see what would emerge. Lee and Blake's stimuli were composed of randomly oriented Gabor elements, where the Gabor phase shifted forward or backward on each frame according to a coin-flip. We downloaded one such movie from their Web site and ran it through a temporal lowpass filter (4) (An input frame is shown in Fig. 1A; a filtered output frame is shown in Fig. 1B.). At the particular moment shown, the target region has a lower effective contrast than does the background, providing a strong form cue. At other moments the target's contrast may be above or below the background's contrast because of statistical fluctuations in the reversal sequences. If a single Gabor element happens to have a run of multiple shifts in one direction, its effective contrast is low because of the temporal averaging. Conversely, if it has a run of alternating forward and backward shifts, thus “jittering” in place, its contrast remains fairly high. Within the unsynchronized background the local contrasts fluctuate randomly, but within the synchronized target region they all rise and fall in unison, revealing a distinct rectangular form.

Figure 1

: (A) One input frame; (B) result of temporal integration with synchronized target and unsynchronized background; (C and D) results of temporal integration when the target and background are each synchronized.

In a second experiment Lee and Blake synchronized both the target and background region, each to its own random sequence. Here the target was even more clearly visible. This result is predicted by our hypothesis. Since both background and target are synchronized, they will both yield uniform texture contrasts after temporal filtering. There will be moments when, by chance, one region's contrast is high while the other's is low, and the target will become especially clear. Figure 1C shows one such moment, again the result of filtering a movie from the Web site with the lowpass filter. Figure 1D shows a moment when the relative contrasts are reversed. We also ran movies through a temporal bandpass filter (5) with a biphasic impulse response, to simulate a transient mechanism. Again, the target was clearly revealed.

Our hypothesis also predicts, with the use of either filter, Lee and Blake's finding that discrimination will be best when the reversal sequences have high entropy, that is, when the coin-flip is unbiased. The contrast cue is best when the target “jitters” in place while the background has a run in a single direction (or vice versa). This condition happens most frequently at high entropy.

Lee and Blake's stimuli are designed to remove form cues from single frames and from frame pairs. However, when one considers the full sequence, strong contrast cues can emerge due to the spatio-temporal filtering present in early vision. These cues probably suffice to explain the perception of form in the experiments. We do not see the need to posit mechanisms other than those already known to exist.


Response: We agree with Adelson and Farid that an appropriately designed, lowpass temporal filter applied to our stochastic animation sequences (1) could extract form defined by luminance contrast without resort to temporal synchrony. We raised that possibility in our report, noting that temporal integration could produce occasional pulses in apparent contrast when, by chance, motion elements repetitively switched back and forth in direction over several successive frames (called “jitter” in Adelson and Farid's comment). The output from Adelson and Farid's model (Fig. 1) confirms our intuition, showing that contrast pulses could be synthesized by a hypothetical temporal filter with the right time constant. But to assert that these infrequent, hypothetical events “explain the perception of form” seems conjectural. In our experiments, observers never saw static single frames such as Adelson and Farid are pointing to in their filtered example; successive frames were rapidly animated, and contrast pulses were not conspicuous in these animations. But perhaps this cue, although not salient in the animations, is available and utilized by observers when performing our shape task. In our research, we created conditions in which the putative contrast pulses would occur in the figure and in the background regions. Distributing identical contrast pulses throughout the display, we reasoned, should impair figure/ground segmentation based on perceived contrast. But exactly the opposite was found [see figure 2A in (1)], implying that contrast summation does not mediate performance on our task.

Adelson and Farid's hypothetical temporal filter uncovers a possible consequence that we did not address in our report. Specifically, in animation sequences containing multiple successive frames without change in the direction of motion (runs), effective contrast produced by temporal integration could be temporarily reduced within the figural region where all elements are doing the same thing. When that happens, this region could stand out from the background, where elements are changing independently. Adelson and Farid's figure 1A depicts this hypothetical situation. Because strings of “no change” frames are more probable at lower entropy values, shape discrimination based on global reductions in contrast within the figure should be particularly easy at low entropy. But just the opposite is true: shape from temporal synchrony is best at high entropy, where “no change” sequences are highly unlikely [see figure 2B in (1)].

We are grateful to Adelson and Farid for formalizing a plausible model of temporal integration. Using their model, we have quantitatively indexed the potential strength of luminance cues from temporal integration (2). We find no correlation between this strength index and psychophysical performance on our shape discrimination task. We have gone one step further, using this index to create animations from which temporal integration could produce no luminance cues whatsoever in any frames of the sequence. We did three things to achieve this: (i) the contrast of each moving element was randomized throughout the array and from frame-to-frame of the animation, (ii) the average luminance of each motion element was assigned randomly throughout the array, and (iii) those frames causing “runs” and “jitter” were selectively pruned from the sequence. Observers still readily perceive shape from temporal synchrony in these sequences that have been purged of potential luminance cues (3). This observation is remarkable considering that contrast randomization and luminance randomization actually introduce conflicting cues for spatial structure. Our findings undermine the supposition that temporal integration alone can “explain the perception of form” (1) in these stochastic displays. On the contrary, it is revealing that temporal integration does not erase visual signals generated by these kinds of dynamic, stochastic events. This constitutes one more piece of evidence that human vision contains mechanisms that preserve the temporal fine structure in dynamic events, structures that operate in the interests of spatial grouping.

Adelson and Farid also suggest that a filter with a biphasic impulse response could be involved in the extraction of shape from our dynamic displays. Here, too, they confirm a point made in our report where we noted that reversals in motion direction—the carriers of temporal structure in our displays—could produce brief neural transients that accurately denote points in time when reversals occur. When applied to our displays, an appropriately tuned biphasic temporal filter accomplishes this operation (change detection). So we agree with Adelson and Farid that there is no need to posit the existence of new visual mechanisms sensitive to stochastic temporal structure. Existing mechanisms provide a reasonable point of departure. Still, change detection is just a first step in extracting shape from temporal structure. It remains a challenge to explain how spatial grouping is accomplished based only on irregularly occurring transients distributed among local neural mechanisms tuned to different directions of motion.


Related Content

Navigate This Article