## Balancing costs and performance

Deciding whether a novel object is another instance of something already known or an example of something different is an easily solved problem. Empirical mapping of human performance across a wide range of domains has established an exponential relationship between the generalization gradient and interstimuli distance. Sims now shows that this relationship can be derived from a consideration of the costs of optimal information coding.

*Science*, this issue p. 652

## Abstract

Perceptual generalization and discrimination are fundamental cognitive abilities. For example, if a bird eats a poisonous butterfly, it will learn to avoid preying on that species again by generalizing its past experience to new perceptual stimuli. In cognitive science, the “universal law of generalization” seeks to explain this ability and states that generalization between stimuli will follow an exponential function of their distance in “psychological space.” Here, I challenge existing theoretical explanations for the universal law and offer an alternative account based on the principle of efficient coding. I show that the universal law emerges inevitably from any information processing system (whether biological or artificial) that minimizes the cost of perceptual error subject to constraints on the ability to process or transmit information.

If a bird eats a poisonous or unpalatable species of butterfly, it will quickly learn to avoid preying on that species again in the future, by avoiding butterflies that look visually similar (*1*). This requires perceptual generalization, as no two butterflies look exactly alike. If generalization is too narrow—it learns to avoid one specific butterfly, but not others of the same species—the bird will continue to mistakenly consume toxic butterflies. However, if generalization is too broad—it avoids all butterflies—it will unnecessarily exclude edible food sources and consequently limit its fitness. A closely related ability is perceptual discrimination: If an edible species of butterfly closely resembles a different, toxic species (Batesian mimicry), the failure to perceptually discriminate between the two will also lead to negative consequences.

These examples demonstrate that adaptive behavior requires perceptual generalization and discrimination abilities that are finely calibrated to the costs of perceptual error. This is true not just for predator–prey relationships, but is equally important for expert-level human performance in domains such as medicine (*2*). Not surprisingly, the theoretical study of generalization is also central to progress in artificial intelligence and machine learning (*3*–*5*).

Just over 30 years ago, cognitive scientist Roger Shepard suggested that perceptual generalization was a suitable candidate for the first “universal law” in psychological science (*6*). Shepard’s universal law of generalization states that the generalization between two stimuli (essentially, the probability of confusion) decreases as an exponential function of their distance within an appropriate metric “psychological space.” This exponential generalization pattern has indeed proved to be near-universal, and the success of the empirical law has been impressive, accounting for data spanning a wide range of domains, sensory modalities, and across multiple species (*6*–*8*).

Shepard’s explanation for this phenomenon revolves around the concept of a “consequential region” within psychological space that corresponds to a concept. For example, the concept of poisonous butterflies encompasses some set of stimuli in psychological space. Given one stimulus known to be an element of this set, the task facing the organism is to infer whether a novel stimulus will also fall in the same region; this task can be framed as one of probabilistic inference. Subsequent work (*9*, *10*) expanded on the idea of generalization as probabilistic inference, to include extrapolating from multiple exemplars and exploring alternative measures of perceptual distance or dissimilarity.

Here, I offer a qualitatively different explanation for the origins of the universal law in human perception, based on the principle of efficient coding (*11*), or the idea that biological information processing should seek to maximize performance subject to constraints on information processing capacity.

Critically, the proposed approach also generates unique predictions that distinguish it from competing explanations for the universal law. These include predictions that relate the slope of the generalization gradient to information-theoretic quantities, asymmetric generalization gradients in situations where there are asymmetric costs for perceptual error, and the finding that artificial systems (such as the JPEG image compression algorithm) can also produce an exponential generalization gradient. The result is a revised universal law of perceptual generalization, which subsumes Shepard’s statement of the law as a special case.

The approach uses results from the field of rate-distortion theory, a subdiscipline within information theory concerned with the design and analysis of optimal, but capacity-limited, information channels (*12*–*14*). Previous work has shown that rate-distortion theory offers a compelling account of human visual working memory limitations (*15*, *16*).

The current results can be concisely stated as follows: Perceptual generalization in any efficient communication system will necessarily follow an exponential function of the cost of perceptual error. In this framework, the emergence of the universal law is the signature of an organism that seeks to perceive the world as best as possible, according to some utility measure, subject to available resource limitations.

Figure 1 shows the theoretical framework and its properties. Perception is modeled as a capacity-limited information channel in which afferent sensory signals (x) are distributed according to the distribution . The perceived signal ( is related to its veridical value by a conditional probability distribution . Capacity limits in the channel prevent transmitting sensory signals with perfect fidelity, and hence in general, . Instead, the goal of the channel is to minimize a given loss function, specified by , subject to the constraint that the amount of information transmitted by the channel, measured by the mutual information , is at or below a capacity limit C. Rate-distortion theory provides analytical and numerical tools for solving such constrained optimization problems (*12*–*14*).

Notably, several of the properties illustrated in Fig. 1 (such as a “bias to the mean effect,” Fig. 1A) are also predicted by Bayesian models of perception. As both are rational or optimal models of cognition, this is not surprising. Whereas Bayesian models of perception often make atheoretic assumptions about the nature of “internal noise” within a perceptual channel [e.g., (*17*)], rate-distortion theory instead gives sensory processing limitations a strong theoretical interpretation in terms of constructs from information theory. Hence, rate-distortion theory can be viewed as a special case of the more general class of Bayesian models of perception. As will be shown presently, this also allows the framework to generate unique predictions.

To connect rate-distortion theory to perceptual generalization, one needs a measure of the strength of generalization from one stimulus to another. Shepard (*6*) defined the following measure:(1)where indicates the probability that a response associated with stimulus is made to stimulus x. According to Shepard’s universal law, generalization will follow an exponential function of the distance between x and in an appropriate psychological space, where the distance is assumed to obey the basic metric axioms. Rate-distortion theory suggests a more general formulation for this law. Using Shepard’s measure of generalization, rate-distortion theory directly predicts that generalization should follow(2)where the constant parameter is monotonically related to the capacity of the channel. Note that this includes Shepard’s original universal law as a special case. If the loss function satisfies two of the axioms of distance metrics, namely symmetry and identity , then one can easily verify that the generalization function reduces to

Consequently, when the loss function is taken to be distance in a psychological space, Shepard’s original universal law emerges from rate-distortion theory exactly. However, the result in Eq. 2 holds true under very general conditions, even when the psychological representation does not correspond to a metric space. As one example, if the mental representation of complex stimuli consists of a taxonomy of nested categories (*18*), the loss function may be defined in terms of tree distance between exemplars.

Rate-distortion theory was applied to the results of several published perceptual identification experiments (Fig. 2) that use a range of perceptual modalities (visual, haptic, auditory, gustatory). Archives of these data, along with model code, are provided online (*19*). On each trial of an identification experiment, a stimulus is randomly selected from a set, and the observer must identify it with a unique response. The resulting data consist of a perceptual confusion matrix, which gives the empirical frequency that stimulus x produced response . The perceptual loss function, , is estimated from this confusion matrix by means of Bayesian inference.

As shown in Fig. 2, the observed relationship between the inferred cost of perceptual error (the estimated loss function L) and the empirical generalization strength (, given by Eq. 1) follows an exponential gradient nearly exactly. Notably, this is not a consequence of a model that fits the data poorly but forces an exponential gradient. Rather, as shown in Fig. 2, J and K, rate-distortion theory simultaneously produces a precise model of the full probability distribution over perceptual confusion, as well as accurately predicts the exponential form of the generalization gradient. The supplementary materials also include a comparison of rate-distortion theory to an alternative existing model of perceptual identification, known as the Luce–Shepard choice model (*20*).

The key test, however, is whether rate-distortion theory generates predictions that distinguish it from competing explanations. The remainder of the paper focuses on three such predictions. The first is that the steepness of the generalization gradient should be monotonically related to the information rate of the perceptual channel. Specifically, when plotted on a logarithmic axis, exponential curves such as those shown in Fig. 2 will appear as straight lines with slope s. Whereas prior work has treated the slope of generalization as a free parameter, rate-distortion theory uniquely provides a strong theoretical interpretation for this quantity. In particular, for an optimal communication channel, the slope satisfies(4)where the term on the right-hand side of this equation is the slope of the rate-distortion curve for the channel (*12*), as illustrated in Fig. 1D. Consequently, experimental manipulations designed to influence the information rate of the perceptual channel (the numerator of this equation) should have a direct and predictable impact on the slope of the generalization gradient.

A test of this prediction is provided by the classic experiments reported in (*21*). In these experiments, subjects were asked to identify vocal consonants embedded in six different levels of white noise (signal-to-noise ratio ranging from 12 to −18 dB). Intuitively, increasing the amount of noise will decrease the amount of information about the signal that the observer can process. Under the assumption that the stimulus noise influences the information rate of the channel (the numerator of Eq. 4), but not the cost function for perceptual error (the denominator), it is possible to predict the slope of the generalization gradient in a parameter-free manner. The results are shown in Fig. 3A. In this plot, the generalization curves are shown on a logarithmic axis to illustrate the change in slope across stimulus noise conditions. The empirical slope of the generalization gradient closely follows the predictions of rate-distortion theory.

A second prediction of rate-distortion stems from the fact that unlike in Shepard’s theoretical account, there is no requirement that perceptual generalization must be symmetric. Empirical asymmetries in generalization have previously been raised as an argument against a metric representation of perceptual similarity (*22*). In the present case, a different theoretical origin for asymmetry is predicted in terms of asymmetric costs of perceptual error. An empirical test of this prediction is found in an experiment reported in (*23*). In this experiment, subjects were tasked with identifying pure tones of varying loudness. Subjects were motivated to perform accurately by awarding points for correct responses and deducting points for errors; points were exchanged for a monetary bonus at the end of the experiment. Each subject completed two experimental conditions. In the neutral condition, payoffs were symmetric for all types of errors, whereas in the biased condition, overestimate errors were more costly than underestimate errors.

The inferred loss functions are illustrated in Fig. 3B for both the neutral and biased condition. Inferred costs for perceptual error are symmetric in the unbiased penalty condition, but substantially asymmetric in the biased penalty condition. Formal model comparisons (reported in the supplementary materials) reveal that the data are better explained by rate-distortion theory with an asymmetric cost function, compared to an alternative model that assumes symmetric perceptual distance.

Lastly, rate-distortion theory predicts that exponential generalization gradients should not be limited to biological information processing, but rather should be exhibited by any communication system that operates efficiently in the rate-distortion sense, whether natural or artificial. Figure 4 illustrates an identification “experiment” conducted on the JPEG image compression algorithm. The experiment was performed by taking grayscale photographs from a natural scene database (*24*) and encoding them using the JPEG algorithm. As JPEG is a form of lossy compression, the encoded images will almost certainly introduce perceptual “confusions”—an input pixel replaced by a somewhat different pixel at the output stage (Fig. 4A). A confusion matrix is obtained by collecting the joint statistics of input and JPEG-encoded pixels. Compared to human participants, JPEG has the useful feature that the objective for perceptual coding is obtainable by inspection of its algorithm. In brief, JPEG performs a discrete cosine transform (DCT) on an input image and scales the coefficients by a weight matrix that emphasizes coding accuracy for low spatial frequencies. This weighted DCT representation is essentially the “psychological space” for JPEG encoding. Figure 4C plots the strength of generalization between pixel values against the average squared error distance in quantized DCT space. The results illustrate that JPEG image coding also conforms to the universal law of generalization. Although this finding is consistent with rate-distortion theory, it is difficult to reconcile with alternative explanations for the universal law.

The current work is only part of a growing body of literature showing the broad applicability of efficient coding as a means of understanding biological information processing (*25*, *26*). As a theoretical framework, efficient coding is not an alternative to the popular Bayesian perception framework, but rather is an extension in which sensory limitations are attributed to information processing capacity limitations. As perception exists to maximize the utility of behavior, it is a compelling idea that evolution drives perceptual systems toward the regime of rate-distortion efficiency: optimizing performance subject to information processing constraints.

## Supplementary Materials

This is an article distributed under the terms of the Science Journals Default License.

## References and Notes

**Acknowledgments:**

**Funding:**This research was supported by NSF grant DRL-1560829.

**Author contributions:**C.R.S. conducted the research and wrote the manuscript.

**Competing interests:**None declared.

**Data and materials availability:**Online data archives associated with this paper are provided via the Open Science Framework, at https://osf.io/x5ckn/.