A Bayesian Truth Serum for Subjective Data

See allHide authors and affiliations

Science  15 Oct 2004:
Vol. 306, Issue 5695, pp. 462-466
DOI: 10.1126/science.1102081


Subjective judgments, an essential information source for science and policy, are problematic because there are no public criteria for assessing judgmental truthfulness. I present a scoring method for eliciting truthful subjective data in situations where objective truth is unknowable. The method assigns high scores not to the most common answers but to the answers that are more common than collectively predicted, with predictions drawn from the same population. This simple adjustment in the scoring criterion removes all bias in favor of consensus: Truthful answers maximize expected score even for respondents who believe that their answer represents a minority view.

Subjective judgment from expert and lay sources is woven into all human knowledge. Surveys of behaviors, attitudes, and intentions are a research staple in political science, psychology, sociology, and economics (1). Subjective expert judgment drives environmental risk analysis, business forecasts, historical inferences, and artistic and legal interpretations (2).

The value of subjective data is limited by its quality at the source—the thought process of an individual respondent or expert. Quality would plausibly be enhanced if respondents felt as if their answers were being evaluated by an omniscient scorer who knew the truth (3). This is the situation with tests of objective knowledge, where success is defined as agreement with the scorer's answer key, or in the case of forecasts, an observable outcome (4). Such evaluations are rarely appropriate in social science, because the scientist is reluctant to impose a particular definition of truth, even if one were available (5).

Here, I present a method of eliciting subjective information, designed for situations where objective truth is intrinsically or practically unknowable (6). The method consists of an “information-scoring” system that induces truthful answers from a sample of rational (i.e., Bayesian) expected value–maximizing respondents. Unlike other Bayesian elicitation mechanisms (79), the method does not assume that the researcher knows the probabilistic relationship between different responses. Hence, it can be applied to previously unasked questions, by a researcher who is a complete outsider for the domain. Unlike earlier approaches to “test theory without an answer key”(5), or the Delphi method (10), it does not privilege the consensus answer. Hence, there is no reason for respondents to bias their answer toward the likely group mean. Truthful responding remains the correct strategy even for someone who is sure that their answer represents a minority view.

Instead of using consensus as a truth criterion, my method assigns high scores to answers that are more common than collectively predicted, with predictions drawn from the same population that generates the answers. Such responses are “surprisingly common,”and the associated numerical index is called an information score. This adjustment in the target criterion removes the bias inherent in consensus-based methods and levels the playing field between typical and unusual opinions.

The scoring works at the level of a single question. For example, we might ask: (i) What is your probability estimate that humanity will survive past the year 2100 (100-point probability scale)? (ii) Will you vote in the next presidential election (Definitely/Probably/Probably Not/Definitely Not)? (iii) Have you had more than 20 sexual partners over the past year (Yes/No)? (iv) Is Picasso your favorite 20th-century painter (Yes/No)?

Each respondent provides a personal answer and also a prediction of the empirical distribution of answers (i.e., the fraction of people endorsing each answer). Predictions are scored for accuracy, that is, for how well they match the empirical frequencies. The personal answers, which are the main object of interest, are scored for being surprisingly common. An answer endorsed by 10% of the population against a predicted frequency of 5% would be surprisingly common and would receive a high information score; if predictions averaged 25%, it would be a surprisingly uncommon answer, and hence receive a low score.

The surprisingly common criterion exploits an overlooked implication of Bayesian reasoning about population frequencies. Namely, in most situations, one should expect that others will underestimate the true frequency of one's own opinion or personal characteristic. This implication is a corollary to the more usual Bayesian argument that the highest predictions of the frequency of a given opinion or characteristic in the population should come from individuals who hold that opinion or characteristic, because holding the opinion constitutes a valid and favorable signal about its general popularity (11, 12). People who, for example, rate Picasso as their favorite should—and usually do (13)—give higher estimates of the percentage of the population who shares that opinion, because their own feelings are an informative “sample of one”(14). It follows, then, that Picasso lovers, who have reason to believe that their best estimate of Picasso popularity is high compared with others' estimates, should conclude that the true popularity of Picasso is underestimated by the population. Hence, one's true opinion is also the opinion that has the best chance of being surprisingly common.

The validity of this conclusion does not depend on whether the personally truthful answer is believed to be rare or widely shared. For example, a male who has had more than 20 sexual partners [answering question (iii)] may feel that few people fall in this promiscuous category. Nevertheless, according to Bayesian reasoning, he should expect that his personal estimate of the percentage (e.g., 5%) will be somewhat higher than the average of estimates collected from the population as a whole (e.g., 2%). The fact that he has had more than 20 sexual partners is evidence that the general population, which includes persons with fewer partners, will underestimate the prevalence of this profile.

Truth-telling is individually rational in the sense that a truthful answer maximizes expected information score, assuming that everyone is responding truthfully [hence, it is a Bayesian Nash equilibrium (15)]. It is also collectively rational in the sense that no other equilibrium provides a higher expected information score, for any respondent. In actual applications of the method, one would not teach respondents the mathematics of scoring or explain the notion of equilibrium. Rather, one would like to be able to tell them that truthful answers will maximize their expected scores, and that in arriving at their personal true answer they are free to ignore what other respondents might say. The equilibrium analysis confirms that under certain conditions one can make such a claim honestly.

The equilibrium results rest on two assumptions. First, the sample of respondents must be sufficiently large so that a single answer cannot appreciably affect empirical frequencies (16). The results do hold for large finite populations but are simpler to state for a countably infinite population, as is done here. Respondents are indexed by r ∈ {1,2,...}, and their truthful answer to a m multiple-choice question by Embedded Image. Embedded Image is thus an indicator variable that has a value of one or zero depending on whether answer k is or is not the truthful answer of respondent r. The truthful answer is also called a personal opinion or characteristic.

Second, respondents treat personal opinions as an “impersonally informative” signal about the population distribution, which is an unknown parameter, ω = (ω1,..,ωm) ∈Ω (17). Formally, I assume common knowledge (18) by respondents that all posterior beliefs, p(ω|tr), are consistent with Bayesian updating from a single distribution over ω, also called a common prior, p(ω), and that: p(ω|tr) = p(ω|ts) if and only if tr = ts. Opinions thus provide evidence about ω, but the inference is impersonal: Respondents believe that others sharing their opinion will draw the same inference about population frequencies (19). One can therefore denote a generic respondent with opinion j by tj and suppress the respondent superscript from joint and conditional probabilities: ProbEmbedded Image becomes p(tj|ti), and so on.

For a binary question, one may interpret the model as follows. Each respondent privately and independently conducts one toss of a biased coin, with unknown probability ωH of heads. The result of the toss represents his opinion. Using this datum, he forms a posterior distribution, pH|tr), whose expectation is the predicted frequency of heads. For example, if the prior is uniform, then the posterior distribution following the toss will be triangular on [0,1], skewed toward heads or tails depending on the result of the toss, with an expected value of one-third or two-thirds. However, if the prior is not uniform but strongly biased toward the opposite result (i.e., tails), then the expected frequency of heads following a heads toss might still be quite low. This would correspond to a prima facie unusual characteristic, such as having more than 20 sexual partners within the previous year.

An important simplification in the method is that I never elicit prior or posterior distributions, only answers and predicted frequencies. Denoting answers and predictions by Embedded Image and Embedded Image, respectively, I calculate the population endorsement frequencies, k, and the (geometric) average, ȳk, of predicted frequencies, Embedded Image Embedded Image Instead of applying a preset answer key, we evaluate answers according to their information score, which is the log-ratio of actual-to-predicted endorsement frequencies. The information score for answer k is Embedded Image(1) At least one answer will have a nonnegative information score. Variance in predictions tends to lower all ȳk values and hence raises information scores.

The total score for a respondent combines the information score with a separate score for the accuracy of predictions (20): Embedded Image Embedded Image Embedded Image(2) Equation 2 is the complete payoff equation for the game. It is symmetric, and zero-sum if α = 1. The first part of the equation selects a single information-score value, given that Embedded Image for all answers except the one endorsed by r. The second part is a penalty proportional to the relative entropy (or Kullback-Leibler divergence) between the empirical distribution and r's prediction of that distribution (21, 22). The best prediction score is zero, attained when prediction exactly matches reality, Embedded Image. Expected prediction score is maximized by reporting expected frequencies, Embedded Image (2). The constant α fine-tunes the weight given to prediction error.

To see how this works in the simple coin toss setting, imagine that there are only two equally likely possibilities: Either the coin is fair, or it is unfair, in which case it always comes up heads. A respondent who privately observes a single toss of tails knows that the coin is fair, and predicts a 50-50 split of observations. A respondent observing heads lowers the probability of fairness from the prior 1/2 to a posterior of 1/3, in accord with Bayes' rule, which in turn yields a predicted (i.e., expected) frequency of 1/6 for tails (multiplying 1/3 by 1/2). From the perspective of someone observing tails, the expectation of others' predictions of the frequency of tails will be a mix of predictions of 1/2 (from those tossing tails) and 1/6 (from those tossing heads), yielding a geometric mean clearly lower than his or her predicted frequency of 1/2. Hence, he or she expects that tails will prove to be more common than predicted and receive a positive information score. By contrast, heads is expected to be a surprisingly uncommon toss, because the predicted frequency of 1/2 is lower than the expectation of others' predictions, which is a mix of 1/2 and 5/6 predictions. A similar argument would show that those who draw heads should expect that heads will prove to be the answer with the high information score.

The example illustrates a general property of information scores. Namely, a truthful answer constitutes the best guess about the most surprisingly common answer, if “best”is defined precisely by expected information score and if other respondents are answering truthfully and giving truthful predicted frequencies. This property does not depend on the number of possible answers or on the prior (23). It leads directly to the equilibrium result [proof in the supporting online material (SOM) text].

For this theorem, assume that (i) every respondent r with opinion tr forms a posterior over the population distribution of opinions, p(ω|tr), by applying Bayes' rule to a common prior p(ω); (ii) p(ω|tr) = p(ω|ts) if and only if tr = ts; and (iii) scores are computed according to Eq. 2. Then, (T1) truth-telling is a Nash equilibrium for any α > 0: Truth-telling maximizes expected total score of every respondent who believes that others are responding truthfully; (T2) expected equilibrium information scores are nonnegative and attain a maximum for all respondents in the truth-telling equilibrium; (T3) for α = 1, the game is zero-sum, and the total scores in the truth-telling equilibrium equal log p(ω|tr) + K, with K set by the zero-sum constraint.

Truth-telling is defined as truthful answers, xr = tr, and truthful predictions, yr = E{ω|tr}. T2 states that although there are other equilibria, constructed by mapping multiple true opinions into a single response category or by randomization, these less revealing equilibria result in lower information scores for all respondents. If needed, one can enhance the strategic advantage of truth-telling by giving relatively more weight to information score in Eq. 2 (24). For sufficiently small α, the expected total scores in the truth-telling equilibrium will Pareto-dominate expected scores in any other equilibrium. T3 shows that by setting α = 1 we also have the option of presenting the survey as a purely competitive, zero-sum contest. Total scores then rank respondents according to how well they anticipate the true distribution of answers. Note that the scoring system asks only for the expected distribution of true answers, E{ω|tr} and not for the posterior distribution p(ω|tr), which is an m-dimensional probability density function. Remarkably, one can infer which respondents assign more probability to the actual value of ω by means of a procedure that does not elicit these probabilities directly.

In previous economic research on incentive mechanisms, it has been standard to assume that the scorer (or the “center”) knows the prior and posteriors and incorporates this knowledge into the scoring function (79, 25). In principle, any change in the prior, whether caused by a change in question wording, in the composition of the sample, or by new public information, would require a recalculation of the scoring functions. By contrast, my method employs a universal “one-size-fits-all”scoring equation, which makes no mention of prior or posterior probabilities. This has three benefits for practical application. First, questions do not need to be limited to some pretested set for which empirically estimated base rates and conditional probabilities are available; instead, one can use the full resources of natural language to tailor a new set of questions for each application. Second, it is possible to apply the same survey to different populations, or in a dynamic setting (which is relevant to political polling). Third, one can honestly instruct respondents to refrain from speculating about the answers of others while formulating their own answer. Truthful answers are optimal for any prior, and there are no posted probabilities for them to consider, and perhaps reject.

These are decisive advantages when it comes to scoring complex, unique questions. In particular, one can apply the method to elicit honest probabilistic judgments about the truth value of any clearly stated proposition, even if actual truth is beyond reach and no prior is available. For example, a recent book, Our Final Century, by a noted British astronomer, gives the chances of human survival beyond the year 2100 at no better than 50:50 (26). It is a provocative assessment, which will not be put to the test anytime soon. With the present method, one could take the question: “Is this our final century?”and submit it to a sample of experts, who would each provide a subjective probability and also estimate probability distributions over others' probabilities. T1 implies that honest reporting of subjective probabilities would maximize expected information score. Experts would face comparable truth-telling incentives as if they were betting on the actual outcome [e.g., as in a futures market (27)] and that outcome could be determined in time for scoring.

I illustrate this with a discrete computation, which assumes that probabilities are elicited at 1% precision by means of a 100-point multiple-choice question (in practice, one would have fewer categories and smooth out the empirical frequencies). The population vector ω = (ω00,..,ω99) indexes the unknown distribution of such probabilities among experts. Given any prior, p(ω), it is a laborious but straightforward exercise to calculate expected information score as function of true personal probability and endorsed probability. Figure 1, lines A90 and B90, present the result of such calculations, with two different priors, pA(ω) and pB(ω), for experts who happen to agree that the probability of disaster striking before 2100 is 90%. The experts thus share the same assessment but have different theories about how their assessment is related to the assessment of others. Although lines A90 and B90 differ, the expected information score is in both cases maximized by a truthful endorsement of 90%. This confirms T1. In both cases, each expert believes that his subjective probability is pessimistic relative to the population: The expectation of others' probabilities, conditioned on a personal estimate of 90%, is only 65% with pA(ω) and 54% with pB(ω).

Fig. 1.

The expected information score is maximized by a truthful report of subjective belief in a proposition (i.e., “this is our final century”), irrespective of priors (A or B) or subjective probability values (50% or 90%). Line A90 gives expected score for different reported probabilities when true personal estimate of catastrophe is 90% and prior probability is 50%. It is optimal to report 90% even though that is expected to be an unusually pessimistic estimate. Changing the prior to 20% (line B90) increases expected scores but does not displace the optimum. Changing subjective probability to 50% shifts the optimum to 50% (A50 assumes a 50% prior, B50 a 20% prior). Standard proper scoring (expectation of Eq. 3, displayed as line PS90) also maximally rewards a truthful report (90%). However, proper scoring requires knowledge of the true outcome, which may remain moot until 2100.

If the subjective probability shifts to 50%, the lines move to A50, B50, and the optimum, in both cases, relocates to 50%. Hence, the optimum automatically tracks changes in subjective belief, in this case the subjective probability of an unknown future event, but is invariant with respect to assumptions about how that belief is related to beliefs of other individuals. Changing these assumptions will simply lead back to the same recommendation: Truthfully report subjective probability.

Respondents are thus free to concentrate on their personal answer and need not worry about formulating an adequate prior. Any model of the prior is likely to be complex and involve strong assumptions. For example, in the calculations in Fig. 1, I assumed that experts' estimates are based on a private signal, distributed between zero and one, representing a personal assessment of the credibility of evidence supporting the bad outcome. The “credibility signal”is a valid but stochastic indicator of the true state of affairs: On the bad scenario, credibility signals are independent draws from a uniform distribution, so that some experts “get the message”and some do not; on the good scenario, they are independent draws from a triangular distribution, peaking at zero (no credibility) and declining linearly to one (full credibility). A prior probability of catastrophe then induces a monotonic mapping from credibility signals to posterior probabilities of catastrophe, as well as a prior over experts' probability estimates, p(ω).

Lines A and B differ in that the prior probability of catastrophe is presumed to be 50% for line A and 20% for line B. Expected scores are higher for B, because the 90% estimate is more surprising in that case.

One could question any of the assumptions of this model (28). However, changing the assumptions would not move the optimum, as long as the impersonally informative requirement is preserved. (The impersonally informative requirement means that two experts will estimate the same probability of catastrophe if and only if they share the same posterior distribution over other experts' probabilities). Thus, even though information scoring conditions success on the answers of other people, the respondent does not need to develop a theory of other people's answers; the most popular answer has no advantage of “winning,”and the entire structure of mutual beliefs, as embodied in the prior, is irrelevant.

It is instructive to compare information scores with scores that would be computed if the scorer had a crystal ball and could score estimates for accuracy. The standard instrument for eliciting honest probabilities about publicly verifiable events is the logarithmic proper scoring rule (2, 4, 29). With the rule, an expert who announces a probability distribution z = (z1,..,zn) over n mutually exclusive events would receive a score of Embedded Image(3) if event i is realized. For instance, an expert whose true subjective probability estimate that humanity will perish by 2100 is 90%, but who announced a possibly different probability z, would calculate an expected score of 0.9 log z + 0.1 log(1 – z), assuming, again, that there was some way to establish the true outcome. This expectation is maximized at the true value, z = 0.90, as shown by line PS90 in Fig. 1 (elevation is arbitrary). It is hard to distinguish proper scoring, which requires knowledge of the true outcome, from information scoring, which does not require such knowledge (30).

There are two generic ways in which the assumption of an impersonally informative prior might fail. First, a true answer might not be informative about population frequencies in the presence of public information about these frequencies (inducing a sharp prior). For instance, a person's gender would have minimal impact on their judgment of the proportion of men and women in the population. This would be a case of trts but p(ω|tr) ≅ p(ω|ts), and the difference between expected information scores for honest and deceptive answers would be virtually zero (though still positive). As shown below, the remedy is to combine the gender question with an opinion question that interacts with gender.

Second, respondents with different tastes or characteristics might choose the same answer for different reasons and hence form different posteriors. For example, someone with nonstandard political views might treat his or her liking for a candidate as evidence that most people will prefer someone else. This would be a case of: p(ω|tr)p(ω|ts) although tr = ts. Here, too, the remedy is to expand the questionnaire, allowing the person to reveal both the opinion and characteristic.

A last example, an art evaluation, illustrates both remedies. The example assumes existence of experts and laymen, and a binary state-of-nature: a question of whether a particular artist either does or does not represent an original talent. By hypothesis, art experts recognize this distinction quite well, but laymen discriminate poorly and, indeed, have a higher chance of enjoying a derivative artist than an original one. The fraction of experts is common knowledge, as are the other probabilities (Table 1).

Table 1.

An incomplete question can create incentives for misrepresentation. The first pair of columns gives the conditional probabilities of liking the exhibition as function of originality (so that, for example, experts have a 70% chance of liking an original artist). It is common knowledge that 25% of the sample are experts, and that the prior probability of an original exhibition is 25%. The remaining columns display expected information scores. Answers with highest expected information score are shown by bold numbers. Truth-telling is optimal in the long version but not in the short version of the survey.

Probability of opinion conditional on quality of exhibition Expected score
Long version Short version
Opinion Original Derivative Expert claim Layman claim Like Dislike
Like Dislike Like Dislike
    Like 70% 10% +575 -776 -462 +67 +191 -57
    Dislike 30% 90% -934 +95 +84 -24 -86 +18
    Like 10% 20% -826 +32 +45 -18 -66 +12
    Dislike 90% 80% -499 -156 -73 +2 -6 -4

In the short version of the survey, respondents only state their opinion; in the long version, they also report their expertise. Table 1 displays expected information scores for all possible answers, as a function of opinion and expertise. With the short version, truth-telling is optimal for experts but not for laymen, who do have a slight incentive to deceive if they happen to like the exhibition. With the long version, however, the diagonal, truth-telling entries have highest expected score. In particular, respondents will do better if they reveal their true expertise even though the distribution of expertise in the surveyed population is common knowledge.

Expected information scores in this and other examples reflect the amount of information associated with a particular opinion or characteristic. In Table 1, experts have a clear advantage even though they comprise a minority of the sample, because their opinion is more informative about population frequencies. In general, the expected information score for opinion i equals the expected relative entropy between distribution p(ω|tk,ti) and p(ω|tk), averaged over all tk. In words, the expected score for i is the information-theoretic measure of how much endorsing opinion i shifts others' posterior beliefs about the population distribution. An expert endorsement will cause greater shift in beliefs, because it is more informative about the underlying variables that drive opinions for both segments (31). This measure of impact is quite insensitive to the size of the expert segment or to the direction of association between expert and nonexpert opinion.

By establishing truth-telling incentives, I do not suggest that people are deceitful or unwilling to provide information without explicit financial payoffs. The concern, rather, is that the absence of external criteria can promote self-deception and false confidence even among the well-intentioned. A futurist, or an art critic, can comfortably spend a lifetime making judgments without the reality checks that confront a doctor, scientist, or business investor. In the absence of reality checks, it is tempting to grant special status to the prevailing consensus. The benefit of explicit scoring is precisely to counteract informal pressures to agree (or perhaps to “stand out” and disagree). Indeed, the mere existence of a truth-inducing scoring system provides methodological reassurance for social science, showing that subjective data can, if needed, be elicited by means of a process that is neither faith-based (“all answers are equally good”) nor biased against the exceptional view.

Supporting Online Material

SOM Text

References and Notes

  1. The finite n-player scoring formula (n3), for respondent r, is Embedded Image where Embedded Image and Embedded Image. The score for r is built up from pairwise comparisons of r against all other respondents s, excluding from the pairwise calculations the answers and predictions of respondents r and s. To prevent infinite scores associated with zero frequencies, I replace the empirical frequencies with Laplace estimates derived from these frequencies. This is equivalent to “seeding” the empirical sample with one extra answer for each possible choice. Any distortion in incentives can be made arbitrarily small by increasing the number of respondents, n. The scoring is zero-sum when α = 1.
  2. The key step in the proof involves calculation of expected information score for someone with personal opinion i but endorsing a possibly different answer j, Embedded Image(a) Embedded Image(b) Embedded Image(c) Embedded Image Embedded Image(d) Embedded Image Once we reach (d), we can use the fact that the integral, Embedded Image is maximized when: p(ω|tk,tj) = p(ω|tk,ti), to conclude that a truthful answer, i, will have higher expected information score than any other answer j. To derive (d), we first compute expected information score (a) with respect to the posterior distribution, p(ω|ti), and use the assumption that others are responding truthfully to derive (b). For an infinite sample, truthful answers imply: j = ωj, and truthful predictions: log ȳj = Σkωk log p(tj|tk), because the fraction ωk of respondents who draw k will predict p(tj|tk) for answer j. To derive (c) from (b), we apply conditional independence to write ωk p(ω|ti) as p(tk|ti)p(ω|tk,ti), ωj as p(tj|ω), and 1 as p(tk|tj,ω)/p(tk|ω), which is inserted into the fraction. (d) follows from (c) by Bayes' rule.
View Abstract

Stay Connected to Science

Navigate This Article