Technical Comments

Comment on “Ducklings imprint on the relational concept of ‘same or different’”

See allHide authors and affiliations

Science  24 Feb 2017:
Vol. 355, Issue 6327, pp. 806
DOI: 10.1126/science.aah6047


Martinho and Kacelnik’s (Reports, 15 July 2016, p. 286) finding that mallard ducklings can deal with abstract concepts is important for understanding the evolution of cognition. However, a statistically more robust analysis of the data calls their conclusions into question. This example brings to light the risk of drawing too strong an inference by relying solely on P values.

Martinho and Kacelnik (1) report that mallard ducklings are capable of learning the concepts of “same” and “different.” Ducklings were exposed just after birth to a pair of identical objects, generating a high-fidelity imprinting response. When tested 30 min later, a majority of ducklings followed a novel pair of identical objects more often than a novel pair of nonidentical objects. The opposite relationship held for other ducklings exposed to a pair of nonidentical objects during the imprinting period.

Martinho and Kacelnik computed a 0.0001 probability of obtaining by chance only (the null hypothesis) a difference between the two main conditions at least as large as the observed one. This P value was interpreted, following common practice, as strong (highly significant) evidence in favor of the alternative hypothesis (the learning of relational concepts). However, P values do not indicate the probability of alternative hypotheses, especially a specific one (2, 3). Moreover, they are conditional to the specifications of the model and its implementation (the assumptions required to analyze the data, discussed below). We show that other, arguably more accurate, choices of data analysis were possible. All of them lead to higher P values, possibly raising up to P = 0.37 the probability of obtaining the observed difference by chance. Moreover, even overall deviations from chance could be accounted for by other hypotheses than abstraction, such as the imprinting of low-level visual features for some of the tested configurations of object pairs.

Many recommendations have been suggested to face the present crisis of replicability and credibility in life science (4). Making the data available, as done by Martinho and Kacelnik, is commendable because it allows the reinterpretation of the data considering different hypotheses. Psychologists recommend the abandonment of the logic of P values and instead interpret confidence intervals (CIs) (5). Here, we show that CI interpretations are indeed less sensitive to specific analysis choices and are therefore consistent across different analyses, in the sense that they reveal the large magnitude of uncertainty conveyed by the present data.

Three steps in the analysis of the duckling’s data could be reconsidered.

First, each duckling had to be assigned a preference for a pair of objects. Martinho and Kacelnik counted the number of times each duckling followed each pair and scored as preferred the stimulus that commanded at least one more approach than the other stimulus. Instead, we reclassified ducklings using a statistical threshold to quantify each bird’s behavior relative to chance, whatever the total number of approaches. We obtained much higher P values than those reported (Table 1) (Martinho and Kacelnik also observed that criteria of 5 and 10 more approaches reduced statistical significance). If combining all data (assuming that the difference in proportions of imprinted behavior observed in both experiments could have been obtained by chance: a chi-square test yielded P = 0.39), there is a 1% chance of obtaining the present result if the null hypothesis is in fact true (100 times more than what was reported by Martinho and Kacelnik). Because our classification criterion (P < 0.05) was arbitrary and repeated 117 times across both experiments, a number of categorizations were likely false positives. To keep the family-wise error below 5%, we can apply the Bonferroni correction (i.e., using P < 0.05/117). This analysis suggests that the results of both experiments could be obtained by chance, as well as the combined data (P = 0.18). Such a strict criterion, however, also entails excluding, by lack of power, truly choosing ducklings (false negative categorizations in unknown proportion). Figure 1A displays the full range of results when varying the criterion systematically (gray dots).

Table 1 Categorization of behavior using a statistical criterion (P < 0.05).

The behavior of each duckling was reclassified using a statistical criterion. For example, S04 followed the imprinted pattern 44 times and the novel one 19 times. Such a difference is unlikely to occur by chance (P = 0.002, two-tailed binomial test), so we considered, like Martinho and Kacelnik (they used the criterion 44 > 19) that S04 had displayed the imprinted behavior. Another duckling, S11, followed the imprinted pattern 20 times and the novel one 35 times. Such a difference is more likely to occur by chance (P = 0.058), so we considered that S11 had no preference, unlike Martinho and Kacelnik, who considered that S11 had preferred the novel pair (20 < 35). With fewer than five approaches (birds excluded by the authors), the behavior could never reach significance. The P values and 95% CI for the percentage of birds preferring the imprinted rather than the novel stimuli (last two columns of the table) were computed using the exact two-tailed binomial test. A 95% CI should be interpreted to indicate that if the exact same experiment was repeated a sufficient number of times, 95% of the samples would include the “true” value within their 95% CI. Like a P value, a particular CI does not allow any probability statement on the true value, because any new experiment would yield a new CI. It does indicate, though, the approximate range of possible true values compatible with the data sample. For experiment 1 (“shapes”), the results for subgroups are detailed (Martinho and Kacelnik kindly provided the information that the animals numbered 1 to 18 in their supplementary table belonged to subgroup 1 and animals 19 to 36 belonged to subgroup 2).

View this table:
Fig. 1 Statistical outcome of the data as a function of analysis choices, computed across both experiments.

(A to C) We varied the statistical criterion to classify each duckling from P = 0.0004 (family-wise error < 0.05) to P = 0.9999. “P = 1” corresponds to the procedure used by Martinho and Kacelnik. Indeed, for birds like S14 in experiment 1 that followed one pair four times and the other three times, the probability of observing by chance a difference at least as large as this one is 1. Probabilities could be computed using only birds considered to have made a choice, as was done by Martinho and Kacelnik [(B) and gray points in (A)], or using all birds and attributing half of the hesitant birds (or birds making fewer than five approaches) to each category [(C) and black points in (A)]. The red dot and error bar (95% CI) correspond to the analysis criterion used by the authors; the blue dots and error bars correspond to our preferred computations. The increasing relationship observed in (B) and (C) means that the weaker the preference expressed by a duckling (especially when preference was undistinguishable from chance), the more often it was in the direction of the imprinted concept. The interpretation of such a surprising relationship (surprising because the noisier the data, the more it deviates from randomness) is beyond the scope of the present Comment. But the opposite relationship or a maximal value observed for some midrange classification criterion would have made more sense. Alternative ways of coding the behavior could be tried (such as the percentage of time following the objects or the choices made at the end of the test period).

Second, the computed probabilities depend on the way to handle ducklings that did not move or failed to make a clear choice. Those birds may have not recognized any similarity between the novel and imprinted stimuli, so that if they had been forced to follow a pair of objects, they would have done it randomly. If we include those ambiguous cases by assuming that exactly half of them would have preferred the “imprinted” pair, we obtain P = 0.06 with the classification criterion “P < 0.05” (see the black dots in Fig. 1A for the full range of criteria). Alternatively, we could consider a trinomial probability with “no choice” as a third, independent outcome where birds recognizing that both pairs were novel chose not to follow any of them (whatever the possible recognition of the relational concept). The probability of this third outcome would need to be estimated in an independent experiment—for example, by presenting birds with novel pairs that have no conceptual relation with the learned pair. The interpretation of deviations from chance trinomial probabilities would then require more complex models of the ducklings’ choice behavior.

Third, to ensure that preferences can be interpreted as the learning of relational concepts and not of specific shape or color patterns, Martinho and Kacelnik cleverly used different configurations [Fig. 1 in (1)]. However, they did not test the effect of these manipulations. There was not enough power in experiment 2 (maximum eight ducklings in each group). In experiment 1, abstraction showed up only when birds learnt the same shapes in group 2 or the different shapes in group 1 (Table 1). Because group independence was unlikely (chi-square test, P = 0.022), averaging across groups may not be legitimate. In other words, the attempted counterbalancing of stimuli did not cover against all possible contaminating variables.

Conclusions based on P values can change dramatically, depending on the choice of analysis (Fig. 1 A). CI-based interpretations are more stable. The 95% CIs vary from 59 to 77% with Martinho and Kacelnik’s analysis to 46 to 62% with our most conservative method (Fig. 1, B and C). Intervals are wider when analyzing each experiment separately (Table 1). Thus, whatever the analysis, the CIs are too wide to allow any precise conclusion. Their extent indicates that the data are compatible with large deviations from chance, which would suggest the mastering of relational concepts by ducklings, as well as with no or little deviation, which could reflect the imprinting of some low-level visual features. In summary, Martinho and Kacelnik designed an ingenious study suggesting that mallard ducklings may be able to deal with abstract concepts. Whether ducklings really can do this requires further investigation, as well as a paradigm shift regarding statistics.

References and Notes

  1. Acknowledgments: We thank A. Kacelnik and A. Martinho for their constructive collaboration, as well as M. Price and V. Hajivassiliou for helpful discussions and comments.
View Abstract

Navigate This Article