## Abstract

Martinho and Kacelnik’s (Reports, 15 July 2016, p. 286) finding that mallard ducklings can deal with abstract concepts is important for understanding the evolution of cognition. However, a statistically more robust analysis of the data calls their conclusions into question. This example brings to light the risk of drawing too strong an inference by relying solely on *P* values.

Martinho and Kacelnik (*1*) report that mallard ducklings are capable of learning the concepts of “same” and “different.” Ducklings were exposed just after birth to a pair of identical objects, generating a high-fidelity imprinting response. When tested 30 min later, a majority of ducklings followed a novel pair of identical objects more often than a novel pair of nonidentical objects. The opposite relationship held for other ducklings exposed to a pair of nonidentical objects during the imprinting period.

Martinho and Kacelnik computed a 0.0001 probability of obtaining by chance only (the null hypothesis) a difference between the two main conditions at least as large as the observed one. This *P* value was interpreted, following common practice, as strong (highly significant) evidence in favor of the alternative hypothesis (the learning of relational concepts). However, *P* values do not indicate the probability of alternative hypotheses, especially a specific one (*2*, *3*). Moreover, they are conditional to the specifications of the model and its implementation (the assumptions required to analyze the data, discussed below). We show that other, arguably more accurate, choices of data analysis were possible. All of them lead to higher *P* values, possibly raising up to *P* = 0.37 the probability of obtaining the observed difference by chance. Moreover, even overall deviations from chance could be accounted for by other hypotheses than abstraction, such as the imprinting of low-level visual features for some of the tested configurations of object pairs.

Many recommendations have been suggested to face the present crisis of replicability and credibility in life science (*4*). Making the data available, as done by Martinho and Kacelnik, is commendable because it allows the reinterpretation of the data considering different hypotheses. Psychologists recommend the abandonment of the logic of *P* values and instead interpret confidence intervals (CIs) (*5*). Here, we show that CI interpretations are indeed less sensitive to specific analysis choices and are therefore consistent across different analyses, in the sense that they reveal the large magnitude of uncertainty conveyed by the present data.

Three steps in the analysis of the duckling’s data could be reconsidered.

First, each duckling had to be assigned a preference for a pair of objects. Martinho and Kacelnik counted the number of times each duckling followed each pair and scored as preferred the stimulus that commanded at least one more approach than the other stimulus. Instead, we reclassified ducklings using a statistical threshold to quantify each bird’s behavior relative to chance, whatever the total number of approaches. We obtained much higher *P* values than those reported (Table 1) (Martinho and Kacelnik also observed that criteria of 5 and 10 more approaches reduced statistical significance). If combining all data (assuming that the difference in proportions of imprinted behavior observed in both experiments could have been obtained by chance: a chi-square test yielded *P* = 0.39), there is a 1% chance of obtaining the present result if the null hypothesis is in fact true (100 times more than what was reported by Martinho and Kacelnik). Because our classification criterion (*P* < 0.05) was arbitrary and repeated 117 times across both experiments, a number of categorizations were likely false positives. To keep the family-wise error below 5%, we can apply the Bonferroni correction (i.e., using *P* < 0.05/117). This analysis suggests that the results of both experiments could be obtained by chance, as well as the combined data (*P* = 0.18). Such a strict criterion, however, also entails excluding, by lack of power, truly choosing ducklings (false negative categorizations in unknown proportion). Figure 1A displays the full range of results when varying the criterion systematically (gray dots).

Second, the computed probabilities depend on the way to handle ducklings that did not move or failed to make a clear choice. Those birds may have not recognized any similarity between the novel and imprinted stimuli, so that if they had been forced to follow a pair of objects, they would have done it randomly. If we include those ambiguous cases by assuming that exactly half of them would have preferred the “imprinted” pair, we obtain *P* = 0.06 with the classification criterion “*P* < 0.05” (see the black dots in Fig. 1A for the full range of criteria). Alternatively, we could consider a trinomial probability with “no choice” as a third, independent outcome where birds recognizing that both pairs were novel chose not to follow any of them (whatever the possible recognition of the relational concept). The probability of this third outcome would need to be estimated in an independent experiment—for example, by presenting birds with novel pairs that have no conceptual relation with the learned pair. The interpretation of deviations from chance trinomial probabilities would then require more complex models of the ducklings’ choice behavior.

Third, to ensure that preferences can be interpreted as the learning of relational concepts and not of specific shape or color patterns, Martinho and Kacelnik cleverly used different configurations [Fig. 1 in (*1*)]. However, they did not test the effect of these manipulations. There was not enough power in experiment 2 (maximum eight ducklings in each group). In experiment 1, abstraction showed up only when birds learnt the same shapes in group 2 or the different shapes in group 1 (Table 1). Because group independence was unlikely (chi-square test, *P* = 0.022), averaging across groups may not be legitimate. In other words, the attempted counterbalancing of stimuli did not cover against all possible contaminating variables.

Conclusions based on *P* values can change dramatically, depending on the choice of analysis (Fig. 1 A). CI-based interpretations are more stable. The 95% CIs vary from 59 to 77% with Martinho and Kacelnik’s analysis to 46 to 62% with our most conservative method (Fig. 1, B and C). Intervals are wider when analyzing each experiment separately (Table 1). Thus, whatever the analysis, the CIs are too wide to allow any precise conclusion. Their extent indicates that the data are compatible with large deviations from chance, which would suggest the mastering of relational concepts by ducklings, as well as with no or little deviation, which could reflect the imprinting of some low-level visual features. In summary, Martinho and Kacelnik designed an ingenious study suggesting that mallard ducklings may be able to deal with abstract concepts. Whether ducklings really can do this requires further investigation, as well as a paradigm shift regarding statistics.