Technical Comments

Comment on “Predicting reaction performance in C–N cross-coupling using machine learning”

See allHide authors and affiliations

Science  16 Nov 2018:
Vol. 362, Issue 6416, eaat8603
DOI: 10.1126/science.aat8603


Ahneman et al. (Reports, 13 April 2018) applied machine learning models to predict C–N cross-coupling reaction yields. The models use atomic, electronic, and vibrational descriptors as input features. However, the experimental design is insufficient to distinguish models trained on chemical features from those trained solely on random-valued features in retrospective and prospective test scenarios, thus failing classical controls in machine learning.

A recent report by Ahneman et al. (1) describes a machine learning approach for modeling chemical reactions with data collected through ultrahigh-throughput experimentation. The Buchwald-Hartwig coupling (2) is used as a model reaction, with a Glorius interference approach (3) to study reaction poisoning by isoxazole additives. Reactions are represented by atomic, electronic, and vibrational descriptors that are automatically calculated through a new computational pipeline. The authors find that random forest models outperform linear models in predicting yields on a 70/30 train-test random split, and claim strong performance on an out-of-sample test set of unseen isoxazoles.

We applied the classical method of multiple hypotheses (4, 5) to investigate alternative explanations for the observed machine learning model performance. The experiments in this study explore the effect of four reaction parameters—aryl halide, catalyst, base, and additive—with all combinations exhaustively generated through 4608 different reactions. This complete combinatorial layout provides an underlying structure to the data irrespective of any chemical knowledge. Correspondingly, we posited the alternative hypothesis that the machine learning algorithms exploit patterns within the underlying experimental design, instead of learning solely from meaningful chemical features. A model that learns patterns particular to an experimental layout, rather than from meaningful input features, cannot be relied upon to generalize to new examples (i.e., reaction components).

Following the logic of exclusions (4), we performed two experiments intended to disprove the alternative hypothesis, wherein we ablated all chemical information from the Ahneman et al. dataset and evaluated the same machine learning methods. All computational analyses were performed in the Python package Scikit-learn (6). In the first experiment, we replaced the extracted chemical features of each molecule with random numbers, which effectively creates a unique, random “barcode” mimicking the reaction fingerprints used in the paper (Fig. 1A). For example, the 27 chemical descriptors (1H and 13C nuclear magnetic resonance shifts, dipole moment, etc.) that had been used to represent 4-bromotoluene were replaced with 27 random numbers drawn from a standard normal distribution. Applying these random barcodes to the exact train-test split of the dataset used by Ahneman et al. resulted in “straw” models (7) that achieved predictive performance nearly identical to those trained on actual chemical features (Fig. 1B). In a related second experiment, we encoded each reaction component as a “one-hot” vector (i.e., a “dummy” encoding) that denotes only the presence or absence of each component (e.g., additive-1, additive-2, etc.; see Fig. 1A). (8) One-hot encoding likewise provided near-identical performance for each model (Fig. 1C). Critically, both of these approaches encode no notion of chemistry, and by definition cannot generalize to new chemical entities. We note that these results do not indicate that chemical features are unimportant, but instead suggest that the retrospective study performed in the paper is incapable of distinguishing between meaningful featurization and random featurization.

Fig. 1 Comparison of input representation control experiments.

(A) Schematic of Ahneman’s featurization versus random features and one-hot encoded categorical features. (B) Machine learning models trained using random feature barcodes provide near-identical performance on the exact 70/30 train-test split reported. Note: Bayes’ generalized linear model was not assessed; five-neuron versus 100-neuron networks are shown instead. (C) Comparison of coefficient of determination (R2) and RMSE values between input representations.

Prospective out-of-sample test sets provide a more rigorous measure of model generalization. Ahneman et al. reported that an out-of-sample set of eight isoxazole additives, representing one of the three 1536-well plates used in their study, shows good generalization [root mean square error (RMSE) = 11.3%]. To account for variance in sampling and to establish a comprehensive picture, we completed analyses that, in turn, independently hold out each of the remaining two experimental plates (Fig. 2A). We found that performance dropped starkly (RMSE = 22.0 and 17.3%, respectively), indicating that the models’ generalizability was more limited than expected. Similarly, we repeated the analysis using one-hot encoded representations in place of chemical features, as in Fig. 1A. Machine learning algorithms trained on one-hot representations learn only from the presence or absence of additives in the training set, and by definition cannot generalize to unseen additives. We thus anticipated highly diminished straw model performance in this sanity check. Surprisingly, platewise performance (Fig. 2B) using one-hot encoding tracked closely with that obtained using chemical features (Fig. 2A, as did random barcode results, not shown). These results failed to strongly distinguish meaningful from random featurization, despite the prospective context.

Fig. 2 Out-of-sample performance on individual plate predictions and analysis of feature importance bias.

(A) Platewise predictions on additives using Ahneman et al.’s chemical featurization. (B) Analogous platewise predictions on additives using one-hot encoding. (C) Box plot of average feature importances extracted from 100 trials of shuffled data, showing median values with first and third quartiles. Plot whiskers represent minimum and maximum importance values across random trials. (D) Top 10 feature importances from a single representative trial.

Looking beyond prediction performance, Ahneman et al. thoughtfully analyzed the relative importance of the chemical features used by their top-performing random forest model. They found that isoxazole additive–based descriptors most significantly affect yield prediction mean square error under permutation analysis (9). By contrast, the random-feature and one-hot encoding straw experiments we performed above suggest that isoxazole additives play only a limited role in predicting reaction outcome, and we looked to understand this discrepancy. Traditional random forest implementations can exhibit feature-importance bias when inputs vary in scale or when categories vary in number of classes (10, 11). We suspected that additive feature importance may be enriched as the result of a similar effect. Consequently, we shuffled all training data to decorrelate the predictive variables (features) from the output (yields) and trained a random forest regressor on the shuffled data. In 100 trials of this randomized-data test, additive features were nonetheless consistently identified as most important, and consistently occupied 9 of the top 10 by rank (Fig. 2, C and D). These results indicate that apparently high additive feature importances cannot be distinguished from hidden structure within the dataset itself.

We believe that these results, taken together, illustrate the need to incorporate random-control procedures (7) when applying machine learning to new scientific domains. We find that the experimental design is insufficient to establish that models built on the proposed chemical featurization can generalize to new chemical entities, or meaningfully outperform straw models trained on randomly assigned features. However, we do not conclude that chemical features are unimportant, nor that the ones used here are necessarily incorrect. Nor do we believe that careful design of chemical features is futile. Rather, further studies that more expansively explore each reaction dimension (additional bases, ligands, substrates, and additives) may be a means to demonstrate that these models can be usefully adopted for reaction prediction. Flexible and powerful machine learning models have become widespread and readily available. As these tools permeate the physical and life sciences, so too must accompanying methods to distinguish models that learn peculiarities of an experiment’s layout from those that extract meaningful and actionable patterns beyond it. With randomized controls to guide experimental design, Ahneman et al.’s novel machine learning approach to reaction prediction may best prove its merit.

References and Notes

Acknowledgments: We thank D. T. Ahneman, J. G. Estrada, and A. G. Doyle for their collegiality and discussions of one-hot results. Funding: Supported by a Paul G. Allen Family Foundation Distinguished Investigator Award (M.J.K.). Competing interests: The authors declare no competing interests. Data and materials availability: All code and data used in these analyses are available at
View Abstract

Navigate This Article