Research Article

Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning

See allHide authors and affiliations

Science  18 Jan 2019:
Vol. 363, Issue 6424, eaau5631
DOI: 10.1126/science.aau5631

You are currently viewing the abstract.

View Full Text

Log in to view the full text

Log in through your institution

Log in through your institution

Predicting catalyst selectivity

Asymmetric catalysis is widely used in chemical research and manufacturing to access just one of two possible mirror-image products. Nonetheless, the process of tuning catalyst structure to optimize selectivity is still largely empirical. Zahrt et al. present a framework for more efficient, predictive optimization. As a proof of principle, they focused on a known coupling reaction of imines and thiols catalyzed by chiral phosphoric acid compounds. By modeling multiple conformations of more than 800 prospective catalysts, and then training machine-learning algorithms on a subset of experimental results, they achieved highly accurate predictions of enantioselectivities.

Science, this issue p. eaau5631

Structured Abstract


The development of new synthetic methods in organic chemistry is traditionally accomplished through empirical optimization. Catalyst design, wherein experimentalists attempt to qualitatively identify correlations between catalyst structure and catalyst efficiency, is no exception. However, this approach is plagued by numerous deficiencies, including the lack of mechanistic understanding of a new transformation, the inherent limitations of human cognitive abilities to find patterns in large collections of data, and the lack of quantitative guidelines to aid catalyst identification. Chemoinformatics provides an attractive alternative to empiricism for several reasons: Mechanistic information is not a prerequisite, catalyst structures can be characterized by three-dimensional (3D) descriptors (numerical representations of molecular properties derived from the 3D molecular structure) that quantify the steric and electronic properties of thousands of candidate molecules, and the suitability of a given catalyst candidate can be quantified by comparing its properties with a computationally derived model trained on experimental data. The ability to accurately predict a selective catalyst by using a set of less than optimal data remains a major goal for machine learning with respect to asymmetric catalysis. We report a method to achieve this goal and propose a more efficient alternative to traditional catalyst design.


The workflow we have created consists of the following components: (i) construction of an in silico library comprising a large collection of conceivable, synthetically accessible catalysts derived from a particular scaffold; (ii) calculation of relevant chemical descriptors for each scaffold; (iii) selection of a representative subset of the catalysts [this subset is termed the universal training set (UTS) because it is agnostic to reaction or mechanism and thus can be used to optimize any reaction catalyzed by that scaffold]; (iv) collection of the training data; and (v) application of machine learning methods to generate models that predict the enantioselectivity of each member of the in silico library. These models are evaluated with an external test set of catalysts (predicting selectivities of catalysts outside of the training data). The validated models can then be used to select the optimal catalyst for a given reaction.


To demonstrate the viability of our method, we predicted reaction outcomes with substrate combinations and catalysts different from the training data and simulated a situation in which highly selective reactions had not been achieved. In the first demonstration, a model was constructed by using support vector machines and validated with three different external test sets. The first test set evaluated the ability of the model to predict the selectivity of only reactions forming new products with catalysts from the training set. The model performed well, with a mean absolute deviation (MAD) of 0.161 kcal/mol. Next, the same model was used to predict the selectivity of an external test set of catalysts with substrate combinations from the training set. The performance of the model was still highly accurate, with a MAD of 0.211 kcal/mol. Lastly, reactions forming new products with the external test catalysts were predicted with a MAD of 0.236 kcal/mol. In the second study, no reactions with selectivity above 80% enantiomeric excess were used as training data. Deep feed-forward neural networks accurately reproduced the experimental selectivity data, successfully predicting the most selective reactions. More notably, the general trends in selectivity, on the basis of average catalyst selectivity, were correctly identified. Despite omitting about half of the experimental free energy range from the training data, we could still make accurate predictions in this region of selectivity space.


The capability to predict selective catalysts has the potential to change the way chemists select and optimize chiral catalysts from an empirically guided to a mathematically guided approach.

Chemoinformatics-guided optimization protocol.

(A) Generation of a large in silico library of catalyst candidates. (B) Calculation of robust chemical descriptors. (C) Selection of a UTS. (D) Acquisition of experimental selectivity data. (E) Application of machine learning to use moderate- to low-selectivity reactions to predict high-selectivity reactions. R, any group; X, O or S; Y, OH, SH, or NHTf; PC, principal component; ΔΔG, mean selectivity.


Catalyst design in asymmetric reaction development has traditionally been driven by empiricism, wherein experimentalists attempt to qualitatively recognize structural patterns to improve selectivity. Machine learning algorithms and chemoinformatics can potentially accelerate this process by recognizing otherwise inscrutable patterns in large datasets. Herein we report a computationally guided workflow for chiral catalyst selection using chemoinformatics at every stage of development. Robust molecular descriptors that are agnostic to the catalyst scaffold allow for selection of a universal training set on the basis of steric and electronic properties. This set can be used to train machine learning methods to make highly accurate predictive models over a broad range of selectivity space. Using support vector machines and deep feed-forward neural networks, we demonstrate accurate predictive modeling in the chiral phosphoric acid–catalyzed thiol addition to N-acylimines.

View Full Text