Predicting human olfactory perception from chemical features of odor molecules

See allHide authors and affiliations

Science  24 Feb 2017:
Vol. 355, Issue 6327, pp. 820-826
DOI: 10.1126/science.aal2014
  • Fig. 1 DREAM Olfaction Prediction Challenge.

    (A) Psychophysical data. (B) Chemoinformatic data. (C) DREAM Challenge flowchart. (D) Individual and population challenges. (E) Hypothetical example of psychophysical profile of a stimulus. (F) Connection strength between 21 attributes for all 476 molecules. Width and color of the lines show the normalized strength of the edge. (G) Perceptual variance of 21 attributes across 49 individuals for all 476 molecules at both concentrations sorted by Euclidean distance. Three clusters are indicated by green, blue, and red bars above the matrix. (H) Model Z-scores, best performers at left. (I and J) Correlations of individual (I) or population (J) perception prediction sorted by team rank. The dotted line represents the P < 0.05 significance threshold with respect to random predictions. The performance of four equations for pleasantness prediction suggested by Zarzo (10) [from top to bottom: equations (10, 9, 11, 7, 12)] and of a linear model based on the first seven principal components inspired by Khan et al. (8) are shown.

  • Fig. 2 Predictions of individual perception.

    (A) Example of a random-forest algorithm that utilizes a subset of molecules from the training set to match a semantic descriptor (e.g., garlic) to a subset of molecular features. (B) Example of a regularized linear model. For each perceptual attribute yi, a linear model utilizes molecular features xi,j weighted by βi to predict the psychophysical data of 69 hidden test-set molecules, with sparsity enforced by the magnitude of λ. (C) Correlation values of best-performer model across 69 hidden test-set molecules, sorted by Euclidean distance across 21 perceptual attributes and 49 individuals. (D) Correlation values for the average of all models (red dots, mean ± SD), best-performing model (white dots), and best-predicted individual (black dots), sorted by the average of all models. (E) Prediction correlation of the best-performing random-forest model plotted against measured standard deviation of each subject’s perception across 69 hidden test-set molecules for the four indicated attributes. Each dot represents one of 49 individuals. (F) Correlation values between prediction correlation and measured standard deviation for 21 perceptual attributes across 49 individuals, color coded as in (E). The dotted line represents the P < 0.05 significance threshold obtained from shuffling individuals.

  • Fig. 3 Predictions of population perception.

    (A) Average of correlation of population predictions. Error bars, SDs calculated across models. (B) Ranked prediction correlation for 69 hidden test-set molecules produced by aggregated models (open black circles; gray bars, SD) and the average of all models (solid black dots; black bars, SD). (C to E) Prediction correlation with increasing number of molecular features using random-forest (red) or linear (black) models. Attributes are ordered from top to bottom and left to right by the number of features required to obtain 80% of the maximum prediction correlation using the random-forest model. Plotted are intensity and pleasantness (C), and attributes that required six or fewer (D) or more than six features (E). The combined training + leaderboard set of 407 molecules was randomly partitioned 250 times to obtain error bars for both types of models.

  • Fig. 4 Quality of predictions.

    (A and B) Community phase predictions for random-forest (A) and linear (B) models using both Morgan and Dragon features for population prediction. The training set was randomly partitioned 250 times to obtain error bars: *P < 0.05, **P < 0.01, ***P < 0.001, corrected for multiple comparisons [false discovery rate (FDR)]. (C) Comparison between correlation coefficients for model predictions and for test-retest for individual perceptual attributes by using the aggregated predictions from linear and random-forest models. Error bars reflect standard error obtained from jackknife resampling of the retested molecules. Linear regression of the model-test correlation coefficients against the test-retest correlation coefficients yields a slope of 0.80 ± 0.02 and a correlation of r = 0.870 (black line) compared with a theoretically optimal model (perfect prediction given intraindividual variability, dashed red line). Only the model-test correlation coefficient for burnt (15) was statistically distinguishable from the corresponding test-retest coefficient (P < 0.05 with FDR correction). (D) Schematic for reverse-engineering a desired sensory profile from molecular features. The model was presented with the experimental sensory profile of a molecule (spider plot, left) and tasked with searching through 69 hidden test-set molecules (middle) to find the best match (right, model prediction in red). Spider plots represent perceptual data for all 21 attributes, with the lowest rating at the center and highest at the outside of the circle. (E) Example where the model selected a molecule with a sensory profile 7th closest to the target, butyric acid. (F) Population prediction quality for the 69 molecules in the hidden test set when all 19 models are aggregated. The overall area under the curve (AUC) for the prediction is 0.83, compared with 0.5 for a random model (gray dashed line) and 1.0 for a perfect model.

Supplementary Materials

  • Predicting human olfactory perception from chemical features of odor molecules

    Andreas Keller, Richard C. Gerkin, Yuanfang Guan, Amit Dhurandhar, Gabor Turu, Bence Szalai, Joel D. Mainland, Yusuke Ihara, Chung Wen Yu, Russ Wolfinger, Celine Vens, Leander Schietgat, Kurt De Grave, Raquel Norel, DREAM Olfaction Prediction Consortium, Gustavo Stolovitzky, Guillermo A. Cecchi, Leslie B. Vosshall, Pablo Meyer

    Materials/Methods, Supplementary Text, Tables, Figures, and/or References

    Download Supplement
    • Materials and Methods
    • Supplementary Text
    • Figs. S1 to S7
    • Table S1 legend
    • References
    Table S1
    Raw data including prediction scores and methods, correlation values, molecule CIDs, top molecular features, and LVO-0869 data set.

Stay Connected to Science

Navigate This Article