Research Article

Human-level concept learning through probabilistic program induction

See allHide authors and affiliations

Science  11 Dec 2015:
Vol. 350, Issue 6266, pp. 1332-1338
DOI: 10.1126/science.aab3050

Handwritten characters drawn by a model

Not only do children learn effortlessly, they do so quickly and with a remarkable ability to use what they have learned as the raw material for creating new stuff. Lake et al. describe a computational model that learns in a similar fashion and does so better than current deep learning algorithms. The model classifies, parses, and recreates handwritten characters, and can generate new letters of the alphabet that look “right” as judged by Turing-like tests of the model's output in comparison to what real humans produce.

Science, this issue p. 1332


People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms—for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world’s alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several “visual Turing tests” probing the model’s creative generalization abilities, which in many cases are indistinguishable from human behavior.

Despite remarkable advances in artificial intelligence and machine learning, two aspects of human conceptual knowledge have eluded machine systems. First, for most interesting kinds of natural and man-made categories, people can learn a new concept from just one or a handful of examples, whereas standard algorithms in machine learning require tens or hundreds of examples to perform similarly. For instance, people may only need to see one example of a novel two-wheeled vehicle (Fig. 1A) in order to grasp the boundaries of the new concept, and even children can make meaningful generalizations via “one-shot learning” (13). In contrast, many of the leading approaches in machine learning are also the most data-hungry, especially “deep learning” models that have achieved new levels of performance on object and speech recognition benchmarks (49). Second, people learn richer representations than machines do, even for simple concepts (Fig. 1B), using them for a wider range of functions, including (Fig. 1, ii) creating new exemplars (10), (Fig. 1, iii) parsing objects into parts and relations (11), and (Fig. 1, iv) creating new abstract categories of objects based on existing categories (12, 13). In contrast, the best machine classifiers do not perform these additional functions, which are rarely studied and usually require specialized algorithms. A central challenge is to explain these two aspects of human-level concept learning: How do people learn new concepts from just one or a few examples? And how do people learn such abstract, rich, and flexible representations? An even greater challenge arises when putting them together: How can learning succeed from such sparse data yet also produce such rich representations? For any theory of learning (4, 1416), fitting a more complicated model requires more data, not less, in order to achieve some measure of good generalization, usually the difference in performance between new and old examples. Nonetheless, people seem to navigate this trade-off with remarkable agility, learning rich concepts that generalize well from sparse data.

Fig. 1 People can learn rich concepts from limited data.

(A and B) A single example of a new concept (red boxes) can be enough information to support the (i) classification of new examples, (ii) generation of new examples, (iii) parsing an object into parts and relations (parts segmented by color), and (iv) generation of new concepts from related concepts. [Image credit for (A), iv, bottom: With permission from Glenn Roberts and Motorcycle Mojo Magazine]

This paper introduces the Bayesian program learning (BPL) framework, capable of learning a large class of visual concepts from just a single example and generalizing in ways that are mostly indistinguishable from people. Concepts are represented as simple probabilistic programs— that is, probabilistic generative models expressed as structured procedures in an abstract description language (17, 18). Our framework brings together three key ideas—compositionality, causality, and learning to learn—that have been separately influential in cognitive science and machine learning over the past several decades (1922). As programs, rich concepts can be built “compositionally” from simpler primitives. Their probabilistic semantics handle noise and support creative generalizations in a procedural form that (unlike other probabilistic models) naturally captures the abstract “causal” structure of the real-world processes that produce examples of a category. Learning proceeds by constructing programs that best explain the observations under a Bayesian criterion, and the model “learns to learn” (23, 24) by developing hierarchical priors that allow previous experience with related concepts to ease learning of new concepts (25, 26). These priors represent a learned inductive bias (27) that abstracts the key regularities and dimensions of variation holding across both types of concepts and across instances (or tokens) of a concept in a given domain. In short, BPL can construct new programs by reusing the pieces of existing ones, capturing the causal and compositional properties of real-world generative processes operating on multiple scales.

In addition to developing the approach sketched above, we directly compared people, BPL, and other computational approaches on a set of five challenging concept learning tasks (Fig. 1B). The tasks use simple visual concepts from Omniglot, a data set we collected of multiple examples of 1623 handwritten characters from 50 writing systems (Fig. 2)(see acknowledgments). Both images and pen strokes were collected (see below) as detailed in section S1 of the online supplementary materials. Handwritten characters are well suited for comparing human and machine learning on a relatively even footing: They are both cognitively natural and often used as a benchmark for comparing learning algorithms. Whereas machine learning algorithms are typically evaluated after hundreds or thousands of training examples per class (5), we evaluated the tasks of classification, parsing (Fig. 1B, iii), and generation (Fig. 1B, ii) of new examples in their most challenging form: after just one example of a new concept. We also investigated more creative tasks that asked people and computational models to generate new concepts (Fig. 1B, iv). BPL was compared with three deep learning models, a classic pattern recognition algorithm, and various lesioned versions of the model—a breadth of comparisons that serve to isolate the role of each modeling ingredient (see section S4 for descriptions of alternative models). We compare with two varieties of deep convolutional networks (28), representative of the current leading approaches to object recognition (7), and a hierarchical deep (HD) model (29), a probabilistic model needed for our more generative tasks and specialized for one-shot learning.

Fig. 2 Simple visual concepts for comparing human and machine learning.

525 (out of 1623) character concepts, shown with one example each.

Bayesian Program Learning

The BPL approach learns simple stochastic programs to represent concepts, building them compositionally from parts (Fig. 3A, iii), subparts (Fig. 3A, ii), and spatial relations (Fig. 3A, iv). BPL defines a generative model that can sample new types of concepts (an “A,” “B,” etc.) by combining parts and subparts in new ways. Each new type is also represented as a generative model, and this lower-level generative model produces new examples (or tokens) of the concept (Fig. 3A, v), making BPL a generative model for generative models. The final step renders the token-level variables in the format of the raw data (Fig. 3A, vi). The joint distribution on types ψ, a set of M tokens of that type θ(1), . . ., θ(M), and the corresponding binary images I(1), . . ., I(M) factors as Embedded Image (1)

Fig. 3 A generative model of handwritten characters.

(A) New types are generated by choosing primitive actions (color coded) from a library (i), combining these subparts (ii) to make parts (iii), and combining parts with relations to define simple programs (iv). New tokens are generated by running these programs (v), which are then rendered as raw data (vi). (B) Pseudocode for generating new types ψ and new token images I (m) for m = 1, ..., M. The function f (·, ·) transforms a subpart sequence and start location into a trajectory.

The generative process for types P(ψ) and tokens P(m)|ψ) are described by the pseudocode in Fig. 3B and detailed along with the image model P(I(m)|θ(m)) in section S2. Source code is available online (see acknowledgments). The model learns to learn by fitting each conditional distribution to a background set of characters from 30 alphabets, using both the image and the stroke data, and this image set was also used to pretrain the alternative deep learning models. Neither the production data nor any alphabets from this set are used in the subsequent evaluation tasks, which provide the models with only raw images of novel characters.

Handwritten character types ψ are an abstract schema of parts, subparts, and relations. Reflecting the causal structure of the handwriting process, character parts Si are strokes initiated by pressing the pen down and terminated by lifting it up (Fig. 3A, iii), and subparts si1, ..., sini are more primitive movements separated by brief pauses of the pen (Fig. 3A, ii). To construct a new character type, first the model samples the number of parts κ and the number of subparts ni, for each part i = 1, ..., κ, from their empirical distributions as measured from the background set. Second, a template for a part Si is constructed by sampling subparts from a set of discrete primitive actions learned from the background set (Fig. 3A, i), such that the probability of the next action depends on the previous. Third, parts are then grounded as parameterized curves (splines) by sampling the control points and scale parameters for each subpart. Last, parts are roughly positioned to begin either independently, at the beginning, at the end, or along previous parts, as defined by relation Ri (Fig. 3A, iv).

Character tokens θ(m) are produced by executing the parts and the relations and modeling how ink flows from the pen to the page. First, motor noise is added to the control points and the scale of the subparts to create token-level stroke trajectories S(m). Second, the trajectory’s precise start location L(m) is sampled from the schematic provided by its relation Ri to previous strokes. Third, global transformations are sampled, including an affine warp A(m) and adaptive noise parameters that ease probabilistic inference (30). Last, a binary image I(m) is created by a stochastic rendering function, lining the stroke trajectories with grayscale ink and interpreting the pixel values as independent Bernoulli probabilities.

Posterior inference requires searching the large combinatorial space of programs that could have generated a raw image I(m). Our strategy uses fast bottom-up methods (31) to propose a range of candidate parses. The most promising candidates are refined by using continuous optimization and local search, forming a discrete approximation to the posterior distribution P(ψ, θ(m)|I(m)) (section S3). Figure 4A shows the set of discovered programs for a training image I(1) and how they are refit to different test images I(2) to compute a classification score log P(I(2)|I(1)) (the log posterior predictive probability), where higher scores indicate that they are more likely to belong to the same class. A high score is achieved when at least one set of parts and relations can successfully explain both the training and the test images, without violating the soft constraints of the learned within-class variability model. Figure 4B compares the model’s best-scoring parses with the ground-truth human parses for several characters.

Fig. 4 Inferring motor programs from images.

Parts are distinguished by color, with a colored dot indicating the beginning of a stroke and an arrowhead indicating the end. (A) The top row shows the five best programs discovered for an image along with their log-probability scores (Eq. 1). Subpart breaks are shown as black dots. For classification, each program was refit to three new test images (left in image triplets), and the best-fitting parse (top right) is shown with its image reconstruction (bottom right) and classification score (log posterior predictive probability). The correctly matching test item receives a much higher classification score and is also more cleanly reconstructed by the best programs induced from the training item. (B) Nine human drawings of three characters (left) are shown with their ground truth parses (middle) and best model parses (right).


People, BPL, and alternative models were compared side by side on five concept learning tasks that examine different forms of generalization from just one or a few examples (example task Fig. 5). All behavioral experiments were run through Amazon’s Mechanical Turk, and the experimental procedures are detailed in section S5. The main results are summarized by Fig. 6, and additional lesion analyses and controls are reported in section S6.

Fig. 5 Generating new exemplars.

Humans and machines were given an image of a novel character (top) and asked to produce new exemplars. The nine-character grids in each pair that were generated by a machine are (by row) 1, 2; 2, 1; 1, 1.

Fig. 6 Human and machine performance was compared on (A) one-shot classification and (B) four generative tasks.

The creative outputs for humans and models were compared by the percent of human judges to correctly identify the machine. Ideal performance is 50%, where the machine is perfectly confusable with humans in these two-alternative forced choice tasks (pink dotted line). Bars show the mean ± SEM [N = 10 alphabets in (A)]. The no learning-to-learn lesion is applied at different levels (bars left to right): (A) token; (B) token, stroke order, type, and type.

One-shot classification was evaluated through a series of within-alphabet classification tasks for 10 different alphabets. As illustrated in Fig. 1B, i, a single image of a new character was presented, and participants selected another example of that same character from a set of 20 distinct characters produced by a typical drawer of that alphabet. Performance is shown in Fig. 6A, where chance is 95% errors. As a baseline, the modified Hausdorff distance (32) was computed between centered images, producing 38.8% errors. People were skilled one-shot learners, achieving an average error rate of 4.5% (N = 40). BPL showed a similar error rate of 3.3%, achieving better performance than a deep convolutional network (convnet; 13.5% errors) and the HD model (34.8%)—each adapted from deep learning methods that have performed well on a range of computer vision tasks. A deep Siamese convolutional network optimized for this one-shot learning task achieved 8.0% errors (33), still about twice as high as humans or our model. BPL’s advantage points to the benefits of modeling the underlying causal process in learning concepts, a strategy different from the particular deep learning approaches examined here. BPL’s other key ingredients also make positive contributions, as shown by higher error rates for BPL lesions without learning to learn (token-level only) or compositionality (11.0% errors and 14.0%, respectively). Learning to learn was studied separately at the type and token level by disrupting the learned hyperparameters of the generative model. Compositionality was evaluated by comparing BPL to a matched model that allowed just one spline-based stroke, resembling earlier analysis-by-synthesis models for handwritten characters that were similarly limited (34, 35).

The human capacity for one-shot learning is more than just classification. It can include a suite of abilities, such as generating new examples of a concept. We compared the creative outputs produced by humans and machines through “visual Turing tests,” where naive human judges tried to identify the machine, given paired examples of human and machine behavior. In our most basic task, judges compared the drawings from nine humans asked to produce a new instance of a concept given one example with nine new examples drawn by BPL (Fig. 5). We evaluated each model based on the accuracy of the judges, which we call their identification (ID) level: Ideal model performance is 50% ID level, indicating that they cannot distinguish the model’s behavior from humans; worst-case performance is 100%. Each judge (N = 147) completed 49 trials with blocked feedback, and judges were analyzed individually and in aggregate. The results are shown in Fig. 6B (new exemplars). Judges had only a 52% ID level on average for discriminating human versus BPL behavior. As a group, this performance was barely better than chance [t(47) = 2.03, P = 0.048], and only 3 of 48 judges had an ID level reliably above chance. Three lesioned models were evaluated by different groups of judges in separate conditions of the visual Turing test, examining the necessity of key model ingredients in BPL. Two lesions, learning to learn (token-level only) and compositionality, resulted in significantly easier Turing test tasks (80% ID level with 17 of 19 judges above chance and 65% with 14 of 26, respectively), indicating that this task is a nontrivial one to pass and that these two principles each contribute to BPL’s human-like generative proficiency. To evaluate parsing more directly (Fig. 4B), we ran a dynamic version of this task with a different set of judges (N = 143), where each trial showed paired movies of a person and BPL drawing the same character. BPL performance on this visual Turing test was not perfect (59% average ID level; new exemplars (dynamic) in Fig. 6B), although randomizing the learned prior on stroke order and direction significantly raises the ID level (71%), showing the importance of capturing the right causal dynamics for BPL.

Although learning to learn new characters from 30 background alphabets proved effective, many human learners will have much less experience: perhaps familiarity with only one or a few alphabets, along with related drawing tasks. To see how the models perform with more limited experience, we retrained several of them by using two different subsets of only five background alphabets. BPL achieved similar performance for one-shot classification as with 30 alphabets (4.3% and 4.0% errors, for the two sets, respectively); in contrast, the deep convolutional net performed notably worse than before (24.0% and 22.3% errors). BPL performance on a visual Turing test of exemplar generation (N = 59) was also similar on the first set [52% average ID level that was not significantly different from chance t (26) = 1.04, P > 0.05], with only 3 of 27 judges reliably above chance, although performance on the second set was slightly worse [57% ID level; t (31) = 4.35, P < 0.001; 7 of 32 judges reliably above chance]. These results suggest that although learning to learn is important for BPL’s success, the model’s structure allows it to take nearly full advantage of comparatively limited background training.

The human productive capacity goes beyond generating new examples of a given concept: People can also generate whole new concepts. We tested this by showing a few example characters from 1 of 10 foreign alphabets and asking participants to quickly create a new character that appears to belong to the same alphabet (Fig. 7A). The BPL model can capture this behavior by placing a nonparametric prior on the type level, which favors reusing strokes inferred from the example characters to produce stylistically consistent new characters (section S7). Human judges compared people versus BPL in a visual Turing test (N = 117), viewing a series of displays in the format of Fig. 7A, i and iii. The judges had only a 49% ID level on average [Fig. 6B, new concepts (from type)], which is not significantly different from chance [t(34) = 0.45, P > 0.05]. Individually, only 8 of 35 judges had an ID level significantly above chance. In contrast, a model with a lesion to (type-level) learning to learn was successfully detected by judges on 69% of trials in a separate condition of the visual Turing test, and was significantly easier to detect than BPL (18 of 25 judges above chance). Further comparisons in section S6 suggested that the model’s ability to produce plausible novel characters, rather than stylistic consistency per se, was the crucial factor for passing this test. We also found greater variation in individual judges’ comparisons of people and the BPL model on this task, as reflected in their ID levels: 10 of 35 judges had individual ID levels significantly below chance; in contrast, only two participants had below-chance ID levels for BPL across all the other experiments shown in Fig. 6B.

Fig. 7 Generating new concepts.

(A) Humans and machines were given a novel alphabet (i) and asked to produce new characters for that alphabet. New machine-generated characters are shown in (ii). Human and machine productions can be compared in (iii). The four-character grids in each pair that were generated by the machine are (by row) 1, 1, 2; 1, 2. (B) Humans and machines produced new characters without a reference alphabet. The grids that were generated by a machine are 2; 1; 1; 2.

Last, judges (N = 124) compared people and models on an entirely free-form task of generating novel character concepts, unconstrained by a particular alphabet (Fig. 7B). Sampling from the prior distribution on character types P(ψ) in BPL led to an average ID level of 57% correct in a visual Turing test (11 of 32 judges above chance); with the nonparametric prior that reuses inferred parts from background characters, BPL achieved a 51% ID level [Fig. 7B and new concepts (unconstrained) in Fig. 6B; ID level not significantly different from chance t(24) = 0.497, P > 0.05; 2 of 25 judges above chance]. A lesion analysis revealed that both compositionality (68% and 15 of 22) and learning to learn (64% and 22 of 45) were crucial in passing this test.


Despite a changing artificial intelligence landscape, people remain far better than machines at learning new concepts: They require fewer examples and use their concepts in richer ways. Our work suggests that the principles of compositionality, causality, and learning to learn will be critical in building machines that narrow this gap. Machine learning and computer vision researchers are beginning to explore methods based on simple program induction (3641), and our results show that this approach can perform one-shot learning in classification tasks at human-level accuracy and fool most judges in visual Turing tests of its more creative abilities. For each visual Turing test, fewer than 25% of judges performed significantly better than chance.

Although successful on these tasks, BPL still sees less structure in visual concepts than people do. It lacks explicit knowledge of parallel lines, symmetry, optional elements such as cross bars in “7”s, and connections between the ends of strokes and other strokes. Moreover, people use their concepts for other abilities that were not studied here, including planning (42), explanation (43), communication (44), and conceptual combination (45). Probabilistic programs could capture these richer aspects of concept learning and use, but only with more abstract and complex structure than the programs studied here. More-sophisticated programs could also be suitable for learning compositional, causal representations of many concepts beyond simple perceptual categories. Examples include concepts for physical artifacts, such as tools, vehicles, or furniture, that are well described by parts, relations, and the functions these structures support; fractal structures, such as rivers and trees, where complexity arises from highly iterated but simple generative processes; and even abstract knowledge, such as natural number, natural language semantics, and intuitive physical theories (17, 4648).

Capturing how people learn all these concepts at the level we reached with handwritten characters is a long-term goal. In the near term, applying our approach to other types of symbolic concepts may be particularly promising. Human cultures produce many such symbol systems, including gestures, dance moves, and the words of spoken and signed languages. As with characters, these concepts can be learned to some extent from one or a few examples, even before the symbolic meaning is clear: Consider seeing a “thumbs up,” “fist bump,” or “high five” or hearing the names “Boutros Boutros-Ghali,” “Kofi Annan,” or “Ban Ki-moon” for the first time. From this limited experience, people can typically recognize new examples and even produce a recognizable semblance of the concept themselves. The BPL principles of compositionality, causality, and learning to learn may help to explain how.

To illustrate how BPL applies in the domain of speech, programs for spoken words could be constructed by composing phonemes (subparts) systematically to form syllables (parts), which compose further to form morphemes and entire words. Given an abstract syllable-phoneme parse of a word, realistic speech tokens can be generated from a causal model that captures aspects of speech motor articulation. These part and subpart components are shared across the words in a language, enabling children to acquire them through a long-term learning-to-learn process. We have already found that a prototype model exploiting compositionality and learning to learn, but not causality, is able to capture some aspects of the human ability to learn and generalize new spoken words (e.g., English speakers learning words in Japanese) (49). Further progress may come from adopting a richer causal model of speech generation, in the spirit of classic “analysis-by-synthesis” proposals for speech perception and language comprehension (20, 50).

Although our work focused on adult learners, it raises natural developmental questions. If children learning to write acquire an inductive bias similar to what BPL constructs, the model could help explain why children find some characters difficult and which teaching procedures are most effective (51). Comparing children’s parsing and generalization behavior at different stages of learning and BPL models given varying background experience could better evaluate the model’s learning-to-learn mechanisms and suggest improvements. By testing our classification tasks on infants who categorize visually before they begin drawing or scribbling (52), we can ask whether children learn to perceive characters more causally and compositionally based on their own proto-writing experience. Causal representations are prewired in our current BPL models, but they could conceivably be constructed through learning to learn at an even deeper level of model hierarchy (53).

Last, we hope that our work may shed light on the neural representations of concepts and the development of more neurally grounded learning models. Supplementing feedforward visual processing (54), previous behavioral studies and our results suggest that people learn new handwritten characters in part by inferring abstract motor programs (55), a representation grounded in production yet active in purely perceptual tasks, independent of specific motor articulators and potentially driven by activity in premotor cortex (5658). Could we decode representations structurally similar to those in BPL from brain imaging of premotor cortex (or other action-oriented regions) in humans perceiving and classifying new characters for the first time? Recent large-scale brain models (59) and deep recurrent neural networks (6062) have also focused on character recognition and production tasks—but typically learning from large training samples with many examples of each concept. We see the one-shot learning capacities studied here as a challenge for these neural models: one we expect they might rise to by incorporating the principles of compositionality, causality, and learning to learn that BPL instantiates.

Supplementary Materials

Materials and Methods

Supplementary Text

Figs. S1 to S11

References (6380)

References and Notes

  1. Acknowledgments: This work was supported by a NSF Graduate Research Fellowship to B.M.L.; the Center for Brains, Minds, and Machines funded by NSF Science and Technology Center award CCF-1231216; Army Research Office and Office of Naval Research contracts W911NF-08-1-0242, W911NF-13-1-2012, and N000141310333; the Natural Sciences and Engineering Research Council of Canada; the Canadian Institute for Advanced Research; and the Moore-Sloan Data Science Environment at NYU. We thank J. McClelland, T. Poggio and L. Schulz for many valuable contributions and N. Kanwisher for helping us elucidate the three key principles. We are grateful to J. Gross and the encyclopedia of writing systems for helping to make this data set possible. Online archives are available for visual Turing tests (, Omniglot data set (, and BPL source code (
View Abstract

Stay Connected to Science

Navigate This Article