Research Article

A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs

See allHide authors and affiliations

Science  08 Dec 2017:
Vol. 358, Issue 6368, eaag2612
DOI: 10.1126/science.aag2612

Computer or human?

Proving that we are human is now part of many tasks that we do on the internet, such as creating an email account, voting in an online poll, or even downloading a scientific paper. One of the most popular tests is text-based CAPTCHA, where would-be users are asked to decipher letters that may be distorted, partially obscured, or shown against a busy background. This test is used because computers find it tricky, but (most) humans do not. George et al. developed a hierarchical model for computer vision that was able to solve CAPTCHAs with a high accuracy rate using comparatively little training data. The results suggest that moving away from text-based CAPTCHAs, as some online services have done, may be a good idea.

Science, this issue p. eaag2612

Structured Abstract


Compositionality, generalization, and learning from a few examples are among the hallmarks of human intelligence. CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), images used by websites to block automated interactions, are examples of problems that are easy for people but difficult for computers. CAPTCHAs add clutter and crowd letters together to create a chicken-and-egg problem for algorithmic classifiers—the classifiers work well for characters that have been segmented out, but segmenting requires an understanding of the characters, which may be rendered in a combinatorial number of ways. CAPTCHAs also demonstrate human data efficiency: A recent deep-learning approach for parsing one specific CAPTCHA style required millions of labeled examples, whereas humans solve new styles without explicit training.

By drawing inspiration from systems neuroscience, we introduce recursive cortical network (RCN), a probabilistic generative model for vision in which message-passing–based inference handles recognition, segmentation, and reasoning in a unified manner. RCN learns with very little training data and fundamentally breaks the defense of modern text-based CAPTCHAs by generatively segmenting characters. In addition, RCN outperforms deep neural networks on a variety of benchmarks while being orders of magnitude more data-efficient.


Modern deep neural networks resemble the feed-forward hierarchy of simple and complex cells in the neocortex. Neuroscience has postulated computational roles for lateral and feedback connections, segregated contour and surface representations, and border-ownership coding observed in the visual cortex, yet these features are not commonly used by deep neural nets. We hypothesized that systematically incorporating these findings into a new model could lead to higher data efficiency and generalization. Structured probabilistic models provide a natural framework for incorporating prior knowledge, and belief propagation (BP) is an inference algorithm that can match the cortical computational speed. The representational choices in RCN were determined by investigating the computational underpinnings of neuroscience data under the constraint that accurate inference should be possible using BP.


RCN was effective in breaking a wide variety of CAPTCHAs with very little training data and without using CAPTCHA-specific heuristics. By comparison, a convolutional neural network required a 50,000-fold larger training set and was less robust to perturbations to the input. Similar results are shown on one- and few-shot MNIST (modified National Institute of Standards and Technology handwritten digit data set) classification, where RCN was significantly more robust to clutter introduced during testing. As a generative model, RCN outperformed neural network models when tested on noisy and cluttered examples and generated realistic samples from one-shot training of handwritten characters. RCN also proved to be effective at an occlusion reasoning task that required identifying the precise relationships between characters at multiple points of overlap. On a standard benchmark for parsing text in natural scenes, RCN outperformed state-of-the-art deep-learning methods while requiring 300-fold less training data.


Our work demonstrates that structured probabilistic models that incorporate inductive biases from neuroscience can lead to robust, generalizable machine learning models that learn with high data efficiency. In addition, our model’s effectiveness in breaking text-based CAPTCHAs with very little training data suggests that websites should seek more robust mechanisms for detecting automated interactions.

Breaking CAPTCHAs using a generative vision model.

Text-based CAPTCHAs exploit the data efficiency and generative aspects of human vision to create a challenging task for machines. By handling recognition and segmentation in a unified way, our model fundamentally breaks the defense of text-based CAPTCHAs. Shown are the parses by our model for a variety of CAPTCHAs .


Learning from a few examples and generalizing to markedly different situations are capabilities of human visual intelligence that are yet to be matched by leading machine learning models. By drawing inspiration from systems neuroscience, we introduce a probabilistic generative model for vision in which message-passing–based inference handles recognition, segmentation, and reasoning in a unified way. The model demonstrates excellent generalization and occlusion-reasoning capabilities and outperforms deep neural networks on a challenging scene text recognition benchmark while being 300-fold more data efficient. In addition, the model fundamentally breaks the defense of modern text-based CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) by generatively segmenting characters without CAPTCHA-specific heuristics. Our model emphasizes aspects such as data efficiency and compositionality that may be important in the path toward general artificial intelligence.

The ability to learn and generalize from a few examples is a hallmark of human intelligence (1). CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), images used by websites to block automated interactions, are examples of problems that are easy for humans but difficult for computers. CAPTCHAs are hard for algorithms because they add clutter and crowd letters together to create a chicken-and-egg problem for character classifiers—the classifiers work well for characters that have been segmented out, but segmenting the individual characters requires an understanding of the characters, each of which might be rendered in a combinatorial number of ways (25). A recent deep-learning approach for parsing one specific CAPTCHA style required millions of labeled examples from it (6), and earlier approaches mostly relied on handcrafted style-specific heuristics to segment out the character (3, 7), whereas humans can solve new styles without explicit training (Fig. 1A). The wide variety of ways in which letterforms could be rendered and still be understood by people is illustrated in Fig. 1.

Fig. 1 Flexibility of letterform perception in humans.

(A) Humans are good at parsing unfamiliar CAPTCHAs. (B) The same character shape can be rendered in a wide variety of appearances, and people can detect the “A” in these images regardless. (C) Common sense and context affect letterform perception: (i) m versus u and n. (ii) The same line segments are interpreted as N or S depending on occluder positions. (iii) Perception of the shapes aids the recognition of “b,i,s,o,n” and “b,i,k,e.”


Building models that generalize well beyond their training distribution is an important step toward the flexibility Douglas Hofstadter envisioned when he said that “for any program to handle letterforms with the flexibility that human beings do, it would have to possess full-scale artificial intelligence” (8). Many researchers have conjectured that this could be achieved by incorporating the inductive biases of the visual cortex (912) using the wealth of data generated by neuroscience and cognitive science research. In the mammalian brain, feedback connections in the visual cortex play roles in figure-ground segmentation and in object-based top-down attention that isolates the contours of an object even when partially transparent objects occupy the same spatial locations (1316). Lateral connections in the visual cortex are implicated in enforcing contour continuity (17, 18). Contours and surfaces are represented using separate mechanisms that interact (1921), enabling the recognition and imagination of objects with unusual appearance—for example, a chair made of ice. The timing and topography of cortical activations give clues about contour-surface representations and inference algorithms (22, 23). These insights based on cortical function are yet to be incorporated into leading machine learning models.

We introduce a hierarchical model called the Recursive Cortical Network (RCN) that incorporates these neuroscience insights in a structured probabilistic generative model framework (5, 2427). In addition to developing RCN and its learning and inference algorithms, we applied the model to a variety of visual cognition tasks that required generalizing from one or a few training examples: parsing of CAPTCHAs, one-shot and few-shot recognition and generation of handwritten digits, occlusion reasoning, and scene text recognition. We then compared its performance to state-of-the-art models.

Recursive cortical network

RCN builds on existing compositional models (24, 2832) in important ways [section 6 of (33)]. Although grammar-based models (24) have the advantage of being based on well-known ideas from linguistics, they either limit interpretations to single trees or are computationally infeasible when using attributed relations (32). The seminal work on AND-OR templates and tree-structured compositional models (34) has the advantage of simplified inference but is lacking in selectivity owing to the absence of lateral constraints (35). Models from another important class (25, 29) use lateral constraints, but rather than gradually building invariance through a pooling structure (36), they use parametric transformations for complete scale, rotation, and translation invariance at each level. Custom inference algorithms are required, but those are not effective in propagating the effect of lateral constraints beyond local interactions. The representation of contours and surfaces in (37) does not model their interactions, choosing instead to model these as independent mechanisms. RCNs and composition machines (CMs) (32) share the motivation of placing compositional model ideas in a graphical model formulation. However, CM’s representational choice of “composed distributions”—using a single layer of random variables to collapse feature detection, pooling, and lateral coordination—leads to an expanded state space, which in turn constrains the model to a greedy inference and parsing process. In general, because of the varied and conflicting representational choices, inference in compositional models has relied on custom-crafted methods for different model instantiations, including solving stochastic partial differential equations (30), sampling-based algorithms (24), and pruned dynamic programming (DP) (29).

RCN integrates and builds upon various ideas from compositional models—hierarchical composition, gradual building of invariances, lateral connections for selectivity, contour-surface factorization, and joint-explanation–based parsing—into a structured probabilistic graphical model such that Belief Propagation (38) can be used as the primary approximate inference engine [section 6 of (33)]. Experimental neuroscience data provided important guidance on the representational choices [section 7 of (33)], which were then confirmed to be beneficial using experimental studies. We now discuss the representation of RCN and its inference and learning algorithms. Mathematical details are discussed in sections 2 to 5 of (33).


In RCN, objects are modeled as a combination of contours and surfaces (Fig. 2A). Contours appear at the boundaries of surfaces, both at the outline of objects and at the border between the surfaces that compose the object. Surfaces are modeled using a conditional random field (CRF), which captures the smoothness of variations of surface properties. Contours are modeled using a compositional hierarchy of features (28, 39). Factored representation of contours (shape) and surfaces (appearance) enables the model to recognize object shapes with markedly different appearances without training exhaustively on every possible shape and appearance combination. We now describe the shape and appearance representations in detail.

Fig. 2 Structure of the RCN.

(A) A hierarchy generates the contours of an object, and a CRF generates its surface appearance. (B) Two subnetworks at the same level of the contour hierarchy keep separate lateral connections by making parent-specific copies of child features and connecting them with parent-specific laterals; nodes within the green rectangle are copies of the feature marked “e.” (C) A three-level RCN representing the contours of a square. Features at level 2 represent the four corners, and each corner is represented as a conjunction of four line-segment features. (D) Four-level network representing an “A.”

Figure 2B shows two subnetworks (black and blue) within a level of the RCN contour hierarchy. The filled and empty circular nodes in the graph are binary random variables that correspond to features and pools, respectively. Each feature node encodes an AND relation of its child pools, and each pool variable encodes the OR of its child features, similar to AND-OR graphs (34). Lateral constraints, represented as rectangular “factor nodes,” coordinate the choices between the pools they connect to. The two subnetworks, which can correspond to two objects or object parts, share lower-level features.

Figure 2C shows a three-level network that represents the contours of a square. The features at the lowest, intermediate, and top levels represent line segments, corners, and the entire square, respectively. Each pool variable pools over different deformations, small translations, scale changes, etc., of a “centered” feature, thus introducing the corresponding invariances. Without the lateral connections between the pools (the gray squares in Fig. 2C), generating from a feature node representing a corner can create misaligned line segments, as shown in Fig. 3A. Lateral connections between the pools provide selectivity (35) by ensuring that the choice of a feature in one pool affects the choice of features in pools it is connected to, creating samples where the contours vary more smoothly. The flexibility of lateral constraints is controlled through perturb-factor, a hyperparameter that is specified per level. Through multiple layers of feature pooling, lateral connections, and compositions, a feature node at the top level comes to represent an object that can be recognized with some level of translation, scale, and deformation invariance.

Fig. 3 Samples from RCN.

(A) Samples from a corner feature with and without lateral connections. (B) Samples from character “A” for different deformability settings, determined by pooling and lateral perturb-factors, in a three-level hierarchy similar to Fig. 2D, where the lowest-level features are edges. Column 2 shows a balanced setting where deformability is distributed between the levels to produce local deformations and global translations. The other columns show some extreme configurations. (C) Contour-to-surface CRF interaction for a cube. Green factors, foreground-to-background edges; blue, within-object edges. (D) Different surface-appearance samples for the cubical shape in (C) [see section 3 of (33) for CRF parameters].

Multiple objects are represented in the same shape hierarchy by sharing their parts (Fig. 2B). When multiple parents converge on a single child feature (feature node “e” in Fig. 2B), this will be active when any parent is active (OR gate in the graphical model), and the child feature is allowed to be part of both parents if evidence allows, unlike the exclusive sharing in AND-OR graph grammars (24). Even when two higher-level features share some of the same lower-level features and pools, the higher-level features’ lateral networks are kept separate by making copies of the lower-level feature for each specific higher-level feature it participates in, as shown in Fig. 2B. Parent-specific copies of lateral networks serve to achieve higher-order interactions compared to pairwise connections, similar to the state copying mechanism used in higher-order networks (40). This was also found to be important for message-passing to achieve accurate results and is reminiscent of techniques used in dual decomposition (41). Hierarchy in the RCN network plays two roles. First, it enables the representation of deformations gradually through multiple levels, spreading the amount of variation across layers (Fig. 3B). Second, hierarchy provides efficiency through the sharing of features between different objects (42). Both of these result in efficient learning and inference through shared computations.

Surfaces are modeled using a pairwise CRF (Fig. 3C). Local surface patch properties, such as color, texture, or surface normal, are represented by categorical variables, whose smoothness of variation is enforced by the lateral factors (gray squares in Fig. 2). Contours generated by the contour hierarchy interact with the surface CRF in a specific way: Contours signal the breaks in continuity of surfaces that occur both within an object and between the object and its background, a representational choice inspired by neurobiology (19). Figure 3 (B and D) shows samples generated from an RCN.


To parse a scene, RCN maintains hierarchical graphs for multiple object instances at multiple locations tiling the scene. The parse of a scene can be obtained via maximum a posteriori (MAP) inference on this complex graph, which recovers the best joint configuration, including object identities and their segmentations [section 4 of (33)]. Although the RCN network is extremely loopy, we found that message-passing (38), with a schedule inspired by the timing of activations in the visual cortex (9, 20), resulted in fast and accurate inference. An input image is first passed through PreProc, which converts pixel values to edge likelihoods using a bank of Gabor-like filters. Partial assignments that correspond to object hypotheses are then identified using a forward and backward message-passing in the network, and a complete approximate MAP solution is found by solving the scene-parsing problem on the graph of object hypotheses (Fig. 4). The forward pass gives an upper bound on the log-probability of the nodes at the top level. The backward pass visits the high-scoring forward-pass hypotheses one by one, in a manner similar to a top-down attention process (43, 44), running a conditional inference that assumes that all other nodes are off to find an approximate MAP configuration for the object (Fig. 4A). The backward pass can reject many object hypotheses that were falsely identified in the forward pass.

Fig. 4 Inference and learning.

(A) (i) Forward pass, including lateral propagation, produces hypotheses about the multiple letters present in the input image. PreProc is a bank of Gabor-like filters that convert from pixels to edge likelihoods [section 4.2 of (33)]. (ii) Backward pass and lateral propagation creates the segmentation mask for a selected forward-pass hypothesis, here the letter “A” [section 4.4 of (33)]. (iii) A false hypothesis “V” is hallucinated at the intersection of “A” and “K”; false hypotheses are resolved via parsing [section 4.7 of (33)]. (iv) Multiple hypotheses can be activated to produce a joint explanation that involves explaining away and occlusion reasoning. (B) Learning features at the second feature level. Colored circles represent feature activations. The dotted circle is a proposed feature [see text and section 5 of (33)]. (C) Learning of laterals from contour adjacency (see text).

The global MAP configuration is a subset of all the object hypotheses generated from the forward and backward passes. The number of objects in the scene is inferred as part of this MAP solution. In addition to searching over an exponentially large number of subsets, finding the global MAP requires reasoning about high-order interactions between different hypotheses. We developed an approximate DP method that solves this in linear time. The DP algorithm exploits the fact that each object hypothesis occupies a contiguous region that can be represented as a two-dimensional (2D) mask on the input image. By considering combinations of object hypotheses, that is, parses, that produce spatially contiguous masks when their 2D masks overlap, we create a topological ordering of the parses by sorting them according to masks that are contained in other masks. This results in a recursive computation of the score where only a linear number of candidate parses need to be evaluated in searching for the best parse. See section 4.7 of (33) for more details.


Features and lateral connections up to the penultimate level of the network are trained unsupervised using a generic 3D object data set that is task-agnostic and rendered only as contour images. The resulting learned features vary from simple line segments at the lower levels to curves and corners at the higher levels.

Consider a partially learned model, where new features are being learned at level k, where features up to level k − 1 have already been learned and finalized, and a few features have been learned at level k (Fig. 4B). When a training image is presented, the first step is to find a MAP explanation for the contours of that image using the existing features at level k. This is identical to the inference problem described earlier of finding the MAP solution for a scene. The contours that remain unexplained are parsed using features at level k − 1, and new features are proposed from their contour-continuous conjunctions. Repeating this process for all the training images accumulates counts on the usage of different features at level k, and the final features for this level are selected by optimizing an objective function that balances compression and reconstruction error (31). The same process is repeated level by level [section 5.1 of (33)].

The lateral graph structure, which specifies the connectivity between pool pairs, is learned from the contour connectivity of input images. At the first pooling level, pools with features that are adjacent in the input contours are connected with each other. This process is repeated recursively in the hierarchy where lateral connections at the higher levels are inferred from adjacency in the lower-level graphs.

Features at the topmost level represent whole objects. These are obtained by finding the MAP configuration of a new object up to the penultimate level of the network, connecting pool pairs at the penultimate level according to the contour continuity of the input object, and then storing the conjunction of activations at the penultimate level as a feature in the topmost level. See section 5 of (33) for details.

Once the set of lower-level features and lateral connections are trained, they can be used for different domains by tuning a few hyperparameters [section 8.3 of (33)]. The filter scales in the PreProc are chosen depending on the image and object size, and the flexibility of the lateral connections is set to match the distortions in the data. In addition, the features at the lowest level have a “smoothing parameter” that sets an estimate on the probability that an edge pixel is ON owing to noise. This parameter can be set according to the noise levels in a domain.


A CAPTCHA is considered broken if it can be automatically solved at a rate above 1% (3). RCN was effective in breaking a wide variety of text-based CAPTCHAs with very little training data and without using CAPTCHA-specific heuristics (Fig. 5). It was able to solve reCAPTCHAs at an accuracy rate of 66.6% (character level accuracy of 94.3%), BotDetect at 64.4%, Yahoo at 57.4%, and PayPal at 57.1%, significantly above the 1% rate at which CAPTCHAs are considered ineffective (3). The only differences in architecture across different CAPTCHA tasks are the sets of clean fonts used for training and the different choices of a few hyperparameters, which depend on the size of the CAPTCHA image and the amount of clutter and deformations. These parameters are straightforward to set by hand or can be tuned automatically via cross-validation on an annotated CAPTCHA set. Noisy, cluttered, and deformed examples from the CAPTCHAs were not used for training, yet RCN was effective in generalizing to those variations.

Fig. 5 Parsing CAPTCHAs with RCN.

(A) Representative reCAPTCHA parses showing top two solutions, their segmentations, and labels by two different Amazon Mechanical Turk workers. (B) Word accuracy rates of RCN and CNN on the control CAPTCHA data set. CNN is brittle and RCN is robust when character spacing is changed. (C) Accuracies for different CAPTCHA styles. (D) Representative BotDetect parses and segmentations (indicated by the different colors).

For reCAPTCHA parsing at 66.6% accuracy, RCN required only five clean training examples per character. The model uses three parameters that affect how single characters are combined together to read out a string of characters, and these parameters were both independent of the length of the CAPTCHAs and were robust to the spacing of the characters [Fig. 5B and section 8.4 of (33)]. In addition to obtaining a transcription of the CAPTCHA, the model also provides a highly accurate segmentation into individual characters, as shown in Fig. 5A. To compare, human accuracy on reCAPTCHA is 87.4%. Because many input images have multiple valid interpretations (Fig. 5A), parses from two humans agree only 81% of the time.

In comparison to RCNs, a state-of-the-art convolutional neural network (CNN) (6) required a 50,000-fold larger training set of actual CAPTCHA strings, and it was less robust to perturbations to the input. Because the CNN required a large number of labeled examples, this control study used a CAPTCHA generator that we created to emulate the appearance of reCAPTCHAs [section 8.4.3 of (33)]. The approach used a bank of position-specific CNNs, each trained to discriminate the letter at a particular position. Training the CNNs to achieve a word accuracy rate of 89.9% required more than 2.3 million unique training images, created using translated crops for data augmentation, from 79,000 distinct CAPTCHA words. The resulting network fails on string lengths not present during training, and more importantly, the recognition accuracy of the network deteriorates rapidly with even minor perturbations to the spacing of characters that are barely perceptible to humans—15% more spacing reduced accuracy to 38.4%, and 25% more spacing reduced accuracy to just 7%. This suggests that the deep-learning method learned to exploit the specifics of a particular CAPTCHA, rather than learning models of characters, which are then used for parsing the scene. For RCN, increasing the spacing of the characters results in an improvement in the recognition accuracy (Fig. 5B).

The wide variety of character appearances in BotDetect (Fig. 5C) demonstrates why the factorization of contours and surfaces is important: Models without this factorization could latch on to the specific appearance details of a font, thereby limiting their generalization. The RCN results are based on testing on 10 different styles of CAPTCHAs from BotDetect, all parsed on the basis of a single network trained on 24 training examples per character and using the same parsing parameters across all styles. Although BotDetect CAPTCHAs can be parsed with contour information alone, using the appearance information boosted the accuracy from 61.8 to 64.4%, using the same appearance model across all data sets. See section 8.4.6 of (33) for more details.

RCN outperformed other models on one-shot and few-shot classification tasks on the standard MNIST (modified National Institute of Standards and Technology handwritten digit data set) [section 8.7 of (33)]. We compared RCN’s classification performance on MNIST as we varied the number of training examples from 1 to 100 per category. CNN comparisons were made with two state-of-the art models, a LeNet-5 (45) and the VGG-fc6 CNN (46), with its levels pretrained for ImageNet (47) classification using millions of images. The fully connected layer fc6 of VGG-CNN was chosen for comparison because it gave the best results for this task compared to other pretrained levels of the VGG-CNN and compared to other pretrained CNNs that used the same data set and edge preprocessing as RCN [section 5.1 of (33)]. In addition, we compared against the compositional patch model (CPM) (48) that recently reported state-of-the-art performance on this task. RCN outperformed the CNNs and the CPM (Fig. 6A). The one-shot recognition performance of RCN was 76.6% versus 68.9% for CPM and 54.2% for VGG-fc6. RCN was also robust to different forms of clutter that were introduced during testing, without having to expose the network to those transformations during training. In comparison, such out-of-sample test examples had a large detrimental effect on the generalization performance of CNNs (Fig. 6B). To isolate the contributions of lateral connections, forward pass, and backward pass to RCN’s accuracy, we conducted lesion studies that selectively turned off these mechanisms. The results, summarized in Fig. 6C, show that all these mechanisms contribute significantly toward the performance of RCNs. RCN networks with two levels of feature detection and pooling were sufficient to obtain the best accuracy performance on character parsing tasks. The effect of increasing the number of levels in the hierarchy is to reduce the inference time as detailed in section 8.11 of (33).

Fig. 6 MNIST classification results for training with a few examples.

(A) MNIST classification accuracy for RCN, CNN, and CPM. (B) Classification accuracy on corrupted MNIST tests. Legends show the total number of training examples. (C) MNIST classification accuracy for different RCN configurations.

As a generative model, RCN outperformed variational autoencoders (VAEs) (49) and deep recurrent attentive writer (DRAW) (50) on reconstructing corrupted MNIST images (Fig. 7, A and B). DRAW’s advantage over RCN for the clean test set is not surprising because DRAW is learning an overly flexible model that almost copies the input image in the reconstruction, which hurts its performance on more cluttered data sets [section 8.9 of (33)]. On the Omniglot data (1), examples generated from RCN after one-shot training showed significant variations, while still being identifiable as the original category [Fig. 7D and section 8.6 of (33)].

Fig. 7 Generation, occlusion reasoning, and scene-text parsing with RCN.

Examples of reconstructions (A) and reconstruction error (B) from RCN, VAE, and DRAW on corrupted MNIST. Legends show the number of training examples. (C) Occlusion reasoning. The third column shows the edges remaining after RCN explains away the edges of the first detected object. Ground-truth masks reflect the occlusion relationships between the square and the digit. The portions of the digit that are in front of the square are colored brown and the portions that are behind the square are colored orange. The last column shows the predicted occlusion mask. (D) One-shot generation from Omniglot. In each column, row 1 shows the training example and the remaining rows show generated samples. (E) Examples of ICDAR images successfully parsed by RCN. The yellow outlines show segmentations.

To test occlusion reasoning (5153), we created a variant of the MNIST data set by adding a rectangle to each validation/test image such that some parts of the digit were occluded by the rectangle and some parts of the rectangle were occluded by the digit [Fig. 7C and section 8.8 of (33)]. Occlusion relationships in these images cannot be deduced as a simple layering of one object in front of the other. Classification on this data set is challenging because many parts of the digit are occluded by the rectangle, and because the rectangle acts as clutter. If the rectangle is detected and segmented out, its effect on the evidence for a particular digit can be explained away using the RCN generative model, thereby improving the accuracy of classification and segmentation. RCN was tested for classification accuracy and for occlusion reasoning on this challenging data set. Classification accuracy without explaining away was 47.0%. Explaining away the rectangle boosts the classification accuracy to 80.7%. In addition, RCN was used to parse the scene by reasoning about the occlusion relation between the rectangle and the digit. The model was successful at predicting the precise occlusion relations of the test image as shown in Fig. 7C, obtaining a mean intersection over union of 0.353 measured over the occluded regions.

Last, RCN was tested on the data set from the 2013 International Conference on Document Analysis and Recognition, known as ICDAR-13 (54), a benchmark for text recognition in real-world images (Fig. 7E). For this test, we enhanced the parsing algorithm to include previous knowledge about n-gram and word statistics, and about geometric priors related to the layout of letters in a scene, which includes spacing, relative sizes, and appearance consistency [section 8.5 of (33)]. We compared our result against top participants of the ICDAR competition and against a recent deep learning approach (55) (Table 1). The RCN model outperformed the top contender, PhotoOCR, by 1.9%, despite PhotoOCR using 7.9 million training images, whereas RCN used 1406 training images selected using model-based clustering from 25,584 font images. RCN achieved better accuracy on this task while being 300 times more data efficient, in addition to providing a detailed segmentation of the characters (Fig. 7E) that the competing methods do not provide.

Table 1 Accuracy and number of training images for different methods on the ICDAR-13 robust reading data set.
View this table:


Segmentation resistance, the primary defense of text-based CAPTCHAs, has been a general principle that enabled their automated generation (2, 3). Although specific CAPTCHAs have been broken before using style-specific segmentation heuristics (3, 7), those attacks could be foiled easily by minor alterations to CAPTCHAs. RCN breaks the segmentation defense in a fundamental way and with very little training data, which suggests that websites should move to more robust mechanisms for blocking bots.

Compositional models have been successfully used in the past for generic object recognition and scene parsing, and our preliminary experiments [section 8.12 of (33)] indicate that RCN could be applicable in those domains as well (Fig. 8). The RCN formulation opens up compositional models to a wider array of advanced inference and learning algorithms developed in graphical models, potentially leading to improvements that build on their previous successes in real-world scene parsing (56, 57). Despite the advantage of being a generative model, RCN needs several improvements to achieve superior performance on ImageNet-scale (47) data sets. Flexible merging of multiple instances, the use of surface appearance at all levels of the hierarchy during forward and backward inference, more sophisticated pooling structures that learn to pool over 3D transformations, and generative modeling of scene context and background are problems that need to be investigated and integrated with RCN [section 8.13 of (33)].

Fig. 8 Application of RCN to parsing scenes with objects.

Shown are the detections and instance segmentations obtained when RCN was applied to a scene-parsing task with multiple real-world objects in cluttered scenes on random backgrounds. Our experiments suggest that RCN could be generalized beyond text parsing [see section 8.12 of (33) and Discussion].

The high data efficiency of RCN, compared to whole-image models, such as CNNs and VAEs, derives from the fact that RCN encodes strong assumptions in its structure. Recent neural network models incorporate ideas of compositionality using a spatial attention window (58), but their current instantiations need good separation between the objects in an uncluttered setting because each attention window is modeled using a whole-image VAE. Incorporation of RCN’s object- and part-based compositionality into neural network models would be an interesting research direction. Unlike neural networks, the current version of RCN learning algorithms need clean training data, a drawback we intend to address using gradient-based learning as well as message-passing–based approaches (59).

Combining RCN with Bayesian program learning (BPL) (1) is another avenue for future investigations. BPL has the advantage of precisely modeling the sequential causal mechanisms (for example, the stroke generation in the Omniglot data set), but its inference depends on the contours being separated from the background—something RCN can easily provide. More generally, BPL- and RCN-like graphical models could be combined to obtain the expressive power and efficient inference required to model the parallel and sequential processes (60) involved in perception and cognition.

Of course, Douglas Hofstadter’s challenge—understanding letterforms with the same efficiency and flexibility of humans—still stands as a grand goal for artificial intelligence. People use a lot more commonsense knowledge, in context-sensitive and dynamic ways, when they identify letterforms (Fig. 1C, iii). Our work suggests that incorporating inductive biases from systems neuroscience can lead to robust, generalizable machine learning models that demonstrate high data efficiency. We hope that this work inspires improved models of cortical circuits (61, 62) and investigations that combine the power of neural networks and structured probabilistic models toward general artificial intelligence systems.

Methods summary

For reCAPTCHA experiments, we downloaded 5500 reCAPTCHA images from reCAPTCHA page, of which 500 were used as validation set for parameter tuning, and accuracy numbers are reported on the remaining 5000. The images were scaled up by a factor of 2. A similar-looking font to those used in reCAPTCHA, Georgia, was identified by visual comparison from the fonts available on the local system. RCN was trained on a few rotations of the lowercase and uppercase characters from this font. Hyperparameters were optimized using the validation set. Human accuracy on the reCAPTCHA data set was estimated using Amazon Mechanical Turk (AMT) U.S.-based workers.

Emulated reCAPTCHA data sets, used to train the neural network for control experiments, were created using ImageMagick to produce distortions that are qualitatively similar to the original reCAPTCHA. The emulated data generator is used as an unlimited source to generate random batches for training the neural network. Neural network optimization was run for 80 epochs, where data are permuted at the start of every epoch; data were also augmented by random translations of up to 5 pixels in each cardinal direction per epoch.

Similar methods were used for BotDetect, PayPal, and Yahoo CAPTCHAs. For BotDetect, we downloaded a data set of 50 to 100 images per CAPTCHA style for determining the parsing parameters and training setup, and another 100 images as a testing data set on which the network is not tuned. As training images for the system, we selected a series of fonts and scales from those available on the system by visually comparing a few examples of the BotDetect CAPTCHAs. The BotDetect test images were rescaled by a factor of 1.45. Parsing parameters were optimized using the validation set, and the transferability of the parsing parameters were tested by adapting the parameters for each style separately and then testing those parameters on the other styles.

For training RCN to parse ICDAR, we obtained 492 fonts from Google Fonts, which resulted in 25,584-character training images. From this we selected a set of training images using an automated greedy font selection approach. We rendered binary images for all fonts and then used the resulting images of the same letter to train an RCN. This RCN is then used to recognize the exact images it was trained on, providing a compatibility score (between 0.0 and 1.0) for all pairs of fonts of the same letter. Finally, using a threshold (0.8) as the stopping criterion, we greedily select the most representative fonts until 90% of all fonts are represented, which resulted in 776 unique training images. The parser is trained using 630 word images and the character n-grams are trained using words from the Wikipedia.

RCN classification experiments on the MNIST data set are done by up-sampling the images by a factor of 4. For each training setting, two pooling hyperparameters of the model were adapted using an independent validation set of rotated MNIST digits. Several ways of pretraining the CNN are explored as part of the baselines. To understand the performance of the networks on noisy MNIST data, we created six variants of noise, each one with three levels of severity. For occlusion reasoning, the RCN network was trained with 11 categories: ten MNIST digit categories with 20 examples for category and the rectangular ring category with one example. Reconstruction experiments on the MNIST data set used networks that were trained only on clean MNIST images which were then tested for mean squared reconstruction error on six different noise variants, each with three levels of severity. Full methods are available in the supplementary materials.


Materials and Methods

Supplementary Text

Figs. S1 to S27

Tables S1 to S15

References (63108)

References and Notes

  1. Supplementary materials.
  2. Acknowledgments: We thank the anonymous reviewers for their careful review and helpful suggestions that greatly improved the manuscript. We thank B. Olshausen, T. Dean, B. Lake, and B. Jaros for suggesting improvements after reading early versions of this manuscript. We are grateful to A. Yuille and F.-F. Li for insightful discussions leading to this work. Data sets used in the paper are available for download at The inventions described in this paper are protected by U.S. patents 9262698, 9373085, 9607262, and 9607263. Because text-based CAPTCHAs are still widely used to protect websites, the scientific benefit of releasing the source code must be balanced with the potential for it to be used for circumventing protections that prevent automated interactions with websites. As a compromise, a simplified reference implementation of RCN algorithms for the MNIST data set is available at
View Abstract


Navigate This Article