Research Article

A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs

See allHide authors and affiliations

Science  08 Dec 2017:
Vol. 358, Issue 6368, eaag2612
DOI: 10.1126/science.aag2612
  • Breaking CAPTCHAs using a generative vision model.

    Text-based CAPTCHAs exploit the data efficiency and generative aspects of human vision to create a challenging task for machines. By handling recognition and segmentation in a unified way, our model fundamentally breaks the defense of text-based CAPTCHAs. Shown are the parses by our model for a variety of CAPTCHAs .

  • Fig. 1 Flexibility of letterform perception in humans.

    (A) Humans are good at parsing unfamiliar CAPTCHAs. (B) The same character shape can be rendered in a wide variety of appearances, and people can detect the “A” in these images regardless. (C) Common sense and context affect letterform perception: (i) m versus u and n. (ii) The same line segments are interpreted as N or S depending on occluder positions. (iii) Perception of the shapes aids the recognition of “b,i,s,o,n” and “b,i,k,e.”

  • Fig. 2 Structure of the RCN.

    (A) A hierarchy generates the contours of an object, and a CRF generates its surface appearance. (B) Two subnetworks at the same level of the contour hierarchy keep separate lateral connections by making parent-specific copies of child features and connecting them with parent-specific laterals; nodes within the green rectangle are copies of the feature marked “e.” (C) A three-level RCN representing the contours of a square. Features at level 2 represent the four corners, and each corner is represented as a conjunction of four line-segment features. (D) Four-level network representing an “A.”

  • Fig. 3 Samples from RCN.

    (A) Samples from a corner feature with and without lateral connections. (B) Samples from character “A” for different deformability settings, determined by pooling and lateral perturb-factors, in a three-level hierarchy similar to Fig. 2D, where the lowest-level features are edges. Column 2 shows a balanced setting where deformability is distributed between the levels to produce local deformations and global translations. The other columns show some extreme configurations. (C) Contour-to-surface CRF interaction for a cube. Green factors, foreground-to-background edges; blue, within-object edges. (D) Different surface-appearance samples for the cubical shape in (C) [see section 3 of (33) for CRF parameters].

  • Fig. 4 Inference and learning.

    (A) (i) Forward pass, including lateral propagation, produces hypotheses about the multiple letters present in the input image. PreProc is a bank of Gabor-like filters that convert from pixels to edge likelihoods [section 4.2 of (33)]. (ii) Backward pass and lateral propagation creates the segmentation mask for a selected forward-pass hypothesis, here the letter “A” [section 4.4 of (33)]. (iii) A false hypothesis “V” is hallucinated at the intersection of “A” and “K”; false hypotheses are resolved via parsing [section 4.7 of (33)]. (iv) Multiple hypotheses can be activated to produce a joint explanation that involves explaining away and occlusion reasoning. (B) Learning features at the second feature level. Colored circles represent feature activations. The dotted circle is a proposed feature [see text and section 5 of (33)]. (C) Learning of laterals from contour adjacency (see text).

  • Fig. 5 Parsing CAPTCHAs with RCN.

    (A) Representative reCAPTCHA parses showing top two solutions, their segmentations, and labels by two different Amazon Mechanical Turk workers. (B) Word accuracy rates of RCN and CNN on the control CAPTCHA data set. CNN is brittle and RCN is robust when character spacing is changed. (C) Accuracies for different CAPTCHA styles. (D) Representative BotDetect parses and segmentations (indicated by the different colors).

  • Fig. 6 MNIST classification results for training with a few examples.

    (A) MNIST classification accuracy for RCN, CNN, and CPM. (B) Classification accuracy on corrupted MNIST tests. Legends show the total number of training examples. (C) MNIST classification accuracy for different RCN configurations.

  • Fig. 7 Generation, occlusion reasoning, and scene-text parsing with RCN.

    Examples of reconstructions (A) and reconstruction error (B) from RCN, VAE, and DRAW on corrupted MNIST. Legends show the number of training examples. (C) Occlusion reasoning. The third column shows the edges remaining after RCN explains away the edges of the first detected object. Ground-truth masks reflect the occlusion relationships between the square and the digit. The portions of the digit that are in front of the square are colored brown and the portions that are behind the square are colored orange. The last column shows the predicted occlusion mask. (D) One-shot generation from Omniglot. In each column, row 1 shows the training example and the remaining rows show generated samples. (E) Examples of ICDAR images successfully parsed by RCN. The yellow outlines show segmentations.

  • Fig. 8 Application of RCN to parsing scenes with objects.

    Shown are the detections and instance segmentations obtained when RCN was applied to a scene-parsing task with multiple real-world objects in cluttered scenes on random backgrounds. Our experiments suggest that RCN could be generalized beyond text parsing [see section 8.12 of (33) and Discussion].

  • Table 1 Accuracy and number of training images for different methods on the ICDAR-13 robust reading data set.
    MethodAccuracyTotal no. of training images
    PLT (54)64.6%Unknown
    NESP (54)63.7%Unknown
    PicRead (54)63.1%Unknown
    Deep Structured Output Learning (55)81.8%8,000,000
    PhotoOCR (54)84.3%7,900,000
    RCN86.2%26,214 (reduced to 1406)

Supplementary Materials

  • A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs

    D. George, W. Lehrach, K. Kansky, M. Lázaro-Gredilla, C. Laan, B. Marthi, X. Lou, Z. Meng, Y. Liu, H. Wang, A. Lavin, D. S. Phoenix

    Materials/Methods, Supplementary Text, Tables, Figures, and/or References

    Download Supplement
    • Materials and Methods
    • Supplementary Text
    • Figs. S1 to S27
    • Tables S1 to S15
    • References

Stay Connected to Science

Navigate This Article