Research Article

Neural scene representation and rendering

See allHide authors and affiliations

Science  15 Jun 2018:
Vol. 360, Issue 6394, pp. 1204-1210
DOI: 10.1126/science.aar6170
  • Fig. 1 Schematic illustration of the Generative Query Network.

    (A) The agent observes training scene i from different viewpoints (in this example, from Embedded Image, Embedded Image, and Embedded Image). (B) The inputs to the representation network f are observations made from viewpoints Embedded Image and Embedded Image, and the output is the scene representation r, which is obtained by element-wise summing of the observations’ representations. The generation network, a recurrent latent variable model, uses the representation to predict what the scene would look like from a different viewpoint Embedded Image. The generator can succeed only if r contains accurate and complete information about the contents of the scene (e.g., the identities, positions, colors, and counts of the objects, as well as the room’s colors). Training via back-propagation across many scenes, randomizing the number of observations, leads to learned scene representations that capture this information in a concise manner. Only a handful of observations need to be recorded from any single scene to train the GQN. h1, h2,…hL are the L layers of the generation network.

  • Fig. 2 Neural scene representation and rendering.

    (A) After having made a single observation of a previously unencountered test scene, the representation network produces a neural description of that scene. Given this neural description, the generator is capable of predicting accurate images from arbitrary query viewpoints. This implies that the scene description captures the identities, positions, colors, and counts of the objects, as well as the position of the light and the colors of the room. (B) The generator’s predictions are consistent with laws of perspective, occlusion, and lighting (e.g., casting object shadows consistently). When observations provide views of different parts of the scene, the GQN correctly aggregates this information (scenes two and three). (C) Sample variability indicates uncertainty over scene contents (in this instance, owing to heavy occlusion). Samples depict plausible scenes, with complete objects rendered in varying positions and colors (see fig. S7 for further examples). The model’s behavior is best visualized in movie format; see movie S1 for real-time, interactive querying of GQN’s representation of test scenes.

  • Fig. 3 Viewpoint invariance, compositionality, and factorization of the learned scene representations.

    (A) t-SNE embeddings. t-SNE is a method for nonlinear dimensionality reduction that approximately preserves the metric properties of the original high-dimensional data. Each dot represents a different view of a different scene, with color indicating scene identity. Whereas the VAE clusters images mostly on the basis of wall angles, GQN clusters images of the same scene, independent of view (scene representations computed from each image individually). Two scenes with the same objects (represented by asterisk and dagger symbols) but in different positions are clearly separated. (B) Compositionality demonstrated by reconstruction of holdout shape-color combinations. (C) GQN factorizes object and scene properties because the effect of changing a specific property is similar across diverse scenes (as defined by mean cosine similarity of the changes in the representation across scenes). For comparison, we plot chance factorization, as well as the factorization of the image-space and VAE representations. See section 5.3 of (17) for details.

  • Fig. 4 Scene algebra and Bayesian surprise.

    (A) Adding and subtracting representations of related scenes enables control of object and scene properties via “scene algebra” and indicates factorization of shapes, colors, and positions. Pred, prediction. (B) Bayesian surprise at a new observation after having made observations 1 to k for k = 1 to 5. When the model observes images that contain information about the layout of the scene, its surprise (defined as the Kullback-Leibler divergence between conditional prior and posterior) at observing the held-out image decreases.

  • Fig. 5 GQN representation enables more robust and data-efficient control.

    (A) The goal is to learn to control a robotic arm to reach a randomly positioned colored object. The controlling policy observes the scene from a fixed or moving camera (gray). We pretrain a GQN representation network by observing random configurations from random viewpoints inside a dome around the arm (light blue). (B) The GQN infers a scene representation that can accurately reconstruct the scene. (C) (Left) For a fixed camera, an asynchronous advantage actor-critic reinforcement learning (RL) agent (44) learns to control the arm using roughly one-fourth as many experiences when using the GQN representation, as opposed to a standard method using raw pixels (lines correspond to different hyperparameters; same hyperparameters explored for both standard and GQN agents; both agents also receive viewpoint coordinates as inputs). The final performance achieved by learning from raw pixels can be slightly higher for some hyperparameters, because some task-specific information might be lost when learning a compressed representation independently from the RL task as GQN does. (Right) The benefit of GQN is most pronounced when the policy network’s view on the scene moves from frame to frame, suggesting viewpoint invariance in its representation. We normalize scores such that a random agent achieves 0 and an agent trained on “oracle” ground-truth state information achieves 100.

  • Fig. 6 Partial observability and uncertainty.

    (A) The agent (GQN) records several observations of a previously unencountered test maze (indicated by gray triangles). It is then capable of accurately predicting the image that would be observed at a query viewpoint (yellow triangle). It can accomplish this task only by aggregating information across multiple observations. (B) In the kth column, we condition GQN on observations 1 to k and show GQN’s predicted uncertainty, as well as two of GQN’s sampled predictions of the top-down view of the maze. Predicted uncertainty is measured by computing the model’s Bayesian surprise at each location, averaged over three different heading directions. The model’s uncertainty decreases as more observations are made. As the number of observations increases, the model predicts the top-down view with increasing accuracy. See section 3 of (17), fig. S8, and movie S1 for further details and results. nats, natural units of information.

Supplementary Materials

  • Neural scene representation and rendering

    S. M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu, Demis Hassabis

    Materials/Methods, Supplementary Text, Tables, Figures, and/or References

    Download Supplement
    • Supplementary Text
    • Figs. S1 to S16
    • Algorithms S1 to S3
    • Table S1
    • References

    Images, Video, and Other Media

    Movie S1

Navigate This Article