Report

A million spiking-neuron integrated circuit with a scalable communication network and interface

See allHide authors and affiliations

Science  08 Aug 2014:
Vol. 345, Issue 6197, pp. 668-673
DOI: 10.1126/science.1254642

Modeling computer chips on real brains

Computers are nowhere near as versatile as our own brains. Merolla et al. applied our present knowledge of the structure and function of the brain to design a new computer chip that uses the same wiring rules and architecture. The flexible, scalable chip operated efficiently in real time, while using very little power.

Science, this issue p. 668

Abstract

Inspired by the brain’s structure, we have developed an efficient, scalable, and flexible non–von Neumann architecture that leverages contemporary silicon technology. To demonstrate, we built a 5.4-billion-transistor chip with 4096 neurosynaptic cores interconnected via an intrachip network that integrates 1 million programmable spiking neurons and 256 million configurable synapses. Chips can be tiled in two dimensions via an interchip communication interface, seamlessly scaling the architecture to a cortexlike sheet of arbitrary size. The architecture is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification. With 400-pixel-by-240-pixel video input at 30 frames per second, the chip consumes 63 milliwatts.

A long-standing dream (1, 2) has been to harness neuroscientific insights to build a versatile computer that is efficient in terms of energy and space, homogeneously scalable to large networks of neurons and synapses, and flexible enough to run complex behavioral models of the neocortex (3, 4) as well as networks inspired by neural architectures (5).

No such computer exists today. The von Neumann architecture is fundamentally inefficient and nonscalable for representing massively interconnected neural networks (Fig. 1) with respect to computation, memory, and communication (Fig. 1B). Mixed analog-digital neuromorphic approaches have built large-scale systems (68) to emulate neurobiology by using custom computational elements, for example, silicon neurons (9, 10), winner-take-all circuits (11), and sensory circuits (12). We have found that a multiplexed digital implementation of spiking neurons is more efficient than previous designs (supplementary section S3) and enables one-to-one correspondence between software and hardware (supplementary section S9). Mixed analog-digital as well as custom microprocessor-based neuromorphic approaches (13) have built event-driven communication frameworks (14) to emulate the interconnectivity of the brain by leveraging the speed of digital electronics. We have found that event-driven communication combined with colocated memory and computation mitigates the von Neumann bottleneck (15). Inspired by neuroscience (Fig. 2, A to C), our key architectural abstraction (Fig. 1C) is a network of neurosynaptic cores that can implement large-scale spiking neural networks that are efficient, scalable, and flexible within today’s technology.

Fig. 1 Computation, communication, and memory.

(A) The parallel, distributed architecture of the brain is different from the sequential, centralized von Neumann architecture of today’s computers. The trend of increasing power densities and clock frequencies of processors (29) is headed away from the brain’s operating point. Number and POWER processors are from IBM, Incorporated; AMD, Advanced Micro Devices, Incorporated; Pentium, Itanium, and Core 2 Duo, Intel, Incorporated. (B) In terms of computation, a single processor has to simulate both a large number of neurons as well as the inter-neuron communication infrastructure. In terms of memory, the von Neumann bottleneck (15), which is caused by separation between the external memory and processor, leads to energy-hungry data movement when updating neuron states and when retrieving synapse states. In terms of communication, interprocessor messaging (25) explodes when simulating highly interconnected networks that do not fit on a single processor. (C) Conceptual blueprint of an architecture that, like the brain, tightly integrates memory, computation, and communication in distributed modules that operate in parallel and communicate via an event-driven network.

Fig. 2 TrueNorth architecture.

Panels are organized into rows at three different scales (core, chip, and multichip) and into columns at four different views (neuroscience inspiration, structural, functional, and physical). (A) The neurosynaptic core is loosely inspired by the idea of a canonical cortical microcircuit. (B) A network of neurosynaptic cores is inspired by the cortex’s two-dimensional sheet. (C) The multichip network is inspired by the long-range connections between cortical regions shown from the macaque brain (30). (D) Structure of a neurosynaptic core with axons as inputs, neurons as outputs, and synapses as directed connections from axons to neurons. Multicore networks at (E) chip scale and (F) multichip scale are both created by connecting a neuron on any core to an axon on any core with point-to-point connections. (G) Functional view of core as a crossbar where horizontal lines are axons, cross points are individually programmable synapses, vertical lines are neuron inputs, and triangles are neurons. Information flows from axons via active synapses to neurons. Neuron behaviors are individually programmable, with two examples shown. (H) Functional chip architecture is a two-dimensional array of cores where long-range connections are implemented by sending spike events (packets) over a mesh routing network to activate a target axon. Axonal delay is implemented at the target. (I) Routing network extends across chip boundaries through peripheral merge and split blocks. (J) Physical layout of core in 28-nm CMOS fits in a 240-μm-by-390-μm footprint. A memory (static random-access memory) stores all the data for each neuron, a time-multiplexed neuron circuit updates neuron membrane potentials, a scheduler buffers incoming spike events to implement axonal delays, a router relays spike events, and an event-driven controller orchestrates the core’s operation. (K) Chip layout of 64-by-64 core array, wafer, and chip package. (L) Chip periphery to support multichip networks. I/O, input/output.

From a structural view, the basic building block is a core, a self-contained neural network with 256 input lines (axons) and 256 outputs (neurons) connected via 256-by-256 directed, programmable synaptic connections (Fig. 2D). Building on the local, clustered connectivity of a single neurosynaptic core, we constructed more complex networks by wiring multiple cores together using global, distributed on- and off-chip connectivity (Fig. 2, E and F). Each neuron on every core can target an axon on any other core. Therefore, axonal branching is implemented hierarchically in two stages: First, a single connection travels a long distance between cores (akin to an axonal trunk) and second, upon reaching its target axon, fans out into multiple connections that travel a short distance within a core (akin to an axonal arbor). Neuron dynamics is discretized into 1-ms time steps set by a global 1-kHz clock. Other than this global synchronization signal, which ensures one-to-one equivalence between software and hardware, cores operate in a parallel and event-driven fashion (supplementary section S1). The fundamental currency that mediates fully asynchronous (16) intercore communication and event-driven intracore computation is all-or-nothing spike events that represent firing of individual neurons. The architecture is efficient because (i) neurons form clusters that draw their inputs from a similar pool of axons (1719) (Fig. 2A) allowing for memory-computation colocalization (supplementary section S5); (ii) only spike events, which are sparse in time, are communicated between cores via the long-distance communication network; and (iii) active power is proportional to firing activity. The architecture is scalable because (i) cores on a chip, as well as chips themselves, can be tiled in two dimensions similar to the mammalian neocortex (Fig. 2, B and C); (ii) each spike event addresses a pool of neurons on a target core, reducing the number of long-range spike events thus mitigating a critical bottleneck (supplementary section S4); and (iii) occasional defects at the core and chip level do not disrupt system usability. Last, the architecture is flexible because (i) each neuron is individually configurable, and the neuron model (20) supports a wide variety of computational functions and biologically relevant spiking behaviors; (ii) each synapse can be turned on or off individually, and postsynaptic efficacy can be assigned relative strengths; (iii) each neuron-axon connection is programmable along with its axonal delay; and (iv) the neurons and synapses can exhibit programmed stochastic behavior via a pseudo-random number generator (one per core). The architecture thus supports rich physiological dynamics and anatomical connectivity that includes feed-forward, recurrent, and lateral connections.

From a functional view, a core has individually addressable axons, a configurable synaptic crossbar array, and programmable neurons (Fig. 2G). Within a core, information flows from presynaptic axons (horizontal lines), through the active synapses in the crossbar (binary-connected crosspoints), to drive inputs for all the connected postsynaptic neurons (vertical lines). Axons are activated by input spike events, which are generated by neurons anywhere in the system and delivered after a desired axonal delay of between 1 and 15 time steps. Although the brain has a dedicated wire for each connection, in our architecture spike events are carried between cores by time-multiplexed wires (21) that interconnect a two-dimensional mesh network of routers, each with five ports (north, south, east, west, and local). The routers form the backbone of a two-dimensional mesh network interconnecting a 64-by-64 core array (Fig. 2H). When a neuron on a core spikes, it looks up in local memory an axonal delay (4 bits) and the destination address (8-bit absolute address for the target axon and two 9-bit relative addresses representing core hops in each dimension to the target core). This information is encoded into a packet that is injected into the mesh, where it is handed from core to core—first in the x dimension then in the y dimension (deadlock-free dimension-order routing) until it arrives at its target core before fanning out via the crossbar (fig. S2). To implement feedback connections within a core, where a neuron connects to an axon on the same core, the packet is delivered by using the router’s local channel, which is efficient because it never leaves the core. To scale the two-dimensional mesh across chip boundaries where the number of interchip connections is limited, we used a merge-split structure at the four edges of the mesh to serialize exiting spikes and deserialize entering spikes (Fig. 2I). Spikes leaving the mesh are tagged with their row (for spikes traveling east-west) or column (for spikes traveling north-south) before being merged onto a shared link that exits the chip. Conversely, spikes entering the chip from a shared link are split to the appropriate row or column by using the tagged information.

From a physical view, to implement this functional blueprint, we built TrueNorth, a fully functional digital chip (supplementary section S6) with 1 million spiking neurons and 256 million synapses (nonplastic). With 5.4 billion transistors occupying 4.3-cm2 area in Samsung’s 28-nm process technology, TrueNorth has ∼428 million bits of on-chip memory. Each core has 104,448 bits of local memory to store synapse states (65,536 bits), neuron states and parameters (31,232 bits), destination addresses (6656 bits), and axonal delays (1024 bits). In terms of efficiency, TrueNorth’s power density is 20 mW per cm2, whereas that of a typical central processing unit (CPU) is 50 to 100 W per cm2 (Fig. 1A). Active power density was low because of our architecture, and passive power density was low because of process technology choice with low-leakage transistors. This work advances a previous experimental prototype single neurosynaptic core (22)—scaling the number of cores by 4096 times and reducing core area by 15 times and power by 100 times. To enable an event-driven, hybrid asynchronous-synchronous approach, we were required to develop a custom tool flow, outside the scope of commercial software, for simulation and verification (supplementary section S2).

We used our software ecosystem (supplementary section S9) to map many well-known algorithms to the architecture (23) via offline learning, for example, convolutional networks, liquid state machines, restricted Boltzmann machines, hidden Markov models, support vector machines, optical flow, and multimodal classification. These same algorithms now run without modification on TrueNorth. To test TrueNorth’s applicability to real world problems, we developed an additional multiobject detection and classification application in a fixed-camera setting. The task had two challenges: (i) to detect people, bicyclists, cars, trucks, and buses that occur sparsely in images while minimizing false detection and (ii) to correctly identify the object. Operating on a 400-pixel-by-240-pixel aperture, the chip consumed 63 mW on a 30-frame-per-second three-color video (Fig. 3), which when scaled to a 1920-pixel-by-1080-pixel video achieved state-of-the-art performance (supplementary section S11). Because the video was prerecorded with a standard camera, we were required to convert the pixels into spike events to interface with TrueNorth. In a live setting, we could use a spike-based retinal camera (12) similar to a previously demonstrated eye-detection application (23). We also implemented a visual map of orientation-selective filters, inspired by early processing in mammalian visual cortex (24) and commonly used in computer vision for feature extraction (supplementary section S10). All 1 million neurons received feed-forward inputs with an orientation bias from visual space as well as recurrent connections between nearby features to sharpen selectivity.

Fig. 3 Real-time multiobject recognition on TrueNorth.

(A) The Neovision2 Tower data set is a video from a fixed camera, where the objective is to identify the labels and locations of objects among five classes. We show an example frame along with the selected region that is input to the chip. (B) The region is transduced from pixels into spike events to create two parallel channels: a high-resolution channel (left) that represents the what pathway for labeling objects and a low-resolution channel (right) that represents the where pathway for locating salient objects. These pathways are inspired by dorsal and ventral streams in visual cortex (4). (C) What and where pathways are combined to form a what-where map. In the what network, colors represent the spiking activity for a grid of neurons, where different neurons were trained (offline) to recognize different object types. By overlaying the responses, brighter colors indicate more-confident labels. In the where network, neurons were trained (offline) to detect salient regions, and darker responses indicate more-salient regions. (D) Object bounding boxes reported by the chip.

The standard benchmark of a computer architecture’s efficiency is energy per operation. In the domain of configurable neural architectures, the fundamental operation is the synaptic event, which corresponds to a source neuron sending a spike event to a target neuron via a unique (nonzero) synapse. Synaptic events are the appropriate atomic units because the computation, memory, communication, power, area, and speed all scale with number of synapses. By using complex recurrently connected networks (Fig. 4A), we measured the total power of TrueNorth under a range of configurations (Fig. 4B) and computed the energy per synaptic event (Fig. 4C) (supplementary section S7). Power consumption in TrueNorth is a function of spike rate, the average distance traveled by spikes, and the average number of active synapses per neuron (synaptic density). At the operating point where neurons fire on average at 20 Hz and have 128 active synapses, the total measured power was 72 mW (at 0.775 V operating voltage), corresponding to 26 pJ per synaptic event (considering total energy). Compared with an optimized simulator (25) running the exact same network on a modern general-purpose microprocessor, TrueNorth consumes 176,000 times less energy per event (supplementary section S12). Compared with a state-of-the-art multiprocessor neuromorphic approach (13) (48 chips each with 18 microprocessors) running a similar network, TrueNorth consumes 769 times less energy per event (supplementary section S12). Direct comparison to these platforms is possible because, like TrueNorth, they support individually programmable neurons and connections, as required to run applications like our multiobject recognition example. Direct comparisons with other platforms is not possible because of different network constraints and system capabilities (supplementary section S13). Computation in TrueNorth is measured by using synaptic operations per second (SOPS), whereas in modern supercomputers it is floating-point operations per second (FLOPS). Although not a direct comparison, TrueNorth can deliver 46 billion SOPS per watt for a typical network and 400 billion SOPS per watt for networks with high spike rates and high number of active synapses (supplementary section S8), whereas today’s most energy-efficient supercomputer achieves 4.5 billion FLOPS per watt.

Fig. 4 Benchmarking power and energy.

(A) Example network topology used for benchmarking power at real-time operation. Nodes represent cores, and edges represent neural connections; only 64 of 4096 cores are shown. (B) Although power remains low (<150 mW) for all benchmark networks, those with higher synaptic densities and higher spike rates consume more total power, which illustrates that power consumption scales with neuron activity and number of active synapses. (C) The total energy (passive plus active) per synaptic event decreases with higher synaptic density because leakage power and baseline core power are amortized over additional synapses. For a typical network where neurons fire on average at 20 Hz and have 128 active synapses [marked as * in (B) and (C)], the total energy is 26 pJ per synaptic event.

We have begun building neurosynaptic supercomputers by tiling multiple TrueNorth chips, creating systems with hundreds of thousands of cores, hundreds of millions of neurons, and hundreds of billion of synapses. We envisage hybrid computers that combine the von Neumann architecture with TrueNorth—both being Turing complete but efficient for complementary classes of problems. We may be able to map the existing body of neural networks algorithms to the architecture in an efficient fashion. In addition, many of the functional primitives of a recent large-scale complex behavioral model (3) map natively to our architecture, and we foresee developing a compiler to translate high-level functional tasks directly into TrueNorth networks. We envision augmenting our neurosynaptic cores with synaptic plasticity [see (26) for a prototype] to create a new generation of field-adaptable neurosynaptic computers capable of online learning. Although today TrueNorth is fabricated by using a modern complementary metal-oxide semiconductor (CMOS) process, the underlying architecture may exploit advances in future memory (27), logic (28), and sensor (12) technologies to deliver lower power, denser form factor, and faster speed.

Supplementary Materials

www.sciencemag.org/content/345/6197/668/suppl/DC1

Supplementary Text

Figs. S1 to S8

Tables S1 and S2

References (3140)

Movie S1

References and Notes

  1. Acknowledgments: This research was sponsored by Defense Advanced Research Projects Agency (DARPA) under contract no. HR0011-09-C-0002. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressly or implied, of DARPA or the U.S. Government. U.S. patents 8473439 and 8515885 and U.S. patent applications 13/235,341 and 13/434,733 pertain to aspects of this work. We are grateful to many collaborators: A. Agrawal, A. Andreopoulos, S. Asaad, C. Baks, D. Barch, M Beakes, R. Bellofatto, D. Berg, J. Bong, T. Christensen, A. Cox, P. Datta, D. Friedman, S. Gilson, J. Guan, S. Hall, R. Haring, C. Haymes, J. Ho, S. Iyer, J. Krawiecki, J. Kusnitz, J. Liu, J. B. Kuang, E. McQuinn, R. Mousalli, B. Nathanson, T. Nayak, D. Nguyen, H. Nguyen, T. Nguyen, N. Pass, K. Prasad, M. Ohmacht, C. Ortega-Otero, Z. Saliba, D. L. Satterfield, J.-s. Seo, B. Shaw, K. Shimohashi, K. Shiraishi, A. Shokoubakhsh, R. Singh, T. Takken, F. Tsai, J. Tierno, K. Wecker, S.-y. Wang, and T. Wong. We thank two anonymous reviewers for their thoughtful comments.
View Abstract

Navigate This Article