Adding multiple processing cores on the same chip has become the de facto design choice as we continue extracting increasing performance per watt from our chips. Chips powering smartphones and laptops comprise four to eight cores. Those powering servers comprise tens of cores. And those in supercomputers have hundreds of cores. As transistor sizes decrease, the number of cores on-chip that can fit in the same area continues to increase, providing more processing capability each generation. But to use this capability, the interconnect fabric connecting the cores is of paramount importance to enable sharing or distributing data. It must provide low latency (for high-quality user experience), high throughput (to maintain a rate of output), and low power (so the chip doesn’t overheat).
Ideally, each core should have a dedicated connection to a core with which it’s intended to communicate. However, having dedicated point-to-point wires between all cores wouldn’t be feasible due to area, power, and wire layout constraints. Instead, for scalability, cores are connected by a shared network-on-chip (NoC). For small core counts (eight to 16), NoCs are simple buses, rings, or crossbars. However, these topologies aren’t too scalable: buses require a centralized arbiter and offer limited bandwidth; rings perform distributed arbitration but the maximum latency increases linearly with the number of cores; and crossbars offer tremendous bandwidth but are area and power limited. For large core counts, meshes are the most scalable. A mesh is formed by laying out a grid of wires and adding routers at the intersections which decide which message gets to use each wire segment each cycle, thus transmitting messages hop by hop. Each router has four ports (one in each direction) and one or more ports connecting to a core. Optimized mesh NoCs today take one to two cycles at every hop.
Today’s commercial many-core chips are fairly homogeneous, and thus the NoCs within them are also homogeneous and regular. But the entire computing industry is going through a massive transformation due to emerging technology, architecture, and application trends. These, in turn, will have implications for the NoC designs of the future. Let’s consider some of these trends.
An exciting and potentially disruptive technology for on-chip networks is photonics. Its advantage is extremely high-bandwidth and no electrical power consumption once the signal becomes optical, enabling a few millimeters to a few meters at the same power. Optical fibers have already replaced electronic cables for inter-chassis interconnections within data centers, and optical backplanes are emerging viable alternatives between racks of a chassis. Research in photonics for shorter interconnects—from off-die I/O to DRAM and for on-chip networks—is currently active. In 2015, researchers at Berkeley demonstrated a microprocessor chip with on-chip photonic devices for the modulation of an external laser light source and on-chip silicon waveguides as the transmission medium. These chips directly communicated via optical signals. In 2016, researchers at the Singapore-MIT Alliance for Research and Technology demonstrated LEDs as on-chip light sources using novel III-V materials. NoC architectures inspired by these advances in silicon photonics (light sources, modulators, detectors, and photonic switches) are actively researched. Once challenges related to reliable and low-power photonic devices and circuits are addressed, silicon photonics might partially or completely replace on-chip electrical wires and provide high-bandwidth data delivery to multiple processing cores.
Read more Tech the Future Essays
The performance and energy scaling that used to accompany transistor technology scaling has diminished. While we have billions of transistors on-chip, switching them simultaneously can exceed a chip’s power budget. Thus, general-purpose processing cores are being augmented with specialized accelerators that would only be turned on for specific applications. This is known as dark silicon. For instance, GPUs accelerate graphics and image processing, DSPs accelerate signal processing, cryptographic accelerators perform fast encryption and decryption, and so on. Such domain-specific accelerators are 100× to 1000× more efficient than general-purpose cores. Future chips will be built using tens to hundreds of cores and accelerators, with only a subset of them being active at any time depending on the application. This places an additional burden on the NoC. First, since the physical area of each accelerator isn’t uniform (unlike cores), future NoCs are expected be irregular and heterogeneous. This creates questions about topologies, algorithms for routing, and managing contention. Second, traffic over the NoC may have dynamically varying latency/bandwidth requirements based on the currently active cores and accelerators. This will require quality-of-service guarantees from the NoC, especially for chips operating in real-time IoT environments or inside data centers with tight end-to-end latency requirements.
— ADVERTISMENT—
—Advertise Here—
The aforementioned domain-specific accelerators are massively parallel engines with NoCs within them, which need to be tuned for the algorithm. For instance, there’s a great deal of interest in architectures/accelerators for deep neural networks (DNN), which have shown unprecedented accuracy in vision and speech recognition tasks. Example ASICs include IBM’s TrueNorth, Google’s Tensor Processing Unit (TPU), and MIT’s Eyeriss. At an abstract level, these ASICs comprise hundreds of interconnected multiply-accumulate units (the basic computation inside a neuron). The traffic follows a map-reduce style. Map (or scatter) inputs (e.g., image pixels or filter weights in case of convolutional neural networks) to the MAC units. Reduce (or gather) partial or final outputs that are then mapped again to neurons of this/subsequent layers. The NoC needs to perform this map-reduce operation for massive datasets in a pipelined manner such that MAC units are not idle. Building a highly scalable, low-latency, high-bandwidth NoC for such DNN accelerators will be an active research area, as novel DNN algorithms continue to emerge.
You now have an idea of what’s coming in terms of the hardware-software co-design of NoCs. Future computer systems will become even more heterogeneous and distributed, and the NoC will continue to remain the communication backbone tying these together and providing high performance at low energy.
This essay appears in Circuit Cellar 322.
Dr. Tushar Krishna is an Assistant Professor of ECE at Georgia Tech. He holds a PhD (MIT), an MSE (Princeton), and a BTech (IIT Delhi). Dr. Krishna spent a year as a post-doctoral researcher at Intel and a semester at the Singapore-MIT Alliance for Research and Technology.