A neural network accelerator coupled with a microprocessor enables the use of neural network and machine learning techniques in low-power embedded systems. In this article, Schuyler details the development of machine learning accelerators and possibilities for their use going forward to address issue of “Dark Silicon.”
Following recent trends, machine learning is becoming a dominant component of an increasingly large number of applications. Dedicated application accelerators for machine learning provide one approach towards improving the energy efficiency of systems. In this article I discuss some ongoing work on the development of machine learning accelerators and possibilities for their use going forward to address the preeminent energy efficiency issue of Dark Silicon.
Moore’s Law, the observation by Gordon Moore in 1965 that the number of transistors in an integrated circuit was doubling every two years, has been a wonder of electrical and computer engineering that has only begun to slow in recent years. However, Moore’s Law is often incorrectly credited for driving performance gains in consumer electronics like desktop computer and smartphones.
A related but lesser known “law” observed by Robert Dennard in 1974 is that the performance per watt of integrated circuits was scaling in an exponential fashion like Moore’s Law. This meant that approximately every two years the performance of a microprocessor would double but the power dissipation stayed constant. Dennard scaling is the root cause for seemingly dramatic improvements in consumer electronic performance during the 1990s that most readers will recall. Unfortunately, Dennard scaling broke down in the early 2000s while the number of transistors has continued to grow.
This confluence of an end of “free lunch” energy efficiency improvements and increasing transistor counts has created a problem for engineers known as Dark Silicon which this column has previously mentioned. In effect, the end of Dennard scaling and the continuation of Moore’s Law creates a situation where integrated circuits have more transistors than can be powered simultaneously. A portion of the transistors must be left unpowered or “dark.”
One of the great questions of engineering research has been how to effectively use these additional transistors which must operate in mutual exclusion with the rest of the chip. One of the proposed solutions has been to design special purpose accelerators. These accelerators, much more energy efficient by design, then replace computation which would normally execute on a traditional, general-purpose microprocessor. The problem of accelerator design then becomes a trade-off between specialization and generalization.
Heavy specialization improves the energy efficiency of a specific application, but limits the number of applications that can take advantage of the accelerator. Conversely, an accelerator that is too general is less energy efficient and begins to look like a small, general-purpose microprocessor. Research in this area has focused on both the development of special purpose accelerators with high impact potential as well as programmable accelerators that can be configured to match an application. One of the potential types of accelerators with high impact potential involves those that accelerate machine learning applications.
MACHINE LEARNING AND NEURAL NETWORKS
Machine learning and specifically neural networks are an extremely hot area of research for both academia and industry. An artificial neural network, like the one shown in Figure 1, consists of a collection of interconnected neurons, organized into layers, that mimic the behavior of biological neurons like those in the human brain. Neural networks are fantastic when used to tackle hard problems with difficult to discern solutions. Examples include the approximation of an input-output relationship or classification of images. Their true utility, however, lies in their ability to “learn” through the use of algorithms, like gradient descent, that incrementally modify the connection weights of a neural network to better represent a set of data.
The resurgent use of neural networks, and specifically deep networks with many layers, has enabled modern advances by web services and consumer-facing companies like Google and Facebook to process large, real-world datasets. Consequently, this broad use and adoption makes neural network accelerators strong candidates to include alongside microprocessors and to alleviate Dark Silicon issues. Such accelerator architectures have been appearing in the literature for several years. Similarly, Nvidia, a maker of graphics processing units (GPUs), has been very active in leveraging their parallel processing architectures for machine learning.
Part of the research done at Boston University has focused on the design of a neural network accelerator tightly coupled with a microprocessor. Such an accelerator fits into a different niche than dedicated integrated circuits or graphics cards for neural network acceleration due to its close proximity to the microprocessor. In effect, this accelerator acts more like a co-processor (similar to a floating point unit) than an off-chip device. Having an accelerator close to the microprocessor and its data can enable energy efficiency improving techniques, like function approximation, over ever smaller granularities than are possible with an off-chip device.
ACCELERATING NEURAL NETWORKS WITH X-FILES/DANA
Our specific accelerator architecture consists of two components: 1) a set of software and hardware eXtensions For the Integration of machine Learning in Everyday Systems (X-FILES) and 2) a Dynamically Allocated Neural network Accelerator (DANA). The X-FILES consist of user and supervisor software that allow processes to initiate requests to access a backend neural network accelerator, DANA. The X-FILES hardware component acts as a transaction manager that interleaves the execution of multiple neural network transactions (requests to access a specific neural network) on DANA. DANA supports both feedforward (computing the output of a given neural network for a specific input) and learning (modifications to the weights of a neural network to improve the network’s accuracy on an approximation or classification task).
Figure 2 shows the hardware architecture of both the X-FILES hardware arbiter and DANA. We implement the hardware component of X-FILES/DANA in a hardware description language (HDL) called Chisel (like Verilog/SystemVerilog or VHDL, but supporting object oriented and functional programming capabilities for synthesis). X-FILES/DANA can then be used as a custom coprocessor attached to an open-source RISC-V microprocessor developed at the University of California at Berkeley. The X-FILES hardware arbiter manages transactions, recorded in a Transaction Table, while DANA “executes” transactions. DANA consists of processing elements (PEs) that implement the mathematical operations of one neuron, intermediate local storage, and a Configuration Cache that stores neural network configurations (binary data structures that describe the structure of neural network and its weights).
This system operates as follows: Using our X-FILES software library, processes are able to request access to a specific neural network specified with a neural network identifier (NNID). A new request is assigned a transaction identifier (TID), which the arbiter returns to the process. Using this TID, the process can then transmit input data for the specific neural network that it wants to access and, eventually, access the output data of the network. All requests initiated by processes are stamped with an operating system provided address space identifier (ASID) which prevents one process from accessing another’s neural data or neural networks.
During operation, the hardware arbiter selects a valid transaction from the Transaction Table and requests DANA perform some work to move this transaction towards completion. These actions either involve loading the Configuration Cache with the required neural network configuration or allocating a PE for the next neuron in that transaction. Data is communicated between layers in the network via an intermediate storage scratchpad memory.
At present, the X-FILES/DANA architecture is available via an open-source GitHub repository for the curious reader to test drive. This design, coupled with a RISC-V Rocket microprocessor, can then be simulated in software or evaluated using a Field Programmable Gate Array (FPGA). We provide a rough equivalence with the Fast Artificial Neural Network (FANN) library, a software library for training and running neural networks.
PERFORMANCE EVALUATION OF ROCKET + X-FILES/DANA
Across various neural network configurations the X-FILES/DANA accelerator infrastructure provides, on average, a 30x speedup over a traditional implementation in software. Figure 3 shows the results for different neural network topologies from Table 1. Note, that one interesting facet of this architecture is that the throughput improves as more work (larger neural networks with more neurons per layer) are executed on the accelerator. In the event that a network with little parallel work is executing, DANA can interleave the execution of multiple transactions to maintain high throughput.
While we are currently in the process of completing the power evaluation of this specific implementation, an earlier version of this architecture demonstrated an order of magnitude improvement in power consumption. Coupled with the 30× improvement in performance shown here, this indicates that X-FILES/DANA and other machine learning accelerators have the potential to improve energy-delay product (a metric incorporating energy reduction and performance gains) by at least two orders of magnitude.
ALLEVIATE DARK SILICON ISSUES
Resurgent interest in neural networks and machine learning, driven by the explosion of real-world data that companies must analyze, makes the inclusion of neural network accelerators a likely candidate to improve energy efficiency. Relatedly, a neural network accelerator tightly coupled with a microprocessor enables the use of neural network and machine learning techniques in low-power, embedded systems. Additionally, the design and inclusion of neural network accelerators alongside traditional computer architectures has the potential to alleviate issues related to Dark Silicon—one of the preeminent concerns in the area of semiconductor technology.
Author’s Note: Work related to X-FILES/DANA was supported by a NASA Space Technology Research Fellowship.
 M. B. Taylor, “Is Dark Silicon Useful?: Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse,” Proceedings of the 49th Annual Design Automation Conference, 2012.
 T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, “Diannao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning,” Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, 2014.
 G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. K. Venkata, M. B. Taylor, S. Swanson, “QsCores: Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores,” Proceedings of the 44th International Symposium on Microarchitecture, 2011.
 S. Eldridge, A. Waterland, M. Seltzer, J. Appavoo, and A. Joshi, “Towards General-Purpose Neural Network Computing,” Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2015.
 J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic, “Chisel: Constructing Hardware in a Scala Embedded Language,” Proceedings of the 49th Design Automation Conference, 2012.
 RISC-V Rocket git repository. https://github.com/ucb-bar/rocket-chip.
 X-FILES/DANA git Repository, https://github.com/bu-icsg/xfiles-dana.
 S. Nissen, “Implementation of a Fast Artificial Neural Network library (FANN),” Technical report, Department of Computer Science University of Copenhagen (DIKU).
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • MAY 2016 #310 – Get a PDF of the issue