Deep Learning Solution
Deep learning using convolutional neural networks (CNNs) can offer a robust solution across a wide range of applications and market segments. In this article, the authors examine why FPGAs offer advantages over GPUs when it comes to implementing CNNs—particularly in edge applications.
Machine learning is a technique widely used for analyzing and finding patterns across extremely large data sets as a means to “teach”—rather than program—computers to perform complex tasks. However, traditional machine learning techniques struggle with high-dimensional data such as images and video data. This is where deep learning using convolutional neural networks (CNNs) come into play to offer a more robust solution for this type of data across a wide range of applications and market segments.
Today, common functions most closely associated with deep learning include speech recognition and scene classification. As the technology matures, deep learning is finding its way into many new applications involving medical instruments, robotics, machine vision, supply chain optimization, data mining and more. While GPUs can be used to implement CNNs, a better approach, especially in edge applications, is to use FPGAs that are aligned with the application’s specific accuracy and performance requirements as well as the available size, cost and power budget.
Deep learning is a subset of machine learning. Deep learning is based on the concept of inferring the model of a system by training it with data instead of using physical models. Figure 1 depicts the general deep learning setup. A large representative training database is used to train the model by evaluating the predictions made and updating the model using a training algorithm. Once the accuracy of the model meets some pre-defined criteria, the trained model can be deployed for inference.
A standard CNN usually consists of a feature extractor and a classifier stage, which are both trainable and used during inference. CNNs have actually existed for quite a while. In the late 1980s, Yann LeCun had success with the idea of using a trainable feature extractor combined with a trainable classifier for classifying handwritten digits . This system had great success in the US for reading checks and spurred high interest in the field, but ultimately the excitement died down. It was not until the advent of new training techniques, large datasets and the processing power provided by modern graphics processing units (GPUs) and FPGAs that using these deep networks with trainable feature extractors became viable again.
As previously mentioned, a CNN has two primary components, a feature extractor and a classifier. The feature extractor itself consists of cascaded convolution layers which are highly processing-intensive. With each cascaded convolution layer, the feature extractor can learn progressively more complex and abstract feature representations of the input data. The classifier consists of stacked, fully connected neural network layers, sometimes referred to as dense layers and is primarily memory-intensive. This is an important point to reiterate. The feature extraction is computationally intensive, while the classifier is mainly memory bandwidth intensive as will be shown later in the article.
A deeper examination of both these building blocks of a CNN will now be explored. To understand how the convolutional layer works, Figure 2 shows the structure of a standard convolution layer. The layer has M number of output feature maps and N number of input feature maps. Each output feature map has R rows and C columns. The convolution layer performs a unique filtering operation for each input-output feature map combination. The K×K kernel weights associated with each input-output feature map combination are trainable. A stride, S, can also be specified for the convolution operation. The primary operation for this type of layer is a multichannel convolution. Each output feature map is not just the result of a simple convolution over a single input feature map, but rather the combination of convolutions over all input feature maps. With each cascaded convolution layer, the feature extractor can learn progressively more complex and abstract feature representations of the input data. As seen in Figure 2, this stage is very computationally intensive requiring numerous multiplications, additions and other data functions.
The fully connected layer used to implement the classification stage of the CNN is also notable to consider. Figure 3a shows an overview of fully connected layers. Every input node is connected to every output node and a weight is associated with each connection. The fully connected layer applies a linear transformation on the input vector. The Figure 3b depicts a more detailed view of the operations within a fully connected network. The value of the output node is simply the inner product of the input nodes and the weights. A bias, B, is also added to the result. The bias and weights in a fully connected layer are all trainable. Stacking multiple fully connected layers increases the expressive power of the classifier and allows the network to learn a non-linear transfer function for the classification phase. As stated previously, the fully connected layer is memory-intensive, requiring frequent data accesses for each of the nodes and weights.
Implementing the complete CNN requires an architecture with computational power and adequate memory bandwidth. GPUs are often used for CNNs because of the memory bandwidth available and processing capabilities. The drawback is that these devices consume significant amounts of power and for many edge applications this is simply not acceptable. FPGAs, on the other hand, have limited resources in terms of logic elements and fabric memory, but they do have numerous multipliers and adders. With these limitations in mind, there are two components that need to be carefully reviewed when implementing CNNs on an FPGA. As stated previously, CNN models can have extremely high computation complexity with up to 40 billion multiplication and addition operations. Furthermore, the size of these models requires the use of external memory since the entire network will rarely fit into the fabric memory available on most FPGAs.
Figure 4 shows the time and space complexity analysis of a typical CNN where the first five layers are convolution, and the last two layers are fully connected. The time complexity of a layer is measured by computational requirements as represented by the number of operations that need to be performed; space complexity is determined by the number of weights that need to be retrieved from external memory. From Figure 4, it is apparent that the convolution layers require very little bandwidth compared to the fully connected layers. On the other hand, the fully connected layers are a lot less computationally demanding than the convolution layers and need high memory bandwidth.
Because FPGAs can perform operations in parallel, it is critical to identify in a CNN specifically where the sources of parallelism reside. In addition, it is important to note the data dependency between layers. By looking at the standard structure of a convolutional layer, the following sources of parallelism in CNNs can be identified. (1) Within each convolution there exists parallelism. The multiplication and addition operations for a single kernel can be done in parallel. (2) Each input feature map can be processed independently. Convolutions on different input feature maps can thus happen in parallel. (3) Each output feature map has a set of kernels associated with each input feature map. This means convolutions share the same input feature maps but not the same kernels (Figure 2). This, in turn, means different output feature maps can be processed in parallel.
The computational operations in a CNN boil down to multiply-accumulate operations. The majority of these operations can be implemented with 8×8 multiplications and additions. Because FPGAs contain numerous multipliers and adders, the convolutional layers are ideal for targeting the parallelism of the FPGA. To capitalize on this parallelism, some FPGAs feature arithmetic blocks capable of implementing two multiplications and two additions in one clock cycle. This is referred to as dot product mode or DOTP (Figure 5). In the DOTP mode, a single math block accommodates the following four useful operations: two multiplications and one addition, as well as the accumulator blocks performing one addition needed to accumulate the result from the previous block in cascade. This function is particularly useful in the convolution layer but can also be used in the fully connected layer. Let’s move to the fully connected layers and understand what can be done to best target an FPGA.
FULLY CONNECTED LAYERS
Compression or quantization is used to reduce the memory access for the fully connected layers and convolutional layers. To reduce the amount of data transfers, the network data and trainable parameters both need to be compressed. One example is the compression performed by the Core Deep Learning (CDL) framework from ASIC Design Services. In its CNN, the company implements quantization as a compression solution. One of the biggest concerns is how quantization will affect the accuracy of the network, especially considering the weights and data used in neural networks generally have an implied 32-bit floating point. Research has shown that this 32-bit floating point representation is not necessary to keep the same level of accuracy  .
The compression step in the CDL framework implements a re-training and quantization operation where the 32-bit floating point parameters of the CNN are converted to 8-bit fixed-point parameters. The range of the weights and data in different layers of the network can differ significantly. The solution lies in a dynamic 8-bit fixed-point approach, where the location of the radix point is variable for each layer and its set of weights. The multiply-accumulate operations in the layer are performed at a high precision. Thereafter, a set of eight consecutive bits is selected and passed on for further processing. Retraining is required after the compression when the quantized network accuracy is not sufficient, but the result is reduced memory access bottleneck while still achieving near full performance.
One of the most critical steps which enable a CNN to operate effectively is training. The network must be trained with data and numerous examples. Just like when we were kids, our minds had no idea what a bird or dog was. But after reading books, being shown by our parents, seeing these and being told what they are and so on, our minds became trained. The same must be done for an effective CNN. The training data is a representative set of the type of inputs that the CNN would see when implemented in the field. This training data will correspond to various weights. The weights represent a transfer function that transforms the input data into another data representation to be used during the classification stage. With repeated data inputs and reiterated model updates, the weights are modified based on the evaluation of the previous inference results. This iterative process continues to update the weights and improve the accuracy of the CNN until some accuracy criterion is met.
The optimal implementation of a CNN is when the computational power matches and exceeds the memory bandwidth. If the memory bandwidth cannot match the computing throughput then the logic resources doing the computation will be under-utilized. The goal is thus to maximize the computational throughput by exploiting parallelism while concurrently minimizing the amount of memory access, which is accomplished by identifying shared memory and compression via quantization. Exploring the design space for implementing CNNs on an FPGA is a non-trivial task. Several design implementations need to be implemented to explore the optimal configurations. It requires many different combinations of implementations to have the computational power exceed the maximum memory bandwidth for the CNN. Obtaining this optimal point requires searching a large design space.
TOOL FOR DEEP LEARNING
ASIC Design Services has created a tool called Design Space Exploration (DSE), a software suite provided as part of the CDL framework. This ensures the maximum memory bandwidth is utilized and at least matches or exceeds the computational power. Designs which use the full memory bandwidth and are as close to the maximum computational power are the best choices. DSE provides the functionality to transition from a quantized CNN to an optimized FPGA design. The CDL IP core which accelerates CNNs is a parameterized FPGA core using SystemVerilog. During the DSE phase, the parameters are extracted, which defines the best way for the CDL FPGA implementation to accelerate the neural network. During the DSE phase, one can use half of the FPGA resources, for example. This does not change, compress or prune the network structure that is implemented, but rather changes the way that the CDL core accelerates the neural network.
The Core Deep Learning accelerator is implemented on the FPGA only and does not require an external processor to perform any deep learning functionality (Figure 6). Refer to Figure 6 below. There are several components which make up the core. The interface to the core is an APB and AXI interface, and data transferred from external memory is stored locally in buffers on the FPGA. These buffers are mainly implemented using embedded SRAM blocks. The computing engine performs all mathematical operations and utilizes the embedded math blocks and fabric. Lastly, the core implements several controllers to facilitate data transfer and work between the different components. The structure of the buffers, computing engine and controllers are all optimized during the Design Space Exploration phase of the framework.
To show some results, the CDL framework is used to implement several standard CNNs on a PolarFire FPGA from Microsemi (now part of Microchip Technology) and external DDR4 memory. Results from Le-Net 5, VGG-16, SqueezeNet and Tiny YOLO v2 were tested for utilization and performance (Table 1). Because the PolarFire FPGA family spans from 100k logic elements (LEs) to 500k LEs, the CDL IP can implement different design sizes. With an increase in LE density also comes additional internal RAM and increased arithmetic functions. Depending on the desi0gn parameters such as maximum power, space constraints, cost and so on, various implementations can be tried.
When considering a deep learning core, one should consider the implementation of the network topology and the network performance. The performance of a network is measured in many ways, but the most common is Gops/s and Gops/s/W. Gops/s stands for giga operations per second. Gops/s is purely a performance measurement. The larger the number translates to better performance. This may be seen in how many frames per second at which a network could operate. Gops/s/W, which is Gops/s per watt, measures the power efficiency of the deep learning core. The GOPs/s/W is an important criterion which allows comparison across different implementations whether they be GPUs, processors or FPGAs.
The LeNet5 was implemented with two convolutional layers and two fully connected layers, whereas SqueezeNet was implemented in 26 convolutional layers. Tiny Yolo used nine convolutional layers, and VGG16 was implemented with 14 convolutional layers and two fully connected layers. SqueezeNet and Tiny Yolo implement 1×1 convolutions to replace the fully connected layers. The same CNN was used for both SqueezeNet results shown in Table 1. The CDL IP simply optimized the implementation to show various performance differences which could be achieved with larger resource utilization. For the Tiny Yolo results, these implementations used the same CNN, but again the two different results are because of the different resources utilized. These results compare very favorably, especially in applications which are power constrained. Many IoT and edge-based designs could benefit from the power efficiency of the CDL implementation.
When looking to implement a deep learning algorithm, there are many factors to define. Does the designer know what network to implement? Does the application have specific accuracy requirements? What is the size, cost and power budget available? Are there any minimum performance requirements? For many applications, developers should consider implementing CNNs in cost-optimized, mid-range density FPGAs. The Core Deep Learning IP demonstrates this is a compelling implementation for a wide range of applications.
 Y. Lecun, L. Bottou, Y. Bengio and P. Haffner: “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, Vol. 86 No. 11, pp. 2278-2324, Nov 1998.
 T. Dettmers: “8-Bit Approximations for Parallelism in Deep Learning,” Computing Research Repository, Vol. abs/1511.04561, 2015.
 P. Gysel, M. Motamedi and S. Ghiasi: “Hardware-oriented Approximation of Convolutional Neural Networks,” Computing Research Repository, Vol. abs/1604.03168, 2016.
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • DECEMBER 2018 #341 – Get a PDF of the issue