Design Solutions Research & Design Hub

Understanding the Role of Inference Engines in AI

Written by Geoff Tate

Benchmarks and Batching

Artificial Intelligence offers huge benefits for embedded systems. But implementing AI well requires making smart technology choices, especially when it comes to selected a neural inferencing engine. In this article, Flex Logix CEO Geoff Tate explains what inferencing is, how it plays into AI and how embedded system designers can make sure they are using the right solution for their AI processing.

Clearly, both software and hardware companies alike recognize that AI—if done right—represents a huge market potential. The question is: How do you do it right? The answer lies in the neural inferencing engines being developed that will power AI in the future. Similar to an engine in an automobile, the inferencing engine determines how well, how fast and how efficient the vehicle will run.

AI, machine learning and deep learning are all terms for neural networks which are designed to classify objects into categories after a training phase. What is becoming clear is that the traditional Von Neumann processor architecture is not optimal for neural networks. Artificial intelligence [1] requires powerful chips for computing answers, which is called inferencing, from large data sets, which is the training part. Inference is the part of machine learning when the neural net uses what it has learned during the training phase to deliver answers to new problems. This is particularly important in edge applications, which we define as anything outside of the data center.

The edge inferencing market is expected to be one of the biggest over the next five years. Typical applications may include smart surveillance cameras and real-time object recognition, autonomous driving cars (Figure 1) and other IoT devices. In the past, most inferencing engines were developed for the data center. However, the movement of AI to the edge of the network requires a new generation of specialized processors that are scalable, cost effective and consume extremely low power.

FIGURE 1 – Applications requiring edge inferencing technology may include smart surveillance cameras and real-time object recognition, autonomous driving cars and other IoT devices.

Picking the right inferencing engine is a critical factor in developing effective AI solutions. When looking at AI, it’s all about throughput and good inferencing engines provide very high throughput. The problem however, is that companies don’t know how to distinguish a good inferencing engine from a bad one. Without the establishment of standard benchmarks, companies have been throwing out random performance figures that don’t really matter to the overall throughput. Unless a designer has spent significant time in this space, they are going to have a hard time figuring out what benchmarks really matter.

When the performance of inferencing engines come up, vendors often cite benchmarks such as TOPS (tera-operations/second) performance and TOPS/W. System/chip designers looking into these soon realize that these figures are generally meaningless. What really matters is the throughput an inferencing engine can deliver for a model, image size, batch size and process and PVT (process/voltage/temperature) conditions. This is the number one measurement of how well it will perform, but amazingly very few vendors provide it.

The biggest problem with TOPS is that when a company says their engine does X TOPS, they typically quote this without stating what the conditions are. Without knowing this information, they erroneously believe that a X TOPS means it can perform X trillion operations. In reality, a company quoting 130 TOPS may only deliver 27 TOPS of useable throughput.

Another benchmark being used, but less commonly, is ResNet-50. The problem with this benchmark is that most companies quoting it don’t give batch sizes. When they don’t provide this information, a chip designer can assume it will be a large batch size that maximizes their hardware utilization percentage. This makes ResNet-50 not very helpful as a benchmark. In contrast, YOLOv3 for example requires 100 times more operations to process a 2 Megapixel image. Hardware utilization will be even more challenged on “real-world” models.

The biggest difference between ResNet-50 and YOLOv3 is the choice of image size. For example, if ResNet-50 is run using 2 Megapixel images like YOLOv3, the MACs/image increase to 103 million and the largest activation to 33.6 MB. On large images, ResNet-50’s characteristics looks close to YOLOv3.

There are several key factors to look at when evaluating neural inferencing engines. The first requirement is to define what an operation is. Some vendors count a multiply (typically INT 8 times INT 8) as one operation and an accumulation (addition, typically INT 32) as one operation. Therefore, a single multiply-accumulate equals 2 operations. However, some vendors include other types of operations in their TOPS specification so that must be clarified in the beginning.

It’s also important to define the operating conditions. If a vendor gives a TOPS value without providing the conditions, they are likely using room temperature, nominal voltage and typical process. Usually they will mention which process node they are referring to, but operating speeds differ between different vendors, and most processes are offered with 2, 3 or more nominal voltages. Since performance is a function of frequency, and frequency is a function of voltage, a chip designer can get more than twice the performance at 0.9 V than at 0.6 V. Frequency varies depending on the conditions/assumptions.

Batch size is also important. Even if a vendor provides worst-case TOPS, chip designers need to figure out if all of those operations actually contribute to computing their neural network models. In reality, the actual utilization can be very low because no inferencing engine has 100% utilization of all of the MACs all of the time. That is why batch size matters. Batching is where weights are loaded for a given layer and process multiple data sets at the same time. The reason to do this is to improve throughput, but the give-up is longer latency (Figure 2). ResNet-50 has over 20 million weights, YOLOv3 has over 60 million weights and every weight must be fetched and loaded into the MAC structure for every image. There are too many weights to keep them all resident in the MAC structure.

FIGURE 2 – Batching improves throughputs, but the trade-off is longer latency.

When it comes to MAC utilization, not all neural networks behave the same. It’s important to find out the actual MAC utilization for the neural inference engine for the neural network model you want to deploy, at the batch size you require.


Advertise Here

When it comes to inferencing engines, it’s important they properly manage the movement of data in memory in order to first keep the MACs supplied with the weights and activations to achieve the highest hardware utilization and second do this using the least power possible.

Table 1 depicts two popular neural network models to examine the challenge of memory in inference. We’ll assume the Winograd Transformation is used for the popular 3×3, stride 1 convolutions. As Table 1 shows, caching has little benefit for neural models, which is very different than traditional processor workloads where processing is huge. The 22.7 million weights are cycled through and are not being re-used until the next image. A weights cache needs to hold all the weights, and a smaller weights cache just flushes continuously. Similarly, with activations, in some models, some activations are used again in later stages, but for the most part activations are generated and used immediately only to feed the next stage.

TABLE 1 – The table depicts two popular neural network models to examine the challenge of memory in inference.

Therefore, for each image processed by ResNet-50, the memory transactions required are listed below, assuming for now that all memory references are to/from DRAM:

• 0.15 MB input image size read in
• 22.7 MB weights read in (assuming 8-bit integers which is the norm)
• 9.3 MB of activations are written cumulatively at the end of all of the stages.
• All but the last activation is read back in for the next stage for another almost
9.3 MB

This gives a total of 41.4 MB of memory read/writes per image. We are ignoring here the memory traffic for the code for the accelerator since there is no data available for any architecture. Code may benefit from caching. Memory references to DRAM use about 10-100x the power of memory references to an SRAM on-chip.

To reduce DRAM bandwidth there are two options for ResNet-50. First, you add enough SRAM to store all 22.7 MB of weights on chip. Second, you add SRAM on chip to store intermediate activations so stage writes to the activation cache and stage X+1 reads from it. For ResNet-50 the largest intermediate activation is 0.8 MB so 1 MB of SRAM eliminates about half of the DRAM traffic.


Advertise Here

Now let’s look at YOLOv3 to see the DRAM traffic needed without on-chip SRAM:

• 6 MB input image size (remember each pixel has 3 bytes for RGB).
• 61.9 MB weights read in
• 475 MB activations generated cumulatively as output of all of the stages written to DRAM.
• 475 MB activations read back in for the next layer.

This gives a total of 1,108 MB = 1.1 GB of DRAM traffic to process just one image! Much more SRAM is required to reduce DRAM bandwidth. 62 MB is required for weights caching and, since the largest intermediate activation is 64 MB, another 64 MB is required for activation caching. This would eliminate DRAM bandwidth, but 128 MB in 16 nm is about 140 mm2, which is very expensive.

The practical options for cost-effective designs are an activation cache big enough for most layers. Only 1 layer has a 64 MB activation output, 2 layers have 32 MB activation outputs, and 4 layers have 16 MB activation outputs—and all the rest are 8 MB or less. Thus, there is a tradeoff here between activation cache size and DRAM bandwidth.

For weights there is no trade-off: either storage all 61.9 MB on chip or have them all on DRAM. You can see why YOLOv3 doesn’t run faster with batches greater than 1. Multiple batches require saving multiple activations and the activations are too big.


Advertise Here

The industry is trending toward larger models and larger images, which makes YOLOv3 more representative of the future of inference acceleration. Using on-chip memory effectively will be critical for low cost/low power inference.

Today, there are many vendors promoting inferencing engines but none of them provide ResNet-50 benchmarks. The only information they state typically is TOPS and TOPS/W. These two indicators of performance and power efficiency are almost useless by themselves unless you know what throughput an inference engine can deliver to you for your model, your image size, your batch size and your process and PVT. As a rule of thumb, when you are looking at these engines, it’s important to remember that a good inferencing engine has high MAC utilization, consumes very low power and lets you keep things small. 


[1] Artificial Intelligence

Flex Logix Technologies |


Keep up-to-date with our FREE Weekly Newsletter!

Don't miss out on upcoming issues of Circuit Cellar.

Note: We’ve made the Dec 2022 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Sponsor this Article

Geoff Tate is CEO/Cofounder of Flex Logix Technologies. He earned a BSc in Computer Science from the University of Alberta and an MBA from Harvard University. Prior to cofounding Rambus in 1990, Geoff served as Senior Vice President of Microprocessors and Logic at AMD.

Supporting Companies

Upcoming Events

Copyright © KCK Media Corp.
All Rights Reserved

Copyright © 2024 KCK Media Corp.

Understanding the Role of Inference Engines in AI

by Geoff Tate time to read: 7 min