FPGA Solutions Evolve to Meet AI Needs

Brainy System ICs

Long gone now are the days when FPGAs were thought of as simple programmable circuitry for interfacing and glue logic. Today, FPGAs are powerful system chips with on-chip processors, DSP functionality and high-speed connectivity.

By Jeff Child, Editor-in-Chief

Today’s FPGAs have now evolved to the point that calling them “systems-on-chips” is redundant. It’s now simply a given that the high-end lines of the major FPGA vendors have general-purpose CPU cores on them. Moreover, the flavors of signal processing functionality on today’s FPGA chips are ideally suited to the kind of system-oriented DSP functions used in high-end computing. And even better, they’ve enabled AI (Artificial Intelligence) and Machine Learning kinds of functionalities to be implemented into much smaller, embedded systems.

In fact, over the past 12 months, most of the leading FPGA vendors have been rolling out solutions specifically aimed at using FPGA technology to enable AI and machine learning in embedded systems. The two main FPGA market leaders Xilinx and Intel’s Programmable Solutions Group (formerly Altera) have certainly embraced this trend, as have many of their smaller competitors like Lattice Semiconductor and QuickLogic. Meanwhile, specialists in so-called e-FPGA technology like Archonix and Flex Logix have their own compelling twist on FPGA system computing.

Project Brainwave

Exemplifying the trend toward FPGAs facilitating AI processing, Intel’s high-performance line of FPGAs is its Stratix 10 family. According to Intel, the Stratix 10 FPGAs are capable of 10 TFLOPS, or 10 trillion floating point operations per second (Figure 1). In May Microsoft announced its Microsoft debuted its Azure Machine Learning Hardware Accelerated Models powered by Project Brainwave integrated with the Microsoft Azure Machine Learning SDK. Azure’s architecture is developed with Intel FPGAs and Intel Xeon processors.

Figure 1
Stratix 10 FPGAs are capable of 10 TFLOPS or 10 trillion floating point operations per second.

Intel says its FPGA-powered AI is able to achieve extremely high throughput that can run ResNet-50, an industry-standard deep neural network requiring almost 8 billion calculations without batching. This is possible using FPGAs because the programmable hardware—including logic, DSP and embedded memory—enable any desired logic function to be easily programmed and optimized for area, performance or power. And because this fabric is implemented in hardware, it can be customized and can perform parallel processing. This makes it possible to achieve orders of magnitudes of performance improvements over traditional software or GPU design methodologies.

In one application example, Intel cites an effort where Canada’s National Research Council (NRC) is helping to build the next-generation Square Kilometer Array (SKA) radio telescope to be deployed in remote regions of South Africa and Australia, where viewing conditions are most ideal for astronomical research. The SKA radio telescope will be the world’s largest radio telescope that is 10,000 times faster with image resolution 50 times greater than the best radio telescopes we have today. This increased resolution and speed results in an enormous amount of image data that is generated by these telescopes, processing the equivalent of a year’s data on the Internet every few months.

NRC’s design embeds Intel Stratix 10 SX FPGAs at the Central Processing Facility located at the SKA telescope site in South Africa to perform real-time processing and analysis of collected data at the edge. High-speed analog transceivers allow signal data to be ingested in real time into the core FPGA fabric. After that, the programmable logic can be parallelized to execute any custom algorithm optimized for power efficiency, performance or both, making FPGAs the ideal choice for processing massive amounts of real-time data at the edge.

ACAP for Next Gen

For its part, Xilinx’s high-performance product line is its Virtex UltraScale+ device family (Figure 2). According to the company, these provide the highest performance and integration capabilities in a FinFET node, including the highest signal processing bandwidth at 21.2 TeraMACs of DSP compute performance. They deliver on-chip memory density with up to 500 Mb of total on-chip integrated memory, plus up to 8 GB of HBM Gen2 integrated in-package for 460 GB/s of memory bandwidth. Virtex UltraScale+ devices provide capabilities with integrated IP for PCI Express, Interlaken, 100G Ethernet with FEC and Cache Coherent Interconnect for Accelerators (CCIX).

Figure 2
Virtex UltraScale+ FPGAs provide a signal processing bandwidth at 21.2 TeraMACs. They deliver on-chip memory density with up to 500 Mb of total on-chip integrated memory, plus up to 8 GB of HBM Gen2 integrated in-package for 460 GB/s of memory bandwidth.

Looking to the next phase of system performance, Xilinx in March announced its strategy toward a new FPGA product category it calls its adaptive compute acceleration platform (ACAP). Touted as going beyond the capabilities of an FPGA, an ACAP is a highly integrated multi-core heterogeneous compute platform that can be changed at the hardware level to adapt to the needs of a wide range of applications and workloads. An ACAP’s adaptability, which can be done dynamically during operation, delivers levels of performance and performance per-watt that is unmatched by CPUs or GPUs, says Xilinx… …

Read the full article in the August 337 issue of Circuit Cellar

Don’t miss out on upcoming issues of Circuit Cellar. Subscribe today!

Note: We’ve made the October 2017 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.