CC Blog Research & Design Hub Tech Trends

FPGAs and GPUs Flex Their AI Muscles: Chip-Level Supercomputing

Written by Jeff Child

AI is quickly infiltrating every corner of embedded system design. But implementing the technology can be challenging. To help smooth the way, processing solutions including FPGAs, GPUs and dedicated AI SoCs are evolving to meet the needs of AI edge applications.

  • How processing solutions including FPGAs, GPUs and dedicated AI SoCs are evolving to meet the needs of AI edge applications
  • Xilinx 7nm Versal ACAP
  • Flex Logix InferX X1
  • Lattice sensAI 3.0
  • QuickLogic QuickAI
  • Intel  Agilex FPGAs 
  • Achronix Semiconductor Speedster7t FPGAs
  • Microchip VectorBlox Accelerator SDK
  • Nvidia EGX A100 and EGX Jetson Xavier NX

Once relegated to the realm of supercomputers built from rooms full of compute racks, today’s artificial intelligence (AI), machine learning (ML) and neural networking technologies now routinely are implemented at the chip and board level. Supplying the technology for these efforts are the makers of leading-edge embedded processors, FPGAs and GPUs.

In these tasks, GPUs are being used for “general-purpose computing on GPUs,” a technique also known as GPGPU computing. And, while the trend toward FPGAs with general-purpose CPU cores embedded on them is nothing new, the latest crop FPGAs have shifted their architectures toward configurations that support AI and ML types of processing.

To keep pace with demands for “intelligent” system functionality, embedded processor, GPU and FPGA companies have rolled out a variety of solutions over the last 12 months, aimed at performing AI, ML and other advanced computing functions for several demanding embedded system application segments.

MACHINE LEARNING FOR AV
Two years ago, leading FPGA vendor Xilinx introduced its adaptive compute acceleration platform (ACAP) approach to FPGAs. It’s Versal line ACAPs devices combine scalar processing engines and adaptable hardware engines with enhanced memory and interfacing technologies to provide heterogeneous acceleration for any application. Since then, the company has been enhancing this technology, targeting a variety of applications.

Along just such lines, in February, the company announced a range of new and advanced ML capabilities for Xilinx devices targeted at the professional audio/video (Pro AV) and broadcast markets. At the same time, Xilinx unveiled the industry’s first demonstration of a programmable HDMI 2.1 implementation on its 7nm Versal devices (Figure 1). These and other highly adaptable Xilinx solutions for the Pro AV and broadcast markets are designed to help customers reduce costs and future proof investments while keeping pace with new usage models and evolving industry standards, says Xilinx.

FIGURE 1 – Targeted at the professional audio/video (Pro AV) and broadcast markets, a new set of advanced machine learning (ML) capabilities are available for the Xilinx 7nm Versal ACAP FPGAs. The new ML capabilities for Xilinx Pro AV and broadcast platforms include region-of-interest encoding, intelligent digital signage, automatic object tracking and window cropping and speech recognition.

The new ML capabilities for Xilinx Pro AV and broadcast platforms include region-of-interest encoding, intelligent digital signage, automatic object tracking and window cropping and speech recognition. System developers can now take advantage of these ML capabilities on Xilinx devices, including the already highly integrated Zynq UltraScale+ MPSoC platform, for AI edge processing. The combination of real-time audio and video processing, AV connectivity interfaces, codecs, IP networking, CPU and GPU into an adaptable and scalable single-chip solution can provide users with significant space, power and cost savings, says Xilinx.

— ADVERTISMENT—

Advertise Here

Pro AV system developers can apply the new ML capabilities across many applications and workloads. Region-of-interest encoding can be used to detect faces and features using ML and the Zynq UltraScale+ MPSoC integrated H.264/H.265 codec to keep video quality high in those areas and apply higher compression for backgrounds. This reduces the overall bitrate and saves significant costs in live streaming. ML models for gender, age and gesture detection can be used to provide targeted interactive advertising in digital signage. The result is a higher return on investment for advertisers as well as monetizable behavior metrics.

With data rates up to 48Gbps, HDMI 2.1 supports a range of higher video resolutions and frame rates, including 8K60 and 4K120, for a more immersive experience in LED walls, large format displays, digital signage and applications that rely on fast-moving video such as live sports. Developed in preparation for this, HDMI 2.1 support on Versal ACAP allows customers to transmit, receive and process up to 8K (7680 x 4320 pixels) ultra-high-definition (UHD) video.

The 7nm Versal ACAP offers scalar, adaptable and intelligent engines in a software-programmable silicon infrastructure, providing a tight coupling between embedded software, multichannel AV processing pipelines and AI inferencing. Additionally, combining real-time video processing, AV connectivity interfaces, codecs, IP networking, CPU, GPU and more into a single-chip solution means that developers can save space, power and cost, says Xilinx.

AI INFERENCE ENGINE
Flex Logix Technologies last fall announced its InferX X1 edge inference co-processor. According to the company, it has been optimized for what the edge needs: large models and large models at batch=1. The device is programmed using TensorFlow Lite and ONNX and performance modeler for it is available now.

InferX X1 is based on Flex Logix’s nnMAX architecture integrating four tiles for 4K MACs and 8MB L2 SRAM. InferX X1 connects to a single x32 LPDDR4 DRAM (Figure 2). Four lanes of PCIe Gen3 connect to the host processor. A x32 GPIO link is available for hosts without PCIe. Two X1s can work together to increase throughput up to 2x. The InferX X1 is expected to begin sampling in Q3 2020 as a chip and as a PCIe board.

In April, Flex Logix announced real-world benchmarks for the InferX X1, showing significant price/performance advantages when compared to Nvidia’s Tesla T4 and Xavier NX when run on actual customer models. The details were presented at the Linley Spring Processor Conference. The InferX X1 has a very small die size: 1/7th the area of Nvidia’s Xavier NX and 1/11th the area of Nvidia’s Tesla T4, says Flex Logix. Despite being so much smaller, the InferX X1 has latency for YOLOv3, an open source model that many customers plan to use, similar to Xavier NX. On two real customer models, InferX X1 was much faster, as much as 10x faster in one case, says Flex Logix.

LOW POWER FOR EDGE AI
For its part Lattice Semiconductor’s FPGA-based AI technologies place a strong emphasis on low power. In May, the company launched the latest version of its complete solutions stack for on-device AI processing at the Edge, Lattice sensAI 3.0 (Figure 3). The latest version of the stack includes support for the CrossLink-NX family of FPGAs for low-power smart vision applications and features customized convolutional neural network (CNN) IP, a flexible accelerator IP that simplifies implementation of common CNN networks and is optimized to further leverage the parallel processing capabilities of FPGAs.

FIGURE 3 – The Lattice sensAI 3.0 version of its complete solutions stack for on-device AI processing includes support for the CrossLink-NX family of FPGAs for low power smart vision applications and features customized convolutional neural network (CNN) IP, a flexible accelerator IP that simplifies implementation of common CNN networks and is optimized to further leverage the parallel processing capabilities of FPGAs.

To address data security, latency and privacy issues, developers want to move the AI processing that powers their smart vision and other AI applications from the cloud to the edge, says Lattice. And most edge devices are battery-powered or otherwise sensitive to power consumption, so developers need hardware and software solutions that deliver the processing capabilities needed for AI applications, while keeping power consumption as low as possible. By enhancing the sensAI stack, Lattice hopes to widening the range of power and performance options available to embedded developers. For applications like smart vision that require higher Edge AI performance, Lattice CrossLink-NX FPGAs running sensAI software deliver twice the performance at half the power when compared to prior releases of the solutions stack.

CUSTOMIZED CNN ACCELERATOR IP
The new and updated features of the sensAI solutions stack include several elements. The stack now supports a customized CNN accelerator IP running on a CrossLink-NX FPGA that takes advantage of the underlying parallel processing architecture of the FPGA. Updates to the NN (neural network) compiler software tool let developers easily compile a trained NN model and download it to a CrossLink-NX FPGA. A VGG-based object counting demo operating on a CrossLink-NX FPGA delivers 10fps while consuming only 200mW. Object counting is a common smart vision application used in the surveillance/security, industrial, automotive and robotics systems.

— ADVERTISMENT—

Advertise Here

When running on a CrossLink-NX FPGA, the sensAI solutions stack offers up to 2.5Mb of distributed memory and block RAM and additional DSP resources for efficient on-chip implementation of AI workloads to reduce the need for cloud-based analytics. CrossLink-NX FPGAs are manufactured in a 28nm FD-SOI process that provides a 75% reduction in power in comparison to similar competing FPGAs.

Many components—image sensors, applications processors and so forth—used in smart vision systems require support for the MIPI I/O standard. One of the target applications for sensAI is smart vision, and CrossLink-NX devices are currently the only low-power FPGAs to deliver MIPI I/O speeds of up to 2.5Gbps, says Lattice. This makes the FPGAs well suited as a hardware platform for sensAI applications requiring MIPI support. The FPGA’s I/Os offer instant-on performance and are able to configure themselves in less than 3ms, with full-device configuration in as little as 8ms. Previous versions of sensAI supported the VGG and MobileNet v1 neural network models. The latest version of the stack adds support for the MobileNet v2, SSD and ResNet models on the Lattice ECP5 family of general-purpose FPGAs.

AI TECH FIGHTS COVID-19
For its part, QuickLogic’s FPGA-based AI solution revolves around its QuickAI platform for endpoint AI applications. The QuickAI platform features technology, software and toolkits from various AI software providers, including QuickLogic’s own subsidiary SensiML.

In May, SensiML announced that it is collaborating on an effort to use its AI technology to help predict whether people are showing symptoms of COVID-19 infection. One such capability involves using crowdsourcing to collect cough sounds from a large number of volunteers and then analyzing that data combined with other datasets from the consortium using the SensiML Analytics Toolkit to identify the unique cough patterns associated with COVID-19 infections (Figure 4). The goal of the initiative is to give businesses, governments, healthcare and other public facilities access to multi-sensor pre-diagnostic screening mechanisms to help slow the spread of the disease.

FIGURE 4 – SensiML is collaborating on an effort to use its AI technology to help predict whether people are showing symptoms of COVID-19 infection. The SensiML Analytics Toolkit is used to identify the unique cough patterns associated with COVID-19 infections.

The initiative is supported by a consortium of companies, universities and health organizations including Asymmetric Return Capital, SkyWater Technology and Upward Health, an in-home and virtual care medical provider. In addition to its work with the consortium to build an enhanced screening application, SensiML plans to open-source its own crowdsourced cough sound dataset for researchers at large to access.

The concept of utilizing AI for pre-diagnostic screening of cough acoustic samples has been studied and validated in recently published academic research and is supported by ongoing projects at multiple esteemed universities, says QuickLogic. Early published results suggest that using AI to identify coughs as a COVID-19 screening mechanism has significant potential, because the pathomorphology of the disease is distinctive from that of other respiratory diseases.

5G-ERA FPGA SOLUTION
Long gone are the days when FPGAs were simple devices used for interfaces. Today’s leading-edge FPGAs can do AI processing while supporting the speeds of the 5G era and enabling accelerated data analytics. Along those lines, in August, Intel began shipments of the first Intel Agilex line of FPGAs (Figure 5).

FIGURE 5 – Agilex FPGAs have computation and high-speed interfacing capabilities that enable the creation of smarter, higher bandwidth networks and help deliver real-time actionable insights via accelerated AI and other analytics performed at the edge, in the cloud and throughout the network.

In the data-centric, 5G-fueled era, networking throughput must increase, and latency must decrease, says Intel. The Agilex FPGAs provide the flexibility and agility required to meet these challenges by delivering significant gains in performance and inherent low latency. Reconfigurable and with reduced power consumption, Intel Agilex FPGAs have computation and high-speed interfacing capabilities that enable the creation of smarter, higher bandwidth networks and help deliver real-time actionable insights via accelerated AI and other analytics performed at the edge, in the cloud and throughout the network.

The Intel Agilex family combines several Intel technologies including the second-generation HyperFlex FPGA fabric built on Intel’s 10nm process, and heterogeneous 3D silicon-in-package (SiP) technology based on Intel’s embedded multi-die interconnect bridge (EMIB) technology. This combination of advanced technologies allows Intel to integrate analog, memory, custom computing, custom I/O and Intel eASIC device tiles into a single package along with the FPGA fabric.

Intel Agilex FPGAs are the first FPGA to support Compute Express Link (CXL), a cache and memory coherent interconnect to future Intel Xeon Scalable processors. The FPGAs support hardened BFLOAT16, with up to 40 teraflops of DSP performance. The FPGAs support PCIe Gen 5 as well as up to 112Gbps data rates for high-speed networking requirements for 400GE and beyond. Support is provided for several memory types, including the current DDR4 and upcoming DDR5, HBM and Intel Optane DC persistent memory. Design development tool support for Intel Agilex FPGAs is available today via Intel Quartus Prime Design Software.

FPGA ACCELERATION
There’s no doubt that FPGAs are excellent implementation vehicles for AI/ML algorithms. FPGAs are both programmable and reconfigurable, giving them an advantage over GPUs when implementing newly developed AI/ML algorithms. With that in mind, Achronix Semiconductor says its Speedster7t FPGAs were designed for AI/ML application acceleration. Built with high-performance SerDes, high-bandwidth memory interfaces, dedicated ML processors and high-speed PCIe Gen5 ports, the Speedster7t FPGA family can handle the most demanding workloads.

Speedster7t FPGAs were designed to address key design AI/ML challenges. The devices can ingest massive amounts of data from multiple high-speed input sources. They can then store and retrieve this input data, along with the DNN models, partial results from each layer computation, and completed computations. Those can then be rapidly distributed to on-chip resources that can perform the layer computations quickly. Finally, the computed results are output in a high-speed fashion.

Speedster7t FPGAs feature a 2D network-on-chip (NoC) with greater than 20Tbps bandwidth capacity for moving data from the high-speed interfaces to and across the FPGA fabric. The 2D NoC alleviates data bottlenecks with 256-bit, unidirectional buses in each direction for a total of 512Gbps for each NoC row and column. The primary interface for the NoC is industry-standard AXI channels.

In October, Achronix and BittWare announced an FPGA accelerator card targeting high-performance compute and data acceleration applications. The VectorPath S7t-VG6 accelerator card features the 7nm Speedster7t AC7t1500 FPGA and offers the industry’s highest performance interfaces available on a PCIe FPGA accelerator card, says Achronix (Figure 6).

FIGURE 6 – The VectorPath S7t-VG6 accelerator card features the 7nm Speedster7t AC7t1500 FPGA and includes 1x400GbE and 2x100GbE ports and 8 banks of GDDR6 memory with aggregate bandwidth of 4Tbps making it well suited for high-bandwidth data acceleration applications.

The VectorPath accelerator card was designed for high-performance and high-bandwidth data applications. It offers 400GbE QSFP-DD and 100GbE QSFP56 interfaces. For DRAM memory, there are 8 banks of GDDR6 memory delivering 4Tbps aggregate bandwidth and 1 bank of DDR4 running at 2666MHz with ECC. Aside from the 20Tbps two-NoC mentioned earlier, the Speedster7t FPGA on board also puts to use its 692K 6-input LUTs and 40K Int8 MACs that deliver over 80 tera operations per second (TOPS) performance. A 4-lane PCIe Gen 4 connector for is available for connecting expansion cards

— ADVERTISMENT—

Advertise Here

The VectorPath accelerator card includes a full suite of Achronix’s ACE development tools along with BittWare’s board management controller and developer toolkit, which includes the API, PCIe drivers, diagnostic self-test and application example designs to enable a rapid out-of-the-box experience. Designed for prototyping as well as high-volume production applications, the VectorPath S7t- VG6 accelerator card provides designers the ability to process massive amounts of data not possible with previous generations of FPGAs.

FPGA KIT FOR AI VISION
With the rise of AI, ML and IoT, applications are moving to the network edge where data is collected, requiring power-efficient solutions to deliver more computational performance in ever smaller, thermally constrained form factors. To meet those needs, Microchip Technology in May released its VectorBlox Accelerator Software Development Kit (SDK) (Figure 7). It’s designed to help developers take advantage of Microchip’s PolarFire FPGAs for creating low-power, flexible overlay-based neural network applications without learning an FPGA tool flow.

FIGURE 7 – The VectorBlox Accelerator Software Development Kit (SDK) is designed to help developers take advantage of Microchip’s PolarFire FPGAs for creating low-power, flexible overlay-based neural network applications without learning an FPGA tool flow.

According to the company, FPGAs are ideal for edge AI applications, such as inferencing in power-constrained compute environments, because they can perform more giga operations per second (GOPS) with greater power efficiency than a CPU or graphics processing unit (GPU), but they require specialized hardware design skills. Microchip’s VectorBlox Accelerator SDK is designed to enable developers to code in C/C++ and program power-efficient neural networks without prior FPGA design experience.

The toolkit can execute models in TensorFlow and the open neural network exchange (ONNX) format, which offers the widest framework interoperability. ONNX supports many frameworks such as Caffe2, MXNet, PyTorch, and MATLAB. Unlike alternative FPGA solutions, Microchip’s VectorBlox Accelerator SDK is supported on Linux and Windows operating systems, and it also includes a bit accurate simulator, which provides the user the opportunity to validate the accuracy of the hardware while in the software environment. The neural network IP included with the kit also supports the ability to load different network models at run time.

For inferencing at the edge, Microchip says its PolarFire FPGAs deliver up to 50% lower total power than competing devices, while also offering 25% higher-capacity math blocks that can deliver up to 1.5TOPS. By using FPGAs, developers also have greater opportunities for customization and differentiation through the devices’ inherent upgradability and ability to integrate functions on a single chip. The PolarFire FPGA neural network IP is available in a range of sizes to match the performance, power, and package size trade-offs for the application, enabling developers to implement their solutions in package sizes as small as 11mm × 11mm.

GPUs FOR REAL-TIME AI
GPUs have proven themselves as engines for performing AI and deep learning duties. Serving such needs, in May Nvidia announced two products for its EGX Edge AI platform—the EGX A100 for larger commercial off-the-shelf servers and the tiny EGX Jetson Xavier NX (Figure 8) for micro-edge servers—designed for high-performance, secure AI processing at the edge.

FIGURE 8 – The energy-efficient EGX Jetson Xavier NX module delivers up to 21TOPS at 15W, or 14TOPS at 10W. As a result, the module opens the door for embedded edge-computing devices that demand increased performance to support AI workloads but are constrained by size, weight or power budget.

Using the Nvidia EGX Edge AI platform, hospitals, stores, farms and factories can carry out real-time processing and protection of the massive amounts of data streaming from trillions of edge sensors, says Nvidia. The platform makes it possible to securely deploy, manage and update fleets of servers remotely. Servers powered by the EGX A100 can manage hundreds of cameras in airports, for example, while the EGX Jetson Xavier NX is built to manage a handful of cameras in convenience stores. Cloud-native support ensures the entire EGX lineup can use the same optimized AI software to easily build and deploy AI applications.

The EGX A100 is the first edge AI product based on the Nvidia Ampere architecture. As AI moves increasingly to the edge, organizations can include EGX A100 in their servers to carry out real-time processing and protection of the massive amounts of streaming data from edge sensors. It combines the high computing performance of the Nvidia Ampere architecture with the accelerated networking and critical security capabilities of the Nvidia Mellanox ConnectX-6 Dx SmartNIC to transform standard and purpose-built edge servers into secure, cloud-native AI supercomputers.

NVIDIA AMPERE ARCHITECTURE
The Nvidia Ampere architecture—the company’s 8th-generation GPU architecture —boasts its largest-ever generational leap in performance for a wide range of compute-intensive workloads, including AI inference and 5G applications running at the edge. This allows the EGX A100 to process high-volume streaming data in real time from cameras and other IoT sensors to drive faster insights and higher business efficiency.

Nvidia’s EGX Jetson Xavier NX serves as a small, powerful AI supercomputer for microservers and edge AIoT boxes, with more than 20 solutions now available from ecosystem partners. It packs the power of an Nvidia Xavier SoC into a credit card-size module. EGX Jetson Xavier NX, running the EGX cloud-native software stack, can quickly process streaming data from multiple high-resolution sensors.

The energy-efficient module delivers up to 21TOPS at 15W, or 14TOPS at 10W. As a result, EGX Jetson Xavier NX opens the door for embedded edge-computing devices that demand increased performance to support AI workloads but are constrained by size, weight, power budget or cost. The EGX A100 will be available at the end of 2020. Ready-to-deploy micro-edge servers based on the EGX Jetson Xavier NX are available now for companies looking to create high-volume production edge systems.

HPC EMBEDDED PROCESSORS
Nvidia technologies aren’t the only GPU solutions targeting the embedded space. In April 2019, AMD launched an expansion to the Ryzen Embedded processor lineup with the AMD Ryzen Embedded R1000 SoC. Built on “Zen” CPU and Radeon “Vega” graphics cores, the Ryzen Embedded R1000 processor delivers 3x better CPU performance per watt compared to the previous generation AMD R-series Embedded processor, and 4x better CPU and graphics performance per dollar than the competition, says AMD.

In February of this year, AMD expanded the product line with two new AMD Ryzen Embedded R1000 low-power processors. The two new chips, designed for efficient power envelopes, are the Ryzen Embedded R1102G and R1305G processors. The new processors scale from 6W up to 10W of TDP respectively, while also giving embedded developers the ability to reduce system costs with less memory DIMMs and lower power requirements. With this low power envelope, these embedded processors give system designers the ability to create fanless systems, opening new markets that can leverage the high-performance Ryzen Embedded processors. 

RESOURCES
Achronix | www.achronix.com
AMD | www.amd.com
BittWare | www.bittware.com
Flex Logix Technologies | www.flex-logix.com
Intel | www.intel.com
Lattice Semiconductor | www.latticesemi.com
Microchip Technologies | www.micochip.com
Nvidia | www.nvidia.com
Quicklogic | www.quicklogic.com
Xilinx | www.xilinx.com

PUBLISHED IN CIRCUIT CELLAR MAGAZINE • JULY 2020 #360 – Get a PDF of the issue


Keep up-to-date with our FREE Weekly Newsletter!



Don't miss out on upcoming issues of Circuit Cellar.

Note: We’ve made the May 2020 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.


Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Sponsor this Article
Editor-in-Chief at Circuit Cellar | Website | + posts

Jeff Child has more than 28 years of experience in the technology magazine business—including editing and writing technical content, and engaging in all aspects of magazine leadership and production. He joined the Circuit Cellar after serving as Editor-in-Chief of COTS Journal for over 10 years. Over his career Jeff held senior editorial positions at several of leading electronic engineering publications, including EE Times and Electronic Design and RTC Magazine. Before entering the world of technology journalism, Jeff worked as a design engineer in the data acquisition market.

Supporting Companies

Slider

Upcoming Events

Copyright © 2021 KCK Media Corp.

FPGAs and GPUs Flex Their AI Muscles: Chip-Level Supercomputing

by Jeff Child time to read: 15 min