High-Performance at the Edge
Artificial intelligence (AI) was once a discipline that required big, rack-based supercomputers. Today, FPGAs and GPUs are able to perform AI operations, enabling new types of local decision making in embedded systems.
Like almost every other technology from the general desktop/server computing realm, today’s artificial intelligence (AI) and machine learning (ML) are now easily implemented at the chip and board level and aimed at embedded systems developers. Providing solutions for these efforts are the vendors of cutting-edge FPGAs and GPUs.
While the idea isn’t new, today’s GPUs are being used for “general-purpose computing on GPUs,” a technique also known as GPGPU computing—although that term isn’t used much anymore.
Meanwhile the trend toward FPGAs with general-purpose CPU cores embedded on them is likewise nothing new. But, as the on-chip compute-resources of these devices have evolved, the latest crop FPGAs have shifted their architectures toward configurations that support AI and ML types of processing.
Striving to keep up with demands for “intelligent” system functionality, GPU and FPGA companies have rolled out a variety of solutions over the last 12 months, aimed at performing AI, ML and other advanced computing functions.
GPUs FOR INDUSTRIAL AI/HPC
The concept of using GPUs for general purpose computing is over 10 years old at this point. The high-performance capabilities of GPUs make them a natural for AI. In June, Nvidia announced it had enhanced its HGX AI supercomputing platform with new technologies that fuse AI with high performance computing (HPC), making supercomputing more useful to a growing number of industries. To accelerate the new era of industrial AI and HPC, Nvidia has added three key technologies to its HGX platform: the Nvidia A100 80GB PCIe GPU, Nvidia NDR 400G InfiniBand networking and Nvidia Magnum IO GPUDirect Storage software (Figure 1). Together, they provide the extreme performance to enable industrial HPC innovation, says Nvidia.
Among the dozens of partner companies using the Nvidia HGX platform for next-generation systems are Atos, Dell Technologies, Hewlett Packard Enterprise (HPE), Lenovo, Microsoft Azure and NetApp. Perhaps most significant among these is General Electric (GE). The HGX platform is being used by GE to applying HPC to computational fluid dynamics simulations that guide design innovation in large gas turbines and jet engines. Nvidia says the HGX platform has achieved order-of-magnitude acceleration for breakthrough CFD methods in GE’s GENESIS code, which employs Large Eddy Simulations to study the effects of turbulent flows inside turbines that are composed of hundreds of individual blades and require uniquely complex geometry.
NVIDIA A100 Tensor Core GPUs provide HPC acceleration to solve complex AI, data analytics, model training and simulation challenges relevant to industrial HPC. Compared with the A100 40GB, the A100 80GB PCIe GPUs increase GPU memory bandwidth 25% to 2TB/s, and provide 80GB of HBM2e high-bandwidth memory. The A100 80GB PCIe’s enormous memory capacity and high-memory bandwidth enable more data and larger neural networks to be held in memory, minimizing internode communication and energy consumption.
A100 80GB PCIe is powered by the NVIDIA Ampere architecture, which features Multi-Instance GPU (MIG) technology to deliver acceleration for smaller workloads such as AI inference. MIG allows HPC systems to scale compute and memory down with guaranteed quality of service. In addition to PCIe, there are four- and eight-way NVIDIA HGX A100 configurations.
The ability to do power AI processing in embedded applications have made embedded systems able to be thought of as supercomputing platforms. With that in mind, it’s useful to track innovations from the non-embedded side of computing. Along just those lines, at the 2021 International Supercomputing Conference (ISC) in June, Intel showcased HPC technologies with a series of announcements, including an update on its Xe-HPC-based GPU.
Earlier this year, Intel confirmed via social media and event videos the existence of prototypes of its Xe-HPC-based GPU (code-named “Ponte Vecchio”) and in June the company said it’s in the process of system validation. Ponte Vecchio is an Xe architecture-based GPU optimized for HPC and AI workloads. It leverages Intel’s Foveros 3D packaging technology to integrate multiple IPs in-package, including HBM memory and other intellectual property (IP).
The GPU is architected with compute, memory and swtich fabric to meet the evolving needs of the world’s most advanced supercomputers, like Aurora. Ponte Vecchio will be available in an OCP Accelerator Module (OAM) form factor and subsystems, serving the scale-up and scale-out capabilities required for HPC applications (Figure 2).
ACAP FOR THE AI EDGE
For its part, FPGA market leader Xilinx very much oriented its company strategy a couple years ago in the direction of AI for the edge. This began with the idea of its adaptive compute acceleration platform (ACAP) products. Its most recent achievement there came in June when it introduced its Versal AI Edge series ACAPs, devices designed to enable AI innovation from the edge to the endpoint (Figure 3). With 4x the AI performance-per-watt versus GPUs and 10x greater compute density versus previous-generation adaptive SoCs, the Versal AI Edge series is the industry’s most scalable and adaptable portfolio for next-generation distributed intelligent systems, says Xilinx.
Versal AI Edge ACAPs are designed to provide intelligence for a wide range of applications including: automated driving with the highest levels of functional safety, collaborative robotics, predictive factory and healthcare systems, and multi-mission payloads for the aerospace and defense markets. The portfolio features AI Engine-ML to deliver 4x more ML computing compared to the previous AI Engine architecture and integrates new accelerator RAM with an enhanced memory hierarchy for evolving AI algorithms. These architectural innovations deliver lower latency resulting in far more capable devices at the edge, according to the company.
AI-enabled automated systems require high compute density that can accelerate entire applications from sensor to AI to real-time control. Versal AI Edge devices achieve this by delivering 10x the compute density versus Zynq UltraScale+ MPSoCs, enabling more intelligent autonomous systems. Additionally, Versal AI Edge devices support multiple safety standards across industrial (IEC 61508), avionics (DO-254/178), and automotive (ISO 26262) markets, where vendors can meet ASIL C random hardware integrity and ASIL D systematic integrity levels. Versal AI Edge series design documentation and support is available to early access customers, with shipments expected during the first half of 2022.
STACK FOR AI/ML MODELS
In June, Lattice Semiconductor announced enhancements to its sensAI solution stack for accelerating AI/ML application development on low power Lattice FPGAs. Enhancements include support for the Lattice Propel design environment for embedded processor-based development and the TensorFlow Lite deep-learning framework for on-device inferencing (Figure 4). The new version includes the Lattice sensAI Studio design environment for end-to-end ML model training, validation, and compilation. With sensAI 4.0, developers can use a simple drag-and-drop interface to build FPGA designs with a RISC-V processor and a CNN (convolutional neural network) acceleration engine to enable the quick and easy implementation of ML applications on power-constrained edge devices.
There is growing demand in multiple end markets to add support for low power AI/ML inferencing for applications like object detection and classification, says Lattice. AI/ML models can be trained to support applications for a range of devices that require low-power operation at the Edge, including security and surveillance cameras, industrial robots and consumer robotics and toys. The Lattice sensAI solution stack helps developers rapidly create AI/ML applications that run on flexible, low power Lattice FPGAs.
Support for the TensorFlow Lite framework reduces power consumption and increases data co-processing performance in AI/ML inferencing applications. TensorFlow Lite runs anywhere from 2x to 10x faster on a Lattice FPGA than it does on an Arm Cortex-M4-based mircocontroller (MCU). The enhanced sensAI stack supports the Lattice Propel environment’s GUI and command-line tools to create, analyze, compile and debug both the hardware and software design of an FPGA-based processor system. Even developers unfamiliar with FPGA design can use the tool’s easy-to-use, drag-and-drop interface to create AI/ML applications on low power Lattice FPGAs with support for RISC-V-based co-processing.
Lattice sensAI Studio is a GUI-based tool for training, validating and compiling ML models optimized for Lattice FPGAs. The tool makes it easy to take advantage of transfer learning to deploy ML models. By leveraging advances in ML model compression and pruning, sensAI 4.0 can support image processing at 60 frames per second (FPS) with QVGA resolution or 30FPS with VGA resolution.
AI AND MCUs
No one vendor has all the needed solutions for AI system developers. As a result, industry partnerships have been a significant means for embedded engineers to get the mix of solutions they need. With that in mind, in June, SensiML, a subsidiary of FPGA vendor QuickLogic announced its partnership with Microchip Technology. The team-up’s goal is to simplify the development of AI code for smart industrial, consumer, and commercial edge IoT applications. This partnership enables embedded developers using Microchip MCUs and its MPLAB X IDE tool suite to quickly and easily add intelligence to their new or legacy designs with SensiML’s Analytics Toolkit.
The new integrated design flow enables users to use the Data Visualizer debug tool included with the MPLAB X IDE tool suite to directly read register-level sensor data and then feed that information in SensiML’s Data Capture Lab where it can be analyzed and labeled for high-quality AI modeling (Figure 5). This approach means that data from any of the wide range of sensors supported by MPLAB X IDE tool suite can be converted into usable AI models. The models generated by the SensiML tools are extremely efficient and can easily be supported by nearly any Microchip MCU and its associated memory subsystem while keeping power consumption extremely low.
According the company, SensiML takes the data science complexity out of AI sensing code for smart industrial, consumer and commercial applications. Typical examples include industrial control, smart buildings, smart cities and mass transit management. The Microchip MPLAB X IDE tool suite and SensiML Analytics Toolkit are available now and support the Microchip SAM-IoT WG Development Board using the SAMD21G15 Arm Cortex-M0+ based 32-bit MCU. Support for additional development kits and processors are expected to be added over the coming months.
eFPGA AND DRAM PAIR-UP
Even as the FPGA leaders have put a strong focus on AI and ML, the same trend is happening among Embedded FPGA (eFPGA) companies. An eFPGA is an IP block that allows an FPGA to be incorporated in an SoC, MCU or any kind of IC. Among these eFPGA companies are Flex Logix and Achronix. In January, DRAM vendor Winbond Electronics revealed that its low-power, high-performance LPDDR4X DRAM technology supports eFPGA vendor Flex Logix’s AI technology for applications such as object recognition.
The Winbond LPDDR4X DRAM chip is being paired with Flex Logix’s InferX X1 edge inference accelerator chip, which is based on an architecture that features arrays of reconfigurable Tensor Processors. This provides higher throughput and lower latency at lower cost than existing AI edge computing solutions when processing complex neural networking algorithms such as YOLOv3 or Full Accuracy Winograd, says Flex Logix.
To support the InferX X1’s ultra-high-speed operation while keeping power consumption to a minimum, and 7.5TOPS performance, Flex Logix has paired the accelerator with the W66CQ2NQUAHJ from Winbond, a 4Gb LPDDR4X DRAM which offers a maximum data rate of 4267Mbps at a maximum clock rate of 2133MHz. To enable use in battery-powered systems and other power-constrained applications, the W66 series device operates in active mode from 1.8V/1.1V power rails, and from a 0.6V supply in quiescent mode. It offers power-saving features including partial array self-refresh.
The Winbond LPDDR4X chip operates alongside the InferX X1 processor in Flex Logix’s half-height/half-length PCIe embedded processor board for edge servers and gateways (Figure 6). The system takes advantage of Flex Logix’s architectural innovations, such as reconfigurable optimized data paths, which reduce the traffic between the processor and DRAM, to increase throughput and reduce latency. The 4Gb W66CQ2NQUAHJ is comprised of two 2Gb dies in a two-channel configuration. Each die is organized into eight internal banks, which support concurrent operation. The chip is housed in a 200-ball WFBGA package that measures 10mm × 14.5mm.
In April, Achronix Semiconductor started shipping its 7nm Speedster7t AC7t1500 FPGAs to system developers ahead of schedule, according to the company. The Speedster7t family is purpose-built for high bandwidth workloads in a broad range of applications including AI/ML, 5G infrastructure, networking, computational storage and test and measurement. Those are applications where Speedster7t FPGAs eliminate critical performance bottlenecks associated with traditional FPGAs.
The Speedster7t FPGA family is built on TSMC’s 7nm process technology and delivers high performance for networking, storage and compute acceleration (Figure 7). The AC7t1500 has been optimized for high bandwidth applications and includes the industry’s first 2D network-on-chip (NoC) with more than 20Tbps of bi-directional bandwidth, 112Gbps SerDes, PCIe Gen5, 400G Ethernet and 4Tbps external memory bandwidth with its GDDR6 memory interfaces.
These devices also include an array of the new ML processors (MLPs), which are well suited for the diverse and high-performance workloads required in AI/ML applications. The Speedster7t FPGAs are supported by the Achronix tool suite, which includes Synplify Pro synthesis and the ACE place, route and timing tools. These design tools are available today for customers to evaluate and design for Speedster7t FPGAs.
Achronix says that one of the major architectural innovations in Speedster7t FPGAs is the industry’s first 2D NoC. The 2D NoC has dedicated high-bandwidth paths across the entire FPGA fabric interconnecting all functional blocks and peripheral I/O to each other and to the FPGA fabric. The 2D NoC eliminates complex routing bottlenecks found in traditional FPGAs and can transmit or receive 512Gbps at each of the 80 nodes across the FPGA yielding greater than 20Tbps of bidirectional bandwidth. Engineering samples of the AC7t1500 FPGAs have been shipping to customers. Achronix expects to complete full device validation of the FPGA fabric, hard IP and peripheral interfaces in the second half of 2021 and will begin shipping production devices by the end of 2021.
Achronix | www.achronix.com
Flex Logix Technologies | www.flex-logix.com
Intel | www.intel.com
Lattice Semiconductor | www.latticesemi.com
Microchip Technologies | www.micochip.com
Nvidia | www.nvidia.com
Quicklogic | www.quicklogic.com
Xilinx | www.xilinx.com
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • AUGUST 2021 #373 – Get a PDF of the issue