Embedded Supercomputing
Embedded computing technology has evolved way past the point now where complete system functionality on single chip is remarkable. Today, the levels of compute performance and parallel processing on an IC means that what were once supercomputing levels of capabilities can be implemented in in chip-level solutions.
While supercomputing has become a generalized term, what system developers are really interested in are crafting artificial intelligence, machine learning and neural networking using today’s embedded processing. Supplying the technology for these efforts are the makers of leading-edge embedded processors, FPGAs and GPUs. In these tasks, GPUs are being used for “general-purpose computing on GPUs”, a technique also known as GPGPU computing.
With all that in mind, embedded processor, GPU and FPGA companies have rolled out a variety of solutions over the last 12 months, aimed at performing AI, machine learning and other advanced computing functions for several demanding embedded system application segments.
FPGAs TAKE AI FOCUS
Back March, FPGA vendor Xilinx announced its plans to launch a new FPGA product category it calls its adaptive compute acceleration platform (ACAP). Following up on that, in October the company unveiled Versal– the first of its ACAP implementations. Versal ACAPs combine scalar processing engines, adaptable hardware engines and intelligent engines with advanced memory and interfacing technologies to provide heterogeneous acceleration for any application. But even more importantly, according to Xilinx, the Versal ACAP’s hardware and software can be programmed and optimized by software developers, data scientists and hardware developers alike. This enabled by a host of tools, software, libraries, IP, middleware, and frameworks that enable industry-standard design flows.
Built on TSMC’s 7-nanometer FinFET process technology, the Versal portfolio combines software programmability with domain-specific hardware acceleration and the adaptability necessary to keep pace with today’s rapid pace of innovation. The portfolio includes six series of devices uniquely architected to deliver scalability and AI inference capabilities for a host of applications across different markets, from cloud to networking to wireless communications to edge computing and endpoints.
The portfolio includes the Versal Prime series, Premium series and HBM series, which are designed to deliver industry-leading performance, connectivity, bandwidth, and integration for the most demanding applications. It also includes the AI Core series, AI Edge series, and AI RF series, which feature the AI Engine (Figure 1). The AI Engine is a new hardware block designed to address the emerging need for low-latency AI inference for a wide variety of applications and also supports advanced DSP implementations for applications like wireless and radar.

It is tightly coupled with the Versal Adaptable Hardware Engines to enable whole application acceleration, meaning that both the hardware and software can be tuned to ensure maximum performance and efficiency. The portfolio debuts with the Versal Prime series, delivering broad applicability across multiple markets and the Versal AI Core series, delivering an estimated 8X AI inference performance boost versus industry-leading GPUs, according to Xilinx.
— ADVERTISMENT—
—Advertise Here—
LOW-POWER AI SOLUTION
Following the AI trend, back in May Lattice Semiconductor unveiled Lattice sensAI, a technology stack that combines modular hardware kits, neural network IP cores, software tools, reference designs and custom design services. In September the company unveiled expanded features of the sensAI stack designed for developers of flexible machine learning inferencing in consumer and industrial IoT applications. Building on the ultra-low power (1 mW-1 W) focus of the sensAI stack, Lattice released new IP cores, reference designs, demos and hardware development kits that provide scalable performance and power for always-on, on-device AI applications.
Embedded system developers can build a variety of solutions enabled by sensAI. They can build stand-alone iCE40 UltraPlus/ECP5 FPGA based always-on, integrated solutions, with latency, security and form factor benefits. Alternatively, they can use CE40 UltraPlus as an always-on processor that detects key phrases or objects, and wakes-up a high-performance AP SoC / ASIC for further analytics only when required, reducing overall system power consumption. And, finally, you can use the scalable performance/power benefits of ECP5 for neural network acceleration, along with IO flexibility to seamlessly interface to on-board legacy devices including sensors and low-end MCUs for system control.
Updates to the sensAI stack include a new CNN Compact Accelerator IP core for improved accuracy on iCE40 UltraPlus FPGA and enhanced CNN Accelerator IP core for improved performance on ECP5 FPGAs. Software tools include an updated neural network compiler tool with improved ease-of-use and both Caffe and TensorFlow support for iCE40 UltraPlus FPGAs. Also provided are reference designs enabling human presence detection and hand gesture recognition reference designs and demos (Figure 2). New iCE40 UltraPlus development platform support includes a Himax HM01B0 UPduino shield and DPControl iCEVision board.

AI FOR ENDPOINTS
With a focus on AI at the edge. QuickLogic meanwhile in May launched its QuickAI platform for endpoint AI applications. The QuickAI platform features technology, software and toolkits from General Vision, Nepes, SensiML and QuickLogic (Figure 4). Together these forge a tightly-coupled ecosystem aimed at solving challenges associated with the implementation of AI for endpoint applications.
The QuickAI is based on General Vision’s NeuroMem neural network IP, which has been licensed by Nepes and integrated into the Nepes Neuromorphic NM500 AI learning device. Both General Vision and Nepes provide software for configuring and training the neurons in the network. In addition, SensiML provides an analytics toolkit to quickly and easily build smart sensor algorithms for endpoint IoT applications.
General Vision’s technology enables on-chip exact and fuzzy pattern matching and learning using a scalable parallel architecture of Radial Basis Function neurons. The parallel architecture results in fixed latency for any number of neurons and a very low, energy efficient operating frequency. General Vision supplies the Knowledge Builder tool suite and SDK used to train and configure the neurons in a NeuroMem network.
EXPANDED AI IP
Early this year, Microsemi (now part of Microchip) expanded the third-party intellectual property (IP) offerings for its cost-optimized, low power, mid-range PolarFire FPGAs. With new support of AI /machine learning IP and HDMI 2.0b interfaces, the company’s PolarFire device can now be used in industrial AI applications which leverage the rich resources in the FPGA, particularly the large quantities of DSP math blocks and embedded RAMs (Figure 3).

The expanded IP offerings make Microsemi’s PolarFire FPGAs well-suited a wide variety of machine vision applications within the industrial, medical imaging, retail, defense and automotive markets. The HDMI 2.0b IP, available through Microsemi’s collaboration with Bitec, enables displays up to 4K (ultrahigh definition) resolution, which can be used for AI applications such as retail advertising, smart mirror displays as well as traditional display designs like media servers and display walls.
Machine learning or AI IP, offered via Microsemi’s collaboration with ASIC Design Services, is a key component for machine vision applications such as sensing objects, enabling applications such as surveillance cameras to detect faces or in retail for targeted advertising, as well as advanced driver assist systems (ADAS) applications allowing automobiles to detect cars, pedestrians or other objects. For more on Microsemi’s coloration with ASIC Design Services, see the article “FPGAs Provide Edge for Convolutional Neural Networks” on page XX of this issue of Circuit Cellar.
— ADVERTISMENT—
—Advertise Here—
eFPGAS FOR AI
While the market-share FPGA leaders discussed so far rev up their AI game, a similar trend is happening among Embedded FPGA (eFPGA) companies. An eFPGA is an IP block that allows an FPGA to be incorporated in an SoC, MCU or any kind of IC. Among these eFPGA companies are Flex Logix and Achronix.
For its part, Flex Logix provides an AI-based eFPGA core called EFLX4K AI. The core evolved from its exiting DSP eFPGA core. According to the company, the EFLX4K DSP core turns out to have as many or generally more DSP MAC’s per square millimeter relative to LUTs than other eFPGA and FPGA offerings, but the MAC was designed for digital signal processing and is overkill for AI requirements. But AI doesn’t need a 22 x 22 multiplier and doesn’t need pre-adders or some of the other logic in the DSP MAC. With that in mind, Flex Logix architected its EFLX4K AI core, optimized for deep learning. It has over 10x the GigaMACs/s per square mm of the EFLX4K DSP core. The EFLX4K AI core can be implemented on any process node in 6-8 months on customer demand and can be arrayed interchangeably with the EFLX4K Logic/DSP cores.
Meanwhile Achronix Semiconductor offers both standalone FPGA and eFPGA products. For its high-performance eFPGA offering, Archonix provides its Speedcore technology. The benefit of delivering FPGA technology as an embedded solution is that it can be customized to meet the specific requirements of the target system. For example, in machine learning systems, the compute engine can be a fix point DSP functions, floating point DSP function or a massively parallel engine for convolution neural networks. Designed specifically to be embedded in SoCs and ASICs, Speedcore IP is a fully permutable architecture technology that can be built with densities ranging from less than 10,000 look-up-tables (LUTs) up to two-million LUTs plus large amounts of embedded memory and DSP blocks.
GPGPU COMPUTING FOR CARS
For its part, Nvidia’s GPU technologies are showing up in some many AI and deep learning applications these days that it’s hard to keep up with them all. As recently as October, Nvidia and Volvo announced that the Swedish automaker has selected the Nvidia Drive AGX Xavier computer for its next generation of vehicles, with production starting early 2020s.
Drive AGX Xavier is a highly integrated AI car computer that enables Volvo to streamline development of self-driving capabilities while reducing total cost of development and support. The initial production release will deliver Level 2+ assisted driving features, going beyond traditional advanced driver assistance systems.
The companies are working together to develop automated driving capabilities, integrating 360-degree surround perception and a driver monitoring system. The Nvidia-based computing platform will enable Volvo to implement new connectivity services, energy management technology, in-car personalization options and autonomous drive technology.
GPU DEVELOPER KIT
Announced in September, the Nvidia Drive AGX Xavier developer kit, is a platform for building autonomous driving systems. This open, scalable software and hardware solution enables companies to seamlessly develop and test customized autonomous driving technology, streamlining production.
The developer kit includes a Drive AGX Xavier car computer, along with a vehicle harness to connect the platform to the car, international power supply, camera sensor and other accessories (Figure 4). Nvidia Drive AGX is part of the Nvidia AGX platform for autonomous machines, powering robots, medical devices and self-driving cars.

The platform runs the Nvidia Drive Software 1.0 release, which incorporates a wide range of operations necessary for self-driving, including data collection, obstacle and path perception, advanced driver monitoring and in-vehicle visualization. The release includes DriveWorks modules for a sensor abstraction layer, as well as computer vision and image processing libraries that streamline autonomous driving development. The platform is open, enabling companies to build their own applications on the base software, and constantly enhance it via over-the-air-updates.
AI SOFTWARE
Autonomous vehicles must be able to perceive their surroundings to safely operate without a human driver. To do so, AI-powered self-driving cars must rely on numerous deep neural networks running simultaneously, identifying everything from lane markings and traffic signals to cars and pedestrians.
With Drive Software 1.0, the vehicle’s perception capabilities become increasingly sophisticated, zeroing in on and classifying a broad range of obstacles. The DriveNet deep neural network enables the car to detect and classify objects in the surrounding environment and track them from one frame to the next. LaneNet and OpenRoadNet enable the vehicle to identify lane markings and detect driveable space. The software also comes with a data recording tool, which allows manufacturers to collect time-stamped data from various sensors for training, testing and validation purposes.
These advanced software capabilities can also be utilized inside the car. Using a driver-facing camera, applications built on the Drive IX SDK can track the driver’s facial expression to know whether they’re drowsy or paying attention to the road. Furthermore, visualization capabilities allow passengers inside the vehicle to see what the car is perceiving, building trust between human and machine.
VISION ACCELERATORS FOR AI
Intel has many fingers in the AI pie, but some of its most recent activity have centered around vision processing. In October, Intel rolled out its family of Intel Vision Accelerator Design Products targeted at artificial intelligence (AI) inference and analytics performance on edge devices, where data originates and is acted upon. The new acceleration solutions come in two forms: one that features an array of Intel Movidius vision processors (Figure 5) and one built on the high-performance Intel Arria 10 FPGA.

These Intel Vision Accelerator Design Products include reference designs, featuring Intel Movidius VPUs and Intel Arria 10 FPGAs that are available for Intel partners to build from. They provide power-efficient deep neural network inference for fast, accurate video analytics and computer vision applications in edge servers, NVRs and edge appliances.
The solutions are designed from the ground up to work with the Open Visual Inference & Neural Network Optimization (OpenVINO) toolkit, providing developers full utilization of Intel architecture across the CPUs, CPUs with integrated graphics, FPGAs and VPUs (Figure 6). These reference designs include PCIe, mini PCIe, and M.2 plug-in cards based on Intel Movidius VPUs or Intel Arria 10 FPGAs. They bring an array of essential features, including I/O, multistream aggregation, in-line processing, and a mix of deep inference and traditional sensor processing acceleration to edge devices.
— ADVERTISMENT—
—Advertise Here—

The designs provide a scalable set of validated product designs for delivering a variety of inference accelerator cards suitable for a wide range of end-customer applications. Programmable via the OpenVINO toolkit with its deep neural network compiler tools included, developers are able to rapidly deploy applications such as object and facial detection and recognition to these accelerator designs. Users can choose from a deep neural network inference portfolio, to match the performance, cost, and power efficiency required at any node on any Intel architecture.
HPC HIGH-LEVEL DESIGN
AMD’s latest embedded processors offer solutions for high-performance computer (HPC) system designs. Adding to those efforts, in November, AMD announced the availability of the Mentor Graphics Sourcery CodeBench Lite Edition development environment for AMD EPYC and Ryzen™ CPUs and AMD Radeon Instinct GPUs.
The Sourcery CodeBench Lite Edition is a no cost, complete C/C++ and Fortran development environment for scientific and high-performance computing (HPC) applications. Designed for complex multicore heterogeneous architectures, it allows AMD Radeon Instinct GPUs to be targeted for execution using the industry-standard OpenMP and OpenACC API specifications. This recent release includes a GNU-based C/C++ and Fortran development tool chain for Linux platforms.
Back in February AMD introduced the EPYC Embedded 3000 processor and AMD Ryzen Embedded V1000 processor. The EPYC Embedded 3000 is designed for a variety of new markets including networking, storage and edge computing devices, while AMD Ryzen Embedded V1000 targets medical imaging, industrial systems, digital gaming and thin clients.
The EPYC Embedded 3000 is a highly scalable processor family with designs ranging from four cores to 16 cores, available in single-thread and multi-threaded configurations (Figure 7). Support for thermal design power (TDP) ranges from 30 W to 100 W. Expansive, integrated I/O with support for up to 64 PCIe lanes and up to eight channels of 10 GbE. Up to 32 MB shared L3 cache with up to four independent memory channels.

The Ryzen Embedded V1000 is an Accelerated Processing Unit (APU) coupling high-performance ‘Zen’ CPUs and ‘Vega’ GPUs on a single die, offering up to four CPU cores/eight threads and up to 11 GPU compute units to achieve processing throughput as high as 3.6 Teraflops. Support for TDP ranges from 12W to 54W. I/O capabilities support up to 16 PCIe lanes, dual 10 GbE and several USB options, including up to four USB 3.1/USB-C interconnects, with additional USB, SATA and NVMe support.
Clearly, embedded system developers have an expanding set of technology options when it comes to implementing AI, machine learning and neural networking into their designs. Embedded processors, FPGAs and GPUs are all playing role along with sophisticated software and toolkit solutions supporting them. As these technologies evolve, embedded systems can expect to leverage ever more functionality and intelligence for their system designs.
RESOURCES
Achronix | www.achronix.com
AMD | www.amd.com
Flex Logix Technologies | www.flex-logix.com
Intel | www.intel.com
Lattice Semiconductor | www.latticesemi.com
Microsemi | www.microsemi.com
Nvidia | www.nvidia.com
Quicklogic | www.quicklogic.com
Xilinx | www.xilinx.com
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • DECEMBER 2018 #341 – Get a PDF of the issue
Sponsor this ArticleJeff served as Editor-in-Chief for both LinuxGizmos.com and its sister publication, Circuit Cellar magazine 6/2017—3/2022. In nearly three decades of covering the embedded electronics and computing industry, Jeff has also held senior editorial positions at EE Times, Computer Design, Electronic Design, Embedded Systems Development, and COTS Journal. His knowledge spans a broad range of electronics and computing topics, including CPUs, MCUs, memory, storage, graphics, power supplies, software development, and real-time OSes.