FPGAs, MCUs and Software
Today’s level of chip integration provides the processing and functionality to enable machine learning in embedded applications. Chip companies are teaming up with machine-learning software experts to provide powerful solutions.
The term “machine learning” often gets lumped together with artificial intelligence (AI). That’s not surprising given that machine learning and AI by and large rely on similar kinds of computing hardware. When machine learning is applied to embedded applications, or the so-called edge, technology requirements become more distinct. Feeding such applications, today, the processing needed for both AI and machine learning can be implemented on chip-level solutions including, microcontrollers (MCUs), FPGAs, GPUs and system-on-chips (SoCs).
Machine learning has a narrower definition than AI. While there are many definitions of AI, the fundamental idea is about enabling a computing system to think, to make decisions equally as well or better than a human can. In contrast, machine learning is an application of AI that enables a system to automatically learn and improve from experience. The “learning” part is key—allowing an embedded system to learn on its own, without relying on specific programming to do so.
Over the past several months, a variety of chip vendors have rolled out new solutions to enable machine-learning capabilities for embedded systems. In many of these cases, the chip hardware is only part of the solution—often machine learning technology is provided as software or firmware developed to work with MCUs, FPGAs or GPUs. Such software is provided by the chip vendors themselves in some cases, and through machine learning software specialists in others.
COMPILER FOR MACHINE LEARNING
One trend that’s enabling machine learning in embedded MCU-based applications is the forming of partnerships between MCU vendors and providers of machine-learning software. Along just those lines, in July, NXP released its eIQ Machine Learning software support for the Glow neural network (NN) compiler, delivering the first NN compiler implementation for higher performance with low memory footprint on NXP’s i.MX RT crossover MCUs. Developed by Facebook, Glow can integrate target-specific optimizations, and NXP leveraged this ability using NN operator libraries for Arm Cortex-M cores and the Cadence Tensilica HiFi 4 DSP, maximizing the inferencing performance of its i.MX RT685 and i.MX RT1050 and RT1060. This capability is merged into NXP’s eIQ Machine Learning Software Development Environment, freely available within NXP’s MCUXpresso SDK.
Facebook introduced Glow (the Graph Lowering NN compiler) in May 2018 as an open-source community project, with the goal of providing optimizations to accelerate NN performance on a range of hardware platforms. As an NN compiler, Glow takes in an unoptimized NN and generates highly optimized code. This differs from the typical NN model processing whereby a just-in-time compilation is leveraged, which demands more performance and adds memory overhead. Directly running optimized code, like that possible with Glow, greatly reduces the processing and memory requirements. NXP says it has also taken an active role within the Glow open-source community to help drive broad acceptance of new Glow features.
NXP’s edge intelligence environment solution for machine learning is a comprehensive toolkit that provides the building blocks that developers need to efficiently implement machine learning in edge devices (Figure 1). NXP’s eIQ now includes inferencing support for both Glow and TensorFlow Lite, for which NXP routinely performs benchmarking activities to measure performance. MCU benchmarks include standard NN models, such as CIFAR-10. Using a CIFAR-10 model as an example, the benchmark data acquired by NXP shows how to leverage the performance advantage of the i.MX RT1060 device (with 600MHz Arm Cortex-M7), i.MX RT1170 device (with 1GHz Arm Cortex-M7) and i.MX RT685 device (with 600 MHz Cadence Tensilica HiFi 4 DSP).
NXP’s enablement for Glow is tightly coupled with the Neural Network Library (NNLib) that Cadence provides for its Tensilica HiFi 4 DSP delivering 4.8GMACs of performance. In the same CIFAR-10 example, NXP’s implementation of Glow achieves a 25x performance advantage by using this DSP to accelerate the NN operations.
THREE ML TEAM-UPS
In September, Microchip Technology announced it had partnered with Cartesiam, Edge Impulse and Motion Gestures. To simplify machine learning implementation, tools from these vendors have been integrated with Microchip’s MPLAB X Integrated Development Environment (IDE). This enables Microchip to support developers through all phases of their AI/machine learning projects, including data gathering, training the models and inference implementation. It is also easy to test these solutions using Microchip’s machine learning evaluation kits such as the EV18H79A or EV45Y33A (Figure 2).
Cartesiam is a software publisher specializing in AI development tools for MCUs. NanoEdge AI Studio, Cartesiam’s patented development environment, allows embedded developers, without any prior knowledge of AI, to rapidly develop specialized machine learning libraries for MCUs. Edge Impulse is the end-to-end developer platform for embedded machine learning. The platform is free for developers, providing dataset collection, DSP and machine learning algorithms, testing and highly efficient inference code generation across a wide range of sensor, audio and vision applications. Motion Gestures provides embedded AI-based gesture recognition software for different sensors, including touch, motion (IMUs for example) and vision. Unlike conventional solutions, the company’s platform does not require any training data collection or programming and uses advanced machine-learning algorithms.
MCU CONDITION MONITORING
Machine condition monitoring is an application that is particular well-suited to benefit from machine learning technology. With that in mind, in August STMicroelectronics (ST) released a free STM32 software function pack that lets users quickly build, train and deploy intelligent edge devices for industrial condition monitoring using a MCU Discovery kit (Figure 3). Developed in conjunction with machine-learning expert and ST Authorized Partner Cartesiam, the FP-AI-NANOEDG1 software pack contains all the necessary drivers, middleware, documentation and sample code to capture sensor data, integrate and run Cartesiam’s NanoEdge libraries.
Users without specialist AI skills can quickly create and export custom machine-learning libraries for their applications using Cartesiam’s NanoEdge AI Studio tool running on a Windows 10 or Ubuntu PC. The function pack simplifies complete prototyping and validation free of charge on STM32 development boards, before deploying on customer hardware where standard Cartesiam fees apply.
The straightforward methodology established with Cartesiam uses industrial-grade sensors on-board a Discovery kit, such as the STM32L562E-DK to capture vibration data from the monitored equipment both in normal operating modes and under induced abnormal conditions. Software to configure and acquire sensor data is included in the function pack.
NanoEdge AI Studio analyzes the benchmark data and selects pre-compiled algorithms from more than 500 million possible combinations to create optimized libraries for training and inference. The function-pack software provides stubs for the libraries that can be easily replaced for simple embedding in the application. Once deployed, the device can learn the normal pattern of the operating mode locally during the initial installation phase as well as during the lifetime of the equipment, as the function pack permits switching between learning and monitoring modes.
According to ST, using the Discovery kit to acquire data, generate, train and monitor the solution, leveraging free tools and software and the support of the STM32 ecosystem, developers can quickly create a proof-of-concept model at low cost and easily port the application to other STM32 MCUs.
The STM32L562E-DK Discovery kit contains an STM32L562QEI6QU ultra-low-power MCU, an iNEMO 3D accelerometer and 3D gyroscope, as well as two MEMS microphones, a 240×240 color TFT-LCD module and on-board STLINK-V3E debugger/programmer.
INFERENCE IC FOR THE EDGE
An example of a device designed specifically for AI and machine learning is the InferX X1 from Flex Logix Technologies. Released in October, Flex Logix claims the InferX X1 is the industry’s fastest AI inference chip for edge systems. InferX X1 accelerates performance of NN models, such as object detection and recognition, and other NN models, for robotics, industrial automation, medical imaging, gene sequencing, bank security, retail analytics, autonomous vehicles, aerospace and more, says the company (Figure 4). InferX X1 runs YOLOv3 object detection and recognition 30% faster than NVIDIA’s industry leading Jetson Xavier and runs other real-world customer models up to 10 times faster.
According to Flex Logix, many customers plan to use YOLOv3 in their products in robotics, bank security and retail analytics because it is the highest accuracy object detection and recognition algorithm. Additional customers have custom models they have developed for a range of applications where they need more throughput at lower cost. Flex Logix says it has benchmarked models for these applications and demonstrated to these customers that InferX X1 provides the needed throughput and lower cost.
Based on multiple Flex Logix proprietary technologies, the InferX X1 features a new architecture that achieves more throughput from less silicon area. Flex Logix’s XFLX double-density programmable interconnect is already used in the eFPGA (embedded FPGA) that Flex Logix has supplied for years to multiple customers. This is combined with a reconfigurable Tensor Processor consisting of 64 1-Dimensional Tensor Processors that are reconfigurable to efficiently implement the wide range of NN operations. Because reconfiguration can be done in microseconds, each layer of a NN model can be optimized with full-speed data paths for each layer. Broad market sampling of the device will begin in Q1 2021 and production parts will be available in the second quarter of 2021.
Another MCU vendor that’s teamed up with machine learning software vendors is Renesas Electronics. In August, the company announced the second phase of new partner solutions supporting the Renesas RA Family of 32-bit Arm Cortex-M MCUs. Among these new solutions was Qeexo’s AutoML.
AutoML is a fully-automated, end-to-end platform that empowers users to collect, clean and visualize data to build machine learning models for comparison (Figure 5). Solutions built with Queexo AutoML are optimized to have ultra-low latency, ultra-low power consumption and an incredibly small memory footprint. A selected model can be deployed to target embedded hardware with just one click.
AutoML supports Arm Cortex- M0 to M4 class MCUs like the Renesas Synergy S5D9 and Renesas RA6M3. The platform enables a wide range of machine learning methods, including: GBM, XGBoost, Random Forest, Logistic Regression, CNN, RNN, ANN, Local Outlier Factor and Isolation Forest. Libraries generated from Qeexo AutoML are optimized for constrained endpoint device architectures: low latency, low power consumption and small footprint
The AutoML tool automates tedious and repetitive machine learning processes, saves time/cost to production and eliminates room for error. To use it, zero coding is necessary and machine-learning expertise not required.
FPGAS AND MACHINE LEARNING
In October, FPGA vendor QuickLogic announced that its ArcticPro 3 embedded FPGA (eFPGA) IP is now available on Samsung’s 28nm FD–SOI process. According to QuickLogic, the ArcticPro 3 eFPGA IP was been designed from the ground up on the Samsung 28FDS process, resulting in a significant boost in performance and delivering ultra-low standby current leakage. The 28FDS process supports body biasing, enabling OEMs and semiconductor companies the ability to “dial in” the ideal performance/power consumption parameters to meet their system requirements.
The ArcticPro 3 IP is developed with a homogeneous, reprogrammable fabric architecture based on SLCs (Super Logic Cells), which consist of eight LUT4 + Register blocks (Figure 6), and it uses a hierarchical routing scheme that strikes the optimum performance and power consumption balance needed for computation heavy, battery powered or other power sensitive products. In addition to the logic fabric, the eFPGA IP includes the option to integrate fixed function blocks such as embedded RAM and fracturable Multiply-Accumulate (MAC) blocks to efficiently implement hardware accelerators for NNs and other computationally intensive circuits foundational in AI/machine learning applications.
ArcticPro 3 is supported by QuickLogic’s proprietary tools as well as the QuickLogic Open Reconfigurable Computing (QORC) vendor-supported open source FPGA development tools, giving designers full control over their system software development environment.
SOFTWARE TOOLS UNITE
QuickLogic also provides machine learning solutions via its subsidiary SensiML. In September, SensiML announced that its SensiML Analytics Toolkit is now integrated with Google’s “TensorFlow Lite for Microcontrollers.” This means that developers working with Google’s TensorFlow Lite for Microcontrollers open source NN inference engine have the option to leverage SensiML’s automated data labeling and preprocessing capabilities to reduce dataset errors, build more efficient edge models and do so more quickly, says the company.
Using the standardized workflow within the SensiML model building pipeline, developers can collect and label data using the SensiML Data Capture Lab, create data pre-processing and feature engineering pipelines using the SensiML Analytics Studio and perform classification using TensorFlow Lite for Microcontrollers. The net result is a state-of-the-art toolkit for developing smart sensor algorithms capable of running on low power IoT endpoint devices (Figure 7).
TensorFlow Lite for Microcontrollers is a version of TensorFlow Lite from Google that has been specifically designed to implement machine-learning models on MCUs and other memory-limited devices. SensiML, via its SensiML Analytics Toolkit, delivers an easy and transparent set of developer tools for the creation and deployment of edge AI sensor algorithms for IoT devices. Through this tightly coupled integration of SensiML and Google’s TensorFlow, developers can use the solutions for building intelligent sensor AI algorithms capable of running autonomously on IoT edge devices.
ON-CHIP ML PROCESSORS
For its part, eFPGA vendor Achronix views FPGA-based solutions as ideal for AI and machine-learning applications. According the company, the core of many AI/machine language algorithms is pattern recognition—and those algorithms are often implemented as a NN. Deep convolutional NNs (DNNs) are a popular approach here because they offer high accuracy for important image-classification tasks. AI/machine learning algorithms generally employ matrix and vector math, which requires trillions of MAC operations per second. These are executed in fast multipliers and adders—generally called MAC units.
With all that in mind, Achronix’s Speedster7t FPGAs architecture has been tuned to create an optimized, balanced, massively parallel compute engine for AI/machine learning applications. Each Speedster7t FPGA features a massively parallel array of programmable compute elements, organized into new machine learning processors (MLP) blocks (Figure 8).
Each MLP is a highly configurable, compute-intensive block, with up to 32 multipliers, that support integer formats from 4 bits to 24 bits and various floating-point modes including direct support for Tensorflow’s bfloat16 format and block floating-point (BFP) format.
The MLP’s programmable MAC incorporates both a fracturable integer MAC and a hard floating-point MAC. Each MLP block in the Speedster7t fabric also incorporates two memories that are closely coupled to the MAC blocks. One memory is a large, dual-port, 72KB embedded SRAM (BRAM72k), and the other is a 2KB (LRAM2k) cyclic buffer. The number of available MLP blocks varies by device, but can number into the thousands.
The MAC’s fracturable nature allows it to optimally handle the reduced-precision calculations increasingly used by AI/machine learning inference algorithms to minimize memory requirements. Due to its fracturable nature, the MLP can perform an increasing number of computations because the precision of the number formats are reduced.
The MLP offers a range of features including integer multiply with optional accumulate, bfloat16 operations, floating point 16, floating point 24 and block floating point 16.
LOW-POWER MACHINE LEARNING
IoT edge applications have extreme requirements when it comes to low power. Such devices often are in remote places where battery replacement isn’t practical. For its part, Eta Compute has brought machine learning to such applications with its TENSAI Flow software. It a software suite that enables seamless design from concept to firmware, speeding the creation of machine learning applications in IoT and low power edge devices.
TENSAI Flow software reduces development risks by quickly confirming feasibility and proof of concept, says the company. TENSAI Flow enables seamless development for machine learning applications in IoT and low- power edge devices (Figure 9). It includes a NN compiler, a NN “zoo” and middleware comprising FreeRTOS, HAL and frameworks for sensors, as well as IoT/cloud enablement.
According to Eta Compute, the TENSAI Flow exclusive NN compiler delivers the best optimization for NNs running on Eta Compute’s device as well as the industry’s best power efficiency. In addition, the middleware makes dual core programming seamless by eliminating the need to write customized code to take full advantage of DSPs. A unique Neural Network Zoo accelerates and simplifies development with ready-to-use networks for the most common use cases. These will include motion, image and sound classification. Developers simply train the networks with their data. And, with the insight from TENSAI Flow’s real-world applications, developers can see the potential of neural sensor processors precisely in terms of energy efficiency and performance in a variety of field-tested examples.
Compared to direct implementation on a competitive device of the same CIFAR10 NN, the TENSAI NN compiler on TENSAI SoC improves energy per inference by a factor 54x, says the company. Using the CIFAR10 NN from TENSAI NN zoo and TENSAI NN compiler improves the energy per inference further, bringing it to a staggering 200x factor.
Through its interface with Edge Impulse, TENSAI Flow allows developers to securely acquire and store training data so customers train once and have real-world models for future development. The software automatically optimizes TensorFlow Lite AI models for Eta Compute’s TENSAI SoC.
With TENSAI Flow, TENSAI SoC can load AI models that include sensor interfaces seamlessly. TENSAI Flow provides the foundation to automatically provision and connect devices to the cloud and upgrade firmware over the air based on new models or data.
MACHINE LEARNING FOR CAMERAS
Smart cameras are yet another application area where machine learning can make a difference. Feeding such needs, in July, Qualcomm Technologies, a subsidiary of Qualcomm, introduced the Qualcomm QCS610 and Qualcomm QCS410 SoCs to the Qualcomm Vision Intelligence Platform. The QCS610 and QCS410 are designed to bring advanced camera technology, including AI and machine learning features formerly only available to high-end devices, into mid-tier camera segments (Figure 10).
The QCS610 and QCS410 bring a highly integrated solution designed with multiple features to deliver a one-stop shop for developers building camera-based devices. The platform is built with the company’s upgraded Qualcomm Kryo CPU, Qualcomm Adreno GPU and Qualcomm Hexagon DSP. It includes the Qualcomm AI Engine which is designed to deliver up to 50% improved AI performance than the previous generation. This latest generation was rearchitected to deliver improved efficiencies and faster inferencing in the DSP resulting in more computing power and AI inferencing at the device level, says the company.
The Qualcomm Vision Intelligence Platform supports Linux and Android OS for a variety of IoT segments, including camera, Edge AI box, retail and robotics. Further enhanced capabilities include support for Microsoft Azure Machine Learning and Azure services. Dual ISPs support video capture, Integrated audio, GNSS and hardware-based security. The connectivity options include 5G/4G, Wi-Fi, Bluetooth and Ethernet.
Achronix | www.achronix.com
AMD | www.amd.com
Eta Compute | www.etacompute.com
Flex Logix Technologies | www.flex-logix.com
Microchip Technology | www.microchip.com
Qualcomm Technologies | www.qualcomm.com
Quicklogic | www.quicklogic.com
SensiML | www.sensiml.com
STMicroelectronics | www.st.com
Renesas Electronics | www.renesas.com
Xilinx | www.xilinx.com
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • DECEMBER 2020 #365 – Get a PDF of the issue