Chip-Level Solutions Feed AI Needs

Embedded Supercomputing

Gone are the days when supercomputing meant big, rack-based systems in an air conditioned room. Today, embedded processors, FPGAs and GPUs are able to do AI and machine learning operations, enabling new types of local decision making in embedded systems.

By Jeff Child, Editor-in-Chief

Embedded computing technology has evolved way past the point now where complete system functionality on a single chip is remarkable. Today, the levels of compute performance and parallel processing on an IC means that what were once supercomputing levels of capabilities can now be implemented in in chip-level solutions.

While supercomputing has become a generalized term, what system developers are really interested in are crafting artificial intelligence, machine learning and neural networking using today’s embedded processing. Supplying the technology for these efforts are the makers of leading-edge embedded processors, FPGAs and GPUs. In these tasks, GPUs are being used for “general-purpose computing on GPUs”, a technique also known as GPGPU computing.

With all that in mind, embedded processor, GPU and FPGA companies have rolled out a variety of solutions over the last 12 months, aimed at performing AI, machine learning and other advanced computing functions for several demanding embedded system application segments.

FPGAS Take AI Focus

Back March, FPGA vendor Xilinx announced its plans to launch a new FPGA product category it calls its adaptive compute acceleration platform (ACAP). Following up on that, in October the company unveiled Versal—the first of its ACAP implementations. Versal ACAPs combine scalar processing engines, adaptable hardware engines and intelligent engines with advanced memory and interfacing technologies to provide heterogeneous acceleration for any application. But even more importantly, according to Xilinx, the Versal ACAP’s hardware and software can be programmed and optimized by software developers, data scientists and hardware developers alike. This is enabled by a host of tools, software, libraries, IP, middleware and frameworks that facilitate industry-standard design flows.

Built on TSMC’s 7-nm FinFET process technology, the Versal portfolio combines software programmability with domain-specific hardware acceleration and adaptability. The portfolio includes six series of devices architected to deliver scalability and AI inference capabilities for a host of applications across different markets, from cloud to networking to wireless communications to edge computing and endpoints.

The portfolio includes the Versal Prime series, Premium series and HBM series, which are designed to deliver high performance, connectivity, bandwidth, and integration for the most demanding applications. It also includes the AI Core series, AI Edge series and AI RF series, which feature the AI Engine (Figure 1). The AI Engine is a new hardware block designed to address the emerging need for low-latency AI inference for a wide variety of applications and also supports advanced DSP implementations for applications like wireless and radar.

Figure 1
Xilinx’s AI Engine is a new hardware block designed to address the emerging need for low-latency AI inference for a wide variety of applications. It also supports advanced DSP implementations for applications like wireless and radar.

It is tightly coupled with the Versal Adaptable Hardware Engines to enable whole application acceleration, meaning that both the hardware and software can be tuned to ensure maximum performance and efficiency. The portfolio debuts with the Versal Prime series, delivering broad applicability across multiple markets and the Versal AI Core series, delivering an estimated 8x AI inference performance boost compared to industry-leading GPUs, according to Xilinx.

Low-Power AI Solution

Following the AI trend, back in May Lattice Semiconductor unveiled Lattice sensAI, a technology stack that combines modular hardware kits, neural network IP cores, software tools, reference designs and custom design services. In September the company unveiled expanded features of the sensAI stack designed for developers of flexible machine learning inferencing in consumer and industrial IoT applications. Building on the ultra-low power (1 mW to 1 W) focus of the sensAI stack, Lattice released new IP cores, reference designs, demos and hardware development kits that provide scalable performance and power for always-on, on-device AI applications.

Embedded system developers can build a variety of solutions enabled by sensAI. They can build stand-alone iCE40 UltraPlus/ECP5 FPGA based always-on, integrated solutions, with latency, security and form factor benefits. Alternatively, they can use CE40 UltraPlus as an always-on processor that detects key phrases or objects, and wakes-up a high-performance AP SoC / ASIC for further analytics only when required, reducing overall system power consumption. And, finally, you can use the scalable performance/power benefits of ECP5 for neural network acceleration, along with I/O flexibility to seamlessly interface to on-board legacy devices including sensors and low-end MCUs for system control.

Figure 2
Human face detection application example. iCE40 UlraPlus enables AI with an always-on image sensor, while consuming less than 1 mW of active power.

Updates to the sensAI stack include a new CNN (convolutional neural networks) Compact Accelerator IP core for improved accuracy on iCE40 UltraPlus FPGA and enhanced CNN Accelerator IP core for improved performance on ECP5 FPGAs. Software tools include an updated neural network compiler tool with improved ease-of-use and both Caffe and TensorFlow support for iCE40 UltraPlus FPGAs. Also provided are reference designs enabling human presence detection and hand gesture recognition reference designs and demos (Figure 2). New iCE40 UltraPlus development platform support includes a Himax HM01B0 UPduino shield and DPControl iCEVision board.. …

Read the full article in the December 341 issue of Circuit Cellar

Don’t miss out on upcoming issues of Circuit Cellar. Subscribe today!

Note: We’ve made the October 2017 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

COM Express Card Sports 3 GHz Core i3 Processor

Congatec has introduced a Computer-on-Module for the entry-level of high-end embedded computing based on Intel’s latest Core i3-8100H processor platform. The board’s fast 16 PCIe Gen 3.0 lanes make it suited for all new artificial intelligence (AI) and machine learning applications requiring multiple GPUs for massive parallel processing.

The new conga-TS370 COM Express Basic Type 6 Computer-on-Module with quad-core Intel Core i3 8100H processor offers a 45 W TDP configurable to 35 W, supports 6 MB cache and provides up to 32 GB dual-channel DDR4 2400 memory. Compared to the preceding 7th generation of Intel Core processors, the improved memory bandwidth also helps to increase the graphics and GPGPU performance of the integrated new Intel UHD630 graphics, which additionally features an increased maximum dynamic frequency of up to 1.0 GHz for its 24 execution units. It supports up to three independent 4K displays with up to 60 Hz via DP 1.4, HDMI, eDP and LVDS.

Embedded system designers can now switch from eDP to LVDS purely by modifying the software without any hardware changes. The module further provides exceptionally high bandwidth I/Os including 4x USB 3.1 Gen 2 (10 Gbit/s), 8x USB 2.0 and 1x PEG and 8 PCIe Gen 3.0 lanes for powerful system extensions including Intel Optane memory. All common Linux operating systems as well as the 64-bit versions of Microsoft Windows 10 and Windows 10 IoT are supported. Congatec’s personal integration support rounds off the feature set. Additionally, Congatec also offers an extensive range of accessories and comprehensive technical services, which simplify the integration of new modules into customer-specific solutions.

Congatec | www.congatec.com

MPU Targets AI-Based Imaging Processing

Renesas Electronics has now developed a new RZ/A2M microprocessor (MPU) to expand the use of artificial intelligence (e-AI) solutions to high-end applications. The new MPU delivers 10 times the image processing performance of its predecessor, the RZ/A1, and incorporates Renesas’ exclusive Dynamically Reconfigurable Processor (DRP), which achieves real-time image processing at low power consumption. This allows applications incorporating embedded devices–such as smart appliances, service robots, and compact industrial machinery–to carry out image recognition employing cameras and other AI functions while maintaining low power consumption, and accelerating the realization of intelligent endpoints.
Currently, there are several challenges to using AI in the operational technology (OT) field, such as difficulty transferring large amounts of sensor data to the cloud for processing, and delays waiting for AI judgments to be transferred back from the cloud. Renesas already offers AI unit solutions that can detect previously invisible faults in real time by minutely analyzing oscillation waveforms from motors or machines. To accelerate the adoption of AI in the OT field, Renesas has developed the RZ/A2M with DRP, which makes possible image-based AI functionality requiring larger volumes of data and more powerful processing performance than achievable with waveform measurement and analysis.

Since real-time image processing can be accomplished while consuming very little power, battery-powered devices can perform tasks such as real-time image recognition based on camera input, biometric authentication using fingerprints or iris scans, and high-speed scanning by handheld scanners. This solves several issues associated with cloud-based approaches, such as the difficulty of achieving real-time performance, assuring privacy and maintaining security.

The RZ/A2M with DRP is a new addition to the RZ/A Series lineup of MPUs equipped with large capacity on-chip RAM, which eliminates the need for external DRAM. The RZ/A Series MPUs address applications employing human-machine interface (HMI) functionality, and the RZ/A2M adds to this capability with features ideal for applications using cameras. It supports the MIPI camera interface, widely used in mobile devices, and is equipped with a DRP for high-speed image processing.

Renesas has also boosted network functionality with the addition of two-channel Ethernet support, and enhanced secure functionality with an on-chip hardware encryption accelerator. These features enable safe and secure network connectivity, making the new RZ/A2M best suited for a wide range of systems employing image recognition, from home appliances to industrial machinery.

Samples of the RZ/A2M with DRP are available now. The RZ/A2M MPUs are offered with a development board, reference software, and DRP image-processing library, allowing customers to begin evaluating HMI function and image processing performance. Mass production is scheduled to start in the first quarter of 2019, and monthly production volume for all RZ/A2M versions is anticipated to reach a combined 400,000 units by 2021.

Renesas Electronics | www.renesas.com

SoC Provides Neural Network Acceleration

Brainchip has claimed itself as the first company to bring a production spiking neural network architecture to market. Called the Akida Neuromorphic System-on-Chip (NSoC), the device is small, low cost and low power, making it well-suited for edge applications such as advanced driver assistance systems (ADAS), autonomous vehicles, drones, vision-guided robotics, surveillance and machine vision systems. Its scalability allows users to network many Akida devices together to perform complex neural network training and inferencing for many markets including agricultural technology (AgTech), cybersecurity and financial technology (FinTech).
According to Lou DiNardo, BrainChip CEO, Akida, which is Greek for ‘spike,’ represents the first in a new breed of hardware solutions for AI. Artificial intelligence at the edge is going to be as significant and prolific as the microcontroller.

The Akida NSoC uses a pure CMOS logic process, ensuring high yields and low cost. Spiking neural networks (SNNs) are inherently lower power than traditional convolutional neural networks (CNNs), as they replace the math-intensive convolutions and back-propagation training methods with biologically inspired neuron functions and feed-forward training methodologies. BrainChip’s research has determined the optimal neuron model and training methods, bringing unprecedented efficiency and accuracy. Each Akida NSoC has effectively 1.2 million neurons and 10 billion synapses, representing 100 times better efficiency than neuromorphic test chips from Intel and IBM. Comparisons to leading CNN accelerator devices show similar performance gains of an order of magnitude better images/second/watt running industry standard benchmarks such as CIFAR-10 with comparable accuracy.

The Akida NSoC is designed for use as a stand-alone embedded accelerator or as a co-processor. It includes sensor interfaces for traditional pixel-based imaging, dynamic vision sensors (DVS), Lidar, audio, and analog signals. It also has high-speed data interfaces such as PCI-Express, USB, and Ethernet. Embedded in the NSoC are data-to-spike converters designed to optimally convert popular data formats into spikes to train and be processed by the Akida Neuron Fabric.

Spiking neural networks are inherently feed-forward dataflows, for both training and inference. Ingrained within the Akida neuron model are innovative training methodologies for supervised and unsupervised training. In the supervised mode, the initial layers of the network train themselves autonomously, while in the final fully-connected layers, labels can be applied, enabling these networks to function as classification networks. The Akida NSoC is designed to allow off-chip training in the Akida Development Environment, or on-chip training. An on-chip CPU is used to control the configuration of the Akida Neuron Fabric as well as off-chip communication of metadata.

The Akida Development Environment is available now for early access customers to begin the creation, training, and testing of spiking neural networks targeting the Akida NSoC. The Akida NSoC is expected to begin sampling in Q3 2019.

Brainchip | www.brainchip.com

SDR Meets AI in a Mash-Up of Jetson TX2, Artix-7 and 2×2 MIMO

By Eric Brown

A Philadelphia based startup called Deepwave Digital has gone to Crowd Supply to launch its “Artificial Intelligence Radio – Transceiver” (AIR-T) SBC. The AIR-T is a software defined radio (SDR) platform for the 300 MHz to 6 GHz range with AI and deep learning hooks designed for “low-cost AI, deep learning, and high-performance wireless systems,” says Deepwave Digital. The 170 mm x 170 mm Mini-ITX board is controlled by an Ubuntu stack running on an Arm hexa-core powered Nvidia Jetson TX2 module. There’s also a Xilinx Artix-7 FPGA and an Analog Devices AD9371 RFIC 2×2 MIMO transceiver.


 
AIR-T with Jetson TX2 module
(click images to enlarge)

The AIR-T is available through Aug. 14 for $4,995 on Crowd Supply with shipments due at the end of November. Deepwave Digital has passed the halfway point to its $20K goal, but it’s already committed to building the boards regardless of the outcome.

The AIR-T is designed for researchers who want to apply the deep learning powers of the Jetson TX2’s 256-core Pascal GPU and its CUDA libraries to the SDR capabilities provided by the Artix 7 and AD9371 transceiver. The platform can function as a “highly parallel SDR, data recorder, or inference engine for deep learning algorithms,” and provides for “fully autonomous SDR by giving the AI engine complete control over the hardware,” says Deepwave Digital. Resulting SDR applications can process bandwidths greater than 200MHz in real-time, claims the company.

The software platform is built around “custom and open” Ubuntu 16.04 software running on the Jetson TX2, as well as custom FPGA blocks that interface with the open source GNU Radio SDR development platform.

The combined stack enables developers to avoid coding CUDA or VHDL. You can prototype in GNU Radio, and then optionally port it to Python or C++. More advanced users can program the Artix 7 FPGA and Pascal GPU directly. AIR-T is described as an “open platform,” but this would appear to refer to the software rather than hardware.



AIR-T software flow
(click image to enlarge)

The AIR-T enables the development of new wireless technologies, where AI can help maximize resources with today’s increasingly limited spectrum. Potential capabilities include autonomous signal identification and interference mitigation. The AIR-T can also be used for satellite and terrestrial communications. The latter includes “high-power, high-frequency voice communications to 60GHz millimeter wave digital technology,” says Deepwave.

Other applications include video, image, and audio recognition. You can “demodulate a signal and apply deep learning to the resulting image, video, or audio data in one integrated platform,” says the company. The product can also be used for electrical engineering or applied physics research.


Jetson TX2

Nvidia’s Jetson TX2 module features 2x high-end “Denver 2” cores, 4x Cortex-A57 cores, and the 256-core Pascal GPU with CUDA libraries for running machine learning algorithms. The TX2 also supplies the AIR-T with 8 GB of LPDDR4 RAM, 32 GB of eMMC 5.1, and 802.11ac Wi-Fi and Bluetooth.

The Xilinx Artix-7 provides 75k logic cells. The FPGA interfaces with the Analog Devices AD9371 (PDF) dual RF transceiver designed for 300 MHz to 6 GHz frequencies. The AD9371 features 2x RX and 2x TX channels at 100 MHz for each channel, as well as auxiliary observation and sniffer RX channels.

The AIR-T is further equipped with a SATA port and a microSD slot loaded with the Ubuntu stack, as well as GbE, USB 3.0, USB 2.0 and 4K-ready HDMI ports. You also get DIO, an external LO input, a PPS and 10 MHz reference input, and a power supply. It typically runs on 22 W, or as little as 14 W with reduced GPU usage. Other features include 4x MCX-to-SMA cables and an optional enclosure.

Further information

The Artificial Intelligence Radio – Transceiver (AIR-T) is available through Aug. 14 for $4,995 on Crowd Supply — at a 10 percent discount from retail — with shipments due at the end of November. More information may be found on the AIR-T Crowd Supply page and the Deepwave Digital website.

This article originally appeared on LinuxGizmos.com on July 18..

Deepwave Digital | www.deepwavedigital.com

FPGA Solutions Evolve to Meet AI Needs

Brainy System ICs

Long gone now are the days when FPGAs were thought of as simple programmable circuitry for interfacing and glue logic. Today, FPGAs are powerful system chips with on-chip processors, DSP functionality and high-speed connectivity.

By Jeff Child, Editor-in-Chief

Today’s FPGAs have now evolved to the point that calling them “systems-on-chips” is redundant. It’s now simply a given that the high-end lines of the major FPGA vendors have general-purpose CPU cores on them. Moreover, the flavors of signal processing functionality on today’s FPGA chips are ideally suited to the kind of system-oriented DSP functions used in high-end computing. And even better, they’ve enabled AI (Artificial Intelligence) and Machine Learning kinds of functionalities to be implemented into much smaller, embedded systems.

In fact, over the past 12 months, most of the leading FPGA vendors have been rolling out solutions specifically aimed at using FPGA technology to enable AI and machine learning in embedded systems. The two main FPGA market leaders Xilinx and Intel’s Programmable Solutions Group (formerly Altera) have certainly embraced this trend, as have many of their smaller competitors like Lattice Semiconductor and QuickLogic. Meanwhile, specialists in so-called e-FPGA technology like Archonix and Flex Logix have their own compelling twist on FPGA system computing.

Project Brainwave

Exemplifying the trend toward FPGAs facilitating AI processing, Intel’s high-performance line of FPGAs is its Stratix 10 family. According to Intel, the Stratix 10 FPGAs are capable of 10 TFLOPS, or 10 trillion floating point operations per second (Figure 1). In May Microsoft announced its Microsoft debuted its Azure Machine Learning Hardware Accelerated Models powered by Project Brainwave integrated with the Microsoft Azure Machine Learning SDK. Azure’s architecture is developed with Intel FPGAs and Intel Xeon processors.

Figure 1
Stratix 10 FPGAs are capable of 10 TFLOPS or 10 trillion floating point operations per second.

Intel says its FPGA-powered AI is able to achieve extremely high throughput that can run ResNet-50, an industry-standard deep neural network requiring almost 8 billion calculations without batching. This is possible using FPGAs because the programmable hardware—including logic, DSP and embedded memory—enable any desired logic function to be easily programmed and optimized for area, performance or power. And because this fabric is implemented in hardware, it can be customized and can perform parallel processing. This makes it possible to achieve orders of magnitudes of performance improvements over traditional software or GPU design methodologies.

In one application example, Intel cites an effort where Canada’s National Research Council (NRC) is helping to build the next-generation Square Kilometer Array (SKA) radio telescope to be deployed in remote regions of South Africa and Australia, where viewing conditions are most ideal for astronomical research. The SKA radio telescope will be the world’s largest radio telescope that is 10,000 times faster with image resolution 50 times greater than the best radio telescopes we have today. This increased resolution and speed results in an enormous amount of image data that is generated by these telescopes, processing the equivalent of a year’s data on the Internet every few months.

NRC’s design embeds Intel Stratix 10 SX FPGAs at the Central Processing Facility located at the SKA telescope site in South Africa to perform real-time processing and analysis of collected data at the edge. High-speed analog transceivers allow signal data to be ingested in real time into the core FPGA fabric. After that, the programmable logic can be parallelized to execute any custom algorithm optimized for power efficiency, performance or both, making FPGAs the ideal choice for processing massive amounts of real-time data at the edge.

ACAP for Next Gen

For its part, Xilinx’s high-performance product line is its Virtex UltraScale+ device family (Figure 2). According to the company, these provide the highest performance and integration capabilities in a FinFET node, including the highest signal processing bandwidth at 21.2 TeraMACs of DSP compute performance. They deliver on-chip memory density with up to 500 Mb of total on-chip integrated memory, plus up to 8 GB of HBM Gen2 integrated in-package for 460 GB/s of memory bandwidth. Virtex UltraScale+ devices provide capabilities with integrated IP for PCI Express, Interlaken, 100G Ethernet with FEC and Cache Coherent Interconnect for Accelerators (CCIX).

Figure 2
Virtex UltraScale+ FPGAs provide a signal processing bandwidth at 21.2 TeraMACs. They deliver on-chip memory density with up to 500 Mb of total on-chip integrated memory, plus up to 8 GB of HBM Gen2 integrated in-package for 460 GB/s of memory bandwidth.

Looking to the next phase of system performance, Xilinx in March announced its strategy toward a new FPGA product category it calls its adaptive compute acceleration platform (ACAP). Touted as going beyond the capabilities of an FPGA, an ACAP is a highly integrated multi-core heterogeneous compute platform that can be changed at the hardware level to adapt to the needs of a wide range of applications and workloads. An ACAP’s adaptability, which can be done dynamically during operation, delivers levels of performance and performance per-watt that is unmatched by CPUs or GPUs, says Xilinx… …

Read the full article in the August 337 issue of Circuit Cellar

Don’t miss out on upcoming issues of Circuit Cellar. Subscribe today!

Note: We’ve made the October 2017 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Multiphase PMICs Boast High Efficiency and Small Footprint

Renesas Electronics has announced three programmable power management ICs (PMICs) that offer high power efficiency and small footprint for application processors in smartphones and tablets: the ISL91302B, ISL91301A, and ISL91301B PMICs. The PMICs also deliver power to artificial intelligence (AI) processors, FPGAs and industrial microprocessors (MPUs). They are also well-suited for powering the supply rails in solid-state drives (SSDs), optical transceivers, and a wide range of consumer, industrial and networking devices. The ISL91302B dual/single output, multiphase PMIC provides up to 20 A of output current and 94 percent peak efficiency in a 70 mm2 solution size that is more than 40% smaller than competitive PMICs.
In addition to the ISL91302B, Renesas’ ISL91301A triple output PMIC and ISL91301B quad output PMIC both deliver up to 16 A of output power with 94% peak efficiency. The new programmable PMICs leverage Renesas’ R5 Modulation Technology to provide fast single-cycle transient response, digitally tuned compensation, and ultra-high 6 MHz (max) switching frequency during load transients. These features make it easier for power supply designers to design boards with 2 mm x 2 mm, 1mm low profile inductors, small capacitors and only a few passive components.

Renesas PMICs also do not require external compensation components or external dividers to set operating conditions. Each PMIC dynamically changes the number of active phases for optimum efficiency at all output currents. Their low quiescent current, superior light load efficiency, regulation accuracy, and fast dynamic response significantly extend battery life for today’s feature-rich, power hungry devices.

Key Features of ISL91302B PMIC:

  • Available in three factory configurable options for one or two output rails:
    • Dual-phase (2 + 2) configuration supporting 10 A from each output
    • Triple-phase (3 + 1) configuration supporting 15 A from one output and  5A from the second output
    • Quad-phase (4 + 0) configuration supporting 20A from one output
  • Small solution size: 7 mm x 10 mm for 4-phase design
  • Input supply voltage range of 2.5 V to 5.5 V
  • I2C or SPI programmable Vout from 0.3 V to 2 V
  • R5 modulator architecture balances current loads with smooth phase adding and dropping for power efficiency optimization
  • Provides 75 μA quiescent current in discontinuous current mode (DCM)
  • Independent dynamic voltage scaling for each output
  • ±0.7percent system accuracy for -10°C to 85°C with remote voltage sensing
  • Integrated telemetry ADC senses phase currents, output current, input/output voltages, and die temperature, enabling PMIC diagnostics during operation
  • Soft-start and fault protection against under voltage (UV), over voltage (OV), over current (OC), over temperature (OT), and short circuit

Key Features of ISL91301A and ISL91301B PMICs

  • Available in two factory configurable options:
    • ISL91301A: dual-phase, three output rails configured as 2+1+1 phase
    • ISL91301B: single-phase, four output rails configured as 1+1+1+1 phase
  • 4A per phase for 2.8 V to 5.5 V supply voltage
  • 3A per phase for 2.5 V to 5.5 V supply voltage
  • Small solution size: 7 mm x 10 mm for 4-phase design
  • I2C or SPI programmable Vout from 0.3 V to 2 V
  • Provides 62μA quiescent current in DCM mode
  • Independent dynamic voltage scaling for each output
  • ±0.7percent system accuracy for -10°C to 85°C with remote voltage sensing
  • Soft-start and fault protection against UV, OV, OC, OT, and short circuit

Pricing and Availability

The ISL91302B dual/single output PMIC is available now in a 2.551 mm x 3.670 ball WLCSP package and is priced at $3.90 in 1k quantities. For more information on the ISL91302B, please visit: www.intersil.com/products/isl91302B.

The ISL91301A triple-output PMIC and ISL91301B quad-output PMIC are available now in 2.551 mm x 2.87 mm, 42-ball WLCSP packages, both priced at $3.12 in 1k quantities. For more information on the ISL91301A, please visit: www.intersil.com/products/isl91301A. For more information on the ISL91301B, please visit: www.intersil.com/products/isl91301B.

Renesas Electronics | www.renesas.com

Movidius AI Acceleration Technology Comes to a Mini-PCIe Card

By Eric Brown

UP AI Core (front)

As promised by Intel when it announced an Intel AI: In Production program for its USB stick form factor Movidius Neural Compute Stick, Aaeon has launched a mini-PCIe version of the device called the UP AI Core. It similarly integrates Intel’s AI-infused Myriad 2 Vision Processing Unit (VPU). The mini-PCIe connection should provide faster response times for neural networking and machine vision compared to connecting to a cloud-based service.

UP AI Core (back)

The module, which is available for pre-order at $69 for delivery in April, is designed to “enhance industrial IoT edge devices with hardware accelerated deep learning and enhanced machine vision functionality,” says Aaeon. It can also enable “object recognition in products such as drones, high-end virtual reality headsets, robotics, smart home devices, smart cameras and video surveillance solutions.”

 

 

UP Squared

The UP AI Core is optimized for Aaeon’s Ubuntu-supported UP Squared hacker board, which runs on Intel’s Apollo Lake SoCs. However, it should work with any 64-bit x86 computer or SBC equipped with a mini-PCIe slot that runs Ubuntu 16.04. Host systems also require 1GB RAM and 4GB free storage. That presents plenty of options for PCs and embedded computers, although the UP Squared is currently the only x86-based community backed SBC equipped with a Mini-PCIe slot.

Myriad 2 architecture

Aaeon had few technical details about the module, except to say it ships with 512MB of DDR RAM, and offers ultra-low power consumption. The UP AI Core’s mini-PCIe interface likely provides a faster response time than the USB link used by Intel’s $79 Movidius Neural Compute Stick. Aaeon makes no claims to that effect, however, perhaps to avoid

Intel’s Movidius
Neural Compute Stick

disparaging Intel’s Neural Compute Stick or other USB-based products that might emerge from the Intel AI: In Production program.

It’s also possible the performance difference between the two products is negligible, especially compared with the difference between either local processing solutions vs. an Internet connection. Cloud-based connections for accessing neural networking services suffer from reduced latency, network bandwidth, reliability, and security, says Aaeon. The company recommends using the Linux-based SDK to “create and train your neural network in the cloud and then run it locally on AI Core.”

Performance issues aside, because a mini-PCIe module is usually embedded within computers, it provides more security than a USB stck. On the other hand, that same trait hinders ease of mobility. Unlike the UP AI Core, the Neural Compute Stick can run on an ARM-based Raspberry Pi, but only with the help of the Stretch desktop or an Ubuntu 16.04 VirtualBox instance.

In 2016, before it was acquired by Intel, Movidius launched its first local-processing version of the Myriad 2 VPU technology, called the Fathom. This Ubuntu-driven USB stick, which miniaturized the technology in the earlier Myriad 2 reference board, is essentially the same technology that re-emerged as Intel’s Movidius Neural Compute Stick.

UP AI Core, front and back

Neural network processors can significantly outperform traditional computing approaches in tasks like language comprehension, image recognition, and pattern detection. The vast majority of such processors — which are often repurposed GPUs — are designed to run on cloud servers.

AIY Vision Kit

The Myriad 2 technology can translate deep learning frameworks like Caffe and TensorFlow into its own format for rapid prototyping. This is one reason why Google adopted the Myriad 2 technology for its recent AIY Vision Kit for the Raspberry Pi Zero W. The kit’s VisionBonnet pHAT board uses the same Movidius MA2450 chip that powers the UP AI Core. On the VisionBonnet, the processor runs Google’s open source TensorFlow machine intelligence library for neural networking, enabling visual perception processing at up to 30 frames per second.

Intel and Google aren’t alone in their desire to bring AI acceleration to the edge. Huawei released a Kirin 970 SoC for its Mate 10 Pro phone that provides a neural processing coprocessor, and Qualcomm followed up with a Snapdragon 845 SoC with its own neural accelerator. The Snapdragon 845 will soon appear on the Samsung Galaxy S9, among other phones, and will also be heading for some high-end embedded devices.

Last month, Arm unveiled two new Project Trillium AI chip designs intended for use as mobile and embedded coprocessors. Available now is Arm’s second-gen Object Detection (OD) Processor for optimizing visual processing and people/object detection. Due this summer is a Machine Learning (ML) Processor, which will accelerate AI applications including machine translation and face recognition.

Further information

The UP AI Core is available for pre-order at $69 for delivery in late April. More information may be found at Aaeon’s UP AI Core announcement and its UP Community UP AI Edge page for the UP AI Core.

Aaeon | www.aaeon.com

This article originally appeared on LinuxGizmos.com on March 6.

NVIDIA Graphics Tapped for Mercedes-Benz MBUX AI Cockpit

At the CES show last month, Mercedes-Benz its NVIDIA-powered MBUX infotainment system–a next-gen car cabin experience can learn and adapt to driver and passenger preferences, thanks to artificial intelligence.

According to NVIDIA, all the key MBUX systems are built together with NVIDIA, and they’re all powered by NVIDIA. The announcement comes a year after Huang joined Mercedes-Benz execs on stage at CES 2017 and said that their companies were collaborating on an AI car that would be ready in 2018.

Powered by NVIDIA graphics and deep learning technologies, the Mercedes-Benz User Experience, or MBUX, has been designed to deliver beautiful new 3D touch-screen displays. It can be controlled with a new voice-activated assistant that can be summoned with the phrase “Hey, Mercedes. It’s an intelligent learning system that adapts to the requirements of customers, remembering such details as the seat and steering wheel settings, lights and other comfort features.

The MBUX announcement highlights the importance of AI to next-generation infotainment systems inside the car, even as automakers are racing put AI to work to help vehicles navigate the world around them autonomously. The new infotainment system aims to use AI to adapt itself to drivers and passengers— automatically suggesting your favorite music for your drive home, or offering directions to a favorite restaurant at dinner time. It’s also one that will benefit from “over-the-air” updates delivering new features and capabilities.

Debuting in this month (February) in the new Mercedes-Benz A-Class, MBUX will power dramatic wide-screen displays that provide navigation, infotainment and other capabilities, touch-control buttons on the car’s steering wheel, as well as an intelligent assistant that can be summoned with a voice command. It’s an interface that can change its look to reflect the driver’s mood—whether they’re seeking serenity or excitement—and understand the way a user talks.

NVIDIA | www.nvidia.com

Current Multipliers Improve Processor Performance

Vicor has announced the introduction of Power-on-Package modular current multipliers for high performance, high current, CPU/GPU/ASIC (“XPU”) processors. By freeing up XPU socket pins and eliminating losses associated with delivery of current from the motherboard to the XPU, Vicor’s Power-on-Package solution enables higher current delivery for maximum XPU performance.

In response to the ever-increasing demands of high performance applications–artificial intelligence, machine learning, big data mining—XPU operating currents have risen to Power-on-Package-Enables-Higher-Performance-for-Artificial-Intelligence-Processorshundreds of Amperes. Point-of-Load power architectures in which high current power delivery units are placed close to the XPU, mitigate power distribution losses on the motherboard but do nothing to lessen interconnect challenges between the XPU and the motherboard. With increasing XPU currents, the remaining short distance to the XPU—the “last inch”—consisting of motherboard conductors and interconnects within the XPU socket has become a limiting factor in XPU performance and total system efficiency.

Vicor’s new Power-on-Package Modular Current Multipliers (“MCMs”) fit within the XPU package to expand upon the efficiency, density, and bandwidth advantages of Vicor’s Factorized Power Architecture, already established in 48 V Direct-to-XPU motherboard applications by early adopters. As current multipliers, MCMs mounted on the XPU substrate under the XPU package lid, or outside of it, are driven at a fraction (around 1/64th) of the XPU current from an external Modular Current Driver (MCD). The MCD, located on the motherboard, drives MCMs and accurately regulates the XPU voltage with high bandwidth and low noise. The solution profiled today, consisting of two MCMs and one MCD, enables delivery of up to 320 A of continuous current to the XPU, with peak current capability of 640 A.

With MCMs mounted directly to the XPU substrate, the XPU current delivered by the MCMs does not traverse the XPU socket. And, because the MCD drives MCMs at a low current, power from the MCD can be efficiently routed to MCMs reducing interconnect losses by 10X even though 90% of the XPU pins typically required for power delivery are reclaimed for expanded I/O functionality. Additional benefits include a simplified motherboard design and a substantial reduction in the minimum bypass capacitance required to keep the XPU within its voltage limits.

Multiple MCMs may be operated in parallel for increased current capability. The small (32mm x 8mm x 2.75mm) package and low noise characteristics of the MCM make it suitable for co-packaging with noise-sensitive, high performance ASICs, GPUs and CPUs. Operating temperature range is -40°C to +125°C. These devices represent the first in a portfolio of Power-on-Package solutions scalable to various XPU needs.

Vicor | www.vicorpower.com

Microsoft Real-time AI Project Leverages FPGAs

At Hot Chips 2017 Microsoft unveiled a new deep learning acceleration platform, codenamed Project Brainwave. The system performs real-time AI. Real-time here means the system processes requests as fast as it receives them, with ultra-low latency. Real-time AI is becoming increasingly important as cloud infrastructures process live data streams, whether they be search queries, videos, sensor streams, or interactions with users.

Hot-Chips-Stratix-10-board-1-

 

The Project Brainwave system is built with three main layers: a high-performance, distributed system architecture; a hardware DNN engine synthesized onto FPGAs; and a compiler and runtime for low-friction deployment of trained models. Project Brainwave leverages the massive FPGA infrastructure that Microsoft has been deploying over the past few years. By attaching high-performance FPGAs directly to Microsoft’s datacenter network, they can serve DNNs as hardware microservices, where a DNN can be mapped to a pool of remote FPGAs and called by a server with no software in the loop. This system architecture both reduces latency, since the CPU does not need to process incoming requests, and allows very high throughput, with the FPGA processing requests as fast as the network can stream them.

Project Brainwave uses a powerful “soft” DNN processing unit (or DPU), synthesized onto commercially available FPGAs.  A number of companies—both large companies and a slew of startups—are building hardened DPUs.  Although some of these chips have high peak performance, they must choose their operators and data types at design time, which limits their flexibility.  Project Brainwave takes a different approach, providing a design that scales across a range of data types, with the desired data type being a synthesis-time decision. The design combines both the ASIC digital signal processing blocks on the FPGAs and the synthesizable logic to provide a greater and more optimized number of functional units.  This approach exploits the FPGA’s flexibility in two ways.  First, the developers have defined highly customized, narrow-precision data types that increase performance without real losses in model accuracy.  Second, they can incorporate research innovations into the hardware platform quickly (typically a few weeks), which is essential in this fast-moving space.  As a result, the Microsoft team achieved performance comparable to – or greater than – many of these hard-coded DPU chips but are delivering the promised performance today. At Hot Chips, Project Brainwave was demonstrated using Intel’s new 14 nm Stratix 10 FPGA.

Project Brainwave incorporates a software stack designed to support the wide range of popular deep learning frameworks. They support Microsoft Cognitive Toolkit and Google’s Tensorflow, and plan to support many others. They have defined a graph-based intermediate representation, to which they convert models trained in the popular frameworks, and then compile down to their high-performance infrastructure.

Microsoft | www.microsoft.com

Dev Kit Enables Cars to Express Their Emotions

Renesas Electronics has announced that it has developed a development kit for its R-Car that takes advantage of “emotion engine”, an artificial sensibility and intelligence technology pioneered by cocoro SB Corp. The new development kit enables cars with the sensibility to read the driver’s emotions and optimally respond to the driver’s needs based on their emotional state.

The development kit includes cocoro SB’s emotion engine, which was developed leveraging its sensibility technology to recognize emotional states such as confidence or uncertainty based on the speech of the driver. The car’s response to the driver’s emotional state is displayed by a new driver-attentive user interface (UI) implemented in the Renesas R-Car system-on-chip (SoC). Since it is possible for the car to understand the driver’s words and emotional state, it can provide the appropriate response that ensures optimal driver safety.

20170719-verbal-emotion-recognition-engine-st

As this technology is linked to artificial intelligence (AI) based machine learning, it is possible for the car to learn from conversations with the driver, enabling it to transform into a car that is capable of providing the best response to the driver. Renesas plans to release the development kit later this year.

Renesas  demonstrated its connected car simulator incorporating the new development kit based on cocoro SB’s emotion engine at the SoftBank World 2017 event earlier this month in held by SoftBank at the Prince Park Tower Tokyo.

Renesas considers the driver’s emotional state, facial expression and eyesight direction as key information that combines with the driver’s vital signs to improve the car and driver interface, placing drivers closer to the era of self-driving cars. For example, if the car can recognize the driver is experiencing an uneasy emotional state, even if he or she has verbally accepted the switch to hands free autonomous-driving mode, it is possible for the car to ask the driver “would you prefer to continue driving and not switch to autonomous-driving mode for now?” Furthermore, understanding the driver’s emotions enables the car to control vehicle speed according to how the driver is feeling while driving at night in autonomous-driving mode. By providing carmakers and IT companies with the development kit that takes advantage of this emotion engine, Renesas hopes to expand the possibilities for this service model to the development of new interfaces between cars and drivers and other mobility markets that can take advantage of emotional state information. Based on the newly-launched Renesas autonomy, a new advanced driving assistance systems (ADAS) and automated driving platform, Renesas enables a safe, secure, and convenient driving experience by providing next-generation solutions for connected cars.

Renesas Electronics America | www.renesas.com

The Future of Intelligent Robots

Robots have been around for over half a century now, making constant progress in terms of their sophistication and intelligence levels, as well as their conceptual and literal closeness to humans. As they become smarter and more aware, it becomes easier to get closer to them both socially and physically. That leads to a world where robots do things not only for us but also with us.

Not-so-intelligent robots made their first debut in factory environments in the late ‘50s. Their main role was to merely handle the tasks that humans were either not very good at or that were dangerous for them. Traditionally, these robots have had very limited sensing; they have essentially been blind despite being extremely strong, fast, and repeatable. Considering what consequences were likely to follow if humans were to freely wander about within the close vicinity of these strong, fast, and blind robots, it seemed to be a good idea to isolate them from the environment by placing them in safety cages.

Advances in the fields of sensing and compliant control made it possible to get a bit closer to these robots, again both socially and physically. Researchers have started proposing frameworks that would enable human-robot collaborative manipulation and task execution in various scenarios. Bi-manual collaborative manufacturing robots like YuMi by ABB and service robots like HERB by the Personal Robotics Lab of Carnegie Mellon University[1] have started emerging. Various modalities of learning from/programming by demonstration, such as kinesthetic teaching and imitation, make it very natural to interact with these robots and teach them the skills and tasks we want them perform the way we teach a child. For instance, the Baxter robot by Rethink Robotics heavily utilizes these capabilities and technologies to potentially bring a teachable robot to every small company with basic manufacturing needs.

As robots gets smarter, more aware, and safer, it becomes easier to socially accept and trust them as well. This reduces the physical distance between humans and robots even further, leading to assistive robotic technologies, which literally “live” side by side with humans 24/7. One such project is the Assistive Dexterous Arm (ADA)[2] that we have been carrying out at the Robotics Institute and the Human-Computer Interaction Institute of Carnegie Mellon University. ADA is a wheelchair mountable, semi-autonomous manipulator arm that utilizes the sliding autonomy concept in assisting people with disabilities in performing their activities of daily living. Our current focus is on assistive feeding, where the robot is expected to help the users eat their meals in a very natural and socially acceptable manner. This requires the ability to predict the user’s behaviors and intentions as well as spatial and social awareness to avoid awkward situations in social eating settings. Also, safety becomes our utmost concern as the robot has to be very close to the user’s face and mouth during task execution.

In addition to assistive manipulators, there have also been giant leaps in the research and development of smart and lightweight exoskeletons that make it possible for paraplegics to walk by themselves. These exoskeletons make use of the same set of technologies, such as compliant control, situational awareness through precise sensing, and even learning from demonstration to capture the walking patterns of a healthy individual.

These technologies combined with the recent developments in neuroscience have made it possible to get even closer to humans than an assistive manipulator or an exoskeleton, and literally unite with them through intelligent prosthetics. An intelligent prosthetic limb uses learning algorithms to map the received neural signals to the user’s intentions as the user’s brain is constantly adapting to the artificial limb. It also needs to be highly compliant to be able to handle the vast variance and uncertainty in the real world, not to mention safety.

Extrapolating from the aforementioned developments and many others, we can easily say that robots are going to be woven into our lives. Laser technology used to be unreachable and cutting-edge from an average person’s perspective a couple decades ago. However, as Rodney Brooks says in his book titled Robot: The Future of Flesh and Machines, (Penguin Books, 2003), now we do not know exactly how many laser devices we have in our houses, and more importantly we don’t even care! That will be the case for the robots. In the not so distant future, we will be enjoying the ride in our autonomous vehicle as a bunch of nanobots in our blood stream are delivering drugs and fixing problems, and we will feel good knowing that our older relatives are getting some great care from their assistive companion robots.

[1] http://www.cmu.edu/herb-robot/
[2] https://youtu.be/glpCAdKEWAA

Tekin Meriçli, PhD, is a well-rounded roboticist with in-depth expertise in machine intelligence and learning, perception, and manipulation. He is currently a Postdoctoral Fellow at the Human-Computer Interaction Institute at Carnegie Mellon University, where he leads the efforts on building intuitive and expressive interfaces to interact with semi-autonomous robotic systems that are intended to assist elderly and disabled. Previously, he was a Postdoctoral Fellow at the National Robotics Engineering Center (NREC) and the Personal Robotics Lab of the Robotics Institute at Carnegie Mellon University. He received his PhD in Computer Science from Bogazici University, Turkey.

This essay appears in Circuit Cellar 298, May 2015.