Microsoft Real-time AI Project Leverages FPGAs

At Hot Chips 2017 Microsoft unveiled a new deep learning acceleration platform, codenamed Project Brainwave. The system performs real-time AI. Real-time here means the system processes requests as fast as it receives them, with ultra-low latency. Real-time AI is becoming increasingly important as cloud infrastructures process live data streams, whether they be search queries, videos, sensor streams, or interactions with users.



The Project Brainwave system is built with three main layers: a high-performance, distributed system architecture; a hardware DNN engine synthesized onto FPGAs; and a compiler and runtime for low-friction deployment of trained models. Project Brainwave leverages the massive FPGA infrastructure that Microsoft has been deploying over the past few years. By attaching high-performance FPGAs directly to Microsoft’s datacenter network, they can serve DNNs as hardware microservices, where a DNN can be mapped to a pool of remote FPGAs and called by a server with no software in the loop. This system architecture both reduces latency, since the CPU does not need to process incoming requests, and allows very high throughput, with the FPGA processing requests as fast as the network can stream them.

Project Brainwave uses a powerful “soft” DNN processing unit (or DPU), synthesized onto commercially available FPGAs.  A number of companies—both large companies and a slew of startups—are building hardened DPUs.  Although some of these chips have high peak performance, they must choose their operators and data types at design time, which limits their flexibility.  Project Brainwave takes a different approach, providing a design that scales across a range of data types, with the desired data type being a synthesis-time decision. The design combines both the ASIC digital signal processing blocks on the FPGAs and the synthesizable logic to provide a greater and more optimized number of functional units.  This approach exploits the FPGA’s flexibility in two ways.  First, the developers have defined highly customized, narrow-precision data types that increase performance without real losses in model accuracy.  Second, they can incorporate research innovations into the hardware platform quickly (typically a few weeks), which is essential in this fast-moving space.  As a result, the Microsoft team achieved performance comparable to – or greater than – many of these hard-coded DPU chips but are delivering the promised performance today. At Hot Chips, Project Brainwave was demonstrated using Intel’s new 14 nm Stratix 10 FPGA.

Project Brainwave incorporates a software stack designed to support the wide range of popular deep learning frameworks. They support Microsoft Cognitive Toolkit and Google’s Tensorflow, and plan to support many others. They have defined a graph-based intermediate representation, to which they convert models trained in the popular frameworks, and then compile down to their high-performance infrastructure.

Microsoft |

Promoter Group Announces USB 3.2 Spec Update

The USB 3.0 Promoter Group has announced the pending release of the USB 3.2 specification, an incremental update that defines multi-lane operation for new USB 3.2 hosts and devices, effectively doubling the bandwidth to extend existing USB Type-C cable performance. During the upcoming USB Developer Days 2017 event, the promoters will provide detailed technical training covering USB 3.2, fast charging advancements in USB Power Delivery, and other topics.


While USB hosts and devices were originally designed as single-lane solutions, USB Type-C cables were designed to support multi-lane operation to ensure a path for scalable performance. New USB 3.2 hosts and devices can now be designed as multi-lane solutions, allowing for up to two lanes of 5 Gbps or two lanes of 10 Gbps operation. This enables platform developers to continue advancing USB products by effectively doubling the performance across existing cables. For example, a USB 3.2 host connected to a USB 3.2 storage device will now be capable of realizing over 2 GB/sec data transfer performance over an existing USB Type-C cable that is certified for SuperSpeed USB 10 Gbps.

Key characteristics of the USB 3.2 solution include:

– Two-lane operation using existing USB Type-C cables

– Continued use of existing SuperSpeed USB physical layer data rates and encoding techniques

– Minor update to hub specification to address increased performance and assure seamless transitions between single and two-lane operation

For users to obtain the full benefit of this performance increase, a new USB 3.2 host must be used with a new USB 3.2 device and the appropriate certified USB Type-C cable. This update is part of the USB performance roadmap and is specifically targeted to developers at this time. Branding and marketing guidelines will be established after the final specification is published. The USB 3.2 specification is now in a final draft review phase with a planned formal release in time for the USB Developer Days North America event in September 2017.

The USB 3.0 Promoter Group, comprised of Apple, Hewlett-Packard, Intel Corporation, Microsoft Corporation, Renesas Electronics, ST Microelectronics, and Texas Instruments, continues to develop the USB 3.x family of specifications to meet the market needs for increased functionality and performance in SuperSpeed USB solutions. Additionally, the USB 3.0 Promoter Group develops specification addendums (USB Power Delivery, USB Type-C, and others) to extend or adapt its specifications to support more platform types or use cases where adopting USB 3.x technology will be beneficial in delivering a more ubiquitous, richer user experience.

USB 3.0 Promoter Group |

OPTIGA Trusted Platform Modules Enhance Security for Connected Devices

Microsoft currently uses Infineon Technologies OPTIGA Trusted Platform Modules (TPMs) in its newest personal computing devices, including the Surface Pro 4 tablet and the Surface Book. The dedicated security chips store sensitive data, including keys, certificates, and passwords and keeps them separated from the main processor, which further secures the system from unauthorized access, manipulation, and data theft. For example, the Microsoft BitLocker Drive Encryption application’s key and password are stored in the TPM.Infineon-OPTIGA

Microsoft’s personal computing devices rely on the OPTIGA TPM SLB 9665, which is the first certified security controller based on TPM 2.0. This standard was defined by the Trusted Computing Group (TCG).

Source: Infineon Technologies

Windows-Compatible Dev Board

Intel, Microsoft, and Circuit Co. have teamed up to produce a development board designed for the production of software and drivers used on mobile devices such as phones, tablets and similar System on a Chip (SoC) platforms running Windows and Android operating systems with Intel processors.



The 6″ × 4″ Sharks Cove board and features a number of interfaces including GPIO, I2C, I2S, UART, SDIO, mini USB, USB, and MIPI for display and camera.

Its main features include:

  • Intel  ATOM Processor Z3735G , 2M Cache, 4 Core, 1.33 GHz up
    to 1.88 GHz
  • Intel HD Graphics
  • 1 GB 1×32 DDR3L-RS-1333, 16-GB EMMC storage, micro SD Card
  • HDMI full size connector, MIPI display connector
  • Twelve (5 × 2) Shrouded pin header connectors, 1 (2 × 10) sensor header, 2 × 60 pin MIPI connector for display, camera and 5 (2 × 2) headers for power
  • One USB 2.0 type A connector
  • One micro USB type A/B for debug
  • Audio Codec Realtek ALC5640, speaker output header and onboard digital mic
  • Ethernet or WiFi via USB
  • Intel UEFI BIOS
  • Power, volume up, volume down, home screen and rotation lock
  • One micro USB type A/B for Power
  • SPI debug programming header

You can preorder the board for $299. It includes a Windows 8.1 image together with all the necessary utilities for it to run on Sharks Cove.

MCU-Based Prosthetic Arm with Kinect

James Kim—a biomedical student at Ryerson University in Toronto, Canada—recently submitted an update on the status of an interesting prosthetic arm design project. The design features a Freescale 9S12 microcontroller and a Microsoft Kinect, which tracks arm movements that are then reproduced on the prosthetic arm.

He also submitted a block diagram.

Overview of the prosthetic arm system (Source: J. Kim)

Kim explains:

The 9S12 microcontroller board we use is Arduino form-factor compatible and was coded in C using Codewarrior.  The Kinect was coded in C# using Visual Studio using the latest version of Microsoft Kinect SDK 1.5.  In the article, I plan to discuss how the microcontroller was set up to do deterministic control of the motors (including the timer setup and the PID code used), how the control was implemented to compensate for gravitational effects on the arm, and how we interfaced the microcontroller to the PC.  This last part will involve a discussion of data logging as well as interfacing with the Kinect.

The Kinect tracks a user’s movement and the prosthetic arm replicates it. (Source: J. Kim, YouTube)

The system includes:

Circuit Cellar intends to publish an article about the project in an upcoming issue.

Robot Design with Microsoft Kinect, RDS 4, & Parallax’s Eddie

Microsoft announced on March 8 the availability of Robotics Developer Studio 4 (RDS 4) software for robotics applications. RDS 4 was designed to work with the Kinect for Windows SDK. To demonstrate the capabilities of RDS 4, the Microsoft robotics team built the Follow Me Robot with a Parallax Eddie robot, laptop running Windows 7, and the Kinect.

In the following short video, Microsoft software developer Harsha Kikkeri demonstrates Follow Me Robot.

Circuit Cellar readers are already experimenting Kinect and developing embedded system to work with it n interesting ways. In an upcoming article about a Kinect-based project, designer Miguel Sanchez describes a interesting Kinect-based 3-D imaging system.

Sanchez writes:

My project started as a simple enterprise that later became a bit more challenging. The idea of capturing the silhouette of an individual standing in front of the Kinect was based on isolating those points that are between two distance thresholds from the camera. As depth image already provides the distance measurement, all the pixels of the subject will be between a range of distances, while other objects in the scene will be outside of this small range. But I wanted to have just the contour line of a person and not all the pixels that belong to that person’s body. OpenCV is a powerful computer vision library. I used it for my project because of function blobs. This function extracts the contour of the different isolated objects of a scene. As my image would only contain one object—the person standing in front of the camera—function blobs would return the exact list of coordinates of the contour of the person, which was what I needed. Please note that this function is a heavy image processing made easy for the user. It provides not just one, but a list of all the different objects that have been detected in the image. It can also specify is holes inside a blob are permitted. It can also specify the minimum and maximum areas of detected blobs. But for my project, I am only interested in detecting the biggest blob returned, which will be the one with index zero, as they are stored in decreasing order of blob area in the array returned by the blobs function.

Though it is not a fault of blobs function, I quickly realized that I was getting more detail than I needed and that there was a bit of noise in the edges of the contour. Filtering out on a bit map can be easily accomplished with a blur function, but smoothing out a contour did not sound so obvious to me.

A contour line can be simplified by removing certain points. A clever algorithm can do this by removing those points that are close enough to the overall contour line. One of these algorithms is the Douglas-Peucker recursive contour simplification algorithm. The algorithm starts with the two endpoints and it accepts one point in between whose orthogonal distance from the line connecting the two first points is larger than a given threshold. Only the point with the largest distance is selected (or none if the threshold is not met). The process is repeated recursively, as new points are added, to create the list of accepted points (those that are contributing the most to the general contour given a user-provided threshold). The larger the threshold, the rougher the resulting contour will be.

By simplifying a contour, now human silhouettes look better and noise is gone, but they look a bit synthetic. The last step I did was to perform a cubic-spline interpolation so contour becomes a set of curves between the different original points of the simplified contour. It seems a bit twisted to simplify first to later add back more points because of the spline interpolation, but this way it creates a more visually pleasant and curvy result, which was my goal.


(Source: Miguel Sanchez)
(Source: Miguel Sanchez)

The nearby images show aspects of the process Sanchez describes in his article, where an offset between the human figure and the drawn silhouette is apparent.

The entire article is slated to appear in the June or July edition of Circuit Cellar.