Low-Cost Approach
Low-cost microcontrollers integrate many powerful peripherals. You can even perform data capture directly to internal memory. In his article, Colin uses the ChipWhisperer-Nano as a case study in how you might use such features that would otherwise require external programmable logic.
If you’ve followed this column—and some of my other work—you know I often use my open source ChipWhisperer-Lite as part of the demonstration platform. Although the tool is relatively low cost compared to building a test setup with an oscilloscope, at $250 it would still be difficult to use for outfitting a large classroom at a university. This led me down a design path that I’d like to share with you here—a path that ended up with the design of the ChipWhisperer-Nano (Figure 1). This is a $50 tool that can perform embedded power analysis.
Like ChipWhisperer-Lite, ChipWhisperer-Nano includes both a target device to download code to STMicroelectronics’ STM32F0 microcontroller (MCU) along with hardware for performing the power analysis work. It also includes some minor hardware for performing voltage fault injection. The lower cost is achieved by using only an MCU for the capture, and getting rid of the FPGA. It also removes some other expensive parts like the 105 Msample/s, 10-bit ADC—replaced with a 20 Msample/s, 8-bit ADC—and the variable high-gain amplifier—replaced with fixed gain op amp.
I used a few features of the MCU that you might not have worked with before, so I thought it would be great to use the platform as a focus to talk about these choices. Hopefully you’ll find this article interesting—even though I’m not breaking any embedded devices this time around. But you’ll see the ChipWhisperer-Nano being used for that in future articles.
The architecture of ChipWhisperer-Nano is shown in Figure 2. The analog front-end uses an op amp configured in a fixed-gain, which is used for sampling an analog signal. The ADC can be driven from one of two clock sources: a clock from the MCU or an external clock. The MCU itself uses a Parallel Data Capture (PDC) feature to perform the data capture. Finally, an STM32F0 MCU is used as the target, and the code is downloaded to via a serial bootloader that is built into the STM32F0.

CLOCKING FUN
One immediate question might be: Why not just use an MCU with a fast ADC? There are a few devices out there, including for example the LPC4370 from NXP Semiconductors, which has an 80 Msample/s ADC. Even the Microchip Technology SAM4S used as the main part of the ChipWhisperer-Nano has a fairly fast ADC.
— ADVERTISMENT—
—Advertise Here—
A core part of the ChipWhisperer capture is performing “clock synchronous” capture. This means that samples are taken related to the clock of the target device. Take a look at Figure 3 for more information. The top line of Figure 3 is a trigger pin, which is driven HIGH by the MCU in question. When driven high the device under test (DUT) is performing some operation we want to monitor.

The DUT has a clock that drives this operation. This can be seen in the second line, this would be your normal MCU clock. Now first look at the oscilloscope sample clock, which is asynchronous. These would normally be set to some nice-sounding number such as 100 MHz for a 100 Msample/s sample rate.
The first sample taken after the trigger line goes high will have a (small) random delay between the location of the sample and the clock of the DUT. If the sample rate is too slow this random delay will mean successive captures of a trigger event won’t see the same waveform, since the starting point within the cycle will be different.
Instead we can use the DUT clock to drive our sample clock. This could either be a direct connection or some multiplication. Either way we always get the same result on successive trigger events, since the sample clock has a constant phase offset relative to the DUT clock.
This is also useful, since you immediately know what sample number corresponds to what clock cycle of the target—useful to match exact points to instructions in the device. Normally with asynchronous capture—such as with an oscilloscope—this problem is solved by using a high sample rate so that this random delay is small enough to be ignored.
But if we want a very inexpensive solution that can still be used for attacking real devices, we can use synchronous sampling. The bigger brothers—ChipWhisperer-Lite and ChipWhisperer-Pro—which use a FPGA for the ADC sample clock generation, can do tricks like multiply up an external clock to sample at a multiple of the target clock. For the ChipWhisperer-Nano there is no clock multiplication, only an ability to clock from an external sample clock.
But if we’re driving the target clock—such as in the built-in target—we can take advantage of the various clock outputs of the SAM4S to achieve a similar effect. In this case the SAM4S has an internal clock of 240 MHz (which is divided by two for the core), and we can divide it down by various degrees as an output clock. For example, this lets us send a 7.5 MHz clock to the target device, while clocking our ADC at 15 MHz. This maintains the perfect clock synchronization, but gives us more samples of the device in-between clock edges.
All of this is to say the ChipWhisperer-Nano can be used for power analysis. And not just analysis of real software algorithms, but even hardware cryptographic accelerators. It comes down to careful consideration of how we perform the sample clocking, and we’ll see how it works in a moment.
PARALLEL DATA CAPTURE
With clocking sorted out, what about the ADC data? I use a feature called Parallel Data Capture (PDC) for this. Don’t get confused by the datasheet, which also uses PDC as an acronym for the Peripheral DMA Controller. The flow is shown in Figure 4. The external data coming in on the parallel data lines is actually dumped directly to a memory buffer in the MCU. Since this uses the DMA controller, we are doing PDC via PDC.
— ADVERTISMENT—
—Advertise Here—

The setup is fairly easy—you can see it in Listing 1. Each transfer is limited to 65,535 bytes, but you can set up two to run back-to-back in case you need more samples transferred at once. In my case I’ve limited the capture to at most 100,000 samples. The total SRAM of the device is 160 KB, this leaves 60 KB for the rest of the application and stack.
static pdc_packet_t * packet0p;
static pdc_packet_t * packet1p;
NVIC_DisableIRQ(PIOA_IRQn);
NVIC_ClearPendingIRQ(PIOA_IRQn);
NVIC_SetPriority(PIOA_IRQn, PIO_IRQ_PRI);
NVIC_EnableIRQ(PIOA_IRQn);
packet0p = &packet0;
packet1p = NULL;
if (capture_req_length < (uint32_t)0xFFFF){
/* Set up PDC receive buffer */
packet0.ul_addr = (uint32_t) pio_rx_buffer;
packet0.ul_size = capture_req_length;
} else {
packet0.ul_addr = (uint32_t) pio_rx_buffer;
packet0.ul_size = (uint16_t)0xFFFF;
packet1.ul_addr = (uint32_t) (pio_rx_buffer + (uint32_t)0xFFFF);
packet1.ul_size = capture_req_length - (uint32_t)0xFFFF;
packet1p = &packet1;
}
p_pdc = pio_capture_get_pdc_base(PIOA);
pdc_rx_init(p_pdc, packet0p, packet1p);
/* Enable PDC transfer. */
pdc_enable_transfer(p_pdc, PERIPH_PTCR_RXTEN);
/* Configure the PIO capture interrupt mask. */
pio_capture_enable_interrupt(PIOA, (PIO_PCIER_RXBUFF));
LISTING 1 – Setup of the PDC module offloads the data loading into hardware.
An interrupt can be generated once the transfer is completely done. The SAM4S has external hardware pins that can be configured to control the capture too, so you don’t have to worry about something like having some interrupt jitter affecting when the data capture starts relative to an external signal. It’s a rather nice subsystem hidden away in the otherwise boring port control registers!
Parallel data capture might be a good solution for a lot of other problems you face. Here I’m using it for capturing ADC data, but have you ever needed to log any fast data? A small CPLD or even an FPGA could be used to convert many formats into an 8-bit parallel bus. This is especially interesting when you might want to do some processing of the data in your embedded system, and not just download it to a computer.
GENERATING FAULTS
Another feature of the big brother ChipWhisperer-Lite/Pro is being able to perform fault injection. The FPGA-based clock fault injection circuits use phase shifting capabilities to generate very narrow pulses with sub-nanosecond positioning. This would be difficult to do with an MCU, so I didn’t even try. Instead I looked at another feature: a MOSFET that shorts the power rails, generating voltage glitches as in Figure 5. This is driven from the MCU, which running at 120 MHz provides a reasonably quick method of generating the required pulse. While this sounds easy coming from a simple Microchip AVR or similar MCU, the Cortex M4 device comes with a more complex pipeline that makes exact timing loops more complicated. Even the nop instruction is not necessarily time-constant, for example. The core may remove the instruction from the pipeline before being executed.

An example is given in Listing 2. This little bit of code is written in a simplified format to make it (hopefully) easier to understand the program flow. The case statement selects three possible delay options—a base variable delay in each case (which happens in a large fixed increment of 3 cycles), and an additional fixed delay at the end. In this way an arbitrary delay can be created by selecting the proper glitch width and delay.
switch(glitch_width_case){
case 0:
asm volatile(“isb”);
asm volatile(“str r6, [r5, #48]”); //IO High
for(unsigned int i = glitch_width_cnt; i != 0; i--);
asm volatile(“str r6, [r5, #52]”); //IO Low
break;
case 1:
asm volatile(“isb”);
asm volatile(“str r6, [r5, #48]”); //IO High
for(unsigned int i = glitch_width_cnt; i != 0; i--);
asm volatile(“dsb”); //Delay
asm volatile(“str r6, [r5, #52]”); //IO Low
break;
case 2:
asm volatile(“isb”);
asm volatile(“str r6, [r5, #48]”); //IO High
for(unsigned int i = glitch_width_cnt; i != 0; i--);
asm volatile(“str r6, [r5, #48]”); //Delay
asm volatile(“str r6, [r5, #52]”); //IO Low
break;
}
— ADVERTISMENT—
—Advertise Here—
LISTING 2 – The pipelined architecture makes generating a controllable output pulse trickier. The isb instructions help clear the pipeline to make repeatable delay functions.
The str instructions are used to set an I/O pin high or low—generating the waveform we want. This example code has a C-based delay loop, which should be switched to assembly to avoid optimization changes. The important part is the instruction synchronization barrier (isb) instructions, which clear the pipeline before each case statement is executed. The cleared pipeline gives a constant starting point for each of the loop options that follows. If you don’t have these instructions you would find odd behavior, notably even if you had the same instructions in each case statement, they would take differing amounts of time to execute. This is because the pipeline state differs at the start of the loop. It’s something that can easily catch you off-guard, especially if you’re used to making very easy consistent loops like you can do with 8-bit MCUs.
Within the case statement, there is some different instructions used to generate 0, 1, or 2 cycle delays after the variable delay. Getting a final reliable code base is easier in assembly here, but I’ve left some portions in C to make understanding the flow easier.
CHEAP ATTACKS
The PDC code and fault injection code are the two interesting parts of the design. The rest of the code is more standard housekeeping. There is a USB interface to manage control of the device, along with transferring the analog samples. There’s also the code to deal with the serial interface and programming of the STM32F0 target for example.
Putting it all together, what does it give us? It gives us a low-cost ($50) platform for performing side-channel power analysis, allowing us to investigate algorithms loaded onto the STM32F0 target. You can even connect external targets to the ChipWhisperer-Nano instead, if you’d like to investigate hardware crypto present in other MCUs.
As an example, I’ve loaded the AES routines from MBED-TLS onto the STM32F0. A view of the AES routines running is given in Figure 6. The nature of AES is that ten repeated rounds are called. These rounds can be clearly seen in the power trace. I’ve gone over details of how side-channel power analysis works in previous articles, so I won’t repeat that here. Suffice it to say that the measurement data allows recovery of the entire AES-128 key by observing as few as 25 encryption operations.
Considering this measurement data was taken with $50 hardware, those aren’t bad results at all. I hope that ChipWhisperer-Nano becomes a useful tool in helping engineers understand what side-channel power analysis is, and how it applies to real products. This will allow you to run attacks on not only the integrated device, but even on real external devices such as other MCUs and development boards.
If you want more details of the ChipWhisperer project, you can see the tutorials and examples posted at ChipWhisperer.com, which also includes links to various design repositories held on GitHub. And if you haven’t seen power analysis before, you can go back and read some of my older Circuit Cellar articles.
Additional materials from the author are available at:
www.circuitcellar.com/article-materials
RESOURCES
Microchip Technology | www.microchip.com
NXP Semiconductors | www.nxp.com
STMicroelectronics | www.st.com
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • JANUARY 2019 #342 – Get a PDF of the issue
Sponsor this ArticleColin O’Flynn has been building and breaking electronic devices for many years. He is an assistant professor at Dalhousie University, and also CTO of NewAE Technology both based in Halifax, NS, Canada. Some of his work is posted on his website (see link above).