Capturing an RSA-1024 Power Trace in One Shot
In this article, Colin shows how we can perform a simple power analysis attack on a real RSA implementation. This follows up on a previous article, published in the November 2017 Circuit Cellar issue #328, in which an attack was performed on a limited 8-bit RSA implementation. Using new features in the ChipWhisperer-Husky, we can record extended power traces from the entire RSA decryption operation on a 32-bit Arm processor.
In this article, I will explore further a new design for power analysis and fault injection called the ChipWhisperer-Husky, which I introduced in an article in the November 2021 Circuit Cellar issue #376. This device has open-source firmware, field-programmable gate array (FPGA) design files, and computer software (written in Python) to let you explore in detail various aspects of power analysis and fault injection attacks.
Since I wrote that initial article describing the ChipWhisperer-Husky, problems with the supply chain for chips used in the device have made it more challenging to get your hands on one. But after some wait, parts I ordered were finally delivered, so that now I can explore some examples of what is possible with the device. As we will see, new features and capabilities have been introduced on the ChipWhisperer-Husky since my November 2021 article.
Supply chain challenges have also affected some of the specifics of the attack: in particular, the “target” is no longer the STM32F3 series microcontroller (which remained on backorder for over a year), but an Atmel ATSAM4S2AA device. In practice, this only means some minor differences in the sort of side-channel leakages we’re able to detect.
Using this ATSAM4S2AA device, I’m going to demonstrate a few interesting features of the ChipWhisperer-Husky. The heavy lifting for all of this is done by its open-source aspects (such as FPGA or Python code), so I’ll also point out a bit of what you can modify “under the hood.” After all, the entire point of open-source tooling is to be able to expand and modify it in ways you couldn’t with standard test equipment.
Compared to what I described in the previous article, the ChipWhisperer-Husky now includes a more advanced “interposer” board called the CW313. This can be seen in Figure 1, where the ATSAM4S2AA target is also plugged into it.
One major feature of the ChipWhisperer-Husky is a full “streaming” capture mode. In this mode the device captures very long algorithms. The device works by sampling the power consumption of the target device with a perfect syncronization with the target clock, meaning we can understand exactly what power is consumed at what point in time. This will enable us to use power analysis, where we infer operations the device is performing from these “power traces.” One issue I had in my November 2017 RSA article was I could only do the demonstration on a limited subset of the full RSA algorithm. As I explained then, the lower-cost ChipWhisperer-Lite was unable to capture the entire RSA algorithm run.
As you can see in Figure 2, that is no longer the case. The full MBED-TLS RSA run has 13,717,087 samples (here sampled at a 1:1 ratio of device clock rate to sample rate). Capturing these longer power traces becomes more important, as assymetric algorithms such as RSA and ECC are becoming more common. These assymetric algorithms are inherently much longer-running, commonly needing millions of cycles. And their potential post-quantum replacements will also have extended run-times that will need the capability to capture long power traces.
Let’s dive into how we could use the ChipWhisperer-Husky to explore the MBED-TLS RSA implementation and see where the side-channel leakage lies.
FILTERING YOUR VIEW
The view in Figure 2 might just resemble a solid block with nothing interesting jumping out. But a small amount of Python code can do a lot to clean this up and make it look nicer. I’ll simply apply a low-pass filter—the code for this is shown in Listing 1. We can change the bandwidth parameter to adjust the low-pass filter cut-off, and as we use a lower bandwidth we expect to see less high-frequency detail.
Simple Python code to filter a waveform.
from scipy.signal import butter, lfilter
bw = 0.001
b, a = butter(3, bw, btype=’low’)
y = lfilter(b, a, trace)
Figure 3 shows the output of our filter, along with some annotations I added for clarity. I’ll refer to the markings in Figure 3 as 3-A, 3-B, and 3-C.
Already this waveform is telling us a little bit about what the actual implementation is doing “under the hood.” In particular, the high degree of similarity between blocks 3-A and 3-B suggest the MBED-TLS implementation is doing something called the Chinese Remainder Theorem (CRT) implementation. In this implementation we split the RSA calculation into two parts, then combine the results. The combination is happening in 3-C, the much shorter block at the end. This combination uses the fact that each of the two “parts” operated on effectively half of the secret key, so we need to combine them to get a result including both parts.
Figure 4 shows a zoomed-in view of the start of 3-A. Here I have again marked two blocks, 4-A and 4-B. The pattern in block 4-B continues to repeat until the end of block 3-A, well beyond the end of the clock cycle that is shown in Figure 4. What we are guessing here is that the power trace in 4-A represents some “setup” code and the remainder in block 4-B represents some long-running loop. Before I go any further, we should step back to understand why this long-running loop might be interesting.
JUST A LITTLE RSA
Due to length considerations I can’t do a full introduction to RSA here, but I covered some implementations of it in my previous RSA article in Circuit Cellar issue #328. I also discuss some implementation details, along with my co-author Jasper Van Woudenberg, in the Hardware Hacking Handbook, which might be a good reference if you haven’t looked at power analyses of RSA before.
Normally we expect RSA with the CRT implementation to process part of the secret key bit-by-bit. In my article in Circuit Cellar issue #328, the AVR-based RSA implementation did just that.
Depending on whether the bit of the key is zero or one, we see either a “square” operation or a “square and multiply” operation, respectively. Looking at the power trace, we expect to be able to differentiate between these two use-cases. This made recovering the secret key very easy.
But the MBED-TLS implementation I’m targeting here is more complex—it uses something called “windowing,” which significantly complicates our side-channel attack. Windowing basically means several bits are processed in such a way that we can’t easily learn anything about consecutive bits. This means our “basic” side-channel attack won’t recover the full key.
Let’s now jump back to the repeating portion of the power trace in block 4-B, which contains the interesting part of the key process.
LOOKING INTO THE WINDOW
Because we are working with an open-source implementation of our target, we can inspect the code to see what it’s doing. If you want to take a peak yourself, you can find the answer at the main ChipWhisperer repository in the file linked in the Resources for this article on Circuit Cellar’s website . At line 1,717 of that file, you’ll see a large while(1) loop, which is the start of the processing of the secret key, loaded bit-by-bit into the ei variable.
The full C code implementation gets into more details than I can easily reproduce, so instead Listing 2 provides a simplified version in Python of the algorithms I expect to see in the power trace. This code will loop through each bit of the secret key (in Listen 2 I’ve truncated the full key). If the bit is 0, it will output an ‘s’ to indicate that a square operation would be performed. If the bit is ‘1’, it will output ‘xxxxxm’ to indicate that a “multiply” operation would be performed, but that five unknown bits are part of the window.
This Python code will print the operations performed with a known key for comparison.
dp = int(“C1ACF567564274FB07A0BBAD5D...”, 16)
str_dp_value = bin(dp)[2:]
#Make operation list from the known algorithm
i = 0
total_ops = 0
op_list = “”
b = str_dp_value[i]
if (b == ‘0’):
op_list += “s”
total_ops += 1
i += 1
op_list += “xxxxxm”
total_ops += 6
i += 5
We learn nothing about the bits because they are part of the window that is processed all at once. To attack these bits will require a more complex template, which is beyond the scope of this article. What I will show you is how we can recover the sequence of ‘xxxxxm’ or ‘s’. With that, we’ll learn a portion of the secret key.
If you run the code in Listing 2 with the full secret key found in the simpleserial-rsa code from the ChipWhisperer repository , it will output a sequence that starts like that shown in Listing 3. Now, let’s finally look at how we can see this sequence in the power trace.
The start of the output from Listing 2 when run with the full private key.
SEEING THE WINDOW
Block 4-B is plotted in Figure 5. Using the somewhat arbitrary threshold of 0.044, I’ve marked the locations of the obvious low spikes with red dots. What we are seeing in the power trace is the different operations from Listing 3. Notice that the red dots are grouped together in the following numbers: seven, five, five, six, and six. Looking at Listing 3, can see that these spike numbers correlate with the number of operations. The first group of seven spikes represents ‘xxxxxmss’, the first line of Listing 3. The next two groups of five spikes represents ‘xxxxxm’, and finally the groups of six spikes represents ‘xxxxxms’.
Each of those operations map to values of the secret key bit, where ‘x’ represents bits we don’t know (yet). Each ‘s’ represents a ‘0’ bit, and each ‘m’ represents a ‘1’ bit. Thus our waveform from Figure 5 shows us the start of the secret key: ‘xxxxx100xxxxx1xxxx1xxxxx10xxxxx10…’ Unfortunately, we haven’t learned the full private key with this attack, but we can see how capturing the entire RSA operation let us better understand the implementation and recover partial secret key information.
ATTACKING A FULL KEY
Attacking the full key can be done with this same power trace. This relies on taking advantage of an artifact of the MBED-TLS implementation, which accidentally does a comparison of each bit of the secret key. This comparison is described in more detail in my Hardware Hacking Handbook, so I won’t recreate that attack here.
The attack I’ve shown you in this article is more general (since it doesn’t rely on the MBED-TLS-specific artifact). More importantly, it shows you how by capturing a long power trace from an embedded device you can easily recover a portion of the secret key.
If you want to try performing the power analysis in this column, you don’t even need the ChipWhisperer-Husky. I used mine to pre-record the power traces and post them online. The MBED-TLS attack is in the ChipWhisperer Jupyter repository, under the “SCA202” course folder . It includes a copy of the power traces used to make this column. If you have a ChipWhisperer-Husky, you can of course recreate this attack starting from the original source code.
 ChipWhisperer Repository File: https://github.com/newaetech/chipwhisperer/blob/develop/hardware/victims/firmware/crypto/mbedtls/library/bignum.c
 Full secret key from the ChipWhisperer Repository: https://github.com/newaetech/chipwhisperer/blob/develop/hardware/victims/firmware/simpleserial-rsa/simpleserial-rsa-arm.c
 MBED-TLS attack is in the ChipWhisperer Jupyter repositor: https://github.com/newaetech/chipwhisperer-jupyter/tree/master/courses/sca202
ChipWhisperer-Husky Open-Source Code: https://github.com/newaetech/chipwhisperer-husky
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • NOVEMBER 2022 #388 – Get a PDF of the issueSponsor this Article
Colin O’Flynn has been building and breaking electronic devices for many years. He is an assistant professor at Dalhousie University, and also CTO of NewAE Technology both based in Halifax, NS, Canada. Some of his work is posted on his website (see link above).