Design Solutions Research & Design Hub

Revisiting Vivado HLS

Written by Colin O'Flynn

Recently, the free version of Xilinx Vivado was expanded to include the High Level Synthesis (HLS) feature. This article revisits the tool to explore the creation of an AES-128 hardware encryption block using C++. Colin uses it within the Vivado IP Block design workflow.

  • How create an AES-128 hardware encryption block using C++ and use it within the Vivado IP Block design workflow.

  • Artix-7 FPGA Board and Vivado Design Suite from Xilinx

If you’ve used Xilinx products in the past few years, you’ll have head of Vivado. This is Xilinx’s new FPGA design software, which targets their 7-Series devices. It cannot be used with older devices such as Spartan 6, so there are still lots of people using the “older” Xilinx ISE software.

In my February 2014 article, “Rapid FPGA Design in C Using High-Level Synthesis” (Circuit Cellar 283), I talked about the “High Level Synthesis” feature of Vivado. This can be used to design FPGA blocks in C/C++, taking advantage of features such as fixed-point support built-in and the ability to generate either a fast (but larger area) or slow (but smaller area) design from the same source code. At the time of my column, this was an add-on which required payment of a license fee to use. As of December 2015, this changed—Xilinx Vivado 2015.4 and later webpack (free) versions now include the High Level Synthesis (HLS) feature! With the ability of any user to unlock this feature, I thought it would be worth revisiting.

In my previous column, I showed how you can use the HLS to convert from C/C++ to Verilog or VHDL, and then importing that module into Xilinx ISE if you aren’t yet ready to take the plunge into Vivado (or are using a device which Vivado doesn’t support). This is still possible, and see my previous column for details. Note the 2015.4 edition is the only version where HLS officially supports older devices (such as Spartan 6). If not using special features of the devices, however, you may be able to synthesize the Verilog or VHDL resulting from the HLS tool onto unsupported devices.

In this column I thought I’d perform a complete design in Vivado, where one block is designed using HLS. This will serve as both a refresher for the HLS, along with providing some examples of using Vivado in IP-Block design mode.

THE FPGA BOARD
To take advantage of Vivado, we’ll need to select a 7-series device. In this case I’m actually going to use the PCB I first showed you when demonstrating how to solder a BGA device from my December 2015 column. The board in Photo 1 is slightly different from my December 2015 column, as there is a BGA socket mounted with a Xilinx Artix 7 (XC7A35T-FTG256) inside, instead of the FPGA directly soldered to the PCB.

The board has a USB-connected Atmel SAM3U microcontroller, where the external memory bus of the microcontroller is connected to the FPGA. This allows reading and writing directly into registers designed inside the FPGA.

— ADVERTISMENT—

Advertise Here

I’ve already got a basic wrapper in Verilog that allows the computer to send a block of data to a core in the FPGA, and read the results back. We’ll use this to test a specific encryption core that we are going to design on the FPGA. Both of these blocks will be interconnected using the Vivado IP-Block methodology instead of using wrappers in Verilog.

BASICS OF AES
In this column I won’t detail the full Advanced Encryption Standard (AES) design, but I want to give you enough details to understand the design being implemented. AES is used to encrypt a block of 16 bytes of data, which converts 16 bytes of plaintext (i.e., the data you want to encrypt) into 16 bytes of ciphertext (i.e., the secret data).

This design will specifically implement AES-128, which uses 128 bits (16 bytes) of secret key material. If you had a 16-byte plaintext and 16-byte ciphertext that had been encrypted with AES-128, the design of AES is such you cannot achieve a better method of determining the secret key than simply trying all possible secret keys to see if that piece of plaintext maps to that piece of ciphertext. Given that means trying 2128 keys with AES-128, the design of AES is such you cannot achive this. Even if you could check 10 billion keys per second, it would still take you just over 1 billion trillion (i.e., 1 with 21 zeros) years.

As it turns out, there are ways to break specific implementations of cryptographic algorithms. One such attack is side-channel power analysis attacks, which is the research area the FPGA board from Photo 1 was designed to be used in. Side-channel power analysis attacks will be a topic for a future column, but I’ll put some links on my website (ProgrammableLogicInPractice.com) if you can’t wait.

So how does AES achieve this magic? The basic format is given in Figure 1. The main part of the algorithm repeats 10 times, with the final loop differing slightly in that it omits one function. Each of these functions will be discussed next. Each run through the loop is called a “round,” so we can talk about the first run through the loop as round 1 for example.

Figure 1 
AES runs the same operations in a number of rounds. The basic flow of data through the functions is shown here, where the data is a 16-byte “state” upon which the various functions act. This transforms the initial state (holding the data we want to encipher) into the ciphered data.

All operations are performed on a 16-byte block we will call the AES state. Initially, the AES state is set to the plaintext, and by the end of the algorithm the AES state will contain the ciphertext.

AddRoundKey: This simply performs a logical XOR of the “round key” with the input data to this round. Each round uses a different 16-byte key, which is derived from the initial 16-byte key given as an input. The first round-key is in fact the same as the input key.

SubBytes: This uses a specially crafted look-up table to replace each value with a different value. This is done to introduce nonlinearity into the algorithm, and the design of that look-up table (called a substitution box, or S-Box) is critical to preventing certain attacks. The mapping is reversible, such that a given value always maps to a specific (different) value.

ShiftRows: This reorders bytes within the AES state.

— ADVERTISMENT—

Advertise Here

MixColumns: This performs “mixing” of certain bytes within the AES state, which combines the values of various bytes to ensure that a change of any specific byte at the input will result in multiple bytes changing at the output. The combination of ShiftRows and MixColumns are responsible for the excellent diffusion properties of AES, such that changing a single bit at the input should result in about half the output bits flipping.

With this brief overview of the data flow of AES, let’s now look at how we implement it.

IMPLEMENTING AES
We can find lots of examples of AES implemented in C for software. Using the power of Vivado HLS, we can use these as a basis for a hardware AES implementation that can achieve fairly fast results. As the objective of this implementation is actually to be used as part of an analysis platform, we’ll have a few oddities. In particular we’ll only be implementing the basic “electronic code book” (ECB) mode (normally ECB mode isn’t used stand-alone), and we’ll be designing the system to use a different key for each block being encrypted (normally we’d use the same key for many blocks).

But these oddities don’t affect how we go about the design process. We will first make a basic test bench that checks a single AES-ECB-128 test vector, as shown in Listing 1. This can be run by the Vivado HLS tool to check the function aes_encrypt128(), where the aes_encrypt128() function is the target function being turned into a FPGA block.

Listing 1
This shows a simple test bench, which verifies the C++ code Vivado HLS will implement in the FPGA.

#include <stdio.h>
#include "ap_int.h"
#include "aes.h"


int main(void)
{
	/* Single AES test vector */
	ap_uint<128> testdata("6bc1bee22e409f96e93d7e117393172a", 16);
	ap_uint<128> testkey("2b7e151628aed2a6abf7158809cf4f3c", 16);
	ap_uint<128> expected_vec("3ad77bb40d7a3660a89ecaf32466ef97", 16);

	/* Run encryption */
	ap_uint<128> outdata;
	outdata = aes_encrypt128(testdata, testkey);

	/* Print value */
	printf("%s\n", outdata.to_string(16).c_str());

	/* Return 0 if test is OK */
	if (outdata == expected_vec){
		printf("Test vector OK\n");
		return 0;
	} else {
		printf("Test vector failed\n");
		return 1;
	}
}

This function is shown in Listing 2. For the full source code listing, refer to my website, ProgrammableLogicInPractice.com. As a refresher, the C++ HLS extensions give us the arbitrary-length integer types. In this case I’m using 128-bit unsigned integers to hold the core AES state. The extensions allow me to perform functions such as logical or arithmetical shifts and rolls on these integers, as I would expect from a hardware design. I can also select specific bits by simply specifying a range—for example, state(7,0) would select bits 7 to 0 of the state variable.

Listing 2
This shows the top-level module performing the AES encryption. A single 16-byte block is encrypted using a given key to a final output.

#include "ap_int.h"
#include "aes.h"

typedef ap_uint<128> aes_state_t;
typedef ap_uint<128> aes_key_t;

/* Encrypt one block using given key */
ap_uint<128> aes_encrypt128(ap_uint<128> inputdata, ap_uint<128> key)
{
	aes_key_t keyrev;
	aes_state_t state;

	//Reverse byte order (but not bit order)
	reversekey: for (unsigned int i = 0; i < 128; i+= 8){
		#pragma HLS UNROLL
		keyrev(i+7,i) = key(128-(i+1),128-(i+8));
	}

	//Reverse byte order (but not bit order)
	reversedata: for (unsigned int i = 0; i < 128; i+= 8){
		#pragma HLS UNROLL
		state(i+7,i) = inputdata(128-(i+1),128-(i+8));
	}

	roundkey_setkey(keyrev);

	aes_mainloop: for (unsigned int i = 0; i < 9; i++){
		#pragma HLS PIPELINE II=1
		state ^= newroundkey();
		state = sbytes(state);
		state = shiftrows(state);
		state = mixcols(state);
		//DEBUG: Print state
		//printf("%s\n", state.to_string(16).c_str());
	}

	state ^= newroundkey();
	state = sbytes(state);
	state = shiftrows(state);
	state ^= newroundkey();

	aes_state_t output;

	//Reverse byte order (but not bit order)
	reverseoutput: for (unsigned int i = 0; i < 128; i+= 8){
		#pragma HLS UNROLL
		output(i+7,i) = state(128-(i+1),128-(i+8));
	}

	return output;
}

Loops need special treatment, as you can tell the HLS system how you’d like this loop to be implemented in hardware. In this case you can see there are loops that are simply swapping the byte order of the input and output, which I tell the system to unroll using the #pragma HLS UNROLL directive. Unrolling means the hardware will perform all iterations through the loop simultaneously.

The main loop is given special treatment, using the #pragma HLS PIPELINE II=1 directive. This tells the system I’d like to pipeline the loop, with an iteration interval of one cycle (i.e., to perform a calculation on every clock cycle). I can increase the requested iteration interval and it will result in changes to the estimated resource requirements—for example, a longer iteration interval might reduce the area requirements as the design can perform more resource sharing.

An example of some of the supporting function implementation is shown in Listing 3. Here the implementation of mixcols(), shiftrows(), and part of sbytes() is shown. This design uses various C features to improve readability—for example, defining a macro to select bits within the larger vector. It also shows the use of a static look-up table to define the mapping of bytes within the reordering function. In these functions I’m using the UNROLL directive to flatten the loops, and I’m also requesting the functions themselves are inserted directly into the higher-level function using the INLINE directive. By changing the various directives, I can use this same source code to target either higher speed at the expense of area or lower area at the expense of speed.

Listing 3
This shows implementation details of some of the sub-functions.on. A single 16-byte block is encrypted using a given key to a final  The full sbox lookup table is not present in this code to save space.

/* Perform S-Box using Lookup Table */
aes_state_t sbytes(aes_state_t input)
{
#pragma HLS INLINE
	aes_state_t output;

	for (unsigned int i = 0; i < input.length(); i += 8)
	{
		#pragma HLS UNROLL
		output(i+7, i) = sbox[input(i+7, i)];
	}

	return output;
}

const unsigned char sr_lookup[16] = {0,5,10,15,4,9,14,3,8,13,2,7,12,1,6,11};

aes_state_t shiftrows(aes_state_t input)
{
#pragma HLS INLINE
	aes_state_t output;

	for (unsigned int i = 0; i < 16; i ++)
	{
		#pragma HLS UNROLL
		output((i*8)+7, (i*8)) = input((sr_lookup[i]*8)+7, sr_lookup[i]*8);
	}

	return output;
}

//Software-like AES implementation uses xtime
ap_uint<8> xtime(ap_uint<8> x)
{
#pragma HLS INLINE
  return ((x<<1) ^ (((x>>7) & 1) * 0x1b));
}

#define MC_INP_BYTE(data, round, i) data(((i+(round*4))*8) + 7, (i+(round*4))*8)

aes_state_t mixcols(aes_state_t input)
{
#pragma HLS INLINE
	ap_uint<8> mc_tmp, mc_t;
	for (unsigned int i = 0; i < 4; i++)
	{
		#pragma HLS UNROLL
		mc_t = MC_INP_BYTE(input, i, 0);
		mc_tmp = MC_INP_BYTE(input, i, 0) ^ MC_INP_BYTE(input, i, 1) ^ MC_INP_BYTE(input, i, 2) ^ MC_INP_BYTE(input, i, 3);
		MC_INP_BYTE(input, i, 0) = MC_INP_BYTE(input, i, 0) ^ xtime(MC_INP_BYTE(input, i, 0) ^ MC_INP_BYTE(input, i, 1)) ^ mc_tmp;
		MC_INP_BYTE(input, i, 1) = MC_INP_BYTE(input, i, 1) ^ xtime(MC_INP_BYTE(input, i, 1) ^ MC_INP_BYTE(input, i, 2)) ^ mc_tmp;
		MC_INP_BYTE(input, i, 2) = MC_INP_BYTE(input, i, 2) ^ xtime(MC_INP_BYTE(input, i, 2) ^ MC_INP_BYTE(input, i, 3)) ^ mc_tmp;
		MC_INP_BYTE(input, i, 3) = MC_INP_BYTE(input, i, 3) ^ xtime(MC_INP_BYTE(input, i, 3) ^ mc_t) ^ mc_tmp;
	}
	return input;
}

Some experimentation might be required for finding minimal usage. For the sbytes() function, adding the UNROLL directive to the loop actually resulted in lower resource utilization. This came as a surprise, since I assumed the UNROLL directive would result in duplicating the look-up table 16 times and thus should have increased resource usage. Likewise, failure to declare the main for loop of the aes_encrypt128() function as pipelined results in excessive area usage (about 10× worse).

For more details on the Vivado HLS tools, you should refer to Xilinx UG902 which provides details on usage of the HLS tools. This is also referred to from my February 2014 column, and is linked from the post on ProgrammableLogicInPractice.com as well.

In my February 2014 column, I detailed the various “interface types” available. These allow you to define what control signals are present on the input and output. For this example, I’m using defaults, which will provide me with several signals such as ap_start and ap_done for the block. The inputs and outputs are just mapped as bit vectors to and from the block. We can explore this graphically by exporting our design as a block for the IP Catalog using the “Solution > Export IP” menu in Vivado HLS. Let’s talk more about using this block design methodology.

BLOCK-BASED IP DESIGN
One of the new features in Vivado is the usage of “block-based” design. In this system individual IP blocks (such as a Microblaze processor core, USART, memory interface, and custom logic) are wired together graphically. This system extends well beyond a simple graphical interconnection system, as it has the ability to manage additional information such as timing information (e.g., for clocks) and “active-high” versus “active-low” signal types (e.g., for RESET pins). The design rule check (DRC) system will automatically flag errors such as connecting an active-low reset output to an active-high reset input.

The block-based design can also manage memory mapping of peripherals, detecting errors that might cause implementation (place & route) problems, estimate timing closure, and more. We’ll only be using the most basic features here, but you can see there is a lot more than just a simple graphical interconnect tool!

Getting back to our AES core, assume we’ve now exported it to an IP Block. This can be placed on a new block-based project, which we want to end up looking something like Figure 2 was Fig. 3. The other half of this is the cw305_top IP block. This IP block is one I had previously designed in Verilog, and have packaged it to appear in the IP Catalog. You can easily move blocks from Verilog into the IP catalog (by “packaging” the project), these blocks can then be added to the graphical canvas.

Figure 2
This shows a simple graphical design with two custom blocks. Vivado provides an IP Catalog with everything from DDR controllers to microcontrollers to individual logic elements. You can combine all of these into a hierarchical design including off-chip connections.
(Click to enlarge)

If you need to update the Verilog source of this project, it’s no problem to repackage the design. Vivado will automatically pick up there is an updated block available, and offer to “upgrade” to the latest changes.

— ADVERTISMENT—

Advertise Here

The graphical block design is converted into RTL (Verilog/VHDL) sources for the actual synthesis. Opening these sources shows a copy of the original Verilog IP block (cw305_top), so it’s not needlessly mangling my source code. It’s also possible to modify these sources if some tweaking is required, which can speed up your debugging cycle.

Speaking of debugging, I should also mention that Vivado 2015.4 webpack and later now comes with the Integrated Logic Analyzer (ILA) license. I touched on this in my October 2013 column, but at that time you needed a special license to use the ILA functions. Again this is now available to everyone with the free webpack license.

Finally, once the implementation is complete, a bitstream is generated which can be programmed into your system. You’ll find new tools here, called the “Hardware Manager” view instead of a stand-alone tool. Again Vivado is attempting to integrate the entire experience into a single experience, instead of a number of stand-alone tools. My experience is the error reporting of the old (iMPACT-based) experience is more intuitive to determine where issues were occurring with programming the FPGA, but presumably this will continue to improve with new releases.

VIVADO HLX
If you haven’t tried Vivado yet, the latest release adds several compelling new features to the free webpack edition. While Vivado itself targets the 7-series devices, you can actually use the High Level Synthesis (HLS) tools with older parts, so even if you’re stuck on Spartan-6 you can give that part of the tool a whirl. Once you’ve moved onto the new 7-series devices, you can also experiment with some of the new design flows that make it easier than ever to manage a large design. You can see more details of the HLS tools in my February 2014 column, and be sure to check out the full design from this column on ProgrammableLogicInPractice.com! 

SOURCE
Artix-7 FPGA Board and Vivado Design Suite
Xilinx | www.xilinx.com

PUBLISHED IN CIRCUIT CELLAR MAGAZINE • APRIL 2016 #309  – Get a PDF of the issue

Keep up-to-date with our FREE Weekly Newsletter!

Don't miss out on upcoming issues of Circuit Cellar.


Note: We’ve made the May 2020 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Sponsor this Article
Website | + posts

Colin O’Flynn has been building and breaking electronic devices for many years. He is an assistant professor at Dalhousie University, and also CTO of NewAE Technology both based in Halifax, NS, Canada. Some of his work is posted on his website (see link above).

Supporting Companies

Upcoming Events


Copyright © KCK Media Corp.
All Rights Reserved

Copyright © 2023 KCK Media Corp.

Revisiting Vivado HLS

by Colin O'Flynn time to read: 13 min