Basics of Design Research & Design Hub

Programming the Cortex-M4 in Assembly

Written by David Ludington

Flex Your Arm Skills

Many beginning programmers seek to embark on programming Arm-based devices, but don’t know how to start. With that in mind, David wrote this article that should help beginners establish a good foundation in the basics of Arm and understand the key technical documentation. David recently went through this process himself, and is eager to share his insights.

I’ve noticed on programming blogs that many novice programmers would like to move from the Arduino environment to programming with Arm devices, but do not know how to begin. This article should help those individuals establish a good foundation in the basics of Arm programming, and also highlight the technical documentation needed to continue their learning in this area. Having recently gone through this process myself, I think that my experience will resonate with beginners and address the problems they face when learning this new technology.

There are myriad choices in the Arm world, and this creates a lot of initial confusion for beginners. There are multiple Arm core designs, multiple vendors designing with these cores and multiple vendors supplying the tools to compile and program the devices. Additionally, there are two approaches to programming: direct register level programming, which places values in various registers; and Application Programming Interface (API) programming, which abstracts away much of the details to focus on the big picture (similar to the Arduino).

In the next section, I will select one processor and development board to be used in the rest of this article, and I will be programming in assembly language using register-level programming. The goal is not to create large assembly programs, but I have found that using assembly language helps to focus on the essentials when learning about the processor architecture and placing code and data in memory.

In addition to my own interest in assembly language programming, I found a Texas Instruments (TI) blog where others were also asking about Arm assembly language programming. Unfortunately, on that blog, a frequent response to these questions is “You don’t really want to do assembly programming. Do C, API programming instead.” And the folks asking those questions were not being helped. With that in mind, much of the motivation for this article is to provide a starting point for those seeking to learn Arm assembly language.

Selecting which Arm core to use will be the most straightforward of the choices to be made. I expect that most people reading this article will be interested in embedded programming for controlling other devices, and so we will be using the Cortex-M core. All the other selections are more subjective, mainly because all options are not available for all categories. For example, if you like using an IDE (Integrated Development Environment), some IDE’s are available for both Windows and Linux, and some are only available for Windows. Therefore, instead of extensively comparing all the options, I will simply describe the setup that I will be using and the reasons for that choice.

The hardware I will use is one of the TIVA launchpads (TM4C123GXL) from TI (Figure 1). This board contains the TM4C123GH6PM microcontroller with a Cortex-M4 core. The software tool that I will use is the µVision4 IDE from Keil.


Advertise Here

FIGURE 1 – The TIVA Launchpad (TM4C123GXL) from Texas Instruments contains the TM4C123GH6PM microcontroller with a Cortex-M4 core.

So, why choose this board and processor instead of other boards and processors? There are two major reasons. First, in my opinion, the software for the TI processors seems more directly associated with each peripheral, and is easier to learn. This does not mean that there is not a lot of work involved with learning the TI software structure, but it seems less work than for other processors. For register-level programming, the differences may not be too large, but for the APIs, the differences are substantial. In particular, I have observed many complaints on the blogs about the complicated organization of the HAL API from STMicroelectronics and the ASF API from Atmel (now Microchip). This is my conclusion, but as the saying goes: YMMV (Your Mileage May Vary).

The second reason to select this board is the availability of code examples. Like most vendors, TI has examples for its API (driverLib) and complete project folders for each example. The API has a few functions that are used by multiple peripherals, but most functions are organized and attached to a single peripheral to avoid dealing with all the register details. Beyond that, there is little further abstraction. This approach is carried over from TI’s acquisition of Luminary Micro, which was the first company to the build devices with the Cortex-M architecture (M3). The TI literature also explains that the driverLib API and register programming do not conflict, and can be used together in source files. For example, the API could be used in configuring the peripherals, which is done once, and register programming could be used when processing data for faster response times.

The processor/board combination that I’ve chosen has another excellent source for examples. Professor Jonathan Valvano at the University of Texas at Austin is using the TM4C123GXL Launchpad in some of his computer classes. Over several years, he has developed an extensive (and thoroughly tested) example package (Valvanoware TM4C123G), which has examples for most, if not all, peripherals in “C” for both Keil µVision4 and TI Code Composer Studio. There are also some assembly projects. These examples use register programming, and are ready to be compiled and run in their respective IDEs. There is a project folder for each example containing the files for both µVision4 and Code Composer Studio. When a project is opened with one of these IDEs, only the files needed for that IDE are used.

Before getting to the programs, it would be helpful to briefly discuss the organization of the Arm environment and the documentation that will be needed. Table 1 contains an overview of this documentation. The documents at are extensive and detailed. These are the documents that device vendors use to design their devices. Most of the time, these documents are not needed for programming, but they are the final authority on questions about how the device core, interrupts, memory and instructions should operate. Keil and TI are both Arm tool vendors, and have good documentation on tool operation and some application notes about the processor.

TABLE 1 – Organization and documentation sources

Keil (now owned by Arm) also has extensive online help files. The TM4C123GH6PM data sheet has descriptions of all the peripherals and their registers. Note: There is one error in the GPIO section, which is correct in the TM4C123GH6PM include file. I will cover this in more detail later (see Program 3). The Internet, in general, is also a tremendous source of information and example code for users. It also can be a source of incomplete data and misinformation, so discretion is advised.

It is beyond the scope of this article to provide instructions for downloading the µVision4 IDE and explaining its operation in detail. The vendor website has directions and “getting started” guides, and other sites provide examples and screenshots of each step in the IDE operation. Therefore, for the rest of this article, I’ll proceed with the assumption that the IDE is operational as required. However, I will describe some IDE operations and give suggestions when discussing specific coding and debugging situations.

Having said that, some instruction on µVision4 at this point will be useful. When you open µVision4 for the first time, most of the screen is blank. Go to the project menu and select “New µVision Project.” You will get the “Create New Project” window. Navigate to the location where you want to store the project. Click on “New Folder” and give it the name of the project, then double click to get inside that folder. Type in the project name at the bottom of the window and click “Save.” Now, you are presented with the “Select Device” window.

In the search box, type in “tm4c123gh6pm.” Select the device and then click “OK.” You will then be asked whether you want to copy a startup file to the project. For now, click “NO.” Later, we will include the file. Now a “Target 1” folder will appear. Expand that folder and right click on the “Source Group 1” name. In the resulting menu, left click on “Add New Item to Group.” In the resulting window that appears, left click on the “Asm File” icon, and at the bottom type in the file name. The file will appear under “Source Group 1” and will also be opened in the editor.

Listing 1 shows our first Arm assembly language program. It is a short program with only a few assembly instructions, but it’s surprising how much knowledge and information is needed even for a program this small. The program contains several types of data. First, there are the assembler directives such as lines 2, 3 and others. They tell the assembler the organization of the source code, but they are not code and are not in the final executable. Then, there are comments, which start with a semicolon. The comments can be on their own line or part of another line as long as they are last (for example, lines 5 and 12). Finally, there are the lines of code (12, 13, 27, 28, 29, 31 and optionally line 15). As shown, line 15 is a comment and not assembled. If the semicolon is removed, line 15 is a line of code. In this way, multiple options can easily be tested without making a new program for every option.


Advertise Here

LISTING 1 – Our first Arm assembly program, or Program 1, is a short program with only a few assembly instructions.

5: ; Vector Table Mapped to Address 0 at Reset
6: ; Linker requires __Vectors to be exported
9: EXPORT __Vectors
11: __Vectors
12: DCD 0x20008000 ; initial Stack Pointer
13: DCD Reset_Handler ; reset vector
15: ; space 8
19: ; Linker requires Reset_Handler
24: EXPORT Reset_Handler
26: Reset_Handler
27: mov r0, #3
28: mov r1, #5
29: add r2, r0, r1
33: end

When I first started learning about the Arm processor, I was confused about what defined a valid program. Most examples of valid programs that I saw had a startup file, but several websites said that single file programs (such as Listing 1) were valid, but minimal programs. I discovered that the reason the program in Listing 1 is considered minimal is not because it has only a few instructions. I could have included many more instructions and also controlled peripherals, and the program would still have a minimal design. It is named a minimal program because it contains only the necessary vectors (lines 12 and 13) for the processor to begin running. In a non-minimal program, these vectors plus many more are included in the vector table in the typical startup file. Both types of programs are indeed valid programs.

Line 12 contains the address (0x20008000) of the stack pointer when the stack is empty. The read/write memory starts at address 0x20000000 and the amount of data memory in our processor is 32KB (0x8000). The data memory goes from byte 0x20000000 through byte 0x20007FFF. Therefore, the initial stack pointer (0x20008000) is outside the Random Access Memory (RAM) memory region. This occurs because the stack is descending. When data is pushed onto the stack, the pointer is decremented first, and then the data is written (which is inside valid RAM). When the data is popped off of the stack, the data is removed first, and then the stack pointer is incremented. Line 13 contains the label where code instructions start.

Lines 2 and 3 are not necessary for the program to work, but it is still a good idea to include them. The PRESERVE8 in line 2 means the stack is aligned on an 8-byte boundary. This is the default setting, but by placing it in the file, other people know explicitly that this is what is intended. If line 3 is not there, the THUMB directive can still be added as an assembler option. The EXPORT directive (line 9) is used by the linker to allow variables to be visible outside the current source file. In other IDEs a .GLOBAL directive is used for this purpose. The space statement in line 15 is commented out and is not needed in this example, but will be used later when I add interrupts. The __VectorsReset_Handler and STOP statements (lines 11, 26, 31) are labels starting in the first column. They are reached from other parts of the program through branch statements or the program start in line 13.

After successfully assembling the program, click on the debugger symbol on the toolbar and the click “OK” on the popup window. The code is now in the simulator, which is the debugger default in the Keil IDE. The first thing to notice is the list of the core registers. At this point, most of the registers are zero, but some have data already. Register R13 is the stack pointer (SP) and contains the initial SP value that I placed in the first vector at line 12. The second vector at line 13 is where the program will start executing. Register R14 is the link register that will contain the return addresses for subroutines and interrupts throughout the operation of the program. Register R15 is the program counter that holds the current program address. As we start the debugger, R15 has a value of 0x08, which is the first instruction (mov r0, #3) after the Reset_Handler label.

We are now ready to single-step the debugger. But before that, go to the view menu and click on the disassembly window option. The disassembler window appears in the source file(s) window. I usually right click on the disassembly name and choose “New Vertical Tab Group.” In this way, the source file and the disassembler can be viewed simultaneously. Now click one step in the Debug menu or by using F11. The register R0 now holds the number 3, and the program counter has increased to decimal 12 or hexadecimal 0x0C. Stepping again moves 5 to R1. Stepping once more places the sum in R2.

From the view menu, click the memory 1 icon (once or twice) and put 0 or 0x0 into the Address box. Then you will see the same part of flash memory as displayed by the disassembler. The default in the memory 1 window is to display the memory in bytes, whereas the disassembler default is to display by addresses, which are in 2-byte format. You can change the memory 1 format by right clicking on the Address label, putting the cursor on the signed or unsigned selection and choosing the format length. The contents of the memory seem to be in the wrong order with more significant bits to the right and less significant bits to the left. This is the reverse of the order when writing numbers on paper. But remember, memory addresses start at lower values and proceed to higher values in the display. The Arm processor is “little endian,” which means that the smallest part of a number goes in the lowest address. The first 32-bit number (line 12) that was placed memory is 0x20008000. The two hexadecimal digits on the right form the first byte in memory. The next two digits form the next byte and so forth.

Finally, there is an interesting bit of information shown by the debugger. Line 12 and line 13 have the two 32-bit pieces of data that go into the first few flash memory locations. I just discussed the first piece of data (initial stack pointer), which is located in the first 4 bytes (0- 3). The second 32 bit piece of data (the program starting point) is located in the next 4 bytes (4- 7). When the debugger was launched, the program counter (R15) showed the starting address as 8. Now, if you look at the disassembler and the second set of 4 bytes (4-7), the number stored there is 0x00000009. The program counter says the starting instruction begins at byte 8, while the debugger shows the starting instruction begins at byte 9. So, which is it?

The counter is correct—the instruction starts at byte 8. The address counter is aligned on 2 bytes, so that the least significant bit (LSB) in the address counter is always zero, and an odd address is not possible. Arm has used this situation to advantage for another purpose. If the LSB is zero, then the instruction will be decoded as an Arm instruction. If the LSB is 1, then the instruction is a THUMB instruction. I have seen some programs where a 1 was deliberately added to an address in the source file. I have not done that, and the designation for a THUMB instruction is still shown. I suspect that the THUMB directive in the program takes care of that.

Listing 2 shows the second program that we will examine. The program structure is the same as that of the first program. The difference is that there are new instructions—load (ldr) and store (str)—that can move to and from the core registers and RAM memory. However, before I cover the details of the load and store instructions, I want to continue the discussion on memory organization and how the assembler handles data and instructions.

LISTING 2 – Program 2 has the same program structure as Program 1, but there are new instructions.

6: EXPORT __Vectors
8: __Vectors
9: DCD 0x20008000 ; initial stack pointer
10: DCD Reset_Handler ; program start
11: space 8
12: test dcd 0xCDEF
19: EXPORT Reset_Handler
21: Reset_Handler
22: ldr r7,=test
23: ldr r8,[r7]24:
25: ldr r3,=0x20000000
26: mov r4,#0xAB
27: str r8,[r3]28: str r4,[r3]29:
32: End

There is a counter for flash memory in the assembler. It starts at address zero, and increments when succeeding instructions and data are encountered. (This location of zero is independent of the flash placement in the final program image.) In Listing 2, the first piece of data is the initial stack pointer in line 9. The data is 4 bytes or 32 bits (DCD), with the least significant byte placed in address 0 and other bytes placed in addresses 1, 2, and 3. The assembler counter is incremented by 4 bytes, and the second piece of data (Reset_Handler) is placed in addresses 4 – 7. In line 11, I am doing something different. The space directive normally is used to allocate space for uninitialized variables in RAM. Here, however, I am using the space directive to increment the counter (by 8 bytes) without anything stored there. The purpose will be become clear when I use an interrupt in a later program. The counter then increments 4 bytes for the data labeled test in line 12, and then continues incrementing from the ldr instruction in line 22 until the program stops at line 30.

The RAM memory also has a counter that operates in the same manner as the flash counter. It also starts at address 0 and only increments. When the RAM counter is incrementing, the flash counter is stopped, and when the flash counter is incrementing, the RAM counter is stopped. In the program linker, the flash segments are collected together and the RAM segments are collected together. The linker then places the accumulated flash segment in the program image starting at address 0x00000000, and the accumulated RAM segment in the program image starting at address 0x20000000. The program loader then loads the resulting total program image into the processor.


Advertise Here

To summarize: The assembler counters always start at zero and are relative to the flash and RAM starting addresses in the program image. Most vendors of Arm Cortex-M processors have the program flash and RAM starting addresses as described above. However, the STM32 processors from STMicroelectronics have the flash starting address changed to 0x08000000, whereas the RAM starting address remains at 0x20000000.

The load and store instructions are from line 22 through line 28. The first load instruction (line 22) loads the memory address of the symbol test into register r7. The second load instruction (line 23) loads the contents of test (0xCDEF) into register r8. Then r3 is loaded with a 32-bit number, which happens to be the first address of RAM, and r4 is loaded with a small, 8-bit number (10101011): A=1010 and B= 1011. Finally, the number 0xCDEF is written to RAM starting at 0x20000000, and then the number 0xAB overwrites 0xCDEF in RAM.

When first starting the debugger, type 0x20000000 into the memory address box to observe that memory location in RAM. Watch that location when single stepping through lines 27 and 28. After line 27, the RAM location will show byte EF first and byte CD second, and after line 28, the RAM location will show AB. As mentioned before, the default memory grouping is in bytes (8 bits). If you right click on the text Address, a menu appears that allows you to select other groupings such as Short (16 bits) or Long (32 bits) or Int. Digital values Byte, Short and Long always have fixed lengths of 8, 16, and 32 bits, whereas the number of bits in Int follows the inherent data length of the processor in question, which in the Arm processor is 32 bits.

Listing 3 shows the code from Program 3. Some new assembler instructions are used in program 3, but the primary goal for using this program is to learn how to use the General Purpose Input/Output (GPIO) peripheral in the TM4C123GH6PM processor. This is the first time that I have used peripherals, and the procedure that I use here will, in general, be the same for all of the other peripherals. I will make extensive use of two documents in this discussion: The TM4C123GH6PM datasheet (SPMS376E), and the tm4c123gh6pm.h “C” include file.

LISTING 3 – The primary goal for Program 3 is to learn how to use the GPIO) peripheral in the TM4C123GH6PM processor.

2: GPIO_PORTF_DIR_R EQU 0x40025400
3: GPIO_PORTF_DEN_R EQU 0x4002551C
10: EXPORT __Vectors
12: __Vectors
13: DCD 0x20008000
14: DCD Reset_Handler
17: EXPORT Reset_Handler
19: Reset_Handler
21: bl portf_init
23: loop
25: mov r0, #0x02
26: STR R0, [R1]27: bl delay
29: mov r0, #0x08
30: STR R0, [R1]31: bl delay
32: b loop
36: portf_init
38: mov R0, #0x20
39: STR R0, [R1]40: NOP
41: NOP
43: mov r0, #0x0E
44: STR R0, [R1]45: LDR R1, =GPIO_PORTF_DIR_R
46: mov r0, #0x0E
47: STR R0, [R1]48: bx lr
50: delay
51: ldr r3,=4000000
52: b1
53: subs r3,r3, #1
54: bne b1
55: bx lr
58: END

The datasheet has numbered sections for each peripheral. At the start of each section, there is description of how the peripheral works and all the options available. At the end of each section, there is a detailed listing of the registers and the bits in the registers that control these options. You will find that typically, there are many more registers per peripheral than in previous 8- and 16-bit processors. These previous processors tended to use most or every bit in a register to keep the register count down. The registers in Arm processors tend to be dedicated to just one function, and if the numbers of bits needed to control that function is small, most of the bits in the register are unused.

The GPIO is described in section 10 of the datasheet, and the GPIO registers are listed in Table 10.6. This listing gives the name of the register and its offset from a base address. The list is generic, because its information is given only once. However, there are six GPIO ports (PORTA through PORTF) in the TM4C123GH6PM processor, and the entire list of registers is repeated six times in the processor, with each group of registers having the same offset and different base addresses.

I mentioned earlier that there was an error in the GPIO section of the datasheet. The error is in the first listing in Table 10.6, where it says that GPIODATA is at offset zero. This is not correct. GPIODATA has been moved to offset 3FC (hex), and there is another register at offset zero, which is not shown in Table 10.6. This is an unfortunate discrepancy, but not a serious problem, because the final authority on register names and addresses is the “C” include file tm4c123gh6pm.h, which can be found in the Tivaware INC directory. It is a large file (200 pages) with definitions for all the registers and the bits within each register. For example, the definition for the PORTF_DATA register in “C” is:

#define GPIO_PORTF_DATA_R (*((volatile unsigned long *)0x400253FC)).

What we need in assembly is the register name (GPIO_PORTF_DATA_R) and the address (0x400253FC). Line 1 in Listing 3 shows how to put these items into the µVision4 assembler format. Starting with this program, I will use the Launchpad hardware instead of the simulator, because there is now a blinking LED to show whether the program is running or not, and there is some mention on the blogs that the simulator does not always handle peripherals correctly. I will still use the debugger, but mainly to run the whole program, instead of single stepping line by line.

Since the simulator is the default in µVision4, we need to change the debugger setting. To do this, right click on “Target 1” in the project window and choose “Options” and then “Debug.” In the upper right of the window click on the “Use” button and select “Stellaris ICDI” from the drop-down window. Then click “OK” at the bottom of the options menu. After assembling the program, click “Download” from the flash menu or the download icon on the toolbar. To run the program without single-stepping, choose the “Start/Stop” selection from the debug menu.

In this program, I have branched to a subroutine (portf_init) to provide initialization for the program. The initialization instructions go from line 37 through line 47 with line 48 being the return. The first four lines in Listing 3 are the labels and addresses that are needed to control the PORTF peripheral, and the address in line 4 is the register, which turns on one or more of the ports. In the initialization routine, line 38 has the value (0x20), which will be placed in the lowest byte of the register specified in line 4. The first 6 bits in the register correspond to controlling ports A-F. When a high level is in one of these first 6 bits, the corresponding port is enabled. Putting a 1 in bit 5 (counting from 0) enables PORTF. The main loop for blinking the LED is from line 23 through line 32. First, we put a 1 (0x02) in bit 1 of PORTF to light the red LED and branch to the delay subroutine. Then we put a zero in bit 1 to turn off the red LED and again branch to the delay subroutine.

After the second delay, we branch back to the loop label to repeat the process. In the delay subroutine, we first place a large number in register r3. Then we subtract 1 from r3, and branch back to label b1 if r3 is not zero. When r3 is zero, the routine returns to the first instruction after the current call to the delay subroutine, which is made at line 27 and then again at 31. If you noticed, I stored 0x0E in both the digital enable register (line 43) and in the port direction register (line 46). This sets up the three pins (PF1, PF2, and PF3) for digital output. If I use 0x02 (as shown in line 25), the red LED blinks. If I use 0x04, the blue LED blinks, and likewise, if I use 0x08, the green LED blinks. I can also alternate blinking between two colors by using one of the numbers (0x02, 0x04, 0x08) in line 25 and another of those numbers in line 29. In program 3, the value 0x08 is used in line 29.

Listing 4 shows the code from Program 4. This program will use an interrupt to provide timing for LED blinking. Instead of using a software delay, one of the timers will provide the delay. This timer (SysTick) is part of the Arm core and is an optional feature, but most Arm processors include it. It is a simple timer that only counts down from a preloaded value to zero and then repeats the cycle. This timer is often used with a real-time operating system (RTOS) to set the operating time slices, but since I am not using an RTOS, I am free to use this timer for the delay.

LISTING 4 – Program 4 uses an interrupt to provide timing for LED blinking.

2: GPIO_PORTF_DIR_R EQU 0x40025400
3: GPIO_PORTF_DEN_R EQU 0x4002551C
5: NVIC_ST_CTRL_R EQU 0xE000E010
15: EXPORT __Vectors
17: __Vectors
18: DCD 0x20008000
19: DCD Reset_Handler
20: SPACE 52
21: DCD systick_handler
24: EXPORT Reset_Handler
26: Reset_Handler
28: bl portf_init
30: stop b stop
34: portf_init
37: mov R0, #0x20
38: STR R0, [R1]39: NOP
40: NOP
42: mov r0, #0x0E
43: STR R0, [R1]44:
46: mov r0, #0x0E
47: STR R0, [R1]48:
50: LDR R0, =8000000
51: STR R0, [R1]52:
54: mov r0, #0x00
55: STR R0, [R1]56:
58: mov r0, #0x07
59: STR R0, [R1]60:
61: bx lr
63: systick_handler
64: ldr r1, =GPIO_PORTF_DATA_R
65: ldr r2, [r1]66: eor r2, #0x08
67: str r2, [r1]68: bx lr
70: END

Two topics are covered in this program. One is the organization and operation of exceptions and interrupts, and the other is the description of the registers that implement the SysTick timer. As usual, I will describe the organization and operation first, and then provide a description of the program.

Although exceptions and interrupts are closely related, they are not exactly the same and have different uses. Exceptions are part of the Arm core and tend to be one-time events. They are typically used to stop the processor. By default, they cause the program to branch to a routine that contains an infinite loop, where the processor is running but nothing happens. If you want other operations to occur when the exception occurs, you need to add code to the default routine. Interrupts are generated by peripherals, when some condition is met in that peripheral. The peripherals are designed into the processor (TM4c123GH6PM) by the device vendor, and are not part of the Arm core.

The TM4c123GH6PM datasheet shows a memory map of the exception and interrupt table (pages 103-107). The exceptions start with the initial stack pointer and reset vector in the first two 32-bit words (4 bytes per word). They are not the typical exceptions, since they don’t stop the processor, but since they are there, they have to be accounted for. Next are 13 more traditional exceptions, and in the last spot is the SysTick timer (also not a typical exception). After that, the peripheral interrupts start and continue 4 bytes at a time, until all exceptions and interrupts are listed.

In this program, I am still not using the startup file. Therefore, I need to keep track of the interrupts and their locations, myself. The SysTick timer is exception 15. With 4 bytes per vector, this means that the SysTick vector resides at memory locations 60 through 63. Lines 18 through 21 demonstrate how to get to that region of memory. The first two vectors take 8 bytes, and I need 52 more bytes to get to byte 60. This is done by using the space directive in line 20, which advances the flash counter in the assembler 52 bytes to move from memory location 8 to location 60. There I have placed the label associated with the SysTick timer interrupt subroutine.

Lines 34-61 are the initialization routine to set up the SysTick timer and PORTF. Since the timer is off at start-up, I did not need to turn it off before setting the parameters in lines 49- 55 and turning the timer on with lines 57-59. However, if the running timer needs to be changed during program operation, lines 57-59 need to be run with zero moved to r0, to stop the timer before changing the parameters and then re-enabling the timer. After reaching line 30, the program stays there until the SysTick timer interrupt occurs. Then the program branches to the systick_handler in line 63, and after turning on or off the LED, returns in line 68. This sequence continues indefinitely, blinking the LED.

NOTE: The SysTick timer is the only exception/interrupt that does not need to acknowledge to the processor that it has been called. All other interrupts do need to acknowledge that in the subroutine.

Listing 5 shows the code from Program 5. This program will do the same thing as program 4, but in a slightly different way. This program is different from all previous programs, since I will now use the startup file from the Keil IDE to show how that file will interact with the source file. The easiest way to develop Program 5 source code is to copy the code from Program 4 and then modify it. Most of the changes will be to delete directives and code from Program 4, because these statements are already in the startup file. However, there is one addition in line 54. This is needed, because the definition of the interrupt subroutine is located in the startup file, and the EXPORT directive tells the assembler to look in another file for the definition.

LISTING 5 – Program 5 uses the startup file from the Keil IDE to show how that file will interact with the source file.

2: GPIO_PORTF_DIR_R EQU 0x40025400
3: GPIO_PORTF_DEN_R EQU 0x4002551C
5: NVIC_ST_CTRL_R EQU 0xE000E010
15: EXPORT start
17: start
19: bl portf_init
21: stop b stop
25: portf_init

28: mov R0, #0x20
29: STR R0, [R1]30: NOP
31: NOP
33: mov r0, #0x0E
34: STR R0, [R1]35:
37: mov r0, #0x0E
38: STR R0, [R1]39:
41: LDR R0, =8000000
42: STR R0, [R1]43:
45: mov r0, #0x00
46: STR R0, [R1]47:
49: mov r0, #0x07
50: STR R0, [R1]51:
52: bx lr
54: EXPORT systick_handler
56: systick_handler
57: ldr r1, =GPIO_PORTF_DATA_R
58: ldr r2, [r1]59: eor r2, #0x04
60: str r2, [r1]61: bx lr
63: END

There are also changes needed in the startup file so that both files will assemble. The startup file is too large to include in this article, so instead I will tell you what line numbers to change. Most of the changes will be to delete lines, but first I will modify lines that will stay. This will ensure that the line numbers will be consistent, and that my instructions will match your situation, even if you delete a different number of lines. First, double click on the startup file to place it into the editor. Then change the __initial_sp in line 63 to 0x20008000. Then in lines 233, 235, 236, put a semicolon in the space in front of the instructions to comment out references to a SystemInit subroutine.

The options are to delete the routine, comment it out as mentioned, or put a dummy routine in the source file, which only has a return to the Reset_Handler routine in the startup file. This latter choice could be a placeholder for additional start-up processes in the future. Now, we have to decide what to name the start of code in lines 234 and 237. I used the name start in Listing 5, and so I have to replace __main with start. Whatever name you choose has to be the same in both the startup file and the source file. Finally, delete lines 29 – 49 and 939 – 961 in the startup file. After these changes, the program will assemble without errors.

So far, I have described code instructions and constant data, which are placed in flash memory (read only) starting at address zero, and are not changed during program operation. In this program, I will be discussing variable data, which will be placed in RAM memory (readwrite). Variable data is data that changes with time and has to be both read and written during program operation. In Program 6 (Listing 6), though the data will reside in RAM, the address for the data will be stored in flash memory. First, the data address will be loaded with a ldr instruction into a register. Then data is loaded into another register, and finally, a str instruction will move the data to the RAM location pointed to by the address. This is shown in lines 8-10 and 12-14 for variables x1 and x2. The x1 label defines 32 bits, which holds the RAM address 0x20001000 where the data x1 will reside. It is similar for variable x2.

LISTING 6 – Program 6 uses variable data, which will be placed in RAM memory.

3: area program6, code, readonly
4: export __main
6: __main
8: ldr r0,x1
9: mov r1,#127
10: str r1,[r0]11:
12: ldr r0,x2
13: mov r1,#63
14: str r1,[r0]15:
16: push {r0}
17: pop {r5}
20: stop b stop
22: align
24: x1 dcd 0x20001000
25: x2 dcd 0x20001004
27: end

So, why are the data addresses so high in RAM, instead of near the start of RAM at 0x20000000? This is a simple question, but to answer it, I will backtrack a bit and discuss what I learned as I wrote this article. The discussions of Listing 5 and Listing 6 were written several months apart. During that time, I was trying to get a better understanding of the coding needed to handle data, the stack and the startup file. The approach I used was to program simple examples (such as program 1), but using “C” instead of assembly (mostly using Code Composer Studio). Then I used the debugger to figure out where data was placed in RAM and how to get them there.

The first thing I learned was where the stack is typically placed. In the first programs that I wrote, I was hard-coding the initial stack pointer at one word past the top of RAM (as is done in many 8- and 16-bit processors). I could not see how the startup file would do that, otherwise. In fact, the startup file does not put the stack high in memory, but instead, at or near the beginning of memory (0x20000000). A stack size of 0x200 (512 bytes) is the startup default and can be changed by the user at the top of the startup file. Apparently, there are two methods of placing data relative to the stack. In the first method, the stack is placed in memory first, and the data is placed after the stack with a few bytes of constant data separating the two. In this method, the initial stack pointer is fixed. In the second method, the data is placed first, and the stack is placed after the data. If more data variables are added, the stack is moved up in memory, and the initial stack pointer is also moved up.

Finally, I learned that the stack location is handled at the beginning of the startup file, and the heap (if used) is handled at the end of the startup file. So, in Program 6, I am using the whole startup file except for two things. I am still commenting out the SystemInit routine as before, and also commenting out the statement IMPORT _use_two_region_memory near the end of the file. This statement was one of those deleted at the bottom of the startup file in Program 5. I am still not sure how the assembler takes the 0x0200 stack size at the top of the startup file and combines it with the 0x20000000 start of RAM memory to get the initial stack pointer. But it works, and so I accept it. I don’t use the heap in my small programs, so instead of solving the two-region-memory issue, I just commented out the statement.

At the moment, I am undecided about whether to use the startup file in future assembly programs. If not, I will hard-code the initial stack pointer near the beginning of memory, rather than at the end. I will also include all the exception vectors in my source code file, using an initial common default exception routine with an infinite loop. If I leave out the exception vectors and have code or constant data right after the Reset Handler data, and an exception occurs in that space, there will be no control over what the processor does. I will handle the interrupts as described in previous programs. I will enable the interrupt on a particular peripheral, and use a space statement to place the interrupt routine in its correct location in flash memory. Then I may also use another space statement to place the start of code at a particular fixed location in flash memory.

I have put the data so high in memory for certain reasons. First, I have left a lot of room for increasing the stack, if needed. Also, I have room for other data, if needed in the future, and the starting address is easy to remember. But mostly, I did it to make a point. Assembly is a language with few requirements and restrictions. After that, you are free to do whatever you want. If other people’s examples are useful, then incorporate them. If not, have the confidence to make your own rules and code.

The push and pop instructions were added to verify where the data in register r0 would be placed in the memory, relative to the initial stack pointer, and where the popped data would be returned. Since the memory window cannot be scrolled to a memory location less than the address you enter, and the stack is descending, the address given to the memory window in the debugger must be less than the initial stack pointer by enough for the data to be visible.

Finally, with this data arrangement of coding the data addresses in flash, I will not run into the problem of huge binary (.bin) files that can occur when a separate data section is used. In one case, I had a .bin file of 22 bytes of code, which went to about 450,000,000 bytes when adding 2 bytes to a data section. The code is placed close to 0x00000000 and the data bytes placed near 0x20000000. The only reason for this that I can think of is that, because a .bin file has no smarts to tell the loader where to place code and data, the software creating the .bin file started with the code at zero, and after that padded the file with millions of bytes to place the data bytes starting at 0x20000000. Thus, it is probably better to have the loader use ELF files, or some other format that has more than just the raw code and data in it (instructions where to put the data).

Program 7 (Listing 7) will be the last of the assembly language programs, and it continues the demonstration of how to handle data in a program. Both the Cortex M3 and M4 can use unaligned data. That is, the data does not have to be aligned on a 32-bit boundary. This is accomplished by having variations of the ldr and str instructions that load and store less than 32 bits at a time. Program 7 starts with using str to load 32 ones into the addresses starting at 0x20001000 (lines 7-9). Then in lines 11-13, strb loads 1 byte (0xAB) into address 0x20001001 without disturbing the remaining bytes of the first 32-bit load. Then, in lines 15-17, strh loads a half word (or 2 bytes) into address 0x20001001. Finally, in lines 19-21, str writes over all the bytes starting at address 0x20001000. This demonstrates that unaligned storage will not disturb other data already stored in memory. The same is true with unaligned loads with variations of ldr.

LISTING 7 – Program 7 also demonstrates how to handle data in a program.

2: area program7, code, readonly
3: export __main
5: __main
6:; b d
7: ldr r0,x1
8: mov r1,#-1
9: str r1,[r0]10:
11: ldr r0,x2
12: mov r1,#0xab
13: strb r1,[r0]14:
15: ldr r0,x2
16: mov r1,#0x1234
17: strh r1,[r0]18:
19: ldr r0,x1
20: mov r1,#31
21: str r1,[r0]22: ;d
23: ldr r0,x1
24: mov r1,#0xFF
25: mov r2,#3
27: loop
28: str r1,[r0]29: add r0,r0,#4
30: subs r2,r2,#1
31: bne loop
33: stop b stop
35: align
37: x1 dcd 0x20001000
38: x2 dcd 0x20001001
40: end

Lines 23-31 show something different–the initialization of memory locations with zero or some other value. First, I load the starting memory address as before. Then, I move the value into r1 that will initialize the memory, and a value into r2 that is the number of iterations. To be more efficient, I will initialize 32 bits at a time. Line 28 stores the bit pattern. Then I add 4 bytes to the memory address in r0, and subtract 1 from the counter. Finally, the conditional branch instruction (bne) branches back to loop if the zero flag in the status register is not set. If the zero flag is set, the code moves past line 31 to the next instruction.

When I first wrote Program 7, I used sub instead of subs, and the loop did not stop after three iterations, because the zero flag was not set when zero was reached in line 30. Of the instructions (add and sub) I have used, only adds and subs update the status register and the zero flag. This is an example of how you need to choose the correct form of an instruction to get the results you want.

The programs in this article have covered only a few of the Arm assembly instructions and processor peripherals. But the model is there for you to do more. The programs are examples of how to learn and test instructions, and also how to use the peripherals. The first two programs test instructions, and are easy to run in the simulator. The rest of the programs use peripherals, and it is better to run them on a board, since the simulator may not always run correctly when using peripherals.

In addition, I have explained how to use the Keil IDE. But there are additional ways for the Keil IDE to help in organizing our programs. For example, Program 1 shows the structure for testing instructions. If you want to test many small groups of instructions, but don’t want to have a separate project for each group, the Keil IDE has a feature to help. You can add multiple source files to a project, and Keil will list them all under the source files, and include them all when compiling.

Usually, these separate files are all part of the single larger program, and combining them is what you want. However, if the separate files are separate programs, there is a way to include only one program at a time when compiling. The IDE will still list all the source files as before, but if you right click on a particular source file and select options for that file, there are check boxes to include or exclude that source file from the compile. In this way, you can choose which source file to use. As a caution, I would first try this on a project and files that are easily replaced, in case files are lost during the learning process.

Using peripherals in addition to those mentioned in this article (GPIO and SysTick) will use the same approach. The data sheet will give the overall operation, and the registers will be defined. Sometimes additional system control registers will be needed for a particular peripheral, similar to what was needed to select the GPIO port.

Before writing this article, I had done very little with assembly language, and knew almost nothing about using a debugger. I was surprised to see how easy it is to program Arm devices in assembly, and how effective assembly is for learning about the processor, and also for learning programming concepts in general. I have also learned how useful the debugger is for seeing how the code and data is placed in memory. I expect to continue my education about assembly language programming to learn more about integer data processing, and then extend that to floating-point data processing. 

Click here for the Circuit Cellar article code archive

Arm |
Arm Keil |
STMicroelectronics |
Texas Instruments |


Don't miss out on upcoming issues of Circuit Cellar. Subscribe today!

Note: We’ve made the October 2017 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Become a Sponsor
Electrical Engineer

David Ludington is a retired electrical engineer with experience in low-noise analog design and infrared system design. He earned a BSEE at Michigan State University and an MSEE at Syracuse University.