Tools of the Machine Code Trade
The AVR microcontroller instruction set provides a simplicity that makes it good for learning the root principles of machine language programming. There’s also a rich set of macros available for the Microchip AVR that ease assembler-level programming. In this article, Wolfgang steps you through these principles, with the goal of helping programmers “think low-level, write high-level” when they approach embedded systems software development.
In today’s modern world, why program in assembler? If you program for a living, there are many factors that drive which programming language you use. You could use a language that’s readily available, or one that your customers desire or a language that your superiors have mandated. Or, you could use a language that’s currently in vogue within a certain communality such as education. But, whatever your programming language of choice is, there will be always compelling reasons to program a machine down to the metal. Those who can’t program at the machine level do not really know how a computer works. Moreover, having a familiarity of the basic machine code level of programming is a prerequisite to being able to write truly efficient high-level code. Yes, compilers generate machine code. But you should be able to read it and to judge its efficiency.
By having a notion of what the machine code produced by your high-level code statements look like, you’ll be able to write programs that run faster and require less memory space. In short, it’s wise to follow the advice right in the title of a book by Randall Hall called: “Thinking Low-Level, Writing High-Level” . When writing programs for embedded systems—device drivers and the like—occasionally it can be an actual necessity to resort to assembler programming. That’s typically the case if tight timing requirements are to be met. Sometimes, you have to deal with microseconds or even machine cycles. In such cases, it’s not advantageous to rely on a compiler. Finally, it goes without saying that you could dive into assembler programming projects just for fun.
The Role of Macros
Experienced craftspeople select and prepare their tools before they start working. Sometimes, they make their own special devices to do their work—things like gauges, stencils or rigs. Macros may be likened to these kinds of tools for us programmers. Essentially, macros constitute a toolbox. Some macros simply ease program writing. Others are provided to circumvent some shortcomings of the architecture. Let’s suppose that our projects aren’t that big, our memory requirements are not that severe and absolute speed isn’t a top priority. Under these conditions, we can afford to execute multiple instructions instead of one instruction of a more advanced architecture.
The basic idea is to create a simple virtual machine and to define an extended instruction set, which alleviates typical programming chores—like querying and modifying single bits, 16-bit operations, conditional branching and stack-based operations. In this article, we will introduce the virtual machine and illustrate the content of our toolbox. Here we will concentrate on the basics. In the beginning, it is important to get a first impression, an overview of principal tools and methods. A comprehensive description, the source code and application examples can be found on the Internet.
Many instructions and nasty details: Those words summarize the principal impediments of programming a processor down to the metal. In itself, assembler programming isn’t that difficult. However, when the programming task is more demanding, an overwhelming number of instructions have to be written. That’s especially true for RISC and minimalist CISC architectures, which require many instructions even for small operations.
For example, instead of simply moving a memory operand to a peripheral unit, a RISC machine requires you to load it first into a register. At the same time, we have to master the complexity of the machine architecture’s principles of operation, including their restrictions and quirks. To mention a few examples, not all instructions will work with all registers and all types of operands. And some addressing modes have very short address offsets that may, for example, only be able to jump over 64 instructions or to access only 64 bytes.
How do you overcome those restrictions? The most obvious solution would be to program solely in higher-level languages, leaving it to the compiler author to cope with such eccentricities. However, we don’t want to tread that path. The other extreme would be to design a new processor. Thanks to FPGA technologies and hardware description languages, that wouldn’t be completely impossible with today’s technology. Nowadays, it could be done even on the proverbial kitchen table.
A typical run-of-the-mill processor core is not that difficult. An 8-bit RISC CPU core would be merely some kind of a student’s assignment (refer to  and  as examples of appropriate textbooks). All that said, when your homemade processor core is up and running, it will lack advanced peripherals, not to mention the absences of any rich ecosystem. Therefore, that option is only a viable approach if done for research, educational purposes or just for fun.
A feasible solution must be based on a well-proven architectural platform and a readily available integrated development environment (IDE). Our goal is to ease assembler programming. This is achieved by substituting cumbersome instructions or tedious instruction sequences with virtual instructions that are more powerful and easier to use. The application programmer writes down one instruction and the machine executes a short program showing the same behavior. Essentially, there are three principles involved in laying out such substitute programs: subroutines, macros and emulation.
Functions: Programs written once and called whenever needed are known as functions, procedures, methods and the like. Consider this simple generic example. Our application example relates to a display unit—think of an LCD or OLED module or even of a window on a PC screen. We want to display a character string at a certain screen position, given by the X and Y coordinates. What could be more natural than providing somewhat similar to the following?
SHOW_TEXT (X, Y, string_pointer) or SHOW_LITERAL (X, Y, plain_text)
The program itself has to be declared and written. This is the so-called procedure body. (For our examples, we orient ourselves by the Ada programming language. Although the syntax is more verbose, it is also more lucid and instructive than C, Java and the like.)
procedure SHOW_LITERAL (X: in INTEGER; Y: in INTEGER; plain_text: in ASCII_string) is begin
Here is the program that does the work …
When such substitute programs have been made available to an application program, it is easy to invoke them:
SHOW_TEXT (message_X, message_Y, I/O_error_message); SHOW_LITERAL (2, 20, “Temp out of range”);
A substitute program must know which data to work with. Those data will be passed by parameters. Whoever writes the program body must declare the parameters. In our example procedure SHOW_LITERAL, this is done in the first lines. And whoever uses such a program inserts the current parameters. They must be of the right type, must be within allowed ranges and so on. When programming in a high-level language, the compiler will check whether the parameters are correct or not. When programming in assembler, it is solely up to the programmer to avoid errors.
Subroutines: The subroutine is the counterpart to the function, procedure or method in a high-level language. A subroutine is stored only once. To be used, it must be called. In most cases, a high-level language function, procedure or method will be translated to a subroutine. In this process, the compiler will do the housekeeping work. However, when programming in assembler, parameter passing, calling and returning to the calling program have to be coded instruction by instruction. It goes without saying that there are no formal parameter declarations and correctness checks. Returning to our examples, a subroutine SHOW_TEXT begins with a label and ends with a return instruction:
Here is the program that does the work …
RET ; Return to the calling program.
To be invoked, the parameters are passed:
LDI r16, x LDI r17, y LDI zl, low(text_adrs) LDI zh, high(text_adrs) CALL show_text
The prerequisites of this example are that the assembly language for the Microchip Technology (formerly Atmel) AVR microcontroller (MCU) is used and that the parameters are passed in the AVR register file. There are different principles of parameter passing. They will be explained thoroughly in the textbooks of assembler programming.  to  are typical examples.
Macros: A macro is a program sequence that is written once but inserted always when invoked. The parameters are inserted by the assembler. Parameter passing is a very basic mechanism, without formal declarations and even—as in the case of the AVR assembler—without particular mnemonics. In the macro body, the parameters are simply numbered by @0, @1, @2 and so on. A macro body of our example could look like:
LDI r16, x LDI r17, y LDI zl, low(text_adrs) LDI zh, high(text_adrs) CALL show_text
Here is the program that does the work …
Invocation is straightforward because the parameters have been already inserted by the assembler:
SHOW_TEXT example_x, example_y, example_string
A macro will occupy much more memory capacity (the whole substitute program instead of some instructions to pass the parameters and to call the subroutine). However, it will run faster because there is no overhead to pass the parameters, to call the subroutine, and to return to the calling program.
If the substitute program is comparatively large—showing text on a display is a good example—it may be advantageous to write the program that does the work, as a subroutine and to provide macros to invoke it conveniently:
SHOW_TEXT example_x, example_y, example_string
The program will be invocated by the statement: SHOW_TEXT x, y, string as shown above.
Inserting the machine code of a substitute program instead of calling it avoids the overhead of parameter passing, subroutine call and return. Many compilers support this option. For example, a statement pragma inline will cause the compiler to implement and invoke the function or procedure as a macro.
Emulation: An emulator is a program that implements the functions of a processor architecture by software. The machine running the emulator is the host. Its architecture is the host architecture. And the architecture to be emulated is the target architecture. The target instructions are not invoked by instruction fetches of the host machine. Instead, these instructions are treated as data structures, which the emulator program addresses. The other architectural features of the target architecture are represented as well with stored data structures, especially arrays (Figure 1).
In principle, an emulator is a fairly simple program loop (Figure 2). The program to be emulated is loaded into the program memory array. The emulator fetches instruction for instruction out of this array and invokes routines reproducing the effects of the particular instructions. Emulation allows you to implement even unconventional, eccentric target architectures. It’s an affordable method to tinker with your own architectural ideas and instruction sets. A particular strength lies in the realm of debugging, error tracing and the like. Because everything runs by software, all the details of the program behavior can be analyzed. Even the most severe errors in the target program will not crash a well-written emulator.
Furthermore, emulation can easily support multiple virtual target machines. It’s only necessary to provide the data structures shown in Figure 1 and to switch the emulator loop appropriately. This is already depicted in Figure 2 (block 7). All that said, there is a principal downside: Emulation is slow. A typical target instruction will require from about 10 to more than 50 host instructions. So, it only makes sense to resort to this principle when such lower speeds aren’t an impediment.
Why concentrate on macros? It’s because our focus is to get application projects up and running. With this goal in mind, there is no reason to take time to bother with instruction sets, writing our own assembler and the like. We must rely on a well-established ecosystem. That’s why we’re focused here on the assembler language of a chosen MCU family—the AVR—and supplementing it only by some more advanced, convenient or specialized tools. Moreover, we need the speed of our MCU. For all those reasons, emulation is out of the question.
It goes without saying that experienced application programmers will write subroutines and macros whenever they see fit. Our primary goal, however, is to provide something useful in advance. One of the more specific goals could be a specialized application programming interface (API), for example, to cope with Boolean expressions or to support LCD display modules. Therefore, we have to ask, what the most basic functions are: like positioning to an X-Y-coordinate, showing a character string, or drawing a line.
Another class could be API functions to support application programming in general. A typical example is a set of CISC-like or stack-oriented virtual instructions to ease assembler programming of RISC machines. Macros or subroutines can also capture the experience collected from project work. For instance, when a project is reasonably large and programming experience grows, we will recognize instruction sequences that appear again and again.
There are also functions that are noticeably difficult to program, so we want to encode them only once. Short sequences are obvious candidates for macros. When a program is somewhat larger and more complicated, it is often advantageous to implement it as a subroutine and to provide supplementary macros to call it. That relieves the application programmer from the tedious programming chore of parameter passing.
The small virtual machine is a principle that’s useful when assembler-programming each category of architectures. It will relieve the programmers of the restrictions of an outright minimal register model MCU architecture as well as of the complexity of a high-performance processor. While in the majority of applications an above-100-MHz Arm or MIPS will be programmed exclusively in high-level languages, there are some occasions when one would prefer not to rely on a compiler, but rather to program down to the metal.
Typical examples are time-critical device drivers or innermost loops of really performance-hungry applications. With respect to ease of assembler programming, RISC architectures are somewhat infamous. How do you keep track of variables in 16 or 32 registers? So, it could be wise to define a comprehensible small virtual machine with only a few registers and to create a CISC-like programming environment by writing appropriate macros. Besides ease of comprehending, there is another reason to keep the virtual machine small and simple: It should not use up all the processor’s resources (like the complete register file). This will allow resorting to conventional programming, whenever necessary.
It is even possible to collaborate with programs compiled from high-level languages. Compilers impose typical restrictions, for example, which registers the application programmer should not touch. They should be observed carefully. Nevertheless, our own virtual machine has been designed with disregard for this wisdom, because it has been thought of as a purely experimental project.
Why AVR architecture?
AVR MCUs are ubiquitous and inexpensive. The instruction set is comparatively well-suited to learn the basics of machine programming. After a few hours, you can have positive first experiences. In other words, it is highly probable that your own first programs will run. AVR is neither as complicated nor as difficult to program as a 32-bit architecture, nor as minimalist as alternative 8-bit architectures. In contrast, the AVR architecture is somewhat similar to the more advanced RISC architectures. That the AVR is an 8-bit architecture, is not that important.
Any arbitrarily small universal processor can execute any algorithm, provided the storage capacity is sufficient and execution time does not matter. “Turing-completeness” is the technical term describing this fact. For most educational and fun projects, problem complexity and size are rather modest. We talk about projects that a single person can tackle within a few hours to a few weeks. For this purpose, 8-bit processing with a clock speed between 4 to 32 MHz should be good enough—especially given that an AVR requires only a single clock cycle for most instructions. Keep in mind, that the computers aboard the Apollo spacecraft had less speed and memory capacity than even one of the more advanced 8-bit MCUs of today.
The AVR architecture is noticeably more advanced than other 8-bit architectures. Nevertheless, it has some grievous restrictions of its own. But which of those restrictions are severe and which are merely nuisances? Some conspicuous restrictions are listed in Table 1, the most severe listed first. The degree of severity has been judged according to programming experience and in comparison to other well-renowned architectures.
Our virtual machine has three working registers A, B, C and three address registers X, Y, Z (Figure 3). These registers consist of two bytes each, which can be accessed separately. Register A is the accumulator. Register B typically receives the second operand. Register C is mainly used for counting and auxiliary functions. The address registers X, Y, Z belong to the basic AVR architecture. Additionally, the macros may access a general working register TEMP, the registers R0 and R1, the stack pointer (SP) and the status register (SREG). The remaining registers of the AVR register file are freely available. The comprehensive documentation, individually describing each macro, is quite voluminous. With that in mind, we will limit the description to an overview (Table 2) and supplement it with some details and examples. The documentation and the source code are available for download. Links to them can be found on Circuit Cellar’s article materials webpage.
An assembler is basically a program that translates character strings—like mnemonics and labels—into bit patterns, like machine instructions, addresses and other memory content. There are so-called meta-assemblers able to create machine-specific assemblers for arbitrarily complicated architectures. Besides cost, the downside is that you have to set up all the translation tables for yourself. Therefore, we will be content with the native AVR assembler.
Our starting point is the assembler 2 (AVRASM2) of the AVR development chain. It’s not overly complicated and it’s easy to understand. The assembler is built around certain keywords, denoting instructions, registers and built-in functions. Additional mnemonics are defined in device definition (.INC) files, which belong to the particular microcontroller devices. Built-in keywords and pre-defined mnemonics cannot be used as user-defined symbols (in other words, as labels or names in .def, .equ or .set directives).
The .macro directive denotes the start of a macro. The .endmacro directive terminates the macro body. The parameters are denoted by @0, @1 and so on. The AVR assembler 2 does not limit the number of parameters. When invoking a macro, the parameters are passed as mnemonics, labels, numeric values or character strings. A macro definition may invoke other macros. How deep macros may be nested, is not specified. However, Microchip advises not to overuse this feature. According to practical experience, a nesting depth of three should work.
The application programmer sees a macro as a new instruction mnemonic. Some macros have no parameters at all. Some have one parameter, some two or more parameters. We have to consider quite a few conventions and restrictions: the built-in keywords, the predefined mnemonics and the capabilities and peculiarities of the assembler’s preprocessor. Above all, the assembler does not support the overloading of built-in keywords and pre-defined mnemonics. As a result, we must circumvent all those mnemonics and create our own. So, for example, we will call a literal—what in AVR terms is dubbed an immediate.
Each word or abbreviation we have made to a mnemonic for our macros, however, cannot be used freely as a label or another mnemonic in the application program. So we should not be too generous. A particular problem is, how to encode different variants of a certain operation. Let us contemplate, for example, how we could denote a macro to load a 16-bit register (A, B, C, X, Y, Z).
On one hand, we could write macros like LDA label, LDB label and so on, yielding 6 different macros. On the other hand, we could define one single macro and pass the register as a parameter: LDW A, label; LDW B, label and so on.
In the second variant, the macro body would be considerably more complicated, and we have to define the mnemonics A, B and so on by .equ statements. Problem is this approach will fail if it comes to X, Y and Z. Those letters are reserved keywords, built into the assembler. Although we can pass a register name to a macro, the preprocessor will not accept register names in conditional statements. So, we had to cram all the different variants into the macro mnemonic, making it occasionally somewhat unwieldy. Nevertheless, this habit is not without exceptions—there are alternatives. For more details, refer to the documentation.
Memory and I/O addressing
Figure 4 depicts the memory and I/O addressing of different AVR series. All I/O registers are accessible via data memory (SRAM) addresses. A maximum of 64 I/O registers can be addressed in I/O instructions. Single-bit accesses are supported for the first 32 I/O addresses. In the Xmega, the I/O addresses are equal to the data memory addresses. In the ATmega or ATtiny the I/O addresses are higher by 32 (20H). This must be declared (by .equ statements) or programmed in the application program.
If the address value is less than 40H or (for single-bit access) less than 20H, it is considered an I/O address and access is performed with I/O instructions. Otherwise, it is a data memory (SRAM) address. AVR I/O instructions support only direct addressing. The address parameter is an immediate. If the programmer wants to address an I/O device with an address parameter held in a register (indirect addressing), the device must be addressed with its data memory (SRAM) addresses. Example:
OUT udr, r16 ; Output to the UDR register (ATmega16) via its I/O address LDS udr + 0x20, 0x55 ; Output of an immediate value to the register UDR ; via its data memory (SRAM) address
When 16-bit registers in I/O units are accessed, the byte order matters. In ATmega or ATtiny versions of the AVR MCU, write into the high-order byte first, read from the low-order byte first. In the Xmega version of the AVR MCU, always access the low-order byte first. When writing into the data memory (SRAM), the byte order is not significant. Since AVR is a little-endian architecture, it is quite natural to access the low-order byte first.
When coding macro bodies, we have to consider the byte order of 16-bit accesses to peripheral units and the type of jump and call instructions the particular device support. This is indicated by definitions to be inserted at the begin of the main (application) program. For details see the documentation on the Internet.
If a macro parameter is a general address, it depends on the address range, which type of access instructions (memory or I/O) will be used. Furthermore, 16-bit accesses will be executed in the proper sequence (low-order or high-order byte first). If a macro parameter is a data memory (SRAM) address, all accesses will be executed by memory-related instructions, and the low-order byte will be accessed first.
Register addresses must adhere to the assembly language conventions. The register names r0 to r31, x, y and z are assembler keywords, the other have been declared by .equ statements. xl, xh, yl, yh, zl, zh are declared in the device definition (.INC) files. The particular registers of our virtual machine are declared in the macro source files. The assembler will properly insert register names passed as parameters. However, the preprocessor will not accept register names in conditional statements nor support 16-bit registers made of two consecutive 8-bit registers, like r1 and r0. As a consequence, we must be content with a somewhat cumbersome notation, or we have to declare separate macros for each register pair.
The first example is a macro to load a register pair with the content of an addressed word (of an I/O unit or out of the SRAM). The macro is called GWLD. It has three parameters, two registers, and the general address. The high-order and low-order byte need separate register parameters. Example:
GWLD r3, r2, cntl0
This macro will load the registers r3 and r2 with the content of a 16-bit count register within an Xmega MCU. A second example is a macro to add a 16-bit literal to the content of a register pair. You cannot declare a macro addlit register_pair, literal. To avoid the cumbersome syntax with two register parameters, we decided to declare individual macros for each register pair of our virtual machine. So, we have macros addlita, addlitb and so on. For example, addlita 391 adds the value 391 to the content of the 16-bit register A.
However, there are alternative solutions. First, we could define 16 macros gldw0, gldw2 and so on, each loading a register pair. Second, we could introduce new register names r0w, r2w and so on. These conventions would allow writing, for example, gldw r0w, cntl0 or addlitw aw, 391. For details, refer to the documentation on the Internet.
Let’s look at some macro types:
Basic transports: Transport macros are basic move operations dealing with general addresses (Table 3). They keep care of the address space (data memory or I/O) and of the byte order (low-order or high-order byte first).
Single-bit operations: The bit is the most basic data structure. There are control bits and output signals to be set, and status bits or input signals to be sensed. Bit addressing should be eased in the registers, the peripheral units, and the data memory (SRAM). Macros are provided to set, clear or toggle a selected bit and to move a selected bit into one of the flags ZF or CF and vice versa. There are different ways to address the byte containing the bit (Table 4 and Table 5).
Bit addressing in unified bit-fields (unified_bitfield_adrs):
When dealing with bits, the programmer must always know the address of the byte the particular bit belongs to. Let’s say, example, we want to write routines supporting serial communication. While doing so, we come across a bit indicating the USART has transmitted a byte. When programming an ATmega16, we must know that this bit is called TXC and resides in the register UCSRA at bit position 6. When programming an ATXMega64A4U and using the USART 0 on port E, the bit is called TXCF and resides in the register USARTE0_STATUS at bit position 6.
As a remedy to such complexities, the concept of the unified bit-field address has been introduced. This type of addressing enables you to refer to individual bits, located anywhere in the general address space, by a single name, without having to worry about the byte address. The unified bit-field address is a general address extended by the bit address in the byte. To define such a bit, the general address is to be shifted one byte to the left (address << 8).
Here are some definition examples. The bits to be defined are called sercom_txc, strobe and slave_buffer_empty:
1) sercom_txc = bit 6 in the register UCSRA (for example, ATmega16):
.equ sercom_txc = (ucsra << 8) + 6
2) sercom_txc = bit 6 in the register USARTE0_STATUS (for example, ATXMega64A4U):
.equ sercom_txc = (usarte0_status << 8) + 6
3) strobe = bit 6 in port D:
.equ strobe = (portd << 8) + 6
4) slave_buffer_empty = bit 3 in the byte SERIAL_CHECKS (SRAM)
.equ slave_buffer_empty = (serial_checks << 8) + 3
For example, if you want to set the bit slave_buffer_empty, you simply write:
Without this provision, you would have to program something like:
LDS r16, slave_buffer_empty ORI r16, 8 ; Bit 3 will be set in r16 STS slave_buffer_empty, r16
Macros of other functional groups support unified bit-fields, too. For example, to call a subroutine if the bit slave_buffer_empty is set, you simply write
LDCALLU1 slave_buffer_empty, slave_buffer_exception
Branching, subroutine call and return: The AVR’s conditional branch instructions have only a 7-bit address field and, therefore, a short branch distance. The limits (PC – 63 to PC + 64) are related to the address of the current instruction. If the branch target is further away, you have to program around it, for example, by using a branch instruction to skip over an unconditional jump instruction. Appropriate macros support branching within the complete address space. Conditional execution has been provided by jump, skip, call and return macros (Table 6).
Summary and suggestions
When you want to—or have to—program in assembler, and your primary goal is getting bulky application software up and running, a well-proven approach is to stay within the established ecosystem and to create some kind of runtime environment by writing appropriate macros and subroutines. It is wise to begin with a comparatively straightforward architecture and inexpensive hardware, like starter kits and small MCU modules . To demonstrate the approach, we chose here to examine the AVR architecture. It goes without saying that the principles could be easily applied both to more advanced or to rather minimalist architectures and to 32-bit or 64-bit computing.
For detailed article references and additional resources go to:
References  through  as marked in the article can be found there.
Microchip Technology | www.microchip.com
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • JULY 2019 #348 – Get a PDF of the IssueSponsor this Article
Wolfgang Matthes has developed peripheral subsystems for mainframe computers and conducted research related to special-purpose and universal computer architectures for more than 20 years. He has also taught Microcontroller Design, Computer Architecture and Electronics (both digital and analog) at the University of Applied Sciences in Dortmund, Germany, since 1992. Wolfgang’s research interests include advanced computer architecture and embedded systems design. He has filed over 50 patent applications and written seven books. (www.realcomputerprojects.dev and