Basics of Design CC Blog Research & Design Hub

Extending Machine Instructions

Written by Wolfgang Matthes

How to Turn a Processor Temporarily into a Microprogrammed Control Unit

The principles of microprogramming can be useful for projects with embedded field programmable gate arrays, and there are advantages to microprogrammed control units. In this article, Wolfgang Matthes discusses how to turn a processor into a microprogrammed control unit.

  • How can I turn a processor into a microprogrammed control unit?
  • What are the advantages of a microprogrammed control unit?
  • Why should you use microprogramming for eFPGAs?
  • Zilog Z80
  • Rockwell 6502
  • Motorola 6800
  • Intel 8051
  • MicroBlaze soft core
  • NIOS soft core

When contemplating projects with embedded field programmable gate arrays (eFPGAs), we cannot have enough principal design ideas from which to choose. With this thought in mind, I previously proposed reviving the principles of microprogramming [1,2,3].

Microprogrammed control units have characteristic advantages. All control signals may be energized at once and all condition signals queried immediately. Programming is brought to the register transfer level and each machine cycle, thus alleviating hardware design, debugging, and updating. After all, it’s a tried and tested technique we have only to transfer to the FPGAs. The crucial problem is writing the microprograms. If there are not too many microinstructions, the problem should be solvable. However, today’s application tasks require so much software that we simply can’t write them in assembly language anymore.

As a result, most developers stick to widespread processor architectures and development environments. If the performance of the processor is inadequate and multiple processors are not a viable solution, they consider supplementing the processor core with dedicated accelerators (hardware-software co-design).

THE PRINCIPAL DESIGN IDEA

In a typical project, most of the software runs fast enough, with only a few pieces of code for which the processor’s speed may be too low. The principal design idea is to extend the processor’s instructions outside the processor core. The processor core itself remains unchanged inside. All programs that have nothing to do with the extensions remain as they are.

A control storage is added to the program memory. It is addressed the same way as the processor’s memory (Figure 1), but the extension does not act inside the processor core. This way, the processor temporarily becomes a microprogrammed control unit. Extended instructions become microinstructions outside the processor core, acting on dedicated I/O circuitry or on the data paths passed through while fetching the instruction or reading or writing data.

Some examples are shown in Table 1. Extended and conventional instructions may be intermixed freely. In contrast to a microprogrammed processor based on its own architecture, the extensions do not prevent using existing software and development systems. Writing programs that exploit the microinstruction functions corresponds to programming hardware-related device drivers. For references, sources, and more details, refer to the accompanying material on Circuit Cellar’s Article Materials and Resources webpage, and the author’s project website.

A conventional (non-extended) processor core can access I/O or accelerating circuitry only by I/O instructions. Everything more complex than a simple input or output operation has to be programmed. Typical instruction sequences follow the pattern “read in – evaluate – branch” or “read in – calculate new values – output” and so on. In contrast, our extensions allow machine instructions to interact directly with the inputs and outputs, so that operations traditionally requiring multiple instructions can be programmed by a single extended instruction.

For example, if certain conditions are met, the processor receives a NOP instead of the instruction read out of the memory, thus skipping the instruction conditionally. An instruction extension indicates, for example, that data destined to a register of the processor core is also to be entered into a register of the accelerator, or that a register is not loaded with the addressed memory content, but rather with the accelerator’s result. Thus, the processor core (IP core) and the application-specific circuitry, like accelerators or coprocessors, can interact more closely. Figure 2 shows some more details of an extended system.

Figure 1
Our design idea is introduced by comparing a conventional microprocessor-based system with a system supporting extended instructions.
Figure 1
Our design idea is introduced by comparing a conventional microprocessor-based system with a system supporting extended instructions.
Table 1
Examples of extended functions
Table 1
Examples of extended functions
Figure 2
The extended system in more detail
Figure 2
The extended system in more detail

When fetching instructions, the control storage is addressed in exactly the same way as the conventional memory. The extension—basically an additional microinstruction—is loaded into the control register (Control Storage Data Register, CSDR). The extension control circuitry energizes control signals to load data, or addresses into output registers and injects input data or alternative opcodes into the processor’s read data path. The microinstruction’s effects differ in how they affect the machine instructions fetched at the same time (Tables 2 and 3).

Before going into more details, I will demonstrate the merits of our design idea with two examples concerning problems that are encountered frequently and require solutions. The first is attaching an accelerator and the second is supporting hardware breakpoints.

HOW TO ATTACH AN ACCELERATOR

Traditionally, coprocessors or accelerators are controlled by dedicated instructions. Unless the support is part of the machine architecture (as with some coprocessors), typically move or I/O instructions supply the parameters, initiate the operations, and fetch the results (Figure 3).

The accelerator is an autonomous device with program-accessible registers at the inputs and outputs. First, the program running in the processor loads the operands into the input registers and starts the operation to be executed. The processor waits for the accelerator to finish. Then it fetches the results. One parameter of an I/O instruction (IN, OUT) addresses a processor register and the other is an I/O address selecting a register in the accelerator.

In the alternative solution shown in Figure 4, machine instructions are extended outside the processor, so that they act like microinstructions.

Table 2
Principal extended operations at a glance
Table 2
Principal extended operations at a glance
Table 3
Some extended operations in more detail.
Table 3
Some extended operations in more detail.
Figure 3
A conventional accelerator is operated as some kind of an I/O device, thus showing a considerable software overhead. (How to attach accelerators, see References [9,10,11,12]).
Figure 3
A conventional accelerator is operated as some kind of an I/O device, thus showing a considerable software overhead. (How to attach accelerators, see References [9,10,11,12]).
Figure 4
Here, the accelerator is operated by extensions. They accompany instructions that provide the operands, select and initiate the operation, and fetch the result.
Figure 4
Here, the accelerator is operated by extensions. They accompany instructions that provide the operands, select and initiate the operation, and fetch the result.

The accelerator is controlled by the extensions. The processor reads the operands from memory and loads them into its registers. At the same time, they are tapped to be loaded into the accelerator’s registers. In this example, the operation will start immediately after the last parameter has been entered. Until the result is available, the processor will be held in a wait state. The last of the extended instructions will fetch the result and load it into a processor register.

The software overhead typical of such additional circuitry is thereby eliminated. The accelerator behaves as if it were an inherent part of the processor core instead of some kind of an afterthought.

HOW TO SUPPORT AN UNLIMITED NUMBER OF BREAKPOINTS

Additional memory bits support an unlimited number of breakpoints (Figure 5). During normal operation, the additional memory may serve as an error-checking memory (parity or ECC).

Figure 5 Additional memory may support an unlimited number of breakpoints. During normal operation, it serves as error-checking memory [5].
Figure 5
Additional memory may support an unlimited number of breakpoints. During normal operation, it serves as error-checking memory [5].

To establish debugging mode, clear the control register and read and rewrite the whole memory content. To set a breakpoint, set INJECT and INJECT HI, and read from and write to the desired address. Clear the INJECT bits, set ENABLE TRACE and the desired trace conditions.

To return to normal operation, clear ENABLE TRACE, set ENABLE PARITY and ENABLE INJECT, and read and rewrite the whole memory content. Then set ENABLE ERROR SIGNALIZATION.

A bit position in the additional memory causes an address-compare event that triggers an interrupt. If the entire memory is extended in this way, you can set any number of breakpoints, up to single-stepping through the instructions. In contrast, the typical built-in breakpoint provisions of microcontrollers support only a few breakpoints (for example, four).

SUITABLE PROCESSOR CORES

What we have in mind are mid-range processor cores. The processor must execute the instructions as they are fetched out of the memory. Processor cores with internal instruction buffering (such as the venerable 8086) or built-in, inaccessible instruction caches, with deep pipelines and speculative instruction execution, are out of the question. Anyway, such machines are not particularly well-suited when it comes to interacting with the outside world.

The design ideas presented here have been proven with conventional 8-bit microprocessors. The Zilog Z80, Rockwell 6502, Motorola 6800, and Intel 8051 are historical examples. Examples of Z80-based solutions are available as references [4,5,6,7,8]. It should not be difficult to adapt soft cores like MicroBlaze or NIOS appropriately. With an ARM, MIPS, RISC V, and similar architectures, we should be able to implement our proposals, provided that we choose a suitable processor core. Maybe even high-performance cores could be adapted by appropriately loading page attribute tables, memory type range registers, and the like. In this respect, programs using our extensions can be likened to device drivers controlling the physical I/O circuitry.

ACCESSING THE MEMORY

Everything should remain straightforward. We don’t want to interfere with the addressing. The addresses are not modified and the extensions do not contain addresses to be generated by the compiler or assembler.

We will only tap the data paths (and occasionally, the address paths, too). A few of our extensions may require inserting wait states. Others require writing into the memory to be inhibited and ignoring particular error signals (such as signals indicating an illegal access). It goes without saying that virtual memory will not be implemented here.

When only data or address paths are tapped, then nothing will happen concerning the processor.

When information is injected into the data path, our extensions will appear only as a somewhat slower memory. If required, a wait state is to be inserted. Nevertheless, the extended instruction will be faster than a sequence of instructions that otherwise would be required to produce the same effect.

INSTRUCTION EXAMPLES

Examples of extended instructions are given in Figures 4 to 13. We use a simplified assembler notation related to also simplified RISC-type instruction formats. The processor’s native instruction is followed by a vertical line and the appropriate mnemonics of the extension. It is easy to see that such extended instruction could be entered into the source program via appropriately defined macros. (Ambitious programmers are invited to write compiler add-ons, too.)

ADDITIONAL OR SIDEBAND EFFECTS

The instructions the processor core executes are neither modified nor tapped. The additional effects are caused by the extensions alone (Figures 6 and 7, Table 4). This kind of problem—to load some registers, save and restore register contents, toggle signals, inhibit interrupts temporarily, and the like—needs to be solved frequently. Conventionally, it is done via general-purpose I/O (GPIO) signals. There is a reason that even the most advanced I/O circuits and processors with integrated I/O provide some GPIOs. To assert and de-assert them requires particular I/O instructions. When implementing our design ideas, however, it would be a sideband effect.

Figure 6 Machine instructions may be accompanied by microinstructions acting independently.
Figure 6
Machine instructions may be accompanied by microinstructions acting independently.
Figure 7 Microinstructions emit output data. OPC = the extension’s opcode; SELECT = output register selection; EMIT = output the literal. In the instruction example, a RISC-type ADD instruction is accompanied by loading the literal 0x1234 into the output register out_reg2.
Figure 7
Microinstructions emit output data. OPC = the extension’s opcode; SELECT = output register selection; EMIT = output the literal. In the instruction example, a RISC-type ADD instruction is accompanied by loading the literal 0x1234 into the output register out_reg2.
Table 4
Examples of additional or sideband effects encoded in the accompanying microinstruction.
Table 4
Examples of additional or sideband effects encoded in the accompanying microinstruction.
CONCURRENT OUTPUT

Data moved between processor and memory is diverted as output data. It may come from the memory or the processor core. Memory contents are tapped during read cycles, immediate values (literals) during instruction-fetch cycles, and data out of the processor during write cycles (Figures 8 and 9).

It is even possible to output the memory address (Figure 10). In a read cycle, the address will be tapped, and the read-in data ignored. In a write cycle, address and data may both be output, provided that the writing into the memory can be inhibited. To mention an historical example, an 8-bit processor may output 16 or even 24 bits at once in this way. A 32-bit processor could output up to 64 bits.

Figure 8 Concurrent output (1). The microinstruction taps the data, and the processor reads out from or writes into the memory, respectively. OPC = the extension’s opcode; SELECT = output register selection; DAOR = Data Output while Reading; DAOW = Data Output while Writing. In instruction example (a), a load instruction causes the addressed memory content to be loaded into the output register out_reg1 too. In instruction example (b), a store instruction causes the data to be stored to be loaded into the output register out_reg2 too.
Figure 8
Concurrent output (1). The microinstruction taps the data, and the processor reads out from or writes into the memory, respectively. OPC = the extension’s opcode; SELECT = output register selection; DAOR = Data Output while Reading; DAOW = Data Output while Writing. In instruction example (a), a load instruction causes the addressed memory content to be loaded into the output register out_reg1 too. In instruction example (b), a store instruction causes the data to be stored to be loaded into the output register out_reg2 too.
Figure 9 Concurrent output (2). The microinstruction taps the immediate value (literal) contained in the instruction. Because the literal appears during instruction fetch, it must be buffered. OPC = the extension’s opcode; SELECT = output register selection ; DAOI = Data Output Immediate. In the instruction example, a load immediate instruction causes the immediate value (literal) to be loaded into the output register out_reg1 too.
Figure 9
Concurrent output (2). The microinstruction taps the immediate value (literal) contained in the instruction. Because the literal appears during instruction fetch, it must be buffered. OPC = the extension’s opcode; SELECT = output register selection ; DAOI = Data Output Immediate. In the instruction example, a load immediate instruction causes the immediate value (literal) to be loaded into the output register out_reg1 too.
Figure 10

Concurrent output (3). The microinstruction taps the memory address. OPC = the extension’s opcode; SELECT = output register selection ; ADOR = Address Output while Reading; ADOW = Address Output while Writing; DADOW = Data and Address Output while Writing. In both instruction examples (a) and (b), a load or store instruction causes the memory address to be loaded into an output register. In the instruction example (c), a store instruction causes the address and the data to be stored to be loaded into two output registers, addressed by out_reg3.
Figure 10

Concurrent output (3). The microinstruction taps the memory address. OPC = the extension’s opcode; SELECT = output register selection ; ADOR = Address Output while Reading; ADOW = Address Output while Writing; DADOW = Data and Address Output while Writing. In both instruction examples (a) and (b), a load or store instruction causes the memory address to be loaded into an output register. In the instruction example (c), a store instruction causes the address and the data to be stored to be loaded into two output registers, addressed by out_reg3.

INPUT BY FEEDING IN DATA

The extensions described above could be characterized as passive or sideband operations, because the interface between memory and processor core is not touched at all or only tapped. To support input operations, however, we must inject data from outside into the data paths. In our block diagrams (Figures 11 to 13), injecting input data is illustrated by data selectors or 2:1 multiplexers—exactly how it must be done in typical FPGA implementations, where tri-state buses cannot be implemented. In a legacy implementation based on a conventional microprocessor, tri-state drivers would be used instead.

Input data are injected during a read cycle if they will be delivered to the processor, or during a write cycle if they will be written into the memory (Figures 11 and 12). Observe that data coming from outside must be synchronized before being fed into a data path.

Figure 11

Input data is read into the processor core. OPC = the extension’s opcode; SELECT = output register selection; DAIR = Data Input while Reading. In the instruction example, during the data access of a load instruction, the content of the input register in_reg1 will be injected into the data path, and thus loaded into the destination register R1 instead of the addressed memory content.
Figure 11

Input data is read into the processor core. OPC = the extension’s opcode; SELECT = output register selection; DAIR = Data Input while Reading. In the instruction example, during the data access of a load instruction, the content of the input register in_reg1 will be injected into the data path, and thus loaded into the destination register R1 instead of the addressed memory content.

Figure 12

Input data are written into the memory. OPC = the extension’s opcode; SELECT = output register selection; DAIW = Data Input while Writing. In the instruction example, during the data access of a write instruction, the content of the input register in_reg1 will be injected into the data path and thus written into the memory, instead of the processor’s register content.
Figure 12

Input data are written into the memory. OPC = the extension’s opcode; SELECT = output register selection; DAIW = Data Input while Writing. In the instruction example, during the data access of a write instruction, the content of the input register in_reg1 will be injected into the data path and thus written into the memory, instead of the processor’s register content.

Figure 13

Instructions are executed conditionally. If the condition is not satisfied, the processor core will receive a NOP instead of the instruction and thus skip it. OPC = the extension’s opcode; SAMPLE = capture the conditions; CNDSEL = condition selection; CPL = complement (invert) the condition. In the instruction example, a jump to an error-handling routine will be skipped if no parity check has been detected.
Figure 13

Instructions are executed conditionally. If the condition is not satisfied, the processor core will receive a NOP instead of the instruction and thus skip it. OPC = the extension’s opcode; SAMPLE = capture the conditions; CNDSEL = condition selection; CPL = complement (invert) the condition. In the instruction example, a jump to an error-handling routine will be skipped if no parity check has been detected.

EXECUTE INSTRUCTIONS CONDITIONALLY

In computer architecture, the principle we want to implement here is known as “predication.” Instructions are addressed and fetched sequentially. Predicates decide whether a fetched instruction is executed or not. Our design idea is to intervene during instruction fetch, and let the instruction either pass unmodified or substitute it with a no-operation (NOP) instruction code (Figure 13).

Our predicates are conditions selected from the outside world. Compared to some well-known architectures, ours are not limited to the content of a predicate register or a few condition bits, but instead can select any number of predicates from arbitrary sources. A branch on condition is programmed by an unconditional jump accompanied by the appropriate extension.

The details of the implementation are somewhat tricky. The extension may belong to the instruction that is to be executed conditionally (as shown in Figure 13). Because the condition can be selected not earlier than at the beginning of the instruction fetch cycle, it could be necessary to insert a wait state.

Alternatively, the extension could accompany the previous instruction. Then we have to ensure that interrupts between these two instructions are inhibited.

If more than one instruction will be executed conditionally, the condition must be captured at the beginning of this instruction block and then kept. To this end, latching (sampling) the conditions may be controlled by a particular microinstruction bit, as shown in Figure 13.

If the processor architecture provides for predicates as part of the instruction (for example, think of the ARM architecture), it is not necessary to switch the complete data path between the instruction and a NOP. Instead, it would be sufficient to feed in predicate bits according to the desired effect (execute the instruction or not) and to let the rest of the instruction pass through.

SUMMARY AND SUGGESTIONS

What I have described here is essentially a bouquet of design ideas that may be advantageous when pursuing embedded FPGA projects.

Today, it’s an innate desire to make everything programmable. General-purpose processors, however, have principal performance limits. Resorting to hardware and designing accelerating circuitry is an obvious remedy. Application-specific circuitry, however, cannot be debugged and altered as easily as a software-based solution. Microprogramming could be a viable alternative to keep the application-specific circuitry programmable down to the register-transfer level.

Today, however, customers are demanding. They expect Internet access, sophisticated human-machine interfaces, large memory capacity, and even artificial intelligence. To cope with such huge amounts of software, developers have no choice other than to rely on industry-standard architectures. Here, our proposal sets in. The processor core and all programs that do not need acceleration remain as they are. The acceleration is implemented outside the processor core, but is inextricably linked to the machine instructions that are extended to some kind of microinstructions. This way, I/O ports, peripherals, and accelerators are accessible to programming as if they belong inherently to the processor core, thus easing hardware design, debugging, and alterations.

— ADVERTISMENT—

Advertise Here

The next logical step could be to modify the processor core itself. This way, application-specific or advanced-but-incompatible principles of operation could be introduced into the processor architecture. Nevertheless, compatibility would be retained—because nothing will change if instructions are read out of non-extended memory areas or with a zero extension. 

Addendum

RESOURCES
ARM | arm.com
Intel | www.intel.com
MicroBlaze | xilinx.com/products/design-tools/microblaze.html
MIPS | mips.com
Motorola | www.motorola.com
RISC-V International | riscv.org
Zilog | www.zilog.com

REFERENCES
[1] Matthes, Wolfgang: Microprogramming Choices Explained (Part 1). Circuit Cellar, Issue 378, January 2022, p. 26-35.[2] Matthes, Wolfgang: Microprogramming Choices Explained (Part 2). Circuit Cellar, Issue 379, February 2022, p. 22-32.[3] Matthes, Wolfgang: Mikroprogrammierung. Prinzipien, Architekturen, Maschinen. ISBN 978-3-8325-5234-3. Logos, 2021.

German patent applications (All patents lapsed long ago.):
[4] Mikrorechneranordnung, vorzugsweise für den Einsatz in Multimikrorechnersystemen.
DE file number: DD 159 916 A1
Application number: 23096181
Application date: June 22, 1981
Microprocessor configuration, preferably for the application in multimicroprocessor systems.
EP000000067982A2[5] Speicheranordnung mit Fehlererkennungs- und Diagnoseeigenschaften, vorzugsweise für Mikrorechner
DE file number: DD 225 072 6
Application date: Nov 10, 1980
https://register.dpma.de/DPMAregister/pat/register?AKZ=DD154244[6] Speicheranordnung mit Eingabe-/Ausgabeanschluß, vorzugsweise zum Einsatz in Multimikrorechnersystemen
DE file number: DD 272 021 6
Application date: Dec 28, 1984
https://register.dpma.de/DPMAregister/pat/register?AKZ=DD233435[7] Mikrorechneranordnung mit erweiterten Steuerwirkungen
DE file number: DD 288 148 1
Application date: Mar 21, 1986
https://register.dpma.de/DPMAregister/pat/register?AKZ=DD246858[8] Mikrorechneranordnung mit programmgesteuertem Interfaceanschluß
DE file number: DD 288 145 7
Application date: Mar 21, 1986
https://register.dpma.de/DPMAregister/pat/register?AKZ=DD246860

Attaching accelerators:
[9] Patel, Sanjay; Hwu, Wen-mei: Accelerator Architectures. IEEE Micro, July-August 2008, p. 4-12.[10] MicroBlaze Processor Reference Guide UG 081. Xilinx, 2009.[11] Rosinger, Hans-Peter: Connecting Customized IP to the MicroBlaze Soft Processor Using the Fast Simplex Link (FSL) Channel. Application Note XAPP529. Xilinx, 2004.[12] Madinger, Noah: The Co-Processor Architecture: An Embedded System Architecture for Rapid Prototyping. DigiKey, 2022.
https://www.digikey.com/en/articles/the-co-processor-architecture-an-embedded-system-architecture-for-rapid-prototyping

PUBLISHED IN CIRCUIT CELLAR MAGAZINE • OCTOBER 2022 #387 – Get a PDF of the issue

Keep up-to-date with our FREE Weekly Newsletter!

Don't miss out on upcoming issues of Circuit Cellar.


Note: We’ve made the Dec 2022 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Sponsor this Article
+ posts

Wolfgang Matthes has developed peripheral subsystems for mainframe computers and conducted research related to special-purpose and universal computer architectures for more than 20 years. He has also taught Microcontroller Design, Computer Architecture and Electronics (both digital and analog) at the University of Applied Sciences in Dortmund, Germany, since 1992. Wolfgang’s research interests include advanced computer architecture and embedded systems design. He has filed over 50 patent applications and written seven books. (www.realcomputerprojects.dev and
www.controllersandpcs.de/projects).

Supporting Companies

Upcoming Events


Copyright © KCK Media Corp.
All Rights Reserved

Copyright © 2024 KCK Media Corp.

Extending Machine Instructions

by Wolfgang Matthes time to read: 14 min