In Part 1, Wolfgang explained the basics of implementing a microprogram control unit on an FPGA. Here, in Part 2, he shares some examples of small microprogrammed machines along with a discussion about principles that could be applied to more advanced projects.
In Part 1 of this article series (Circuit Cellar 378, January 2022) , we introduced microprogramming as a well-proven principle that could be made viable again to be employed, above all, in FPGA-based projects. It goes without saying that it would not make sense to promote microprogrammed machines as counterparts to the prevalent microcontrollers (MCUs) and processor cores. We imagine, however, opportunities at both extreme ends. The low end can be characterized by programmable sequencers controlling application-specific circuitry or acting as a somewhat more intelligent accelerator, I/O processor or peripheral.
The high end relates to application-specific FPGA-based machines, high-performance accelerators and so on. Here, appropriately powerful microprogram control units could act similar to a conductor. Compared to typical RISC cores, they would show shorter latencies and reaction times. Moreover, to allude to our previous article, they bring the first flavor of programmability down to the register transfer level (RTL) and the machine cycles, allowing us to program circuits similarly to MCUs. Besides being often able to simplify our application-specific circuitry, we will be less dependent on Boolean synthesis.
Therefore, we present proposals for small microprogrammed machines as well as an outlook on principles well-suited for more advanced projects. What we describe here can be only suggestions and design ideas. To be easy to comprehend, our block diagrams show only simplified or partial circuits. Sometimes, however, it may be advantageous to combine two or more partial solutions in one circuitry, thus making the design more versatile. In the block diagrams shown here, clock signals have been omitted. We assume a machine cycle of at least two clock phases, the first loading the CSDR, the second the CSAR. For details, refer to the accompanying material on Circuit Cellar’s article materials webpage and the author’s project homepage.
Microprogrammed state machines
In the beginning, we will limit ourselves to control units that query input signals and excite binary output signals. No data is to be moved and stored, no arithmetic operations executed. We also want to get by with little effort. The typical design goal is the small microprogram control unit as an alternative or supplement to the MCU. It should use only a few resources and be easy to program.
The conceptual model of such control units is the finite state machine (FSM). To illustrate the basics, state machines are best described by their state diagrams (Figure 1). The state diagram of a complete state machine can be thought of as composed of straightforward state transitions, as shown in Figure 2.
Advancing (ADV) corresponds to the unconditional transition from the current state to its only successor. Since we are aiming for simple, easy-to-program machines, we implement it by incrementing the microinstruction address. The microinstruction address register CSAR is a straightforward binary counter. Successive microinstructions are arranged one after the other, as in the usual general-purpose processor.
Waiting (WT) means retaining the current state until the associated waiting condition is met. Then the successive state will be reached. Branching (BR) means choosing between two states according to a condition. If not met, the first successive state will be reached by incrementing the microinstruction address. Otherwise, the microinstruction address will be loaded out of an address field in the microinstruction. Returning to an initial or final state (RDY) means to force the microinstruction address to a hard-wired value (usually, it will be the address zero) or to inhibit further state transitions.
Simple sequencers are FSMs that output successive bit patterns. They cannot branch. The state diagrams are straightforward (as shown in Figure 1a and Figure 1b). An obvious solution centers around a memory containing the bit patterns. To emit them consecutively, it is addressed by an address counter. Whether such machines are dubbed memory-based state machines or microprogram control units may be a matter of opinion. Here we want to understand them as simple microprogram control units. We employ the microinstruction address register (CSAR) as the state register and the address decoder of the control storage (CS) as the state decoder.
The microprogram control unit depicted in Figure 3 can execute successive microinstructions, wait until a certain condition is met, jump to the first microinstruction or remain in a final state. But it cannot branch conditionally. Other devices within the application environment may load start addresses (1), initiate the sequencer (2) and query whether the final state has been reached (3).
What makes our approach—to design the hardware as a microprogrammed control unit—different from simply describing the desired behavior, for example, in a Verilog always clause? When designed as an application-specific circuit, every change will require executing the Boolean synthesis again. In contrast, the microprogram platform can be programmed like every MCU. Changing its behavior means only altering the memory content.
When synthesized in an FPGA, the control storage typically will be implemented with distributed RAMs or block RAMs. Initial microprogram loading (IML) would be part of the FPGA initialization after power up. The complexity of the CTRL block in Figure 3 depends on the state transitions to be implemented. The most straightforward transition pattern is to advance from state to state (Figure 4).
If the machine is to operate continuously, the RDY bit in the microinstruction will cause the CSAR to be cleared or loaded with an initial address. If the machine is to operate only once (start-stop behavior), the RDY bit in the microinstruction inhibits further incrementing. To run through the state sequence again, the CSAR must be reset from outside.
Sequencers that can wait
Machines built according to Figure 3 are based on a memory containing bit patterns addressed by an address counter. The machine cycle must correspond to the shortest duration of the output signals. If signals are to be active over a prolonged time, the bit pattern must be repeated for an appropriate number of cycles.
In practice, however, it is often required to implement sequences with comparatively long periods between the signal transitions (seconds, minutes and more). An impressive example is controlling a liquid-propellant rocket, from pressurization of the fuel tanks, initiation of fuel supply, ignition, buildup of thrust, and so on up to the separation of the stages. In bygone times, this task was solved by electromechanical devices or multi-track magnetic tape drives.
The magnetic tape is an easily understandable example of a memory of arbitrary size. It can accommodate signal sequences that change in milliseconds, but also signal patterns that remain the same for many minutes—the tape just has to be long enough. The principle can also be implemented by addressable memories. There are, for example, MCUs with functional units emitting stored bit patterns (timing pattern controllers—TPCs). However, large memories for time intervals in the range of minutes and more are far too expensive for most applications. The alternative is to introduce wait states (Figure 5 and Figure 6).
In the example of Figure 5, an application-specific circuit (WAIT CTRL) selects the current wait condition and asserts a PROCEED signal if it is satisfied. Two microinstruction formats are concerned with waiting, WSEL to select the particular wait condition and WCTL to wait until the PROCEED signal becomes active, thus ending the wait state.
Figure 6 illustrates how a wait state can be maintained for a certain period. This task is to be solved quite often. In principle, it is the circuit of Figure 5, limited to a single wait condition. Before waiting, a WSEL microinstruction is to be executed to load the duration of the wait state into the wait-time counter.
Branching requires you to load the CSAR with the address to which you want to branch.The branch address may have different sources: a field in the microinstruction, a register content, a hard-wired address or it may be composed of signals out of the application environment (functional branching). Here, we limit ourselves to the simplest implementation, where the branch address is an immediate value in the microinstruction. Because it is so straightforward, we may implement all the state transitions as illustrated in Figure 2d together, thus yielding a versatile general-purpose branch sequencer (Figure 7 and Table 1).
Figure 8 shows the state transitions together with a horizontal microinstruction format. A horizontal microinstruction format is rarely suitable for a small programmable machine sitting somewhere in the FPGA as an IP core, for example, to control an accelerator or an interface. If only a few states are as complex as those shown in Figure 2d or Figure 8, too much memory capacity would be wasted.
For small, inexpensive machines, it makes sense to prefer vertical microinstruction formats. With separate microinstructions to emit bit patterns, wait, reset and branch, each state transition can be programmed with as many microinstructions as required. Waiting may be replaced by branching to itself and resetting by branching to a start address.
Two basic formats are sufficient (Figure 9), one for exciting control effects and loading literals (CONTROL) and one for branching (BRANCH). A one-bit format code suffices to distinguish one format from the other.
CONTROL microinstructions accommodate control signals and literals. The address of the next microinstruction is obtained by incrementing. Figure 9a shows a typical example format suitable for many applications. The function to be executed is encoded in the ACTION field. The DEST field (destination, target) selects the flip-flop, register, port or the like to which the respective result is to be delivered. The EMIT field contains an immediate value or—depending on the ACTION field—additional control bits.
BRANCH microinstructions contain the branch address (Figure 9b). The COND field selects the branch condition. The CPL bit inverts it. Thus, you have the choice to branch on a satisfied or not satisfied condition, respectively.
NOP microinstructions (No Operation) only waste time but have no effect otherwise. Here it is a CONTROL microinstruction containing only zeros (Figure 9c).
Figure 10 depicts a simple branch sequencer executing the microinstruction formats of Figure 9. The machine can only execute successive microinstructions or branch conditionally. All other state transitions must be programmed, as shown in Figure 11.
Waiting is conditional branching to itself (wait loop; Figure 11a). If conditions are to be queried or control effects are to be exerted, CTL microinstructions are interspersed, as shown in the second example in Figure 11a. Exiting the loop in an alternative direction can be programmed by an additional branch (Figure 11b). By more branch microinstructions, the wait loop may be left in several directions (Figure 11c). A halt is an unconditional branch to itself (Figure 11d) or a loop running until a condition for continuing or restarting the microprogram is met (Figure 11e). The unconditional branch is a conditional branch selecting the hard-wired condition 1.
Algorithmic state machines (ASMS)
The branch sequencer can only react to conditions from outside. Such a machine, however, cannot store data, calculate values or procure branch conditions by processing stored and read-in data. With that in mind, it makes sense to expand the control unit to a general-purpose processor. In many applications, however, true universality (Turing-completeness, to be more specific) is not required at all. To do with an FPGA as small as possible, we would prefer a machine that supports universality only to a degree necessary for the respective application. Such machines are known as algorithmic state machines (ASMs).
Figure 12 shows a particularly simple ASM together with a horizontal microinstruction format. A microprogram control unit extended by an ALU and a local storage (LS) yields a small ASM that can be used, for example, as a programmable logic controller (PLC) or as an application-specific I/O processor. It supports only operations with immediate values. The I/O circuits are depicted as I/O ports, thus resembling the I/O architecture of typical MCUs. In practice, however, the I/O circuits are usually application-specific. Their registers are addressed the same way as the local storage (memory-mapped I/O). Additionally, the local storage could be implemented with dual-port RAMs, thus allowing accesses from outside—for example, from the RISC core the ASM is assigned to.
To solve many control problems by programming, operations with immediate values are sufficient. Status bits and read-in data may be stored in the local storage. Bits and words to be emitted may be assembled there, and so on. Imagine we would have to memorize some status information like an ERROR bit, an ACCEPTED bit and a VALID bit. In a machine based on a branch sequencer, we would have to provide a flip-flop for each of the bits and microinstructions to set, clear, and query them. On the other hand, bits in the local storage can be set by ORing and cleared by ANDing with appropriate immediate values. Furthermore, ANDing a 1 in the corresponding bit position would deliver a zero condition for conditional branching.
Single-address processor cores
When programming more complicated algorithms, you will not get along with immediate values. Instead, you will need a machine providing operations with both operands stored or read-in. For this, we will retain the simple architecture of Figure 12 and provide a working register that can also be used as an accumulator, the A register. If the result is written back to the A register, it acts as an accumulator, otherwise as a working register.
Such a machine, however, would not be genuinely general-purpose (in other words, Turing-complete) because it lacks a memory of sufficient size as well as provisions for address calculation. The memory could be supplemented as an additional functional unit, or the local storage could be extended appropriately. In the first case, we will get a machine with a general-purpose register file, in the second, a conventional single-address machine.
An example of the latter is shown in Figure 13, depicted here as a Harvard machine. The local storage becomes the general-purpose data memory. A single address or index register (AD register) is sufficient to support address calculation, reading data out of the program memory and indirect branching. A register stack is provided to save return addresses. The principal architecture is similar to some legacy single-address machines and small MCUs. Our machine, however, may be of any size—in terms of storage capacity, processing width, and so on. The microinstruction format of Figure 13 relates to a 16-bit machine. It is, however, merely a starting point. When creating a microinstruction format, we deliberately postpone optimization. Instead, we begin concatenating all bits and fields we deem necessary.
Advanced principles: an outlook
It goes without saying that a high-performance microprogram control unit should excite as many control signals and deliver immediate values as wide as required. It is only a question of word length and control storage implementation, including some intricacies of signal paths and clocking. The most significant design challenge is not the microinstruction format but to respond to conditions as fast as possible. The problem has two aspects, conditional branching and interrupts (break-ins, to be more specific).
Branching lately: Branching takes time. Only after the branch condition has been queried and the address loaded into the CSAR, the addressed microinstruction can be read out of the control storage. Late branching is a principle to avoid this delay. The basic idea: All the microinstructions that could be successors of the current microinstruction are read simultaneously. For this purpose, the control storage is to be built from an appropriate number of modules addressed in parallel. While the current microinstruction is executing, all of its successors are read. Branching in two directions means selecting one of two successors (Figure 14), branching in four directions, one of four successors (Figure 15), and so on.
The selected successor is loaded into the microinstruction register CSDR via data selectors, multiplexers, or transfer gates. These data paths have a particular short delay. Therefore, it is sufficient that the selection signals become valid immediately before the machine cycle ends (in other words, late). Thus, the results of the current cycle may decide which microinstruction will be next. Such a microprogram control unit is a comparatively large piece of hardware, but the principles of operation and the circuits are straightforward.
Interrupting the microprogram (break-ins): Such an interrupt causes the control unit to switch from the current microprogram to the execution of another. Everything necessary to continue the interrupted microprogram is to be saved. This way, the microprogram control unit can respond to occurring conditions in an instant. Otherwise, the microprogram would have to query all conditions cyclically.
The typical microprogram interrupt is a so-called lightweight interrupt with only a few registers to save and restore. Usually, it comprises only a few microinstructions. To avoid confusion with the interrupts specified in the system architecture, this kind of interrupt will be dubbed break-in.
The principle was introduced with IBM’s System/360. There its purpose was to employ the processor’s operation section and storage adapter to implement functions of the I/O channels. A channel moves bytes via its device interface. The channels are equipped with buffer memories.
If such a buffer is full (when reading) or empty (when writing), memory accesses are required. However, these access procedures are somewhat complicated. The memory has to be addressed, the address incremented, the number of bytes transmitted has to be subtracted from the byte count specified in the channel command word (CCW), and so on. An autonomous channel control unit is expensive. Therefore, the processor’s operation section is called in for help.
The main adder takes care of address arithmetic and byte-counting. The buffers are emptied or filled via the processor’s main storage interface. All these operations are controlled by microprograms.
What we imagine is a similar use in FPGA-based designs. By implementing a fast break-in mechanism, the microprogram control unit could be called for help. The functional units could be designed so that they control only their elementary functions autonomously. The more complex functions will be relegated to the microprogram, thus easing hardware design and saving FPGA resources.
There are inexpensive implementations of a break-in mechanism, for example, by saving the microinstruction address into a link register or a register stack. We, however, strive for utmost performance. Fortunately, we may rely on a well-proven principle: switching between multiple registers instead of saving register contents. The basic idea: All registers and flip-flops whose contents are to be retained at break-in and hence to be saved are provided for each break-in level separately. This way, the working sets of all active microprograms (addresses, flags, and so on) are preserved. Only one is active at a time. The others remain in their registers or memory cells.
In the simplest case, there are only two levels: the basic level and the break-in level. More complex control units have several levels (0 = basic level, 1 = first break-in level, 2 = second break-in level and so on), as shown in Figure 16.
Each level is assigned its own microinstruction address register (CSAR), which can be supplemented by flag bits, status bits and the like. A break-in causes the respectively assigned CSAR to be selected for microinstruction addressing. Returning from interrupt handling (break-out) consists of selecting the previously used CSAR. In contrast to saving the return information in link registers or a register stack, this principle yields minimal latencies down to a single machine cycle.
At the basic level, the microinstruction address is supplied by CSAR 0. One of the microinstruction address registers CSAR 1, CSAR 2, or CSAR 3 is assigned to each break-in request signal BRQ1 to BRQ3. These CSARs are initially loaded with the start addresses of the respective break-in routines. If the corresponding condition occurs, the associated CSAR is selected. Address counting or loading the CSAR previously active is stopped. Therefore, it contains the address of the next microinstruction to be executed when switching back. All those functions could be executed conditionally. For that, the microinstruction bits of Table 2 are to be supplemented by appropriate selection codes.
By setting SW, we may switch to an arbitrary CSAR selected in the ADSEL field. Programmed level switching is similar to calling a subroutine. Writing microprograms, you are free to decide which of the microinstruction address registers CSAR 1, 2, 3 are to be used for subroutines and which for break-ins. Example: CSAR 1 for the break-ins, CSAR 2 and CSAR 3 for the subroutines. Then the break-in handlers must do without subroutines; the microprograms of the basic level are allowed to nest two subroutines.
Interrupt handling in usual processors begins with saving the instruction address and branching to the interrupt service routine (ISR). When an ISR has been exited (for example, via an RETI instruction), the interrupt handling is completed. The only way to let an ISR run is to trigger an interrupt. Register selection, however, allows switching back and forth at any time to leave the work the break-in has started behind, so to speak, and return later. A simple example is waiting for an interface signal and switching back when a timeout limit has been exceeded. So, you can leave the waiting condition pending and come back to it later.
Summary and suggestions
Microprogramming could be a viable alternative to hardware design as well as to problem-solving by programming. We see it as an additional tool in the box. Here we have outlined a few proposals for small microprogram control units and advanced principles, which may be of use when designing FPGA-based systems in the upper-performance leagues.
Concerning the history of microprogramming, we will borrow only some principles, like multi-way branching, late branching and break-ins. That being said, new microprogram architectures and control units could be developed as IP cores aimed at FPGAs and perhaps ASICs. The next logical steps could focus on true micro-architectures allowing to compile programs written in high-level languages to microprograms and to emulate high-level instruction interfaces like JVM (Java Virtual Machine) or Dalvik. CC
For detailed article references and additional resources go to:
References  as marked in the article can be found there.
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • FEBRUARY 2022 #379 – Get a PDF of the issueSponsor this Article
Wolfgang Matthes has developed peripheral subsystems for mainframe computers and conducted research related to special-purpose and universal computer architectures for more than 20 years. He has also taught Microcontroller Design, Computer Architecture and Electronics (both digital and analog) at the University of Applied Sciences in Dortmund, Germany, since 1992. Wolfgang’s research interests include advanced computer architecture and embedded systems design. He has filed over 50 patent applications and written seven books. (www.realcomputerprojects.dev and