This article is a case study of why you should pay attention to the underlying architecture your HDL code is running on. The use of the simple clock buffer is explored in the Spartan 6 FPGA, and a demonstration of where you might run into placement errors and how you can fix them is given. More generally, this explores the use of the FPGA Editor to get a feel for how your design is being implemented on the physical hardware.
Many FPGA projects live mostly in the HDL design world, being composed of blocks built entirely in Verilog or VHDL. A handful of special FPGA resources might be pulled in, such as clock generators or buffers. But as your project gets more complex, ignoring the underlying architecture of the FPGA that your design is running on gets more and more dangerous. It might mean you have trouble completing a place and route run, or that the design is marginal under certain conditions.
This column will discuss some issues around clock placement and routing. While there is much to be written about the subject, I wanted to give you a quick introduction and demonstrate that it’s actually pretty easy to get your hands dirty doing a little manual placement or optimization. And it might give you a lot more breathing room when it comes time to expand your design!
One thing I won’t be discussing in this column is ensuring your clock has proper constraints, such as telling the tools what sort of frequency input the clock has and the relationship between data input/output and the clock. Proper constraints are critical for a working design, so it’s something I’ll tackle in a future column. In the meantime, you can refer to Austin Lesea’s five-part set of posts, “Timing Constraints,” for a quick introduction (see Resources).
I’m going to be using a Spartan 6 FPGA device specifically, but you can apply the general principles to any architecture from Xilinx or other manufactures. Ultimately, it comes down to finding the appropriate documentation, so this column is really a case study of how to use that documentation.
When it comes to clocks, one of the first things a FPGA designer will learn is that you should only route clock inputs to special pins on the device. These pins have a special routing fabric inside the FPGA that reduces the amount of skew, and ensures the latency between an external clock edge and the internal clock net is consistent and within design requirements (including between FPGA builds, and with changes in temperature and voltage).
If your external clock is used alongside a data bus, you absolutely need that consistent timing. So as a good FPGA (and PCB) designer, you will diligently ensure that all clock inputs are routed to appropriate clock-capable pins.
Unfortunately, it’s not enough to simply route any clock to such a pin and be done with it. There are additional considerations that mean your design might not be able to properly use that FPGA pin. The first you’ll hear of this is when the tools complain about the placement, and it could be frustrating to understand what exactly is happening. After all, you respected the golden rule and routed the clock net to one of these special pins!
This exact case happened to me recently—the tools complained that it couldn’t use the high-speed clock fabric. I could demote the signal and allow the system to use the regular routing fabric, but this isn’t an ideal situation, and it’s something that with a little look under the hood I can get a successful placement.
THE FPGA EDITOR
The best way to understand these errors is to use the “FPGA Editor,” and in fact sometimes the tools will direct you specifically to this. The FPGA Editor shows you exactly what is happening—where signals are being routed into and out of the FPGA, and where special blocks are being used (such as block RAMs or clock modules).
Depending how far along the process got, it might not show you as much information. For example, in the case where the system is failing to route the clock, it makes it difficult to see where the error is since I don’t have a fully routed design to load. In these cases, it can be useful to “hack” your source code such that it temporarily builds. In this case I demoted the clock signal such that it would route on IO fabric.
The first time you load the FPGA Editor it might look a little overwhelming. It’s a sea of tiny logic blocks, and the more you zoom in the more detail you see (see Figure 1). But you can use the search feature to highlight the location and interconnections of certain blocks. Using the selection window, you can select between, for example, “Nets” and “Placed Components.” You can see an example of this when examining the clock routing over the majority of the device logic (see Figure 2).
If you’re having trouble with a specific clock net, you might highlight it as I’ve done in Figure 3. The clock is actually being routed over two paths: one is the high-speed clock fabric, the second is through the IO fabric. You can see the IO fabric one is “bouncing around” a little as it traverses the general-purpose routing fabric.
The questions now are what happened and why is the clock being routed in such a manner? In my design, that clock then connects to a few different locations beyond just logic. It’s also used as an input to a DCM block, and routes to a clock mux (a BUFGMUX component).
These are all instantiated in (I thought) a reasonable manner, so just looking at the Verilog source code it’s not clear why things failed. But from the FPGA Editor display shown in Figure 3, it looks like it was forced to start routing onto the generic IO fabric. Errors like this normally mean you’ve either asked it to do something impossible, or the FPGA has run out of resources.
To get a better grip on that, we need to understand the clock routing resources available to you. Let’s take a look at what the Spartan 6 has to offer.
CLOCK ROUTING FOUNDATIONS
The details of the clock routine resources are contained in Xilinx’s “Spartan-6 FPGA Clocking Resources,” which I highly encourage you to read since it’s the “golden reference.” (Most of its 116 pages are figures and tables, so it’s not as intensive as might seem on a quick glance too.)
But the general clocking structure is shown in Figure 4. One of the first things to note is there is only 16 BUFGMUX devices in the center of the device – these BUFGMUXs can be used to select between two clock inputs, and are also capable of driving the high-speed clock network. If using the BUFGMUX as a simple buffer (the most common case), you simply drive only one input pin, and the select pin is a fixed logic level.
When used in this manner, the device is referred to as a BUFG. You can manually insert them when driving the clock network from some source, and they are also automatically inserted when using features like clock synthesis (either PLL or DCM), where there is a feedback path (which uses a BUFG) along with driving the clock net (another BUFG). If you have a complex clock structure (such as many clock generation blocks or using different frequencies), it’s possible to run out of BUFG/BUFGMUX devices long before you run out of other features.
Note the clock input pins use a primitive called the IBUFG, which is entirely different from the BUFG. The IBUFG is part of the IO pin for the clock driver, and (despite the name) is not a global clock buffer. If you wish to distribute this clock, you should connect the IBUFG to a BUFG, as the BUFG is the actual global clock buffer.
So what happened in Figure 3? This is an example where the IBUFG output is taken to several different BUFG inputs. One was being used as a selectable clock for a DCM block, and one was being used to distribute the clock to logic. Either due to this topology or because of how many other BUFGMUX were in use, the system couldn’t route the two separate lines—ultimately, the reason isn’t critical, as adjusting the topology fixed the problem.
Another detail of the BUFGMUX is how they are interconnected. In fact, every two BUFGMUX devices are interconnected as in Figure 5. This is a physical restriction—the input of I0 for one device is always the same as I1 for the second in the pair (and vice versa). If you are using the devices as BUFGs, that is fine—assume we are using I0 as the input to the BUFG. That means for the second BUFG we don’t care what appears on the I1 pin, so can deal with the architecture forcing us to listen to the data from first BUFG.
Where this can get you into trouble is if you are using the BUFGMUX to switch clock sources. Now you do care what happens on both pins, so you might not be able to use all 16 BUFGMUX devices (worst-case this would mean you could only use eight of them). Most commonly you’ll find out about this when the tools complain about “unrouteable placement.” If you use more BUFGMUXs than can be legally placed, it will in fact place them in a manner that they cannot be used, then raise this unrouteable placement error.
NOT TOO SMART
Knowing about the restriction in placement, you might assume the system would be smart enough to pin-swap the I0 and I1 pins as needed. That is to say surely it would realize it could pack Listing 1 into a BUFGMUX pair? Unfortunately, this isn’t the case. When I used a LOC constraint to force the BUFGMUX and BUFG into a matched pair location, it generated a routing error.
Listing 1 This Verilog listing will use two BUFGMUX, as the BUFG actually instantiates a BUFGMUX but with a fixed "Select" pin. The two cannot be placed in a paired site in this implementation, as the I0/I1 pins need to be swapped. BUFGMUX clkgenfx_mux( .O(clkgenfx_src_buf), // Clock buffer output .I0(clk_sys_in), // Clock buffer input (S=0) .I1(clk_ext_in), // Clock buffer input (S=1) .S(clkgen_source) // Clock buffer select ); BUFG another_system_buffer( .O(clk_sys_buf), .I(clk_sys_in) );
I solved the problem as in Listing 2, where I’ve swapped the input pin used with the BUFG. Note I’m just using the BUFGMUX resource as a BUFG, but had to instantiate a BUFGMUX to force the use of the matched input pin. I also had to leave the other pin disconnected. Driving it with logic 0 (which seems like a good idea, since it’s an unused pin) will again result in an error. Remember at the hardware level the router can’t drive pin I0 to 0 of one BUFGMUX unless pin I1 of the other BUFGMUX was also logic 0. Leaving it disconnected tells the router you don’t care what signal goes there.
Listing 2 Manually swapping the I0/I1 of either BUFGMUX will fix the problem. Here I swapped the BUFG pins, as the BUFGMUX was part of an already proven design. BUFGMUX clkgenfx_mux( .O(clkgenfx_src_buf), // Clock buffer output .I0(clk_sys_in), // Clock buffer input (S=0) .I1(clk_ext_in), // Clock buffer input (S=1) .S(clkgen_source) // Clock buffer select ); BUFGMUX another_system_buffer( .O(clk_sys_buf), // Clock buffer output .I1(clk_sys_in), // Clock buffer input (S=1) .S(1'b1) // Clock buffer select );
You can also see if there is a way to refactor your design a little to take advantage of the BUFGMUX restrictions. For example, it might be that you can adjust where a BUFG is placed or how clock routing is done to allow a shared BUFGMUX pair to work, which could be what you require to fit your design.
There are actually a few other restrictions on the clock system that could bite you, well beyond just the BUFGMUX shared inputs. There are differences when certain devices are placed in the top half of the FPGA and the bottom half—for example, a BUFGMUX in the top half can route to a number of locations that a BUFGMUX in the bottom half cannot (such as general fabric, and certain features like the clock signal for a DCM reprogramming interface).
The shared BUFGMUX inputs also means you have certain restrictions on which GCLK pins can be used in the device. Again UG382 lists which pins may have conflicts – there is ways around it using alternative resource in case you have no choice, but if you take this into account at the beginning of your board design, you can avoid the problem entirely.
Talking about board design, there can also be restrictions on which GCLK pins can be used to reach certain PLL or DCM blocks. Routing the clock from the IO pins to these blocks can use special buffers, but they can only reach half the chip. Typically, you can at least figure out how many PLLs you might require, and ensure your clock inputs are spread across enough of the top/bottom that you can reach both buffers.
FPGA EDITOR FOR OPTIMIZATION
So far I’ve mostly concentrated on avoiding impossible scenarios. While on the subject of the FPGA Editor and timing, it’s worth mentioning that it can give you valuable insight into how your design is being implemented, and where you might improve.
Figure 6 shows an output pin that could be moved closed to the logic. In this case there was an output bus spread across a large area on that edge of the FPGA. Using the FPGA editor to optimize placement (such they were closer together when viewed from the FPGA fabric) allowed the place and route engine to shave 1 to 2 ns off the output delay of the bus. Of course, that type of optimization requires you to have been diligent in creating a full FPGA design before the final PCB is routed!
The FPGA editor can also help understand where signals with large fan-out exist, where you might consider trying to improve the design with pipelining or other techniques. For all of these situations it only makes sense to look at the placement in consideration with the detailed timing report, as that will give you an idea of what sort of delays are involved. Delays coming primarily from the logic (i.e., too many levels of combinational logic on a signal) may mean different solutions can be applied compared to ones coming from routing (i.e., signal had too long of a trip between points).
The FPGA Editor is a powerful tool for giving you some oversight of what your FPGA design is doing, and how the routing is happening. Even if your design is working OK, it can be useful to get an idea of what problems might crop up as your FPGA fills up, or you later need to run at higher frequencies. As always some of these figures are posted online at ProgrammableLogicInPractice.com along with a very quick video tutorial of the FPGA Editor.
A. Lesea, “Timing Constraints,” 2009, https://forums.xilinx.com/t5/PLD-Blog/Timing-Constraints-Part-1-of-5/ba-p/57594.
Xilinx, “Spartan-6 FPGA Clocking Resources,” User Guide, UG382 (v1.10), 2015, www.xilinx.com/support/documentation/user_guides/ug382.pdf.
Spartan 6 FPGA
Xilinx | www.xilinx.com
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • June 2016 #311 – Get a PDF of the issue