Snooping and Synching
Cache coherency is a fundamental concept for processor-based systems. Nishant explains the basics of cache coherency and then explores how Arm’s ACE protocol ensures a more cache-friendly system design.
Cache coherency has been a hot topic since the development of accelerated hardware and parallel processing. It’s been a key concept for performance hungry systems. In such performance-hungry hardware and software, we just cannot afford to do every read and write from the main memory. When compared to the local cache inference of data, the latency of reads/writes with main memory—let’s say in DDR4 DRAM for example—is huge. Using cache ensures that intermediate data is directly inferred without any long paths.
The result is that computation occurs much faster leading to results outputted in practically no time compared to the computations performed with continuous writes and reads to/from DDR4. In my article “Understanding the AMBA AXI4 Spec” (Circuit Cellar 370, May 2021) , I explained Arm’s AXI protocol that does data movement around the processor but doesn’t take care of cache. In this article, we’ll look at Arm’s ACE protocol—a scheme that is to some extent cache friendly, although there are more advanced protocols, like Arm’s CHI, which are fully cache covered.
First, let’s discuss cache coherency and how it’s performed in hardware. Next, we’ll explore how ACE protocol could be helpful in tackling the cache problem and the extent of its capabilities. This article is the result my researching Arm’s AMBA ACE protocol and various other cache coherency concepts.
Terminology note: Although the terms “master” and “slave” have long be used in the electronics industry, those terms are discouraged these days for obvious and valid social reasons. The industry as a whole has not yet come to any widespread agreement on replacement terms. However, for this article we’ve decided to follow the updated documentation on developer.arm.com and use the term “manager” to replace “master,” and use the term “subordinate” to replace “slave.”
Cache coherence is a typical parallel processor problem, where data integrity and data flow are both monitored by the caches and interconnect so there is no data inconsistency or data corruption in between the transactions. Cache inconsistency between various threads can lead to data corruption or system “hanging.”
Let’s review a typical cache coherency problem using the example illustrated in Figure 1. First, Manager 1 issues a read from the memory to acquire data at 0x100 and stores it in its local cache. The same sequence is performed by Manager 2. At this stage, Manager 1 and Manager 2 have the same values in their local cache as they have in main memory.
Let’s say Manager 1 wants to change the data in its local cache to “ABCD.” At this point in time, Manager 1 has ABCD and Manager 2 has 1234, which means the entire system is out of sync. Now let’s say Manager 2 at this point updates half of its data while the remaining half doesn’t change at all. Then, there are three different variants of data in the system. Manager 1 has ABCD, Manager 2 has 12AB and main memory has 1234. When Manager 3 reads the data at 0x100, it won’t obtain a correct data but rather a mixture of Manager 1 and Manager 2—it will get ABAB. This creates a data corrupted system and is a good example of a cache incoherent system.
The solution to the that problem is cache coherency. If performance is not big concern, in theory, you could have a system where you can disable the caches and still work out of it. However, allowing that would really be considered design negligence because it could cause a lot of latency and performance issues. The best cache coherency solutions available in the industry fall into either the software coherency or hardware coherency category.
Software coherency has been with us in the industry for a long time, but it puts the burden on the heads of software engineers. Software coherency can be difficult and complex to design. Worse, it results in limited performance improvement that’s dependent on how the algorithm does cache management. Fortunately, today we have hardware coherent solutions like the ACE and CHI protocols that have extensive or complete cache coverage and reduce the burden on the software side. Hardware coherency may increase the complexity of the interconnect design, but, overall, that complexity is a good trade-off for the performance and latency improvement it provides.
Coherency protocol starts with the concept of the MOESI protocol, which is based on the “Modify and Invalidate” scheme. MOESI stands for Modified, Owned, Exclusive, Shared, Invalidate. Figure 2 shows the state diagram of the MOESI protocol from an AMD datasheet.
Cache coherency can be achieved using following methods:
- Snoopy bus protocol
- Directory-based messaging system
- Shared cache used in multicore system
In this article, we’re only going to examine the snoopy bus and directory-based messaging system methods. “Snoopy” as the name suggests keeps monitoring (snooping) all the data transactions happening throughout the system—in both local cache and main memory. In contrast to Figure 1, the interconnect is modified by adding a layer of a snoop cross bar system that tracks all the data and stores it in a table in terms of IDs. It also keeps observing the changes done in local cache and informs the other sharable caches to invalidate themselves or update their state based on the changing data. This is the most used protocol in all modern processor-based systems. In the directory-based protocol method, multiprocessor systems are connected through crossbar switches and cache directories are used to keep records of where the copies of cache blocks reside.
Under the snoopy-based protocol, there are two basic transactions: Write invalidate and Write update (Write broadcast). When a local copy of the data is modified or replaced, this policy invalidates all other cache in the shared ecosystem. Another concept that’s important to understand is “write/read back” and “write/read through” policies. In write through policy, the information is written to both the block in the cache and to the block in lower-level memory. In write back policy, the information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
Under write broadcast or write update policy, when a local cache is updated, the interconnect broadcasts the modified value to all other shared cache systems at the time of modification. We’ll discuss the snoopy protocol in more detail later when we get into ACE transactions.
INTRO TO ACE PROTOCOL
As we transition toward discussing the ACE protocol, let’s first cover how cache coherency and the write invalidate scheme solve the problem of an incoherent system. Consider the AMBA-based system example shown in Figure 3. In stage 1, all the managers are trying to read the data at address 0x100. They read it as 1234. At stage 1, we call the cache state a Shared Clean state. In stage 2, Manager 0 performs a write request to this memory location. At this stage, Manager 0 changes its state to Unique Clean causing the other managers to change their states to Invalid state, which removes the data at this location for other managers. This process is called a snoop. In stage 3, Manager 0 turns into Unique Dirty state and takes responsibility to update the memory. In stage 3, Manager 0 changes its data at 0x100, thereby changing its stage to Unique Dirty. In summary, for a standard snoop to happen, a manager transitions from Shared Clean to Unique Clean to Unique Dirty.
Now, let’s see what happens when Manager 1 reads data at 0x100. In stage 4, it can be observed, that Manager 1 requests the interconnect for 0x100 data. Now the interconnect snoops the cache lines for the data at 0x100. Manager 2 is in Invalid state so it returns a NACK. Manager 0 is in Unique Dirty state. So, it transitions from Unique Dirty to Shared Dirty and then shares the data to the 0x100 address of Manager 1. Manager 1 then turns from Invalid to Shared Clean state. In summary, the MOESI protocol can be compared with AMBA as shown in Table 1.
Now that we know how the caches are handled by an AMBA-based system, let’s explore how the arbitration happens between the manager and subordinate in a typical ACE-controlled system. Figure 4 shows the channel specifications for the AMBA ACE protocol. ACE is an extension to AXI, so we see three channels for write: write address, write data and write response. And we see two channels for read: read data and read address.
What’s added in ACE are the snoop channels and the ACK channels. Snoop address (AC) channel is an input to the cache manager in the interconnect. This is used for snooping the upstream manager cache. Snoop response (CR) is an output of the manager to the interconnect. This is the response of the cache if the associated data is expected from the snoop data channel. Snoop data gets data into its channel when a read snoop command is requested, for example. In addition to all those signals, there are two more: WACK (write acknowledge) and RACK (read acknowledge)—that go from the manager to the interconnect.
AXI VS. ACE
In my AXI article , I described the many signals that make up the AMBA AXI protocol. For AMBA ACE, those signals remain the same. However, there are few additions and few changes in the sideband signals, which I will discuss.
Reviewing the ACE specifications, we have the following additional signals for the write channels. Write address has AWDOMAIN[1:0] which manages shareability, AWSNOOP[2:0] manages the type of snoop and AWBAR[1:0] indicates barrier types. There is an additional AWUNIQUE signal that is for lower-level cache and indicates the removal of a cache line after completion of a transaction. Meanwhile, the following additional signals are present in the read channel: ARDOMAIN[1:0], ARSNOOP[3:0] and ARBAR[1:0]. The read data channel has a RRESP signal bit modification to [3:2], in other words, it’s extended from 2 bits to 4 bits.
We saw the additional snoop channels in the ACE protocol in Figure 4. These channels also have similar signals in AXI and ACE. Snoop address (AC) channel has ACVALID, ACREADY, ACADDR[X:0], ACSNOOP[3:0] and ACPROT[2:0]. Snoop response (CR) has CRVALID, CRREADY and CRRESP[4:0], and snoop data has CDVALID, CDREADY, CDDATA[x:0] and CDLAST. Note that the way VALID, READY and LAST work is same in ACE as in AXI. So, you can refer to in my previous article on AXI .
ACE has a unique capability of distributing the shareability of the domains. These capabilities are classified as inner sharable, outer sharable and non-sharable. Inner sharable could be two tightly coupled CPU clusters. Outer sharable could be two managers that would like to perform cache maintenance operations. And non-sharable works something like DMA, where the manager wants to keep its local cache information to itself. All this shareability is controlled by the AxDOMAIN[1:0] signal. Understanding the various types of transactions of ACE is out of the scope of this article and can be explored further by reading Arm’s ACE specification. That said, let’s summarize how typical cache handling and data stalling is taken care of in ACE using the RACK and WACK signals.
Let’s say we have two managers: Manager 0 and Manager 1. Initially both are in Shared Clean state and have the same data in their 0x100 location of cache. At this stage, Manager 0 issues a make unique transaction to the interconnect. Because of that, the interconnect invalidates Manager 1 and issues a snoop. As a result, Manager 1 sends a snoop response saying it has invalidated its copy of the data at 0x100.
The interconnect then makes a make unique transaction to the buffered read channel. At this stage, Manager 1 requests for the read shared transaction to the same address as the make unique. The read shared transaction gets stalled until the make unique doesn’t reach Manager 0’s cache via the buffer. Read shared is only acknowledged once the make unique reaches cache 0 and a WACK signal is triggered. This prevents data collision.
Since the cache line is now in a unique state, it can be updated with new data and the line will be held until this process is completed and it changes the state of Manager 0 to the Unique Dirty state. Once the WACK signal is triggered, it allows the snoop operation and sends the read shared transaction to the AC channel. At this stage, the Manager 0 changes state from Unique Dirty to Shared Dirty, and Manager 1 gets the data that was in the 0x100 location of the cache of Manager 0, indicating a RACK signal acknowledgment as a result of which system becomes coherent.
In this article, we reviewed various concepts of cache coherence and its types. We discussed examples of how cache stalling can happen and how it can be tackled. We also introduced the ACE protocol from Arm and how ACE can help cache coherency to be maintained.
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • OCTOBER 2021 #375 – Get a PDF of the issueSponsor this Article