Introduction

Introduction

I would like to thank Tom McFadden (mechBgon of the AnandTech forums) for the numerous hours he spent (during his vacation!) helping me edit this article into what it is now.

In early PCs, the various components had one thing in common: they were all really slow :^). The processor was running at 8 MHz or less, and taking many clock cycles to get anything done. It wasn’t very often that the processor would be held up waiting for the system memory, because even though the memory was slow, the processor wasn’t a speed demon either.

That’s a quote from PCGuide.

My, how times have changed. Now practically every chip is starved for memory bandwidth. From the Athlon to the PIII, the Celeron to the Duron, most all current microprocessors are outstripping the resources their memory subsystems provide. If not for the use of special high-bandwidth memory reserves called cache, both on and off the chip, these powerful processors would give pathetic performance simply because getting data to process, and instructions with which to process it, would be bottlenecked by the relatively snail-paced capability of the system’s memory, or RAM. It’s no accident that modern mainstream microprocessors have cache, even multiple layers of it, and we have all heard the term tossed around in relation to our CPU’s, but there’s a lot more to know about cache than simply the size.

Cache is an expensive type of memory, usually made out of SRAM (static ram), and it is distinctly different from the main memory that we’ve come to know (such as EDO, SDRAM, DDR SDRAM, DRDRAM, etc): it uses more transistors for each bit of information; it draws more power because of this; and it takes up more space for the very same reason.

What makes SRAM different? Well, first it must be noted that regular DRAM must be periodically “refreshed,” because the electrical charge of the DRAM cells decays with time, losing the data. SRAM, on the other hand, does not suffer this electrical decay. It uses more transistors per bit, allowing it to operate without losing its charge while a current is flowing through it. SRAMs also have lower latencies (the amount of time that it takes to get information to the processor after being called upon).

Cache and Architecture Terminology (Part 1)

Before we delve into some of the basics of cache, let us get a simple understanding of the whole memory subsystem, known as a memory hierarchy. This term refers to the fact that most computer systems have multiple levels of memory, each level commonly being of a different size, and different speed. The fastest cache is closest to the CPU (in modern processors, on the die), and each subsequent layer gets slower, farther from the processor, and (generally), larger. One of the reasons why each progressively-lower level on the memory hierarchy must be larger (we’ll look at some exceptions later) is due to the fact that each layer keeps a copy of the information that is in the smaller/faster layer above it. What this means is that the hard drive holds the information that’s in the RAM, which holds information that is in the cache, and if there are multiple layers of cache, this process keeps going. The reason for this is explained later. A graphical representation is shown below:

Diagram showing how each level’s information is stored in a layer below it in a traditional memory hierarchy

In a typical machine, L1 means the first level of cache (smallest and fastest), L2 is the second level of cache (larger, and slower), RAM is the main memory (much larger, and much slower still), and then the hard drive, incredibly slow and incredibly large in comparison to the other layers.

Before one can understand the performance differences that cache can play (aside from disabling it in the bios, and watching your precious Q3 or UT framerates drop to single digits, thus realizing that cache is important 😛 ), one must understand how it works.

However, there are times when this isn’t quite the case, such as when the computer is manipulating, and creating large amounts of data, and not just reading stuff from the computer. An example of this would be very scientific calculations, where they don’t write the disk much at all, and therefore, the data in the L1, L2, and main memory might not make it all the way to the hard drive, except in the situation where the results of all the computations need to be saved.

More than one reader pointed out that my previous definition wasnÂ’t the case:

[A] cache-line is the amount of data transferred between the main memory and the cache by a cache-line fill or write-back operation.

The size of the cache-line takes advantage of the principle called spatial locality, which states that code that is close together is more likely to be executed together. Therefore, the larger the cache-line size, the more data that is close together, and therefore, likely related, is brought into the cache at any one time. The CPU only requests a small piece of information, but it will get whatever other information is contained within the cache-line. If the cache is large enough, then it can easily contain the information within a large cache-line. However, if the cache is too small in comparison to the cache-line size, it can reduce performance (because sometimes, irrelevant information is in the cache-line, and takes up valuable space).

Latency refers to the time it takes a task to be accomplished, expressed in clock cycles from the perspective of the device’s clock. For instance, 100mhz SDRAM with a latency of 9 cycles with a 1ghz CPU means a latency of 90 cycles to the CPU! In the case of cache and memory, it refers to the amount of time that it takes for the cache (or memory) to send data.

A cache hit refers to an occurrence when the CPU asks for information from the cache, and gets it. Likewise, a cache miss is an occurrence when the CPU asks for information from the cache, and does not get it from that level. From this, we can derive the hit rate, or the average percentage of times that the processor will get a cache hit.

So enough already, you say, what’s the bottom line? How is cache helping my framerate in Quake 3 Arena? Well, it turns out that there are still some more important concepts that we have to cover first. Firstly, there’s a term in computer science / computer architecture known as locality of reference: Programs tend to reuse data and instructions they have used recently. A widely held rule of thumb is that a program spends [about] 90% of its execution time in only [about] 10% of the code (Hennessy and Patterson, 38). Well, if a processor is only using 10% of the code most of the time, why not keep that information really close to the processor so that it can get access to that information the fastest? That’s exactly what cache is used for.

It is not that simple though. Cache can be designed many different ways. They are discussed below as defined by PCGuide:

Direct Mapped Cache: Each memory location is mapped to a single cache line that it shares with many others; only one of the many addresses that share this line can use it at a given time. This is the simplest technique both in concept and in implementation. Using this cache means the circuitry to check for hits is fast and easy to design, but the hit ratio is relatively poor compared to the other designs because of its inflexibility. Motherboard-based system caches are typically direct mapped.

Fully Associative Cache: Any memory location can be cached in any cache line. This is the most complex technique and requires sophisticated search algorithms when checking for a hit. It can lead to the whole cache being slowed down because of this, but it offers the best theoretical hit ratio since there are so many options for caching any memory address.

N-Way Set Associative Cache: “N” is typically 2, 4, 8 etc. A compromise between the two previous design, the cache is broken into sets of “N” lines each, and any memory address can be cached in any of those “N” lines. This improves hit ratios over the direct mapped cache, but without incurring a severe search penalty (since “N” is kept small). The 2-way or 4-way set associative cache is common in processor level 1 caches.

As a rule of thumb, the more “associative” a cache is, the higher the hit rate, but the slower it is (thus higher latencies result as the cache is ramped up to high clock speeds). However, the larger the cache, the less of an impact associatively has upon hit rate. While associatively (read: complexity) plays a role in the speed of a cache, and thus the latencies, so does size. Given the same latencies, the more associative a cache is, generally the higher the hit rate and better performance received. However, the larger the cache, the more difficult it is to get it to reach both high clock speeds and low latencies. Clock speed is very important in terms of bandwidth, and latencies and bandwidth go hand in hand.

The bandwidth in current caches is expressed in terms of gigabytes per second, or Gbytes/sec. This is calculated by the bus width (for example, 64 bits which is 8 bytes, or 256 bits which is 32 bytes, though many other combinations are possible), times the clock speed, or Clock * (bus width) = bandwidth. This is very important for processors, because, as it has been shown, processors are requiring more and more bandwidth even at the same clock speeds, because they are continuing to be able to crunch through more and more numbers each clock (though this is diminishing in some cases, but that’s not the scope of this article).

There are even more variations in L1 cache design, usually between a Harvard Architecture cache, which is a cache that is split between a data cache (also called Dcache, which is the stuff to be computed) and an instruction cache(also called an Icache, which is the information about how the data is to be computed), and a Unified cache, which is where both data and instructions can be stored in the one cache. While a Unified cache might sound better, as a reader pointed out to me, dealing with cache, as stated in the beginning, is a bandwidth issue. I quote from him:

The Harvard architecture is an effort to reduce the number of read ports that a cache needs to have. In other words, it is a bandwidth issue. If we can separate tasks, then they should be separated, so they don’t compete for bandwidth. Otherwise, we will have to have a higher bandwidth interface. A separate data and instruction cache is approximately equivalent to a dual(read)-ported unified cache. (the unified cache would still be slower, as the array would be bigger).

So for the sake of performance, the Harvard architecture is better, due to its more efficient use of bandwidth.

Just as pipelining a processor allows it to reach high clock speeds (an excellent discussion on this can be found at http://www.aceshardware.com/Spades/read.php?article_id=50), caches too can be pipelined. What this means is that, assuming a cache hit, the cache can start sending more information from another request before the information from the first request has gotten to the registers of the CPU. This is certainly advantageous because it allows the processor to be designed with higher latencies and reach higher clock speeds, while get maximum bandwidth at the same time. Obviously, a lower latency is more desirable, but it is not always possible depending upon cache size, associativity, and target clock frequency (more on this later).

Most caches today are pipelined, and consequently there are some hazards. One such hazard is a stall on a cache miss. On a cache miss, normally the cache would not be able to continue to send data until the required data was finally received (from a lower level of the memory hierarchy). This is called a blocking cache, but this can be thought of as an in-order CPU design. The solution? A non-blocking cache 😉 This can be thought of as an out-of order cache design, because the data doesn’t have to arrive at the CPU in the same order that it was requested, just like an out-of-order CPU design does not have to execute instructions upon data in the same order that it was received. This is called a “hit-under-miss” optimization. If this is layered, meaning a cache can send multiple cache hits in the face of a miss, this is called “hit under multiple-miss” or “miss under miss” (Hennessy and Patterson, 414).

Now that some design types have been discussed, the next item on the agenda is how to determine what gets to be in the cache.

As stated before, for highest performance, one would want to design a CPU with caches that allow the most recently used information to be nearest the processor so that it has fastest access to the needed information. Well, let us assume a scenario where the L1 cache (the one closest to the CPU) is full. Now assume that the cache does not contain the data needed. Therefore, the processor goes to the L2 cache. Since the data has been used recently (namely, right now), the CPU assumes it will probably be used again fairly soon, and it wants to copy the data to the L1 cache. Since the L1 cache is already full, how do it decide what to get rid of? There are two options. The processor could randomly boot some information from the cache. Alternatively, going back to the concept of locality of reference, the CPU can do the converse of this, and look for the data that has not been used lately. This latter option is called the least-recently-used (LRU) method. This method, as the name implies, boots out the information which has been needed the least of all the information in the cache.

Newer processors use what is called a victim cache (or alternatively, the victim buffer), which is simply information that was kicked out (the “victim”) of the L1 cache. When the LRU kicks out data from the L1 cache, it goes instead to the victim buffer, and then to the L2 cache, so the CPU searches the data that was in the victim-cache, as it is sure to be the most-recently evicted information from the L1 (remember, although it is “older” than the cache in the L1, it is not as old as the data currently in the L2, which is full of information previously evicted, and therefore is more likely to be needed again). It should be noted that this is used only in an exclusive architecture (explained later).

What about when the CPU alters the information that it got from the cache? There are generally two options employed, write-through cache, and write-back cache. The first term means that the information is written both to the cache, and to the lower layers of the memory subsystem, which takes more time. The second term means that the information is written only to the cache, and the modified information is only written to a lower level when it is replacing it. Write-back is faster because it does not have to write to a lower cache, or to main memory, and thus is often the mode of choice for current processors. However, write-through is easier, and better for multiple CPU based systems because all CPU’s see the same information. Consider a situation where the CPU’s are using the same information for different tasks, but each has different values for that information. That’s one reason why write-through is used, since it alleviates this problem by making sure the data seen by all processors remains the same. Write-through is also slower, so for systems that do not need to worry about other CPUs (read: uniprocessor systems), write-back is certainly better.

Furthermore, a feature called a dirty-bit allows fewer writes to be made back to memory, since it doesn’t need to write back over the copy in memory if the block is “clean” (unmodified). If the bit is “dirty” (modified), then the write occurs.

Going back to the information about how each larger, slower layer of the memory hierarchy has the information contained within the smaller, faster layer of memory above it, this is not always the case. Most processor families, including Celerons, Pentium IIs and Pentium3s, and the K6-x, do use this general system, which is called an inclusive cache design. Each layer has a copy of the information contained within the layer above it. Notable exceptions are the AMD Thunderbird and Duron families, which are examples of an exclusive cache design, marketed by AMD as their “performance-enhancing cache” design.

Exclusive cache designs mean that the information contained within one layer is not contained within the layer above it. In the Thunderbird and Duron, this means that the information in the L2 cache is not contained within the L1 cache, and vice versa.

Why is this advantageous to use an exclusive cache? Let us consider Intel’s Celeron2 and AMD’s Duron. The Celeron has 32k of L1 cache, equally split between data and instruction caches. It has a unified, on-die 128k L2 cache. So we add the two together, and we have 160k of on-die cache, right? Technically, yes. However, the information in the L1 is duplicated in the L2, because it is an inclusive design. This means that it has 32k of L1, and 96k of effective L2 (because 32k of the information stored in the L2 is the same as the L1), for a total of 128k of useful cache. Contrast that with the Duron, which has 128k of L1 cache, equally split between data and instruction caches. It has a unified, on-die L2 cache that is 64k in size. What!?! Only 64kb? If the Duron were designed inclusively, this defeats the purpose of the L2 cache. In fact, it would defeat the purpose of adding an L2 cache if AMD designed it with a L2 cache size that was equal to, or less than, the size of the L1. It works this way in all exclusive designs (including the never-released Joshua version of the Cyrix III). So AMD made the design exclusive (as they did with the Thunderbird). The Duron has 128kb of L1, plus 64kb of L2, and because neither one contains the same information, one can just add the two together for the effective on-die cache, which amounts to 192kb. The Duron’s is obviously larger.

So, let us consider this again using some diagrams (not to scale of course):

Inclusive L2

Exclusive L2 – L1 relationship

What must be discussed (this exclusive diagram is a change from the first version of it) is the nature of the exclusivity of the cache: in AMD’s Duron, the relationship is only a L1/L2 relationship. What this means is that, it is solely exclusive between the L1 and L2, and that the information in the L1 is duplicated in the main memory (and, potentially, though not always or often, in the hard drive), yet not in the L2 cache.

Moving from the Celeron2 and Duron as examples, let’s now look at their more-powerful siblings, the AMD Thunderbird Athlon and the Intel Pentium3 Coppermine. The Thunderbird has an on-die, exclusive 256kb L2 cache, and the Coppermine P3 features an on-die, inclusive 256kb L2 cache. Caches take up enormous numbers of transistors. The P3 went from about 9.5 million transistors to about 28 million just from adding 256k of L2 cache. The Athlon had about 22 million, and adding 256k of L2 cache made the Thunderbird weigh in at a hefty 37 million. These represent rather large fractions of the transistors used in two top-of-the-line x86 processors, which are there solely to feed these processors’ hungry execution units.

Considering these caches take up so much space, they increase die sizes significantly, which is not a good thing because die size plays a crucial role in yields. In fact, the formula for die yields can be expressed as:

 

Die yield = (1 + (((defects per mm^2) * (Die area in mm^2))/n))^-n where n is equal to the number of layers of metal in the CMOS process (assuming wafer yields are 100%) (Hennessy and Patterson, 12).

As you can see, the larger the die, the lower the yields. I add this in only because the original Athlon on a .18 micron process was 102 mm^2, and the Thunderbird was about 20% larger at 120mm^2. This increases die size is bad because it reduces yields. As you can see, from the standpoint of economy v. Performance, there need to be good reasons to put the L2 cache on-die, and there are.

Now that Intel has moved a majority, and AMD the entirety, of their production to socket chips, they do not really need to put the CPUs on expensive PCBs. This is because the cache was formerly placed on the cartridge, but off-die, and those cache chips have now been replaced with on-die cache. There are other benefits besides cost however: in this case, performance.

Despite the common misconception that electricity flows at the speed of light, it does not. It certainly travels at speeds far greater than the speed of sound, but electrons flow at a finite speed that is much lower than that of light, and this fact has an impact upon the design and performance of processors. Why mention this? One must remember that computers only deal with information in low and high voltages of electricity. The speed of any given part of a computer is, at the very least, bound by the speed at which electricity can be transmitted across whatever medium it is on. This, in turn, shows us that the ultimate performance bottleneck is necessarily the speed at which electricity can move. This is also the case for cache.

I can recall one of my friends’ 486 DX 33s (hey Adam! If you are reading this… this is YOUR 486 motherboard…. the one we played far too much Duke Nukem 3D on 😉 ) which had a very large (for the time) 256k of L2 cache. This meant, of course, that it was off-die on the motherboard, since no slot 486’s were around. This is due in part to the fact that it had farther to travel to get to the processor. Because the speed of electricity is not instantaneous (though it may seem that way when sticking your finger in an electrical outlet – I swear I’ve never done that! ), the latency involved with caches is also tied to how far it has to travel. In this case, even if the latency of a cache were zero (not possible – you cannot send anything instantaneously), latency would be incurred because of the fact that it takes time to transmit the data to the processor.

Another thing must be explained about the latency of a cache. For example, while the latency of the L1 cache is precisely the number given, the L2 cache isn’t just the number of clocks that it takes for the L2 to give the desired information to the core. It is the latency of the L2 cache, plus the latency of the L1. This happens because in most cases (some exceptions include the Itanium which bypasses the L1 cache for FPU data and goes directly to the L2 instead), the processor has to wait for a L1 miss occur before searching the L2 for the information. So the latency of the L2 is the latency of the L1 cache, plus the timing required for a L2 hit. This works its way down the memory hierarchy.

Socket 5 and 7 motherboards continued to use (and actually popularized) external (L2) caches. Slot 1 evolved and continued the use of external caches. Slot 1, Slot 2, and Slot A also allow a “backside bus,” which in this case run at certain fractions of the core clock speed, and they transmit data more frequently than when the L2 cache was on the motherboard, which was a great boon for performance. Socket 8, (which chronologically should be placed after Socket 7 but before Slot 1, as it housed the Pentium Pro), was an oddball with on-package, but off-die, cache, which is better than merely being on the same daughter card as the CPU, but not as good as being truly integrated onto the same die. It was, in fact, cost prohibitive, hence one of the reasons for the slot architecture.

Those who are halfway observant of the CPU industry have surely notice the shift from socket to slot, and from slot back to socket with the advent of the Socket370 Celeron, and subsequently the socketed Pentium IIIs. AMD is doing likewise with their Thunderbird Athlons and their Durons. There’s a good reason for this about-face: it’s now technically possible and economically desirable to build these processors with the cache on the CPU die due to the smaller fabrication processes. When the processes were larger, caches (and everything else of course) took up more space, thus power and generated more heat as well, which are detriments to clock speed.

Because process technologies have gotten to the point where it can be commonplace to have an L2 cache on the same die as the processor, many designers have started to do this in practice (the Alpha team put on 96K of L2 cache on-die back in a .35 micron process, but they’re not in the cut-throat, relatively low-margin consumer PC x86 world). The benefit from this is threefold: manufacturers no longer need to purchase expensive SRAM’s for their L2 cache, which lowers costs even while taking the larger dies into consideration, since being able to remove the cartridge and get rid of the external SRAM’s saves quite a bit; because the L2 cache is now on-die, there is far less distance for the information to travel from the cache to the registers, and thus, much lower latencies; and, if the architects feel it worthwhile to dig back into the core and make a wider bus interface (say, from 64 to 256 bits wide), they can massively increase bandwidth. Recall that bandwidth = bus width * transfers/second, and I say transfers/second because when something states 300mhz DDR, it is sometimes difficult to determine if it is 150mhz that has been “double pumped” for 300mega-transfers per second, or if it is a 300mhz, but at double data rate, meaning 600mega-transfers per second.

This does not, of course, stop anyone from making CPUs with more than 2 levels of cache. Makers of enterprise-class server chips have done it many times, and the PC market saw it with the advent of the K6-III, where the L2 cache on the motherboard simply became the L3 cache, and was effectively a small bonus. The Itanium (which, it now appears, will not even become a true server chip, as it has basically become a testing platform) uses 3 levels of cache, two on the CPU, and one on the PCB which is on the cartridge. I even recall seeing on the ‘net a 4-way Xeon motherboard that had 128mb of L3 cache!

With all these “basics” down, the impact on performance that all of these varying factors can be taken into account.

Because AMD and Intel are the current major players in the x86 world (Cyrix’s presence is nearly non-existent right now, the same with Centaur, and another upstart company, RISE, never made it to the PC world), I will focus on them, and their current mass-produced flagship chips.

When the Athlon was first introduced, it smashed its rival, the P3, in most everything. Aside from the Athlon’s ability to execute far more instructions per clock cycle than the P3, the Athlon was also better fed. The Athlon’s L1 cache was 4 times the size of the P3’s, and thus the more numerous execution units on the Athlon could be better fed. Let’s take a closer look:

Similarities:

  • 3-cycle latency
  • Harvard Architecture
  • LRU for data
  • Dual ported (meaning, it can do a read and a write to the cache at the same time)

Differences (Athlon vs. P3):

  • 128kb Harvard Architecture (even split) vs. 32kb Harvard Architecture (even split)
  • 2-way associative v. 4-way associative
  • 64byte lines v. 32byte lines
  • 3 Pre-decode-bits v. No pre-decode bits (saves time in decoding x86 code into the RISC-ish ops that current MPU’s use – for the Athlon this adds 24kb of cache [to make up for the 3 bits used in each byte])

Now, moving to the L2 cache, at first glance, the L2 cache of the Athlon appears the same as that of the P3’s, as both L2’s:

  • 512k of L2 cache (unified)
  • Non-blocking
  • Ran at ½ clock (Later Athlons used 2/5 and 1/3)
  • Inclusive
  • On cartridge
  • 64 bit bus

However, the similarities end there. The differences in the L2 cache are as follows (Athlon first, P3 second – latencies taken from Aceshardware at [http://www.aceshardware.com/Spades/read.php?article_id=86 ]):

  • 24-cycle L2 latency v. 27
  • 2-way associative v. 4-way
  • 64byte lines vs. 32byte lines

Now let us move to the cache styles of the P3 Coppermine and the Thunderbird Athlon. The L1 caches remain the same, however, the L2 architecture on both has changed.

Similarities:

Differences (Athlon vs. P3):

  • Exclusive v. Inclusive
  • 16-way associative v. 8-way associative
  • 8-entry 64-byte-line “victim cache” v. None – not an exclusive design
  • 11-cycle latency (though, as high as 20 in worst case situations) v. 7 cycles
  • 64-bit bus v. 256-bit bus
  • Data sent every cycle v. every other cycle.

As was seen with the Mendocino Celerons (the incarnation of the Celeron which featured 128k L2 cache at full core speed, and approached the performance of the Pentium II in many situations), speed can often make up for size. Intel used this same idea with the Coppermine P3. With the introduction of the Coppermine P3, Intel regained, and in some cases superceded, the Athlon in performance. This performance increase at the same clock speed over a Katmai P3 is almost exclusively due to the addition of the L2 cache onto the die of the core. It has a phenomenally low L2 latencies for such high clock rates – 7 cycles (3 for a L1 miss, 4 for a L2 hit)! Even better, when Intel engineers looked at the (aging) P6 core, they opened it up, and increased the L2 bus width from 64 bits to 256 bits, and had the L2 cache operate at core speed, double that of the Katmai.

However, while doubling the speed frequency, and quadrupling bus width, of the L2 cache, the net bandwidth is not eight times that of the Katmai P3, but “merely” four times. While true that doubling frequency, and quadrupling bus width would mean an eightfold increase in bandwidth, Intel made it so that the L2 cache does not send data on every cycle. Yet it is still full speed – take for example the K6 family. Its FPU ran at “full speed”, just as the P6 cores did. However, it had an average latency of two cycles for most floating point operations, as it wasn’t pipelined. This is a similar situation, where the L2 cache runs at full speed, but does not send information every cycle.

In an attempt to regain some lost ground that they had previously won against the P3, AMD introduced the “Thunderbird” Athlon, which has “Performance-Enhancing Cache,” a phrase that sounds suspiciously familiar to the “Advanced Transfer Cache” buzzword that Intel coined for the P3 Coppermine.

Nearly everyone was anticipating that the L2 cache on the new Athlon would be markedly similar to that on the P3 Coppermine: it was expected to have a 256-bit-wide bus, perhaps eight-way associative cache, both like the Coppermine; and lower latencies, though no one expected it to be quite as low as the P3’s latencies. They also expected it to be exclusive. AMD surprised everyone in two ways: one, they had a 16-way associative (higher hit rates) as opposed to the 8-way expected. In addition to the large L1 cache and the fact that it was exclusive, which makes for rather high hit rates (think of it as a 6-way 384kb L1.5 cache – this is a good thing), it came out with merely a 64-bit L2 bus, meaning that unlike the P3 Coppermine, whose L2-to-processor bandwidth increased fourfold, the new Athlon’s bandwidth increased by “only” a factor of between two and three (twofold over the half-speed L2-cache versions, threefold over the 1/3-L2-cache versions. Lastly, the latency of the Athlon did indeed decrease from 24 cycles to 11, but in some situations (see here), its latency shoots up to 20 cycles. The reason for this was as follows (from here):

…the AMD Athlon processor’s L1 cache is capable of efficiently handling most requests for data. As a result, the victim buffer can be drained during idle cycles to the L2 cache interface.

The reason that the L2 latency can shoot up to 20 cycles is, if the victim buffer (victim cache) is full, it has to send the information to the L2 cache before the L1 reads from the L2 (because the information contained in the victim buffer could be what is being requested), and the time required to send the data from the buffer to the L2 cache is 8 cycles (the same as it is for going from the L2 to the L1). When in the L2 cache, there is a total of about a 4 cycle turnaround, plus the 8 cycles to go back from the L2 cache. 8 + 8 + 4 = 20 cycles, which is exactly what was read from the Cachemem utility in the above link at Aceshardware when the memory footprint increased to a size greater than the L1, but smaller than the L2 cache sizes.

I lurk around the technical forums at Aceshardware a lot, and many people were quite shocked about the meager 64-bit L2 bus. AMD seems to have seen these reactions, as they released the above PDF file stating practically nothing more than why they chose the 64 bit bus and why the latency is in some cases more than 11 cycles (it has much other information, however, nothing that couldn’t be found elsewhere).

Because the L1 cache of the Athlon already has such a high hit rate, the amount of bandwidth required of the L2 cache wasn’t enough to warrant the time required to widen the bus any further. Time to market is a very important concept, one which AMD seems to have stolen from Intel as of late (Intel has seemingly forgotten this concept, but that’s another one of my digressions 😉 ). Another reason that many were surprised that the L2 bus width wasn’t increased was this: while AMD may state that the large L1 de-emphasizes the need for such high bandwidth L2, being exclusive in nature means that there is more traffic going through the bus because of the cache evictions – meaning, the L1 has to make sure that the L2 doesn’t have the same information, and it goes over the L2 bus in order to do that, thus increasing bus congestion.

Even though the P3 Coppermine doesn’t use an exclusive cache it is only wasting 32kb of cache right now, one fourth the amount that the Athlon would be wasting if it were inclusive), and doesn’t need the bandwidth for making sure of cache evictions from the L2. Instead, it needs very high bandwidth because it needs to be refilled quickly, and often, due to its smaller size.

I found a rather interesting quote by Paul DeMone over here, which also helps to explain the situation:

The EV5 [Alpha 21164] has a 96 KB L2 cache on-chip. The EV6 [Alpha 21264] only has one level of cache [on-die]. When you have one level of on-chip cache you have to make it as big and associative as possible while not letting it hurt clock rate (going off chip is murder on performance). So it is a compromise. But with two levels of on-chip cache you make the L2 as big and associative as possible and make low latency the number one priority for the L1.

Processors that have “large” 3 clock L1s along with on-chip L2’s (EV7, K7 t-bird) got that way because the L2 caches were add-ons in subsequent chip versions of the core and because of time to market concerns it wasn’t worth opening up the CPU pipeline and layout in order to go with a more optimal 2 level cache arrangement.

Dirk Meyer, the head architect of the EV6, also happened to subsequently become an employee of AMD, and the head architect of the K7 (the Athlon). This helps to explain some of the similarities in cache design between the Alpha 21264 and the Athlon.

How Cache Sizes Affect Yields

When looking back at the K6-III, the yields were relatively low. This is partly due to the much larger die size (recall because of the massive number of transistors that 256kb of on-die L2 cache uses. “Backup” cache-lines can be added to allow the disabling of defective cache-lines in the cache, but AMD didn’t use much, because adding additional backup cache-lines increases die size. So yields were relatively low, and to salvage totally defective L2 caches, AMD just disabled it, and sold it as a K6-2 if the core was fine.

Intel has taken a different approach with the P3 Coppermine. If one half of the cache is bad, but the other is fine, Intel just uses a laser to fuse the second half of the cache, and disable it. This saves the good part, and allows Intel to sell it as a Celeron. While they do this, they often just disable half of it anyway to get a Celeron to sell. This is actually cheaper than designing a chip with half the L2 cache, because Intel does not have to create different masks and such for another chip.

One would think that AMD would do with their Duron and Athlon the same as Intel does with their Celeron and P3 Coppermine. However, they did not. When disabling a section of the cache, it also cuts the associativity proportionally with the fraction that was disabled. Thus the Celeron is only 4-way associative while from the same core. This reduces hit rate, and that’s not something that AMD was willing to do. Also, simply disabling a part of a cache merely to have a product another market isn’t an efficient use of fabrication capacity. This is something that AMD has to deal with much more than Intel does, because AMD has only two fabs, while Intel has many more.

What this amounts to is that the Duron and the Thunderbird have different cache sizes, yet the same architecture, latencies, and associativities. AMD implements more redundant cache-lines than they did on their K6-III series, so yields are much higher, plus the sizes of the Thunderbird is about the same size as the K6-III due to the smaller process technology (with the Duron being about 20mm^2 smaller).

An Aside for What Might Have Been…

While the quote by Paul DeMone wasn’t made with RISE (this company attempted to join the x86 PC world, but failed miserably) in mind, it is interesting to apply this concept to the mp6, their first x86 processor. The mp6 (which, interesting enough, has been introduced “for the first time” twice! [for reference, I read about that at JC’s – http://www.jc-news.com/pc]) was introduced with a miserly 16kb of L1 cache. The last time that was found on an x86 processor was the original Pentium and the Cyrix 6×86. RISE did, however, have plans to add 256k of L2 cache on-die, and they even came out with a socket370 version (they had a GTL+ bus license from the manufacturer, as they run a fabless model) which never made it to market. The reason that this is so interesting, is because they followed Paul DeMone’s statement about having a low latency L1, they got carried away as they didn’t have the large and fast L2 cache to back it up.

The core of the mp6 (without going into the gory details, which can be found here and here) had the ability to execute many instructions at once (it had a fully pipelined FPU even before the Athlon came out). But it was stuck with its tiny L1 cache, which, though it had an incredibly low latency (1 cycle!!!), was supported only by whatever cache was on the motherboard until the second incarnation of the mp6, the mp6II, could come out with integrated L2. This is why they claimed their PR rating (yes, it used it…) scaled so well with the FSB: the FSB was also the speed at which the L2 cache operates at for socket 7 motherboards. If RISE had gotten the version out with L2 cache on-die, perhaps they would have had a very fast solution per clock, but alas, they did not, and they have been relegated to the appliance market (perhaps not a bad thing considering its growth potential) because of poor performance (see here).

PR rating and dismal cache architecture aside, it had another problem, namely the clock speed (which is indeed what necessitated the PR rating, must like AMD’s K5 and Cyrix’s 6×86 core derivatives). Given that it was bandwidth-starved, the addition of the L2 cache could have allowed it to continue to use the PR rating, but as it was, a small company often has great difficulty in getting odd bus speeds to become standard (look at Cyrix – they did it, but it took a while, and never truly became standard).

Conclusion

As you can see, there are numerous design tradeoffs when choosing a cache implementation. Some have worked out well, and such is the case with the P6 and Athlon derivatives. There have been instances where a memory design, perhaps, tried to go to far, and forced a company out of certain markets with a general architecture design that had the potential to do well.

As it stands for the future, it is likely that we will not be seeing larger and larger L1 caches on CPUs (with the exception being HP PA-RISC processors), because of the ability to economically integrate L2 cache on-die. Cyrix discussed this in a PDF file about the defunct Jalapeno*, which was to have a smaller L1 cache than Cyrix’s previous design. Their reasoning was that large caches wouldn’t allow for low latencies while maintaining high clock speeds, which is what the processor needs from an L1 cache, and the on-die L2 cache would give the high hit rates.

Intel is following this concept with their 32kb L1 in the Itanium. While the L1 Icache (Trace Cache) for the P4 is about 96k in size, its effectively about 16kb Icache (I won’t get into this, because there is an excellent discussion of it here), plus a very low-latency (2 cycle) and small 8kb L1 Dcache. Both of these processors have high bandwidth L2 caches on-die. The “bigger is better” mentality in regards to cache caught on easily for those trying to gain a better understanding of computer architecture, as since the introduction of a L1 cache in the x86 world with the 486, the size of this cache has always increased at least, until now. This turn of events will do little else but to confuse those who were led astray.

* A shout-out to PDF file collectors, I have “121507 (MDR Jalapeno).pdf” but I lost the one I’m referring to, which is a different one put out by Cyrix themselves…anyone want to send it to me?

Bibliography

Hennessy, Patterson. “Comptuer Architecture: A Quantitative Approach,” 1996.

“AMD Athlon ™ Processor and AMD Duron ™ Processor with Full-Speed On-Die L2 Cache Enabling an Innovative Cache Architecture for Personal Computing.” http://www.amd.com/products/cpg/athlon/pdf/cache_wp.pdf June 19, 2000.

http://www.aceshardware.com/cgi-bin/ace/tech.pl?read=9001

http://www.aceshardware.com/Spades/read.php?article_id=5000173

http://www.aceshardware.com/Spades/read.php?article_id=86

http://www.aceshardware.com/Spades/read_news.php?post_id=671&keyword_highlight=11+cycles

http://www.ixbt-labs.com/cpu/rise-mp6.html

http://www.jc-news.com/pc/article.cgi?Rise/mP6_Preview

http://www.realworldtech.com/page.cfm?ArticleID=RWT091000000000

http://www.sandpile.org/impl/k7.htm

http://www.sandpile.org/impl/p3.htm

About Dewwa Socc