Latencies

Latencies, latencies…

What makes DRAM slow or fast? As SDRAM is a multi-bank architecture, the chipset can leave a certain row of a certain bank, which has been accessed before, “open”.

If the next request accesses the same row, the chipset does not have to wait until the sense amps are charged, this is called a page hit. The RAS to CAS latency will be therefore 0 cycles and the output buffers will contain the right data after the CAS latency. In other words, a page hit makes sure we only wait until the right columns are found on the sense amps, which already contain the requested row.

There are however, other possibilities. The row requested by the chipset might not be the one that is open, a page miss. In that case the RAS to CAS latency will be 2 or 3 clockcycles, depending on the SDRAM’s quality. This is also the case that we discussed in the chapter structure and operation.

If the chipset has left open a certain row on a certain bank, and the data requested is in a different row in the SAME bank, things get worse. This means that the sense amps have to write back the old row before they can charge the new one. Writing back the old row takes a certain time called “Precharge time” (Trp). This is the worst case.

Real-World Bandwidth

To understand the relationship between latency and Bandwidth, take for example a PC100 SDRAM-222. The first 2 means that the CAS latency is equal to “2”, the second number indicates the RAS to CAS latency and the third the Precharge latency.

Let us resume the different kind of latencies. In this example we will consider what happens if a cache miss occurs, and the CPU is waiting impatiently to get the right data. In other words, the case we study is a cache refill that starts a memory read.

Write backs to memory are far less interesting as they do not stall the CPU. Writes can be buffered, for example, the KX-133 chipset has four cache lines (32 quadwords) of CPU to DRAM write buffers. With such a high speed FSB (200 MHz instead of 133 MHz) and the write buffers, the CPU can pump results into the buffers in the chipset and continue with its work. The chipset will take care of the transmission from the buffers to the main memory when the memory bus is less saturated. Bottom line: memory reads are more interesting.

Back to our table, which should give you an overview of the different latencies and when they occur. The third column explains the latencies that occur in the situation that is described in the first column. For example, in case of page miss, we have to wait 2 cycles until the rows (Row to Cas Delay, RCD) are charged, and then we have to wait 2 cycles until the right column is found (CAS Latency or CL).

In the fourth column you will notice that we add 5 cycles extra latency to get the total latency seen by the FSB. Two cycles are added because the addresses have to travel from the CPU to the chipset to the DIMM module, 1 cycle is added to transfer the data to the output buffer and another two cycle are added to get the data back to the CPU (via the chipset).

In case of a… Statistical Chance DRAM latency Total “critical word” Latency seen at the FSB Total latency to transfer 32 bytes Maximum Bandwidth
(PC 100)
Page hit  +/- 55% CL = 2 7 7-1-1-1= 10 cycles 320 MB/s
“Normal” Page miss +/- 40% RCD+CL = 4 9 9-1-1-1 = 12 cycles 267 MB/s
Page Miss, and sense amp is loaded with a previous “open” row +/- 5% RP+RCD +CL = 6 cycles 11 11-1-1-1= 14 cycles 229 MB/s

The last and the last but one column explain the relation between latency and bandwidth. In case of a page hit, we found out that the memory chip with CAS latency = 2 is able to offer 32 bytes in 10 cycles. In case that our memory is clocked at 100 MHz (PC100 SDRAM), this means that we get 32 bytes per 10 clockcycles of  10 ns  long (100 MHz). 32 bytes per 100 ns is equal to 0.32 bytes per ns, or 320 MB per second.

Therefore, contrary to some popular belief, latency and bandwidth are very closely related, especially in a typical pc system which accesses the main memory mostly to fill a cacheline.

Have you noticed? The best PC100 SDRAMs (222) can reach, even in the best circumstances (page hit!), no more than 40% of the theoretical bandwidth (800 MB/s). Compare this to the bandwidth that a typical SRAM of the PIII (katmai) or Athlon classic delivers. Such an SRAM would be able to deliver a 32 byte cacheline in 3-1-1-1 or 6 cycles at 300 MHz (and more), or at least 1.76 MB/s. How does PC133 SDRAM do? Well, let us consider a system with PC133 CAS2, PC133 Cas3 and PC100 CAS2.

DRAM Total System Latency Total latency to transfer 32 bytes Maximum Bandwidth Increase over PC 100
PC 133-CAS 2  +/- 7 cycles 7-1-1-1= 10 cycles 427 MB/s 33%
PC 133-CAS 3 +/- 8 cycles 8-1-1-1 = 11 cycles 387 MB/s 21%
PC 100-CAS 2 +/- 7 cycles 7-1-1-1= 10 cycles 320 MB/s N/A

As you can see, a PC133 CAS3 module will return the first word a bit later than the PC100 CAS2 module, so in some applications where there is little access to the memory, the PC100 CAS2 equipped system will perform as well as a PC133 CAS3 equipped system.

An Athlon Classic with its large 512 KB L2-cache shows a smaller difference in most applications between PC133 and PC100 than a Duron. As the Duron accesses the memory a lot, the high bandwidth PC133 CAS3 will almost always prevail.

Our Duron review showed that a Duron equipped with PC133 SDRAM has a tangible performance advantage over one with PC100 SDRAMs. To understand the relationship between CPU performance and memory performance, read our article about this subject.

About Dewwa Socc