High-speed DRAMs keep pace with high-speed systems

Dewwa Socc September 14, 2017 General Comments Off on High-speed DRAMs keep pace with high-speed systems

If you have arrived here through a search engine, and there’s no menu to the left click here!

Craig Hampel, Rambus Inc

To get high performance from high-speed systems, you need high-performance memories. Carefully analyze the EDO, SDRAM, and RDRAM system-timing parameters to see which memory type delivers the performance that your system needs.

As processor clock rates in computers climb to 200 MHz and beyond, high-performance memories are becoming critical to sustaining high performance. Market pressures that keep system costs competitive require low-cost main-memory systems based on standard, high-yielding DRAM cores. Extended-data-out (EDO) DRAM, synchronous DRAM (SDRAM), and Rambus DRAM (RDRAM) can help you obtain good performance from low-cost cores.

You must also evaluate system-transaction times, including controller, DRAM, and board-timing factors, for the three DRAM alternatives. The memory systems in this article use 16-Mbit DRAM devices operating with a 66-MHz system clock.

Consider the worst-case main-memory transactions—cache-line fills or 32-byte read operations from random locations in memory—and compare how long each system takes to retrieve the first block of data. Specifically, examine system latency, or the time the system takes to access the first block of data. Other elements of throughput include the memory system’s bandwidth, hit rate, and ability to overlap multiple memory transactions. Overlapped memory transactions (simultaneous row and column accesses to banks) enable the data transfer of the previous transaction to hide the latency of a second transaction. This technique boosts overall throughput.

Because EDO DRAM, SDRAM, and RDRAM are based on the same core-memory technology, their internal device timings are nearly identical. Thus, the differences among memory subsystems that affect the latency include the rate at which the system can move the address and control information to the DRAM and the rate at which the DRAM can move data from the DRAM to the memory controller.

Determining overall latency
At the system level, the cache-line-fill operation comprises address-transport, data-access, and data-transport times. You can extract these timing values from their respective DRAM data sheets and evaluate the system overhead to determine the overall latency for any memory subsystem.

Address-transport, or -settling, time is how long it takes the controller to send a new address and any control information into the DRAM interface. The address-transport time is a function of the memory interface only; you measure this time from the controller pins.

Data-access time is how long it takes the DRAM to move the data from its internal storage array into its internal sense amps. Because the three DRAM types’ internal arrays are nearly identical, their access times are similar. The RDRAM’s higher clock rate allows it to complete an access faster because the RDRAM can transmit data at the end of a 3.75-nsec clock period. This capability is unlike that of an SDRAM, which must wait for the end of a system-bus clock cycle: every 15 nsec in a 66-MHz system.

Data-transport time is how long it takes the system to move data from the DRAM to the controller. This component is a direct function of the memory interface’s bandwidth.

Performing a detailed analysis
The EDO DRAM is a page-mode DRAM with the addition of a register set that holds the data output. This feature allows the core to access the next column address sooner, resulting in a 30% speed improvement over comparable page-mode DRAMs. The controller presents row-address strobes (RASs) and column-address strobes (CASs) to the core, asynchronously to the bus clock. EDO DRAMs use a single bank architecture and, therefore, must process memory transactions serially; that is, the next transaction cannot begin until the EDO DRAM completes the previous one.

A 66-MHz EDO DRAM system that uses a 64-bit-wide data bus with blocks of four 1M3 16-bit EDO DRAMs in parallel has a peak data-transfer rate of 266 Mbytes/sec. Each block of four EDO DRAMs presents one 8-kbyte open page to the controller. This system also provides an 8-Mbyte memory granularity.

A 66-MHz EDO DRAM system that uses a 64-bit-wide data bus with blocks of four 1M316-bit EDO DRAMs in parallel has a peak data-transfer rate of 266 Mbytes/sec.

The SDRAM and RDRAM have synchronous interfaces, meaning that the bus clock synchronizes the transfer of control, address, and data information. The 16M-bit SDRAM and RDRAM arrays are organized into two banks. Each bank includes a sense-amp array to hold an open page. The memory devices can burst-transfer data from either open page. A 66-MHz SDRAM system that uses a 64-bit-wide data bus with blocks of four 1M316-bit SDRAMs provides a 533-Mbyte/sec peak data-transfer rate. Each block of four SDRAMs presents two 2-kbyte open pages to the controller. This system also provides a memory granularity of 8 Mbytes.

A 66-MHz SDRAM system that uses a 64-bit-wide data bus with blocks of four 1M316-bit SDRAMs provides a 533-Mbyte/sec peak data-transfer rate.

A 533-MHz RDRAM system that uses a two-channel configuration comprises two 2M38-bit RDRAMs, yielding a 16-bit datapath. Because each RDRAM has two 2-kbyte pages, the system of four RDRAMs presents four 4-kbyte open pages to the controller. This system provides a 1.06-Gbyte/sec peak transfer rate and a 4-Mbyte memory granularity.

A 533-MHz RDRAM system uses a two-channel configuration that comprises blocks of two 2M38-bit RDRAMs, yielding a 16-bit datapath.

Calculating EDODRAM cache-line-fill latency
You can calculate the EDO DRAM system’s total latency time for a cache-line fill by using typical timing specifications, such as from the Micron (Boise, ID) MT4LC1M16E5-6 EDO DRAM data sheet. Because the EDO interface is asynchronous, the access times are not a function of the system or bus clock, and the board’s physical layout can impact these times. Therefore, a system with few EDO DRAMs and series damping resistors has a different transport time from that of a larger system.

The address-transport time for an EDO DRAM system is typically 9 nsec. System designers base this time on board-level characteristics, including trace and settling times.

A –60 EDO DRAM has a 60-nsec RAS-access time, t_RAC, which equals the data-access time. The column-address- access time, t_AA, for this EDO DRAM is 30 nsec, and the RAS-precharge time, t_RP, is 40 nsec.

The data-transport time is less than the address-transport time because less loading occurs on the data lines. In this configuration, the data-settling time is about 6 nsec—how long it takes for the controller to receive the first 8 bytes. The controller must issue three more CAS cycles to finish transferring the entire cache line, or 32 bytes. Thus,

t_{DATA TRANSPORT}=t_DS+(3·t_{COLUMN CYCLE}) =6+(3·30) =96 NSEC,

where t_DSis the data-settling time for the first 8 bytes.

Without precharge,

TOTAL EDO LATENCY=t_{ADDRESS TRANSPORT}+t_{DATA ACCESS} +t_{DATA TRANSPORT}=9 NSEC+60 NSEC+96 NSEC=165 NSEC.

Calculating SDRAM cache-line-fill latency
You can calculate the SDRAM total latency time for a cache-line fill by using typical SDRAM timing specifications, such as from the Hitachi (Brisbane, CA) HM5216165TT-10 data sheet. The cycle time for a 66-MHz bus clock is 15 nsec, and, therefore, the address-transport time is also 15 nsec. The SDRAM controller and array operate with the same clock. The controller drives the address bits to the SDRAM array on one clock edge, and the SDRAM samples the address and control information on the next clock edge.

The data-access time for the SDRAM is:

t_{DATA ACCESS}=t_RCD+((CL–1)·t_CYCLE)+t_AC=30+30+9=69 NSEC,

where t_RCDis the specified minimum time from activating the next RAS until the READ or WRITE command, and CL is the CAS, or read, latency. You must set this value to 3 to achieve an output valid time (t_AC) of 8 nsec and provide sufficient setup time to the controller:

t_{DATA TRANSPORT}=t_DPD+(3·t_CYCLE)=6+45=51 NSEC,

where t_DPDis the minimum time necessary to allow the data to switch and propagate from the DRAM to the controller across the wide, unterminated data bus. This time equals the cycle time minus the output valid time. The controller must issue three additional CAS cycles, or bursts, to transfer the remainder of the cache line; this procedure accounts for 45 nsec of additional data-transport time.

TOTAL SDRAM LATENCY=t_{ADDRESS TRANSPORT}+t_{DATA ACCESS} +t_{DATA TRANSPORT}=15 NSEC+68 NSEC+51 NSEC=134 NSEC.

Calculating RDRAM cache-line-fill latency
The RDRAM controller begins a cache-line-access transaction with the transmission of a request packet. This transmission requires 6 bytes of data and transmits 2 bytes on every clock edge. The cycle time for the Rambus (Mountain View, CA) 533-MHz R16MC-60-533 RDRAM is 3.75 nsec. Therefore, the address-transport time is 11.25 nsec. The RDRAM controller must simultaneously send the same request packet to both RDRAM channels, because the two devices operate in lock-step.

The RDRAMs begin transmitting data 16 clock edges after the RDRAMs receive the request packet:

t_{DATA ACCESS}=16·t_CYCLE=60 NSEC.

With two RDRAM channels, the controller receives 4 bytes for each clock cycle, or 2 bytes per clock edge. Transporting the entire 32 bytes requires eight clock cycles, or 30 nsec.

TOTAL RDRAM LATENCY=t_{ADDRESS TRANSPORT}+t_{DATA ACCESS} +t_{DATA TRANSPORT}=11.25 NSEC+60 NSEC+30 NSEC
=101.25 NSEC.

Subsequent data-access timing
Although the preceding analysis focuses on the time memory subsystems need to access the first cache line in a new page, accesses to the next data blocks within the open page depend on column-access time. Applications that show high locality of reference can benefit from keeping a page open after an access. Subsequent access to the open page results in a “hit,” or immediate CAS access.

The CAS-access time is similar to the RAS-access time minus the time it takes to load a new row. All three DRAM types use similar DRAM core technologies with similar RAS cycle times. Filling a cache line of 32 bytes from an open row takes 135 nsec from the EDO system, because access time reduces by 30 nsec. Similarly, with the T_RCD component of 30 nsec removed, the SDRAM access to the next 32 bytes is 105 nsec. An RDRAM system can achieve a hit time of 26 nsec, or seven clocks, compared with the RAS-access time of 60 nsec, or 16 clocks. This procedure results in subsequent accesses of 67.25 nsec to any 32 bytes within an open page.

Memory-controller designers may choose to keep pages open in the expectation that subsequent requests will access the open page. If the request is not to the open page, the precharge time must occur before the beginning of the RAS-access time. That is, the controller must close the page before selecting a new row to open.

The precharge time the DRAM core requires adds latency for a page miss. The precharge time is a function of the speed of the core being accessed—about 30 nsec for RDRAMs and SDRAMs and 40 nsec for EDO DRAMs. To accommodate the precharge time, the RDRAM system uses eight extra bus clocks for the access (with a cycle time of 3.75 nsec). The SDRAM specifies its precharge time, t_RP, as the minimum time between the precharge command and a subsequent activate command.

Average or effective latency
A system’s average latency is a function of the page hit rate. The hit rate determines how often the lower latency CAS-access and the longer precharge, or miss, times occur. Although hit rate depends on the application, the RDRAM system’s higher open-page count results in 50% higher hit rates than those of comparable SDRAM and EDO DRAM systems. Although an individual 16-Mbit SDRAM or RDRAM has two independent banks, how a system designer implements the memory subsystem determines how many open pages are available to the controller. (EDO DRAMs have only one bank.)

First-word latency
Many designers also must analyze memory systems’ latency to provide the critical first word of the cache line. The access time of the core dominates this latency, because, for small transfers, data-transport time plays a less important role. For memory systems with comparable hit rates, the latency for the first-word access is about the same for EDO DRAM, SDRAM, and RDRAM systems. Again, because RDRAM systems have more open pages than do equivalent EDO DRAM and SDRAM systems, they are more likely to have higher hit rates and, thus, better first-word latency times.

Latency is one component of overall system performance. The other component is effective bandwidth. Pentium Pro, RISC CPUs, unified-memory processors, and graphics controllers can issue multiple outstanding memory requests. In these multiple-request environments, the data-transport time can hide the row-access latency if the memory device can support overlapped memory transactions.

Both the SDRAM and RDRAM systems have multibank designs and thus can support simultaneous RAS/CAS accesses. This capability enables them to interleave transactions. While the controller is transferring data from one bank of an SDRAM or RDRAM, a row access can be in progress from another bank. On the other hand, the single-bank design of EDO DRAM requires the memory controller to process its transactions serially. This design also means that the device must wait the full precharge time from the previous access as well as the RAS time for the next block transfer before transferring data.

Faster system buses
System designers are moving to 100-MHz system-bus operation to achieve higher system performance. You cannot cost-effectively use EDO subsystems at these system speeds, however. You can adjust the preceding timing analysis for a 100-MHz system bus by comparing a 600-MHz Rambus system with a 3.3-nsec bus-cycle time to an SDRAM subsystem supporting 100-MHz system operation. Although memory vendors have not determined the timing parameters for SDRAM operation with a 100-MHz bus, a 600-MHz Rambus implementation yields a peak transfer rate of 1.2 Gbytes/sec.

DRAMs for mainstream memory systems must meet several criteria, including performance, availability from many manufacturers, and low component and overall-system cost. EDO DRAMs, SDRAMs, and RDRAMs are all based on low-cost cores, and most top-tier DRAM suppliers support these memories.

Although you may consider EDO DRAMs and SDRAMs as high-bandwidth memories, their interface limits their ability to quickly move data into or out of the DRAM core. In addition, their wide data-bus configuration leads to complex board designs where system designers must resolve tricky timing issues. A Rambus main-memory subsystem allows higher effective bandwidth and lower lead-off latency than either an EDO DRAM or a SDRAM configuration. Despite using the same core technology, SDRAM-based systems take greater than 20% longer, and EDO DRAM systems take 60% longer than RDRAMs to retrieve the 32-byte cache line.

Author’s biography

Craig Hampel is an architecture specialist at Rambus Inc (Mountain View, CA), where he has worked for 3 years. He has helped develop PC and RISC chip sets and Rambus DRAMs, concurrent RDRAM, RISC, and x86 workstations and servers. Hampel has a BS in computer engineering from the University of Illinois—Champaign/Urbana. His leisure pursuits include skiing, roller hockey, and camping.

We would like to take this opportunity to thank Craig Hampel and Rambus, Inc. for providing this very informative information.

Notice: Rambus, RDRAM and the Rambus logo are registered trademarks of Rambus Inc. Windows® 95, Windows® 98, Windows® NT, Windows® 2000 and Microsoft® Office are registered trademarks or trademarks of the Microsoft Corporation.

All other brands and trademarks are the property of their respective owners.