RAM Latency - Explained
September 16, 2007 [<<][>>] [back to archive index] [no comments]
Is Low-Latency RAM worth the extra cost?
For most users, the answer is no. CAS 2.5 RAM costs about 30% more than CAS 3.0. & CAS 2.0 RAM can cost up to twice as much as CAS 3.0
The amount of improvement in system performance varies greatly depending on the application you're running, but it's safe to generally characterize the improvement by saying that, compared with CAS 3.0 RAM, CAS 2.5 will give you a 1% to 2% speed boost, and CAS 2.0 will give you 2% to 4%.
RAM Latency Basics
When you buy RAM, you'll see two main speed ratings listed: frequency (the maximum rated clock rate), and latency. Memory speed is certainly important if you're considering overclocking, but for now we're concerned only with latency. Low latency memory, running at low latency settings, supposedly speeds up your system without requiring you to overclock it.
Memory latency is almost always designated in one of two ways. It's either a single number denoting the CAS latency, or a string of four numbers denoting several latencies. CL=2.5, CAS=2.5, or C=2.5 would be common "single number" listings for RAM with a CAS latency of 2.5 cycles, for instance. A four-number designation would be something like 3-4-4-8, in which the four numbers relate to CAS tRCD tRP tRAS. That's a lot of weird abbreviations, so here's the basics of what they mean:
CAS
Column Access Strobe (sometimes Column Access Select). This is actually the last stage in finding where data is physically located in RAM. Data is stored in an array of columns and rowsthe row is selected first, then the column is selected and the data in memory is either read from or written to. CAS is the amount of time, in cycles, between receiving the column access command and acting upon it. It is usually a value of 2, 2.5, or 3.
tRCD
RAS (Row Access Strobe) to CAS delay. This is the delay, in number of cycles, between finding the row of a location in memory, and finding the column. This value is usually between 3 and 5 cycles, but it doesn't tend to have a huge impact on performance. Sequential bits of data are usually stored along the same row in memory, so rows are not re-selected nearly as often as columns.
tRP
RAS precharge. This is how much time it takes for the memory to stop accessing one row and start accessing another. Like tRCD, this value is typically between 3 and 5 cycles for modern memory systems. It can have an impact on performance when programs use large blocks of memory that span several rows.
tRAS
Active to Precharge Delay. This is the delay, in cycles, between the pins of the memory module electronically receiving a signal and the module starting the Row Access Strobe to locate and retrieve (or write) it. This is generally a pretty big delay, from 5 to 8 cycles on most DDR memories. But it also doesn't have a huge impact on performance, and should only make a big difference when memory access patterns change dramatically.
That's probably all still a bit confusing, so here's the chronological sequence of events: First the pins receive a request to, let's say, retrieve memory at a certain address. The first latency measurement that comes into play is tRAS, as the memory waits to activate the row where the data resides. Then tRP comes into play if the requested data resides on a different row than the one previously accessed. After the row is selected (if necessary), you have the tRCD delay before the column is selected. Then CAS is the time it takes to select the proper column of memory and retrieve data stored there. To recap, listed chronologically, it's tRAS -> tRP -> tRCD -> CAS. And CAS has the biggest impact on performance, since new columns are accessed more frequently than anything else.
To test the effects of memory latency, we built two nearly identical PCs, a common Athlon 64 system using socket 754, and a high-end 3.4GHz Pentium 4 Extreme Edition machine. With the exception of the motherboard and CPU, both machines were configured identically. It might have been interesting to see the effects of latency on an Athlon 64 system that uses socket 939, with its 128-bit memory controller instead of the 64-bit controller on socket 754, but the fact is that socket 754 drastically outnumbers socket 939 right now. The memory controller on our P4 system is located on the north bridge, so memory access is the same regardless of CPU.
On each of these two systems, we ran a suite of benchmarks with three different sets of memory to determine the performance differences. We used three sets of Kingston HyperX memory that run at three common latency settings. Our high-latency RAM was run at 3-4-4-8, a very common set of latency settings for inexpensive DDR400 memory. When you buy the cheapest DDR400 you can find, this is probably what you're getting. We then re-ran all our benchmarks with a set of RAM running at 2.5-3-3-7--another common speed for DDR400 memory. You'll typically pay a little more for this than CAS 3 RAM, but it's not terribly expensive. Our low-latency setup was 2-2-2-5, an extremely aggressive group of latency settings that will definitely come at a cost premium.
Benchmark Results
Business Winstone 2004 is the latest version of VeriTest's Winstone benchmark suite. It consists of a variety of common desktop applications, run in a scripted sequence that resembles actual usage patterns. Most of these are Microsoft Office applications, including Microsoft Project and Access. Also included are Norton Antivirus Professional 2003 and WinZip 8.1.
The latest release of Content Creation Winstone updates most of the applications to recent versions. It also shifts away from Windows Media Encoder 7.1 to the current Windows Media Encoder 9. SoundForge has been replaced with Steinberg's WaveLab. One note: LightWave is currently running as a single threaded application.
Both Business Winstone and Multimedia Content Creation Winstone 2004 can be ordered from VeriTest and delivered on CD-ROM for a nominal shipping charge. They cannot be downloaded.
Our Winstone tests show quite modest improvements with decreasing RAM latency. In the Content Creation test, moving from CAS 3.0 to 2.5 results in an improvement of just over 1% on the Athlon 64 platform and just over 2% on the Pentium 4. There's another modest pickup going from CAS 2.5 to 2.0. Both systems were about 1% faster on the Business Winstone suite moving from 3.0 to 2.5, and almost 2% more going from 2.5 to 2.0. In none of these tests was the overall improvement from CAS 3.0 to 2.0 more than 3%. We expected these production suites, which switch back and forth between different applications, to show more of an improvement.
For our video encoding tests, we ran Windows Media Encoder 9 to recompress a high bitrate AVI file into a 1 megabit CBR WMV9 file, with "CD quality" audio (640x480 video, 16-bit / 44.1KHz audio). Our DivX compression test uses the popular VirtualDub freeware and DivX 5.2. We tested audio compression by converting a 248MB .WAV file into WMA9 at the generic "CD Quality" setting using Windows Media Encoder.
Digital media encoding can really strain the memory subsystem of your computer, as large media files have to be streamed to the CPU very quickly. It's no surprise that we see some of the best improvements here. Our Athlon 64 system picked up almost 6% encoding Windows Media Audio, almost 3% encoding DivX video, and 1.5% encoding WMV. Those improvements are analogous to buying the next fastest CPU.
Our Pentium 4 system shows something interesting. In our media encoding tests for that system, we see virtually no change at all. It's hard to explain this phenomenon, quite frankly. Our media test files are way too huge to fit in the CPU cache, so memory access is definitely a factor. Does it have something to do with Hyper-Threading, and the support of our test applications for multi-threaded operation? Perhaps the P4 is so fast at these tasks that the limiting factor is how fast the data can be pulled off the hard drive? For whatever reason, the fact remains that, with our basic Pentium 4 system, RAM latency alone seems to have little to no effect on performance. The first of our two 3D rendering tests is the SPEC ViewPerf suite, which runs a suite of applications that mimic the OpenGL 3D rendering performance of popular content creation tools like 3ds max and Lightscape. The ViewPerf benchmark gives results for each of the six application benchmarks; we average them and present that composite score here. We also used Maxon's Cinebench 2003 benchmark, based on the company's Cinema4D engine, which can be found at www.cinebench.com. The Readme file goes into substantial detail on the design of the benchmark. We only use the CPU rendering result the single CPU test for the Athlon 64 system, and the dual CPU rendering test for the P4 system.
The ViewPerf suite shows a steady increase in performance as RAM latency is reduced. Our Pentium 4 system picks up more steam, gaining about 2% when you move from CAS 3.0 to 2.5, and 3.2% if you make the leap to CAS 2.0. The Athlon 64 system shows a bit more modest improvement, measuring a 1.4% gain from 3.0 to 2.5 or a 2.3% gain if latency is reduced all the way to 2.0. Cinebench, on the other hand, displays no change at all. It's possible that the workload for that particular test fits inside the large caches of these two CPUs, and therefore doesn't stress memory-access latencies at all. We expect most real-world 3D rendering workstation applications to show the kind modest but predictable speed boost demonstrated by ViewPerf.
The latest iteration of Futuremark's suite of synthetic tests expands on the limited repertoire of the original. Futuremark has added several multithreaded tests and includes storage and graphics. We focus on the memory and CPU tests here, but give the overall "PCMarks" score as well. PCMark tests are synthetic they don't run actual applications but they call upon commonly used algorithms like JPEG decoding, WMV video encoding, find-and-replace searches, and so on. PCMark tests are small and tend to fit in the L2 caches of modern high-end CPUs, so access to system RAM does not greatly affect the overall score.
Honestly, we're surprised to see any improvement in PCMark scores at all, given that most of the individual tests are small enough to fit in the processor's cache and memory access is limited. Our Athlon 64 system gains a little more than 1% in the overall PCMark score, the CPU-only tests, and the memory tests. The Pentium 4 is less consistent. Overall PCMark scores improved over 4% going from slowest to fastest RAM, but the CPU score gained only one percent. That test probably contains code and data that reside in cache more easily, and doesn't stress memory bandwidth or latency.
We ran 3DMark 2003 at 640x480 with software vertex shaders forced on, to maximize the strain on the CPU. We also recorded the CPU test scores, which consist of the frame rate for Game Tests 1 and 3 at low resolution with software vertex shaders. Our real-world game tests include four titles we know to be CPU-limited with high-end graphics hardware: Unreal Tournament 2004, Halo, Painkiller, and Doom 3. These were all run at a resolution of 640x480 with graphics details turned up all the way (so that more detailed models and higher-resolution textures would fill up more system RAM and put more strain on the system bus).
On our Athlon 64 system, 3DMark scores didn't really improve a whole lot. We gained about a percent in the overall score, a tiny bit more on the first CPU test, and a tiny bit less on the second CPU test when going from CAS 3.0 to 2.5. Jumping from CAS 3.0 to 2.0 nets about a 2% improvement in the overall 3DMark score and more than 3% on both CPU scores. The Pentium 4 system fares a little better, gaining up to 3.25% in the overall score and as much as 4.35% in the second CPU test when moving from our slowest to fastest RAM. That's a pretty good speed improvement, and looks a lot like the gains you see when moving up to the next fastest CPU frequency.
Our game tests definitely demonstrated the most significant improvements in RAM latency scores. This makes a lot of sense, actually games typically access both very large and very small files from RAM, and access patterns are more scattered than when reading in large files during media encoding or 3D rendering. The latency properties of both row and column access are likely stressed by most modern games.
The amount of improvement varies from one game to the next. On the Athlon 64 system, decreasing RAM latency from CAS 3.0 to CAS 2.5 resulted in a speed boost from 1% to 3%, and further decreasing it to CAS 2.0 gives us a difference of 2% to 6%. The speed boost is slightly less pronounced on our Pentium 4 system, though the fact that we are using an Extreme Edition processor with a large L3 cache is almost certainly a factor, and we wouldn't be surprised to see normal P4 processors getting the same speed boost as our Athlon 64 system or more.
The Halo scores in particular are worth pointing out. Our P4 system gains over 4% going from CAS 3.0 to 2.5, and over 10% going to 2.0. The Athlon 64 does even better, earning 7.5% moving from CAS 3.0 to 2.5 and 13.6% going to 2.0. In both cases, that's an absolutely enormous speed increase, but we were able to verify it with repeated testing. Something about the way Halo is coded makes it highly sensitive to RAM latency, and we wouldn't be surprised if there were other games out there with similar properties.
Is Low-Latency RAM Worth It?
Our tests show that improving RAM latency only makes a small difference in the performance of modern high-end PCs. The amount of improvement varies greatly depending on the application you're running, but it's safe to generally characterize the improvement by saying that, compared with CAS 3.0 RAM, CAS 2.5 will give you a 1% to 2% speed boost, and CAS 2.0 will give you 2% to 4%. Those are definitely small numbers, to be sure. System manufacturers often fight with their competitors over performance variances that small, though, so it shouldn't be viewed as insignificant. In fact, it's not uncommon for a new high-end CPU to be released that's just a couple hundred MHz faster than the previous best, giving you a performance advantage of 5% or less.
The real question is this: Is the marginal speed improvement of low-latency RAM worth its higher price? For most users, the answer is no. In a broad survey of online prices, we found that CAS 2.5 RAM costs about 30% more than CAS 3.0. If you're building a system with 1GB of RAM, you'll pay an additional $60 more for only one or two percentage points. You're much better off spending that money on a slightly better video card or buying a good sound card instead of relying on integrated audio. Heck, even a really good mouse would probably be a better investment toward making your computing time more enjoyable.
For extreme enthusiasts and the overclocking set, the story is a bit different. CAS 2.0 RAM can cost up to twice as much as CAS 3.0, adding $150 to $200 to the cost of a high-performance PC. If you're the type that spares no expense to get the very fastest machine possible, lowering RAM latency definitely does help. What's more, low-latency RAM generally stands up better to running at higher clock frequencies (sometimes with a slight increase in latency settings), and the biggest performance improvement almost surely comes from running some combination of lower-latency, overclocked RAM. For those who spend well over $2,000 building their dream machine, the cost increase of high performance RAM is a worthwhile investment.
Source: http://www.extremetech.com/
Submit a new comment
 |