Since Apple built the Mac Pro out of Intel workstation components, it unfortunately has to use more expensive Intel workstation memory. In other words, cheap unbuffered DDR2 isn't an option, its time to welcome ECC enabled Fully Buffered DIMM (FBD) to your Mac.
Years ago, Intel saw two problems happening with most mainstream memory technologies: 1) As we pushed for higher speed memory, the number of memory slots per channel went down, and 2) the rest of the world was going serial (USB, SATA and more recently, Hyper Transport, PCI Express, etc...) yet we were still using fairly antiquated parallel memory buses.
The number of memory slots per channel isn't really an issue on the desktop; currently, with unbuffered DDR2-800 we're limited to two slots per 64-bit channel, giving us a total of four slots on a motherboard with a dual channel memory controller. With four slots, just about any desktop user's needs can be met with the right DRAM density. It's in the high end workstation and server space that this limitation becomes an issue, as memory capacity can be far more important, often requiring 8, 16, 32 or more memory sockets on a single motherboard. At the same time, memory bandwidth is also important as these workstations and servers will most likely be built around multi-socket multi-core architectures with high memory bandwidth demands, so simply limiting memory frequency in order to support more memory isn't an ideal solution. You could always add more channels, however parallel interfaces by nature require more signaling pins than faster serial buses, and thus adding four or eight channels of DDR2 to get around the DIMMs per channel limitation isn't exactly easy.
Intel's first solution was to totally revamp PC memory technology, instead of going down the path of DDR and eventually DDR2, Intel wanted to move the market to a serial memory technology: RDRAM. RDRAM offered significantly narrower buses (16-bits per channel vs. 64-bits per channel for DDR), much higher bandwidth per pin (at the time a 64-bit wide RDRAM memory controller would offer 6.4GB/s of memory bandwidth, compared to a 64-bit DDR266 interface which at the time could only offer 2.1GB/s of bandwidth) and of course the ease of layout benefits that come with a narrow serial bus.
Unfortunately, RDRAM offered no tangible performance increase, as the demands of processors at the time were no where near what the high bandwidth RDRAM solutions could deliver. To make matters worse, RDRAM implementations were plagued by higher latency than their SDRAM and DDR SDRAM counterparts; with no use for the added bandwidth and higher latency, RDRAM systems were no faster, if not slower than their SDR/DDR counterparts. The final nail in the RDRAM coffin on the PC was the issue of pricing; your choices at the time were this: either spend $1000 on a 128MB stick of RDRAM, or spend $100 on a stick of equally performing PC133 SDRAM. The market spoke and RDRAM went the way of the dodo.
Intel quietly shied away from attempting to change the natural evolution of memory technologies on the desktop for a while. Intel eventually transitioned away from RDRAM, even after its price dropped significantly, embracing DDR and more recently DDR2 as the memory standards supported by its chipsets. Over the past couple of years however, Intel got back into the game of shaping the memory market of the future with this idea of Fully Buffered DIMMs.
The approach is quite simple in theory: what caused RDRAM to fail was the high cost of using a non-mass produced memory device, so why not develop a serial memory interface that uses mass produced commodity DRAMs such as DDR and DDR2? In a nutshell that's what FB-DIMMs are, regular DDR2 chips on a module with a special chip that communicates over a serial bus with the memory controller.
The memory controller in the system stops having a wide parallel interface to the memory modules, instead it has a narrow 69 pin interface to a device known as an Advanced Memory Buffer (AMB) on the first FB-DIMM in each channel. The memory controller sends all memory requests to the AMB on the first FB-DIMM on each channel and the AMBs take care of the rest. By fully buffering all requests (data, command and address), the memory controller no longer has a load that significantly increases with each additional DIMM, so the number of memory modules supported per channel goes up significantly. The FB-DIMM spec says that each channel can support up to 8 FB-DIMMs, although current Intel chipsets can only address 4 FB-DIMMs per channel. With a significantly lower pin-count, you can cram more channels onto your chipset, which is why the Intel 5000 series of chipsets feature four FBD channels.
Bandwidth is a little more difficult to determine with FBD than it is with conventional DDR or DDR2 memory buses. During Steve Jobs' keynote, he put up a slide that listed the Mac Pro as having a 256-bit wide DDR2-667 memory controller with 21.3GB/s of memory bandwidth. Unfortunately, that claim isn't being totally honest with you as the 256-bit wide interface does not exist between the memory controller and the FB-DIMMs. The memory controller in the Intel 5000X MCH communicates directly with the first AMB it finds on each channel, that interface is actually only 24-bits wide per channel for a total bus width of 96-bits (24-bits per channel x 4 channels). The bandwidth part of the equation is a bit more complicated, but we'll get to that in a moment.
Below we've got the anatomy of a AMB chip:
The AMB has two major roles, to communicate with the chipset's memory controller (or other AMBs) and to communicate with the memory devices on the same module.
When a memory request is made the first AMB in the chain then figures out if the request is to read/write to its module, or to another module, if it's the former then the AMB parallelizes the request and sends it off to the DDR2 chips on the module, if the request isn't for this specific module, then it passes the request on to the next AMB and the process repeats.
As we mentioned before, the AMB interface is only 24-bits wide thanks to its high speed serial nature, but there's far more detail to this bus than meets the eye. The AMB bus is split into a 14-bit read bus ("Northbound" lanes) and a 10-bit write bus ("Southbound" lanes), with these buses operating at 6 times the DDR2 frequency (e.g. if you're using DDR2-667 FB-DIMMs, then the AMB runs at 667MHz x 6 or 4GHz). By having a dedicated read and write bus, reads and writes can happen simultaneously thus increasing performance in some circumstances. The read bus is a bit wider than the write bus since more often than not your system is reading from memory than writing to it.
In each bus, there are no dedicated lines for addresses, commands and data, all three types of signals are sent over the same pins. In conventional parallel interfaces, the address of the memory request is placed on a dedicated set of address pins and the data at that address is then placed on another set of data pins. With FBD, the data is sent in packets or frames (much like network traffic); each frame generally consists of either address/control signals or command and data signals. The data frames are 15 bytes large for writes and 21 bytes large for reads, but not all of that is raw data, some of it is ECC data that we don't normally look at when comparing bandwidths, so we'll have to strip that out.
For northbound traffic (reads), each frame is 12 cycles long and each frame that contains data can have a maximum of 16-bytes of data, meaning that our peak bandwidth with DDR2-667 FB-DIMMs is 5.34GB/s. For southbound traffic (writes), each frame is still 12 cycles long but only 8 bytes are transferred per frame, giving us a peak data bandwidth of 2.67GB/s.
Total data bandwidth then weighs in at just over 8GB/s for a single channel, but also keep in mind that not every frame will be a data frame, so the effective bandwidth will be noticeably lower. What we're touching on here is one of the major drawbacks to serial buses: there's greater overhead than with a parallel bus. Although there is more peak bandwidth on the AMB bus than there is between the AMB and its DDR2 devices (8GB/s vs. 5.34GB/s), there may actually be less peak read bandwidth once you factor in the overhead of the serial bus. There's of course less peak write bandwidth available, but writes take much longer to complete and generally can't ever reach peak bandwidth numbers. At the end of the day, despite the best efforts, there may be some situations where you are actually bandwidth limited by your AMB in a FBD system. How frequently those situations occur and what the average performance impact is are unfortunately both very complicated questions to answer and beyond the scope of this already long article.
The FBD proposition gets a little less appetizing when you look at the other major aspect of memory performance: latency. Since the protocol calls for point-to-point communication between AMBs, there's an additional latency penalty for each AMB that has to be contacted in the search for the right FB-DIMM to fulfill the read/write request. Intel states that the additional delay is in the range of 3 - 5 ns per FB-DIMM, meaning that a configuration of 8 x 1GB FB-DIMMs will be slower than 4 x 2GB FB-DIMMs. The argument here in favor of FBD is that even though you give up some latency, you make up for it in the ability to cram more memory channels on your memory controller and support configurations with more DIMMs.
There's one more issue worth talking about and that is power consumption. The AMB on each FB-DIMM has a pretty big job, converting the 4GHz serialized memory requests into 667MHz parallel requests that can be serviced by regular DDR2 memories. This translation process consumes quite a bit of power and thus causes the AMB to dissipate a noticeable amount of heat. The Mac Pro page on Apple's website states the following about the FB-DIMMs it uses:
That heatsink is made necessary by the AMB on each FB-DIMM, which seems to dissipate somewhere between 3 - 6W. The reason there's a range is because how active the AMB is depends on how close it is to the memory controller. The first AMB in the chain will have to service all requests from the main memory controller, passing them along as needed, while the last AMB in the chain will only receive those requests that are specifically targeted to its module. With 8 FB-DIMM slots in the Mac Pro, you're looking at up to another ~40W of power if you've got all slots populated.
Despite being a lower pincount bus, current FB-DIMMs use the same number of pins as DDR2 DIMMs. The reason being that each AMB needs two sets of buses, one to communicate with the FB-DIMM before it, and one to communicate with the module after it, thus there are approximately 120 signaling pins needed for each AMB. Once you add your power and ground pins, not to mention your reserved pins for future use you're not that far off of the 240-pins used on current desktop DDR2 DIMMs. Rather than introducing a brand new connector and module design, FB-DIMMs simply take the current DDR2 DIMM design and key it differently to only work in FB-DIMM slots. Remember that the signal routing from the chipset to the first memory slot still only uses 69 signaling pins since it doesn't have to communicate with anything "before" it in the chain, so you do still get the benefits of a lower pincount interface.
The major benefit that the Mac Pro seems to get from the use of FB-DIMMs is that its memory bus and FSBs can offer identical bandwidths at 21.3GB/s (ignoring the unknowns we discussed earlier about the efficiency of FBD). By using a lower pincount interface, Intel was able to fit four FBD channels on its 5000 series chipset and thus offer the bandwidth equivalent of a 256-bit wide DDR2 memory controller. However the additional memory bandwidth comes at the high cost of additional latency, power consumption and more expensive DIMMs.
There are a couple of things you can do to maximize performance and minimize the cost of additional memory on your Mac Pro, and it starts with the number of FB-DIMMs you configure your system with. The Mac Pro ships with a default configuration of 2 x 512MB FB-DIMMs, unfortunately that means that you're only using two of the four available memory channels, cutting your peak theoretical memory bandwidth in half. You'll want to upgrade to at least four FB-DIMMs so that you can run in quad-channel mode, in the coming weeks we'll be running some tests to figure out exactly how much additional performance you'll gain by doing that and if it's noticeable or not.
If you do find yourself filling all 8 memory slots on the Mac Pro, we would suggest trying to move to 4 higher density modules instead. Remember that you gain an additional 3 - 5ns of latency (at minimum) with each FB-DIMM hop, so the fewer FB-DIMMs you have the lower your worst case scenario memory latency will be. But since you still want to be running in quad-channel mode you don't want to drop below four FB-DIMMs, making four the magic number with the Mac Pro.
Copyright © AnandTech, Inc.