Single-bit Memory Errors

Ever wonder if a stray cosmic ray or alpha particle might
double your bank account, due to an undetected RAM error?

In a previous blog,
I described how lower-end systems without ECC will crash (assuming you’re
lucky) when a single-bit memory error is encountered. These “soft faults” are
often the results of high-energy neutrons from cosmic rays, or alpha particles from
the decay of isotopes within the silicon packaging or surrounding materials.
Alternatively, the signal integrity of the bus, for example in terms of its
susceptibility to crosstalk, will affect the soft error rate.

Systems with ECC will transparently correct single-bit
errors, and log the results, with some small performance hit. So putting ECC
memory in your desktop or laptop is generally a good idea if you’re involved in
financial or scientific applications (as opposed to just doing Facebook).

But conventional ECC memory can’t handle double-bit or worse
memory errors. For higher-end systems that demand high-reliability and
availability, more sophisticated ECC which can detect and correct multi-bit
errors within a single memory device are needed. Known variously as Chipkill, Extended
ECC, Chipspare and SDDC, these schemes scatter the bits of the ECC code across
multiple chips.

The goal, of course, is to reduce the incidents of soft
errors overall, to improve the overall performance and robustness of the
system. Although we can’t block cosmic rays and high-energy neutrons, alpha
particle interaction can be mitigated through the use of purer materials
(although at a cost). And signal integrity is the most important to attack, in
terms of validating that the design has plenty of margin, because we’ve seen
how temperature, jitter, noise, voltage aberrations and manufacturing variances
can all affect it. A good memory test program is essential to providing stable
and reliable memory tuning. Many subtle memory failures manifest themselves
only with certain data patterns and might also be dependent on the addresses
being accessed. They depend not only on the data being written or read, but
also on the data in the surrounding bytes, which are also being transferred. A
simple memory test — one which writes a fixed pattern to all bytes — will not
discover this type of failure. Similarly, failures might only show up when
accessing non-adjacent addresses, so you should use a memory test program which
performs accesses in a non-sequential pattern.

A good read on the general topic of memory testing can be
found in our Memory
Test Whitepaper
. A more specific brief on margining of DDR3 can be seen in
our DDR3
Memory Test toolkit
paper.

Alan Sguigna