SilverBulletPC.com
The world's most practical PC.

Error Correction Code (ECC)

If you plan on using your computer in a scenario where memory accuracy is paramount, then please read on.

Here are some examples of memory sensitive uses...

servers
medical
machine control
banking
complex repetitive processes
pointer based databases

The vast majority of PC software will simply burp, you'll restart the program, and nothing major will come of it. But if you're use for the computer is listed above, then a solution to memory errors should be a consideration.

Does a memory error mean it is bad RAM? The answer to this question is counter-intuitive. All advanced memory inevitably gets errors once in a rare while no matter the quality. This is due to the combination of high density electronics and how they respond to background cosmic radiation. Compared to a few years ago, today's DDR4's are incredibly reliable, leaving just the outside causes like background cosmic radiation.

ECC adds 7 bits to every 64 bits of memory, allowing what's called a Reed-Solomon algorithm to both detect and correct bad data. Think of it as a very efficient algorithm for handling the most likely type of error: isolated single bit errors.

Many online articles and computer experts will talk as though ECC is fool-proof. However, the reality of the situation is that it is an algorithm that almost perfectly remedies the most common type of memory error only. Just about anything outside that scope will continue to be a problem.

More specifically, today's ECC-capable memory controller detects and corrects one isolated 1-bit error out of every 64-bit word, and detects but does not correct one 2-bit error out of every 64-bit word. Fortunately, the vast majority of errors are single very random and isolated 1-bit errors caused by background cosmic radiation, and this formula for ECC covers those. But unfortunately, there are other causes, including the occasional more intense cosmic events.

The way to visualize this is to think of a shotgun shooting you from a vast distance. You might get one pellet flying through your body, and will likely recover. Maybe two, but unlikely right next to each other. So the damage is relatively low risk, yet painful. Now you add armor that can defend against a single isolated random pellet. As long as that shotgun is way out there you are almost completely safe. But then imagine the same shotgun at close range. The armor can't hold back the concentrated burst of pellets. ECC works about as well as that armor. And the occasional intense cosmic event is about as bad as bringing that shotgun in close. And this is going to happen whether you plan for it or not. Now multiply that likelihood many times over when considering manmade causes for memory errors. Radiation from a bad light bulb can do it. Some machinery can do it. Nearby secret government labs...

The next point to understand, is how will ECC respond to a more intense cosmic event? As memory errors get closer together, either time-wise or address-wise, we should expect a corresponding increase in likelihood that additional errors went completely undetected.

So, while memory error causes are normal, ECC works wonders. But, while memory error causes are not normal, ECC will not only fail, it will let some errors through completely undetected in amongst other errors that were detected and corrected as planned. And this could create a blind-spot that works against the value of having ECC in the first place, because not having ECC would have resulted in lots of memory errors alerting you to a major problem, while having ECC potentially results in no alert at all to a major problem with memory.

The whole point of this article is to underscore the point, that this scenario could sometimes be worse than not having ECC. That is, if you do not do anything more to improve the picture.

Here's a nutshell explanation for two approaches to greatly improve memory integrity...

Method #1

A safer approach is to configure your system to intentionally crash upon detection of any 2-bit ECC error, and upon any ECC error occurring within a 1 minute window of any previous ECC error. Both of these scenarios should be considered a red flag pointing to a significantly increased likelihood of additional errors that went completely undetected and uncorrected. The purpose of this course of action is to contain those scenarios that were not corrected automatically, so that they do not interfere with the critical nature of your machine's use.

NOTE: If you have an ECC capable machine with ECC memory and would like to be certain that your Windows system alerts you to ECC errors detected, then you should read through this forum thread...

ECC Error logging on Windows 7 and Diagnostic Services

Method #2

Full redundancy (2+ machines) inclusive of much longer software-generated error detection codes is a far better approach. Each machine would compare the codes from important transactions from all machines on-the-fly. From there the machine with the error can be determined automatically and its transaction disregarded.

If properly implemented this method increases data reliability many times, even in the rare event scenarios.

The picture we're painting here is that ECC by itself is a compromise that may make things worse some of the time. When memory error causes are on the rise (either man-made causes near the computer, or the occasional more intense cosmic event) ECC may become disadvantageous as memory errors may occur that are not caught, while giving every indication that it did its job.

ECC adds about 50%-75% to the cost of memory. A thorough implementation of Method #2 above could be considered complementary to ECC, or it could be considered an alternative to ECC, which may in part explain the trend to use more non-ECC servers. One way to think this through is to first establish whether or not redundancy is necessary. Generally, everyone will agree that it is necessary. And from there the software code perfectly fits the situation, not ECC. Server costs drop, and data integrity improves way beyond that of an ECC based solution. Lower cost and better data integrity. Something to think about.