Cores That Don’t Count: What if your Processor Thinks 2+2=5?

Cores That Don’t Count: What if Your Processor Thinks 2+2=5?

Software engineers are no strangers to bugs, constantly battling them in the trenches of code. But as systems grow more complex, so do the bugs. Imagine the headache of dealing with a rogue processor core that, under certain conditions, decides that 2+2 equals 5. Imagine the frustration when that error happens silently. This is not a far-fetched nightmare but a reality uncovered by recent research from Google.

In the vast server fleets that power the internet, we usually assume processors either work perfectly or fail in obvious, detectable ways. But Google’s findings challenge this assumption, introducing us to “mercurial cores”—processors that occasionally fail silently, causing incorrect computations without any immediate warning. These errors are called silent “corrupt execution errors” (CEEs).

Why Are CEEs So Hard to Handle?

At first glance, silent failures in processors may seem like just another issue in an already unreliable hardware world. After all, we’ve dealt with similar challenges in storage and networking for decades. But detecting computational errors is far trickier.

Unlike storage and network issues, where data corruption is relatively easy to detect and correct, computational errors often remain hidden until it’s too late. And, the stakes are much higher. Imagine a rogue core miscomputing a database query or breaking cryptographic calculations. In these cases, one small error can propagate and cause significant damage before it’s even noticed. Worse, because these errors are tied to specific cores and specific instructions, testing for them comprehensively is both time-consuming and expensive.

While in case of storage or networking errors we can use redundancy or error-correction techniques with minimal extra cost, CEEs are far harder to catch. Why? The nature of the failure itself. With mercurial cores, errors strike unpredictably and infrequently, often tied to specific instructions under rare conditions. To detect them reliably, you’d need to replicate every computation across multiple cores and vote on the results, tripling the computational effort.

One may never encounter such a bug in their entire career, but when they do, it could be catastrophic. Few of the novel bugs observed due to CEE are:

Violation of lock (semaphore) semantics
A deterministic AES mis-computation, wherein encryption and decryption on the same core was yielding expected result, but encrypting on one and decrypting on another was resulting in gibberish
Corruption affecting garbage collection, in a storage system, causing live data to be lost.
Database index corruption, leading to some queries, being nondeterministic
Corruption of kernel state resulting in crashes and application malfunctions

What Causes These Rogue Errors?

Mercurial cores arise from the increasing complexity in modern CPU designs and the ongoing miniaturization of silicon components. These make processors more vulnerable to subtle defects that can escape manufacturing tests. Additionally, these errors can become more likely as processor cores age.

While the exact CEE rate is not disclosed, it was observed on the order of a few mercurial cores per several thousand machines.

Troubleshooting the Elusive

What makes this even more fascinating (or frustrating, depending on how you look at it) is that tracking down the root cause of mercurial cores can feel like searching for a needle in a haystack. These cores might work perfectly 99.9% of the time, only to misfire on rare occasions. You could see one core malfunctioning, while all others on the same chip operate flawlessly. It’s a nightmare scenario for engineers trying to ensure system reliability.

So, What Can Be Done?

The solution isn’t simple. Google’s research suggests a variety of potential approaches for detection, such as observing core-specific patterns for the error signals, like application or kernel level crashes, testing CPU cores while they are serving real workload, by scheduling a low priority task running CEE tests, etc.

Similarly, it suggests different approaches for mitigations like, using two cores to do computation in lock-step mode, triple modular redundancy – where computations are performed in triplicate and results are voted, cost effective application specific detection, such as computing an invariant over a database record before committing a transaction, etc. It is also important to note that some of these techniques may result in significant performance costs.

As we push the boundaries of processor technology, it’s likely that issues like these will become more common. What was once a rare, nearly invisible problem for a few hyperscalers is now on the radar of the entire tech industry. At the same time, it also opens up new opportunities for researchers in the area of operating systems and compilers.

Sounds intriguing? You can dive deeper into the fascinating world of mercurial cores by reading Google’s full research paper: Cores That Don’t Count

Your search for: "" revealed the following:

Cores That Don’t Count: What if your Processor Thinks 2+2=5?

Your search for: "" revealed the following:

Cores That Don’t Count: What if Your Processor Thinks 2+2=5?