Google researchers examining these silent corrupt execution errors (CEEs) concluded "mercurial cores" were to blame for CPUs that occasionally miscalculated, under different circumstances, in a way that defied prediction.
The errors were not the result of chip architecture design missteps, and they're not detected during manufacturing tests. Rather, Google engineers theorise, the errors have arisen because we've pushed semiconductor manufacturing to a point where failures have become more frequent, and we lack the tools to identify them in advance.
In a paper titled "Cores that don't count", Hochschild and chums Paul Turner, Jeffrey Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David Culler, and Amin Vahdat cite several plausible reasons why the unreliability of computer cores is only now receiving attention, including larger server fleets that make rare problems more visible, increased attention to overall reliability, and software development improvements that reduce the rate of software bugs.
"But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design", the researchers state, noting that existing verification methods are ill-suited for spotting flaws that occur sporadically or because of physical deterioration after deployment.
The risks posed by misbehaving cores include not only crashes, which the existing fail-stop model for error handling can accommodate, but incorrect calculations and data loss, which may go unnoticed and pose a particular risk at scale.
Hochschild recounted an instance where Google's hardware conducted an auto-erratic ransomware attack - which is not as sexually stimulating as it sounds.
"One of our mercurial cores corrupted encryption. It did it in such a way that only it could decrypt what it had wrongly encrypted."