Stanford University electrical engineer Subhasish Mitra, who specializes in testing computer hardware said that as switches in computer chips have shrunk to the width of a few atoms, the reliability of chips has become another worry.
Companies like Amazon, Facebook, Twitter and many other sites have experienced surprising outages over the last year. The outages have had several causes, like programming mistakes and congestion on the networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they are still dependent, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable.
Mitra warned about silent errors were coming from the underlying hardware made by various companies.
Mitra said people believe that manufacturing defects are tied to these so-called silent errors that cannot be easily caught. Researchers worry that they are finding rare defects because they are trying to solve bigger and bigger computing problems, which stresses their systems in unexpected ways.
He said that companies that run large data centres began reporting systematic problems more than a decade ago.
In a microprocessor that has billions of transistors -- or a computer memory board composed of trillions of the tiny switches that can each store a 1 or 0 -- even the smallest error can disrupt systems that now routinely perform billions of calculations each second, he said.