Cosmic Ray Interference
Cosmic Ray Interference
Why would you care about cosmic rays?
Have you ever seen Men in Black? For you, reading this article may be like the moment Will Smith walked into MiB headquarters and found out there was this whole other reality living behind the scenes while the oblivious public lived on in blissful ignorance.
Cosmic rays affect you – or they would, if not for a class of caretakers that protect our daily lives from interstellar invaders from beyond our galaxy.
Without them, computers would go haywire. Banks could be randomly deleted. Missile launch commands could be sent to nuclear silos anywhere on the globe at any time. Planes would crash out of the sky. Cars would take off uncontrollably as with a mind of their own. And that report you’ve been working on for the last four hours …gone!
While all that may just sound typical for 2020 – right up there with the pandemic, murder hornets, and exploding fertilizer – this could be the reality caused by cosmic ray interference.
What are cosmic rays?
It’s good to start with a basic understanding of the culprit here. Cosmic rays are what we commonly call the bits of atoms that are blasted across the cosmos at darn near the speed of light. The vast majority of these are nuclei of hydrogen or helium atoms, so they’re positively charged, relatively massive, and highly energetic when traveling at that speed. These are known to come from stars – especially ones that die and go supernova in other galaxies far off from Earth. Or from even more violent things in our universe. They’re like tiny little bullets randomly sprayed out in every direction.
Most of the time, these little bullets fly through the vacuum of space unimpeded, but when that changes when they reach our atmosphere. There’s suddenly a bunch of other “stuff” in the way that the cosmic rays can run into – like gas particles in the air, or any manner of more dense things at the surface. The closer to the surface, the more things to collide with. And they do! Cosmic rays smash into other atoms like a billiard ball. The result is more subatomic bits being thrown in different directions at varying levels of energy and electrical charge. Some of these bits can hit other bits and cause secondary smashing and splitting and so on. The result is a cascade of high energy subatomic particles generally heading down toward the surface of the planet, or across it. The highest-energy particles are those that have been slowed down by fewer collisions, so they are commonly encountered at higher altitudes. Likewise, it’s unlikely you’ll see much evidence of cosmic rays underground.
CDN had a hand in building the Fermi Gamma-ray Space Telescope launched into orbit in 2008. Its sensors detect the energies and origins of gamma and cosmic rays from within and outside our solar system.
How do Cosmic Rays Affect You?
Cosmic rays aren’t too much of a concern for our bodies in low doses, but they’re harsh on electronics. Especially anything with a microchip. Think “Y2K” on steroids (if that doesn’t mean anything to you, go ask someone older). If a microchip gets a signal it’s not supposed to, it’s going to do something it’s not supposed to do. Integrated circuits and semiconductors are occasionally hit either by cosmic rays, or the resultant charged subatomic pieces leftover from smashing’s caused by the cosmic rays. When that happens, a ‘0’ can become a ‘1’ or vice versa. Left alone, you get data corruption, erratic operation, and any manner of consequences.
RadioLab’s “Bit Flip” podcast episode talks about this at length with some examples from recent history.
Any time the particle has a charge and an energy level higher than the signal level in a piece of circuitry, it has the potential to change that signal. There is a movement to monitor and include every aspect of our daily lives in the cloud using battery-powered IoT devices and protocols like BLE. As our electronics get smaller, more distributed, more reliant on battery power efficiency, they use lower and lower signal energies to function. This makes them even more susceptible to disruption by a greater number of these wild particles flying around.
On the contrary, though, the small size of battery-powered IoT devices means that they are smaller targets and are less likely to be hit by one of those particles. Larger systems present larger targets to the cosmos. In a way, integrated networks of IoT devices can act like large computers spread throughout an area. Data centers, network server farms, and supercomputers on the other hand are most certainly large targets. They are much more likely to be hit by high energy particles and to using increased energy-efficient chips. Think about it this way, whether you throw a baseball into a pool or play catch in the rain, the ball gets wet either way.
Computers used at higher altitudes enhance the chances of an issue. The cosmic rays encountered in the atmosphere have higher energies and are more likely to change multiple bits of data at once. As if trying to get work done on a laptop during a flight with the turbulence and the little glass of, ahem… “just water” wasn’t hazardous enough, think about the increased potential for data corruption. Even more – what about the airplane itself? One of the reasons Boeing is in so much trouble with the grounded 737Max today is due to its susceptibility to catastrophic failures caused by cosmic ray interference.
How do you Protect Electronics from Cosmic Rays?
The easiest way to protect electronics from being disturbed or damaged by cosmic rays is to make them really small, run on high voltage, and bury them deep underground. If for some strange reason that doesn’t work for you, there are a few other tricks that can help out.
Redundant Parallel Circuits – First, accept the fact that the electronics will get hit and signals will get disturbed. Consider that this will not happen (extremely, extremely unlikely) in two specific places at the exact same time in the exact same manner. Two circuits can then be placed side by side to each operate the same calculations on the same data. Unless there was a digital bit changed by a cosmic ray or some other anomaly, the output from both circuits should always match. When they don’t match, re-run those last calculations. The downside is that you have the size, expense, and power consumption of two circuits now instead of one. The upside though is that you catch the glitch and can process data with very minimal slowing from the random recalculation here and there.
Redundant Processing – Knowing also that it is extremely, extremely unlikely for the exact same disturbance to happen at the exact same location at two specific moments in time, you can employ redundancy in software. Software can be made to run the same calculations with the same data twice on the same hardware circuit. If the result is the same – you’re good to go. If the two results don’t match, there was a disturbance. Run the calculations again. This method keeps the hardware simplified to a single circuit avoiding the costs of size, power consumption, and cost, but it takes a hit on performance. Processing speed will be slowed significantly since the operations are doubled to achieve the same task.
Increased Junction Energy – In some circuits, the signal energy level can be raised at critical sections or junction points particularly susceptible to influence from cosmic rays. This hardens the circuit in a way that it can only be affected by the highest energy particles. This is an okay strategy to limit what errors could occur, but it does not catch the few that still do happen. It’s an incomplete solution but does not have major size, cost, or performance drawbacks.
These solutions catch problems that occur while data is being transmitted and processed but does not come into play for memory devices where digital bits are just stored for long periods of time. These errors can accumulate and can happen randomly distributed throughout the memory array or be adjacent to each other if a cosmic ray of sufficient energy smashed into it. Single bit errors where an individual ‘1’ or ‘0’ is changed are referred to as SEUs (Single Event Upsets). When multiple adjacent disruptions occur, they are called MBUs (Multiple Bit Upsets) and are much more challenging to deal with.
ECC Memory – A special class of memory can detect and sometimes correct errors with Error Correcting Code (ECC). This memory stores the data in little chunks along with some qualifying information about each chunk of data. The simplest example of this is by using what is called a parity bit. This is a piece of qualifying information that uses one additional bit of space in the memory for each chunk (word) of actual data that simply says whether the number of 1’s in that chunk of data is an odd number (parity bit = ‘1’) or an even number (parity bit = ‘0’).
Given that a word of 8-bits long, 01011011, contains a total of five 1’s, five is an odd number, so the parity bit would be ‘1’. The information stored in memory than would be:
While it won’t be able to determine which bit of data was changed or be able to discriminate between an SEU or MBU, it can at least determine whether the data was changed. This method is memory efficient since it only requires one bit to monitor a long set of data bits, it isn’t very powerful.
More sophisticated methods utilize what are known as Hamming Codes. In a sense, these are similar approaches where small amounts of memory space are used for bits of information that can monitor whether data has been changed. The improvements are that in many cases – these codes can locate and correct the error too. The methods are many and complex. For anyone interested, Wikipedia maintains a great entry on Hamming Codes with further examples. For our purposes, suffice it to say this is what makes ECC Memory special. It’s not bulletproofed as large MBUs or corruptions in the monitoring bits themselves could occur and be uncorrectable, but it gives a great advantage in stability.
There are some downsides to ECC Memory. Naturally, it does require a bit more space to store the same amount of data since some space for the error monitoring bits must be made available. This makes it a bit more expensive than non-ECC memory. Also, it takes a short bit of additional processing time to check for and correct any errors found. This slows the performance of ECC memory by a small amount, but it’s almost negligible – less than 3% slower. These are small tradeoffs for the advantages in stability. Scientific research computer systems and internet servers exclusively use ECC memory.
Not for use in Safety-Critical Systems
Next time you see a label on a piece of equipment saying something like “this product is not intended for use in safety-critical systems” you can be reasonably assured cosmic rays will cause it to go haywire.
Some systems will work satisfactorily without the need for costly and complex hardening against cosmic ray interference. Industries like medical care and research, banking, data security, military, mass transportation, aircraft, and especially spacecraft all absolutely must include these provisions.
Are the products you work with seeing random digital failures? Are they critical to life support, safety, or stability of other systems? Are you taking the right steps to address the potential effects of cosmic rays and similar digital disruptions?
Be sure to ask questions of the developers you’re working with to ensure the proper level of attention is given to this. It could be critical.
CDN Inc. is a product design and engineering firm that can adapt easily to your project needs; engineering, industrial design, prototyping & manufacturing.