Last year Ridley and I presented at Infiltrate 2013. I’ve been going since the conference started and it’s pretty awesome actually. While we were there we saw this amazing talk by this guy Jasper van Woudenberg “Hardware Sidechannel Attacks and Fault Injections.” Some other guys and I commented on how awesome it was, we mentioned our ARM exploitation training, one thing led to another and then pretty soon Ridley and I found ourselves in San Francisco teaching his guys all about ARM exploits, while we got schooled on all manner of what is colloquially referred to as “DPA”.
I think we (Ridley and I) won out on the exchange, Power Analysis was like magical voodoo for me, and it still is, but less so. And they have lasers for crying out loud! But I’m sure software exploits were pretty mystifying to them, too, I suppose everything is before it’s explained to you.
Anyhow, Van Woudenberg is CTO of Riscure, a Dutch company that has a small office in San Francisco. Basically they specialize in side-channel and fault injection attacks, doing assessments and such, as well as manufacturing equipment and software to conduct these types of attacks.
Side Channel Attacks?
This comes down to analyzing power signatures of equipment, which can be measured “directly” (by tapping the power lines) or indirectly via measuring EMF (electromagnetic fields), and, in the case of fault injection, purposefully “glitching” the equipment while making measurements to induce “unintended” behavior in the microprocessor and then record the result. And it turns out doing this you can do some pretty amazing things.
“Glitching” is accomplished through five basic means: clock (overclocking), power (such as by varying voltage levels above or below specified levels), EMF (by emitting electromagnetic pulses targeted at portions of chips), temperature (heating beyond spec), and, yes, lasers (by literally shooting lasers at equipment). They have their own frickin’ lasers to shoot at microchips, it’s pretty awesome.
They’ve also created this product called “Inspector” which helps automate data collection and analysis and to control the actual physical attacks. Riscure has developed their own testing rigs, that you plug into a computer with the Inspector software installed, allowing you to fine-tune your attacks. The software then records measurements (such as voltage levels), and has a variety of signal processing software, and post processing software, to analyze the data and conduct specific types of attacks (such as breaking RSA).
So here’s a brief overview of what we learned. It was a bit of a firehose, so I tried to take as many notes as I could. Some of what I’m about to say here is most likely a little incorrect. But I’ll do my best.
Introduction to Power Analysis
We’ll start with “DPA” — Differential Power Analysis. I always sort of off-hand referred to “DPA”, when it turns out what I really mean is just “PA”. The “Differential” refers to a specific type of power analysis. But saying DPA for everything just seems to roll off the tongue better.
The basic idea is that semiconductors use power, and the more “logic” a semiconductor has to do, the more power it consumes. Now due to the physical nature of CMOS circuits, the amount of power consumed is basically equal to the number of “ones” it has to store. (Every 0-bit is 0-volts, every 1-bit is 5-volts, get it? Like that, more or less). This means that if, say, a register contains a 32-bit value, that register is going to use power proportional to the number of “1-bits” used to store the number. Conveniently, the term for this is “Hamming Weight” — the number of “1-bits”. So on binary systems, the numbers 1, 2, 4, 8, 16, 32, etc., all are Hamming Weight 1, and are represented using the same amount of power. Whereas the numbers 3, 5, 9, 65664 (0×10080), etc. are all Hamming Weight 2, and consume twice as much power.
Another concept to keep in mind is that the change in power, over time, is a function of the “Hamming Distance“. So if a register went from the value 0×10080 to 9 (both weight 2) the Hamming distance is 0, and so no change in power would be recorded. But a value going from 0×10080 to 0×10081, or any other bit flip, would produce a change in power equal to “one step”.
What this means is that the power going into a chip, or emitted from a chip, MUST leak the Hamming Weights of numbers currently being processed, and changes in Hamming Distances of numbers being processed will produce measurable differences in power. There’s pretty much nothing you can do to stop this with standard circuits.
So first step, is you rig an apparatus up to the power and/or ground of the system under test. They went over a bunch of EE stuff that I simply don’t fully understand, but you need a high-speed oscilloscope to take many many power samples per second. If you are analyzing a microprocessor that runs at 5 MHz, well then you need to probably record at at least 10 MHz in order to take at least one sample per “clock cycle” whilst still accounting for the Nyquist Theorem. You probably will attach a low-pass filter, because we are likely only interested in power fluctuations that occur at the rate of the CPU clock (since that is how often CMOS circuits will be switching, and leaking power levels equivalent to the Hamming Weights of data being processed), and when dealing with physical equipment you do not always get to tune the collection frequencies and such to be exactly the way you want. You then turn on the machine, get it to process some data, and record power levels.
Well if it’s that easy you must be able to hook up an oscilloscope to “Power” and “Ground” and leak out all the secret bits right? Well no, because there is a massive amount of “Noise” in any of the signals you can measure. Furthermore, most of the time you do not have a diagram of the semiconductor, with atomic-scale probes to attach to each individual transistor. No, the semiconductor is doing a million things at once, and your measurements are the combined measurements of all CMOS circuits all going at roughly the same time.
Now if you have any background in signal processing, you know that “Noise” is usually incoherent, or basically its “totally random” and so there is no correlation between samples at any periods in time. But a continuous “Signal” is in some fashion coherent, meaning that at any two separate periods in time the signal looks roughly the “same” and can be mathematically correlated. To spare you the math, the point is that if you add (like literally arithmetically add) a series of incoherent samples together in a timeseries, they tend to cancel out. Conversely the signal gets boosted every time you add it together. Think of a sine-wave: as you add more and more of the sine-wave together, its amplitude increases by one step each time. But the noise is so random it mostly cancels itself out.
Thinking of that sine-wave though, we know it goes up and down, up and down. It’s cyclical. If you were to just randomly pick two periods in time to try and add the signal together, periods “A” and “B”, you might end up with the case where it’s going “down” in period “A” but “up” in period “B”. If you add those two signals together… they’ll cancel each other out! This is called the “phase” of the signal, and so when you do this type of analysis you have to try and correct for that “phase difference”. Basically it means that you may have to “shift” one signal around a little, so that when you try and combine signals, the positive aspects amplify up, instead of cancelling out.
That is basically what they do. You make many measurements, possibly as many as “millions”. You try and “time-shift” the signals (“align” them). You the combine them together to get the “signal” to stand out.
So what “signal” are we looking for? How do we know where to “align”? I mean isn’t the point that we have a noisy signal and we don’t know how to extract the signal? Yes, and this is why there is a human in the loop making judgments. For example, assume we are attacking a naive RSA implementation. In RSA, if one of the values in the key is a “0″, then basically the arithmetic operation ends up being a “square”. But if it was a “1″, then what happens is a “square and multiply”. This means that for every bit, there is going to be some large power draw as it does a lot of math of numbers of many different Hamming Weights. But for “1s” this computational power draw will last longer (two operations instead of one). If, through guesswork and reversing, we knew that RSA occurs “approximately 32 microseconds” after we “inject data”, we could hone in on the power traces at around 32 microseconds. We then eyeball where we think we see something looking like “more power”. We then focus in and try to find a unique power signature, that might correspond to some specific set of instructions being executed. Using that, we then try to shift all the input samples just enough so that that unique power signature that is present in all of them, is “aligned”. This “shifts the phase” of the “sine wave” we are trying to focus on.
We can then combine the input samples together, and get a smoothed out result, where the power signatures corresponding to the RSA algorithm at around 32 microseconds into the trace, are all aligned and “amplify up”, whereas all the noise, and all the signals that may have occurred later in the trace, aren’t really aligned together or are incoherent, and so all cancel out.
From there, we can conduct our actual attack.
Simple Power Analysis
In the hypothetical RSA example just shown, this can be enough to read out the RSA key. Purely by just looking at a power trace, and seeing where power amps up, and then shoots down. “Long” power traces correspond to a “1″ bit doing the square and multiply, “short” traces to a “0″ bit where it only did the square. We did it in the lab: you can literally read out the RSA key with your eyeballs from power traces from an embedded device using a naive RSA implementation. Similar techniques can be used to pull out ECC keys.
So that’s “Simple Power Analysis”. What’s the “Differential” (DPA) kind?
Differential Power Analysis
Well under differential analysis, all you are doing is subtracting recorded power levels to find the difference in power. Why would you do this?
Say you are sending thousands of uniformly random inputs to the system. You then recorded traces for each input. You focus on one bit of the input, say “the zeroth bit”. For each trace where you sent input where the bit was 0, you put that in the “0 bin”, and likewise where that bit was 1, you put in the “1 bin”. You align all traces (0 and 1) together, and then combine all the 0 traces into one averaged trace, and the 1 traces into another average trace. If you now subtract the combined 0 power levels from the combined 1 power levels, the difference will represent a difference in power between performing computations on the 0, vs the 1. Where this difference is small, or near zero, we can say there is basically no difference in power during those computations, which, because CMOS circuit power levels are a function of the bits used in computation, means those samples did not have any data dependency on the value of that bit. But where the difference is “large”, that means that during that period of the computation, the CMOS circuits were responding to the particular value of that bit.
Remember that power differs according to the Hamming Distance. So if we know that at some point a computation involved a “flipped bit”, then the difference in power must be proportional to Hamming Distance! That is how we can determine where in the trace there is a data dependency on a flipped bit.
Let’s assume we want to break DES. So lets say we send 1000 random inputs and record a few hundred samples (clock cycles) for each input, getting 1000 “traces”, and we also record the encrypted 1000 outputs. We align all the traces together. We can then say to ourselves, ok, DES operates on 16 rounds. For each round the computation involves a portion of the key, the subkey for that round. Each round produces two 32-bit values (“L” and “R”). For the 16th round we might call it L16 and R16. In the final round, the subkey is XOR’d with “R15″, input into the S-boxes, and the result is then XOR’d with “L15″. The XOR’d L15 becomes R16, and R15 becomes L16, because at each round the outputs are swapped (according to the Feistel scheme). R16 and L16 then go through the final permutation before becoming the output ciphertext.
Because of that, given a ciphertext, we can inversely compute what L16 and R16 was for the last round (because the permutation just shifts the bits around). R15 is equal to L16 by definition. L15 is XOR’d with the S-box output of the subkey and R15 to produce R16. So we compute all possible potential values of one of the inputs to the S-box, getting 64 (2**6) possible values. And then for each S-box guess, and for each trace, we select a single bit and we say: “OK, for this power trace and encrypted output value, assuming the S-box input we just guessed, and knowing what R15 is, what value would the zeroth bit of the subkey have to be to produce the R16 value we saw? If that bit would have been 0, we’ll put this trace into the ’0 bin’. If that bit would have been 1, we’ll put this trace into the ’1 bin’”.
We then average together all the traces in the “0 bin”, and then average together the “1 bin”. We then subtract the two. IF our guess was correct for the subkey for that bit, that would mean that the traces we stuck in the “0 bin” really did involve a bit value of zero, and the traces we stuck in the “1 bin” really did involve a bit value of one. And that would mean the difference in the averaged power traces should produce a “power spike” somewhere in the power trace corresponding to the computation of the last DES round. Why? Because if our inputs were “binned” correctly, one of them involved a computation of a 0 bit, and one of them with a 1 bit, and so there must be a measured difference in power in the computation because we flipped a bit (and therefore changed the Hamming Distance of some numeric values).
If we see such a spike, we can say “Yep! Our guess for that bit of the subkey for the last round must have been correct.”
If our guess was wrong, that means the “0 bin” and the “1 bin” actually just contained a random selection of signals that were not correlated together in terms of any Hamming Distance, which would mean there wouldn’t be a power spike. So no “peak” or no “power spike” means the guess was wrong.
But not to worry, we made a lot of guesses, one for all possible subkey values for the last round. So we repeat this guessing process, assigning traces to the 0 bin or 1 bin, for each possible subkey value, until we get spikes that readily identify that guess as being correct. We then repeat this process for all bits, and eventually we’ll obtain the subkey for the last round.
If you cannot obtain the subkey, it probably means there was too much noise drowning out the signal, and the solution to this is take more traces. Or, it means you didn’t align signals correctly and you need improve the pre-processing and “alignment” of your data.
At that point, we can XOR R16 with the value we obtained by successive guessing, to yield L15. We have thus broke the very last round of DES. We can then essentially repeat this step backwards for all DES rounds and obtain the complete key. (In practice, you only need to break a few rounds and can then either brute-force the rest of the keyspace or use some more sophisticated statistical analysis to obtain the rest of the key).
I hope I remembered that all correctly. But the same, more or less, hand-waving high level idea here can be used to break 3DES, AES, RSA, ECC, etc. The attacks have to be tailored to the specific structure of the cryptographic algorithms at hand and there is extensive literature about how to conduct these attacks.
Ok! But what about fault injection?
Basic Fault Injection
Well fault injection can be used for many things. One attack basically boils down to trying to glitch the processor when it gets to a “branch” instruction. For example lets say it computes some cryptographic hash, and then compares to a known result, and branches to “explode self” if it doesn’t match, but “run signed code” if it does match. Imagine we could magically cause the status register to be zeroed out right before that compare. Or imagine if we could magically turn that “jnz” into “nop”. How could you do that?
Well overclocking, bumping voltage, and the other glitching techniques described can accomplish exactly that. What happens when computers run out of spec? I’m sure if you’ve tried overclocking your GPU way beyond spec you know what happens.
If you can narrow down where in a power trace the “jump” instruction probably occurs, and bump clock or voltage at around the right moment, you might get the CPU in a weird state where it hangs, crashes, …. or by luck, happens to skip over the jump or execute what more-or-less happens to be a NOP. When we did this, it basically boiled down to glitching a piece of equipment bajillions of times until you get lucky.
In the case of EMF or laser injection, you can even pinpoint the glitch to a specific portion of the microprocessor you are attacking. Here you can potentially reverse fuses, or avoid “glitching” parts of the processor that you don’t want to screw with.
Differential Fault Analysis
But glitching is even more awesome than that. You can glitch processors to …. ta da! Obtain encryption keys. Using DES as an example again, let’s assume the encryption key is baked into the device. We run the process with a ciphertext and obtain a legitimate result. We then run it again and glitch right in the middle of DES during computation of the 15th round, corrupting the “R15″ value, and obtaining a bogus result. (How do you do this? By profiling the application, identifying when the crypto occurs, picking the crypto rounds out of the power trace through statistical analysis and educated guesswork, and so on, and then you target the glitch right when the CPU would’ve been executing those instructions to compute R15). During the last round DES can be modeled like this:
R16 = F(R15, K16) + L15
Where R and L are the two halves of the inputs coming into the last round, F is the Feistel function (S-boxes and so on), and “+” means XOR. If we feed in the same inputs each time, we obtain two results:
R16 = F(R15, K16) + L15
R’16 = F(R’15, K16) + L15
R’15 here is the “glitched” result of the 15th round
R’16 + R16 = F(R15, K16) + L15 + F(R’15, K16) + L15
R’16 + R16 = F(R15, K16) + F(R’15, K16)
Now on the last round, L16 is equal to R15 before permutation. So we can reverse out what R15 was (again, the permutation just switches bits around). That means we have F(known, K16) + F(another_known, K16), and we can then bruteforce what K16 was to obtain the final XOR’d result R’16 + R16.
To go backwards we need to figure out what L15 and L’15 are, and this requires that you do some more math that I forgot what it was, but there are a several published attacks that basically let you work backwards from here and obtain the rest of the DES key.
And again, there are fault attacks for AES, RSA, and so on. Each attack is tuned to particular implementations of the crypto algorithms.
A Plug for Riscure Because They’re Awesome
These attacks were conducted with their Inspector product, which has a ton of algorithms baked in to do the statistical analysis for you. We didn’t go through all of the statistical analyses, but from poking around in their GUI they have a ton of them. I would have to guess that most usable publicly documented algorithms and attacks against various crypto algorithms are already baked into their product, plus I believe they have coded up some of their own secret sauce in there as well.
We flew through a whole bunch of other amazing stuff. Using EMF probes to create a map of power draws on a microprocessor so you can attempt to identify functional components within the chip, using Fourier analysis to find clock rates or I/O rates based on EMF data, how focus your power attacks on those parts of the chip that are of interest (so that perhaps your power traces are not polluted with as much “noise” such as the I/O subsystem, DRAM interface, etc.) When you use low-power IR lasers vs. high-power lasers (gamma lasers? I forgot!) to do laser fault injection. We learned all about Correlational Power Analysis, which requires you have a “model” of the power signature of particular operations, and then uses math to fit signals to that model. This was extraordinarily powerful and allowed us to easily extract 3DES keys from smartcards purely based on power signatures, but I forgot how any of it worked. To give you an idea though, it might take us 15 minutes or more to do capture and analysis using SPA or DPA with tens of thousands of input traces, whereas with the right CPA model the same task can be accomplished in under a minute.
Anyhow, this stuff was damn cool. Jasper and the rest of the Riscure guys are really smart and cool dudes. Check out their site, there’s a lot there and you can find some of their papers on *PA topics with Google. I think Ridley and I could both say this was probably the most exciting stuff either of us have learned in years. Oh well! Back to web pens and reversing malware! (wah-waaaaah)