New cloud attack takes full control of virtual machines with little effort

302 points by mikecarlton 9 years ago

The rowhammer "attack" is successful only because the hardware is just plain broken, and I consider it in the same category as things like a CPU which will calculate 1+1=3 if the computation of 1+1 is done enough times --- nothing software should even try to fix, because the problem is at a lower level. The solution is to demand that the hardware manufacturers make memory which actually works like memory should; and it should be possible, since apparently previous generations of RAM don't have this problem at all. In the early 90s Intel recalled and replaced, free of charge, CPUs which didn't divide correctly. Perhaps the memory manufacturers today should do the same for rowhammer-affected modules and chips.

Memory errors are particularly disturbing because they are often highly dependent on data and access patterns, and can be extremely difficult to pinpoint without special testing tools. I've personally experienced a situation where a system which otherwise appears to work perfectly well would always corrupt one specific bit of a file when extracting one particular archive.

As a testing tool, MemTest86+ has always worked well for me, and the newer versions can detect rowhammer, although there is this interesting discussion about whether it is actually a problem (to which I say a resounding YES!!!) or if there's some sort of cover-up by the memory industry:

http://www.passmark.com/forum/memtest86/5903-rowhammer-probl...

http://www.passmark.com/forum/memtest86/5475-memtest86-v6-2-...

Run it on your hardware and if it fails, I think you should definitely complain and get it fixed.

nickpsecurity 9 years ago

There's CPU's that do memory, integrity checking to contain attacks. They're designed for stoping software and peripheral attacks mainly but consider RAM untrusted. They could probably be modified to deal with the new attacks.
- userbinator 9 years ago
  
  ECC RAM has been around for a long time and may reduce, but not eliminate the risk. The problem is that the RAM is fundamentally defective.
  
  nickpsecurity 9 years ago
  
  I know what the root problem is. I also know it comes from an oligopoly of companies that only care about money, probably have patents on key features, and operate in a price-sensitive market. Fixing root cause might be tricky unless you could be sure via contracts of volume deals from cloud and other big buyers.
  Meanwhile, small teams in academia are building CPU's that knock out those and other issues. Worth bringing up given the fix you want isnt doable for most HW dedigners. RAM vendors might eventually use it as a differentiator but that's not guaranteed.
  
  tsukikage 9 years ago
  
  You can't entirely blame the providers for only caring about money; the consumers that choose the budget hosting options for critical applications must surely share some of it.
  Server grade hardware is certainly available to cloud/VPS providers, but it turns out people are unwilling to pay $2 for a VM if there's one going elsewhere for $1.50.
  
  nickpsecurity 9 years ago
  
  "the consumers that choose the budget hosting options for critical applications must surely share some of it."
  The customers expect the RAM they bought to work correctly. They might have even read papers on ASIC verification where the hardware companies brag about all these techniques they use to prevent recalls like one Intel had. The issue is that the companies stopped doing or reduced verification on specific components to reduce costs. What they bring in on the chips is way more than it takes to do that. So, the reason must be greed driving the profits up a little bit.
  This one is the companies' fault. I'd have assigned blame differently if we were talking security of regular, consumer products or even operating systems. Verification of repeating pieces of hardware circuits is an industry-standard practice, though. Except for RAM providers apparently.
- willvarfar 9 years ago
  
  Encrypted RAM is offered by the newest Intel server-grade CPUs (SGX, Skylake) and the next AMD server-grade CPUs (SME, Zen).
  One of the main use-cases for these technologies is trusted computing in a cloud environment - the customer can assert that the hardware is securing the program state from the eyes of the computer owner!.
  However, the cloud is actually made from cheap commodity boxes without server-grade anything! ;)
  Encrypting RAM pages would prevent the hypervisor from deduping pages between virtual machines, and this would be very negative for cloud providers who want to up the occupancy on each box as much as possible...
  In a few years, or perhaps longer, perhaps proper DDR4 and other immune memory will be mainstream in clouds. But until then, it seems we'll have a cloud fitted out with increasingly aging cheap machines with no rowhammer immunity.
  
  makomk 9 years ago
  
  Don't think I've seen any non-server-grade processors in even the cheapest bargain-basement VPS hosts. (Low-end dedicated is different.) Cramming as many VMs into a big server as possible seems to be too important to their cost structure for that.
  
  willvarfar 9 years ago
  
  We perhaps only disagree on what is "server-grade" vs what is sold for servers.
  Google, for example, are famous for making big data centres out of cheap commodity boxes, and I double Amazon are any different. I certainly know the rackspace blades I've played with didn't make my grade of either! :)
  
  tristor 9 years ago
  
  I can't make any claims to contrary about other providers, but I know at the very least that at one point not in too distant past the primary systems used for Rackspace Cloud hypervisors were Dell R720 rackmount servers. Maybe not the most amazing hardware, but considering how common they are you can hardly refuse to say they're "server-grade". The newer OpenCompute stuff is also clearly well-made hardware.
  
  wmf 9 years ago
  
  Everything I've read implies that cheap commodity servers like Open Compute are just as reliable as name brand Intel servers (not surprising considering that they're made from the same parts), and ~95% of the market appears to be satisfied with that level of reliability.
  
  nickpsecurity 9 years ago
  
  I figured it would end up in security-oriented, bare-metal hosting first. Or racks people rent out for their own boxes. Didn't know something like that was on new Inte/AMD CPU's. Thanks for tip.
  
  marcosdumay 9 years ago
  
  > However, the cloud is actually made from cheap commodity boxes without server-grade anything! ;)
  You know, I refuse to buy anything that does not support ECC for my home desktops (and don't even pay much for it). Only my laptop got a pass from this because there was literally no option available with it.
  Good to know cloud providers are not as careful... But honestly, shouldn't be a surprise.
  
  nickpsecurity 9 years ago
  
  Same here. It helps to sell it if you don't say ECC = RAM + extra cash. That's the normal method. I instead say you have two options:
  1. RAM that works at this price.
  2. RAM that allows more crashes or corruption of your files for slightly-lower price.
  The Right Thing suddenly looks more obvious except to cheap skates. Now I just need one with ChipKill built-in. That's the next level of ECC. I haven't heard whether Intel or AMD got something similar.
  
  amluto 9 years ago
  
  Encrypted RAM as AMD is implementing it (SME) protects nicely from "cold-boot attacks" but is otherwise largely a feel-good feature. It also probably doesn't help a whole lot against rowhammer-style attacks because it's merely encrypted, not authenticated. The result is that a bit flip will effectively randomize 64 bytes or whatever the block size is but will not be otherwise detected by the hardware. I bet that clever attackers will find a nice way to take over by randomizing 64 bytes.
  Intel's encrypted RAM is authenticated quite nicely, but it's not (yet?) designed for general purpose use -- it's for SGX only right now. Using it for everything would (if I understand correctly) add considerable space overhead and possibly considerable latency.
  
  willvarfar 9 years ago
  
  But encryption will prevent dedupe meaning a vm cannot attack other occupants as described in the article.
Afforess 9 years ago

I don't agree, there is software that is designed to run on faulty hardware. This is often in high radiation environments (see: outer space). I agree this is not an area that much hardening has been done in conventional security models, but in other environments, it is common to use CRC error detection, parity information or other means to ensure that even if data is partially corrupted, that the original can be restored.
I see no reason to prevent someone from implementing this sort of error correction for GPG and other important cryptography.
- jerf 9 years ago
  
  Hostile environments attack your software without intelligence. (When working with them, it may seem otherwise, but that's just cynicism.) Hostile people attack intelligently. Whatever mitigation you may imagine is possible by checking CRCs or something after the fact, you must account for the possibility that the software, the OS, or the CRC has also been attacked by a hostile intelligent adversary. The fact that we can make reliable software in the face of unintelligent attacks is not evidence that we can make secure software in the face of intelligent ones.
  Rowhammer is too powerful a technique to expect secure software to run on machines affected by it. This is an attack based on using rowhammer to change bits in other VM's memory. The only sane response to that, from the perspective of writing secure software, is despair. You can't deal with attackers in possession of that primitive.
  
  Dylan16807 9 years ago
  
  Rowhammer is largely random. You don't get to target specific bits of physical ram. You find scarce weak bits and work to get the data located there. In this case that means you can only pick a couple bits per 4KB to attack. That won't let you fake out a CRC.
  
  acobster 9 years ago
  
  That's where I'm getting a little hazy. The paper says the attacker can "induce bit flips over arbitrary physical memory in a fully controlled way." Sounds a little more advanced than "largely random" to me, and based on the article it sounds like FFS is a step up from "vanilla" Rowhammer...am I missing something?
  
  ultramancool 9 years ago
  
  Yeah, I mean if it won't beat a CRC wouldn't ECC RAM be a reasonably easy solution to this and wouldn't most servers already be secured against it?
  
  Dylan16807 9 years ago
  
  ECC RAM makes it a harder, but three bit flips will still survive. It depends on whether the system actually acts properly when it sees a huge amount of ECC errors happening.
  
  Dylan16807 9 years ago
  
  They can pick a bit or two per page to attack, but then they're stuck with those bits.
  In theory they could attack a new bit every few minutes, but that requires a system that allows the victim page to be remapped multiple times. KSM does not; any other memory-merging system could work the same way to mitigate things.
  Even if they could keep remapping, it's a very slow attack that way. Reloading the checksum every ten minutes would keep you safe.
- runeks 9 years ago
  
  This does not make sense. If an attacker an alter your data, he can alter your CRC codes as well. Or just replace the pointer to the checkCRC function to "return true".
niftich 9 years ago

> The rowhammer "attack" is successful only because the hardware is just plain broken
I too am of this opinion and am surprised this view isn't widely shared. With DDR4, we should be asking for a refund and/or starting a class-action suit, yet we're putting up with software 'mitigations' instead.
This isn't like the 2008 Phenom TLB bug [1] where the CPU was locking up so AMD released a workaround that kept it from freezing at the expense of a 14% performance penalty. This is like the floating point division bug [2] where the device no longer meets basic operational and accuracy guarantees. RAM cells bleeding into each other ought to be considered a fatal flaw, not some intellectual curiosity.
[1] http://techreport.com/review/13741/phenom-tlb-patch-benchmar...
[2] https://en.wikipedia.org/wiki/Pentium_FDIV_bug
- dboreham 9 years ago
  
  Back in the day, when I were a hardware engineer, we called this "pattern sensitivity" and it was a defect!
- patrick8 9 years ago
  
  FDIV was really not technically a serious errata in the grand scheme of erratas. The Phenom TLB bug was worse. Intel basically denied/sat on the issue for half a year, stopped just short of slandering Dr. Nicely, etc, they made it into a complete PR disaster. If they came out the week after it was reported and just said, here's a workaround, here's an opt-in replacement program (which they finally did, but then it was too late), you would probably never have heard about the FDIV bug -- like the countless other errata we have software workarounds for.
  
  niftich 9 years ago
  
  In retrospect I regret bringing up the Phenom because my argument could've stood without it, and I could realistically argue either way.
  But my original intention was pointing out that the failure mode of the Phenom was such that it wasn't exploitable for anything other than potentially denial-of-service; it was just inconvenient, and only affected a subsystem of the CPU which worked fine without it using a firmware workaround.
  Though you don't expect your CPU to halt and lock up, I believe it's far more insidious when you feed a device inputs and get the wrong output without any obvious indication that something went wrong, like in the case of rowhammer-vulnerable memory and FDIV.
  
  patrick8 9 years ago
  
  I think that is the reason for the misunderstanding.. FDIV was not really insidious in the way you describe. It was 100% predictable, certain bit patterns always gave the wrong answer in the quotient on the affected hardware and it had a very straightforward software fix (with a performance effect sure). You could demonstrate it immediately, but it really wasn't severe. (Q9 and Q10 http://www.trnicely.net/pentbug/pentbug.html)
  Rowhammer is a much more complex errata and I don't feel qualified to comment on, especially the safety of the published mitigations, but it is in a class of bugs where the outcomes are not generally predictable due to more variables involved.
  My reason for replying initially though, is that I don't think that the line for what types of hardware defects are open to software workarounds is so cut and dry, and I don't think many people outside of kernel/OS dev realize how many errata are on the chips they use everyday with workarounds they don't notice.
  
  slededit 9 years ago
  
  The early 386 32-bit multiplication bug is probably a better example. Fortunately there was little 32-bit software at the time.
- userbinator 9 years ago
  
  I too am of this opinion and am surprised this view isn't widely shared. With DDR4, we should be asking for a refund and/or starting a class-action suit, yet we're putting up with software 'mitigations' instead.
  I extensively test all the hardware I buy (CPU: LINPACK, RAM: MemTest86+) and if it fails any of those tests, it gets returned as "not fit for purpose". I've done this successfully a few times. A lot of other enthusiasists/power users do the same too, especially if they're overclocking, and searches on other forums show plenty of users testing and finding (mostly other, not rowhammer) errors in newly-bought RAM even when not overclocking. But as noted in the threads I linked to, manufacturers may be trying to cover this up and downplay its severity. Even in the original paper on rowhammer, the authors didn't disclose which manufacturers and which modules were affected, although I think this should really be treated like the FDIV bug: name and shame. I blame political correctness...
  
  pjmlp 9 years ago
  
  How do you do it, regarding LINPACK?
  I assume just compiling it and execute some tests that are part of it?
  
  userbinator 9 years ago
  
  The Intel LINPACK distribution contains, besides the library, a sample benchmarking application using it, and that happens to be a very intense and "real" workload (solving systems of equations, i.e. scientific computation.) There's plenty of posts on various PC enthusiasists forums about how to run it correctly. (And plenty arguing that it's irrelevant, mostly because their insane overclock seems fine but instantly fails this test. There's a good reason most doing "real" scientific computing don't overclock; a lot of CPUs just barely pass this absolutely realistic test with stock speeds and voltages.)
  
  corysama 9 years ago
  
  If you are buying GPUs, you should add in FurMark. I e had instances where I had to return 3 GPUs to get one that didn't glitch.
  http://www.ozone3d.net/benchmarks/fur/
wepple 9 years ago

without being an expert in this area; my gut feel is that the fix to this problem is likely going to be funded by the end user. Given that competition continues to drive prices down, would 'secure ram' be viable? would you pay more for it?
- userbinator 9 years ago
  
  Given that competition continues to drive prices down, would 'secure ram' be viable? would you pay more for it?
  It's funny you mention this, since the problem only affects newer DDR3 and DDR4 modules and older RAM (EDO modules are apparently still in limited production and being sold) does tend to be significantly more expensive. Unfortunately the rest of the hardware needs to be compatible.
  This also means all the older hardware that gets scrapped in massive quantities daily is likely to contain RAM immune to this problem, which is somewhat ironic... maybe it's just a (sad) continuation of the "newer is more volatile" trend that can be traced back to thousand-year-old stone tablets which remain readable today.
  
  marcosdumay 9 years ago
  
  Not stone, but clay tablets are probably more volatile than your USB drive. There's a huge sampling bias here.
  About the main point, why isn't ECC fixing this for everybody? I'll surely get cheaper, more volatile RAM, and use some of it on redundancy so it works better than the more expensive, less volatile kind.
- GTP 9 years ago
  
  I have the same opinion as you. I would but most people wouldn't. The root cause of the problem is since the trend about ram is "the bigger the better" (in terms of GBs) we have tons of capacitors on a small surface. I'm no expert too but I think there's no simple hardware fix for this instead of returning back to RAMs that hold less memory, but most people won't accept it. Maybe we're hitting the limits of the current technology and we should switch to another one. Just on a side note two years ago one of my professors quoted an ongoing research in my university about RAM that instead of storing electrons formed crystals, but I don't know any other detail about this.
mankash666 9 years ago

This blame placed on HW stems from a lack of understanding of RAM physics/electronics. As dimensions scale down, these things happen.
The market has chosen to adopt the cost benefits of smaller transistors and higher capacity for the same $. It's a mix of physics and market forces, not malfunctioning hardware.

andrewstuart2 9 years ago

It seems the HN title and original title are both pretty wrong, at least according to the article content. The attack vector is really the ability to, if you have a known public key and a server using it, perform a pre-calculated bit flip such that the new public key is much easier to factor, and thus obtain a corresponding private key.

So you're not obtaining original private keys, you're altering original public keys so that you can more quickly factor a private key that will be accepted.

If this is an SSH public key, then you can obtain SSH access. If it's a PGP key trusted by the package manager, then you can craft signatures on packages that would be accepted as valid, assuming you can also get the target machine to download said package.

I think SSH is probably the most interesting attack vector assuming you can get network access to the host once you've jumped through the myriad hoops to perform this attack.

It's a serious issue that should be addressed (probably via forced from-disk reads or at minimum integrity checks), but I think the authors are perhaps a little too eager on the practical implications of corrupting in-memory public keys.

DigitalJack 9 years ago

It's the sort of thing for which the NSA would spend their resources to develop an exploit tool.
- userbinator 9 years ago
  
  Rowhammer is such a subtle effect and very easily blamed on many other things that it's not hard for the more paranoid among us to imagine the NSA deliberately sabotaging memories with it to use as a backdoor. When it was first discovered I wrote my thoughts on it here:
  https://news.ycombinator.com/item?id=8716977
ec109685 9 years ago

In the paper, they compromise not only the pgp key but also the Debian update server address, so all that is necessary is a software update to be compromised.
willvarfar 9 years ago

Another attack would be to flip bits in code pages...
That the attackers illustrated it by changing public keys so they could push updates or ssh into a box doesn't mean that's all the ways they could have compromised. You can't say "I don't use SSH so I'm safe!" or anything like that.

xorgar831 9 years ago

Here's the crux of the memory issue from one of the link in the article:

DDR memory is laid out in an array of rows and columns, which are assigned in large blocks to various applications and operating system resources. To protect the integrity and security of the entire system, each large chunk of memory is contained in a "sandbox" that can be accessed only by a given app or OS process. Bit flipping works when a hacker-developed app or process accesses two carefully selected rows of memory hundreds of thousands of times in a tiny fraction of a second. By hammering the two "aggressor" memory regions, the exploit can reverse one or more bits in a third "victim" location. In other words, selected zeros in the victim region will turn into ones or vice versa.

runeks 9 years ago

So it doesn't allow reading any data? I'm most nervous about leaking private keys.
- jfoutz 9 years ago
  
  in and of itself, no. but it could alter a permission bit, for example, and then reading would be allowed.
wepple 9 years ago

Rowhammer itself has been around a while, and is only 50% of this attack that has been posted.
The other bit is the newer idea (well, old idea, newer actual implementation); memory deduplication by your hypervisor leads to a very minor timing fingerprint when you write to a page of memory that had previously been deduplicated IE the same physical page was shared amongst multiple VM's because it was identical.. until you wrote to it and the OS/hardware had to then Copy-on-Write it out to your own dedicated copy; that has a higher latency than a memory page that is already available directly for writing to you.

frostmatthew 9 years ago

This attack wouldn't work with [current versions] of ESXi since VMs now share pages only if the salt value and contents of the pages are identical (each VM uses a unique salt by default). https://kb.vmware.com/selfservice/microsites/search.do?langu...

andrewstuart 9 years ago

Sharing pages seems a big price risk to pay for saving a little memory.
Why not turn it off entirely?
- ams6110 9 years ago
  
  I would guess if you're a big VM hosting provider and you have thousands of VMs all running the same version of Windows or Linux distro, that it could add up to some real savings to have them share common pages.
  
  andrewstuart 9 years ago
  
  I guess so.
  Seems the savings would be somewhat offset by having your whole business destroyed because its easy to crack.
  
  jerf 9 years ago
  
  Conceptually, it's safe. UNIX distributions routinely do the equivalent operation within single machines, it's a fundamental part of their operating model.
  It's just that in the face of defective hardware, it's not safe. But this is not surprising, because nothing is safe, so it isn't particularly a criticism of page sharing. This specific attack may have used it, but Rowhammer is a powerful tool. This is not the only way it can be used; it is merely an exemplar.
  
  rasz_pl 9 years ago
  
  cant you limit sharing to Read/eXecute pages only?
  
  AgentME 9 years ago
  
  Isn't rowhammer done purely by read operations?
  
  Sanddancer 9 years ago
  
  Yep. In DRAM, reads are destructive, so every time you read a row, you have to write that row back.
  
  rasz_pl 9 years ago
  
  from what I remember you need control (=ability to write to) over adjacent rows?
- xavierd 9 years ago
  
  There is still significant sharing that can be achieved inside a VM, plus, a lot of the sharing come from zero pages (full of 0) which is still performed accross VMs.
  Another benefit of the salting mechanism is that it allows the administrator to define groups of VMs that are trusted in which sharing will be performed.
  disclaimer: I work at VMware and wrote the salting code.
  
  lawnchair_larry 9 years ago
  
  Does the salting address the issue described in the dedup est machina paper? I noticed they did not mention that it worked against VMWare.
runeks 9 years ago

How can VMs share a memory page at all with this scheme when the salt is unique to each VM? It sounds more like turning off inter-VM memory sharing...
- frostmatthew 9 years ago
  
  Each VM has a unique salt by default, but you can still specify two or more VMs to share a salt. See also the comment[1] below by xavierd.
  [1] https://news.ycombinator.com/item?id=12411146

ars 9 years ago

People are focusing too much on the exact specific attack shown here: Deduplication, modifying a public key, etc. (And proposing solutions like turning off deduplicaiton, checksum, etc.)

But that's just this attack - the fact that they have that much control over memory means there are FAR FAR FAR more possible attacks.

If you can control memory to that level then you are limited only by your imagination.

The only mitigation I can think of at the moment is ECC memory. And shame on Intel for only supporting that on Xeon.

andrewstuart 9 years ago

What can you do to attack other VMs if you don't have shared memory with them?
- rasz_pl 9 years ago
  
  depends, do they run node? because there were successful javascript rowhammer implementations demonstrated
mjevans 9 years ago

That used to be a point in AMD's favor for their desktop CPUs.
Dylan16807 9 years ago

If you control memory vs. if you control your memory. This does not give you general purpose access.

walrus01 9 years ago

It is more costly, but this is a good reason to use a dedicated chunk of memory for every Xen PV domU. No oversubscription!

Allowing multiple domU VMs on the same dom0 (or the equivalent in other hypervisor platforms) to re-use memory and balloon/contract memory on the fly is what enables this.

runeks 9 years ago

Can you point me to some services that provide, specifically, Xen PV VMs with non-oversubscribed memory?
I'm considering deploying a custom unikernel for protecting the private key data for my app[1], until I have enough money for a Hardware Security Module.
[1] http://security.stackexchange.com/questions/135457/penetrati...
- walrus01 9 years ago
  
  Sorry, I can't, we use Debian stable + xen on our own bare metal hardware machines with from 256gb to 1tb of RAM. Never tried to buy a rental VM using the same dom0+PV setup. All of my off site VMs are for testing, some cheap $4/mo type openVZ that are basically glorified jails.
- sn 9 years ago
  
  I'm not sure if anyone actually oversubscribes ram with Xen. But we (prgmr.com) still allow you to order PV VMs, mostly because NetBSD performance is abysmal in HVM mode.
besselheim 9 years ago

I suppose this would still be susceptible to this class of attack: https://www.usenix.org/conference/usenixsecurity16/technical...
Unless you're very careful about which ranges of physical memory are mapped to each VM.

micro_softy 9 years ago

"For the attacks to work, the cloud hosting the VM must have deduplication enabled so that physical pages are shared between customers."

But the vendor's cloud will not disable sharing pages of physical memory because ____.

This is a great counterpoint to the salesman trying to sell you on "cloud" anything.

Why is it less expensive to use the "cloud"?

One reason is because you do not get your own physical server, including your own RAM.

When the "cloud" buzz began to gain momentum years ago I raised the issue of not knowing who your "neighbors" were on these physical servers that customers are sharing with other customers in datacenters.

As usual, these concerns will just fade into the background... again.

rotrux 9 years ago

For those of you worried about your aws workloads, this may help make ya feel a (slight) bit better.

https://forums.aws.amazon.com/thread.jspa?messageID=739485&t...

saltyhiker 9 years ago

Why only slightly better? The response to that forum post, as far as I can tell, means EC2 is not vulnerable to this attack.

Animats 9 years ago

Will rowhammer attacks work against ECC RAM? Multibit memory errors should be detected, even if they can't be corrected.

strstr 9 years ago

ECC is one of the mitigations, as well as increased refresh rate.
The 'best' solution is better ram: some vendors are more vulnerable than others.

trendia 9 years ago

Would this be a threat to services running on AWS?

wmf 9 years ago

No, because AFAIK EC2 does not dedupe RAM.
- SBArbeit 9 years ago
  
  What about Hyper-V / Microsoft Azure? Anyone know if they de-dup memory like this?
  
  Wakko1 9 years ago
  
  No. Hyper-V has no memory De-dup function. Azure runs on Hyper-V so it's not vulnerable either.
  
  saltyhiker 9 years ago
  
  What popular cloud providers are vulnerable?

runeks 9 years ago

Ouch. Before reading this article I was seriously considering deploying a signing service as a HaLVM (Haskell) Xen PV unikernel running on EC2. The service would receive its private key after startup, such that the key never touches disk. Now I'm a lot less inclined to pretend that the Xen interface actually protects me...

ploxiln 9 years ago

Xen has had page-table and interrupt vector related security vulnerabilities. But I don't think EC2 would use non-ECC RAM, so I don't think it's vulnerable to this "rowhammer" technique. (I also don't think EC2 would do cross-VM page deduplication, another necessary condition.)
- willvarfar 9 years ago
  
  Perhaps we need more certainty than just "think"?
  That AWS don't boast that they are not susceptible to this suggests that perhaps at least some of their setup is?
  
  Rezo 9 years ago
  
  The EC2 FAQ [0] states:
  "In our experience, ECC memory is necessary for server infrastructure, and all the hardware underlying Amazon EC2 uses ECC memory."
  While ECC does apparently not completely mitigate Rowhammer, it helps.
  [0] https://aws.amazon.com/ec2/faqs/
mseri 9 years ago

Afaik xen does not use memory deduplication. KVM aside, one should be worried about things running inside a linux host/vm, like containers. Maybe I am missing something

Annatar 9 years ago

For the attacks to work, the cloud hosting the VM must have deduplication enabled so that physical pages are shared between customers.

This "Flip Feng Shui" wouldn't work in SmartOS simply because the hypervisor does not implement memory deduplication.

Good luck with VMware though.

tmaly 9 years ago

It seems like a dedicated server would solve this issue in some sense. If your not on a shared VM, then an attacker could not affect the memory.

For those that cannot be on a dedicated server, what changes could be made to the shared VM memory setup to reduce this attack surface?

lifeisstillgood 9 years ago

some thoughts:

  For the attacks to work, the cloud hosting the VMs must have deduplication enabled so that physical pages are shared between customers.

This seemingly is an attack where two VMs on the same host can read each other's memory, if a deduplication flag is set on the VM controller. This seems to offer cloud holsters some easy (paid for) upgrades to be honest

its not (afaik) heartbleed time. It's bad but the effort required is high and afaik the attacker will replace your key with their key - making it clear you are compromised.

acobster 9 years ago

The abstract says the attack allows "flips over arbitrary physical memory in a fully controlled way." If I'm understanding that correctly, it would be trivial to then restore the old key alongside it, leaving the victim none the wiser.
Also, as others have pointed out, this is a hardware issue and the clear solution is to swap out the vulnerable RAM. Yeah, paying more is an "easy" way to have peace of mind (if that's even an option for you as a "cloud hoster"), but that's just backwards IMHO: a security vulnerability on the host's side should not translate into an upsell.

caf 9 years ago

I wonder if it would be worth checksumming public keys and re-checking the checksum each time it's used?

quickben 9 years ago

Something like Perspectives for Firefox?
- caf 9 years ago
  
  No, much simpler - store a checksum alongside public keys in places like .ssh/authorized_keys, and have the software like sshd recompute the checksum of the in-memory key each time it uses it for authentication.
  The attack relies on glitching the in-memory key.
  
  kccqzy 9 years ago
  
  This doesn't sound correct. What if the attacker times the operation so that the bit corruption occurs after the checksumming but before being actually used for cryptographic operations?
  
  caf 9 years ago
  
  Yes, to be safe you'd have to do the checksum after performing the RSA operation.
runeks 9 years ago

That's security by obscurity (which may work to delay the attacker). If an attacker can modify your public keys he can modify your checksums as well.
Seems to me that a public key should be identified by a cryptographic hash of it, rather than the public keys itself. Then the attacker would need to replace the entire hash, rather than just a few bits, because the hash changes completely just by flipping a single bit in the input.
- caf 9 years ago
  
  The attacker isn't making targeted modifications to your public keys, though: they're randomly glitching it, and using the page sharing implemented by the hypervisor to read out and factor the glitched version.
  Even with say a 64 bit checksum then there's only a 1 in 2^64 chance of the randomly modified key/checksum pair matching. But you could use a cryptographic hash as your checksum if you wanted.
  I only suggest this not because I think it would be a complete defence against all Rowhammer attacks - it wouldn't - but because the general fragility of the RSA construction means that doing it with any potentially corrupted input gives me the willies. There are other sources of bitflips other than Rowhammer and it just strikes me as a generally good idea not to leak the results of RSA operations performed on potentially bitflipped inputs.
jameshart 9 years ago

If your threat model now includes 'the attacker can at arbitrary times make arbitrary alterations to the working memory of my process' then no, this won't help. You can't trust the checksum, you can't trust that the data you just check summed hasn't subsequently changed, and you can't trust that data which passes a checksum wasn't previously different. Also you can't trust the checksum code itself. Or the operating system you're running on. Or anything.

arrty88 9 years ago

Is Linode safe?