I can hear through these walls
And I hear every sign, every sound
- Phil Collins
In June last year under conditions of intense secrecy, teams of developers around the world forfeited their holidays to fix three unprecedented vulnerabilities that lay at the very heart of the hardware that powers our computing infrastructure.
The flaws these developers were scrambling to mitigate gave attackers unprecedented powers. Like the creepy inhabitant of Collins' motel room, listening through a glass pressed to the wall to eavesdrop on his unsuspecting neighbour, it became possible for an attacker using only normal unprivileged code to break through the supposedly impermeable walls that isolate user code from the operating system kernel.
Even worse, in virtualised environments, they could effectively spy on their neighbours. In a cloud environment, where tenants might not even belong to the same organisation, that's a terrifying prospect.
Decades of processor optimisation led to critical flaws
These were not easy flaws to fix. They resulted from more than two decades of hardware improvements that had led to unprecedented performance improvements for modern processors. Specialized hardware, such as the translation lookaside buffer (TLB) and branch target buffer (BTB) and sophisticated multi-level caches, along with architectural changes such as out of order and speculative execution (which added a kind of quantum uncertainty to the way code would run) had revolutionised CPU performance. As workloads and software complexity continued to increase, this performance was much needed.
But in competing relentlessly, the CPU vendors largely overlooked a critical security issue.
It arose from their efforts to overcome a significant performance bottleneck. Despite decades of spectacular improvements in chip speed, main memory remains a significant choke point. Its speed has not improved significantly in nearly two decades, so it's critical to optimize memory access if software is to fully exploit the raw power of the CPU, which can execute thousands of instructions in the time it takes to retrieve one item from main memory.
So high-speed cache memory, located right next to the CPU, became critical to mitigating this problem. With sufficiently sophisticated (and large) caches, software, which is mostly ‘loops within loops' could efficiently retrieve data that was needed.
Now the high-speed caches that are critical to CPU performance can't be read directly by user-level code. They act only as an intermediate marshalling area for data which is used frequently. CPU vendors therefore didn't get too concerned about cached data that a user could never retrieve. If permission checks were made before this data was marshalled from cache back to the processor's registers or main memory, surely the system was intrinsically secure? Or, if speculative or out-of-order code temporarily left data in cache that would never be made available to the user, what did it matter?
Side-channel attacks
However, it turns out that an attacker can infer the contents of cache by timing how long it takes to retrieve data, using carefully crafted code. Because cache is so much faster than main memory, it's possible to infer where data resides and what its values are, despite not having direct access to it. This is known as a side-channel attack.
Ironically, at least one research paper pointed out this potential attack vector more than twenty years ago. Unfortunately, no-one seems to have paid heed to its warnings.
Unlike normal malware, this attack code requires no special privileges and triggers no tripwires when executed. No wonder so many people were willing to work so hard to fix it.
Unfortunately, their fixes required fairly radical changes to the software that makes up critical operating system components. As this is a hardware flaw, it affected Microsoft, Apple, Linux and of course the vendors of specialised operating systems such as VMWare's ESXi hypervisor.
To save the village we had to destroy it
The software patches that were finally unveiled in January 2018 were more ‘hacks' than patches. To protect against the vulnerabilities, those key components of the CPU, the TLB and BTB, were significantly compromised. Sensitive kernel data now had to be hidden from user code, requiring a constant swapping of page tables as code transitioned from user to kernel mode and back again.
Additionally, fences had to be placed around sensitive code in the operating system to restrict the clever speculative and out-of-order execution capabilities of the processor which enable these attacks in the first place.
This flurry of additional activity has a cost. Much of the performance optimisation provided by this extra CPU hardware is negated. Suddenly we lost a decade's worth of enhancements.
These enhancements mainly impact scenarios where the processor is handling a workload involving multiple active tasks.
As a consequence, while end-users won't feel much, if any, pain, busy servers and cloud infrastructure clearly will experience a hit. And in some cases this is a significant hit – particularly to critical I/O – disk and networking in particular.
And it's not over yet. At least one vulnerability requires that application code itself be carefully patched. This could take years, during which every computer system will have a known security vulnerability. Meanwhile, security researchers will be crawling all over the low-level details of CPU architectures – a much-belated but necessary scrutiny. Unfortunately due to secrecy on the part of the CPU vendors, these researchers have to reverse-engineer many critical internals – but they now have a very strong motivation for doing so. It's not unlikely that further hardware flaws will be identified, requiring a fresh wave of patches.
Rock, meet hard place
This leaves organizations in a difficult position. Unlike most security flaws, these don't come with a zero-impact set of mitigations. Instead, difficult decisions have to be made. Should mission-critical servers be left deliberately unpatched, because no user code access is possible? For example, a dedicated database server might arguably be left unpatched because there's no obvious way to exploit a vulnerability.
Alternatively, can virtualised environments be refactored into ‘good' and ‘bad' neighbourhoods? Using, for example, Windows Server 2016's new CPU groups capability, it's possible to partition guests and the hypervisor itself into dedicated CPUs. This means that physical walls could be put up between guests, potentially mitigating the flaws without incurring the performance penalty of the patches.
Because the Spectre vulnerabilities require both OS-level and application-level changes, along with new CPU microcode, older OS's and applications, along with older servers, may not be practical to patch. In this case, it may be appropriate to mitigate as far as possible by migrating to servers using CPUs from a different vendor. That will, of course, depend on whether vendor assurances regarding security prove to be correct.
In the cloud environment it gets trickier. Some performance overheads have already been observed, but organizations have little choice but to trust that their cloud vendor's environment is intrinsically secure. Unfortunately, like our motel eavesdropper, it's going to be difficult to know who your neighbours are. For some organizations this may be a wake-up call, bringing a halt to further cloud migration until the security implications become clearer. At least, with on-premise facilities, you have some control over the neighbourhood.
Cloud vendors may respond by providing some kind of isolation guarantee; i.e., for a price, you can be assured that your computing workload is cleanly isolated to a known set of CPU cores and that no other tenants can access these. That may ameliorate concerns to some extent.
Trust is good, but control is better
Finally, efficient infrastructure management will become increasingly critical for organizations. Allegedly, one major organization, ironically a leading IT vendor, was reduced to using spreadsheets to try and track and manage the rollout of the Meltdown and Spectre patches.
Clearly, managing your infrastructure with SIEM, inventory management, and other tools will be essential with wide-ranging vulnerabilities like these. And with potential performance and stability issues arising from patch deployment to busy servers, you need to be able to automate performance and availability monitoring, and quickly take action if problems arise.
But vendors are already addressing the challenges arising from these new vulnerabilities. With a proactive and informed strategy to manage your infrastructure, along with a careful choice of vendor partners, you can ensure those walls remain impervious both inside, and outside, your organizational boundaries.