I wonder if we should just universally accept that live patching should become part of the linux kernel? An automatic job that updates (much like some system packages in some distros) that installs (signed) live patches from upstream? Of course we would run into a problem where a malicious patch can now be distributed reliably to hundreds of thousands of machines, but we already have that at a lower level with normal application updates.
Canonical has thus far proved that it can be safe, but they're also a massive organization that is locking this feature for $200/yr for any commercial use.
It would be neat if such patches could retroactively replace tagged functions that have identical sematics so that means it would automatically get backported without extra effort from the maintainers.
Why would the source of the patches be less trusted than the source of (updated) kernels? I expect it to be the same, your distro.
$200/year is peanuts for any commercial use worth the name. The problem, of course, is the whole non-free infrastructure it has to introduce.
I wonder when large and critical OSS projects will start to be seen as a public good they are, with large corporations willingly financing them because not doing so is bad PR.
Errr.. are you saying that Linux contributors are only affiliated to small companies or volunteers? Well, last time I checked, some contributors were RedHat(IBM) and Microsoft employees ;)
I am saying that big company funding of some FOSS is exceptional compared to the economy at large. Most public goods are not funded by bug companies. They do not do the same for roads or public health services, or free education or public libraries etc.
Even with FOSS few projects get the funding or contributions the Linux kernel does.
After the npm supply chain attacks people suggested automating delays before installing updates, now we're talking about automating update delivery... I'm afraid there won't be any easy or quick fix after decades of treating security as an afterthought.
Linux distros are not npm. It doesn't mean they are infallible to malicious actors, but I believe it is possible to make them infallible for some small set of packages at least.
Attacks are still possible, but if we look at xz backdoor attack[1] it was insanely complicated attack and it still failed. Its fail doesn't look promising, attack could succeed just the attacker was unlucky. Still it shows that the success is not guaranteed.
Theoretically npm can be improved in this way, if there were a separate "distro" for packaged, with dedicated maintainers for packages, who don't write code, just pull it from a mainstream and review it. It is not being done because of tragedy of commons, not because it is impossible.
Whenever you read about an incredibly unlucky criminal, there's a chance that the unlucky event is a parallel construction to the classified real reason why they were caught. Not sure how exactly that would have worked in this case.
Yes, it could be. But it is a hypothetical that smells like a conspiracy theory. I wonder why you think it is a good idea to go for these hypotheticals?
Are you arguing that the system may be more resilient than it seems? Like, maybe there is a conspiracy working on security. And they keep themselves secret so attackers would be susceptible to under-appreciate the real level of security and make mistakes that inevitable would caught?
It seems like a over-stretched explanation, doesn't it. Care to explain yourself?
I mentioned it mainly because I find it interesting. We are probably just slightly less lucky than it seems. This particular case doesn't look that much like a parallel construction to me.
Linux itself, major Linux distros, npm - none of these were designed with a security-first approach. Even the things that do help with security, like package maintenance or containerization, were more incidental to other primary goals like stability, reproducibility and so on rather than being born from a comprehensive security-first strategy. They could have been, but then things would have moved slower. They even exist, like Alpine, OpenBSD, RedoxOS, but the major ones, the ones we're talking about today, were the ones who moved faster and managed to take over. That's the fundamental issue I'm talking about, the mindset shift that would be required before we could even start the Herculean effort of rebuilding much of the existing stack with different architectures, in different languages and using different development models, always knowing that, in the past, the ones who moved fast and broke things instead tended to be the ones who succeeded.
I technically agree, but it seems too abstract to me. How could look a distro maintenance, if it was built with a security-first approach?
Maybe I have not enough fantasy and/or creativity, but trying to imagine it, I see just a bit more of oversight built into protocols of approving changes to repositories. I mean, it doesn't seem that improved security needs an approach "destroy everything and build it from scratch", some additions on top of existing structures would do. Am I wrong?
If we were to start from security first, we would be asking questions like 'how can we make sure that new code is safe?'. Manual review is great, but we can likely think of some desirable invariants for program behavior that could be tested automatically, or even formally verified. Those would come at the very start. The entire mindset right now is that the existing code is probably unsafe and we'll ship fixes as we discover its vulnerabilities. Not immediately applying updates is seen as a kind of moral failure. All major OS and most software projects were developed with this mindset of crossing your fingers at launch and then changing the tires while driving. So much so that we think of it as the natural state of software. If you start from a base of verified code, the mindset shifts. Not that there are zero vulnerabilities guaranteed in the existing code, but you become a lot more suspicious of new code.
> I wonder if we should just universally accept that live patching should become part of the linux kernel?
I think we can learn many lessons from the recent SNAFUs before going all wild on auto-patching.
One lesson for example is that you shouldn't compile into the kernel modules that only about 0.00001% of all Linux installations out there are ever going to use.
Another lesson is that even if the modules are compiled, but not into the kernel, they should probably be blacklisted (preventing them from loading) by default and only removed from the blacklist by people who really know they'll need these rarely used modules.
We're way past the "but it needs to work on all cases": we're now into the "users installing our distro are getting hacked left and right" territory.
In any case I think many things can be done before Linux distros reproduce the "security" practices of the NPM ecosystem.
> we're now into the "users installing our distro are getting hacked left and right" territory
Are we? Are users actually getting hacked, or have they theoretically been exposed to problems that could allow local privileged escalation if exploited but that nobody's seen used in the wild?
(Edit: To be clear, I'm skeptical but this isn't a completely rhetorical question. If there are actual reports of these vulns causing problems, that would strongly incentivize a stronger response.)
It used to be relatively standard even on the "big" distros to compile your own kernel if you needed something outside of the bog-standard. Modularization and all the related auto-detect auto-mod tools have resulted in most distros shipping a "works for almost everyone" kernel that has everything available as a module.
It seems like a reasonable middle ground for most distros is to put things in kernel modules, but then package those modules into separate packages. If you don't need somedriver.ko, then you don't `apt install linux-driver-somedriver`; if you do need it, just install the package and it just works without needing to compile anything and you get automatic updates and everything.
For Gentoo, of course, "just recompile the kernel as desired" is more reasonable, though they have binary packages including for the kernel and I don't see why the same idea shouldn't work there.
>but then package those modules into separate packages. If you don't need somedriver.ko, then you don't `apt install linux-driver-somedriver
But I don't want to know what drivers I need and will need next. Tomorrow I could buy a different wifi module and then what? Spend 3 hours googling which rtl378326973268632aahaxhabt.ko to install? Thanks but no thanks.
That existed for a very short period of time before it became simpler to just ship everything all the time. I remember at least one distro booting with a single processor kernel and detecting that it could use an SMP kernel and did I want to pull it down?
The standard method is for the installer to have the needful and then it knows what packages to install to give you the network drivers you need going forward (shades of slipstreaming in all the network drivers I could find into W2K custom ISOs so I'd not need to find floppies).
You can do blacklists easy enough if you want to, just add few lines of text into /etc.
I'd also like option for whitelisting, like whitelisting every single NIC driver is harmless enough coz they just won't be loaded, but anything that can be loaded by non-root userspace action should have option to be only loaded if it is on whitelist.
Tho all that is easily doable by just changing userspace AFAIK
Easier: Do not start with a "allow all" configuration in the first place.
Maybe all of those userspace-work-done-in-kernel-because-muh-performance features should be restricted to (the "real") CAP_NET_ADMIN, unless positively enumerated as free-for-all-containers. And then subtract from that free-for-all list every time you learn that some kernel module in its currently available version cannot be trusted to do its own memory shuffling.
> live patching should become part of the linux kernel
Services where uptime matters tend to be designed so they can tolerate the reboot of a single node for other reasons besides kernel maintenance. I can't imagine a situation where I can't tolerate the downtime of a reboot but I would be willing to risk the system locking up with brain surgery gone wrong.
We've used ksplice for good amount of years (till bought by Oracle and they stopped publishing patches for other Linux distros) and all in all it was very stable technology 10 years ago
> I can't imagine a situation where I can't tolerate the downtime of a reboot but I would be willing to risk the system locking up with brain surgery gone wrong.
Because you haven't worked at that level in organization. Doing restart in some case might involve paperwork with your client and maintenance window outside of working hours even if service is redundant. And some customers are fine with a little bit of downtime and don't want active/active level of redundancy but still insist of maintenance windows for any work like that.
> Services where uptime matters tend to be designed so they can tolerate the reboot of a single node for other reasons besides kernel maintenance. I can't imagine a situation where I can't tolerate the downtime of a reboot but I would be willing to risk the system locking up with brain surgery gone wrong.
I've run systems with live code updates for userland, and would have considered live kernel updates if it was reasonable on our systems.
The thing is you typically build your system to tolerate reboot or unscheduled stop of a single node. Scheduled stop is nicer, but systems sometimes lock up even when you're not doing risky behaviors, so you know.
But just because the system can tolerate a reboot or restart doesn't mean it's not disruptive. A lock up / etc during hot load is also disruptive, of course. But when you can push code without having to stop anything, with limited impact on users, it makes it easier and faster to do updates. You can use whatever rollout pattern you like to contain risk too; same as you would for an upgrade with restarts.
For us, we have servers with hundreds of thousands or millions of tcp connections from mobile clients. Restarting a server would make all those clients have to reconnect and connecting is expensive. Restarting all the servers would result in many clients reconnecting several times. It was better to avoid it when possible.
> Restarting all the servers would result in many clients reconnecting several times. It was better to avoid it when possible.
As a sibling says, you need a "reconnect now" in the protocol. (GOAWAY, in HTTP.)
In addition to what the sibling says, if you have some sort of cordoning/graceful drain facility at the traffic level, you can also prevent the "several times" bit: bring new, patched nodes online. Disallow new connections to the outgoing nodes. Drain the outgoing nodes. Decommission them.
Live patching production kernels makes sense when there is an imminent threat/timeline and rebooting is throttled due to underlying throttling mechanisms that are guarding the health of distributed systems running a-top the systems. Here's a real example I am familiar with:
Consider a hyper-converged cluster with many nodes serving distributed block storage, say at N=3 replication. This can tolerate exactly one N=1 node of outage for the reboot. It would seem preferable to drain the nodes in a way that allows for more parallelism in the per-node kernel-reboot process, but draining is expensive and its cheaper to reboot and hope the data comes back to the pool within some period of time after the reboot. This gets worse linearly as the cluster grows.
A non-trivial size cluster facing this can have a reboot rollout easily stretch from hours into days and even weeks. It is further made slower when the roll-out itself is repeatedly paused when any other production issue is detected, or some other in-cluster event is happening and distributed storage health is degraded or unavailable. If a single (additional) node goes out during the reboot roll-out, data goes unavailable and storage must wait and heal. It also simply takes time for the cluster to reconcile when the storage eventually comes back from reboot to make sure it is all still there.
If your systems are large enough, things will go so slow that things fall into the trap where the target release changes mid-deployment: to benefit from everything learned in the last many days or weeks, security, performance, crashes, whatever! There is benefit because the fixes you cared about most got onto a portion of the cluster sooner than later. There is also penalty, as this resets the time it takes to deploy, elongating the perceived end-to-end deployment time. This negatively affects OKRs and similarly displaces the release of anything that was queued for upcoming releases.
So yeah, live patching is great to get priority fixes out in a matter of minutes or hours. I also think it is the best tool to get oneself out of this rollout-reset trap and onto the next release sooner. Faster than rollback or rollover.
I've been using KernelCare in my servers for many, many years now, and have yet to experience a single crash. One of them has been up for close to 4 years now.
I guess there's merit for that, especially if you are in a cloud environment. In a previous company, I decided to set up dnf/kpatch for VMs that we considered critical. At the time I had a healthy disregard for reliability, mostly because we had enough trust in our terraform process so I decided to automate the whole thing through AWS System Manager across the fleet and guard the feature to work only for security patches targeting the kernel. Briefly, every VM that came up would have the necessary packages installed and from there live patching would execute periodically (I believe once a week) or manually. At some point after a quarter or something we had to devise a way to tag VMs to be excluded from this but this was relatively easy to do and most of the exclusions were testing infra anyway.
> Like, running emerge -u @world on a regular basis
You can run emerge -u sys-kernel/whatever-kernel-u-use, maybe followed by `cd /usr/src/linux; make bzImage modules install modules_install...` well, probably you'd use genkernel or something like that, instead of hand-crafted scripts.
The point is: `emerge -u @world` can run into issues esp. if you customized a lot, it can't be automated fully, but I've never run into any issues with updating the kernel, and it can be automated.
It is not so hard to upgrade kernel, the issue is with the reboot you need do automate. Or with live patching, which doesn't seem encouraging, as you say.
If we are looking at things like gvisor or firecracker, SELinux might be an alternative. From what I can see, SELinux prevented both copy fail and dirty frag, and maybe also fragnesia but I couldn't find any definitive answer on that one.
Last time I tried it was a pain to setup and a pain to use, but as a sysadmin there is a lot of thing that share those attributes. The only question if its worth it. If the current avalanche of patches continues it might.
SELinux is a bear when you’re reacting to it, but ever since I took a day to proactively read about it, it’s become much easier to reason about. It’s not actually all that complex.
I still need to troubleshoot from time to time, but I never reach for permanent setenforce 0 anymore.
From what I understand, SELinux can prevent copyfail, dirtyfrag and maybe fragnesia, although it might not always.
I presume you are referring to the GrapheneOS post/thread about this[0], although this implementation is not the same implementation we see on Fedora or Debian for example and it appears these distros were (and are) still vulnerable to this exploit, with the out of the box configuration of SELinux on these systems.
People are doing too much HackTheBox or the like I guess … where you always have some entry point and then need to do privesc to get the root flag.
Then they are forgetting how much untrusted software they are running as their user account that can do much damage without need to do privilege escalation to root.
Clearly, the future is LLM-generated patches that get instantly vibecoded and installed on all machines without any human review. In fact, this is such a good idea that it should be illegal and impossible to run your computer without being connected to such a system. There are no other alternatives. /sarcasm
Many distros deal with the problem of learning about these issues the same time as the public. Some have fast track processes to ensure patches can get into their stable/rolling releases but it is still a lot of work (especially as kernel updates usually mean that automatic updates won't fully shipped you (without alsp automatically rebooting after an update)).
All of them need to do it. There maybe differences, like different number of versions of kernel supported, so less of backporting, but still distros have to provide fixed kernels.
With Gentoo I believe it is more fun, because of all the options gentoo provides out of a box. More kernels, more work to do.
When I used Gentoo the normal was to install gentoo-sources, which gives you the kernel source code but doesn't compile it. You then have to compile and install the kernel yourself without any support from the package manager.
If you're running on a different platform then perhaps you need the raspberrypi or asahi kernel
This is a bit misleading. All of genkernel/, gentoo-kernel/, gentoo-kernel-bin/, gentoo-sources/, git-sources/, vanilla-kernel/ and vanilla-sources/ are all different packages for the same Linux Kernel. There are multiple slots per package for the various supported LTS versions of said kernel but they will all get +/- the same set of patches for these issues. There is some support for other kernels like Darwin, BSD and HURD but your millage will vary.
> There is some support for other kernels like Darwin, BSD and HURD but your millage will vary.
I believe at this writing only Linux and HURD are officially supported standalone. Which IMHO is sad because Gentoo kFreeBSD was really cool but oh well. There is still the Gentoo prefix project, though even there support for the BSDs is iffy:(
Expanding on gentoo's recommendations:
I wonder if we should just universally accept that live patching should become part of the linux kernel? An automatic job that updates (much like some system packages in some distros) that installs (signed) live patches from upstream? Of course we would run into a problem where a malicious patch can now be distributed reliably to hundreds of thousands of machines, but we already have that at a lower level with normal application updates.
Canonical has thus far proved that it can be safe, but they're also a massive organization that is locking this feature for $200/yr for any commercial use.
It would be neat if such patches could retroactively replace tagged functions that have identical sematics so that means it would automatically get backported without extra effort from the maintainers.
Why would the source of the patches be less trusted than the source of (updated) kernels? I expect it to be the same, your distro.
$200/year is peanuts for any commercial use worth the name. The problem, of course, is the whole non-free infrastructure it has to introduce.
I wonder when large and critical OSS projects will start to be seen as a public good they are, with large corporations willingly financing them because not doing so is bad PR.
Public goods are not generally funded by large corporations.
Errr.. are you saying that Linux contributors are only affiliated to small companies or volunteers? Well, last time I checked, some contributors were RedHat(IBM) and Microsoft employees ;)
I am saying that big company funding of some FOSS is exceptional compared to the economy at large. Most public goods are not funded by bug companies. They do not do the same for roads or public health services, or free education or public libraries etc.
Even with FOSS few projects get the funding or contributions the Linux kernel does.
After the npm supply chain attacks people suggested automating delays before installing updates, now we're talking about automating update delivery... I'm afraid there won't be any easy or quick fix after decades of treating security as an afterthought.
Linux distros are not npm. It doesn't mean they are infallible to malicious actors, but I believe it is possible to make them infallible for some small set of packages at least.
Attacks are still possible, but if we look at xz backdoor attack[1] it was insanely complicated attack and it still failed. Its fail doesn't look promising, attack could succeed just the attacker was unlucky. Still it shows that the success is not guaranteed.
Theoretically npm can be improved in this way, if there were a separate "distro" for packaged, with dedicated maintainers for packages, who don't write code, just pull it from a mainstream and review it. It is not being done because of tragedy of commons, not because it is impossible.
[1] https://en.wikipedia.org/wiki/XZ_Utils_backdoor
Whenever you read about an incredibly unlucky criminal, there's a chance that the unlucky event is a parallel construction to the classified real reason why they were caught. Not sure how exactly that would have worked in this case.
Yes, it could be. But it is a hypothetical that smells like a conspiracy theory. I wonder why you think it is a good idea to go for these hypotheticals?
Are you arguing that the system may be more resilient than it seems? Like, maybe there is a conspiracy working on security. And they keep themselves secret so attackers would be susceptible to under-appreciate the real level of security and make mistakes that inevitable would caught?
It seems like a over-stretched explanation, doesn't it. Care to explain yourself?
I mentioned it mainly because I find it interesting. We are probably just slightly less lucky than it seems. This particular case doesn't look that much like a parallel construction to me.
Linux itself, major Linux distros, npm - none of these were designed with a security-first approach. Even the things that do help with security, like package maintenance or containerization, were more incidental to other primary goals like stability, reproducibility and so on rather than being born from a comprehensive security-first strategy. They could have been, but then things would have moved slower. They even exist, like Alpine, OpenBSD, RedoxOS, but the major ones, the ones we're talking about today, were the ones who moved faster and managed to take over. That's the fundamental issue I'm talking about, the mindset shift that would be required before we could even start the Herculean effort of rebuilding much of the existing stack with different architectures, in different languages and using different development models, always knowing that, in the past, the ones who moved fast and broke things instead tended to be the ones who succeeded.
I technically agree, but it seems too abstract to me. How could look a distro maintenance, if it was built with a security-first approach?
Maybe I have not enough fantasy and/or creativity, but trying to imagine it, I see just a bit more of oversight built into protocols of approving changes to repositories. I mean, it doesn't seem that improved security needs an approach "destroy everything and build it from scratch", some additions on top of existing structures would do. Am I wrong?
If we were to start from security first, we would be asking questions like 'how can we make sure that new code is safe?'. Manual review is great, but we can likely think of some desirable invariants for program behavior that could be tested automatically, or even formally verified. Those would come at the very start. The entire mindset right now is that the existing code is probably unsafe and we'll ship fixes as we discover its vulnerabilities. Not immediately applying updates is seen as a kind of moral failure. All major OS and most software projects were developed with this mindset of crossing your fingers at launch and then changing the tires while driving. So much so that we think of it as the natural state of software. If you start from a base of verified code, the mindset shifts. Not that there are zero vulnerabilities guaranteed in the existing code, but you become a lot more suspicious of new code.
> I wonder if we should just universally accept that live patching should become part of the linux kernel?
I think we can learn many lessons from the recent SNAFUs before going all wild on auto-patching.
One lesson for example is that you shouldn't compile into the kernel modules that only about 0.00001% of all Linux installations out there are ever going to use.
Another lesson is that even if the modules are compiled, but not into the kernel, they should probably be blacklisted (preventing them from loading) by default and only removed from the blacklist by people who really know they'll need these rarely used modules.
We're way past the "but it needs to work on all cases": we're now into the "users installing our distro are getting hacked left and right" territory.
In any case I think many things can be done before Linux distros reproduce the "security" practices of the NPM ecosystem.
> we're now into the "users installing our distro are getting hacked left and right" territory
Are we? Are users actually getting hacked, or have they theoretically been exposed to problems that could allow local privileged escalation if exploited but that nobody's seen used in the wild?
(Edit: To be clear, I'm skeptical but this isn't a completely rhetorical question. If there are actual reports of these vulns causing problems, that would strongly incentivize a stronger response.)
It used to be relatively standard even on the "big" distros to compile your own kernel if you needed something outside of the bog-standard. Modularization and all the related auto-detect auto-mod tools have resulted in most distros shipping a "works for almost everyone" kernel that has everything available as a module.
Perhaps we should tend toward the first.
It seems like a reasonable middle ground for most distros is to put things in kernel modules, but then package those modules into separate packages. If you don't need somedriver.ko, then you don't `apt install linux-driver-somedriver`; if you do need it, just install the package and it just works without needing to compile anything and you get automatic updates and everything.
For Gentoo, of course, "just recompile the kernel as desired" is more reasonable, though they have binary packages including for the kernel and I don't see why the same idea shouldn't work there.
>but then package those modules into separate packages. If you don't need somedriver.ko, then you don't `apt install linux-driver-somedriver
But I don't want to know what drivers I need and will need next. Tomorrow I could buy a different wifi module and then what? Spend 3 hours googling which rtl378326973268632aahaxhabt.ko to install? Thanks but no thanks.
So why can't someone (probably the distro) build a utility that detects the hardware and installs the required kernal module?
We can have security and convenience.
That existed for a very short period of time before it became simpler to just ship everything all the time. I remember at least one distro booting with a single processor kernel and detecting that it could use an SMP kernel and did I want to pull it down?
and how would it get that module without network access. I'd say for network drivers specifically, this is tough one.
It would work for various other drivers though.
The standard method is for the installer to have the needful and then it knows what packages to install to give you the network drivers you need going forward (shades of slipstreaming in all the network drivers I could find into W2K custom ISOs so I'd not need to find floppies).
On older versions of Windows you used to get popups saying new hardware is detected, would you like to install the driver now?
It was always fun to get those when the hardware hadn't changed.
That's in generally available distro a huge PITA.
You can do blacklists easy enough if you want to, just add few lines of text into /etc.
I'd also like option for whitelisting, like whitelisting every single NIC driver is harmless enough coz they just won't be loaded, but anything that can be loaded by non-root userspace action should have option to be only loaded if it is on whitelist.
Tho all that is easily doable by just changing userspace AFAIK
I dont belive in live patching unless you are AWS.
But I absolutely belive we should have a method for changing kernel configuration (e.g. kernel module blacklists) and syscall firewalls and alike.
Easier: Do not start with a "allow all" configuration in the first place.
Maybe all of those userspace-work-done-in-kernel-because-muh-performance features should be restricted to (the "real") CAP_NET_ADMIN, unless positively enumerated as free-for-all-containers. And then subtract from that free-for-all list every time you learn that some kernel module in its currently available version cannot be trusted to do its own memory shuffling.
> live patching should become part of the linux kernel
Services where uptime matters tend to be designed so they can tolerate the reboot of a single node for other reasons besides kernel maintenance. I can't imagine a situation where I can't tolerate the downtime of a reboot but I would be willing to risk the system locking up with brain surgery gone wrong.
We've used ksplice for good amount of years (till bought by Oracle and they stopped publishing patches for other Linux distros) and all in all it was very stable technology 10 years ago
> I can't imagine a situation where I can't tolerate the downtime of a reboot but I would be willing to risk the system locking up with brain surgery gone wrong.
Because you haven't worked at that level in organization. Doing restart in some case might involve paperwork with your client and maintenance window outside of working hours even if service is redundant. And some customers are fine with a little bit of downtime and don't want active/active level of redundancy but still insist of maintenance windows for any work like that.
Live patching makes that a whole lot easier
> Services where uptime matters tend to be designed so they can tolerate the reboot of a single node for other reasons besides kernel maintenance. I can't imagine a situation where I can't tolerate the downtime of a reboot but I would be willing to risk the system locking up with brain surgery gone wrong.
I've run systems with live code updates for userland, and would have considered live kernel updates if it was reasonable on our systems.
The thing is you typically build your system to tolerate reboot or unscheduled stop of a single node. Scheduled stop is nicer, but systems sometimes lock up even when you're not doing risky behaviors, so you know.
But just because the system can tolerate a reboot or restart doesn't mean it's not disruptive. A lock up / etc during hot load is also disruptive, of course. But when you can push code without having to stop anything, with limited impact on users, it makes it easier and faster to do updates. You can use whatever rollout pattern you like to contain risk too; same as you would for an upgrade with restarts.
For us, we have servers with hundreds of thousands or millions of tcp connections from mobile clients. Restarting a server would make all those clients have to reconnect and connecting is expensive. Restarting all the servers would result in many clients reconnecting several times. It was better to avoid it when possible.
You need a server push "reconnect here" in your protocol. Makes maintenance of all sorts a much easier deal.
> Restarting all the servers would result in many clients reconnecting several times. It was better to avoid it when possible.
As a sibling says, you need a "reconnect now" in the protocol. (GOAWAY, in HTTP.)
In addition to what the sibling says, if you have some sort of cordoning/graceful drain facility at the traffic level, you can also prevent the "several times" bit: bring new, patched nodes online. Disallow new connections to the outgoing nodes. Drain the outgoing nodes. Decommission them.
(I.e., only permit reconnects to patched nodes.)
Live patching production kernels makes sense when there is an imminent threat/timeline and rebooting is throttled due to underlying throttling mechanisms that are guarding the health of distributed systems running a-top the systems. Here's a real example I am familiar with:
Consider a hyper-converged cluster with many nodes serving distributed block storage, say at N=3 replication. This can tolerate exactly one N=1 node of outage for the reboot. It would seem preferable to drain the nodes in a way that allows for more parallelism in the per-node kernel-reboot process, but draining is expensive and its cheaper to reboot and hope the data comes back to the pool within some period of time after the reboot. This gets worse linearly as the cluster grows.
A non-trivial size cluster facing this can have a reboot rollout easily stretch from hours into days and even weeks. It is further made slower when the roll-out itself is repeatedly paused when any other production issue is detected, or some other in-cluster event is happening and distributed storage health is degraded or unavailable. If a single (additional) node goes out during the reboot roll-out, data goes unavailable and storage must wait and heal. It also simply takes time for the cluster to reconcile when the storage eventually comes back from reboot to make sure it is all still there.
If your systems are large enough, things will go so slow that things fall into the trap where the target release changes mid-deployment: to benefit from everything learned in the last many days or weeks, security, performance, crashes, whatever! There is benefit because the fixes you cared about most got onto a portion of the cluster sooner than later. There is also penalty, as this resets the time it takes to deploy, elongating the perceived end-to-end deployment time. This negatively affects OKRs and similarly displaces the release of anything that was queued for upcoming releases.
So yeah, live patching is great to get priority fixes out in a matter of minutes or hours. I also think it is the best tool to get oneself out of this rollout-reset trap and onto the next release sooner. Faster than rollback or rollover.
I've been using KernelCare in my servers for many, many years now, and have yet to experience a single crash. One of them has been up for close to 4 years now.
I guess there's merit for that, especially if you are in a cloud environment. In a previous company, I decided to set up dnf/kpatch for VMs that we considered critical. At the time I had a healthy disregard for reliability, mostly because we had enough trust in our terraform process so I decided to automate the whole thing through AWS System Manager across the fleet and guard the feature to work only for security patches targeting the kernel. Briefly, every VM that came up would have the necessary packages installed and from there live patching would execute periodically (I believe once a week) or manually. At some point after a quarter or something we had to devise a way to tag VMs to be excluded from this but this was relatively easy to do and most of the exclusions were testing infra anyway.
> We recommend exploring ways to automate upgrading your kernel
Like, running emerge -u @world on a regular basis, or ...
/me searches
Okay, so https://wiki.gentoo.org/wiki/Live_patching exists but says,
> A note of caution: Kernel live patching is risky. Count on hard freezing or panics to become normal...
That's not encouraging.
---
Another approach: Can we make the kernel vulns less important? Has anyone had luck moving more things to run under gvisor or firecracker or such?
> Like, running emerge -u @world on a regular basis
You can run emerge -u sys-kernel/whatever-kernel-u-use, maybe followed by `cd /usr/src/linux; make bzImage modules install modules_install...` well, probably you'd use genkernel or something like that, instead of hand-crafted scripts.
The point is: `emerge -u @world` can run into issues esp. if you customized a lot, it can't be automated fully, but I've never run into any issues with updating the kernel, and it can be automated.
It is not so hard to upgrade kernel, the issue is with the reboot you need do automate. Or with live patching, which doesn't seem encouraging, as you say.
If we are looking at things like gvisor or firecracker, SELinux might be an alternative. From what I can see, SELinux prevented both copy fail and dirty frag, and maybe also fragnesia but I couldn't find any definitive answer on that one.
Last time I tried it was a pain to setup and a pain to use, but as a sysadmin there is a lot of thing that share those attributes. The only question if its worth it. If the current avalanche of patches continues it might.
SELinux is a bear when you’re reacting to it, but ever since I took a day to proactively read about it, it’s become much easier to reason about. It’s not actually all that complex.
I still need to troubleshoot from time to time, but I never reach for permanent setenforce 0 anymore.
From what I understand, SELinux can prevent copyfail, dirtyfrag and maybe fragnesia, although it might not always.
I presume you are referring to the GrapheneOS post/thread about this[0], although this implementation is not the same implementation we see on Fedora or Debian for example and it appears these distros were (and are) still vulnerable to this exploit, with the out of the box configuration of SELinux on these systems.
> Can we make the kernel vulns less important?
How about strong virtualization? https://qubes-os.org
> Can we make the kernel vulns less important?
Local kernel vulns are totally unimportant in any vaguely reasonable environment.
People are doing too much HackTheBox or the like I guess … where you always have some entry point and then need to do privesc to get the root flag.
Then they are forgetting how much untrusted software they are running as their user account that can do much damage without need to do privilege escalation to root.
They absolutely suck in HPC environments. Bunch of users running all types of code that can just LPE themselves to root, not a good scenario.
Yes, but academic HPC environments are very 90s and are far from what is generally considered reasonable in 2026.
Clearly, the future is LLM-generated patches that get instantly vibecoded and installed on all machines without any human review. In fact, this is such a good idea that it should be illegal and impossible to run your computer without being connected to such a system. There are no other alternatives. /sarcasm
Is Gentoo an outlier or do all Linux distributions deal with this problem?
Many distros deal with the problem of learning about these issues the same time as the public. Some have fast track processes to ensure patches can get into their stable/rolling releases but it is still a lot of work (especially as kernel updates usually mean that automatic updates won't fully shipped you (without alsp automatically rebooting after an update)).
All of them need to do it. There maybe differences, like different number of versions of kernel supported, so less of backporting, but still distros have to provide fixed kernels.
With Gentoo I believe it is more fun, because of all the options gentoo provides out of a box. More kernels, more work to do.
Not all these directories are different kernel packages, but anything with -kernel or -sources at the end is.
When I used Gentoo the normal was to install gentoo-sources, which gives you the kernel source code but doesn't compile it. You then have to compile and install the kernel yourself without any support from the package manager.
If you're running on a different platform then perhaps you need the raspberrypi or asahi kernel
This is a bit misleading. All of genkernel/, gentoo-kernel/, gentoo-kernel-bin/, gentoo-sources/, git-sources/, vanilla-kernel/ and vanilla-sources/ are all different packages for the same Linux Kernel. There are multiple slots per package for the various supported LTS versions of said kernel but they will all get +/- the same set of patches for these issues. There is some support for other kernels like Darwin, BSD and HURD but your millage will vary.
https://wiki.gentoo.org/wiki/Kernel/Packages/en
> There is some support for other kernels like Darwin, BSD and HURD but your millage will vary.
I believe at this writing only Linux and HURD are officially supported standalone. Which IMHO is sad because Gentoo kFreeBSD was really cool but oh well. There is still the Gentoo prefix project, though even there support for the BSDs is iffy:(