We should get rid of average CPU utilization

www.theocharis.dev

29 points by JeremyTheo 11 hours ago

arianvanp 10 hours ago

A more general metric that is useful to watch for is pressure stall information for CPU, IO and Memory.

https://docs.kernel.org/accounting/psi.html

I made a Prometheus exporter for it:

https://github.com/arianvp/cgroup-exporter

JeremyTheo 9 hours ago

Yes!

JanMa 9 hours ago

I've learned the hard way that CPU resource limits in K8S are a bad idea, as can be seen in this post. Just use CPU requests without limits so the scheduler has an estimate of your applications CPU requirements, but it can burst to use more CPU when it's available.

With memory of course you should set a limit and from experience it should be the same as your memory requests.

JeremyTheo 9 hours ago

There is also the concern that a single pod shouldn’t be able to take down an entire node. So there needs to be some safety levels. But then also not. I find this is a really complex issue which is not widely known (only in Kubernetes bubble)
- ralgozino 9 hours ago
  
  you can reserve node resources for system processes so the pods don't kill the node using some kubelet parameters: https://kubernetes.io/docs/tasks/administer-cluster/reserve-...
cassianoleal 6 hours ago

This, very much. With memory, I have seen one or two use cases where it made sense to have bigger limits than requests but it's the exception rather than the norm.

nairboon 9 hours ago

No, not at all. Why get rid of a low-level statistical measure? It's not even quite clear what the article argues against. htop doesn't even show you "average CPU utilization", it provides a sample of the current CPU utilization.

To me the problem appears to be that they try to do some hard realtime computing with strict time guarantees, but are so far up the stack (golang library, golang scheduler, docker, kubernetes, virtualization, etc.), that they don't realize that this stack can't guarantee you realtime computing. CPU utilization is a very low-level measure and, in this stack, is only indirectly related to the observed timeouts.

joshspankit 7 hours ago

> It's not even quite clear what the article argues against.
I think it can be summed up as “average CPU utilization, which is the common and intuitive first check doesn’t tell you the real story”
I would also suggest that these are “outdated” measurements as common CPU metrics are really designed for moderately multi-threaded, single-foreground-application on bare metal
To your point, someone who deeply understands the stack already knows these are not the metrics to look at, but this is clearly aimed at people who have not (yet) had to dive deep to figure out a scheduling issue

CodesInChaos 10 hours ago

It's well known that many throttling implementations are broken, usually by design. You shouldn't blame the CPU utilization metric for that footgun.

In a well designed scheduler, a task that has been granted an allotment of at least n cores, should never get throttled to less than n cores at any time. It can be limited to less than n cores if CPU utilization is at 100% and another task gets scheduled at the time, since that's unavoidable when you oversubscribe the available resources.

zeafoamrun 10 hours ago

Same thing when it comes to memory. The rabbit hole goes on forever, and metrics lie to you if you don't know how to interpret them properly.

ahartmetz 10 hours ago

No, we shouldn't. We should measure latency if we care about latency.

jiggawatts 10 hours ago

I’ve come to realise that “wide logs” like OpenTelemetry traces are the only way to go, despite the expense of collecting and storing them with current technology.
As open source columnar databases improve, the cost will drop.

VimEscapeArtist 10 hours ago

Let’s measure temperature :)

techpression 10 hours ago

Lovely read, if you’ve ever had even remotely similar issues (you think you’re looking at the right places but you’re not) it read like a detective novel.

rimworld 10 hours ago

great article thanks

ksk23 10 hours ago

TLDR; if app slow, give more resources

luipugs 10 hours ago

Or just don't put CPU limits: https://home.robusta.dev/blog/stop-using-cpu-limits
- JeremyTheo 9 hours ago
  
  Yeah, that is mainly the point there. But difficult if company internal policies require it (for security, etc)
andrepd 10 hours ago

Writing better code is of course out of the question.
- dgellow 10 hours ago
  
  What do you mean, I always append “make it excellent” to all my prompts!
- inglor_cz 10 hours ago
  
  Shockingly many developers have never profiled any code in their life.
  
  ahartmetz 4 hours ago
  
  The nice thing about such code is that, when you come in to improve it, you can make huge improvements in no time. As a user, though...