| Svelte Hacker News

points by krackers 2 months ago

>pay for reinitializing the cache

Why can't they save the kv cache to disk then later reload it to memory?

stavros 2 months ago

Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.

zozbot234 2 months ago

The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.

vanviegen 2 months ago

Isn't that how the kv cache currently works? Of course they could decide to hold on to cache items for longer than an hour, but the storage requirements are pretty significant while the chance of sessions resumption slinks rapidly.

zozbot234 2 months ago

The storage requirements for large-model KV caches are actually comparatively tiny: the per-token size grows far less than model parameters. Of course, we're talking "tiny" for stashing them on bulk storage and slowly fetching them back to RAM. But that should still be viable for very long context, since the time for running prefill is quadratic.
- vanviegen 2 months ago
  
  We only have open models to go by, so looking at GLM 5.1 for instance, we're talking about almost 300 GB of kv-cache for a full context window of 200k tokens.
  That's hardly tiny.

stingraycharles 2 months ago

It’s a shitload of data, and it only works if all the tokens are 100% identical, i.e. all the attention values are exactly the same.

Typically it’s cached for about 5 minutes, you can pay extra for longer caches.

krackers 2 months ago

If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same. That's the whole point of this caching in the first place.
If you're willing to incur a latency penalty on a "cold resume" (which is fine for most use-cases), why couldn't they just move it to disk. The size of the KV cache should scale on the order of something like (context_length * n_layers * residual_length). I think for a standard V3-MoE model at 1M token length, this should be on the order of 100G at FP16? And you can surely play tricks with KV compression (e.g. the recent TurboQuant paper). It doesn't seem like an outrageous amount of data to put onto cheap scratch HDD (and it doesn't grow indefinitely since really old conversations can be discarded).
- stingraycharles 2 months ago
  
  > If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same.
  Correct, when you’re using the API you can choose between 60 minute or 5 minute cache writes for this reason, but I believe the subscription doesn’t offer this. 60 minute cache writes are about 25% more expensive than regular cache writes.
  I don’t have insights into internals at Anthropic so I don’t know where the pain point is for increasing cache sizes.