jchandra 3 days ago

I’ve been exploring KV cache optimization for LLM inference.

Most methods (Top-K, sliding window) prune tokens. This works on average, but fails selectively — a few tokens cause large errors when removed.

I tried reframing the problem as approximating the attention function: Attn(Q, K, V)

Prototype: - entropy → identify weak tokens - OLS → reconstruct their contribution - SVD → compress them

Early results show lower error than Top-K at low memory, sometimes even lower memory overall.

This is still a small research prototype, would appreciate feedback or pointers to related work.

bee_rider 22 hours ago

Were there any downsides or difficulties?

It would be sort of surprising if an SVD-based opportunity was missed (since it is such a familiar tool). But, your entropy and least-squares ideas are necessary to set that up, so I guess it makes sense that you’d find some new territory here.

  • jchandra 20 hours ago

    That’s a great point and yeah, I’d agree SVD itself isn’t new at all.

    On downsides: definitely a few. The biggest one is latency - SVD is fairly heavy, so even though it’s amortized (runs periodically, not per token), it still adds noticeable overhead. It’s also more complex than simple pruning, and I haven’t validated how well this holds on real downstream tasks yet.

    This is very much a research prototype right now more about exploring a different tradeoff space than something ready for production.

cowartc 22 hours ago

Interesting direction. One question: How does this hold up outside the synthetic transformer on a real downstream task? Reconstruction error is the right measure but its one step removed from the end task. I'm curious whether HAE would show a similar gap on a downstream benchmark.

vivahir215 3 days ago

Interesting Approach. Curious about the latency tradeoff: OLS + SVD are much heavier than Top-K.Have you benchmarked end-to-end inference latency?

  • jchandra 3 days ago

    In this prototype, OLS + SVD isn’t per-token, it runs only when the recycle bin fills (amortized over multiple tokens).

    That said, it’s still heavier than Top-K. I haven’t benchmarked end-to-end latency yet; this is mainly exploring the accuracy vs memory tradeoff.

  • scythmic_waves 21 hours ago

    From the conclusion:

    > The primary trade-off observed is the increased calculation time for OLS and SVD steps. Consequently, the next phase of this work involves implementing these operations within custom Triton kernels to amortize latency. By viewing the cache through the lens of reconstruction fidelity rather than just memory capacity, we can develop more sustainable architectures for long-context inference.

    Reading between the lines, the increase in latency was so significant that they didn't want to include it before they had a chance to try and optimize the problem away first.

    Still interesting research. Hope they get good results!

    • jchandra 20 hours ago

      Haha, that’s a very fair reading :)

      Yeah, the latency hit is definitely real. That said, most of what I’ve run so far is CPU-bound, which likely exaggerates it quite a bit so I didn’t want to draw strong conclusions from that.

      Would need proper GPU implementations to really understand where it lands.

jbellis 20 hours ago

Isn't the "KV Compression Strategies (FAIR)" chart showing that the fancy complex algorithm only barely beats simple topk?

The commentary says that topk "degrades rapidly at low ratios" but the same can be seen for HAE (Entropy + OLS).

  • jchandra 18 hours ago

    Fair point, the gap isn’t huge in that plot, and both degrade at low ratios. The difference is more in how they degrade: TopK can have sharper, localized failures, while HAE tends to be a bit more smooth. That doesn’t always show up strongly in average MSE.

    That said, the gains are modest right now, this is still a research prototype exploring the tradeoff, and there’s clearly more work to be done.

  • bee_rider 17 hours ago

    Is it really that fancy and complex, though? The “entropy recycling bin” seems fancy to me, but the other stuff is least squares and an SVD, these are solid workhorse numerical routines.

rishabhaiover 20 hours ago

After a certain amount of context usage, I think I empirically see the stated issues with Top-K compression strategy. It doesn't catastrophically forget but nuances fade as I reach towards the tail end of my context limits.

  • jchandra 18 hours ago

    Yeah, that’s consistent. topK keeps the obvious tokens, but subtle context gets eroded over time rather than dropped all at once.

thw20 19 hours ago

Good work! This is very interesting. Here's a related work that construct low-rank approximation for attention: https://arxiv.org/abs/2505.12942.

Maybe the idea of Query calibration matrix Rxx is of interest to the author!

  • jchandra 19 hours ago

    Thanks, really appreciate the pointer. Will dig into it.

aesthesia 20 hours ago

I notice the experiments are all run with Gaussian token embeddings and weight matrices, which is a very different scenario than you would get in a real model. It shouldn't be much more difficult to try this with an actual model and data and get a much better sense of how well it compresses.

  • jchandra 20 hours ago

    I completely agree.Right now this is all on a synthetic setup to isolate the behavior and understand the reconstruction vs memory tradeoff. Real models will definitely behave differently.

    I’ve started trying this out with actual models, but currently running things CPU-bound, so it’s pretty slow. Would ideally want to try this properly on GPU, but that gets expensive quickly

    So yeah, still very much a research prototype — but validating this on real models/data is definitely the next step.