KVarN: Native vLLM backend for KV-cache quantization by Huawei

142 points by theanonymousone 1 day ago

Better performance than TQ and better quality than FP16?

Am I reading this right??

lukasc-ch 2 hours ago

lukasc-ch 2 hours ago

This is awesome! Let's give them some stars: - https://github.com/huawei-csl/KVarN (original repo, vLLM implementation) - https://github.com/Anbeeld/beellama.cpp (llama.cpp implementation + awesome evals)

v3ss0n 1 day ago

Why this is not a PR for vLLM ?

esafak 1 day ago

It's the output of a research paper; the authors are not trying to build up vLLM, and they probably have no incentive to do so. You can submit a PR, though! It's easier now while the divergence is low, so don't wait. Since there are six authors, I bet you could get help with the inevitable review chores if you just take the step of creating the PR.
edit: It might not be clear that it is based on vLLM 0.22, which is the current version: https://github.com/huawei-csl/KVarN/commit/d6290e99098d7426d.... All you have to do is create a diff off it; it's fairly straightforward.
- jmalicki 1 day ago
  
  And with the help of AI, pointing at AI at this paper and saying "making a vLLM PR from this paper" tends to work surprisingly well, even if you need to nudge it a little bit along the way.
thefox96 1 day ago

it should be easy to do btw
woadwarrior01 22 hours ago

Last I heard, vLLM was backed by a company that has raised $150m in seed funding. I'm sure they've got the resources to port it.
electronsoup 19 hours ago

Why this is not a PR for llama.cpp

0xjeffro 20 hours ago

yao yao ling xian