LFM2-24B-A2B: Scaling Up the LFM2 Architecture

www.liquid.ai

58 points by nateb2022 3 days ago

This model is pretty cool if you don't have a GPU - I was able to get I think 20 or 30 tokens per second on CPU (DDR4 ram) alone. (I don't remember if that was with q4 or q8.)

Otherwise, if you have a GPU with more than like 4GB of VRAM, there are better models. Gemma4 and Qwen3.6 (or Qwen3.5 if you need the smaller dense models that haven't yet been released for 3.6) are a good place to start.

aziis98 22 hours ago

> I was able to get I think 20 or 30 tokens per second on CPU (DDR4 ram) alone
What are you using for inference? I have a recent intel laptop with 32GB of DDR5 and I am getting at most 25tps with the llama cpp vulkan backend (that is the fastest, I also tried sycl but it is a bit slower)
- meatmanek 18 hours ago
  
  Ok, I double-checked, and I get 21-22tps with lmstudio-community/LFM2-24B-A2B-Q4_K_M.gguf running under LM Studio on my i5-12400 with 2x32GB sticks of DDR4 3200. This is with small context (just "Write me a poem about a language model named Liquid" in `lms chat`)
  Prediction Stats: Stop Reason: eosFound Tokens/Second: 21.10 Time to First Token: 1.827s Prompt Tokens: 42 Predicted Tokens: 187 Total Tokens: 229

aziis98 22 hours ago

I just tried the Q4_K_M variant of this [] and this is one of the first models that run at ~20tps on my laptop. I also tried it with some "hard" maths questions and it clearly knows much. Can't wait to try some local coding agent harnesses with it (I recently discovered kon [1] and dirac [2] and wanted to try them out)

The only thing I'm not sure about is if this model supports thinking or not.

[1]: https://github.com/0xku/kon

[2]: https://github.com/dirac-run/dirac

trilogic 1 day ago

Liquid AI have made some awesome models (especially the smaller ones, they are lightning fast). I wish they made a fast small size coder. Did a finetune distill of 0.8B myself and it is in fact working properly, coding like a 30B model, so I know it is possible. Anyway here you have the 24B parameters with 2B active: https://hugston.com/models/lfm2-24b-a2b-q4-k-m

MarsIronPI 16 hours ago

That sounds pretty interesting. Did you publish a write-up anywhere? If not, could you say more about how you did the finetune? Which model did you fine-tune/distill, what datasets did you use?

alyxya 1 day ago

The blog post was published a couple months ago, and it looks like there hasn't been a follow-up release with the fully trained model. I'm not sure if there's much to take away from an early checkpoint besides the unique architectural choices they made in their model for faster inference.

adrian_b 1 day ago

Some smaller models from the LFM2.5 family have been published on Huggingface by the end of March, a month ago.
It can be assumed that this larger model takes more time to complete post-training, but it will follow in the near future after those smaller LFM2.5 models.

BoredomIsFun 1 day ago

LFM models I've tried all seemed to be suffering from serious coherence issues. I found Gemmas the best at tasks requiring rock solid coherent output; even Qwen's not comparable.

1dom 1 day ago

I think context length is important to consider here.
I find Gemmas really good for a short conversation with maybe 3 or 4 exchanges of a few paragraphs each, which covers a surprisingly large amount of interactions.
For anything longer form though, particularly with larger code contexts, Qwen is far more useful for me personally.
I'm not an expert in this field, but my understanding is Qwen are hybrid gated attention mechanisms, whereas Gemma is hybrid including a sliding attention attention mechanism which makes it look like it favour the most recent tokens a little too much at times.
This is all in the context of local quantized models, I'm aware both have larger cloud variants that wouldn't suffer as much.

potatobanana 1 day ago

I liked LFM2-8B-A1B for its speed on cpu or integrated gpu. This one is slower. I used it recently while coding in offline stressful situation and it is good enough to propose an ok suggestion how to solve a simple to intermediate problem in a bit obscure language. But multi turn iteration was not working well. The code was working but it didn't exactly fulfill all expectations from next turns. Good enough to help though and take it further, so helpful.

goldenarm 22 hours ago

Comparison with Qwen3.6 35B A3B:

- GPQA Diamond: 47.4% vs 84.1% for Qwen

- HLE: 4.4% vs 20.2% for Qwen

- AA Omniscience Accuracy: 6.4% vs 18.9% for Qwen

- AA Hallucination Rate: 30.0% vs 50.3% for Qwen

alfiedotwtf 1 day ago

Tokens per second is nice but I would also like to see quality benchmarks especially against other models. I mean eventually someone’s gonna write a blog post comparing models, so why not just do it yourself… that way your marketing department at least get to control the narrative rather than a random blogger

mirekrusin 1 day ago

It's a checkpoint in the middle of training, it makes sense to report speed, which will stay the same and to report quality as they did.