It's interesting how even 5 tok/s is still much faster than you'd typically type, but feels glacially slow for an agent.
On the other hand, I've been using Mimo and Minimax a lot recently. They routinely reach 100-150 tokens per second and that feels too fast, to the point where it's hard to keep up with what it's actually doing. Great for subagents though.
They routinely reach 100-150 tokens per second and that feels too fast, to the point where it's hard to keep up with what it's actually doing.
There is no way you can follow what is going on even at 30 tokens per second. Maybe you can maintain a rough idea of what is going on for some tens of seconds but that is probably about it. Follow it in any detail, no chance. Reason about what you read, absolutely no chance.
800 tok/s — Cerebras-class, where the bottleneck is your eyeballs
I do not understand why they say this. I am not sure if it is even true. 800 tokens sounds like a page of text and I would assume you can look at one page per second without hitting any limitation of your eyes. Or is the resolution of the human not good enough to see an entire page at once and you have to scan it with the fovea? Scrolling text might of course hit the temporal resolution limit. But why does this even matter, your brain can not process anything close to the amount of information your eyes can take in.
Because it is scrolling. If they would show one page of text while filling the next one in the background, the result would probably be somewhat like flicking through a book at one page per second. You still can not read one page per second but you would not be limited by your eyes being unable to recognizing the quickly scrolling text.
EDIT: As others have pointed out and I now did some reading on, it is an illusion that you can see all the text on a page at once, that is beyond the resolution limit of the human eye. To actually see all the words, you have to scan the page and that takes several seconds. From the numbers I have seen, it seems that the ultimate limit is probably below 30 tokens per second, no matter what, even using rapid serial visual presentation to cut out eye movements. Even 10 to 20 tokens per second is probably pushing it and unsustainable for many, if not most, people.
The angular diameter of detailed seeing is very small - something like 1-2 degrees from what I was reading (matches my experience). That's the only area where you can reasonably read, the rest is only good for making out rough shape. So scanning it is.
I run models in the ~120B class on my old server (96GB DDR4) and it manages about 3-3.5 tok/sec. It is indeed painfully slow to watch, but I find if I walk away or bury the window and do something else, it always seems to be done when I check back
> It's interesting how even 5 tok/s is still much faster than you'd typically type, but feels glacially slow for an agent.
Calling the token rate the rate at which they "type" is a bit misleading. They also do virtually all of their more complex reasoning in tokens, so 5 tokens per second is also their thinking speed. And thinking at 5 tokens per second is glacially slow.
This is why faster versions of strong models do so well on reasoning tasks like playing text adventure games[1]. Their output isn't better on a token-for-token basis, but they get so much more thinking in during a given time window, they get more opportunities to find the right conclusion.
Most of my thinking is non-verbal. I don't think in sentences. I CAN think in sentences and internally rationalize my actions and explain them and sometimes that's beneficial (rubber duck debuggin, sometimes it's good to verbalize and explain something) but usually I don't do it
This question gets into information theory way beyond me, but I suspect it depends a lot on the task at hand. Human brains aren't very effective at combining sources of statistical variation, but they're great at other things. I'm personally most impressed by the cerebellum. It is highly trainable, yet if we tried to translate the things it does to maintain locomotion, proprioception, coordination of movement, etc. into tokens would probably result in a high token rate.
Looking at 5 tok/s after reading this comment made me think about why it felt slow and would be unacceptable for work. If you didn't plan or even sometimes despite planning, you have absolutely no idea if it is suddenly going to go off the rails in a wrong direction. Everyday, I'll look at the thinking and it seems pretty good until suddenly I have to slam the esc key because it decided to pursue a completely wrong direction. Much faster is better for skimming to make sure you don't have to throw everything away.
I think your demo need more realistic thinking logs because thinking usually burns at least 2x to 3x of tokens of the code and for harder tasks much more.
I like that the output rendering is closer to typical UIs -- syntax highlighting in code mode, tool calls, dim-italic reasoning.
One feature mine has that the author, or anyone else who vibe codes their own version after seeing this, might like to steal is modeling the distribution of output latencies. My implementation is hacky (log-normal roughyl estimated from p50, p90, and p99 values), but still, when you set those to realistic values, it recreates the "jitter" you see in many LLM UIs.
antirez is right that generation tok/s isn't flat as a function of context length, which is a weakness of both simulators.
I always wonder what the world is going to be like when these are 2 orders of magnitude faster.
Something like https://chatjimmy.ai/ will do 14,000 tokens per second, and it's a completely different experience to what we see now. It's more like a page load than a conversation.
I'm flashing back to using a 1200 baud modem when the world was on 28.8k. Modems are much more regular-looking, though, since each character is a character. Unless you count color changes and such, which you only really notice at 1200...
The visualiser seems to be quite naive with what it defines as a token. I don't think a token is an entire word as often as the demo shows, and when it gets to the `def estimate_tokens` method, the entire `# Rough heuristic: ~1 token per 4 chars of English` comment is printed all at once as one token, which is certainly not accurate.
This is not a realistic replay of what a common LLM might actually print out - it's entirely fabricated. But for the purpose of estimating the feel of tokens per second, I suppose it's good enough.
I built something similar awhile back [1] and used OpenAI’s tokenizer playground [2] to recalculate tokens on a giant block of lorem ipsum text. I feel like this gives a much more accurate representation.
Token/sec only makes sense once you tell me three four things:
1. decoding t/s, that is, when the model is generating text in the autoregressive fashion.
2. prefill t/s, that is, prompt processing speed.
3. What is the slope of those two numbers as the context size increases. An implementation that decodes at 50t/s with 2k context but decodes at 7t/s at 100k context is going to be a lot less useful that it seems at a first glance for a big number of real world use cases.
4. What's your use case? Reading a huge text and then having a small output like, fraud probability=12%? Or Reading a small question and generating a lot of text? This changes substantially if a model is usable based on its prefill/decoding speed.
For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.
On the Mac Ultra the prefill is like 400 t/s and decoding 35 t/s.
The two systems can perform dramatically differently or almost the same based on the use case. In general for local inference to be acceptable, even if slow, you want at least 100 t/s prefill, at least 10 t/s generation. To be ok-ish from 200 to 400 t/s prefill, 15-25 t/s generation. To be a wonderful experience thousands of t/s prefill, 100 t/s generation.
Agreed. Prefill kills me for local model work. The model reads much faster than it writes, but I'd love to get a sense for how fast the model can read large source conversations.
> For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.
You should run a multi-session batched decode on that DGX unless your 13 t/s decode is already running into thermal or power limits, which I don't believe it is. (To be clear, this is a real issue on Apple Silicon machines: batched decode does not seem to unlock higher aggregate tok/s unless you're specifically trying to mitigate the drawbacks of slow streamed inference. Especially on the M5 laptops, thermal/power throttling places an early limit on your total compute.
The jury is still out on Strix Halo, but I think batched decode may turn out to be quite useful there since the bandwidth bottleneck is even more constraining there.)
Cool visualization, but most of the token generation in my sessions doesn't go to output code or even the text I see. Reasoning tokens make up most of the output. That can only occur after processing the input files and context.
For non-trivial work I go through hundreds of thousands of tokens (combined prefill + tg of course) before even getting to some useful text output.
I mostly use LLMs for exploration and studies, rarely code generation. Prefill matters heavily for this. Even in the high hundreds or low thousands prefill rate I spend a lot of time waiting on the LLM (doing other things, not twiddling thumbs)
It really depends. With the new "thinking" models they usually spend some time before writing the final answer. If they "think" for 1k tokens, that's a minute of spinning wheel you're gonna see for each question. Add that to the prompt processing, and diminishing speeds as context increases, and it becomes really slow for longer sessions.
The nice thing about DeepSeek and off-memory streaming is that you ought to be able to batch multiple sessions of it in parallel. Each individual session would slow down from streaming incrementally more active weights from disk, but your total tok/s would ultimately only be limited by compute. Other models have trouble doing this, because the KV cache takes too much space in RAM (and increases wear-and-tear if stored on disk) even for somewhat limited context.
Don't set your goals so low. We already reached 17k on a small models.
Since the whole goal of software architecture schemes it to allow the rest of us non-geniuses to still understand it and modify it, perhaps the same could be true of llms.
Perhaps a million-per-second hypothetical (small) model can be more useful than a state of the art big one.
30tok/s looks fine when you're just streaming code, but the issue is that there's a lot of background noise like tool-calling conventions, metadata, "thinking", etc.
This is awesome!! I use Cursor and I've been trending towards medium thinking models as much as possible - I don't like the dev cadence with something like opus 4.7 (thinking: very high) (great for some tasks, like complex plans). Eventually I'd like to make my way to open models and open harness, and this tool or something like it could help me understand what performance I'd need for productive work - bookmarked!
Cool, Great UI and showcase btw can't this be used as plugin in real-world. Like in my claude-code when it is reasoning or generating code I have this tool to measure real-time speed of the tokens.
This reminds me of when I signed up for cerebras to try it out and dumped $20 in and hooked it into opencode and the speed was truly insane. But my one session burnt through like $15 of that in seemingly a matter of minutes. I've since used those really high tok/s options for specific application use cases, but would not advise as a coding agent. Much harder to catch issues when it is moving a million miles an hour and then it is too late and it has already spent a ton of tokens.
I just looked up what my computer is capable of (m2 MacBook Air) and it says 15-35 tokens per second. I could live with that writing code with a local model.
One can say the most profound thing in 3 words, slowly or fast it does not even matter, and can also spout absolute senseless garbage in billions of words at absolutely ridiculous speed.
It’s GLM 4.7, GPT OSS 120B, or llama 3.1 8B so not exactly the latest or best models.
But GLM is good enough for many small tasks, certainly enough to get a taste for Cerebras’ high speeds!
[edit: actually that’s just their general models, I can’t see what Cerebras code offers. It was Qwen-coder when it launched but I don’t know what it is now. I think GLM 4.7 but I’m not completely sure]
> It was Qwen-coder when it launched but I don’t know what it is now.
This was also what I used at the time, the Qwen 3 Coder 480b on Cerebras. Worked great and was so stupidly fast it made me realize that if the hardware can be at that level and commercially available (say in a 5~10 years), for that price, then we will have entirely new bottlenecks. Human review at the pace it was going is completely impossible.
Used them for a while! They didn't seem to have prompt caching so I burnt through the daily 24M token limitations really quickly when doing large scale changes on a codebase (essentially a team's worth of menial migration/refactoring work). A lot of it was okay, but plenty had to be re-done and I still spotted some issues months down the line, in part I blame their model catalogue which did get an update to GLM 4.7 sometime way back, but definitely is showing its age: https://inference-docs.cerebras.ai/models/overview
Quality wise, Anthropic gives me the best results (Opus for almost everything, I make sub-agents with fresh context review its work, after 2-10 loops, usually finds most issues). Token amount wise for agentic work, DeepSeek V4 is up there. What Cerebras is doing pretty cool though, apparently they even have prompt caching now like the other big providers: https://inference-docs.cerebras.ai/capabilities/prompt-cachi... At the same time, producing bad code faster was annoying in a uniquely new way.
Wish they'd update the models with their subscription, it could genuinely be great with the proper harness. Like if they can run GLM 4.7, surely they could at least get DeepSeek V4 Flash with a big context window going as a starting point. How can you have so much money to make your own chips, but can't run modern models that you can get for free? It's like they don't want people to use their subscription.
Codex is pretty good, OpenAI models are up there with Anthropic's, though I still prefer the latter for most development tasks (in part UI/UX, in part personal preference for how the model performs and interacts with me and the codebases). That said, if you do get a subscription from OpenAI, they actually have more generous usage limits than Anthropic - Anthropic's Pro tier is borderline useless for agentic development and I just went with their 100 USD Max tier instead. OpenAI might be more cost effective, though GPT-5.5 is more expensive than GPT-5.4, for example.
I'm recently also considering downgrading to Pro and using DeepSeek V4 Pro for anything but the more complex tasks and basically wrote a little utility to hook Claude Code up with 3rd party providers better: https://ccode.kronis.dev/ or tbh I could also just use OpenCode on the CLI or maybe something like KiloCode in Visual Studio Code (sadly RooCode got retired, liked their UI/UX a lot too).
I guess where I'm going with all this is that most of the SOTA or near-SOTA models are pretty okay and if you want, you should either get their more affordable plans for a month and experiment, or maybe hook up whatever tools you have with something like OpenRouter and try out a bunch of them: https://openrouter.ai/ (though some of their providers quantize the models a lot, look out for that) Personally I'd also add the new Kimi and GLM models to the list of the ones to try out.
Paying for API tokens isn't really financially good long term for anyone but companies and eventually most folks just settle on a subscription of some sort, since those are heavily subsidized and more cost effective.
For small enough tasks with tight enough workflows, you can have it right now. Ie if you can constrain the task to work well with GPT OSS 120B/llama 3.3/qwen 3, then you can get upwards of 600 TPS on groq and up to 3k TPS on Cerebras.
Those models aren’t comparable to Opus, or even weaker models like MiniMax, but for certain task (focused context and prompts, strict workflows, single purpose requests) you absolutely can use these models and get insane speeds.
Why is that? It seems the other direction? I want to be sure I can complete a task in a certain amount of wall clock time. If the tokens per second are slow, then I am risking more by running a single approach at a time, and then have an incentive to try to multiplex my attention between separate work-streams. If the generation is fast enough to occupy my attention then there is no more available improvement by having parallel threads.
Branching strategies, do 10 things in parallel and evaluate for the best at the end or something along the lines of an evolutionary algorithms. Turn up the temperature on an LLM and have a survival mechanism, and generate solutions to the same problem over and over.
Regarding the first, parallel requests to the same loaded model seem to work pretty well, I'm trying to find time to look more into it myself, but this may be something that might already be within reach for local models.
Sure, it's possible, but you'd start to use it much more and in more advanced ways. Like "thinking hard" would consist of spawning a dozen different inferences from the same cached point and then picking the best one.
Obviously things will get expensive quick, but the main thing for me would be not dealing with the context switch every time I leave the agent to do stuff on it's own.
Feedback loops for prototyping could become even quicker.
In my experience, current agentic workflows are so slow, that for many cases it only makes sense to run them in parallel. So a lot of context switching. If we could have 10-100× faster token generation, we could have task delivery at the speed of human review.
People seem to use these tools very differently from each other. I value intelligence over speed any day. My programs are written in Haskell, so there are rarely any tasks which require thousands and thousands of lines to solve. Just intelligence. If there are rote tasks, I want the LLM to help me find intelligent ways of automating it: the right abstraction, the right meta-programming technique.
I constantly push Opus and GPT, and they are getting better. But still have to do the hardest parts myself. I would not mind waiting 10-15 minutes for the right 20 lines of code!
Why do you use Haskell? Why not something that produces a more predictable memory use at runtime? (I’m asking earnestly as a former Haskeller turned Rustacean who sees the value in “Boring Haskell”, but favours strictness for anything internet-facing and many things that aren’t compilers.)
I use Haskell because purity and strong typing gives so much control over what each part of the program does. This has huge benefits when it comes to security, and just general lack of bugs. Also, it makes the code easy to write, once the types are in place.
I use Haskell because I find laziness to be a super power. I can solve so many problems in the most straightforward way, and then laziness saves my butt w.r.t. performance.
I use Haskell because it is a better C than C is. The foreign function interface is brilliant, and I can take C primitives and apply all the abstraction mechanisms from Haskell to them. My latest project has been OpenGL based, so lots of caring about byte alignments and shovelling data to the GPU. But all this can be automated with clever use of type classes and Generics (Haskells super cool meta system of data types.)
I use Haskell because I love applying abstractions to make code which describes the problem, and then the compiler finds the solution.
I don’t do programming for embedded, so I am rarely memory constrained. I also understand Haskell memory usage quite well, and can get myself out of trouble.
Google's 3.5 Flash – which came out yesterday – is 200-300 tokens/second (albeit purportedly inefficient in its use of reasoning tokens) and according to Google, 800-1500+ tokens/second on their 8i TPUs when they're out!
It's... suboptimal, but hopefully that's a reason to hope... if Google get themselves together for 3.5 Pro / the next Flash.
The sweet spot for being just fast enough to not irritate you is 10tok/s. Still slow but faster than you can sustain at typing and thinking. Just interesting to observe.
Lol. I was running some content generation on a 3060 recently with speeds measured as seconds-per-token. Not all tokens are equal. And, for the work i dok, i will take slow-but-better answers over fast-but-junk every time.
It's interesting how even 5 tok/s is still much faster than you'd typically type, but feels glacially slow for an agent.
On the other hand, I've been using Mimo and Minimax a lot recently. They routinely reach 100-150 tokens per second and that feels too fast, to the point where it's hard to keep up with what it's actually doing. Great for subagents though.
They routinely reach 100-150 tokens per second and that feels too fast, to the point where it's hard to keep up with what it's actually doing.
There is no way you can follow what is going on even at 30 tokens per second. Maybe you can maintain a rough idea of what is going on for some tens of seconds but that is probably about it. Follow it in any detail, no chance. Reason about what you read, absolutely no chance.
800 tok/s — Cerebras-class, where the bottleneck is your eyeballs
I do not understand why they say this. I am not sure if it is even true. 800 tokens sounds like a page of text and I would assume you can look at one page per second without hitting any limitation of your eyes. Or is the resolution of the human not good enough to see an entire page at once and you have to scan it with the fovea? Scrolling text might of course hit the temporal resolution limit. But why does this even matter, your brain can not process anything close to the amount of information your eyes can take in.
>I do not understand why they say this.
Click on 800.
Try to read the text.
You'll understand.
Because it is scrolling. If they would show one page of text while filling the next one in the background, the result would probably be somewhat like flicking through a book at one page per second. You still can not read one page per second but you would not be limited by your eyes being unable to recognizing the quickly scrolling text.
EDIT: As others have pointed out and I now did some reading on, it is an illusion that you can see all the text on a page at once, that is beyond the resolution limit of the human eye. To actually see all the words, you have to scan the page and that takes several seconds. From the numbers I have seen, it seems that the ultimate limit is probably below 30 tokens per second, no matter what, even using rapid serial visual presentation to cut out eye movements. Even 10 to 20 tokens per second is probably pushing it and unsustainable for many, if not most, people.
Did someone say rapid serial visual presentation? I made a tool for that! Https://wordflashreader.vercel.app
The angular diameter of detailed seeing is very small - something like 1-2 degrees from what I was reading (matches my experience). That's the only area where you can reasonably read, the rest is only good for making out rough shape. So scanning it is.
On top of the other comments, this reads like a half-joke.
I run models in the ~120B class on my old server (96GB DDR4) and it manages about 3-3.5 tok/sec. It is indeed painfully slow to watch, but I find if I walk away or bury the window and do something else, it always seems to be done when I check back
isn't 5 tok/s like 100wpm? Pretty standard typing speed.
You also would need to compare token generation not with the actual output, but with the thoughts and deleted and edited parts.
It's about 240 wpm on text.
100wpm is well above what the average person types at, which is estimated at about 40wpm.
100wpm might still bit a bit high even for your average programmer.
I think the metric should be reading speed, not writing speed. At the very least it should be speech speed.
> It's interesting how even 5 tok/s is still much faster than you'd typically type, but feels glacially slow for an agent.
Calling the token rate the rate at which they "type" is a bit misleading. They also do virtually all of their more complex reasoning in tokens, so 5 tokens per second is also their thinking speed. And thinking at 5 tokens per second is glacially slow.
This is why faster versions of strong models do so well on reasoning tasks like playing text adventure games[1]. Their output isn't better on a token-for-token basis, but they get so much more thinking in during a given time window, they get more opportunities to find the right conclusion.
[1]: https://entropicthoughts.com/updated-llm-benchmark
How many tok/s does an average human think?
Most of my thinking is non-verbal. I don't think in sentences. I CAN think in sentences and internally rationalize my actions and explain them and sometimes that's beneficial (rubber duck debuggin, sometimes it's good to verbalize and explain something) but usually I don't do it
This question gets into information theory way beyond me, but I suspect it depends a lot on the task at hand. Human brains aren't very effective at combining sources of statistical variation, but they're great at other things. I'm personally most impressed by the cerebellum. It is highly trainable, yet if we tried to translate the things it does to maintain locomotion, proprioception, coordination of movement, etc. into tokens would probably result in a high token rate.
Looking at 5 tok/s after reading this comment made me think about why it felt slow and would be unacceptable for work. If you didn't plan or even sometimes despite planning, you have absolutely no idea if it is suddenly going to go off the rails in a wrong direction. Everyday, I'll look at the thinking and it seems pretty good until suddenly I have to slam the esc key because it decided to pursue a completely wrong direction. Much faster is better for skimming to make sure you don't have to throw everything away.
I think your demo need more realistic thinking logs because thinking usually burns at least 2x to 3x of tokens of the code and for harder tasks much more.
Indeed, at 30tok/s make it pause for 20 seconds while "thinking" is streaming (and hidden); that's the real experience.
Yes, it should use actual output from some of the open models.
Very cool!
> Unless you've actually watched tokens stream at those rates, the numbers are hard to internalize. This is the rendering.
I built something similar recently, for the same reason: https://modal.com/llm-almanac/token-timing-simulator.
I like that the output rendering is closer to typical UIs -- syntax highlighting in code mode, tool calls, dim-italic reasoning.
One feature mine has that the author, or anyone else who vibe codes their own version after seeing this, might like to steal is modeling the distribution of output latencies. My implementation is hacky (log-normal roughyl estimated from p50, p90, and p99 values), but still, when you set those to realistic values, it recreates the "jitter" you see in many LLM UIs.
antirez is right that generation tok/s isn't flat as a function of context length, which is a weakness of both simulators.
I always wonder what the world is going to be like when these are 2 orders of magnitude faster.
Something like https://chatjimmy.ai/ will do 14,000 tokens per second, and it's a completely different experience to what we see now. It's more like a page load than a conversation.
That's really impressive from a speed point of view but it hallucinates like it's on drugs.
I'd much rather trade speed for accuracy.
Its llama 3.1 8b. So not a particular good model, mostly as a tech demo of the speed possible.
We truly are in the dial up era of GenAI.
I'm flashing back to using a 1200 baud modem when the world was on 28.8k. Modems are much more regular-looking, though, since each character is a character. Unless you count color changes and such, which you only really notice at 1200...
How many tok/s does codex usually run at?
The visualiser seems to be quite naive with what it defines as a token. I don't think a token is an entire word as often as the demo shows, and when it gets to the `def estimate_tokens` method, the entire `# Rough heuristic: ~1 token per 4 chars of English` comment is printed all at once as one token, which is certainly not accurate.
This is not a realistic replay of what a common LLM might actually print out - it's entirely fabricated. But for the purpose of estimating the feel of tokens per second, I suppose it's good enough.
I built something similar awhile back [1] and used OpenAI’s tokenizer playground [2] to recalculate tokens on a giant block of lorem ipsum text. I feel like this gives a much more accurate representation.
[1] https://dave.ly/tools/tokenflow/
[2] https://platform.openai.com/tokenizer
Token/sec only makes sense once you tell me three four things:
1. decoding t/s, that is, when the model is generating text in the autoregressive fashion.
2. prefill t/s, that is, prompt processing speed.
3. What is the slope of those two numbers as the context size increases. An implementation that decodes at 50t/s with 2k context but decodes at 7t/s at 100k context is going to be a lot less useful that it seems at a first glance for a big number of real world use cases.
4. What's your use case? Reading a huge text and then having a small output like, fraud probability=12%? Or Reading a small question and generating a lot of text? This changes substantially if a model is usable based on its prefill/decoding speed.
For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.
On the Mac Ultra the prefill is like 400 t/s and decoding 35 t/s.
The two systems can perform dramatically differently or almost the same based on the use case. In general for local inference to be acceptable, even if slow, you want at least 100 t/s prefill, at least 10 t/s generation. To be ok-ish from 200 to 400 t/s prefill, 15-25 t/s generation. To be a wonderful experience thousands of t/s prefill, 100 t/s generation.
Agreed. Prefill kills me for local model work. The model reads much faster than it writes, but I'd love to get a sense for how fast the model can read large source conversations.
> For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.
You should run a multi-session batched decode on that DGX unless your 13 t/s decode is already running into thermal or power limits, which I don't believe it is. (To be clear, this is a real issue on Apple Silicon machines: batched decode does not seem to unlock higher aggregate tok/s unless you're specifically trying to mitigate the drawbacks of slow streamed inference. Especially on the M5 laptops, thermal/power throttling places an early limit on your total compute.
The jury is still out on Strix Halo, but I think batched decode may turn out to be quite useful there since the bandwidth bottleneck is even more constraining there.)
Isn't the thinking part the part that burns the tokens? You're just outputting tokens.
Totally depends, but I think this is mostly just an illustration of overall speed, regardless of the content.
Okay, but I think the realistic thing is * burns 18000 tokens thinking of the solution * outputs 1000 tokens of code
So you can easily follow the 1000 tokens of code, and the 18000 tokens of thinking is you sitting around waiting for your GPU to process the LLM.
Cool visualization, but most of the token generation in my sessions doesn't go to output code or even the text I see. Reasoning tokens make up most of the output. That can only occur after processing the input files and context.
For non-trivial work I go through hundreds of thousands of tokens (combined prefill + tg of course) before even getting to some useful text output.
I mostly use LLMs for exploration and studies, rarely code generation. Prefill matters heavily for this. Even in the high hundreds or low thousands prefill rate I spend a lot of time waiting on the LLM (doing other things, not twiddling thumbs)
Interesting. It seems to me that with that speed (20-30) on local hardware the real issue is quality of output, not tokens per sec.
It really depends. With the new "thinking" models they usually spend some time before writing the final answer. If they "think" for 1k tokens, that's a minute of spinning wheel you're gonna see for each question. Add that to the prompt processing, and diminishing speeds as context increases, and it becomes really slow for longer sessions.
Reminds me of the possibility of running DeepSeek at 3-4 t/s with SSD streaming, could be viable if you are running something overnight for example
The nice thing about DeepSeek and off-memory streaming is that you ought to be able to batch multiple sessions of it in parallel. Each individual session would slow down from streaming incrementally more active weights from disk, but your total tok/s would ultimately only be limited by compute. Other models have trouble doing this, because the KV cache takes too much space in RAM (and increases wear-and-tear if stored on disk) even for somewhat limited context.
I wonder when we reach speed of 1000 tps with high quality models. 5 years? 10 years?
We technically can (check Cerebras grok and Gemini diffusion), but it's not economically viable and not a priority for product managers.
Maybe when intelligence plateaus it could become a main differentiating factor, like smartphones and battery life.
Don't set your goals so low. We already reached 17k on a small models.
Since the whole goal of software architecture schemes it to allow the rest of us non-geniuses to still understand it and modify it, perhaps the same could be true of llms.
Perhaps a million-per-second hypothetical (small) model can be more useful than a state of the art big one.
30tok/s looks fine when you're just streaming code, but the issue is that there's a lot of background noise like tool-calling conventions, metadata, "thinking", etc.
This is awesome!! I use Cursor and I've been trending towards medium thinking models as much as possible - I don't like the dev cadence with something like opus 4.7 (thinking: very high) (great for some tasks, like complex plans). Eventually I'd like to make my way to open models and open harness, and this tool or something like it could help me understand what performance I'd need for productive work - bookmarked!
Curious about the other way around, how many tokens per second a productive developer codes in a day?
Much more, given that you need to incorporate the dev's thought process too.
Interesting point, i didn't consider the thought process as tokens.
I'm not sure if it's much more though.
Cool, Great UI and showcase btw can't this be used as plugin in real-world. Like in my claude-code when it is reasoning or generating code I have this tool to measure real-time speed of the tokens.
This reminds me of when I signed up for cerebras to try it out and dumped $20 in and hooked it into opencode and the speed was truly insane. But my one session burnt through like $15 of that in seemingly a matter of minutes. I've since used those really high tok/s options for specific application use cases, but would not advise as a coding agent. Much harder to catch issues when it is moving a million miles an hour and then it is too late and it has already spent a ton of tokens.
I just looked up what my computer is capable of (m2 MacBook Air) and it says 15-35 tokens per second. I could live with that writing code with a local model.
One thing I noticed is that prior to AI coding agents, I used to be able to tolerate 10 tokens/s. With AI Agents, I think 60 is minimum.
And yet... it means absolutely nothing.
One can say the most profound thing in 3 words, slowly or fast it does not even matter, and can also spout absolute senseless garbage in billions of words at absolutely ridiculous speed.
Neat website, the visualization is great. I had a hard time wrapping my head around the tokens/s thing but this made it easy.
This is great. Agentic coding at 600+ tokens/sec is going to be a radically different beast. Coming soon-ish?
If you have a Cerebras Code subscription you can experience it right now. Indeed, a very different experience.
It’s GLM 4.7, GPT OSS 120B, or llama 3.1 8B so not exactly the latest or best models.
But GLM is good enough for many small tasks, certainly enough to get a taste for Cerebras’ high speeds!
[edit: actually that’s just their general models, I can’t see what Cerebras code offers. It was Qwen-coder when it launched but I don’t know what it is now. I think GLM 4.7 but I’m not completely sure]
> It was Qwen-coder when it launched but I don’t know what it is now.
This was also what I used at the time, the Qwen 3 Coder 480b on Cerebras. Worked great and was so stupidly fast it made me realize that if the hardware can be at that level and commercially available (say in a 5~10 years), for that price, then we will have entirely new bottlenecks. Human review at the pace it was going is completely impossible.
Used them for a while! They didn't seem to have prompt caching so I burnt through the daily 24M token limitations really quickly when doing large scale changes on a codebase (essentially a team's worth of menial migration/refactoring work). A lot of it was okay, but plenty had to be re-done and I still spotted some issues months down the line, in part I blame their model catalogue which did get an update to GLM 4.7 sometime way back, but definitely is showing its age: https://inference-docs.cerebras.ai/models/overview
Quality wise, Anthropic gives me the best results (Opus for almost everything, I make sub-agents with fresh context review its work, after 2-10 loops, usually finds most issues). Token amount wise for agentic work, DeepSeek V4 is up there. What Cerebras is doing pretty cool though, apparently they even have prompt caching now like the other big providers: https://inference-docs.cerebras.ai/capabilities/prompt-cachi... At the same time, producing bad code faster was annoying in a uniquely new way.
Wish they'd update the models with their subscription, it could genuinely be great with the proper harness. Like if they can run GLM 4.7, surely they could at least get DeepSeek V4 Flash with a big context window going as a starting point. How can you have so much money to make your own chips, but can't run modern models that you can get for free? It's like they don't want people to use their subscription.
Have you tried Codex? If you have, how does it compare to Opus?
Codex is pretty good, OpenAI models are up there with Anthropic's, though I still prefer the latter for most development tasks (in part UI/UX, in part personal preference for how the model performs and interacts with me and the codebases). That said, if you do get a subscription from OpenAI, they actually have more generous usage limits than Anthropic - Anthropic's Pro tier is borderline useless for agentic development and I just went with their 100 USD Max tier instead. OpenAI might be more cost effective, though GPT-5.5 is more expensive than GPT-5.4, for example.
I'm recently also considering downgrading to Pro and using DeepSeek V4 Pro for anything but the more complex tasks and basically wrote a little utility to hook Claude Code up with 3rd party providers better: https://ccode.kronis.dev/ or tbh I could also just use OpenCode on the CLI or maybe something like KiloCode in Visual Studio Code (sadly RooCode got retired, liked their UI/UX a lot too).
I guess where I'm going with all this is that most of the SOTA or near-SOTA models are pretty okay and if you want, you should either get their more affordable plans for a month and experiment, or maybe hook up whatever tools you have with something like OpenRouter and try out a bunch of them: https://openrouter.ai/ (though some of their providers quantize the models a lot, look out for that) Personally I'd also add the new Kimi and GLM models to the list of the ones to try out.
Paying for API tokens isn't really financially good long term for anyone but companies and eventually most folks just settle on a subscription of some sort, since those are heavily subsidized and more cost effective.
For small enough tasks with tight enough workflows, you can have it right now. Ie if you can constrain the task to work well with GPT OSS 120B/llama 3.3/qwen 3, then you can get upwards of 600 TPS on groq and up to 3k TPS on Cerebras.
Those models aren’t comparable to Opus, or even weaker models like MiniMax, but for certain task (focused context and prompts, strict workflows, single purpose requests) you absolutely can use these models and get insane speeds.
i really want a qwen on one of these chips: https://chatjimmy.ai
15k tokens/s would get me feeling like its actually worth splitting out worktrees to try several approaches to a problem
Why is that? It seems the other direction? I want to be sure I can complete a task in a certain amount of wall clock time. If the tokens per second are slow, then I am risking more by running a single approach at a time, and then have an incentive to try to multiplex my attention between separate work-streams. If the generation is fast enough to occupy my attention then there is no more available improvement by having parallel threads.
Do you have ideas/suggestions for agentic workflows that only start making sense at such speeds?
Branching strategies, do 10 things in parallel and evaluate for the best at the end or something along the lines of an evolutionary algorithms. Turn up the temperature on an LLM and have a survival mechanism, and generate solutions to the same problem over and over.
Regarding the first, parallel requests to the same loaded model seem to work pretty well, I'm trying to find time to look more into it myself, but this may be something that might already be within reach for local models.
Sure, it's possible, but you'd start to use it much more and in more advanced ways. Like "thinking hard" would consist of spawning a dozen different inferences from the same cached point and then picking the best one.
Obviously things will get expensive quick, but the main thing for me would be not dealing with the context switch every time I leave the agent to do stuff on it's own.
Feedback loops for prototyping could become even quicker.
In my experience, current agentic workflows are so slow, that for many cases it only makes sense to run them in parallel. So a lot of context switching. If we could have 10-100× faster token generation, we could have task delivery at the speed of human review.
People seem to use these tools very differently from each other. I value intelligence over speed any day. My programs are written in Haskell, so there are rarely any tasks which require thousands and thousands of lines to solve. Just intelligence. If there are rote tasks, I want the LLM to help me find intelligent ways of automating it: the right abstraction, the right meta-programming technique.
I constantly push Opus and GPT, and they are getting better. But still have to do the hardest parts myself. I would not mind waiting 10-15 minutes for the right 20 lines of code!
Why do you use Haskell? Why not something that produces a more predictable memory use at runtime? (I’m asking earnestly as a former Haskeller turned Rustacean who sees the value in “Boring Haskell”, but favours strictness for anything internet-facing and many things that aren’t compilers.)
I use Haskell because purity and strong typing gives so much control over what each part of the program does. This has huge benefits when it comes to security, and just general lack of bugs. Also, it makes the code easy to write, once the types are in place.
I use Haskell because I find laziness to be a super power. I can solve so many problems in the most straightforward way, and then laziness saves my butt w.r.t. performance.
I use Haskell because it is a better C than C is. The foreign function interface is brilliant, and I can take C primitives and apply all the abstraction mechanisms from Haskell to them. My latest project has been OpenGL based, so lots of caring about byte alignments and shovelling data to the GPU. But all this can be automated with clever use of type classes and Generics (Haskells super cool meta system of data types.)
I use Haskell because I love applying abstractions to make code which describes the problem, and then the compiler finds the solution.
I don’t do programming for embedded, so I am rarely memory constrained. I also understand Haskell memory usage quite well, and can get myself out of trouble.
Google's 3.5 Flash – which came out yesterday – is 200-300 tokens/second (albeit purportedly inefficient in its use of reasoning tokens) and according to Google, 800-1500+ tokens/second on their 8i TPUs when they're out!
It's... suboptimal, but hopefully that's a reason to hope... if Google get themselves together for 3.5 Pro / the next Flash.
On avg 1 token = 4 chars
So 75 tokens/s is ~ 300 chars per second which is the speed you'd get with a 2400 baud modem
This is nice. Please tell you agent to make the site "light" or give an option for light/dark.
Nice, I always thought 15 tok/sec is too "slow"
The non-linear scaling on the slider is an excellent UX.
I noticed this as well, glad I'm not the only one
Neat visual. 5 tok/s is still faster than me!
I had the opposite reaction, 5tok/s is so slow that when you include all the reasoning and thinking + warmup it is far slower than me.
The sweet spot for being just fast enough to not irritate you is 10tok/s. Still slow but faster than you can sustain at typing and thinking. Just interesting to observe.
yeah 3t/s seems human. only that i never wrote code perfectly top to bottom.
Good reminder that raw tokens/sec numbers can be misleading without latency and context-window considerations.
RIP my browser history, I guess
Thank you for this great utility. I love the "gut feel" calibration utilities like this one!
Not very far til we reach 1MTk/s per LLM. Computing is going to look very different in the future.
Lol. I was running some content generation on a 3060 recently with speeds measured as seconds-per-token. Not all tokens are equal. And, for the work i dok, i will take slow-but-better answers over fast-but-junk every time.
This is cool, thanks for making it.
> Now switch between c and t at the same rate. The difference is striking — and intentional.
I don't see a big difference.
super cool, thanks
This is great.