We track performance vs. the all-in cost of completing real engineering tasks, rather than cost per token. [1]
Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
Interesting! I've been thinking about how to create a similar type of evaluation system for myself. How do you handle tweaks to agentic tasks? Say that a model gets pretty close to what you want, so you just need a quick follow up prompt to the original response?
If the changes needed are small, I'll apply the best implementation as a foundation and then just iterate directly.
If the changes needed are drastic, it usually signals that there was sth wrong/ambiguous/etc in the spec (or the ensemble was too weak, which is rarely the case). In cases like this, I improve the spec and then rerun.
If it's in the middle, I'll usually apply the best and write a follow on spec.
How does that get integrated into the scoring system? I'm imagining a scenario where a cheaper model may get close, but only needs a small follow up to get the desired result. How would this score in comparison to a larger model that got it right the first time - even if it may have been much more expensive overall?
We also use a secondary signal from blinded multi-verifier reviews. Each verifier ranks the candidates, and those verification outcomes serves as an additional quality signal. It's somewhat similar to consensus labeling.
Btw, this also helps manage scale. Eg you have 15 diffs to review. Run a few verifiers to get a short list, then review directly and apply the best.
So far we have been native harnessmaxxing, which simplifies things a lot.
The configuration space around open models is much larger. Eg which models, capability heterogeneity, which harness, networking, data egress / privacy, etc.
If anyone is getting very good production code out of open models, I'd love to do a user interview to better understand your setup. Email is in my bio.
With how much vendor harnesses are now actively steering the agent with their own instructions on top of user prompts, I think it’d be super interesting to see a comparison of one of the already tested models - so Opus 4.7 or GPT-5.5 - across a range of different harnesses that aren’t their native. OpenCode, Pi, Hermes, Kilo Code. The most popular coding-focused harnesses, basically.
Yes, the signal we are measuring is quite different from most evals.
We are measuring sth much closer to: when multiple agents compete on the same spec, which one produces the patch that holds up best in code review?
Most evals are static / synthetic, and for code, generally stop at tests. Test evals are weak proxies for quality since it's difficult to encode qualities like scope creep/churn, codebase fit, maintainability etc in tests. [1]
Almost every agent in a given run can pass tests at this point, but there is large separation during review.
Ok, but my point is that the claims you make about more reasoning performing worse seems kinda suspicious and I haven't seen any analysis exploring why that would happen.
I get it, but that is a significant claim. And the claim could be right, but it could also be wrong, and I see no analysis, not even a blog post on your website saying "wow, look at this weird thing we found". To me that makes the claim suspicious because it signals that nobody thought to investigate what's going on. Investigating weird results is how we demonstrate that what we're doing is right.
My take is that demand is also increasing, so maybe they are making incremental improvements to model quality while focusing on improving inference costs. Prices are increasing though because even if they achieve a very efficient model, they are still selling at a loss.
For what is worth I find GPT 5.5 qualitatively different than 5.4 and 5.3
If I had to collapse the nature of the difference in one sentence it'd be that the 5.5 does more what I'm asking it to do versus doing a small aspect of what I'm asking then stopping.
5.4 required a lot of "continue" encouragement. 5.5 just "gets it" a bit more
What is boils down to for me is that even though it's more expensive I would much rather use 5.5 on low then 5.4/5.3 on high/medium
They must have changed something recently as when 5.5 first dropped I was unable to make it do anything. It would say it will implement, but would never actually do it, no matter how many times I tried to tell it what it needs to do. It would acknowledge what needs to be done, even create step by step plan and then ask if it should do it. I would confirm and then it will just go around reiterating the plan and that this time it will start. Annoying and funny. Now it doesn't seem to be doing that anymore.
I think that's a failure mode of using the legacy completions API rather than the new responses API. With the responses API, the agent actually goes and does the things it's supposed to do.
Considering my use case (web apps), there already wasn't anything I couldn't do with Opus 4.5, the same will be true or were already true for more people in other releases, and at some point, which may have already passed, most people will stop finding qualitative leaps.
This doesn't always mean that there is a bottleneck in terms of raw power, it may also mean that your use cases (or the lower hanging fruits among them) are already covered.
I am delighted to see the ceiling on small models exponentially increase. I think the "make models unsustainably large because the benchmark improved by 1%" practice is ending. I think the thing boosting small models will be the thing that makes LLMs actually useful. The main thing is research.
Yes, Cuba definitely doesn’t have such wild delusions to the benefit of its residents.
Please stop. Critical theory is easy. Something about “X” sucks. Got it. What is the alternative? It’s the completely unserious philosophy of the peanut gallery.
If that is true then they should all invest resources into projects that will yield efficient use of the compute. The most efficient producer then gains a huge cost advantage AND capacity to serve more… so yeah.. that logic doesn’t hold.
They only did that after they "found" ~300k H100 equivalent compute. Before signing that deal they were severely compute constrained. Especially visible when EU tz was still active and US east would wake up.
Azure recently discontinued the gpt-4.1 model. I had to move off of this model, and moving to any gpt-5* model was worse (higher failures & less accuracy), and more expensive. I had to rewrite the entire system from high school level prompts to lower elementary school level prompts using non-gpt models.
I would say models entered a bottleneck a long time ago. My personal opinion is now they are overfitting newer models on coding and "agentic" capabilities at great expense of general abilities in other domains.
I am wondering if everyone is moving to an IPO and striking these bizarre circular deals because they’ve hit the ceiling on what can be done with more compute until a major architectural innovation happens.
Still amazing, but 5.5 does feel like incremental progress with a massive up charge.
Ofc they have hit a ceiling, why do you think OAI has shut down many of its projects like the research one called Prism?
The reality is both Anthropic and OAI have converged on LLMs as being a thing for software production - that's where the majority of their revenue is coming from.
I actually think it makes sense to hone models for coding and agentic capabilities. Those models will be specialized for those tasks, and the results will be cheaper and better. We can still have a general model and specialized models
In fairness I think these recent few iterations have done reasonably well considering it's largely optimising/fine tuning/enhancing multimodal integrations in existing foundation models rather than generating new ones but at some point the next big foundation models will come out.
We'll probably see another stair step change followed by another plateauing curve of incremental improvements when that happens.
GPT-5.5 is a solid leap with Codex or other harnesses. Opus 4.7 I still don't understand how people use... I tried it for a day or two, have tried it for a few hours every week or so since release, and still use 4.6 as daily driver (with xhi thinking).
As with these daily opinion threads, ymmv. I find GPT's code to be competent, but its voice isn't great. If Claude can be a little too cool, GPT-5.x often reads like 90s era movie hacker technobabble. This has got to be RLHF/alignment and the sort of tone that people like. Also anecdotally I used xhigh for a while and turned it down to medium because it would take so long to do even simple jobs. The instruction following is quite good with 5.5 so there isn't too much need to let it wander off.
Call me cynical but for me these are mostly pricing changes, the change in quality is imperceptible. I believe after a few iterations we will be closer to the real cost.
This doesn't seem to be controlling for the number of turns in any way. Am I missing something?
Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
They also don't mention what their sample size is, or anything about the distribution of input and response lengths.
It'd be interesting to see the distributions if the author actually plotted the data, so we could see if their analysis holds water or not.
A plot of the input lengths using ggplot2 geom_density with color and fill by model, 0.1 alpha, and an appropriate bandwidth adjustment would allow us to see if the input data distribution looks similar across the two, and using the same for the output length distributions, faceted by the input length bins would give us an idea if those look the same too.
Edit: Or even a faceted plot using input bins of output length/input length.
OpenRouter may see you fire hundreds of requests at them, but they have no idea that "these 50 requests here at 4PM are for task A", "those 100 requests there does task B", etc. So it's a shallow analysis at the "overall request shape" level.
We observed slightly smaller outputs over long horizon agentic coding for GPT 5.5, at a significant improvement in overall response scores. For one-shot coding responses, GPT 5.5 was actually more verbose than GPT 5.4, but again, the responses were significantly stronger. The expected cost increases reported by OpenRouter seem reasonably accurate (perhaps a bit optimistic), but in my opinion, highly worth it. GPT 5.5 has a pretty wide lead on the #2 model for understanding complex scenarios.
New model releases are now like new iPhones--mostly imperceivable improvements with a higher price tag. That's one of the major benefits to open source: you can "freeze" what model you're using. Often it's the model that you know that wins over the one that is different enough that you have to start from scratch with every major update. Most businesses require cost control and predictability over a cutting edge with limited evidence of profitable output outside of tech.
If I skip 2 models of iPhone upgrades, there is definitely a difference between how the thing feels - and it feels its worth the money.
If I skip 2 models of upgrades of the frontier models now, I highly doubt I can discern what the difference is and what exactly I'd be paying more for.
In terms of work done per dollar, new models from OpenAI and Anthropic are worse than the older models. They are trying to squeeze the customers.
For personal use I switched to coding plans containing GLM 5.1, Kimi K2.6 and Xiaomi MiMo V2.5 Pro and I never been happier. I said goodbye to both Claude Max and Cursor.
it does seem like a step change in token efficiency, though based on the earlier artificial analysis reporting it's also quite the cost lottery and i'm not sure i am comfortable with that
Quality would be performance against different given benchmarks, I assume?
There's multiple open weight models you can run on a pretty standard computer at home, which match the quality of GPT 4. I guess that would also change the equation.
anything that compares proprietary models will be very miscalibrated and may not be indicative, there have been too many model changes in both chat and the api where model providers did not even say the word before it got too noticable
We track performance vs. the all-in cost of completing real engineering tasks, rather than cost per token. [1]
Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
[1] https://voratiq.com/leaderboard?x=cost
Interesting! I've been thinking about how to create a similar type of evaluation system for myself. How do you handle tweaks to agentic tasks? Say that a model gets pretty close to what you want, so you just need a quick follow up prompt to the original response?
Yes! It depends on the extent of changes needed.
If the changes needed are small, I'll apply the best implementation as a foundation and then just iterate directly.
If the changes needed are drastic, it usually signals that there was sth wrong/ambiguous/etc in the spec (or the ensemble was too weak, which is rarely the case). In cases like this, I improve the spec and then rerun.
If it's in the middle, I'll usually apply the best and write a follow on spec.
How does that get integrated into the scoring system? I'm imagining a scenario where a cheaper model may get close, but only needs a small follow up to get the desired result. How would this score in comparison to a larger model that got it right the first time - even if it may have been much more expensive overall?
We also use a secondary signal from blinded multi-verifier reviews. Each verifier ranks the candidates, and those verification outcomes serves as an additional quality signal. It's somewhat similar to consensus labeling.
Btw, this also helps manage scale. Eg you have 15 diffs to review. Run a few verifiers to get a short list, then review directly and apply the best.
would be interesting to see some other labs:
- deepseek v4 pro
- glm 5.1
- kimi k2.6
- qwen 3.6 max
- xiaomi 2.5 pro
- minimax 2.7
- grok
I agree!
So far we have been native harnessmaxxing, which simplifies things a lot.
The configuration space around open models is much larger. Eg which models, capability heterogeneity, which harness, networking, data egress / privacy, etc.
If anyone is getting very good production code out of open models, I'd love to do a user interview to better understand your setup. Email is in my bio.
With how much vendor harnesses are now actively steering the agent with their own instructions on top of user prompts, I think it’d be super interesting to see a comparison of one of the already tested models - so Opus 4.7 or GPT-5.5 - across a range of different harnesses that aren’t their native. OpenCode, Pi, Hermes, Kilo Code. The most popular coding-focused harnesses, basically.
Agreed. Harness is really important. Especially since many labs are now post-training agents directly in their native harness.
(Which is why my prior is that third party harnesses would not perform as well. But I haven't actually measured this.)
OpenCode seems to give me better results than codex-cli, i’d be interested in seeing this too!
It feels pretty weird that your ratings have:
gpt-5-4-high > gpt-5-4-xhigh
gpt-5-4-high > gpt-5-5-high
gpt-5-4 > gpt-5-5
gpt-5-2-high > gpt-5-2-xhigh
No other ratings I've seen show that.
Yes, the signal we are measuring is quite different from most evals.
We are measuring sth much closer to: when multiple agents compete on the same spec, which one produces the patch that holds up best in code review?
Most evals are static / synthetic, and for code, generally stop at tests. Test evals are weak proxies for quality since it's difficult to encode qualities like scope creep/churn, codebase fit, maintainability etc in tests. [1]
Almost every agent in a given run can pass tests at this point, but there is large separation during review.
[1] https://voratiq.com/blog/your-workflow-is-the-eval
Ok, but my point is that the claims you make about more reasoning performing worse seems kinda suspicious and I haven't seen any analysis exploring why that would happen.
My point is more reasoning often leads to worse "scope creep/churn, codebase fit, maintainability".
I get it, but that is a significant claim. And the claim could be right, but it could also be wrong, and I see no analysis, not even a blog post on your website saying "wow, look at this weird thing we found". To me that makes the claim suspicious because it signals that nobody thought to investigate what's going on. Investigating weird results is how we demonstrate that what we're doing is right.
It’s mostly a bandwidth thing. We’ve seen the pattern consistently, but haven’t had time yet to write up the analysis carefully.
We are not the only ones to see the reasoning inversion.: https://arxiv.org/abs/2510.11977, https://arxiv.org/abs/2502.08235, https://arxiv.org/abs/2507.14417
But what situation seems to good to enable xhigh?
I feel that the recent iterations of LLM haven't provided an intuitive qualitative leap. Have they entered a bottleneck period so quickly?
My take is that demand is also increasing, so maybe they are making incremental improvements to model quality while focusing on improving inference costs. Prices are increasing though because even if they achieve a very efficient model, they are still selling at a loss.
For what is worth I find GPT 5.5 qualitatively different than 5.4 and 5.3
If I had to collapse the nature of the difference in one sentence it'd be that the 5.5 does more what I'm asking it to do versus doing a small aspect of what I'm asking then stopping.
5.4 required a lot of "continue" encouragement. 5.5 just "gets it" a bit more
What is boils down to for me is that even though it's more expensive I would much rather use 5.5 on low then 5.4/5.3 on high/medium
They must have changed something recently as when 5.5 first dropped I was unable to make it do anything. It would say it will implement, but would never actually do it, no matter how many times I tried to tell it what it needs to do. It would acknowledge what needs to be done, even create step by step plan and then ask if it should do it. I would confirm and then it will just go around reiterating the plan and that this time it will start. Annoying and funny. Now it doesn't seem to be doing that anymore.
I think that's a failure mode of using the legacy completions API rather than the new responses API. With the responses API, the agent actually goes and does the things it's supposed to do.
They probably just tell it to do more in the prompt lmao
Are you running gpt-5.5 on xhigh reasoning? Because I'm seeing a clear difference between that and gpt-5.4 on xhigh.
Considering my use case (web apps), there already wasn't anything I couldn't do with Opus 4.5, the same will be true or were already true for more people in other releases, and at some point, which may have already passed, most people will stop finding qualitative leaps.
This doesn't always mean that there is a bottleneck in terms of raw power, it may also mean that your use cases (or the lower hanging fruits among them) are already covered.
> Have they entered a bottleneck period so quickly?
So quickly - this industry has had trillions thrown around to get here so quickly, heh.
But, yes, capability seems somewhat stagnant. It's about ISO perf and cost improvements or iso cost and perf improvements + agentic.
I am delighted to see the ceiling on small models exponentially increase. I think the "make models unsustainably large because the benchmark improved by 1%" practice is ending. I think the thing boosting small models will be the thing that makes LLMs actually useful. The main thing is research.
They likely entered the same compute constraint scenario as Anthropic.
IE. They had 100 compute units. Demand is 200 units. They have to do a combination of buying more compute, increasing price, lowering limits, etc.
capitalism convinced you that line goes up unless you dont let it eat all the resources.
Yes, Cuba definitely doesn’t have such wild delusions to the benefit of its residents.
Please stop. Critical theory is easy. Something about “X” sucks. Got it. What is the alternative? It’s the completely unserious philosophy of the peanut gallery.
Bunch of nonsense.
If that is true then they should all invest resources into projects that will yield efficient use of the compute. The most efficient producer then gains a huge cost advantage AND capacity to serve more… so yeah.. that logic doesn’t hold.
You mean the company that just doubled their rate limits? https://www.anthropic.com/news/higher-limits-spacex
They only did that after they "found" ~300k H100 equivalent compute. Before signing that deal they were severely compute constrained. Especially visible when EU tz was still active and US east would wake up.
I don't disagree. It's just a weird way to describe them currently when they just announced massively increasing limits.
Anthropic hasn't solved their compute constraint issue. Colossus just makes it a little easier.
its a sigmoid, not a bottleneck.
Azure recently discontinued the gpt-4.1 model. I had to move off of this model, and moving to any gpt-5* model was worse (higher failures & less accuracy), and more expensive. I had to rewrite the entire system from high school level prompts to lower elementary school level prompts using non-gpt models.
I would say models entered a bottleneck a long time ago. My personal opinion is now they are overfitting newer models on coding and "agentic" capabilities at great expense of general abilities in other domains.
I am wondering if everyone is moving to an IPO and striking these bizarre circular deals because they’ve hit the ceiling on what can be done with more compute until a major architectural innovation happens.
Still amazing, but 5.5 does feel like incremental progress with a massive up charge.
Ofc they have hit a ceiling, why do you think OAI has shut down many of its projects like the research one called Prism?
The reality is both Anthropic and OAI have converged on LLMs as being a thing for software production - that's where the majority of their revenue is coming from.
Can you elaborate what kind of system you built? I'm curious what specific prompts are getting worse responses with the newer models.
I actually think it makes sense to hone models for coding and agentic capabilities. Those models will be specialized for those tasks, and the results will be cheaper and better. We can still have a general model and specialized models
5.4 and 5.5 were each a big jump for Codex use
In fairness I think these recent few iterations have done reasonably well considering it's largely optimising/fine tuning/enhancing multimodal integrations in existing foundation models rather than generating new ones but at some point the next big foundation models will come out.
We'll probably see another stair step change followed by another plateauing curve of incremental improvements when that happens.
I remember thinking the same thing shortly after GPT-5 came out, then Opus 4.5 dropped.
Some releases are just "meh", but I wouldn't rule out exciting new stuff for 2026 just because Opus 4.7 sucked.
GPT-5.5 is a solid leap with Codex or other harnesses. Opus 4.7 I still don't understand how people use... I tried it for a day or two, have tried it for a few hours every week or so since release, and still use 4.6 as daily driver (with xhi thinking).
As with these daily opinion threads, ymmv. I find GPT's code to be competent, but its voice isn't great. If Claude can be a little too cool, GPT-5.x often reads like 90s era movie hacker technobabble. This has got to be RLHF/alignment and the sort of tone that people like. Also anecdotally I used xhigh for a while and turned it down to medium because it would take so long to do even simple jobs. The instruction following is quite good with 5.5 so there isn't too much need to let it wander off.
Call me cynical but for me these are mostly pricing changes, the change in quality is imperceptible. I believe after a few iterations we will be closer to the real cost.
I do a lot of OCaml and I found 5.5 to be much better, but that's kind of an esoteric language thing
~3.5x more expensive to run my benchmarks[0].
[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
Sure, but it did better on the test, which matches OpenAI's claim. More bang for more buck.
Interestingly, using your tests as a comparison, 5.5 low beats 5.4 medium at a 82% of the cost.[0]
[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
This doesn't seem to be controlling for the number of turns in any way. Am I missing something?
Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
They also don't mention what their sample size is, or anything about the distribution of input and response lengths.
It'd be interesting to see the distributions if the author actually plotted the data, so we could see if their analysis holds water or not.
A plot of the input lengths using ggplot2 geom_density with color and fill by model, 0.1 alpha, and an appropriate bandwidth adjustment would allow us to see if the input data distribution looks similar across the two, and using the same for the output length distributions, faceted by the input length bins would give us an idea if those look the same too.
Edit: Or even a faceted plot using input bins of output length/input length.
I think it should be tested on goals.
E.g. Crack this puzzle, fix this code so these tests pass. (A human can verify it doesn't cheese things).
OpenRouter may see you fire hundreds of requests at them, but they have no idea that "these 50 requests here at 4PM are for task A", "those 100 requests there does task B", etc. So it's a shallow analysis at the "overall request shape" level.
We observed slightly smaller outputs over long horizon agentic coding for GPT 5.5, at a significant improvement in overall response scores. For one-shot coding responses, GPT 5.5 was actually more verbose than GPT 5.4, but again, the responses were significantly stronger. The expected cost increases reported by OpenRouter seem reasonably accurate (perhaps a bit optimistic), but in my opinion, highly worth it. GPT 5.5 has a pretty wide lead on the #2 model for understanding complex scenarios.
Rankings at https://gertlabs.com/rankings?mode=agentic_coding. See the efficiency chart at the bottom.
New model releases are now like new iPhones--mostly imperceivable improvements with a higher price tag. That's one of the major benefits to open source: you can "freeze" what model you're using. Often it's the model that you know that wins over the one that is different enough that you have to start from scratch with every major update. Most businesses require cost control and predictability over a cutting edge with limited evidence of profitable output outside of tech.
Really bad comparison.
If I skip 2 models of iPhone upgrades, there is definitely a difference between how the thing feels - and it feels its worth the money.
If I skip 2 models of upgrades of the frontier models now, I highly doubt I can discern what the difference is and what exactly I'd be paying more for.
In terms of work done per dollar, new models from OpenAI and Anthropic are worse than the older models. They are trying to squeeze the customers.
For personal use I switched to coding plans containing GLM 5.1, Kimi K2.6 and Xiaomi MiMo V2.5 Pro and I never been happier. I said goodbye to both Claude Max and Cursor.
I do think recent models are too expensive to be used for customer-facing agentic workflows.
it does seem like a step change in token efficiency, though based on the earlier artificial analysis reporting it's also quite the cost lottery and i'm not sure i am comfortable with that
Has any enterprising hacker here yet graphed price vs "output" over time since 2023, taking "quality" into account?
That's got to be a very tricky analysis given how subjective quality is. But I'm sure there are people trying to pin it down.
Quality would be performance against different given benchmarks, I assume?
There's multiple open weight models you can run on a pretty standard computer at home, which match the quality of GPT 4. I guess that would also change the equation.
anything that compares proprietary models will be very miscalibrated and may not be indicative, there have been too many model changes in both chat and the api where model providers did not even say the word before it got too noticable
artificial analysis has an intelligence benchmark