points by gertlabs 1 month ago

Grok 4.3 is a unique model in our tests. It's one of the fastest models, and its responses are far smaller/token dense than other models with comparable performance.

However, its overall coding reasoning ability is not competitive with the big April releases, and neither Grok 4.20 nor Grok 4.3 have been able to significantly push the intelligence frontier since Grok 4. Grok 4.3 is better in agentic workloads, and a fair analogy would be that it's capabilities are approximately GPT 5.1 / Gemini 3 Pro Preview level, but much faster and cheaper. So definitely a solid release in its own ways. Many of the recent open weights releases are smarter, but slower.

Full benchmarks at https://gertlabs.com/rankings

nomel 1 month ago

Any possibility that there could be a compromise in making it work seemingly well (benchmarks around this?) with post-knowledge-cutoff information, which appears to be their primary use case for it?

  • gertlabs 1 month ago

    All models are moving towards more frequent and more efficient tool use, which should close the gap on post-knowledge cutoff problems. The only tradeoff I see is speed, and Grok 4.3 is currently taking the fast side of that tradeoff.

bel8 1 month ago

Interesting benchmarks. But how is Deepseek V4 Flash significantly better than Pro in the agentic coding benchmarks?

  • gertlabs 1 month ago

    Pro is smarter in one-shot problems, but it struggles with custom tooling, and spends too much time trying to figure out our harness. We ran a lot of samples, so I can't make excuses for the model. Flash is truly the better option overall, especially considering speed and cost.