points by antirez 2 days ago

Very good move. In my experience, for system programming at least, GPT 5.4 xhigh is vastly superior to Claude Opus 4.6 max effort. I ran many brutal tests, including reconstructing for QEMU the SCSI controller (not longer accessible) of a SVSY UNIX of the early 90s used in a 386. Side by side, always re-mirroring the source trees each time one did a breakthrough in the implementation. Well, GPT 5.4 single handed did it all, while Opus continued to take wrong paths. The same for my Redis bug tracking and development. But 200$ is too much for many people (right now, at least: the reality is that if frontier LLMs are not democratized, we will end paying like a house rent to a few providers), and also while GPT 5.4 is much stronger, it is slower and less sharp when the thing to do is simple, so many people went for Claude (also because of better marketing and ethical concerns, even if my POV is different on that side: both companies sell LLM models with similar capabilities and similar internal IP protection and so forth, to me they look very similar in practical terms). This will surely change things, and many people will end with a Claude 5x account + a Codex 5x account I bet.

dweekly 2 days ago

GPT 5.4 is the surly physics PhD post-doc who slowly and angrily sits in a basement to write brilliant, undocumented, uncommented code that encapsulates a breakthrough algorithm.

Opus 4.6 is the L5 new hire SWE keen to prove their chops and quickly turn out totally reasonable code with putatively defensible reasons for doing it that way (that are sometimes tragically wrong) and then catch an after-work yoga class with you.

  • simianwords 2 days ago

    GPT is also cautious and Defensive but opus is agreeable.

  • fragmede 2 days ago

    > and then catch an after-work yoga class with you.

    That's cute, but do you mean something concrete with this, aka are there some non-coding prompting you use it for that you're referring to with that or is it simply a throwaway line about L5 SWEs (at a FAANG).

    (FWIW, I find myself using ChatGPT for non-coding prompting for some reason, like random questions like if oil is fungible and not Claude, for some reason.)

    • joncrane 2 days ago

      I think the point they are trying to make is the golden retriever vibe/energy you get from Claude gives "after work yoga."

    • dghlsakjg 2 days ago

      It’s an analogy about the “personalities” of the models.

      They are saying that Claude is more of a team player and conformist. It isn’t really much deeper than that.

  • pdntspa 2 days ago

    Who replies to you with fucking emoji brainrot

    • ponector 2 days ago

      You are absolutely right!

    • Gud 1 day ago

      You can tell it to be no nonsense

Tiberium 2 days ago

Thanks for confirming my impressions, it's been like 4 months now that I've arrived at the same conclusions. GPT models are just better at any kind of low-level work: reverse engineering including understanding what the decompiled code/assembly does, renaming that decompiled code (functions/types), any kind of C/C++, way more reliable security research (Opus will find way more, but most will turn out to be false positives). I've had GPT create non-trivial custom decompilers for me for binaries built with specific compilers (it's a much simpler task than what IDA Pro/Ghidra are doing but still complex), and modify existing Java decompilers.

Regarding speed, I don't use xhigh that often, and surprisingly for me GPT 5.4 high is faster than Claude 4.6 Opus high (unless you enable fast mode for Opus).

Of course I still use Opus for frontend, for some small scripts, and for criticizing GPT's code style, especially in Python (getattr).

  • antirez 2 days ago

    In the SCSI controller work I mentioned, a very big part of the work was indeed reasoning about assembly code and how IRQs and completion of DMAs worked and so forth. Opus, even if TOOLS.md had the disassembler and it was asked to use it many times, didn't even bothered much. GPT 5.4 did instead a very great reverse engineering work, also it was a lot more sensible to my high level suggestions, like: work in that way to make more isolated progresses and so forth.

    • amluto 2 days ago

      GPT 5.4 is remarkably good at figuring out machine code using just binutils. Amusingly, I watched it start downloading ghidra, observe that the download was taking a while, and then mostly succeed at its assignment with objdump :)

  • beering 2 days ago

    Codex also gives you a lot more usage for $20/mon than Claude, so there’s not also that fear that high or xhigh reasoning will eat up all your quota. It really comes down to whether you want to try to save some time or not. (I default to xhigh because it’s still fast enough for me.)

Asyne 2 days ago

+1 to this, I've found GPT/Codex models consistently stronger in engineering tasks (such as debugging complex, cross-systems issues, concurrency problems, etc).

I use both OpenAI and Anthropic models, though for different purposes, what surprises me is how underrated GPT still feels (or, alternatively, how overhyped Anthropic models can be) given how capable it is in these scenarios. There also seems to be relatively little recognition of this in the broader community (like your recent YouTube video). My guess is that demand skews toward general codegen rather than the kind of deep debugging and systems work where these differences really show.

  • mediaman 2 days ago

    It's surprising to me how much LLM "personality" seems to matter to people, more than actual capability.

    I do turn to Anthropic for ideation and non-tech things. But I find little reason to use it over codex for engineering tasks. Sometimes for planning, but even there, 5.4 is more critical of my questionable ideas, and will often come up with simpler ways to do things (especially when prompted), which I appreciate.

    And I don't do hard-tech things! I've chosen a b2b field where I can provide competent products for a niche that is underserved and where long term relationships matter, simply because I'm not some brilliant engineer who can completely reinvent how something is done. I'm not writing kernels or complex ML stacks. So I don't really understand what everyone is building where they don't see the limits of Opus. Maybe small greenfield projects with few users.

    • fcarraldo 2 days ago

      > It's surprising to me how much LLM "personality" seems to matter to people, more than actual capability. > I do turn to Anthropic for ideation and non-tech things. But I find little reason to use it over codex for engineering tasks. Sometimes for planning, but even there, 5.4 is more critical of my questionable ideas, and will often come up with simpler ways to do things (especially when prompted), which I appreciate.

      Aren't you saying here that the LLM personality matters to you, too? Being critical of you is a personality attribute, not a capabilities one.

      • lo_zamoyski 2 days ago

        Not necessarily. Criticism is the analysis, evaluation, or judgment of the qualities of something. This is a matter of intellectual act. However, you could say that being habitually critical can be partly a result of "personality" or temperament.

        (Of course, strictly speaking, LLMs have neither temperament, "personality", nor intellect, but we understand these terms are used in an analogical or figurative fashion.)

    • randomNumber7 2 days ago

      > I'm not some brilliant engineer who can completely reinvent how something is done

      With an honest evaluation of your own capabilities you are already far above average. Also its hard to see the insane amount of work that often was necessary to invent the brilliant stuff and most people can not shit that out consistently.

  • dvfjsdhgfv 2 days ago

    I use codex for cleaning after cloude and it always finds so many bugs, some of them quite obvious.

  • beering 2 days ago

    Or rather, it’s hard to ask everyone to side-by-side compare both products on their use cases. So the choice really comes down to word-of-mouth even though their use cases may be better served by Codex.

thisisit 2 days ago

My non scientific tests has been that GPT models follow the prompts literally. Every time I give it an example, it uses the example in literal sense instead of using it to enhance its understanding of the ask. This is a good thing if I want it to follow instructions but bad if I want it to be creative. I have to tell it that the examples I gave are just examples and not to be used in output. I feel comfortable using it when I have everything mapped out.

Claude on the other hand can be creative. It understands that examples are for reference purposes only. But there are times it decides to off on a tangent on its own and decide not to follow instructions closely. I find it useful for bouncing off ideas or test something new,

The other thing I notice is Claude has slightly better UI design sensibilities even if you don’t give instructions. GPT on the other hand needs instructions otherwise every UI element will be so huge you need to double scroll to find buttons.

  • veber-alex 2 days ago

    This is also what I noticed.

    GPT doesn't know how to get creative, you need to tell it exactly what to do and what code you want it to write.

    For Claude you can be more general and it will look up solutions for you outside of the scope you gave it.

    I presonaly prefer Claude.

  • sixothree 2 days ago

    I think you might benefit from the "superpower" plugin. Add the word "brainstorm" before your prompt and it does a little bit better at figuring out how you want things.

postalcoder 2 days ago

What I like most about gpt coding models is how predictable of a lever that thinking effort is.

Xhigh will gather all the necessary context. low gathers the minimum necessary context.

That doesn’t work as well with me for Opus. Even at max effort it’ll overlook files necessary to understanding implementations. It’s really annoying when you point that out and you get hit with an”you’re absolutely right”.

Codex isn’t the greatest one shot horse in the race but, once you figure out how to harness it, it’s hard to go back to other models.

bob1029 2 days ago

GPT5.4 with any effort level is scary when you combine it with tricks like symbolic recursion. I actually had to reduce the effort level to get the model to stop trying to one shot everything. I struggled to come up with BS test cases it couldn't dunk in some clever way. Turning down the reasoning effort made it explore the space better.

  • rolls-reus 2 days ago

    can you explain what you mean by symbolic recursion tricks in this context?

    • bob1029 2 days ago

      The model can call a copy of itself as a tool (i.e., we maintain actual stack frames in the hosting layer). Explicit tools are made available: Call(prompt) & Return(result).

      The user's conversation happens at level 0. Any actual tool use is only permitted at stack depths > 0. When the model calls the Return tool at stack depth 0 we end that logical turn of conversation and the argument to the tool is presented to the user. The user can then continue the conversation if desired with all prior top level conversation available in-scope.

      It's effectively the exact same experience as ChatGPT, but each time the user types a message an entire depth-first search process kicks off that can take several minutes to complete each time.

      • krackers 1 day ago

        How is this different from a standard tool-call agentic loop, or subagents?

        • bob1029 1 day ago

          Each stack frame has its own isolated context. This pushes the token pressure down the stack. The top level conversation can go on for days in this arrangement. There is no need for summarization or other tricks.

          • krackers 1 day ago

            Is this related to the paper on Recursive Language Models? I remember it mentioned something similar about "symbolic recursion", but the way you describe it makes it sound too simple, why is there an entire paper about it?

            • bob1029 1 day ago

              The RLM paper did inspire me to try it. This is where the term comes from. "Symbolic" should be taken to mean "deterministic" or "out of band" in this context. A lot of other recursive LLM schemes rely on the recursion being in the token stream (i.e.. "make believe you have a call stack and work through this problem recursively"). Clearly this pales in comparison to actual recursion with a real stack.

          • esperent 22 hours ago

            This is just subagents.

osti 2 days ago

Yup I've mentioned this in another thread, I got gpt 5.4xhigh to improve the throughout of a very complex non typical CUDA kernel by 20x. This was through a combination of architecture changes and then do low level optimizations, it did the profiling all by itself. I was extremely impressed.

  • esperent 22 hours ago

    Do you mean the non-codex model? Are people preferring normal GPT over codex?

    • osti 21 hours ago

      I was using codex cli with 5.4xhigh. So it was able to iteratively improve from simple prompts on my part (can you give some architectural ideas to improve the performance? And once it does, I just say can you implement and benchmark it).

      I think it was a bit like Karpathy's autoresearch, except I was doing manual promoting... Though I feel I could definitely be removed from that equation.

pjjpo 12 hours ago

Really great to see this whole thread after so many questioning looks from people on why I use codex instead of Claude which generally doesn't work for me.

I never thought it was about particular usefulness for low level vs high level but it tracks with my general low level work.

munksbeer 1 day ago

> right now, at least: the reality is that if frontier LLMs are not democratized, we will end paying like a house rent to a few providers

This part of your comment has slipped through but is very worrying for me. I _think_ we're passing the point now where programmers are accepting that LLMs writing code are the real deal. Lots of antagonism along the way, but the reality is these things are good, and getting better all the time.

What this means in reality, in my opinion, is that if you're an independent programmer, or smaller company trying to compete with others to earn a living, you're almost certainly going to have to use coding agents, which means your competitiveness in the market is going to be gated by the big model providers until we have more options. If you somehow get banned from a few of them, which seems like it can happen through no fault of your own, you're going to be seriously negatively impacted.

That's quite worrying having gatekeepers to our industry where it was previously in our own hands.

SunshineTheCat 2 days ago

1000%. I have been running claude's work through codex for about a week now and it's insane the number of mistakes it catches. Not really sure why I've been doing this, just interesting to watch I guess.

Not to mention a billion times more usage than you get with claude, dollar for dollar.

  • scrollop 2 days ago

    It's widely reported that opus has been greatly reduced for a number of weeks since Mythos was released internally

  • sixothree 1 day ago

    Funny, I've been doing the same thing. I've also been giving them both the same task and seeing who does a better job.

    I think it's all of this controversy around usage limits and model nerfing that made me start doing this.

    In the end though, I _much_ prefer working with claude because it understands the task at hand so much better and I feel like I understand the results better. It's just that codex is doing a better job at the actual coding lately.

zozbot234 2 days ago

The $100/mo giving access to GPT Pro (with reduced usage) is a nice counter to the just teased Claude Mythos. But GPT 5.4 xhigh being able to perform that kind of low-level reconstruction task is very impressive already.

nealmueller 1 day ago

Price change is ChatGPT not Codex, you may be mixing them up, Codex (for coding) remains $200

  • esperent 22 hours ago

    I just checked the codex pricing page, it's pro 5x for $100, pro 20x for $200. The 20x plan has a codex usage boost until the end of may, whatever that means.

    Edit: apparently the usage boost is an additional 2x for both 5x and 20x. So maybe it's time to start watching whichever of these services is currently doing offers like this and switch subscriptions every few months.

aerhardt 2 days ago

I completely agree with you on both the technical and ethical reasoning.

Thank you for speaking out. I think it's important that reputable engineers like you do so. The Claude gang gaslighting is unhinged right now. It would be none of my concern but I have to deal with it in the real world - my customers are susceptible to these memes. I'm sure others have to deal with similar IRL consequences, too.

TacticalCoder 1 day ago

I use Claude Code / Anthropic models but...

> I ran many brutal tests, including reconstructing for QEMU the SCSI controller (not longer accessible) of a SVSY UNIX of the early 90s used in a 386.

QEMU is one project that, for a variety of reasons, said that atm they simply refuse any code written by a LLM. Is this just as a test? Or just for you? Or do you think QEMU shall accept that patch?