points by enraged_camel 1 day ago

My experience with Fable since its release matches Simon's.

I've been having it orchestrate complex implementations. I give it a parent ticket (issue) on Linear and say "look at the sub-issues on this ticket and determine which ones you can implement yoursef, in which order, and determine how your implementation will need to be coordinated with what is currently being worked on by other team members". These tickets are not trivial. They have a lot of moving parts, as well as dependencies between them, both inside the same project and across projects (e.g. backend).

Fable then chooses tickets, delegates each ticket to a subagent (also Fable), which looks at Figma designs for the ticket, implements it perfectly (following repo guidelines and conventions to the letter), takes screenshots of each piece, writes detailed commit messages and PR descriptions, then posts the screenshots in them as evidence. Then it provides a summary in the form of "you'll need to make sure PR #1283 is merged first - btw there were no Figma designs for such-and-such screen but I looked at similar screens that have been implemented and adopted the pattern".

That's probably like... 20% of what it can do. It's a truly, legitimately powerful model.

Opus 4.8 could do a lot of this too, but required a lot of hand-holding, and when it came across a blocker it was likely to just stop and say "I was able to get this far, but I can't proceed."

pshirshov 1 day ago

Ok, explain me one thing: I have a benchmark - I feed identical prompt to multiple models. Codex produces a rough but working program. Fable produces the same - but with more bugs than Codex. Opus produces something similar to Codex but with a critical bug.

That describes all my tests with Fable.

Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?

I mean, well, yes, it is impressive. It could quickly generate a lot of garbage which sorta does look like code. Two others can do the same. I don't see any groundbreaking improvement - but the price is much higher. Why the hype?

  • enraged_camel 1 day ago

    >> Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?

    I don't care if you're hyped or not. You asked if the posts like the OP come from a "parallel reality" and I said no and described my experience. If you're getting good/better results with Codex than with Fable, you should probably continue using that, since it's cheaper and faster.

    • pshirshov 1 day ago

      But can you bring anything measurable in support to your words? I did.

      • bubsneedpumping 1 day ago

        The OP and GP need all genai news to be positive to the point of using doublespeak here unironically.

        "Relentlessly proactive" is a grotesque use of language. A paperclip optimizer is "relentlessly proactive".

        We already had a word for what is being promoted here: wasteful.

      • enraged_camel 1 day ago

        You brought your own benchmark to support your words. I happen to have studied statistics, so I took a look. It is deeply flawed, primarily because it is not a statistical benchmark. It is a single (n=1) autonomous "pi" coding-harness run per model per prompt, scored by an automated battery (A-items, pass/fail), an LLM code review (R-items, 0 to 2 each), and a human manual checklist (M1 to M10) that was never actually completed.

        The grader being an LLM is a big problem. You yourself admit explicitly that the grader is the same model family as the Fable 5 contestant cell and say to "discount accordingly, or re-grade with a non-Claude judge."

        Model configurations appear to not be uniform either. Effort levels differ (mimo-v2.5-pro at @high, everyone else at @xhigh), harnesses differ (codex internal config vs. pi vs. claude -p), context windows differ, and one model (GPT-5.5) had extra MCP tools the others did not.

        The two scored runs seem to use two different rubrics (/22 then /25), so scores are not comparable across runs, and the /22 rubric saturated (there are multiple 22/22 results).

        A provider quota error (HTTP 429) truncated the minimax-m3 run mid-build but it was still scored (18/25) and ranked, on code that does that does not compile and has zero tests.

        If you want actual benchmarks, there are dozens of legitimate ones out there. Many of them have been posted on this website. They overwhelmingly disagree with yours. If you have any interest whatsoever in creating a reliable benchmark (so that you can make optimal decisions on what models to use for your work), you should look at them and see how yours needs to be redesigned.

        • pshirshov 11 hours ago

          Yes, I know all the flaws. As I said, it's not an objective way to measure performance of a model - but it is intended to produce something that only humans could mesaure. The goal is for you to being able to play the game and judge - and fill the human checklist for yourself if you wish.

          You didn't get why the automatic review scores are there - all of the reviewers, including Fable, happily assign highest scores to code which can't even run. In my opinion that is a sort of an empirical evidence that these models are very far from the "AGI" state.

          Anyway, while I didn't explain the methodology and the purpose of this experiment, I have something material to discuss. The "awesome Fable" claims are not material at all.

          Can you bring something clearly showcasing Fable's superiority?