Success rate already up from 2.5% in Q3 2025 to 3.75% with Opus 4.5 (November 2025), presumably even higher with Opus 4.6 and/or GPT-5.3-Codex https://www.remotelabor.ai
This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested.
Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.
One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.
This post really should be edited to say 96% of tasks posted on Upwork. Since we would all expect that to happen.
Success rate already up from 2.5% in Q3 2025 to 3.75% with Opus 4.5 (November 2025), presumably even higher with Opus 4.6 and/or GPT-5.3-Codex https://www.remotelabor.ai
This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested.
Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.
They didn't test Opus at all, only Sonnet.
One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.
ChatGPT: when you want spellcheck to argue with you.
You think they don't? You think AI can replace programmers, today?
Then go ahead and use AI to fix this: https://gitlab.gnome.org/GNOME/mutter/-/issues/4051