Even (very) noisy LLM evaluators are useful for improving AI agents

SmithersBot 34 minutes ago

as long as OpenAI and Anthropic keep subsidizing dirt cheap Codex or Claude Code usage, I'll just keep using them as evaluators. The trick is to have a fresh instance doing the reviewing, not the one that did the work.

CharlesW 10 minutes ago

> The trick is to have a fresh instance doing the reviewing, not the one that did the work.
In my experience that's not neccessary (some people even claim that you must use models from different vendor), and it's expensive since a fresh instance needs to rebuild all the context that's needed in order to properly and thoroughly review. LLMs have no problem throwing "them 5 minutes ago" under the bus when asked to review something "skeptically" and "with fresh eyes".

ai_slop_hater 1 hour ago

What is an LLM evaluator?

Gregaros 1 hour ago

They should define this, but after having read the entire article I think it’s clear they mean “frameworks for evaluating the output of an agent” rather than what first might come to mind as “LLM evals”.
Their thesis is that even when the eval is useless for correctness of a single agentic action in production, it allows you to choose between two agents by cross-comparing in a large aggregated collection of tasks. Effectively: you can tune your agentic parameters.
Nothing new to the idea that taking many samples and averaging can work when a single datapoint doesn’t. Presumably this is part of a conversation in which we’re lacking context.
- ai_slop_hater 1 hour ago
  
  Are “frameworks for evaluating the output of an agent” and "LLM evals" different? :) If yes, how?
  
  brianwmunz 30 minutes ago
  
  "LLM evals" is maybe an overused term because it can mean a bunch of things. This article talks about LLM-as-a-judge where an LLM scores another system's outputs.