"A benchmark for catching when code doesn't do what its documentation claims"

Suggestions; would it be more maintainable to:

Rewrite this with pytest-evals.

Write pytest tests with pytest.mark.parametrize, fixtures, and mocks. Push to >90% branch coverage with pytest-cov.

I don't think any of these benchmarks yet do model output evals for docs?:

> how to optimize an AGENTS.md:

> [agentevals, foundry-toolkit, ]