"A benchmark for catching when code doesn't do what its documentation claims" github.com 3 points by o2zer0cool 3 days ago
westurner 3 days ago Suggestions; would it be more maintainable to:Rewrite this with pytest-evals.Write pytest tests with pytest.mark.parametrize, fixtures, and mocks. Push to >90% branch coverage with pytest-cov..I don't think any of these benchmarks yet do model output evals for docs?:Mcpbr > Supported Benchmarks: https://github.com/supermodeltools/mcpbr#supported-benchmark....On subjectivity and language also the other day, this: https://github.com/mozilla/firefox-devtools-mcp/pull/90#issu... :> how to optimize an AGENTS.md:> [agentevals, foundry-toolkit, ]
Suggestions; would it be more maintainable to:
Rewrite this with pytest-evals.
Write pytest tests with pytest.mark.parametrize, fixtures, and mocks. Push to >90% branch coverage with pytest-cov.
.
I don't think any of these benchmarks yet do model output evals for docs?:
Mcpbr > Supported Benchmarks: https://github.com/supermodeltools/mcpbr#supported-benchmark...
.
On subjectivity and language also the other day, this: https://github.com/mozilla/firefox-devtools-mcp/pull/90#issu... :
> how to optimize an AGENTS.md:
> [agentevals, foundry-toolkit, ]