benchmark synthesis for code generation in long horizon tasks
Long-duration coding evaluations show models perform far better on local completion than on multi-step planning, debugging loops, and constraint tracking over time (SWE-bench).
see also: meta analysis on llm judge reliability across domains · structured output contracts reduce agent failure rates
evidence map
- Success rates fall sharply as task length and dependency count rise.
- Recovery quality after first error predicts final completion more than initial generation quality.
- Tool-use constraints strongly affect benchmark outcomes.
method boundary
Useful benchmarks need realistic repository context, failing tests, and noisy requirements rather than curated single-file prompts.
my take
The frontier problem is not writing snippets. It is maintaining coherent intent over long execution arcs.
linkage
- [[meta analysis on llm judge reliability across domains]]
- [[structured output contracts reduce agent failure rates]]
- [[stateful agents gain safer rollback controls]]
ending questions
which long-horizon benchmark signal best predicts real developer productivity gains?