benchmark synthesis for code generation in long horizon tasks

Long-duration coding evaluations show models perform far better on local completion than on multi-step planning, debugging loops, and constraint tracking over time (SWE-bench).

see also: meta analysis on llm judge reliability across domains · structured output contracts reduce agent failure rates

evidence map

  • Success rates fall sharply as task length and dependency count rise.
  • Recovery quality after first error predicts final completion more than initial generation quality.
  • Tool-use constraints strongly affect benchmark outcomes.

method boundary

Useful benchmarks need realistic repository context, failing tests, and noisy requirements rather than curated single-file prompts.

my take

The frontier problem is not writing snippets. It is maintaining coherent intent over long execution arcs.

linkage

  • [[meta analysis on llm judge reliability across domains]]
  • [[structured output contracts reduce agent failure rates]]
  • [[stateful agents gain safer rollback controls]]

ending questions

which long-horizon benchmark signal best predicts real developer productivity gains?