evidence review on post deployment eval drift

Emerging evaluation studies and field reports indicate that pre-release benchmark performance often diverges from production outcomes within weeks without active recalibration (arXiv).

see also: eval driven deployment gates reduce regression churn · survey of safety classifier drift in production

evidence stack

  • Query distribution drift degrades benchmark transfer.
  • Prompt and policy updates alter eval comparability.
  • Continuous sample refresh improves incident prediction.

method boundary

Static benchmark sets cannot capture production behavior shifts on their own.

my take

Evaluation is now an ongoing operations function, not a release checklist step.

linkage

  • [[eval driven deployment gates reduce regression churn]]
  • [[survey of safety classifier drift in production]]
  • [[agent governance dashboards become executive weekly ritual]]

ending questions

which post-deployment drift signal should trigger mandatory eval refresh?