evidence review on post deployment eval drift
Emerging evaluation studies and field reports indicate that pre-release benchmark performance often diverges from production outcomes within weeks without active recalibration (arXiv).
see also: eval driven deployment gates reduce regression churn · survey of safety classifier drift in production
evidence stack
- Query distribution drift degrades benchmark transfer.
- Prompt and policy updates alter eval comparability.
- Continuous sample refresh improves incident prediction.
method boundary
Static benchmark sets cannot capture production behavior shifts on their own.
my take
Evaluation is now an ongoing operations function, not a release checklist step.
linkage
- [[eval driven deployment gates reduce regression churn]]
- [[survey of safety classifier drift in production]]
- [[agent governance dashboards become executive weekly ritual]]
ending questions
which post-deployment drift signal should trigger mandatory eval refresh?