Keith Kitchen

❯

❯

Research_Digests

❯

❯

evidence review on post deployment eval drift

evidence review on post deployment eval drift

CreatedJan 27, 2025Read1 min read

#researchdigest (50)
#ai (816)
#evaluation (8)
#2025 (134)

evidence review on post deployment eval drift

Emerging evaluation studies and field reports indicate that pre-release benchmark performance often diverges from production outcomes within weeks without active recalibration (arXiv).

see also: eval driven deployment gates reduce regression churn · survey of safety classifier drift in production

evidence stack

Query distribution drift degrades benchmark transfer.
Prompt and policy updates alter eval comparability.
Continuous sample refresh improves incident prediction.

method boundary

Static benchmark sets cannot capture production behavior shifts on their own.

my take

Evaluation is now an ongoing operations function, not a release checklist step.

linkage

[[eval driven deployment gates reduce regression churn]]
[[survey of safety classifier drift in production]]
[[agent governance dashboards become executive weekly ritual]]

ending questions

which post-deployment drift signal should trigger mandatory eval refresh?

Dedicated reader

Stacked notes

Freeze this note on the left. Open internal note links into the compare pane on the right.

Current noteLoading...

Stacked noteChoose a note

Select an internal link from the left pane.

Graph View

Graph View

Map

#ai816 notes #general-note344 notes #infra290 notes #2023277 notes #policy271 notes #2022257 notes #tech-journal247 notes #2024220 notes #keyword-index183 notes #thoughtpiece174 notes #2021170 notes #market-news163 notes #2025134 notes #research-digest120 notes #2020107 notes #finance101 notes #generalnote82 notes #economy77 notes #security74 notes #philosophy71 notes #portfolio66 notes #product65 notes #techjournal63 notes #behavior62 notes #hardware62 notes #data51 notes #interactive51 notes #researchdigest50 notes #202642 notes #governance40 notes #open39 notes #energy37 notes #tech37 notes #crypto36 notes #blog34 notes #external34 notes

evidence review on post deployment eval drift
evidence stack
method boundary
my take
ending questions

Backlinks

eval debt registers become standard in quarterly planning
eval set version locks prevent accidental benchmark inflation
trace linked eval stores improve incident root cause speed

Service Counter

Keith Kitchen is a curated atheneum for market systems, research notes, and working ideas. © 2026

Home
Notes
Portfolio
Interactive Lab
GitHub