meta analysis on llm judge reliability across domains

Recent evaluation papers indicate LLM-judge systems can approximate ranking quality in constrained settings, but reliability drops across domain shifts and prompt framing changes (arXiv).

see also: evidence review on retrieval eval methods in production · ai safety evals move into procurement checklists

evidence stack

  • Agreement with human raters varies widely by task type.
  • Score variance increases under adversarial or ambiguous inputs.
  • Calibration with gold sets improves consistency materially.

method boundary

LLM judges are strongest for triage and relative ranking, weakest for high-stakes absolute scoring without human oversight.

my take

LLM judging is a multiplier for evaluation throughput, not a replacement for accountable review.

linkage

  • [[evidence review on retrieval eval methods in production]]
  • [[ai safety evals move into procurement checklists]]
  • [[enterprise rag failure modes cluster in stale corpora]]

ending questions

which calibration protocol most improves llm judge consistency across domain transitions?