meta analysis on llm judge reliability across domains
Recent evaluation papers indicate LLM-judge systems can approximate ranking quality in constrained settings, but reliability drops across domain shifts and prompt framing changes (arXiv).
see also: evidence review on retrieval eval methods in production · ai safety evals move into procurement checklists
evidence stack
- Agreement with human raters varies widely by task type.
- Score variance increases under adversarial or ambiguous inputs.
- Calibration with gold sets improves consistency materially.
method boundary
LLM judges are strongest for triage and relative ranking, weakest for high-stakes absolute scoring without human oversight.
my take
LLM judging is a multiplier for evaluation throughput, not a replacement for accountable review.
linkage
- [[evidence review on retrieval eval methods in production]]
- [[ai safety evals move into procurement checklists]]
- [[enterprise rag failure modes cluster in stale corpora]]
ending questions
which calibration protocol most improves llm judge consistency across domain transitions?