LLM-as-Judge in Production: Agent Reasoning Verification, Self-Correction, and Hallucination Defense (2026)

Executive Summary

LLM-as-judge has crossed from evaluation harness territory into load-bearing production infrastructure: more than half of surveyed production agent teams now rely on judge LLMs at runtime for quality gating, hallucination defense, and tool-call verification.
Six distinct patterns exist — offline eval, online runtime verifier, self-consistency loops, Reflexion/reflection, constitutional AI/RLAIF, and inference-time reward models — each with different latency budgets, cost profiles, and failure modes.
The field has bifurcated into large proprietary judges (GPT-4o, Claude 3.7 Sonnet) for high-stakes verification and small distilled judges (Galileo Luna-2 at 3B–8B, Prometheus 2 at 7B, Patronus Lynx at 8B) for high-throughput inline checking — with the small models often delivering 97% cost reduction at 0.88–0.95 accuracy.
Intrinsic self-correction is unreliable: a body of research through 2024–2025 consistently shows that prompting an LLM to "check your work" without external grounding degrades performance on reasoning tasks. Self-correction only reliably helps when grounded in external feedback (unit test results, retrieval verification, tool-output comparison).
Production teams should instrument judge checks at three boundaries: before user-facing output, before irreversible tool execution, and on writes to persistent memory — skipping inline judging on every intermediate reasoning step to manage cost.

Why Judge LLMs Became Load-Bearing in 2026

The core insight driving adoption is asymmetry: classifying content is simpler than generating it. A model that struggles to produce a perfectly factual answer can still reliably detect when an answer contradicts a retrieved document. This capability gap makes LLM judges practical even when using smaller, cheaper models as verifiers.

Three pressures converged to push judge LLMs from eval suites into production loops:

Scale of agentic deployments. The LangChain State of Agent Engineering survey found over 57% of respondents had agents in production by early 2026, up from single digits two years prior. As agent pipelines grew multi-step — planning, tool calls, memory reads, sub-agent delegation — the surface area for quality failures expanded faster than human review capacity could track.

RAG hallucination as a concrete problem. RAG systems exposed a specific, measurable failure mode: models synthesizing responses that contradicted their own retrieved context. Judge LLMs provided an affordable inline check — verify that the generated answer is entailed by the source documents — that rule-based approaches couldn't match on coverage.

Cost collapse of capable small models. The release of distilled judge models (Prometheus 2 7B, Galileo Luna-2 3B/8B, Patronus Lynx 8B) made inline verification economically viable at scale. Galileo's Luna-2 achieves 0.88–0.95 accuracy on agentic evaluation tasks with a 97% cost reduction versus GPT-4-based evaluation. At this price point, judging every agent output before delivery became feasible.

Taxonomy: Six Distinct Patterns

1. Offline Evaluation Harnesses

The original use case: batch evaluation of model outputs against ground truth or rubric, run during development, CI/CD gates, or nightly regression checks. Tools like Braintrust (raised $80M Series B, $800M valuation), LangSmith, and Arize Phoenix dominate this space. The judge runs asynchronously — latency is irrelevant, cost is bounded per-run.

Key characteristic: No user is waiting. You can afford large, expensive judges and human review on sampled outputs.

2. Online Runtime Verifiers

The judge runs synchronously in the production request path, blocking delivery until it approves the output. This is the pattern Amazon Prime Video uses (independent LLM evaluates outputs from the analysis agent before returning to users) and what Microsoft Bing implements to verify search-grounded responses against retrieved web content.

Key characteristic: Latency is real. A 76–162ms overhead (as measured by HaluGate's token-level hallucination pipeline) is acceptable; a multi-second GPT-4 call on every output is not. This pattern pushes teams toward small, fast judge models.

3. Self-Consistency and Self-Critique Loops

Rather than a separate judge model, the actor generates multiple candidate outputs, then uses itself (or a same-family model) to select the best one. Best-of-N sampling with a reward model ranker is the standard form. Majority voting across N samples (without a ranker) is the cheapest variant.

The canonical finding: "by sampling multiple answers and heuristically aggregating their answers (e.g., through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains." The gains are most reliable in formal domains (math, code) where correctness is checkable, and weakest in open-ended generation.

4. Reflexion / Reflection Patterns

Introduced in the NeurIPS 2023 Reflexion paper (Shinn et al.), this pattern has a language agent verbally reflect on task feedback, store reflections in an episodic memory buffer, and use them in subsequent trials — reinforcement learning through text rather than weight updates. Reflexion achieved 91% pass@1 on HumanEval (vs. GPT-4's 80% baseline) and showed 22% absolute improvement on AlfWorld decision tasks over 12 iterative steps.

By 2025–2026, Reflexion has evolved into richer variants: Language Agent Tree Search (LATS) combines Monte Carlo tree search with reflection; Process Reward Models (PRMs) verify each reasoning step rather than final output. Critically, the verbal memory mechanism is far cheaper than fine-tuning — it requires only extended context, not gradient updates.

5. Constitutional AI / RLAIF Judges

Anthropic's Constitutional AI (2022) is the foundational paper, but its impact in 2026 is felt through its production descendants. The training-time pattern: an AI model critiques draft responses against a constitution of principles, and those AI-generated preferences replace (or augment) human labels. At inference time, the same pattern becomes a runtime guardrail: generate, critique against principles, revise.

RLAIF extends this to reward modeling — the judge LLM produces preference labels at scale that train reward models, which then guide further generation. Claude's published constitution (May 2023 version) is the most transparent public example; production constitutions are unpublished.

6. Reward Models in Inference-Time Search

The 2025 wave of "inference-time scaling" research established that allocating more compute at inference (via search guided by reward models) can outperform parameter scaling on reasoning tasks. Two mechanisms dominate:

Outcome Reward Models (ORMs): score complete answers; best-of-N with ORM selection
Process Reward Models (PRMs): score each reasoning step; enable tree search with judge pruning

The key paper (Snell et al., ICLR 2025): "Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters." PRMs address ORM's sparse-signal limitation by providing dense, step-by-step feedback, but require much more expensive training data (step-level human or model annotations).

The 2026 Landscape

Anthropic

Constitutional AI remains the published foundation. Claude 3.7 Sonnet is widely used as a production judge by third-party teams — it appears as one of the frontier judges in Google's FACTS Grounding benchmark alongside Gemini 1.5 Pro and GPT-4o. Anthropic's alignment team published a debate-and-scalable-oversight research program (NeurIPS 2024), finding that debate outperforms consultancy across all task types when the consultant argues for random sides, but that results against direct QA are task-dependent.

OpenAI

o1 and o3 models incorporate inference-time chain-of-thought verification as a core architectural feature — the "thinking" tokens function as an internal self-verification pass before the final answer is emitted. This is the most mainstream deployment of pattern #6 (reward models in inference-time search) in a consumer product.

Google DeepMind

FACTS Grounding (December 2024) established a benchmark for measuring factual grounding quality, with Gemini 1.5 Pro as one of the frontier judges. Google's production systems use LLM judges for search-grounded response verification, with public acknowledgment that the judge approach is part of their hallucination mitigation stack.

Open-Source Judge Models

Model	Size	Training	Strengths
Prometheus 2	7B / 8x7B	Distilled from GPT-4 labels	Custom rubric evaluation; both pairwise and direct assessment; 72–85% human agreement
JudgeLM	7B / 13B / 33B	100K GPT-4-judged samples	Bias mitigation (swap augmentation, reference drop); ICLR 2025 Spotlight
PandaLM	7B	Human + model labels	Strong in specialized domains (legal, biomedical)
Patronus Lynx	8B / 70B	Fine-tuned Llama 3	Hallucination detection specifically; 8B beats GPT-3.5 by 24.5% on HaluBench
M-Prometheus	7B+	Multilingual extension	Non-English evaluation

Eval Platforms

Braintrust: Experiment-first approach; $800M valuation after Series B. Strongest for prompt iteration and A/B comparison.

LangSmith: Deep LangChain/LangGraph integration; de facto choice for teams in that ecosystem; automatic instrumentation.

Arize Phoenix: Apache 2.0 open-source for local tracing; Arize AX for enterprise production monitoring. Strong in ML + LLM unified workflows.

W&B Weave: Near-zero integration friction for teams already on Weights & Biases; full observability stack.

Galileo: Luna-2 (3B/8B fine-tuned Llama) is the cost-optimized judge backbone. Best for high-volume production evals where GPT-4 cost is prohibitive.

Architectural Patterns in Production

Verifier-in-the-Loop

The canonical production pattern: actor generates → judge verifies → block or pass. Implementation variations:

Serial: actor → judge → user (adds judge latency to p99)
Speculative: begin streaming output to user, run judge in parallel, cancel/revise if judge rejects (complex but minimizes perceived latency)
Batched async: judge runs on samples post-delivery; results feed back into prompt iteration rather than blocking individual requests

Ramp's production deployment uses a shadow mode variant: the judge compares agent predictions to human actions without blocking, accumulating accuracy metrics until the agent hits a threshold, at which point live gating is enabled.

Parallel Sampling + Best-of-N

Generate N candidate outputs simultaneously, score all with reward model, return the top-ranked. Cox Automotive's production implementation tracks relevancy, completeness, and tone per candidate. The win rate over greedy decoding is largest in formal domains (code: +15–20pp); in open-ended generation the gains are smaller and come with 3–5x compute cost.

Tree-of-Thought with Judge Pruning

Extend best-of-N to a search tree: generate multiple intermediate reasoning steps, prune low-scoring branches (via PRM) early, expand promising ones. Language Agent Tree Search (LATS) is the canonical 2025 implementation. Cost is high — O(beam_width × depth) judge calls — so this pattern is reserved for high-value tasks (complex code generation, multi-step research).

Debate and Cross-Model Verification

Two AI models argue opposing positions; a judge (human or LLM) evaluates the debate. The Kenton et al. NeurIPS 2024 paper (Anthropic/DeepMind collaboration) found debate outperforms consultancy across all tasks, especially extractive QA with information asymmetry. In production, a weaker version of this appears as ensemble disagreement detection: run two differently-prompted or differently-modeled generations; when they disagree significantly, flag for human review.

Failure Modes

Positional Bias

Judges systematically favor responses appearing in certain positions (first or last in a comparison). A 2025 IJCNLP study ("Judging the Judges") found position bias is strongly modulated by the quality gap between candidates — when candidates are similarly capable, position bias is largest. The standard mitigation is double evaluation with order swap and aggregation. The accuracy shift from position alone can exceed 10% in code evaluation tasks.

Verbosity Bias

Judges prefer longer, more formal responses regardless of substantive quality, an artifact of generative pretraining on data where longer answers correlate with quality. The "Justice or Prejudice?" framework (CALM) identified verbosity as one of 12 distinct bias types. Mitigation: explicitly score conciseness as a rubric criterion; penalize length inflation in the evaluation prompt.

Self-Preference / Same-Family Bias

When the judge model comes from the same training family as the actor model, agreement rates inflate by 5–7% versus cross-family judging. The judge has absorbed similar stylistic preferences and is less likely to penalize outputs that match its own generation style. Mitigation: use cross-family judges for high-stakes evaluation; the "LLM jury" pattern (majority vote across 3–5 models from different families) reduces this bias 30–40% at 3–5x cost.

Judge Hallucinations

Judges can themselves hallucinate — fabricating evaluation rationales, citing non-existent rubric criteria, or confidently scoring outputs that violate unstated assumptions. This is especially prevalent when the judge is asked to evaluate factual claims in domains outside its training distribution. Mitigation: structure judge outputs with constrained schemas (JSON with explicit field validation); use chain-of-thought to expose reasoning before the verdict.

Calibration Drift

Judge models calibrate against their training distribution. As the actor model is fine-tuned or as the production data distribution shifts, the judge's internal calibration drifts — what was a 7/10 in January may be a 5/10 by June. Mitigation: maintain a held-out golden dataset; re-validate judge calibration against human labels on a monthly cadence.

Cost Explosion

A naive "judge everything" approach applied to a high-volume production system using GPT-4-class judges is cost-prohibitive. One study found costs spanning 175× across models — from $0.45/1K to $78.96/1K evaluations. At 1M evaluations/month, the difference between a $0.45 and $5 judge is $550,000/year. Additionally, low evaluation completion rates (ECR@1) — models failing to return structured judge output — create retry overhead; one model at 85.4% ECR@1 translates to $1,200/year in retry overhead alone at 1M evals/month.

Cost & Latency Engineering

The Small Judge Principle

The key insight from Galileo's Luna-2: a 3B–8B model fine-tuned specifically for evaluation tasks outperforms a 175B general model on those tasks. Luna-2 achieves 0.88–0.95 accuracy on agentic evaluation at 97% lower cost than GPT-4-based evaluation. Prometheus 2 (7B) achieves at least 80% of the evaluation statistics of Prometheus 2 (8x7B) while requiring only 16GB VRAM.

Recommended approach: use distilled judges (Lynx 8B, Prometheus 7B, Luna-2) for high-frequency inline checks; reserve large judges (GPT-4o, Claude 3.7 Sonnet) for low-frequency high-stakes decisions and for generating the training data to distill the next generation of small judges.

Strategic Placement

Don't judge every token — judge at boundaries:

Boundary	Judge type	Rationale
Before user-facing output	Fast inline judge (≤8B)	Visible to user; latency budget 50–200ms
Before irreversible tool calls (write, send, deploy)	Thorough judge (may use large model)	Cost of error is high; extra latency acceptable
On memory writes	Lightweight factuality check	Bad memory compounds across future sessions
On tool-call return (before acting on result)	Consistency check	Detect tool hallucinations / unexpected outputs

Caching Judge Outputs

For repeated patterns (same document + same question class), cache judge verdicts with a semantic similarity threshold. This requires careful invalidation logic — a cached "no hallucination" verdict on a document becomes invalid if the document is updated.

Sampling Strategies

Not every production request needs a judge call. Effective strategies:

Uncertainty sampling: only judge when actor model's log-probability of its output is below a threshold
Stratified sampling: judge 100% of a small random sample + 100% of flagged outliers (long outputs, low-confidence generations, tool-call outputs)
CI/CD gating: run full judge suite on every PR; run lightweight inline judge in production; reconcile gaps nightly

Hallucination Defense Stack

A production hallucination defense stack typically layers three mechanisms:

Layer 1: Retrieval Grounding

Ensure the actor only generates claims that can be grounded in retrieved documents. This is a retrieval design problem, not purely a judge problem — but it sets the baseline for what the judge can verify.

Layer 2: Faithfulness Checking (Judge Layer)

A specialized judge (Patronus Lynx, MiniCheck, NLI classifier) verifies that each claim in the generated output is entailed by the source documents. MiniCheck (Flan-T5-based binary classifier) provides sentence-level entailment checking at very low cost. Lynx-8B provides chain-of-thought reasoning about hallucination types, at a higher but still affordable cost.

The key paper: HaluGate (vLLM blog, December 2025) demonstrates token-level hallucination detection with 76–162ms total overhead — negligible compared to typical LLM generation times of 5–30 seconds.

Layer 3: Citation Verification

For outputs that cite specific sources, verify that cited sources actually say what the output claims. Google's FACTS Grounding benchmark (December 2024) formalized this evaluation approach, using frontier LLM judges (Gemini 1.5 Pro, GPT-4o, Claude 3.5 Sonnet) to evaluate factual grounding accuracy.

Hybrid pattern: RAG retrieval + faithfulness judge has proven more reliable than either approach alone. Standard RAG without a judge does not prevent hallucination — models still fabricate citations, reference non-existent code locations, or conflate memorized patterns with retrieved content.

Self-Correction: When It Helps, When It Hurts

The research consensus as of 2025 is clear and somewhat counterintuitive: intrinsic self-correction — prompting a model to review and revise its own output without external grounding — does not reliably improve performance and often degrades it.

The most cited finding comes from a Google DeepMind paper (Huang et al., ICLR 2024): "Large Language Models Cannot Self-Correct Reasoning Yet." The core mechanism: the quality of self-generated feedback is bounded by the model's existing knowledge and abilities. Internal feedback offers no advantage over the original generation if the model already couldn't produce correct output; it may steer the model away from a correct answer it happened to produce.

A 2025 ACL paper ("Understanding the Dark Side of LLMs' Intrinsic Self-Correction") quantified the failure modes — intrinsic self-correction degraded performance on arithmetic reasoning, closed-book QA, and code generation tasks.

When self-correction does work:

With external execution feedback: CRITIC (ICLR 2024) shows models can reliably self-correct code when given unit test execution results. The judge here is the test runner, not the model itself.
With retrieval verification: comparing a generated claim against a freshly retrieved document provides genuine external signal the model can act on.
In Reflexion's episodic memory pattern: the key is that reflections are grounded in concrete task feedback (did the action succeed?), not the model's self-assessment of whether the action was good.
With fine-tuning: SuperCorrect (ICLR 2025) shows that fine-tuning with thought templates that include explicit self-correction steps enables reliable correction in small models — but this is a training-time intervention, not pure inference-time prompting.

The practical rule: replace "let me check my work" prompts with structured external verification. Run the code. Retrieve the document. Compare tool output to expectation. Only loop back to the model when you have a concrete external signal to provide.

Tooling & OSS Frameworks

Tool	Category	Key Feature	Production Status
DSPy Assertions	Guardrails framework	Hard/soft constraints with self-refinement retry; 164% improvement in rule adherence in tests	Widely used in 2025
Outlines	Constrained generation	JSON/regex-constrained decoding; prevents structurally invalid outputs before judge needed	Production stable
Guidance	Constrained generation	Template-based interleaving of generation and constraints	Active development
Patronus Lynx	Hallucination judge	Fine-tuned Llama 3 8B/70B; 8B beats GPT-3.5 by 24.5% on HaluBench	Production ready
Galileo Luna-2	Agentic eval judge	3B/8B Llama fine-tune; 97% cost vs GPT-4; 0.88–0.95 accuracy	Production ready
Prometheus 2	Custom rubric judge	7B/8x7B; distilled from GPT-4; pairwise + direct assessment	Production ready
JudgeLM	Scalable judge	7B–33B; swap augmentation for bias mitigation; ICLR 2025 Spotlight	Research + production
Llama Guard 3	Content safety judge	Meta; 8B; 8 languages; input+output classification	Production ready
LangSmith	Observability + eval	LangChain-native tracing + judge pipeline	Production standard
Braintrust	Offline eval	Experiment framework; A/B comparison at scale	Production standard
Arize Phoenix	OSS observability	Apache 2.0; local tracing; Arize AX for enterprise	Production ready
W&B Weave	Observability	Unified ML + LLM monitoring	Production ready
Langfuse	OSS observability	LLM-as-a-judge evaluation tracing; all features now open	Growing adoption
MLflow	Eval framework	LLM evaluation integrated into MLOps platform	Production standard

DSPy's assertion system deserves special note: it introduces two tiers of constraints. dspy.Assert (hard) halts the pipeline if violated after max retries. dspy.Suggest (soft) triggers self-refinement but allows continuation. In four text-generation tests, DSPy assertions improved rule adherence by up to 164% and response quality by up to 37%. This pattern effectively turns guardrails into a lightweight judge-and-retry loop without requiring a separate judge model.

Practical Recipes for Production Agents

Recipe 1: RAG Faithfulness Gate

1. Retrieve documents
2. Generate answer
3. Run Lynx-8B / MiniCheck: is answer entailed by retrieved docs?
4. If FAIL: log + either retry with stronger grounding prompt or return "I don't have enough information"
5. If PASS: deliver to user

Cost profile: ~$0.001 per check with Lynx-8B. Latency overhead: ~100–200ms. Recommended for: all RAG pipelines where factual accuracy matters.

Recipe 2: Tool-Call Pre-Execution Verification

1. Agent proposes tool call (e.g., DELETE /records/123, send email, deploy)
2. Before execution: run a judge with the tool call + context
   - "Given this conversation, is this tool call appropriate, correct, and intended?"
3. If judge confidence < threshold or verdict is REJECT: pause + escalate to human
4. If PASS: execute tool call

Cost profile: higher per-call cost acceptable because tool errors are expensive. Use a larger judge here (GPT-4o, Claude 3.7 Sonnet) for irreversible actions.

Recipe 3: Memory Write Auditing

1. Before writing to long-term memory: extract the claim being written
2. Run a factuality check: is this claim supported by the conversation?
3. Run a redundancy check: does this duplicate or contradict existing memory?
4. If PASS on both: write with timestamp + source attribution
5. If FAIL: discard or flag for review

This prevents hallucinated facts from compounding across future sessions — one of the most insidious failure modes in long-running agents.

Recipe 4: Periodic Self-Audit

Schedule a background task (not in the live request path) that:

Samples recent agent outputs (1–5% of volume)
Runs full judge suite against rubric: accuracy, helpfulness, policy adherence
Computes trend metrics: hallucination rate, refusal rate, user satisfaction proxies
Alerts if any metric degrades >10% week-over-week

This is the pattern Ramp uses in shadow mode before enabling live gating, and what Cox Automotive uses for continuous quality monitoring.

Recipe 5: Ensemble Disagreement Escalation

1. For high-stakes outputs: generate with two differently-prompted versions of the actor
2. Run semantic similarity between outputs
3. If similarity < 0.85: flag as uncertain; either run a judge to arbitrate or escalate to human
4. If similar: deliver higher-confidence output

Cost: 2x generation; no explicit judge call for the majority of requests. Judge only invoked on disagreements (~10–20% of requests in practice).

Open Questions

1. What is the right judge-to-actor model size ratio?

Early intuition was "judge should be larger than actor." The distilled judge research (Prometheus 2, Luna-2) challenges this: a 7B model trained specifically for evaluation can outperform a 70B general model at judging. The open question is whether this holds as actor models become more capable — can a 7B judge reliably evaluate a 70B actor's outputs in 2027?

2. When does adding a judge help vs. add latency for nothing?

The honest answer: for open-ended generation tasks (creative writing, general chat), adding a judge rarely produces measurable quality improvement and adds cost and latency. For formal tasks (code, math, structured extraction, factual QA against documents), judge verification has demonstrated consistent value. The boundary between these categories is not always obvious, and the field lacks a reliable pre-deployment test for "is a judge worthwhile here?"

3. How do you avoid judge collusion in same-family setups?

When Claude judges Claude outputs, or GPT-4 judges GPT-4 outputs, the 5–7% self-preference inflation is documented but not fully explained. The mitigation (cross-family judging) is expensive and may introduce different biases. An open question is whether constitutional constraints on the judge prompt (explicit instructions to penalize self-familiar stylistic patterns) can reduce collusion cost-effectively.

4. Calibration maintenance at scale

Judge models degrade in calibration as the world changes and as actor models are updated. There is no industry standard for calibration monitoring frequency, golden dataset construction, or drift detection thresholds. This is a gap that the eval platform vendors (Braintrust, LangSmith, Galileo) are beginning to address but have not yet solved.

5. Multi-agent judge cascades

In systems with dozens of specialized sub-agents, should every agent have its own judge, or should there be a centralized quality control layer? The former adds latency at every step; the latter creates a bottleneck and a single point of failure. Hierarchical judge architectures (step-level PRMs + output-level ORMs + delivery-level faithfulness checks) are emerging but lack standardized patterns.

6. Adversarial judge manipulation

If an actor model (or a malicious prompt injection) can learn to produce outputs that score well on the judge without being genuinely high quality — Goodhart's Law applied to LLM evaluation — then the judge becomes an attack surface rather than a defense. The research on adversarial evaluation is nascent; it is likely to become more pressing as agentic systems are deployed in higher-stakes contexts.

Sources

LangChain, "State of Agent Engineering," 2026. https://www.langchain.com/state-of-agent-engineering
ZenML, "What 1,200 Production Deployments Reveal About LLMOps in 2025." https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
Label Your Data, "LLM as a Judge: A 2026 Guide to Automated Model Assessment." https://labelyourdata.com/articles/llm-as-a-judge
Evidently AI, "LLM-as-a-judge: a complete guide to using LLMs for evaluations." https://www.evidentlyai.com/llm-guide/llm-as-a-judge
Kim et al., "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models," EMNLP 2024. https://arxiv.org/abs/2405.01535
Zhu et al., "JudgeLM: Fine-tuned Large Language Models are Scalable Judges," ICLR 2025 Spotlight. https://arxiv.org/abs/2310.17631
Patronus AI, "Lynx: State-of-the-Art Open Source Hallucination Detection Model." https://www.patronus.ai/blog/lynx-state-of-the-art-open-source-hallucination-detection-model
Patronus AI / Lynx arXiv paper. https://arxiv.org/html/2407.08488v1
Galileo, "Luna-2: Best LLM Eval Platforms Compared." https://galileo.ai/blog/best-llm-eval-platforms-compared
Huang et al., "Large Language Models Cannot Self-Correct Reasoning Yet," ICLR 2024. https://arxiv.org/abs/2310.01798
Kamoi et al., "When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey," TACL 2024. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00713/125177/
ACL 2025, "Understanding the Dark Side of LLMs' Intrinsic Self-Correction." https://aclanthology.org/2025.acl-long.1314.pdf
Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," NeurIPS 2023. https://arxiv.org/abs/2303.11366
Kenton et al. (Anthropic/DeepMind), "On scalable oversight with weak LLMs judging strong LLMs," NeurIPS 2024. https://arxiv.org/html/2407.04622v1
Snell et al., "Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning," ICLR 2025. https://openreview.net/forum?id=4FWAwZtd2n
Zhao et al., "Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge," IJCNLP 2025. https://aclanthology.org/2025.ijcnlp-long.18/
Anthropic, "Constitutional AI: Harmlessness from AI Feedback," 2022. https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Meta AI, "Llama Guard 3 8B." https://huggingface.co/meta-llama/Llama-Guard-3-8B
Google DeepMind, "FACTS Grounding: A new benchmark for evaluating the factuality of large language models," December 2024. https://deepmind.google/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/
vLLM Blog, "Token-Level Truth: Real-Time Hallucination Detection for Production LLMs (HaluGate)," December 2025. https://blog.vllm.ai/2025/12/14/halugate.html
DSPy, "Assertions: Computational Constraints for Self-Refining Language Model Pipelines." https://arxiv.org/html/2312.13382v1
Guo et al., "A Survey on LLM-as-a-Judge," November 2024. https://arxiv.org/html/2411.15594v6
Braintrust, "Best LLM evaluation platforms 2025." https://www.braintrust.dev/articles/best-llm-evaluation-platforms-2025
Arize, "Comparing LLM Evaluation Platforms: Top Frameworks for 2025." https://arize.com/llm-evaluation-platforms-top-frameworks/
CRITIC (Gou et al., ICLR 2024): "Large Language Models Can Self-Correct with Tool-Interactive Critiquing." https://openreview.net/forum?id=Sx038qxjek