LLM-as-a-judge is the practice of using a large language model to evaluate or rank outputs from another model, system, or workflow, instead of relying only on human reviewers or fixed metrics.
It solves a common problem: many AI outputs are hard to score with simple automated checks. If you are judging a summary for usefulness, a chatbot answer for helpfulness, or two candidate responses for preference, exact-match metrics often miss the point.
Teams reach for LLM-as-a-judge when they need:
In practice, it is often used as a proxy evaluator: not perfect, but cheaper and more scalable than human labeling for every test case.
You define a rubric.
The judge model is given criteria such as correctness, completeness, relevance, tone, or safety. Good setups are explicit about what “better” means.
You give the judge the inputs and outputs to evaluate.
This might be a user prompt plus one answer, or a prompt plus two answers to compare. The judge then produces a score, a ranking, or a short rationale.
You aggregate results across many examples.
The outputs are used to compare prompts, retrieval settings, model versions, or agent behaviors. Some teams use the judge for coarse screening and still keep human review for spot checks.
You validate the judge itself.
Because the judge is also a model, it can be biased, inconsistent, or overly sensitive to phrasing. Mature workflows calibrate it against human judgments on a sample set.
A subtle point: “LLM-as-a-judge” is a broad term. In the literature and tooling, it can mean direct scoring, pairwise preference ranking, rubric-based grading, or multi-criteria evaluation. The exact protocol matters more than the label.
Prompt to the judge:
Rate which answer is more helpful for a new developer. Prefer correctness and clarity. Return only “A”, “B”, or “Tie”.
Answer A:
“Use a vector database because it is always faster.”
Answer B:
“Use a vector database if you need semantic retrieval over embeddings; benchmark it against your latency and recall needs.”
Judge output:
B
Here the judge is being used as a quick preference rater, not as a final authority on truth.
Using it as ground truth.
A judge model is still a model. It can hallucinate rationale, inherit biases, or prefer fluent but wrong answers.
Skipping calibration.
If you do not compare judge scores with human labels on a representative sample, you may optimize for the judge rather than the real task.
Rewarding style over substance.
LLM judges often favor verbosity, confidence, or polished phrasing unless the rubric explicitly guards against that.
Using a weak judge for a hard domain.
For legal, medical, or highly specialized technical content, a general-purpose judge may be unreliable without expert review.
Letting the model evaluate itself without controls.
Self-evaluation can be useful, but it is especially prone to bias and overestimation.
If you need a single number for a product decision, human evaluation or task-specific metrics are usually safer. LLM-as-a-judge is best when you need scalable, approximate evaluation and can tolerate some noise.
LLM-as-a-judge is a scalable way to evaluate AI outputs using another language model. It is useful for ranking, grading, or comparing responses when human review is too slow or expensive, but it should be calibrated carefully and not treated as perfect truth.