An LLM eval, or evaluation, is a test or set of tests used to measure how well a large language model behaves on a task or under a specific set of conditions.
If you build with LLMs, you need a way to answer practical questions like:
An eval gives you a repeatable way to compare models, prompts, retrieval setups, or fine-tunes. In practice, teams use evals before shipping changes, during prompt iteration, and as regression tests so quality does not quietly drift.
An eval usually starts with a defined task and a dataset or test set. That might be a list of question-answer pairs, conversation snippets, coding tasks, or policy/safety cases.
Then you run the model on those cases and score the outputs against some criterion. The score can be:
For many LLM use cases, the evaluation is not just one number. Teams often track several dimensions, such as helpfulness, factuality, refusal behavior, latency, and cost. That is because a model can improve on one axis and regress on another.
The important idea is repeatability: the same test should let you compare two versions fairly, even if the model itself is non-deterministic.
Suppose you are testing a customer-support assistant.
Test case:
User: “I was charged twice. What should I do?”
Eval rubric:
You run the model and check whether its answer matches the rubric. If version B starts saying “I’ve already refunded you,” the eval catches a regression.