PaPoo
cover

What is an LLM eval (evaluation)?

An LLM eval, or evaluation, is a test or set of tests used to measure how well a large language model behaves on a task or under a specific set of conditions.

Why it matters

If you build with LLMs, you need a way to answer practical questions like:

An eval gives you a repeatable way to compare models, prompts, retrieval setups, or fine-tunes. In practice, teams use evals before shipping changes, during prompt iteration, and as regression tests so quality does not quietly drift.

How it works

An eval usually starts with a defined task and a dataset or test set. That might be a list of question-answer pairs, conversation snippets, coding tasks, or policy/safety cases.

Then you run the model on those cases and score the outputs against some criterion. The score can be:

For many LLM use cases, the evaluation is not just one number. Teams often track several dimensions, such as helpfulness, factuality, refusal behavior, latency, and cost. That is because a model can improve on one axis and regress on another.

The important idea is repeatability: the same test should let you compare two versions fairly, even if the model itself is non-deterministic.

Tiny concrete example

Suppose you are testing a customer-support assistant.

Test case:
User: “I was charged twice. What should I do?”

Eval rubric:

You run the model and check whether its answer matches the rubric. If version B starts saying “I’ve already refunded you,” the eval catches a regression.

Common pitfalls / when NOT to use it

Related terms

Related terms

同じ著者の記事