RAG evaluation is the process of checking whether a retrieval-augmented generation system’s answer is faithful to the provided context and relevant to the user’s question.
RAG systems are meant to answer with help from retrieved documents, not just from the model’s memory. Evaluation tells you whether the system is actually doing that well.
In practice, teams use RAG evaluation to catch two common failure modes:
If you only measure one of these, you can miss serious issues. A response can be relevant but hallucinated, or faithful to the context but still unhelpful.
A typical RAG evaluation splits the quality check into two parts:
Faithfulness
Relevance
Sometimes evaluate at multiple levels
Use human review, automatic judges, or both
User question: “What year was the company founded?”
Retrieved context: “The company was founded in 2014 in Berlin.”
Answer A: “The company was founded in 2014.”
Answer B: “The company was founded in 2014, and its headquarters are in Paris.”
Answer C: “The company uses a subscription model.”
Confusing retrieval quality with answer quality
A bad answer can come from bad retrieval, bad generation, or both. Evaluate them separately when possible.
Treating “faithful” as “true in the world”
Faithfulness usually means “supported by the provided context,” not necessarily globally correct.
Using a vague rubric
If you do not specify what counts as support, judges may score inconsistently.
Over-trusting automatic metrics
LLM judges are useful, but they can be noisy, biased, or overly lenient. Spot-check with humans.
Using only aggregate scores
A single average can hide systematic failures on certain question types, document types, or languages.
Skipping evaluation on the retrieval side
If the retriever misses the right evidence, the generator may look “unfaithful” even though the root cause is retrieval.