2026-06-20

What is RAG evaluation (faithfulness and relevance)?

RAG evaluation is the process of checking whether a retrieval-augmented generation system’s answer is faithful to the provided context and relevant to the user’s question.

Why it matters

RAG systems are meant to answer with help from retrieved documents, not just from the model’s memory. Evaluation tells you whether the system is actually doing that well.

In practice, teams use RAG evaluation to catch two common failure modes:

Low faithfulness: the answer sounds plausible but is not supported by the retrieved evidence.
Low relevance: the answer may be factually grounded, but it does not address the user’s question well.

If you only measure one of these, you can miss serious issues. A response can be relevant but hallucinated, or faithful to the context but still unhelpful.

How it works

A typical RAG evaluation splits the quality check into two parts:

Faithfulness
- Ask: Is every important claim in the answer supported by the retrieved context?
- This is a grounding check. If the answer says something the documents do not support, faithfulness is low.
Relevance
- Ask: Does the answer actually address the user’s query?
- This checks whether the system stayed on task and answered the right thing, rather than drifting into related but unnecessary details.
Sometimes evaluate at multiple levels
- Some setups score the retrieved passages themselves for relevance to the query.
- Others score the final answer for both faithfulness and relevance.
- These are related but not identical. Good retrieval helps relevance, but it does not guarantee a faithful answer.
Use human review, automatic judges, or both
- Human evaluation is the clearest standard when stakes are high.
- Automated evaluation often uses an LLM-as-judge or heuristic scoring to scale coverage.
- The exact rubric matters: different papers and tools define these metrics slightly differently, so teams should read the rubric carefully rather than assume a universal definition.

Tiny concrete example

User question: “What year was the company founded?”

Retrieved context: “The company was founded in 2014 in Berlin.”

Answer A: “The company was founded in 2014.”

Faithful: yes
Relevant: yes

Answer B: “The company was founded in 2014, and its headquarters are in Paris.”

Faithful: partly no, if Paris is not in the context
Relevant: yes

Answer C: “The company uses a subscription model.”

Faithful: maybe, if the context says so
Relevant: no, because it does not answer the founding-year question

Common pitfalls / when NOT to use it

Confusing retrieval quality with answer quality
A bad answer can come from bad retrieval, bad generation, or both. Evaluate them separately when possible.
Treating “faithful” as “true in the world”
Faithfulness usually means “supported by the provided context,” not necessarily globally correct.
Using a vague rubric
If you do not specify what counts as support, judges may score inconsistently.
Over-trusting automatic metrics
LLM judges are useful, but they can be noisy, biased, or overly lenient. Spot-check with humans.
Using only aggregate scores
A single average can hide systematic failures on certain question types, document types, or languages.
Skipping evaluation on the retrieval side
If the retriever misses the right evidence, the generator may look “unfaithful” even though the root cause is retrieval.