PaPoo
cover

What is RAG evaluation (faithfulness and relevance)?

RAG evaluation is the process of checking whether a retrieval-augmented generation system’s answer is faithful to the provided context and relevant to the user’s question.

Why it matters

RAG systems are meant to answer with help from retrieved documents, not just from the model’s memory. Evaluation tells you whether the system is actually doing that well.

In practice, teams use RAG evaluation to catch two common failure modes:

If you only measure one of these, you can miss serious issues. A response can be relevant but hallucinated, or faithful to the context but still unhelpful.

How it works

A typical RAG evaluation splits the quality check into two parts:

  1. Faithfulness

    • Ask: Is every important claim in the answer supported by the retrieved context?
    • This is a grounding check. If the answer says something the documents do not support, faithfulness is low.
  2. Relevance

    • Ask: Does the answer actually address the user’s query?
    • This checks whether the system stayed on task and answered the right thing, rather than drifting into related but unnecessary details.
  3. Sometimes evaluate at multiple levels

    • Some setups score the retrieved passages themselves for relevance to the query.
    • Others score the final answer for both faithfulness and relevance.
    • These are related but not identical. Good retrieval helps relevance, but it does not guarantee a faithful answer.
  4. Use human review, automatic judges, or both

    • Human evaluation is the clearest standard when stakes are high.
    • Automated evaluation often uses an LLM-as-judge or heuristic scoring to scale coverage.
    • The exact rubric matters: different papers and tools define these metrics slightly differently, so teams should read the rubric carefully rather than assume a universal definition.

Tiny concrete example

User question: “What year was the company founded?”

Retrieved context: “The company was founded in 2014 in Berlin.”

Answer A: “The company was founded in 2014.”

Answer B: “The company was founded in 2014, and its headquarters are in Paris.”

Answer C: “The company uses a subscription model.”

Common pitfalls / when NOT to use it

Related terms

同じ著者の記事