RAGChecker: A Fine-grained Evaluation Framework For Diagnosing RAG
RAGChecker is a comprehensive evaluation framework designed for Retrieval-Augmented Generation (RAG) systems. It provides a suite of metrics to assess both the retrieval and generation components of RAG systems, offering detailed insights into their performance.
Key features of RAGChecker include:
- Fine-grained analysis using claim-level entailment checking
- Comprehensive metrics for overall performance, retriever efficiency, and generator accuracy
- Actionable insights for improving RAG systems
For more information, visit the RAGChecker GitHub repository.
RAGChecker Metrics
Section titled “RAGChecker Metrics”RAGChecker provides a comprehensive set of metrics to evaluate different aspects of RAG systems:
-
Overall Metrics:
- Precision: The proportion of correct claims in the model’s response.
- Recall: The proportion of ground truth claims covered by the model’s response.
- F1 Score: The harmonic mean of precision and recall.
-
Retriever Metrics:
- Claim Recall: The proportion of ground truth claims covered by the retrieved chunks.
- Context Precision: The proportion of retrieved chunks that are relevant.
-
Generator Metrics:
- Context Utilization: How well the generator uses relevant information from retrieved chunks.
- Noise Sensitivity: The generator’s tendency to include incorrect information from retrieved chunks.
- Hallucination: The proportion of incorrect claims not found in any retrieved chunks.
- Self-knowledge: The proportion of correct claims not found in any retrieved chunks.
- Faithfulness: How closely the generator’s response aligns with the retrieved chunks.
These metrics provide a nuanced evaluation of both the retrieval and generation components, allowing for targeted improvements in RAG systems.
Install Requirements
Section titled “Install Requirements”%pip install -qU ragchecker llama-index
Setup and Imports
Section titled “Setup and Imports”First, let’s import the necessary libraries:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReaderfrom ragchecker.integrations.llama_index import response_to_rag_resultsfrom ragchecker import RAGResults, RAGCheckerfrom ragchecker.metrics import all_metrics
Creating a LlamaIndex Query Engine
Section titled “Creating a LlamaIndex Query Engine”Now, let’s create a simple LlamaIndex query engine using a sample dataset:
# Load documentsdocuments = SimpleDirectoryReader("path/to/your/documents").load_data()
# Create indexindex = VectorStoreIndex.from_documents(documents)
# Create query enginerag_application = index.as_query_engine()
Using RAGChecker with LlamaIndex
Section titled “Using RAGChecker with LlamaIndex”Now we’ll demonstrate how to use the response_to_rag_results
function to convert LlamaIndex output to the RAGChecker format:
# User query and groud truth answeruser_query = "What is RAGChecker?"gt_answer = "RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance."
# Get response from LlamaIndexresponse_object = rag_application.query(user_query)
# Convert to RAGChecker formatrag_result = response_to_rag_results( query=user_query, gt_answer=gt_answer, response_object=response_object,)
# Create RAGResults objectrag_results = RAGResults.from_dict({"results": [rag_result]})print(rag_results)
Evaluating with RAGChecker
Section titled “Evaluating with RAGChecker”Now that we have our results in the correct format, let’s evaluate them using RAGChecker:
# Initialize RAGCheckerevaluator = RAGChecker( extractor_name="bedrock/meta.llama3-70b-instruct-v1:0", checker_name="bedrock/meta.llama3-70b-instruct-v1:0", batch_size_extractor=32, batch_size_checker=32,)
# Evaluate using RAGCheckerevaluator.evaluate(rag_results, all_metrics)
# Print detailed resultsprint(rag_results)
The output will look something like this:
RAGResults( 1 RAG results, Metrics: { "overall_metrics": { "precision": 66.7, "recall": 27.3, "f1": 38.7 }, "retriever_metrics": { "claim_recall": 54.5, "context_precision": 100.0 }, "generator_metrics": { "context_utilization": 16.7, "noise_sensitivity_in_relevant": 0.0, "noise_sensitivity_in_irrelevant": 0.0, "hallucination": 33.3, "self_knowledge": 0.0, "faithfulness": 66.7 } })
This output provides a comprehensive view of the RAG system’s performance, including overall metrics, retriever metrics, and generator metrics as described in the earlier section.
Selecting Specific Metric Groups
Section titled “Selecting Specific Metric Groups”Instead of evaluating all the metrics with all_metrics
, you can choose specific metric groups as follows:
from ragchecker.metrics import ( overall_metrics, retriever_metrics, generator_metrics,)
Selecting Individual Metrics
Section titled “Selecting Individual Metrics”For even more granular control, you can choose specific individual metrics for your needs:
from ragchecker.metrics import ( precision, recall, f1, claim_recall, context_precision, context_utilization, noise_sensitivity_in_relevant, noise_sensitivity_in_irrelevant, hallucination, self_knowledge, faithfulness,)
Conclusion
Section titled “Conclusion”This notebook has demonstrated how to integrate RAGChecker with LlamaIndex to evaluate the performance of RAG systems. We’ve covered:
- Setting up RAGChecker with LlamaIndex
- Converting LlamaIndex outputs to RAGChecker format
- Evaluating RAG results using various metrics
- Customizing evaluations with specific metric groups or individual metrics
By leveraging RAGChecker’s comprehensive metrics, you can gain valuable insights into your RAG system’s performance, identify areas for improvement, and optimize both retrieval and generation components. This integration provides a powerful tool for developing and refining more effective RAG applications.