Usage Pattern (Retrieval)
Using RetrieverEvaluator
Section titled āUsing RetrieverEvaluatorāThis runs evaluation over a single query + ground-truth document set given a retriever.
The standard practice is to specify a set of valid metrics with from_metrics
.
from llama_index.core.evaluation import RetrieverEvaluator
# define retriever somewhere (e.g. from index)# retriever = index.as_retriever(similarity_top_k=2)retriever = ...
retriever_evaluator = RetrieverEvaluator.from_metric_names( ["mrr", "hit_rate"], retriever=retriever)
retriever_evaluator.evaluate( query="query", expected_ids=["node_id1", "node_id2"])
Building an Evaluation Dataset
Section titled āBuilding an Evaluation DatasetāYou can manually curate a retrieval evaluation dataset of questions + node idās. We also offer synthetic dataset generation over an existing text corpus with our generate_question_context_pairs
function:
from llama_index.core.evaluation import generate_question_context_pairs
qa_dataset = generate_question_context_pairs( nodes, llm=llm, num_questions_per_chunk=2)
The returned result is a EmbeddingQAFinetuneDataset
object (containing queries
, relevant_docs
, and corpus
).
Plugging it into RetrieverEvaluator
Section titled āPlugging it into RetrieverEvaluatorāWe offer a convenience function to run a RetrieverEvaluator
over a dataset in batch mode.
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
This should run much faster than you trying to call .evaluate
on each query separately.