Evaluating With `LabelledRagDataset`'s
We have already gone through the core abstractions within the Evaluation module that enable various kinds of evaluation methodologies of LLM-based applications or systems, including RAG systems. Of course, to evaluate the system one needs an evaluation method, the system itself, as well as evaluation datasets. It is considered best practice to test the LLM application on several distinct datasets emanating from different sources and domains. Doing so helps to ensure the overall robustness (that is, the level in which the system will work in unseen, new cases) of the system.
To this end, weâve included the LabelledRagDataset
abstraction in our library. Their core purpose is to facilitate the
evaluations of systems on various datasets, by making these easy to create, easy
to use, and widely available.
This dataset consists of examples, where an example
carries a query
, a reference_answer
, as well as reference_contexts
. The main
reason for using a LabelledRagDataset
is to test a RAG systemâs performance
by first predicting a response to the given query
and then comparing that predicted
(or generated) response to the reference_answer
.
from llama_index.core.llama_dataset import ( LabelledRagDataset, CreatedBy, CreatedByType, LabelledRagDataExample,)
example1 = LabelledRagDataExample( query="This is some user query.", query_by=CreatedBy(type=CreatedByType.HUMAN), reference_answer="This is a reference answer. Otherwise known as ground-truth answer.", reference_contexts=[ "This is a list", "of contexts used to", "generate the reference_answer", ], reference_by=CreatedBy(type=CreatedByType.HUMAN),)
# a sad dataset consisting of one measely examplerag_dataset = LabelledRagDataset(examples=[example1])
Building A LabelledRagDataset
Section titled âBuilding A LabelledRagDatasetâAs we just saw at the end of the previous section, we can build a LabelledRagDataset
manually by constructing LabelledRagDataExample
âs one by one. However, this is
a bit tedious, and while human-annoted datasets are extremely valuable, datasets
that are generated by strong LLMs are also very useful.
As such, the llama_dataset
module is equipped with the RagDatasetGenerator
that
is able to generate a LabelledRagDataset
over a set of source Document
âs.
from llama_index.core.llama_dataset.generator import RagDatasetGeneratorfrom llama_index.llms.openai import OpenAIimport nest_asyncio
nest_asyncio.apply()
documents = ... # a set of documents loaded by using for example a Reader
llm = OpenAI(model="gpt-4")
dataset_generator = RagDatasetGenerator.from_documents( documents=documents, llm=llm, num_questions_per_chunk=10, # set the number of questions per nodes)
rag_dataset = dataset_generator.generate_dataset_from_nodes()
Using A LabelledRagDataset
Section titled âUsing A LabelledRagDatasetâAs mentioned before, we want to use a LabelledRagDataset
to evaluate a RAG
system, built on the same source Document
âs, performance with it. Doing so would
require performing two steps: (1) making predictions on the dataset (i.e. generating
responses to the query of each individual example), and (2) evaluating the predicted
response by comparing it to the reference answer. In step (2) we also evaluate the
RAG systemâs retrieved contexts and compare it to the reference contexts, to gain
an assessment on the retrieval component of the RAG system.
For convenience, we have a LlamaPack
called the RagEvaluatorPack
that
streamlines this evaluation process!
from llama_index.core.llama_pack import download_llama_pack
RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")
rag_evaluator = RagEvaluatorPack( query_engine=query_engine, # built with the same source Documents as the rag_dataset rag_dataset=rag_dataset,)benchmark_df = await rag_evaluator.run()
The above benchmark_df
contains the mean scores for evaluation measures introduced
previously: Correctness
, Relevancy
, Faithfulness
as well as Context Similarity
that measures the semantic similarity between the reference contexts as well as the
contexts retrieved by the RAG system to generated the predicted response.
Where To Find LabelledRagDataset
âs
Section titled âWhere To Find LabelledRagDatasetâsâYou can find all of the LabelledRagDataset
âs in llamahub. You can browse each one of these and decide
if you do decide that youâd like to use it to benchmark your RAG workflow, then
you can download the dataset as well as the source Document
âs conveniently thru
one of two ways: the llamaindex-cli
or through Python code using the
download_llama_dataset
utility function.
# using clillamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data
# using pythonfrom llama_index.core.llama_dataset import download_llama_dataset
# a LabelledRagDataset and a list of source Document'srag_dataset, documents = download_llama_dataset( "PaulGrahamEssayDataset", "./data")
Contributing A LabelledRagDataset
Section titled âContributing A LabelledRagDatasetâYou can also contribute a LabelledRagDataset
to llamahub.
Contributing a LabelledRagDataset
involves two high level steps. Generally speaking,
you must create the LabelledRagDataset
, save it as a json and submit both this
json file and the source text files to our llama_datasets Github repository. Additionally, youâll have to make
a pull request, to upload required metadata of the dataset to our llama_hub Github repository.
Please refer to the âLlamaDataset Submission Template Notebookâ linked below.
Now, Go And Build Robust LLM Applications
Section titled âNow, Go And Build Robust LLM ApplicationsâThis page hopefully has served as a good starting point for you to create, download
and use LlamaDataset
âs for building robust and performant LLM Applications. To
learn more, we recommend reading the notebook guides provided below.