---
title: Evaluating With `LabelledRagDataset`'s | Developer Documentation
---

We have already gone through the core abstractions within the Evaluation module that enable various kinds of evaluation methodologies of LLM-based applications or systems, including RAG systems. Of course, to evaluate the system one needs an evaluation method, the system itself, as well as evaluation datasets. It is considered best practice to test the LLM application on several distinct datasets emanating from different sources and domains. Doing so helps to ensure the overall robustness (that is, the level in which the system will work in unseen, new cases) of the system.

To this end, we’ve included the `LabelledRagDataset` abstraction in our library. Their core purpose is to facilitate the evaluations of systems on various datasets, by making these easy to create, easy to use, and widely available.

This dataset consists of examples, where an example carries a `query`, a `reference_answer`, as well as `reference_contexts`. The main reason for using a `LabelledRagDataset` is to test a RAG system’s performance by first predicting a response to the given `query` and then comparing that predicted (or generated) response to the `reference_answer`.

```
from llama_index.core.llama_dataset import (
    LabelledRagDataset,
    CreatedBy,
    CreatedByType,
    LabelledRagDataExample,
)


example1 = LabelledRagDataExample(
    query="This is some user query.",
    query_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_answer="This is a reference answer. Otherwise known as ground-truth answer.",
    reference_contexts=[
        "This is a list",
        "of contexts used to",
        "generate the reference_answer",
    ],
    reference_by=CreatedBy(type=CreatedByType.HUMAN),
)


# a sad dataset consisting of one measely example
rag_dataset = LabelledRagDataset(examples=[example1])
```

## Building A `LabelledRagDataset`

As we just saw at the end of the previous section, we can build a `LabelledRagDataset` manually by constructing `LabelledRagDataExample`’s one by one. However, this is a bit tedious, and while human-annoted datasets are extremely valuable, datasets that are generated by strong LLMs are also very useful.

As such, the `llama_dataset` module is equipped with the `RagDatasetGenerator` that is able to generate a `LabelledRagDataset` over a set of source `Document`’s.

```
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms.openai import OpenAI
import nest_asyncio


nest_asyncio.apply()


documents = ...  # a set of documents loaded by using for example a Reader


llm = OpenAI(model="gpt-4")


dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=llm,
    num_questions_per_chunk=10,  # set the number of questions per nodes
)


rag_dataset = dataset_generator.generate_dataset_from_nodes()
```

## Using A `LabelledRagDataset`

As mentioned before, we want to use a `LabelledRagDataset` to evaluate a RAG system, built on the same source `Document`’s, performance with it. Doing so would require performing two steps: (1) making predictions on the dataset (i.e. generating responses to the query of each individual example), and (2) evaluating the predicted response by comparing it to the reference answer. In step (2) we also evaluate the RAG system’s retrieved contexts and compare it to the reference contexts, to gain an assessment on the retrieval component of the RAG system.

For convenience, we have a `LlamaPack` called the `RagEvaluatorPack` that streamlines this evaluation process!

```
from llama_index.core.llama_pack import download_llama_pack


RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")


rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine,  # built with the same source Documents as the rag_dataset
    rag_dataset=rag_dataset,
)
benchmark_df = await rag_evaluator.run()
```

The above `benchmark_df` contains the mean scores for evaluation measures introduced previously: `Correctness`, `Relevancy`, `Faithfulness` as well as `Context Similarity` that measures the semantic similarity between the reference contexts as well as the contexts retrieved by the RAG system to generated the predicted response.

## Where To Find `LabelledRagDataset`’s

You can find all of the `LabelledRagDataset`’s in [llamahub](https://llamahub.ai). You can browse each one of these and decide if you do decide that you’d like to use it to benchmark your RAG workflow, then you can download the dataset as well as the source `Document`’s conveniently thru one of two ways: the `llamaindex-cli` or through Python code using the `download_llama_dataset` utility function.

Terminal window

```
# using cli
llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data
```

```
# using python
from llama_index.core.llama_dataset import download_llama_dataset


# a LabelledRagDataset and a list of source Document's
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./data"
)
```

### Contributing A `LabelledRagDataset`

You can also contribute a `LabelledRagDataset` to [llamahub](https://llamahub.ai). Contributing a `LabelledRagDataset` involves two high level steps. Generally speaking, you must create the `LabelledRagDataset`, save it as a json and submit both this json file and the source text files to our [llama\_datasets](https://github.com/run-llama/llama_datasets) Github repository. Additionally, you’ll have to make a pull request, to upload required metadata of the dataset to our [llama\_hub](https://github.com/run-llama/llama-hub) Github repository.

Please refer to the “LlamaDataset Submission Template Notebook” linked below.

## Now, Go And Build Robust LLM Applications

This page hopefully has served as a good starting point for you to create, download and use `LlamaDataset`’s for building robust and performant LLM Applications. To learn more, we recommend reading the notebook guides provided below.

## Resources

- [Labelled RAG datasets](/python/examples/llama_dataset/labelled-rag-datasets/index.md)
- [Downloading Llama datasets](/python/examples/llama_dataset/downloading_llama_datasets/index.md)