---
title: Evaluating Evaluators with `LabelledEvaluatorDataset`'s | LlamaIndex OSS Documentation
---

The purpose of the llama-datasets is to provide builders the means to quickly benchmark LLM systems or tasks. In that spirit, the `LabelledEvaluatorDataset` exists to facilitate the evaluation of evaluators in a seamless and effortless manner.

This dataset consists of examples that carries mainly the following attributes: `query`, `answer`, `ground_truth_answer`, `reference_score`, and `reference_feedback` along with some other supplementary attributes. The user flow for producing evaluations with this dataset consists of making predictions over the dataset with a provided LLM evaluator, and then computing metrics that measure goodness of evaluations by computationally comparing them to the corresponding references.

Below is a snippet of code that makes use of the `EvaluatorBenchmarkerPack` to conveniently handle the above mentioned process flow.

```
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.gemini import Gemini


# download dataset
evaluator_dataset, _ = download_llama_dataset(
    "MiniMtBenchSingleGradingDataset", "./mini_mt_bench_data"
)


# define evaluator
gemini_pro_llm = Gemini(model="models/gemini-pro", temperature=0)
evaluator = CorrectnessEvaluator(llm=gemini_pro_llm)


# download EvaluatorBenchmarkerPack and define the benchmarker
EvaluatorBenchmarkerPack = download_llama_pack(
    "EvaluatorBenchmarkerPack", "./pack"
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
    evaluator=evaluators["gpt-3.5"],
    eval_dataset=evaluator_dataset,
    show_progress=True,
)


# produce the benchmark result
benchmark_df = await evaluator_benchmarker.arun(
    batch_size=5, sleep_time_in_seconds=0.5
)
```

## The related `LabelledPairwiseEvaluatorDataset`

A related llama-dataset is the `LabelledPairwiseEvaluatorDataset`, which again is meant to evaluate an evaluator, but this time where the evaluator is tasked on comparing a pair of LLM responses to a given query and to determine the better one amongst them. The usage flow described above is exactly the same as it is for the `LabelledEvaluatorDataset`, with the exception that the LLM evaluator must be equipped to perform the pairwise evaluation task — i.e., should be a `PairwiseComparisonEvaluator`.

## More learning materials

To see these datasets in action, be sure to checkout the notebooks listed below that benchmark LLM evaluators on slightly adapted versions of the MT-Bench dataset.

- [MTBench Single Grading](/python/examples/evaluation/mt_bench_single_grading/index.md)
- [MTBench Human Judge](/python/examples/evaluation/mt_bench_human_judgement/index.md)
