Evaluating Evaluators with `LabelledEvaluatorDataset`'s
The purpose of the llama-datasets is to provide builders the means to quickly benchmark
LLM systems or tasks. In that spirit, the LabelledEvaluatorDataset
exists to
facilitate the evaluation of evaluators in a seamless and effortless manner.
This dataset consists of examples that carries mainly the following attributes:
query
, answer
, ground_truth_answer
, reference_score
, and reference_feedback
along with some
other supplementary attributes. The user flow for producing evaluations with this
dataset consists of making predictions over the dataset with a provided LLM
evaluator, and then computing metrics that measure goodness of evaluations by
computationally comparing them to the corresponding references.
Below is a snippet of code that makes use of the EvaluatorBenchmarkerPack
to
conveniently handle the above mentioned process flow.
from llama_index.core.llama_dataset import download_llama_datasetfrom llama_index.core.llama_pack import download_llama_packfrom llama_index.core.evaluation import CorrectnessEvaluatorfrom llama_index.llms.gemini import Gemini
# download datasetevaluator_dataset, _ = download_llama_dataset( "MiniMtBenchSingleGradingDataset", "./mini_mt_bench_data")
# define evaluatorgemini_pro_llm = Gemini(model="models/gemini-pro", temperature=0)evaluator = CorrectnessEvaluator(llm=gemini_pro_llm)
# download EvaluatorBenchmarkerPack and define the benchmarkerEvaluatorBenchmarkerPack = download_llama_pack( "EvaluatorBenchmarkerPack", "./pack")evaluator_benchmarker = EvaluatorBenchmarkerPack( evaluator=evaluators["gpt-3.5"], eval_dataset=evaluator_dataset, show_progress=True,)
# produce the benchmark resultbenchmark_df = await evaluator_benchmarker.arun( batch_size=5, sleep_time_in_seconds=0.5)
The related LabelledPairwiseEvaluatorDataset
Section titled “The related LabelledPairwiseEvaluatorDataset”A related llama-dataset is the LabelledPairwiseEvaluatorDataset
, which again
is meant to evaluate an evaluator, but this time where the evaluator is tasked on
comparing a pair of LLM responses to a given query and to determine the better one
amongst them. The usage flow described above is exactly the same as it is for the
LabelledEvaluatorDataset
, with the exception that the LLM evaluator must be
equipped to perform the pairwise evaluation task — i.e., should be a PairwiseComparisonEvaluator
.
More learning materials
Section titled “More learning materials”To see these datasets in action, be sure to checkout the notebooks listed below that benchmark LLM evaluators on slightly adapted versions of the MT-Bench dataset.