Fully Local RAG Pipeline with Chroma + Ollama

No API key required. This notebook runs entirely on your local machine using Ollama for both the LLM and embeddings, and ChromaDB as the vector store.

What you will learn

Step	Concept
1	Configure the pipeline — all tunables in one place
2	Ingest and chunk a sample document with `SentenceSplitter`
3	Embed chunks with `OllamaEmbedding` and persist in ChromaDB
4	Query the index with a local LLM (`llama3.2:3b` via Ollama)
5	Evaluate retrieval quality against a gold Q&A set (hit-rate & MRR)
6	Explore failure modes: empty context, long queries, hallucination guard

Prerequisites

Ollama installed and running — download here

Models pulled:

ollama pull llama3.2:3b
ollama pull nomic-embed-text

Cell 1 — Install dependencies

%pip install -q llama-index-core llama-index-llms-ollama llama-index-embeddings-ollama llama-index-vector-stores-chroma llama-index-readers-file chromadb

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[31mERROR: Invalid requirement: '\\n': Expected package name at the start of dependency specifier
    \n
    ^[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.

Cell 2 — Imports and logging

import json
import logging
import shutil
from pathlib import Path

import chromadb
from IPython.display import Markdown, display

from llama_index.core import Document, StorageContext, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.chroma import ChromaVectorStore

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%H:%M:%S",
)
logger = logging.getLogger("local_rag")
logger.info("Imports loaded successfully.")

17:49:30 [INFO] local_rag: Imports loaded successfully.

Cell 3 — Configuration

All tunables are defined here. Edit this cell to change model names, chunk size, top-k, etc.

cfg = {
    "llm": {
        "model": "llama3.2:3b",
        "base_url": "http://localhost:11434",
        "temperature": 0.0,
        "request_timeout": 120.0,
    },
    "embedding": {
        "model": "nomic-embed-text",
        "base_url": "http://localhost:11434",
    },
    "splitter": {
        "chunk_size": 512,
        "chunk_overlap": 50,
    },
    "chroma": {
        "persist_dir": "./chroma_db",
        "collection_name": "ai_safety_rag",
    },
    "retrieval": {
        "similarity_top_k": 3,
    },
}

print(json.dumps(cfg, indent=2))

17:49:30 [INFO] local_rag: Config loaded from config.yaml


{
  "llm": {
    "model": "llama3.2:3b",
    "base_url": "http://localhost:11434",
    "temperature": 0.0,
    "request_timeout": 120.0
  },
  "embedding": {
    "model": "nomic-embed-text",
    "base_url": "http://localhost:11434"
  },
  "splitter": {
    "chunk_size": 512,
    "chunk_overlap": 50
  },
  "chroma": {
    "persist_dir": "./chroma_db",
    "collection_name": "ai_safety_rag"
  },
  "data": {
    "input_dir": "./data"
  },
  "retrieval": {
    "similarity_top_k": 3
  },
  "eval": {
    "gold_qa_path": "./eval/gold_qa.json"
  }
}

Cell 4 — Initialise LLM and embedding model

llm = Ollama(
    model=cfg["llm"]["model"],
    base_url=cfg["llm"]["base_url"],
    temperature=cfg["llm"]["temperature"],
    request_timeout=cfg["llm"]["request_timeout"],
)

embed_model = OllamaEmbedding(
    model_name=cfg["embedding"]["model"],
    base_url=cfg["embedding"]["base_url"],
)

logger.info(
    "LLM: %s | Embedding: %s", cfg["llm"]["model"], cfg["embedding"]["model"]
)

17:49:30 [INFO] local_rag: LLM: llama3.2:3b | Embedding: nomic-embed-text

Cell 5 — Load and chunk the document

We use SentenceSplitter with the chunk size and overlap from config. The corpus is defined inline — replace CORPUS_TEXT with your own content or load from a file.

CORPUS_TEXT = """# AI Safety Primer

## What is AI Safety?

AI safety is a field of research focused on ensuring that artificial intelligence systems
behave in ways that are safe, beneficial, and aligned with human values. As AI systems
become more capable, the importance of safety research grows correspondingly.

## Key Concepts

### Alignment
Alignment refers to the challenge of ensuring that an AI system's goals and behaviors
match the intentions of its designers and the broader interests of humanity. A misaligned
AI might pursue its programmed objective in ways that are harmful or unintended.

### RLHF (Reinforcement Learning from Human Feedback)
RLHF is a training technique where human evaluators rank model outputs, and those
rankings are used to train a reward model. The AI is then fine-tuned using reinforcement
learning to maximize this reward signal, steering it toward outputs humans prefer.

### Reward Hacking
Reward hacking occurs when an AI system finds unintended ways to maximize its reward
signal without achieving the true underlying goal. For example, a robot trained to run
fast might learn to make itself very tall and then fall forward repeatedly.

### Constitutional AI
Constitutional AI (CAI) is a technique developed by Anthropic to make AI systems more
helpful, harmless, and honest. It uses a set of explicit principles (a "constitution")
to guide the model's behavior. The model critiques and revises its own outputs against
these principles, reducing reliance on human labelers for harmful content.

### Red Teaming
Red teaming in AI involves deliberately trying to find failure modes, vulnerabilities,
or harmful outputs in AI systems. Red teamers act as adversaries, probing the system
with edge cases, jailbreaks, and adversarial prompts to expose weaknesses before
deployment.

### Deceptive Alignment
Deceptive alignment is a hypothetical failure mode where an AI system behaves safely
during training and evaluation but pursues different goals once deployed. The system
"knows" it is being evaluated and acts accordingly to pass safety checks.

### Interpretability
Interpretability (or explainability) research aims to understand what is happening
inside AI models — which features they use, how they represent concepts, and why they
produce specific outputs. Tools like mechanistic interpretability try to reverse-engineer
neural network computations.

## Why AI Safety Matters Now

The rapid pace of AI development means that safety considerations must be integrated
early into the design and training process. Several organizations are actively working
on AI safety research:

- **Anthropic** — founded by former OpenAI researchers, focuses on Constitutional AI
  and interpretability
- **OpenAI** — safety team works on alignment, red teaming, and policy
- **DeepMind** — conducts research on agent safety and specification gaming
- **MIRI (Machine Intelligence Research Institute)** — focuses on long-term
  theoretical alignment problems
- **Center for AI Safety (CAIS)** — coordinates safety research across academia
  and industry
"""

documents = [
    Document(text=CORPUS_TEXT, metadata={"source": "ai_safety_primer"})
]
logger.info("Loaded %d document(s) from inline corpus", len(documents))

splitter = SentenceSplitter(
    chunk_size=cfg["splitter"]["chunk_size"],
    chunk_overlap=cfg["splitter"]["chunk_overlap"],
)
nodes = splitter.get_nodes_from_documents(documents)
logger.info(
    "Split into %d nodes (chunk_size=%d, overlap=%d)",
    len(nodes),
    cfg["splitter"]["chunk_size"],
    cfg["splitter"]["chunk_overlap"],
)

print(f"\nFirst chunk preview ({len(nodes[0].text)} chars):")
print("-" * 60)
print(nodes[0].text[:400], "...")

17:49:30 [WARNING] llama_index.core.readers.file.base: `llama-index-readers-file` package not found, some file readers will not be available if not provided by the `file_extractor` parameter.
17:49:30 [INFO] local_rag: Loaded 1 document(s) from data
17:49:31 [INFO] local_rag: Split into 3 nodes (chunk_size=512, overlap=50)



First chunk preview (2247 chars):
------------------------------------------------------------
# AI Safety Primer

## What is AI Safety?

AI safety is a field of research focused on ensuring that artificial intelligence systems
behave in ways that are safe, beneficial, and aligned with human values. As AI systems
become more capable, the importance of safety research grows correspondingly.

## Key Concepts

### Alignment
Alignment refers to the challenge of ensuring that an AI system's goal ...

Cell 6 — Build or load the Chroma index (with caching)

If chroma_db/ already exists on disk we load from it — no re-embedding. Delete the chroma_db/ folder to force a full re-index.

PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])
COLLECTION = cfg["chroma"]["collection_name"]

chroma_client = chromadb.PersistentClient(path=str(PERSIST_DIR))
existing = [c.name for c in chroma_client.list_collections()]

if COLLECTION in existing:
    logger.info(
        "Cache hit — loading existing collection '%s' from %s",
        COLLECTION,
        PERSIST_DIR,
    )
    chroma_collection = chroma_client.get_collection(COLLECTION)
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    index = VectorStoreIndex.from_vector_store(
        vector_store,
        embed_model=embed_model,
    )
else:
    logger.info(
        "Cache miss — embedding %d nodes into new collection '%s'",
        len(nodes),
        COLLECTION,
    )
    chroma_collection = chroma_client.create_collection(COLLECTION)
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(
        nodes,
        storage_context=storage_context,
        embed_model=embed_model,
    )
    logger.info("Index built and persisted to %s", PERSIST_DIR)

print(f"Collection '{COLLECTION}' has {chroma_collection.count()} vectors.")

17:49:31 [INFO] chromadb.telemetry.product.posthog: Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
17:49:32 [INFO] local_rag: Cache miss — embedding 3 nodes into new collection 'ai_safety_rag'
17:49:33 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:49:33 [INFO] local_rag: Index built and persisted to chroma_db


Collection 'ai_safety_rag' has 3 vectors.

Cell 7 — RAG query

The query engine retrieves the top-k most relevant chunks and passes them as context to the local LLM to generate a grounded answer.

query_engine = index.as_query_engine(
    llm=llm,
    similarity_top_k=cfg["retrieval"]["similarity_top_k"],
)

QUERY = "What is Constitutional AI and who developed it?"
logger.info("Running query: %s", QUERY)

response = query_engine.query(QUERY)

display(Markdown(f"**Query:** {QUERY}\n\n**Answer:** {response}"))

print("\n--- Retrieved source nodes ---")
for i, node in enumerate(response.source_nodes, 1):
    score = getattr(node, "score", "n/a")
    print(f"[{i}] score={score:.4f}  |  {node.text[:120].strip()}...")

17:49:33 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/show "HTTP/1.1 200 OK"
17:49:33 [INFO] local_rag: Running query: What is Constitutional AI and who developed it?
17:49:33 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:49:52 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Query: What is Constitutional AI and who developed it?

Answer: Constitutional AI is an approach where an AI system is trained to follow a set of explicit principles (a “constitution”). This approach was developed by Anthropic. The model critiques and revises its own outputs against these principles, reducing reliance on human labelers for harmful content identification.

--- Retrieved source nodes ---
[1] score=0.4398  |  # AI Safety Primer

## What is AI Safety?

AI safety is a field of research focused on ensuring that artificial intellig...
[2] score=0.4063  |  ## Why AI Safety Matters Now

The rapid pace of AI development means that safety considerations must be integrated
early...
[3] score=0.3742  |  The model critiques
and revises its own outputs against these principles, reducing reliance on human
labelers for harmfu...

Cell 8 — Retrieval evaluation: Hit-Rate and MRR

We loop over the gold Q&A set and for each question:

Retrieve the top-k nodes
Check if any retrieved chunk contains all expected keywords (hit)
Record the rank of the first hit (for MRR)

This is CI-friendly — no extra LLM calls, runs in seconds.

gold_qa = [
    {
        "id": "q1",
        "question": "What is alignment in the context of AI safety?",
        "expected_keywords": ["alignment", "goals"],
    },
    {
        "id": "q2",
        "question": "What is RLHF and how does it work?",
        "expected_keywords": ["rlhf", "reward"],
    },
    {
        "id": "q3",
        "question": "What is reward hacking?",
        "expected_keywords": ["reward hacking", "unintended"],
    },
    {
        "id": "q4",
        "question": "What is Constitutional AI and who developed it?",
        "expected_keywords": ["constitutional ai", "anthropic"],
    },
    {
        "id": "q5",
        "question": "What is red teaming in AI?",
        "expected_keywords": ["red teaming", "failure"],
    },
    {
        "id": "q6",
        "question": "What is deceptive alignment?",
        "expected_keywords": ["deceptive alignment", "training"],
    },
    {
        "id": "q7",
        "question": "What is interpretability in AI systems?",
        "expected_keywords": ["interpretability", "neural"],
    },
    {
        "id": "q8",
        "question": "Which organizations are working on AI safety?",
        "expected_keywords": ["anthropic", "openai", "deepmind"],
    },
]

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=cfg["retrieval"]["similarity_top_k"],
    embed_model=embed_model,
)

hits = 0
reciprocal_ranks = []
results = []

for item in gold_qa:
    retrieved_nodes = retriever.retrieve(item["question"])
    keywords = [kw.lower() for kw in item["expected_keywords"]]

    first_hit_rank = None
    for rank, node in enumerate(retrieved_nodes, 1):
        text_lower = node.text.lower()
        if all(kw in text_lower for kw in keywords):
            first_hit_rank = rank
            break

    hit = first_hit_rank is not None
    hits += int(hit)
    reciprocal_ranks.append(1 / first_hit_rank if hit else 0.0)

    results.append(
        {
            "id": item["id"],
            "hit": hit,
            "rank": first_hit_rank,
            "question": item["question"][:60],
        }
    )
    logger.info(
        "[%s] hit=%s rank=%s | %s",
        item["id"],
        hit,
        first_hit_rank,
        item["question"][:50],
    )

hit_rate = hits / len(gold_qa)
mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)

print("\n" + "=" * 50)
print(
    f"  Retrieval Evaluation Results (top-k={cfg['retrieval']['similarity_top_k']})"
)
print("=" * 50)
print(f"  Hit-Rate : {hit_rate:.2%}  ({hits}/{len(gold_qa)} questions)")
print(f"  MRR      : {mrr:.4f}")
print("=" * 50)
print("\nPer-question breakdown:")
for r in results:
    status = "✅" if r["hit"] else "❌"
    print(f"  {status} [{r['id']}] rank={r['rank']}  {r['question']}")

17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q1] hit=True rank=1 | What is alignment in the context of AI safety?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q2] hit=True rank=1 | What is RLHF and how does it work?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q3] hit=True rank=2 | What is reward hacking?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q4] hit=True rank=1 | What is Constitutional AI and who developed it?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q5] hit=True rank=1 | What is red teaming in AI?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q6] hit=True rank=1 | What is deceptive alignment?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q7] hit=True rank=1 | What is interpretability in AI systems?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q8] hit=True rank=3 | Which organizations are working on AI safety?



==================================================
  Retrieval Evaluation Results (top-k=3)
==================================================
  Hit-Rate : 100.00%  (8/8 questions)
  MRR      : 0.8542
==================================================

Per-question breakdown:
  ✅ [q1] rank=1  What is alignment in the context of AI safety?
  ✅ [q2] rank=1  What is RLHF and how does it work?
  ✅ [q3] rank=2  What is reward hacking?
  ✅ [q4] rank=1  What is Constitutional AI and who developed it?
  ✅ [q5] rank=1  What is red teaming in AI?
  ✅ [q6] rank=1  What is deceptive alignment?
  ✅ [q7] rank=1  What is interpretability in AI systems?
  ✅ [q8] rank=3  Which organizations are working on AI safety?

Cell 9 — Failure mode demos

Understanding where a RAG pipeline breaks is as important as knowing where it works. We demonstrate three common failure modes.

Failure Mode 1: Empty / nonsense query

When the query has no semantic content, retrieval returns low-relevance chunks and the LLM is forced to hallucinate or admit it doesn’t know.

empty_query = "asdfjkl qwerty zzz"
logger.info("[Failure Mode 1] Empty/nonsense query: '%s'", empty_query)

response_empty = query_engine.query(empty_query)

print("Query  :", empty_query)
print("Answer :", str(response_empty))
print(
    "\nTop retrieved node score:",
    f"{response_empty.source_nodes[0].score:.4f}"
    if response_empty.source_nodes
    else "none",
)
print(
    "\n⚠️  Note: Low retrieval score indicates the context is not relevant to the query."
)

17:50:00 [INFO] local_rag: [Failure Mode 1] Empty/nonsense query: 'asdfjkl qwerty zzz'
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:06 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Query  : asdfjkl qwerty zzz
Answer : I'm sorry, but it seems like you haven't provided a clear question. The input "asdfjkl qwerty zzz" doesn't appear to be related to any specific topic or subject matter discussed in the context information. Could you please rephrase your query so I can provide a helpful response?

Top retrieved node score: 0.3735

⚠️  Note: Low retrieval score indicates the context is not relevant to the query.

Failure Mode 2: Query about a topic outside the document

The document covers AI safety. A query about an unrelated topic will retrieve the least-bad chunks, but the answer will be unreliable.

ood_query = "What is the recipe for making sourdough bread?"
logger.info("[Failure Mode 2] Out-of-domain query: '%s'", ood_query)

response_ood = query_engine.query(ood_query)

print("Query  :", ood_query)
print("Answer :", str(response_ood))
print(
    "\n⚠️  Guardrail tip: Add a relevance score threshold. If max(node.score) < 0.4,"
)
print(
    "   return 'I don't have information about this topic' instead of hallucinating."
)

17:50:06 [INFO] local_rag: [Failure Mode 2] Out-of-domain query: 'What is the recipe for making sourdough bread?'
17:50:07 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:14 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Query  : What is the recipe for making sourdough bread?
Answer : I'm happy to help you with your question, but I have to say that the provided context information seems unrelated to baking or cooking. The text appears to be a discussion about AI safety and its importance in the field of artificial intelligence.

Unfortunately, I don't have any information on making sourdough bread from the given context. If you're looking for a recipe, I'd be happy to try and help you find one elsewhere!

⚠️  Guardrail tip: Add a relevance score threshold. If max(node.score) < 0.4,
   return 'I don't have information about this topic' instead of hallucinating.

Failure Mode 3: Hallucination guardrail (score threshold)

A simple but effective guardrail: if the best retrieval score is below a threshold, refuse to answer rather than hallucinate.

SCORE_THRESHOLD = 0.40


def safe_query(
    query_engine, retriever, question: str, threshold: float = SCORE_THRESHOLD
) -> str:
    """Run RAG query with a relevance score guardrail.

    Returns the LLM answer if the best retrieved chunk exceeds `threshold`,
    otherwise returns a fallback message to prevent hallucination.
    """
    nodes = retriever.retrieve(question)
    if not nodes:
        return "[GUARDRAIL] No documents retrieved."

    best_score = max(n.score for n in nodes if n.score is not None)
    logger.info(
        "[safe_query] best_score=%.4f threshold=%.2f", best_score, threshold
    )

    if best_score < threshold:
        return (
            f"[GUARDRAIL] Best retrieval score ({best_score:.4f}) is below "
            f"threshold ({threshold}). Cannot answer reliably."
        )
    return str(query_engine.query(question))


# In-domain question — should pass the guardrail
q_in = "What is reward hacking?"
# Out-of-domain question — should be blocked
q_out = "What is the capital of France?"

print("=" * 55)
print(f"Q (in-domain) : {q_in}")
print(f"A             : {safe_query(query_engine, retriever, q_in)}")
print()
print(f"Q (out-domain): {q_out}")
print(f"A             : {safe_query(query_engine, retriever, q_out)}")
print("=" * 55)

17:50:14 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:14 [INFO] local_rag: [safe_query] best_score=0.4569 threshold=0.40
17:50:14 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"


=======================================================
Q (in-domain) : What is reward hacking?


17:50:22 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
17:50:22 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:22 [INFO] local_rag: [safe_query] best_score=0.3442 threshold=0.40


A             : Reward hacking occurs when an AI system finds unintended ways to maximize its reward signal without achieving the true underlying goal. For example, a robot trained to run fast might learn to make itself very tall and then fall forward repeatedly. This phenomenon highlights the potential for AI systems to develop behaviors that are not aligned with their intended objectives.

Q (out-domain): What is the capital of France?
A             : [GUARDRAIL] Best retrieval score (0.3442) is below threshold (0.4). Cannot answer reliably.
=======================================================

Cell 10 — Cleanup (optional)

Run this cell to delete the persisted Chroma database and start fresh. Useful for testing the full pipeline from scratch.

# Uncomment to reset the vector store
# PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])
# if PERSIST_DIR.exists():
#     shutil.rmtree(PERSIST_DIR)
#     logger.info("Deleted %s — re-run Cell 6 to rebuild the index.", PERSIST_DIR)
# else:
#     logger.info("%s does not exist, nothing to clean up.", PERSIST_DIR)
print(
    "Cleanup cell ready. Uncomment the lines above to reset the vector store."
)

Cleanup cell ready. Uncomment the lines above to reset the vector store.

Summary

Component	Choice	Why
LLM	`llama3.2:3b` via Ollama	Free, local, no API key
Embeddings	`nomic-embed-text` via Ollama	High quality, 274 MB, fully local
Vector store	ChromaDB (persistent)	Simple, file-based, no server needed
Chunking	`SentenceSplitter`	Respects sentence boundaries
Eval	Keyword hit-rate + MRR	CI-friendly, zero LLM cost
Guardrail	Score threshold	Prevents hallucination on OOD queries

Next steps

Swap llama3.2:3b for mistral or gemma3 in the config cell and re-run
Replace CORPUS_TEXT with your own documents
Increase similarity_top_k and observe the effect on MRR
Add a reranker (e.g. llama-index-postprocessor-cohere-rerank) after retrieval