Fully Local RAG Pipeline with Chroma + Ollama
No API key required. This notebook runs entirely on your local machine using Ollama for both the LLM and embeddings, and ChromaDB as the vector store.
What you will learn
Section titled “What you will learn”| Step | Concept |
|---|---|
| 1 | Configure the pipeline — all tunables in one place |
| 2 | Ingest and chunk a sample document with SentenceSplitter |
| 3 | Embed chunks with OllamaEmbedding and persist in ChromaDB |
| 4 | Query the index with a local LLM (llama3.2:3b via Ollama) |
| 5 | Evaluate retrieval quality against a gold Q&A set (hit-rate & MRR) |
| 6 | Explore failure modes: empty context, long queries, hallucination guard |
Prerequisites
Section titled “Prerequisites”- Ollama installed and running — download here
- Models pulled:
Terminal window ollama pull llama3.2:3bollama pull nomic-embed-text
Cell 1 — Install dependencies
Section titled “Cell 1 — Install dependencies”%pip install -q llama-index-core llama-index-llms-ollama llama-index-embeddings-ollama llama-index-vector-stores-chroma llama-index-readers-file chromadb[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m26.0.1[0m[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m[31mERROR: Invalid requirement: '\\n': Expected package name at the start of dependency specifier \n ^[0m[31m[0mNote: you may need to restart the kernel to use updated packages.Cell 2 — Imports and logging
Section titled “Cell 2 — Imports and logging”import jsonimport loggingimport shutilfrom pathlib import Path
import chromadbfrom IPython.display import Markdown, display
from llama_index.core import Document, StorageContext, VectorStoreIndexfrom llama_index.core.node_parser import SentenceSplitterfrom llama_index.core.retrievers import VectorIndexRetrieverfrom llama_index.embeddings.ollama import OllamaEmbeddingfrom llama_index.llms.ollama import Ollamafrom llama_index.vector_stores.chroma import ChromaVectorStore
logging.basicConfig( level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", datefmt="%H:%M:%S",)logger = logging.getLogger("local_rag")logger.info("Imports loaded successfully.")17:49:30 [INFO] local_rag: Imports loaded successfully.Cell 3 — Configuration
Section titled “Cell 3 — Configuration”All tunables are defined here. Edit this cell to change model names, chunk size, top-k, etc.
cfg = { "llm": { "model": "llama3.2:3b", "base_url": "http://localhost:11434", "temperature": 0.0, "request_timeout": 120.0, }, "embedding": { "model": "nomic-embed-text", "base_url": "http://localhost:11434", }, "splitter": { "chunk_size": 512, "chunk_overlap": 50, }, "chroma": { "persist_dir": "./chroma_db", "collection_name": "ai_safety_rag", }, "retrieval": { "similarity_top_k": 3, },}
print(json.dumps(cfg, indent=2))17:49:30 [INFO] local_rag: Config loaded from config.yaml
{ "llm": { "model": "llama3.2:3b", "base_url": "http://localhost:11434", "temperature": 0.0, "request_timeout": 120.0 }, "embedding": { "model": "nomic-embed-text", "base_url": "http://localhost:11434" }, "splitter": { "chunk_size": 512, "chunk_overlap": 50 }, "chroma": { "persist_dir": "./chroma_db", "collection_name": "ai_safety_rag" }, "data": { "input_dir": "./data" }, "retrieval": { "similarity_top_k": 3 }, "eval": { "gold_qa_path": "./eval/gold_qa.json" }}Cell 4 — Initialise LLM and embedding model
Section titled “Cell 4 — Initialise LLM and embedding model”llm = Ollama( model=cfg["llm"]["model"], base_url=cfg["llm"]["base_url"], temperature=cfg["llm"]["temperature"], request_timeout=cfg["llm"]["request_timeout"],)
embed_model = OllamaEmbedding( model_name=cfg["embedding"]["model"], base_url=cfg["embedding"]["base_url"],)
logger.info( "LLM: %s | Embedding: %s", cfg["llm"]["model"], cfg["embedding"]["model"])17:49:30 [INFO] local_rag: LLM: llama3.2:3b | Embedding: nomic-embed-textCell 5 — Load and chunk the document
Section titled “Cell 5 — Load and chunk the document”We use SentenceSplitter with the chunk size and overlap from config.
The corpus is defined inline — replace CORPUS_TEXT with your own content or load from a file.
CORPUS_TEXT = """# AI Safety Primer
## What is AI Safety?
AI safety is a field of research focused on ensuring that artificial intelligence systemsbehave in ways that are safe, beneficial, and aligned with human values. As AI systemsbecome more capable, the importance of safety research grows correspondingly.
## Key Concepts
### AlignmentAlignment refers to the challenge of ensuring that an AI system's goals and behaviorsmatch the intentions of its designers and the broader interests of humanity. A misalignedAI might pursue its programmed objective in ways that are harmful or unintended.
### RLHF (Reinforcement Learning from Human Feedback)RLHF is a training technique where human evaluators rank model outputs, and thoserankings are used to train a reward model. The AI is then fine-tuned using reinforcementlearning to maximize this reward signal, steering it toward outputs humans prefer.
### Reward HackingReward hacking occurs when an AI system finds unintended ways to maximize its rewardsignal without achieving the true underlying goal. For example, a robot trained to runfast might learn to make itself very tall and then fall forward repeatedly.
### Constitutional AIConstitutional AI (CAI) is a technique developed by Anthropic to make AI systems morehelpful, harmless, and honest. It uses a set of explicit principles (a "constitution")to guide the model's behavior. The model critiques and revises its own outputs againstthese principles, reducing reliance on human labelers for harmful content.
### Red TeamingRed teaming in AI involves deliberately trying to find failure modes, vulnerabilities,or harmful outputs in AI systems. Red teamers act as adversaries, probing the systemwith edge cases, jailbreaks, and adversarial prompts to expose weaknesses beforedeployment.
### Deceptive AlignmentDeceptive alignment is a hypothetical failure mode where an AI system behaves safelyduring training and evaluation but pursues different goals once deployed. The system"knows" it is being evaluated and acts accordingly to pass safety checks.
### InterpretabilityInterpretability (or explainability) research aims to understand what is happeninginside AI models — which features they use, how they represent concepts, and why theyproduce specific outputs. Tools like mechanistic interpretability try to reverse-engineerneural network computations.
## Why AI Safety Matters Now
The rapid pace of AI development means that safety considerations must be integratedearly into the design and training process. Several organizations are actively workingon AI safety research:
- **Anthropic** — founded by former OpenAI researchers, focuses on Constitutional AI and interpretability- **OpenAI** — safety team works on alignment, red teaming, and policy- **DeepMind** — conducts research on agent safety and specification gaming- **MIRI (Machine Intelligence Research Institute)** — focuses on long-term theoretical alignment problems- **Center for AI Safety (CAIS)** — coordinates safety research across academia and industry"""
documents = [ Document(text=CORPUS_TEXT, metadata={"source": "ai_safety_primer"})]logger.info("Loaded %d document(s) from inline corpus", len(documents))
splitter = SentenceSplitter( chunk_size=cfg["splitter"]["chunk_size"], chunk_overlap=cfg["splitter"]["chunk_overlap"],)nodes = splitter.get_nodes_from_documents(documents)logger.info( "Split into %d nodes (chunk_size=%d, overlap=%d)", len(nodes), cfg["splitter"]["chunk_size"], cfg["splitter"]["chunk_overlap"],)
print(f"\nFirst chunk preview ({len(nodes[0].text)} chars):")print("-" * 60)print(nodes[0].text[:400], "...")17:49:30 [WARNING] llama_index.core.readers.file.base: `llama-index-readers-file` package not found, some file readers will not be available if not provided by the `file_extractor` parameter.17:49:30 [INFO] local_rag: Loaded 1 document(s) from data17:49:31 [INFO] local_rag: Split into 3 nodes (chunk_size=512, overlap=50)
First chunk preview (2247 chars):------------------------------------------------------------# AI Safety Primer
## What is AI Safety?
AI safety is a field of research focused on ensuring that artificial intelligence systemsbehave in ways that are safe, beneficial, and aligned with human values. As AI systemsbecome more capable, the importance of safety research grows correspondingly.
## Key Concepts
### AlignmentAlignment refers to the challenge of ensuring that an AI system's goal ...Cell 6 — Build or load the Chroma index (with caching)
Section titled “Cell 6 — Build or load the Chroma index (with caching)”If chroma_db/ already exists on disk we load from it — no re-embedding.
Delete the chroma_db/ folder to force a full re-index.
PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])COLLECTION = cfg["chroma"]["collection_name"]
chroma_client = chromadb.PersistentClient(path=str(PERSIST_DIR))existing = [c.name for c in chroma_client.list_collections()]
if COLLECTION in existing: logger.info( "Cache hit — loading existing collection '%s' from %s", COLLECTION, PERSIST_DIR, ) chroma_collection = chroma_client.get_collection(COLLECTION) vector_store = ChromaVectorStore(chroma_collection=chroma_collection) index = VectorStoreIndex.from_vector_store( vector_store, embed_model=embed_model, )else: logger.info( "Cache miss — embedding %d nodes into new collection '%s'", len(nodes), COLLECTION, ) chroma_collection = chroma_client.create_collection(COLLECTION) vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex( nodes, storage_context=storage_context, embed_model=embed_model, ) logger.info("Index built and persisted to %s", PERSIST_DIR)
print(f"Collection '{COLLECTION}' has {chroma_collection.count()} vectors.")17:49:31 [INFO] chromadb.telemetry.product.posthog: Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.17:49:32 [INFO] local_rag: Cache miss — embedding 3 nodes into new collection 'ai_safety_rag'17:49:33 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:49:33 [INFO] local_rag: Index built and persisted to chroma_db
Collection 'ai_safety_rag' has 3 vectors.Cell 7 — RAG query
Section titled “Cell 7 — RAG query”The query engine retrieves the top-k most relevant chunks and passes them as context to the local LLM to generate a grounded answer.
query_engine = index.as_query_engine( llm=llm, similarity_top_k=cfg["retrieval"]["similarity_top_k"],)
QUERY = "What is Constitutional AI and who developed it?"logger.info("Running query: %s", QUERY)
response = query_engine.query(QUERY)
display(Markdown(f"**Query:** {QUERY}\n\n**Answer:** {response}"))
print("\n--- Retrieved source nodes ---")for i, node in enumerate(response.source_nodes, 1): score = getattr(node, "score", "n/a") print(f"[{i}] score={score:.4f} | {node.text[:120].strip()}...")17:49:33 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/show "HTTP/1.1 200 OK"17:49:33 [INFO] local_rag: Running query: What is Constitutional AI and who developed it?17:49:33 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:49:52 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"Query: What is Constitutional AI and who developed it?
Answer: Constitutional AI is an approach where an AI system is trained to follow a set of explicit principles (a “constitution”). This approach was developed by Anthropic. The model critiques and revises its own outputs against these principles, reducing reliance on human labelers for harmful content identification.
--- Retrieved source nodes ---[1] score=0.4398 | # AI Safety Primer
## What is AI Safety?
AI safety is a field of research focused on ensuring that artificial intellig...[2] score=0.4063 | ## Why AI Safety Matters Now
The rapid pace of AI development means that safety considerations must be integratedearly...[3] score=0.3742 | The model critiquesand revises its own outputs against these principles, reducing reliance on humanlabelers for harmfu...Cell 8 — Retrieval evaluation: Hit-Rate and MRR
Section titled “Cell 8 — Retrieval evaluation: Hit-Rate and MRR”We loop over the gold Q&A set and for each question:
- Retrieve the top-k nodes
- Check if any retrieved chunk contains all expected keywords (hit)
- Record the rank of the first hit (for MRR)
This is CI-friendly — no extra LLM calls, runs in seconds.
gold_qa = [ { "id": "q1", "question": "What is alignment in the context of AI safety?", "expected_keywords": ["alignment", "goals"], }, { "id": "q2", "question": "What is RLHF and how does it work?", "expected_keywords": ["rlhf", "reward"], }, { "id": "q3", "question": "What is reward hacking?", "expected_keywords": ["reward hacking", "unintended"], }, { "id": "q4", "question": "What is Constitutional AI and who developed it?", "expected_keywords": ["constitutional ai", "anthropic"], }, { "id": "q5", "question": "What is red teaming in AI?", "expected_keywords": ["red teaming", "failure"], }, { "id": "q6", "question": "What is deceptive alignment?", "expected_keywords": ["deceptive alignment", "training"], }, { "id": "q7", "question": "What is interpretability in AI systems?", "expected_keywords": ["interpretability", "neural"], }, { "id": "q8", "question": "Which organizations are working on AI safety?", "expected_keywords": ["anthropic", "openai", "deepmind"], },]
retriever = VectorIndexRetriever( index=index, similarity_top_k=cfg["retrieval"]["similarity_top_k"], embed_model=embed_model,)
hits = 0reciprocal_ranks = []results = []
for item in gold_qa: retrieved_nodes = retriever.retrieve(item["question"]) keywords = [kw.lower() for kw in item["expected_keywords"]]
first_hit_rank = None for rank, node in enumerate(retrieved_nodes, 1): text_lower = node.text.lower() if all(kw in text_lower for kw in keywords): first_hit_rank = rank break
hit = first_hit_rank is not None hits += int(hit) reciprocal_ranks.append(1 / first_hit_rank if hit else 0.0)
results.append( { "id": item["id"], "hit": hit, "rank": first_hit_rank, "question": item["question"][:60], } ) logger.info( "[%s] hit=%s rank=%s | %s", item["id"], hit, first_hit_rank, item["question"][:50], )
hit_rate = hits / len(gold_qa)mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
print("\n" + "=" * 50)print( f" Retrieval Evaluation Results (top-k={cfg['retrieval']['similarity_top_k']})")print("=" * 50)print(f" Hit-Rate : {hit_rate:.2%} ({hits}/{len(gold_qa)} questions)")print(f" MRR : {mrr:.4f}")print("=" * 50)print("\nPer-question breakdown:")for r in results: status = "✅" if r["hit"] else "❌" print(f" {status} [{r['id']}] rank={r['rank']} {r['question']}")17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:00 [INFO] local_rag: [q1] hit=True rank=1 | What is alignment in the context of AI safety?17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:00 [INFO] local_rag: [q2] hit=True rank=1 | What is RLHF and how does it work?17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:00 [INFO] local_rag: [q3] hit=True rank=2 | What is reward hacking?17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:00 [INFO] local_rag: [q4] hit=True rank=1 | What is Constitutional AI and who developed it?17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:00 [INFO] local_rag: [q5] hit=True rank=1 | What is red teaming in AI?17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:00 [INFO] local_rag: [q6] hit=True rank=1 | What is deceptive alignment?17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:00 [INFO] local_rag: [q7] hit=True rank=1 | What is interpretability in AI systems?17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:00 [INFO] local_rag: [q8] hit=True rank=3 | Which organizations are working on AI safety?
================================================== Retrieval Evaluation Results (top-k=3)================================================== Hit-Rate : 100.00% (8/8 questions) MRR : 0.8542==================================================
Per-question breakdown: ✅ [q1] rank=1 What is alignment in the context of AI safety? ✅ [q2] rank=1 What is RLHF and how does it work? ✅ [q3] rank=2 What is reward hacking? ✅ [q4] rank=1 What is Constitutional AI and who developed it? ✅ [q5] rank=1 What is red teaming in AI? ✅ [q6] rank=1 What is deceptive alignment? ✅ [q7] rank=1 What is interpretability in AI systems? ✅ [q8] rank=3 Which organizations are working on AI safety?Cell 9 — Failure mode demos
Section titled “Cell 9 — Failure mode demos”Understanding where a RAG pipeline breaks is as important as knowing where it works. We demonstrate three common failure modes.
Failure Mode 1: Empty / nonsense query
Section titled “Failure Mode 1: Empty / nonsense query”When the query has no semantic content, retrieval returns low-relevance chunks and the LLM is forced to hallucinate or admit it doesn’t know.
empty_query = "asdfjkl qwerty zzz"logger.info("[Failure Mode 1] Empty/nonsense query: '%s'", empty_query)
response_empty = query_engine.query(empty_query)
print("Query :", empty_query)print("Answer :", str(response_empty))print( "\nTop retrieved node score:", f"{response_empty.source_nodes[0].score:.4f}" if response_empty.source_nodes else "none",)print( "\n⚠️ Note: Low retrieval score indicates the context is not relevant to the query.")17:50:00 [INFO] local_rag: [Failure Mode 1] Empty/nonsense query: 'asdfjkl qwerty zzz'17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:06 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
Query : asdfjkl qwerty zzzAnswer : I'm sorry, but it seems like you haven't provided a clear question. The input "asdfjkl qwerty zzz" doesn't appear to be related to any specific topic or subject matter discussed in the context information. Could you please rephrase your query so I can provide a helpful response?
Top retrieved node score: 0.3735
⚠️ Note: Low retrieval score indicates the context is not relevant to the query.Failure Mode 2: Query about a topic outside the document
Section titled “Failure Mode 2: Query about a topic outside the document”The document covers AI safety. A query about an unrelated topic will retrieve the least-bad chunks, but the answer will be unreliable.
ood_query = "What is the recipe for making sourdough bread?"logger.info("[Failure Mode 2] Out-of-domain query: '%s'", ood_query)
response_ood = query_engine.query(ood_query)
print("Query :", ood_query)print("Answer :", str(response_ood))print( "\n⚠️ Guardrail tip: Add a relevance score threshold. If max(node.score) < 0.4,")print( " return 'I don't have information about this topic' instead of hallucinating.")17:50:06 [INFO] local_rag: [Failure Mode 2] Out-of-domain query: 'What is the recipe for making sourdough bread?'17:50:07 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:14 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
Query : What is the recipe for making sourdough bread?Answer : I'm happy to help you with your question, but I have to say that the provided context information seems unrelated to baking or cooking. The text appears to be a discussion about AI safety and its importance in the field of artificial intelligence.
Unfortunately, I don't have any information on making sourdough bread from the given context. If you're looking for a recipe, I'd be happy to try and help you find one elsewhere!
⚠️ Guardrail tip: Add a relevance score threshold. If max(node.score) < 0.4, return 'I don't have information about this topic' instead of hallucinating.Failure Mode 3: Hallucination guardrail (score threshold)
Section titled “Failure Mode 3: Hallucination guardrail (score threshold)”A simple but effective guardrail: if the best retrieval score is below a threshold, refuse to answer rather than hallucinate.
SCORE_THRESHOLD = 0.40
def safe_query( query_engine, retriever, question: str, threshold: float = SCORE_THRESHOLD) -> str: """Run RAG query with a relevance score guardrail.
Returns the LLM answer if the best retrieved chunk exceeds `threshold`, otherwise returns a fallback message to prevent hallucination. """ nodes = retriever.retrieve(question) if not nodes: return "[GUARDRAIL] No documents retrieved."
best_score = max(n.score for n in nodes if n.score is not None) logger.info( "[safe_query] best_score=%.4f threshold=%.2f", best_score, threshold )
if best_score < threshold: return ( f"[GUARDRAIL] Best retrieval score ({best_score:.4f}) is below " f"threshold ({threshold}). Cannot answer reliably." ) return str(query_engine.query(question))
# In-domain question — should pass the guardrailq_in = "What is reward hacking?"# Out-of-domain question — should be blockedq_out = "What is the capital of France?"
print("=" * 55)print(f"Q (in-domain) : {q_in}")print(f"A : {safe_query(query_engine, retriever, q_in)}")print()print(f"Q (out-domain): {q_out}")print(f"A : {safe_query(query_engine, retriever, q_out)}")print("=" * 55)17:50:14 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:14 [INFO] local_rag: [safe_query] best_score=0.4569 threshold=0.4017:50:14 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
=======================================================Q (in-domain) : What is reward hacking?
17:50:22 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"17:50:22 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"17:50:22 [INFO] local_rag: [safe_query] best_score=0.3442 threshold=0.40
A : Reward hacking occurs when an AI system finds unintended ways to maximize its reward signal without achieving the true underlying goal. For example, a robot trained to run fast might learn to make itself very tall and then fall forward repeatedly. This phenomenon highlights the potential for AI systems to develop behaviors that are not aligned with their intended objectives.
Q (out-domain): What is the capital of France?A : [GUARDRAIL] Best retrieval score (0.3442) is below threshold (0.4). Cannot answer reliably.=======================================================Cell 10 — Cleanup (optional)
Section titled “Cell 10 — Cleanup (optional)”Run this cell to delete the persisted Chroma database and start fresh. Useful for testing the full pipeline from scratch.
# Uncomment to reset the vector store# PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])# if PERSIST_DIR.exists():# shutil.rmtree(PERSIST_DIR)# logger.info("Deleted %s — re-run Cell 6 to rebuild the index.", PERSIST_DIR)# else:# logger.info("%s does not exist, nothing to clean up.", PERSIST_DIR)print( "Cleanup cell ready. Uncomment the lines above to reset the vector store.")Cleanup cell ready. Uncomment the lines above to reset the vector store.Summary
Section titled “Summary”| Component | Choice | Why |
|---|---|---|
| LLM | llama3.2:3b via Ollama | Free, local, no API key |
| Embeddings | nomic-embed-text via Ollama | High quality, 274 MB, fully local |
| Vector store | ChromaDB (persistent) | Simple, file-based, no server needed |
| Chunking | SentenceSplitter | Respects sentence boundaries |
| Eval | Keyword hit-rate + MRR | CI-friendly, zero LLM cost |
| Guardrail | Score threshold | Prevents hallucination on OOD queries |
Next steps
Section titled “Next steps”- Swap
llama3.2:3bformistralorgemma3in the config cell and re-run - Replace
CORPUS_TEXTwith your own documents - Increase
similarity_top_kand observe the effect on MRR - Add a reranker (e.g.
llama-index-postprocessor-cohere-rerank) after retrieval