Hybrid RAG with Qdrant: multi-tenancy, custom sharding, distributed setup

What you’ll build

This notebook implements a production-style Hybrid RAG on Qdrant using LlamaIndex, designed for multitenancy and scale-out via custom sharding.

Hybrid search: dense embeddings + sparse BM25 for higher recall and precision.
Multitenancy: isolate tenants using payload filters and shard routing.
Custom sharding: keep each tenant local for performance and cost efficiency.
Distributed Qdrant: multi-node setup with replication for high availability and throughput.

This notebook walks through an end to end Retrieval Augmented Generation workflow that uses Qdrant as a distributed hybrid search backend and LlamaIndex as the orchestration layer. You will build a tenant aware RAG that combines dense vectors with sparse signals, you will isolate data per tenant with filters, and you will route data and queries with a custom shard key for scale.

Install dependencies

About the dependencies

llama-index: orchestration layer for ingestion, indexing, and retrieval.
llama-index-vector-stores-qdrant: Qdrant integration with hybrid support.
fastembed: lightweight CPU-friendly embedding/sparse models

%pip install -U llama-index llama-index-vector-stores-qdrant fastembed

Make sure you have a distributed Qdrant cluster up and running. Here is a compose.yaml file:

services:
  qdrant_primary:
    image: "qdrant/qdrant:latest"
    ports:
      - "6333:6333"
    environment:
      QDRANT__CLUSTER__ENABLED: "true"
    command: ["./qdrant", "--uri", "http://qdrant_primary:6335"]
    restart: always
  qdrant_secondary:
    image: "qdrant/qdrant:latest"
    environment:
      QDRANT__CLUSTER__ENABLED: "true"
    command: ["./qdrant", "--bootstrap", "http://qdrant_primary:6335"]
    restart: always

Imports and global settings

Settings and connectivity

Embeddings: FastEmbedEmbedding('BAAI/bge-base-en-v1.5') is a compact, high-quality baseline.
Connection: QDRANT_URL defaults to an HTTP endpoint; set QDRANT_API_KEY for secured/cloud setups.

import os

from qdrant_client import AsyncQdrantClient, QdrantClient
from qdrant_client import models

from llama_index.core import (
    Settings,
    VectorStoreIndex,
    Document,
    StorageContext,
)
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.fastembed import FastEmbedEmbedding

# Embeddings, small and fast
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Qdrant connection, local by default, set QDRANT_URL and QDRANT_API_KEY for cloud
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")

client: QdrantClient = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
aclient: AsyncQdrantClient = AsyncQdrantClient(
    url=QDRANT_URL, api_key=QDRANT_API_KEY
)
COLLECTION = "hybrid_rag_multitenant_sharding_demo"

Create distributed-ready collection

Configure dual-vector schema (dense + sparse)

Define vector field names: dense for embeddings and sparse for BM25‑style signals.
Dense config:
- Determine embedding dimensionality at runtime by probing Settings.embed_model (avoids hardcoding).
- Use cosine distance for semantic similarity.
Sparse config:
- Enable an in‑memory sparse index (on_disk=False) to support hybrid scoring.
These settings establish the collection’s dual‑index layout used later by QdrantVectorStore for hybrid retrieval.

dense_vector_name = "dense"
dense_config = models.VectorParams(
    size=len(Settings.embed_model.get_text_embedding("probe")),
    distance=models.Distance.COSINE,
)
sparse_vector_name = "sparse"
sparse_config = models.SparseVectorParams(
    index=models.SparseIndexParams(on_disk=False)
)

Shard keys and selector contract

shard_keys: [‘tenant_a’, ‘tenant_b’] — predefined partitions used with custom sharding to keep each tenant local.
payload_indexes: keyword index on tenant_id to accelerate filter-based queries.
shard_key_selector_fn(tenant_id) -> tenant_id: returns the shard key used for both writes and reads.

shard_keys = ["tenant_a", "tenant_b"]
payload_indexes = [
    {
        "field_name": "tenant_id",
        "field_schema": models.PayloadSchemaType.KEYWORD,
    }
]


def shard_key_selector_fn(tenant_id: str) -> models.ShardKeySelector:
    return tenant_id

Initialize hybrid Qdrant store with custom sharding

This step creates or attaches to the collection named in COLLECTION and configures a dual‑vector hybrid store:

Hybrid search: enable_hybrid=True with dense_vector_name='dense' and sparse_vector_name='sparse'.
Dense config: dense_config uses cosine distance and derives size from Settings.embed_model.
Sparse config: sparse_config enables an in‑memory sparse index; fastembed_sparse_model='Qdrant/bm25' supplies BM25‑style signals.
Distributed topology:
- sharding_method=Custom with shard_keys=['tenant_a','tenant_b'].
- shard_key_selector_fn(tenant_id) -> tenant_id routes both writes and reads.
- shard_number=6, replication_factor=2 for scale and High availability.
Payload index: payload_indexes accelerates filtering on tenant_id.

Idempotent behavior: the vector store will create the collection if missing and reuse it on subsequent runs.

vector_store = QdrantVectorStore(
    collection_name=COLLECTION,
    client=client,
    aclient=aclient,
    dense_vector_name=dense_vector_name,
    sparse_vector_name=sparse_vector_name,
    enable_hybrid=True,
    dense_config=dense_config,
    sparse_config=sparse_config,
    fastembed_sparse_model="Qdrant/bm25",
    shard_number=6,
    sharding_method=models.ShardingMethod.CUSTOM,
    shard_key_selector_fn=shard_key_selector_fn,
    shard_keys=shard_keys,
    replication_factor=2,
    payload_indexes=payload_indexes,
)

Prepare multi-tenant dataset

We create two tenants with small document sets. Each Document carries tenant_id, tags, and a doc_id.

Dataset design and extensibility

We simulate two tenants with a few short documents each. Every Document carries:

tenant_id for isolation and shard routing,
tags for quick filtering and debugging,
text content used for dense/sparse indexing.

TENANT_DOCS: dict[str, list[Document]] = {
    "tenant_a": [
        Document(
            text="Solar panels reduce electricity bills and carbon footprint",
            metadata={"tenant_id": "tenant_a", "tags": ["energy", "solar"]},
        ),
        Document(
            text="Inverters convert DC power to AC for home appliances",
            metadata={"tenant_id": "tenant_a", "tags": ["energy", "hardware"]},
        ),
        Document(
            text="Net metering policies vary by region and utility provider",
            metadata={
                "tenant_id": "tenant_a",
                "tags": ["policy", "regulation"],
            },
        ),
    ],
    "tenant_b": [
        Document(
            text="Kubernetes orchestrates containers across a cluster",
            metadata={"tenant_id": "tenant_b", "tags": ["cloud", "k8s"]},
        ),
        Document(
            text="Service meshes add observability and traffic management",
            metadata={
                "tenant_id": "tenant_b",
                "tags": ["cloud", "networking"],
            },
        ),
        Document(
            text="Helm charts package and deploy Kubernetes applications",
            metadata={"tenant_id": "tenant_b", "tags": ["cloud", "devops"]},
        ),
    ],
}

Ingest with shard key for locality

Here we embed text with the active Settings.embed_model, then upsert each point with payload and a shard key. This keeps each tenant local to a shard group in a cluster.

Embedding strategy

FastEmbed keeps this demo CPU-friendly. For production, consider a service (e.g., text-embedding-3-large or in-house model) and cache embeddings.
If you change the model, update dense_config.size to match and consider reindexing.
Avoid embedding on every run in notebooks; persist or cache to speed up iterations.

def create_dense_embeddings(docs: list[Document]) -> list[Document]:
    for doc in docs:
        doc.embedding = Settings.embed_model.get_text_embedding(doc.text)
    return docs

Ingestion flow and locality guarantees

We embed each doc with the configured embedding model (dense) and rely on the vector store to build the sparse representation.
Writes use shard_identifier=tenant_id, ensuring documents live on the intended shard group.

Tip: For large batches, prefer the async ingestion APIs and chunk documents for backpressure control.

for tenant_id, docs in TENANT_DOCS.items():
    docs = create_dense_embeddings(docs)
    await vector_store.async_add(docs, shard_identifier=tenant_id)

Index wrapping and reusability

StorageContext.from_defaults(vector_store=vector_store) binds the Qdrant collection to LlamaIndex’s VectorStoreIndex without re-ingesting data.

Benefits:

Reuse the same physical collection for multiple retrievers or query pipelines.
Swap retrieval modes (dense-only, sparse-only, hybrid) via retriever config, not data layout.
Keep ingestion concerns (sharding, replication) decoupled from application query logic.

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(
    vector_store, storage_context=storage_context
)

Multi-tenant retrieval

Use a tenant-scoped hybrid retriever and keep queries shard-local. You can also use metadata filters if you want to filter within the tenant’s data.

Retrieval tips for hybrid mode

Set vector_store_query_mode=HYBRID to combine dense and sparse. Tune similarity_top_k, sparse_top_k, and hybrid_top_k.
Pass vector_store_kwargs={"shard_identifier": tenant_id} to keep queries within the tenant’s shard.
Add metadata filters (e.g., on tenant_id or tags) to further narrow candidates when needed.

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.vector_stores.types import VectorStoreQueryMode


def create_retriever_for_tenant(tenant_id: str) -> VectorIndexRetriever:
    if tenant_id not in shard_keys:
        raise ValueError(
            f"Unknown tenant_id: {tenant_id}. Expected one of {shard_keys}"
        )
    return VectorIndexRetriever(
        index=index,
        vector_store_query_mode=VectorStoreQueryMode.HYBRID,
        similarity_top_k=5,
        sparse_top_k=5,
        hybrid_top_k=5,
        vector_store_kwargs={"shard_identifier": tenant_id},
    )


tenant_id = "tenant_b"
retriever = create_retriever_for_tenant(tenant_id)

query = "manage microservices traffic and observability"
results = retriever.retrieve(query)

print(f"Tenant: {tenant_id} | Query: {query}")
for i, r in enumerate(results, 1):
    meta = r.node.metadata
    print(
        f"{i}. score={r.score:.4f} | tags={meta.get('tags')} | text={r.node.get_content()}"
    )

Tenant: tenant_b | Query: manage microservices traffic and observability
1. score=4.6271 | tags=['cloud', 'networking'] | text=Service meshes add observability and traffic management
2. score=0.1213 | tags=['cloud', 'k8s'] | text=Kubernetes orchestrates containers across a cluster
3. score=0.0000 | tags=['cloud', 'devops'] | text=Helm charts package and deploy Kubernetes applications

Interpreting results

The printout shows the hybrid score, tags (metadata), and snippet of the matched text.
Verify tenant isolation by switching tenant_id and observing that results come only from that tenant’s documents.