Skip to content

Hybrid RAG with Qdrant: multi-tenancy, custom sharding, distributed setup

This notebook implements a production-style Hybrid RAG on Qdrant using LlamaIndex, designed for multitenancy and scale-out via custom sharding.

  • Hybrid search: dense embeddings + sparse BM25 for higher recall and precision.
  • Multitenancy: isolate tenants using payload filters and shard routing.
  • Custom sharding: keep each tenant local for performance and cost efficiency.
  • Distributed Qdrant: multi-node setup with replication for high availability and throughput.

This notebook walks through an end to end Retrieval Augmented Generation workflow that uses Qdrant as a distributed hybrid search backend and LlamaIndex as the orchestration layer. You will build a tenant aware RAG that combines dense vectors with sparse signals, you will isolate data per tenant with filters, and you will route data and queries with a custom shard key for scale.

  • llama-index: orchestration layer for ingestion, indexing, and retrieval.
  • llama-index-vector-stores-qdrant: Qdrant integration with hybrid support.
  • fastembed: lightweight CPU-friendly embedding/sparse models
%pip install -U llama-index llama-index-vector-stores-qdrant fastembed

Make sure you have a distributed Qdrant cluster up and running. Here is a compose.yaml file:

services:
qdrant_primary:
image: "qdrant/qdrant:latest"
ports:
- "6333:6333"
environment:
QDRANT__CLUSTER__ENABLED: "true"
command: ["./qdrant", "--uri", "http://qdrant_primary:6335"]
restart: always
qdrant_secondary:
image: "qdrant/qdrant:latest"
environment:
QDRANT__CLUSTER__ENABLED: "true"
command: ["./qdrant", "--bootstrap", "http://qdrant_primary:6335"]
restart: always
  • Embeddings: FastEmbedEmbedding('BAAI/bge-base-en-v1.5') is a compact, high-quality baseline.
  • Connection: QDRANT_URL defaults to an HTTP endpoint; set QDRANT_API_KEY for secured/cloud setups.
import os
from qdrant_client import AsyncQdrantClient, QdrantClient
from qdrant_client import models
from llama_index.core import (
Settings,
VectorStoreIndex,
Document,
StorageContext,
)
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.fastembed import FastEmbedEmbedding
# Embeddings, small and fast
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")
# Qdrant connection, local by default, set QDRANT_URL and QDRANT_API_KEY for cloud
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
client: QdrantClient = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
aclient: AsyncQdrantClient = AsyncQdrantClient(
url=QDRANT_URL, api_key=QDRANT_API_KEY
)
COLLECTION = "hybrid_rag_multitenant_sharding_demo"
  • Define vector field names: dense for embeddings and sparse for BM25‑style signals.
  • Dense config:
    • Determine embedding dimensionality at runtime by probing Settings.embed_model (avoids hardcoding).
    • Use cosine distance for semantic similarity.
  • Sparse config:
    • Enable an in‑memory sparse index (on_disk=False) to support hybrid scoring.
  • These settings establish the collection’s dual‑index layout used later by QdrantVectorStore for hybrid retrieval.
dense_vector_name = "dense"
dense_config = models.VectorParams(
size=len(Settings.embed_model.get_text_embedding("probe")),
distance=models.Distance.COSINE,
)
sparse_vector_name = "sparse"
sparse_config = models.SparseVectorParams(
index=models.SparseIndexParams(on_disk=False)
)
  • shard_keys: [‘tenant_a’, ‘tenant_b’] — predefined partitions used with custom sharding to keep each tenant local.
  • payload_indexes: keyword index on tenant_id to accelerate filter-based queries.
  • shard_key_selector_fn(tenant_id) -> tenant_id: returns the shard key used for both writes and reads.
shard_keys = ["tenant_a", "tenant_b"]
payload_indexes = [
{
"field_name": "tenant_id",
"field_schema": models.PayloadSchemaType.KEYWORD,
}
]
def shard_key_selector_fn(tenant_id: str) -> models.ShardKeySelector:
return tenant_id

This step creates or attaches to the collection named in COLLECTION and configures a dual‑vector hybrid store:

  • Hybrid search: enable_hybrid=True with dense_vector_name='dense' and sparse_vector_name='sparse'.
  • Dense config: dense_config uses cosine distance and derives size from Settings.embed_model.
  • Sparse config: sparse_config enables an in‑memory sparse index; fastembed_sparse_model='Qdrant/bm25' supplies BM25‑style signals.
  • Distributed topology:
    • sharding_method=Custom with shard_keys=['tenant_a','tenant_b'].
    • shard_key_selector_fn(tenant_id) -> tenant_id routes both writes and reads.
    • shard_number=6, replication_factor=2 for scale and High availability.
  • Payload index: payload_indexes accelerates filtering on tenant_id.

Idempotent behavior: the vector store will create the collection if missing and reuse it on subsequent runs.

vector_store = QdrantVectorStore(
collection_name=COLLECTION,
client=client,
aclient=aclient,
dense_vector_name=dense_vector_name,
sparse_vector_name=sparse_vector_name,
enable_hybrid=True,
dense_config=dense_config,
sparse_config=sparse_config,
fastembed_sparse_model="Qdrant/bm25",
shard_number=6,
sharding_method=models.ShardingMethod.CUSTOM,
shard_key_selector_fn=shard_key_selector_fn,
shard_keys=shard_keys,
replication_factor=2,
payload_indexes=payload_indexes,
)

We create two tenants with small document sets. Each Document carries tenant_id, tags, and a doc_id.

We simulate two tenants with a few short documents each. Every Document carries:

  • tenant_id for isolation and shard routing,
  • tags for quick filtering and debugging,
  • text content used for dense/sparse indexing.
TENANT_DOCS: dict[str, list[Document]] = {
"tenant_a": [
Document(
text="Solar panels reduce electricity bills and carbon footprint",
metadata={"tenant_id": "tenant_a", "tags": ["energy", "solar"]},
),
Document(
text="Inverters convert DC power to AC for home appliances",
metadata={"tenant_id": "tenant_a", "tags": ["energy", "hardware"]},
),
Document(
text="Net metering policies vary by region and utility provider",
metadata={
"tenant_id": "tenant_a",
"tags": ["policy", "regulation"],
},
),
],
"tenant_b": [
Document(
text="Kubernetes orchestrates containers across a cluster",
metadata={"tenant_id": "tenant_b", "tags": ["cloud", "k8s"]},
),
Document(
text="Service meshes add observability and traffic management",
metadata={
"tenant_id": "tenant_b",
"tags": ["cloud", "networking"],
},
),
Document(
text="Helm charts package and deploy Kubernetes applications",
metadata={"tenant_id": "tenant_b", "tags": ["cloud", "devops"]},
),
],
}

Here we embed text with the active Settings.embed_model, then upsert each point with payload and a shard key. This keeps each tenant local to a shard group in a cluster.

  • FastEmbed keeps this demo CPU-friendly. For production, consider a service (e.g., text-embedding-3-large or in-house model) and cache embeddings.
  • If you change the model, update dense_config.size to match and consider reindexing.
  • Avoid embedding on every run in notebooks; persist or cache to speed up iterations.
def create_dense_embeddings(docs: list[Document]) -> list[Document]:
for doc in docs:
doc.embedding = Settings.embed_model.get_text_embedding(doc.text)
return docs
  • We embed each doc with the configured embedding model (dense) and rely on the vector store to build the sparse representation.
  • Writes use shard_identifier=tenant_id, ensuring documents live on the intended shard group.

Tip: For large batches, prefer the async ingestion APIs and chunk documents for backpressure control.

for tenant_id, docs in TENANT_DOCS.items():
docs = create_dense_embeddings(docs)
await vector_store.async_add(docs, shard_identifier=tenant_id)

StorageContext.from_defaults(vector_store=vector_store) binds the Qdrant collection to LlamaIndex’s VectorStoreIndex without re-ingesting data.

Benefits:

  • Reuse the same physical collection for multiple retrievers or query pipelines.
  • Swap retrieval modes (dense-only, sparse-only, hybrid) via retriever config, not data layout.
  • Keep ingestion concerns (sharding, replication) decoupled from application query logic.
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(
vector_store, storage_context=storage_context
)

Use a tenant-scoped hybrid retriever and keep queries shard-local. You can also use metadata filters if you want to filter within the tenant’s data.

  • Set vector_store_query_mode=HYBRID to combine dense and sparse. Tune similarity_top_k, sparse_top_k, and hybrid_top_k.
  • Pass vector_store_kwargs={"shard_identifier": tenant_id} to keep queries within the tenant’s shard.
  • Add metadata filters (e.g., on tenant_id or tags) to further narrow candidates when needed.
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.vector_stores.types import VectorStoreQueryMode
def create_retriever_for_tenant(tenant_id: str) -> VectorIndexRetriever:
if tenant_id not in shard_keys:
raise ValueError(
f"Unknown tenant_id: {tenant_id}. Expected one of {shard_keys}"
)
return VectorIndexRetriever(
index=index,
vector_store_query_mode=VectorStoreQueryMode.HYBRID,
similarity_top_k=5,
sparse_top_k=5,
hybrid_top_k=5,
vector_store_kwargs={"shard_identifier": tenant_id},
)
tenant_id = "tenant_b"
retriever = create_retriever_for_tenant(tenant_id)
query = "manage microservices traffic and observability"
results = retriever.retrieve(query)
print(f"Tenant: {tenant_id} | Query: {query}")
for i, r in enumerate(results, 1):
meta = r.node.metadata
print(
f"{i}. score={r.score:.4f} | tags={meta.get('tags')} | text={r.node.get_content()}"
)
Tenant: tenant_b | Query: manage microservices traffic and observability
1. score=4.6271 | tags=['cloud', 'networking'] | text=Service meshes add observability and traffic management
2. score=0.1213 | tags=['cloud', 'k8s'] | text=Kubernetes orchestrates containers across a cluster
3. score=0.0000 | tags=['cloud', 'devops'] | text=Helm charts package and deploy Kubernetes applications
  • The printout shows the hybrid score, tags (metadata), and snippet of the matched text.
  • Verify tenant isolation by switching tenant_id and observing that results come only from that tenant’s documents.