Hybrid RAG with Qdrant: multi-tenancy, custom sharding, distributed setup
What youâll build
Section titled âWhat youâll buildâThis notebook implements a production-style Hybrid RAG on Qdrant using LlamaIndex, designed for multitenancy and scale-out via custom sharding.
- Hybrid search: dense embeddings + sparse BM25 for higher recall and precision.
- Multitenancy: isolate tenants using payload filters and shard routing.
- Custom sharding: keep each tenant local for performance and cost efficiency.
- Distributed Qdrant: multi-node setup with replication for high availability and throughput.
This notebook walks through an end to end Retrieval Augmented Generation workflow that uses Qdrant as a distributed hybrid search backend and LlamaIndex as the orchestration layer. You will build a tenant aware RAG that combines dense vectors with sparse signals, you will isolate data per tenant with filters, and you will route data and queries with a custom shard key for scale.
Install dependencies
Section titled âInstall dependenciesâAbout the dependencies
Section titled âAbout the dependenciesâ- llama-index: orchestration layer for ingestion, indexing, and retrieval.
- llama-index-vector-stores-qdrant: Qdrant integration with hybrid support.
- fastembed: lightweight CPU-friendly embedding/sparse models
%pip install -U llama-index llama-index-vector-stores-qdrant fastembed
Make sure you have a distributed Qdrant cluster up and running. Here is a compose.yaml
file:
services: qdrant_primary: image: "qdrant/qdrant:latest" ports: - "6333:6333" environment: QDRANT__CLUSTER__ENABLED: "true" command: ["./qdrant", "--uri", "http://qdrant_primary:6335"] restart: always qdrant_secondary: image: "qdrant/qdrant:latest" environment: QDRANT__CLUSTER__ENABLED: "true" command: ["./qdrant", "--bootstrap", "http://qdrant_primary:6335"] restart: always
Imports and global settings
Section titled âImports and global settingsâSettings and connectivity
Section titled âSettings and connectivityâ- Embeddings:
FastEmbedEmbedding('BAAI/bge-base-en-v1.5')
is a compact, high-quality baseline. - Connection:
QDRANT_URL
defaults to an HTTP endpoint; setQDRANT_API_KEY
for secured/cloud setups.
import os
from qdrant_client import AsyncQdrantClient, QdrantClientfrom qdrant_client import models
from llama_index.core import ( Settings, VectorStoreIndex, Document, StorageContext,)from llama_index.vector_stores.qdrant import QdrantVectorStorefrom llama_index.embeddings.fastembed import FastEmbedEmbedding
# Embeddings, small and fastSettings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")
# Qdrant connection, local by default, set QDRANT_URL and QDRANT_API_KEY for cloudQDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
client: QdrantClient = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)aclient: AsyncQdrantClient = AsyncQdrantClient( url=QDRANT_URL, api_key=QDRANT_API_KEY)COLLECTION = "hybrid_rag_multitenant_sharding_demo"
Create distributed-ready collection
Section titled âCreate distributed-ready collectionâConfigure dual-vector schema (dense + sparse)
Section titled âConfigure dual-vector schema (dense + sparse)â- Define vector field names:
dense
for embeddings andsparse
for BM25âstyle signals. - Dense config:
- Determine embedding dimensionality at runtime by probing
Settings.embed_model
(avoids hardcoding). - Use cosine distance for semantic similarity.
- Determine embedding dimensionality at runtime by probing
- Sparse config:
- Enable an inâmemory sparse index (
on_disk=False
) to support hybrid scoring.
- Enable an inâmemory sparse index (
- These settings establish the collectionâs dualâindex layout used later by QdrantVectorStore for hybrid retrieval.
dense_vector_name = "dense"dense_config = models.VectorParams( size=len(Settings.embed_model.get_text_embedding("probe")), distance=models.Distance.COSINE,)sparse_vector_name = "sparse"sparse_config = models.SparseVectorParams( index=models.SparseIndexParams(on_disk=False))
Shard keys and selector contract
Section titled âShard keys and selector contractâshard_keys
: [âtenant_aâ, âtenant_bâ] â predefined partitions used with custom sharding to keep each tenant local.payload_indexes
: keyword index ontenant_id
to accelerate filter-based queries.shard_key_selector_fn(tenant_id) -> tenant_id
: returns the shard key used for both writes and reads.
shard_keys = ["tenant_a", "tenant_b"]payload_indexes = [ { "field_name": "tenant_id", "field_schema": models.PayloadSchemaType.KEYWORD, }]
def shard_key_selector_fn(tenant_id: str) -> models.ShardKeySelector: return tenant_id
Initialize hybrid Qdrant store with custom sharding
Section titled âInitialize hybrid Qdrant store with custom shardingâThis step creates or attaches to the collection named in COLLECTION
and configures a dualâvector hybrid store:
- Hybrid search:
enable_hybrid=True
withdense_vector_name='dense'
andsparse_vector_name='sparse'
. - Dense config:
dense_config
uses cosine distance and derives size fromSettings.embed_model
. - Sparse config:
sparse_config
enables an inâmemory sparse index;fastembed_sparse_model='Qdrant/bm25'
supplies BM25âstyle signals. - Distributed topology:
sharding_method=Custom
withshard_keys=['tenant_a','tenant_b']
.shard_key_selector_fn(tenant_id) -> tenant_id
routes both writes and reads.shard_number=6
,replication_factor=2
for scale and High availability.
- Payload index:
payload_indexes
accelerates filtering ontenant_id
.
Idempotent behavior: the vector store will create the collection if missing and reuse it on subsequent runs.
vector_store = QdrantVectorStore( collection_name=COLLECTION, client=client, aclient=aclient, dense_vector_name=dense_vector_name, sparse_vector_name=sparse_vector_name, enable_hybrid=True, dense_config=dense_config, sparse_config=sparse_config, fastembed_sparse_model="Qdrant/bm25", shard_number=6, sharding_method=models.ShardingMethod.CUSTOM, shard_key_selector_fn=shard_key_selector_fn, shard_keys=shard_keys, replication_factor=2, payload_indexes=payload_indexes,)
Prepare multi-tenant dataset
Section titled âPrepare multi-tenant datasetâWe create two tenants with small document sets. Each Document carries tenant_id, tags, and a doc_id.
Dataset design and extensibility
Section titled âDataset design and extensibilityâWe simulate two tenants with a few short documents each. Every Document
carries:
tenant_id
for isolation and shard routing,tags
for quick filtering and debugging,text
content used for dense/sparse indexing.
TENANT_DOCS: dict[str, list[Document]] = { "tenant_a": [ Document( text="Solar panels reduce electricity bills and carbon footprint", metadata={"tenant_id": "tenant_a", "tags": ["energy", "solar"]}, ), Document( text="Inverters convert DC power to AC for home appliances", metadata={"tenant_id": "tenant_a", "tags": ["energy", "hardware"]}, ), Document( text="Net metering policies vary by region and utility provider", metadata={ "tenant_id": "tenant_a", "tags": ["policy", "regulation"], }, ), ], "tenant_b": [ Document( text="Kubernetes orchestrates containers across a cluster", metadata={"tenant_id": "tenant_b", "tags": ["cloud", "k8s"]}, ), Document( text="Service meshes add observability and traffic management", metadata={ "tenant_id": "tenant_b", "tags": ["cloud", "networking"], }, ), Document( text="Helm charts package and deploy Kubernetes applications", metadata={"tenant_id": "tenant_b", "tags": ["cloud", "devops"]}, ), ],}
Ingest with shard key for locality
Section titled âIngest with shard key for localityâHere we embed text with the active Settings.embed_model, then upsert each point with payload and a shard key. This keeps each tenant local to a shard group in a cluster.
Embedding strategy
Section titled âEmbedding strategyâ- FastEmbed keeps this demo CPU-friendly. For production, consider a service (e.g., text-embedding-3-large or in-house model) and cache embeddings.
- If you change the model, update
dense_config.size
to match and consider reindexing. - Avoid embedding on every run in notebooks; persist or cache to speed up iterations.
def create_dense_embeddings(docs: list[Document]) -> list[Document]: for doc in docs: doc.embedding = Settings.embed_model.get_text_embedding(doc.text) return docs
Ingestion flow and locality guarantees
Section titled âIngestion flow and locality guaranteesâ- We embed each doc with the configured embedding model (dense) and rely on the vector store to build the sparse representation.
- Writes use
shard_identifier=tenant_id
, ensuring documents live on the intended shard group.
Tip: For large batches, prefer the async ingestion APIs and chunk documents for backpressure control.
for tenant_id, docs in TENANT_DOCS.items(): docs = create_dense_embeddings(docs) await vector_store.async_add(docs, shard_identifier=tenant_id)
Index wrapping and reusability
Section titled âIndex wrapping and reusabilityâStorageContext.from_defaults(vector_store=vector_store)
binds the Qdrant collection to LlamaIndexâs VectorStoreIndex
without re-ingesting data.
Benefits:
- Reuse the same physical collection for multiple retrievers or query pipelines.
- Swap retrieval modes (dense-only, sparse-only, hybrid) via retriever config, not data layout.
- Keep ingestion concerns (sharding, replication) decoupled from application query logic.
storage_context = StorageContext.from_defaults(vector_store=vector_store)index = VectorStoreIndex.from_vector_store( vector_store, storage_context=storage_context)
Multi-tenant retrieval
Section titled âMulti-tenant retrievalâUse a tenant-scoped hybrid retriever and keep queries shard-local. You can also use metadata filters if you want to filter within the tenantâs data.
Retrieval tips for hybrid mode
Section titled âRetrieval tips for hybrid modeâ- Set
vector_store_query_mode=HYBRID
to combine dense and sparse. Tunesimilarity_top_k
,sparse_top_k
, andhybrid_top_k
. - Pass
vector_store_kwargs={"shard_identifier": tenant_id}
to keep queries within the tenantâs shard. - Add metadata filters (e.g., on
tenant_id
ortags
) to further narrow candidates when needed.
from llama_index.core.retrievers import VectorIndexRetrieverfrom llama_index.core.vector_stores.types import VectorStoreQueryMode
def create_retriever_for_tenant(tenant_id: str) -> VectorIndexRetriever: if tenant_id not in shard_keys: raise ValueError( f"Unknown tenant_id: {tenant_id}. Expected one of {shard_keys}" ) return VectorIndexRetriever( index=index, vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=5, sparse_top_k=5, hybrid_top_k=5, vector_store_kwargs={"shard_identifier": tenant_id}, )
tenant_id = "tenant_b"retriever = create_retriever_for_tenant(tenant_id)
query = "manage microservices traffic and observability"results = retriever.retrieve(query)
print(f"Tenant: {tenant_id} | Query: {query}")for i, r in enumerate(results, 1): meta = r.node.metadata print( f"{i}. score={r.score:.4f} | tags={meta.get('tags')} | text={r.node.get_content()}" )
Tenant: tenant_b | Query: manage microservices traffic and observability1. score=4.6271 | tags=['cloud', 'networking'] | text=Service meshes add observability and traffic management2. score=0.1213 | tags=['cloud', 'k8s'] | text=Kubernetes orchestrates containers across a cluster3. score=0.0000 | tags=['cloud', 'devops'] | text=Helm charts package and deploy Kubernetes applications
Interpreting results
Section titled âInterpreting resultsâ- The printout shows the hybrid score, tags (metadata), and snippet of the matched text.
- Verify tenant isolation by switching
tenant_id
and observing that results come only from that tenantâs documents.