ClickHouse Vector Store
In this notebook we are going to show a quick demo of using the ClickHouseVectorStore.
If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
!pip install llama-index!pip install clickhouse_connect
Creating a ClickHouse Client
Section titled “Creating a ClickHouse Client”import loggingimport sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from os import environimport clickhouse_connect
environ["OPENAI_API_KEY"] = "sk-*"
# initialize clientclient = clickhouse_connect.get_client( host="localhost", port=8123, username="default", password="",)
Load documents, build and store the VectorStoreIndex with ClickHouseVectorStore
Section titled “Load documents, build and store the VectorStoreIndex with ClickHouseVectorStore”Here we will use a set of Paul Graham essays to provide the text to turn into embeddings, store in a ClickHouseVectorStore
and query to find context for our LLM QnA loop.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReaderfrom llama_index.vector_stores.clickhouse import ClickHouseVectorStore
# load documentsdocuments = SimpleDirectoryReader("../data/paul_graham").load_data()print("Document ID:", documents[0].doc_id)print("Number of Documents: ", len(documents))
Document ID: d03ac7db-8dae-4199-bc38-445dec51a534Number of Documents: 1
Download Data
!mkdir -p 'data/paul_graham/'!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2024-02-13 10:08:31-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txtResolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.HTTP request sent, awaiting response... 200 OKLength: 75042 (73K) [text/plain]Saving to: ‘data/paul_graham/paul_graham_essay.txt’
data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.003s
2024-02-13 10:08:31 (23.9 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
You can process your files individually using SimpleDirectoryReader:
loader = SimpleDirectoryReader("./data/paul_graham/")documents = loader.load_data()for file in loader.input_files: print(file) # Here is where you would do any preprocessing
data/paul_graham/paul_graham_essay.txt
# initialize with metadata filter and store indexesfrom llama_index.core import StorageContext
for document in documents: document.metadata = {"user_id": "123", "favorite_color": "blue"}vector_store = ClickHouseVectorStore(clickhouse_client=client)storage_context = StorageContext.from_defaults(vector_store=vector_store)index = VectorStoreIndex.from_documents( documents, storage_context=storage_context)
Query Index
Section titled “Query Index”Now ClickHouse vector store supports filter search and hybrid search
You can learn more about query_engine and retriever.
import textwrap
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
# set Logging to DEBUG for more detailed outputsquery_engine = index.as_query_engine( filters=MetadataFilters( filters=[ ExactMatchFilter(key="user_id", value="123"), ] ), similarity_top_k=2, vector_store_query_mode="hybrid",)response = query_engine.query("What did the author learn?")print(textwrap.fill(str(response), 100))
The author learned several things during their time at Interleaf, including the importance of havingtechnology companies run by product people rather than sales people, the drawbacks of having toomany people edit code, the value of corridor conversations over planned meetings, the challenges ofdealing with big bureaucratic customers, and the importance of being the "entry level" option in amarket.
Clear All Indexes
Section titled “Clear All Indexes”for document in documents: index.delete_ref_doc(document.doc_id)