Chroma Multi-Modal Demo with LlamaIndex
Chroma is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.
Chroma is fully-typed, fully-tested and fully-documented.
Install Chroma with:
pip install chromadb
Chroma runs in various modes. See below for examples of each integrated with LangChain.
in-memory
- in a python script or jupyter notebookin-memory with persistence
- in a script or notebook and save/load to diskin a docker container
- as a server running your local machine or in the cloud
Like any other database, you can:
.add
.get
.update
.upsert
.delete
.peek
- and
.query
runs the similarity search.
View full docs at docs.
Basic Example
Section titled “Basic Example”In this basic example, we take the a Paul Graham essay, split it into chunks, embed it using an open-source embedding model, load it into Chroma, and then query it.
If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-vector-stores-qdrant%pip install llama-index-embeddings-huggingface%pip install llama-index-vector-stores-chroma
!pip install llama-index
Creating a Chroma Index
Section titled “Creating a Chroma Index”!pip install llama-index chromadb --quiet!pip install chromadb==0.4.17!pip install sentence-transformers!pip install pydantic==1.10.11!pip install open-clip-torch
# importfrom llama_index.core import VectorStoreIndex, SimpleDirectoryReaderfrom llama_index.vector_stores.chroma import ChromaVectorStorefrom llama_index.core import StorageContextfrom llama_index.embeddings.huggingface import HuggingFaceEmbeddingfrom IPython.display import Markdown, displayimport chromadb
# set up OpenAIimport osimport openai
OPENAI_API_KEY = ""openai.api_key = OPENAI_API_KEYos.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
Download Images and Texts from Wikipedia
Section titled “Download Images and Texts from Wikipedia”import requests
def get_wikipedia_images(title): response = requests.get( "https://en.wikipedia.org/w/api.php", params={ "action": "query", "format": "json", "titles": title, "prop": "imageinfo", "iiprop": "url|dimensions|mime", "generator": "images", "gimlimit": "50", }, ).json() image_urls = [] for page in response["query"]["pages"].values(): if page["imageinfo"][0]["url"].endswith(".jpg") or page["imageinfo"][ 0 ]["url"].endswith(".png"): image_urls.append(page["imageinfo"][0]["url"]) return image_urls
from pathlib import Pathimport urllib.request
image_uuid = 0MAX_IMAGES_PER_WIKI = 20
wiki_titles = { "Tesla Model X", "Pablo Picasso", "Rivian", "The Lord of the Rings", "The Matrix", "The Simpsons",}
data_path = Path("mixed_wiki")if not data_path.exists(): Path.mkdir(data_path)
for title in wiki_titles: response = requests.get( "https://en.wikipedia.org/w/api.php", params={ "action": "query", "format": "json", "titles": title, "prop": "extracts", "explaintext": True, }, ).json() page = next(iter(response["query"]["pages"].values())) wiki_text = page["extract"]
with open(data_path / f"{title}.txt", "w") as fp: fp.write(wiki_text)
images_per_wiki = 0 try: # page_py = wikipedia.page(title) list_img_urls = get_wikipedia_images(title) # print(list_img_urls)
for url in list_img_urls: if url.endswith(".jpg") or url.endswith(".png"): image_uuid += 1 # image_file_name = title + "_" + url.split("/")[-1]
urllib.request.urlretrieve( url, data_path / f"{image_uuid}.jpg" ) images_per_wiki += 1 # Limit the number of images downloaded per wiki page to 15 if images_per_wiki > MAX_IMAGES_PER_WIKI: break except: print(str(Exception("No images found for Wikipedia page: ")) + title) continue
Set the embedding model
Section titled “Set the embedding model”from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
# set defalut text and image embedding functionsembedding_function = OpenCLIPEmbeddingFunction()
/Users/haotianzhang/llama_index/venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Build Chroma Multi-Modal Index with LlamaIndex
Section titled “Build Chroma Multi-Modal Index with LlamaIndex”from llama_index.core.indices import MultiModalVectorStoreIndexfrom llama_index.vector_stores.qdrant import QdrantVectorStorefrom llama_index.core import SimpleDirectoryReader, StorageContextfrom chromadb.utils.data_loaders import ImageLoader
image_loader = ImageLoader()
# create client and a new collectionchroma_client = chromadb.EphemeralClient()chroma_collection = chroma_client.create_collection( "multimodal_collection", embedding_function=embedding_function, data_loader=image_loader,)
# load documentsdocuments = SimpleDirectoryReader("./mixed_wiki/").load_data()
# set up ChromaVectorStore and load in datavector_store = ChromaVectorStore(chroma_collection=chroma_collection)storage_context = StorageContext.from_defaults(vector_store=vector_store)index = VectorStoreIndex.from_documents( documents, storage_context=storage_context,)
Retrieve results from Multi-Modal Index
Section titled “Retrieve results from Multi-Modal Index”retriever = index.as_retriever(similarity_top_k=50)retrieval_results = retriever.retrieve("Picasso famous paintings")
# print(retrieval_results)from llama_index.core.schema import ImageNodefrom llama_index.core.response.notebook_utils import ( display_source_node, display_image_uris,)
image_results = []MAX_RES = 5cnt = 0for r in retrieval_results: if isinstance(r.node, ImageNode): image_results.append(r.node.metadata["file_path"]) else: if cnt < MAX_RES: display_source_node(r) cnt += 1
display_image_uris(image_results, [3, 3], top_k=2)
Node ID: 13adcbba-fe8b-4d51-9139-fb1c55ffc6be
Similarity: 0.774399292477267
Text: == Artistic legacy ==
Picasso’s influence was and remains immense and widely acknowledged by his …
Node ID: 4100593e-6b6a-4b5f-8384-98d1c2468204
Similarity: 0.7695965506408678
Text: === Later works to final years: 1949–1973 ===
Picasso was one of 250 sculptors who exhibited in t…
Node ID: aeed9d43-f9c5-42a9-a7b9-1a3c005e3745
Similarity: 0.7693110304140338
Text: Pablo Ruiz Picasso (25 October 1881 – 8 April 1973) was a Spanish painter, sculptor, printmaker, …
Node ID: 5a6613b6-b599-4e40-92f2-231e10ed54f6
Similarity: 0.7656537748231977
Text: === The Basel vote ===
In the 1940s, a Swiss insurance company based in Basel had bought two pain…
Node ID: cc17454c-030d-4f86-a12e-342d0582f4d3
Similarity: 0.7639671751819532
Text: == Style and technique ==
Picasso was exceptionally prolific throughout his long lifetime. At hi…