Multi-modal models
Concept
Section titled βConceptβLarge language models (LLMs) are text-in, text-out. Large Multi-modal Models (LMMs) generalize this beyond the text modalities. For instance, models such as GPT-4V allow you to jointly input both images and text, and output text.
Weβve included a base MultiModalLLM
abstraction to allow for text+image models. NOTE: This naming is subject to change!
Usage Pattern
Section titled βUsage Patternβ- The following code snippet shows how you can get started using LMMs e.g. with GPT-4V.
from llama_index.multi_modal_llms.openai import OpenAIMultiModalfrom llama_index.core.multi_modal_llms.generic_utils import load_image_urlsfrom llama_index.core import SimpleDirectoryReader
# load image documents from urlsimage_documents = load_image_urls(image_urls)
# load image documents from local directoryimage_documents = SimpleDirectoryReader(local_directory).load_data()
# non-streamingopenai_mm_llm = OpenAIMultiModal( model="gpt-4-vision-preview", api_key=OPENAI_API_KEY, max_new_tokens=300)response = openai_mm_llm.complete( prompt="what is in the image?", image_documents=image_documents)
- The following code snippet shows how you can build MultiModal Vector Stores/Index.
from llama_index.core.indices import MultiModalVectorStoreIndexfrom llama_index.vector_stores.qdrant import QdrantVectorStorefrom llama_index.core import SimpleDirectoryReader, StorageContext
import qdrant_clientfrom llama_index.core import SimpleDirectoryReader
# Create a local Qdrant vector storeclient = qdrant_client.QdrantClient(path="qdrant_mm_db")
# if you only need image_store for image retrieval,# you can remove text_storetext_store = QdrantVectorStore( client=client, collection_name="text_collection")image_store = QdrantVectorStore( client=client, collection_name="image_collection")
storage_context = StorageContext.from_defaults( vector_store=text_store, image_store=image_store)
# Load text and image documents from local folderdocuments = SimpleDirectoryReader("./data_folder/").load_data()# Create the MultiModal indexindex = MultiModalVectorStoreIndex.from_documents( documents, storage_context=storage_context,)
- The following code snippet shows how you can use MultiModal Retriever and Query Engine.
from llama_index.multi_modal_llms.openai import OpenAIMultiModalfrom llama_index.core import PromptTemplatefrom llama_index.core.query_engine import SimpleMultiModalQueryEngine
retriever_engine = index.as_retriever( similarity_top_k=3, image_similarity_top_k=3)
# retrieve more information from the GPT4V responseretrieval_results = retriever_engine.retrieve(response)
# if you only need image retrieval without text retrieval# you can use `text_to_image_retrieve`# retrieval_results = retriever_engine.text_to_image_retrieve(response)
qa_tmpl_str = ( "Context information is below.\n" "---------------------\n" "{context_str}\n" "---------------------\n" "Given the context information and not prior knowledge, " "answer the query.\n" "Query: {query_str}\n" "Answer: ")qa_tmpl = PromptTemplate(qa_tmpl_str)
query_engine = index.as_query_engine( multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl)
query_str = "Tell me more about the Porsche"response = query_engine.query(query_str)
Legend
- β = should work fine
- β οΈ = sometimes unreliable, may need more tuning to improve
- π = not available at the moment.
End to End Multi-Modal Work Flow
Section titled βEnd to End Multi-Modal Work FlowβThe tables below attempt to show the initial steps with various LlamaIndex features for building your own Multi-Modal RAGs (Retrieval Augmented Generation). You can combine different modules/steps together for composing your own Multi-Modal RAG orchestration.
Query Type | Data Sources for MultiModal Vector Store/Index | MultiModal Embedding | Retriever | Query Engine | Output Data Type |
---|---|---|---|---|---|
Text β | Text β | Text β | Top-k retrieval β
Simple Fusion retrieval β | Simple Query Engine β | Retrieved Text β
Generated Text β |
Image β | Image β | Image β
Image to Text Embedding β | Top-k retrieval β
Simple Fusion retrieval β | Simple Query Engine β | Retrieved Image β
Generated Image π |
Audio π | Audio π | Audio π | π | π | Audio π |
Video π | Video π | Video π | π | π | Video π |
Multi-Modal LLM Models
Section titled βMulti-Modal LLM ModelsβThese notebooks serve as examples how to leverage and integrate Multi-Modal LLM model, Multi-Modal embeddings, Multi-Modal vector stores, Retriever, Query engine for composing Multi-Modal Retrieval Augmented Generation (RAG) orchestration.
Multi-Modal Vision Models | Single Image Reasoning | Multiple Images Reasoning | Image Embeddings | Simple Query Engine | Pydantic Structured Output |
---|---|---|---|---|---|
GPT4V (OpenAI API) | β | β | π | β | β |
GPT4V-Azure (Azure API) | β | β | π | β | β |
Gemini (Google) | β | β | π | β | β |
CLIP (Local host) | π | π | β | π | π |
LLaVa (replicate) | β | π | π | β | β οΈ |
Fuyu-8B (replicate) | β | π | π | β | β οΈ |
ImageBind [To integrate] | π | π | β | π | π |
MiniGPT-4 | β | π | π | β | β οΈ |
CogVLM | β | π | π | β | β οΈ |
Qwen-VL [To integrate] | β | π | π | β | β οΈ |
Multi Modal Vector Stores
Section titled βMulti Modal Vector StoresβBelow table lists some vector stores supporting Multi-Modal use cases. Our LlamaIndex built-in MultiModalVectorStoreIndex
supports building separate vector stores for image and text embedding vector stores. MultiModalRetriever
, and SimpleMultiModalQueryEngine
support text to text/image and image to image retrieval and simple ranking fusion functions for combining text and image retrieval results.
Multi-Modal Vector Stores | Single Vector Store | Multiple Vector Stores | Text Embedding | Image Embedding |
---|---|---|---|---|
LLamaIndex self-built MultiModal Index | π | β | Can be arbitrary text embedding (Default is GPT3.5) | Can be arbitrary Image embedding (Default is CLIP) |
Chroma | β | π | CLIP β | CLIP β |
Weaviate [To integrate] | β | π | CLIP β
ImageBind β | CLIP β
ImageBind β |
Multi-Modal LLM Modules
Section titled βMulti-Modal LLM ModulesβWe support integrations with GPT4-V, Anthropic (Opus, Sonnet), Gemini (Google), CLIP (OpenAI), BLIP (Salesforce), and Replicate (LLaVA, Fuyu-8B, MiniGPT-4, CogVLM), and more.
Multi-Modal Retrieval Augmented Generation
Section titled βMulti-Modal Retrieval Augmented GenerationβWe support Multi-Modal Retrieval Augmented Generation with different Multi-Modal LLMs with Multi-Modal vector stores.
Evaluation
Section titled βEvaluationβWe support basic evaluation for Multi-Modal LLM and Retrieval Augmented Generation.