Skip to content

Multimodal RAG with VideoDB

RAG: Multimodal Search on Videos and Stream Video Results ๐Ÿ“บ

Section titled โ€œRAG: Multimodal Search on Videos and Stream Video Results ๐Ÿ“บโ€

Constructing a RAG pipeline for text is relatively straightforward, thanks to the tools developed for parsing, indexing, and retrieving text data.

However, adapting RAG models for video content presents a greater challenge. Videos combine visual, auditory, and textual elements, requiring more processing power and sophisticated video pipelines.

VideoDB is a serverless database designed to streamline the storage, search, editing, and streaming of video content. VideoDB offers random access to sequential video data by building indexes and developing interfaces for querying and browsing video content. Learn more at docs.videodb.io.

To build a truly Multimodal search for Videos, you need to work with different modalities of a video like Spoken Content, Visual.

In this notebook, we will develop a multimodal RAG for video using VideoDB and Llama-Index โœจ.

ย 


To connect to VideoDB, simply get the API key and create a connection. This can be done by setting the VIDEO_DB_API_KEY environment variable. You can get it from ๐Ÿ‘‰๐Ÿผ VideoDB Console. ( Free for first 50 uploads, No credit card required! )

Get your OPENAI_API_KEY from OpenAI platform for llama_index response synthesizer.

import os
os.environ["VIDEO_DB_API_KEY"] = ""
os.environ["OPENAI_API_KEY"] = ""

To get started, weโ€™ll need to install the following packages:

  • llama-index
  • videodb
%pip install videodb
%pip install llama-index

Letโ€™s upload a our video file first.

You can use any public url, Youtube link or local file on your system.

โœจ First 50 uploads are free!

from videodb import connect
# connect to VideoDB
conn = connect()
coll = conn.get_collection()
# upload videos to default collection in VideoDB
print("Uploading Video")
video = conn.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")
print(f"Video uploaded with ID: {video.id}")
# video = coll.get_video("m-56f55058-62b6-49c4-bbdc-43c0badf4c0b")
Uploading Video
Video uploaded with ID: m-0ccadfc8-bc8c-4183-b83a-543946460e2a
  • coll = conn.get_collection() : Returns default collection object.
  • coll.get_videos() : Returns list of all the videos in a collections.
  • coll.get_video(video_id): Returns Video object from givenvideo_id.

๐Ÿ“ธ๐Ÿ—ฃ๏ธ Step 2: Extract Scenes from Video

Section titled โ€œ๐Ÿ“ธ๐Ÿ—ฃ๏ธ Step 2: Extract Scenes from Videoโ€

First, we need to extract scenes from the video and then use vLLM to obtain a description of each scene.

To learn more about Scene Extraction options, explore the following guides:

  • Scene Extraction Options Guide delves deeper into the various options available for scene extraction within Scene Index. It covers advanced settings, customization features, and tips for optimizing scene extraction based on different needs and preferences.
from videodb import SceneExtractionType
# Specify Scene Extraction algorithm
index_id = video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 2, "select_frames": ["first", "last"]},
prompt="Describe the scene in detail",
)
video.get_scene_index(index_id)
print(f"Scene Extraction successful with ID: {index_id}")
Indexing Visual content in Video...
Scene Index successful with ID: f3eef7aee2a0ff58

โœจ Step 3 : Incorporating VideoDB in your existing Llamaindex RAG Pipeline

Section titled โ€œโœจ Step 3 : Incorporating VideoDB in your existing Llamaindex RAG Pipelineโ€

To develop a thorough multimodal search for videos, you need to handle different video modalities, including spoken content and visual elements.

You can retrieve all Transcript Nodes and Visual Nodes of a video using VideoDB and then incorporate them into your LlamaIndex pipeline.

You can fetch transcript nodes using Video.get_transcript()

To configure the segmenter, use the segmenter and length arguments.

Possible values for segmenter are:

  • Segmenter.time: Segments the video based on the specified length in seconds.
  • Segmenter.word: Segments the video based on the word count specified by length
from videodb import Segmenter
from llama_index.core.schema import TextNode
# Fetch all Transcript Nodes
nodes_transcript_raw = video.get_transcript(
segmenter=Segmenter.time, length=60
)
# Convert the raw transcript nodes to TextNode objects
nodes_transcript = [
TextNode(
text=node["text"],
metadata={key: value for key, value in node.items() if key != "text"},
)
for node in nodes_transcript_raw
]
# Fetch all Scenes
scenes = video.get_scene_index(index_id)
# Convert the scenes to TextNode objects
nodes_scenes = [
TextNode(
text=node["description"],
metadata={
key: value for key, value in node.items() if key != "description"
},
)
for node in scenes
]

๐Ÿ”„ Simple RAG Pipeline with Transcript + Scene Nodes

Section titled โ€œ๐Ÿ”„ Simple RAG Pipeline with Transcript + Scene Nodesโ€

We index both our Transcript Nodes and Scene Node

๐Ÿ”โœจ For simplicity, we are using a basic RAG pipeline. However, you can integrate more advanced LlamaIndex RAG pipelines here for better results.

from llama_index.core import VectorStoreIndex
# Index both Transcript and Scene Nodes
index = VectorStoreIndex(nodes_scenes + nodes_transcript)
q = index.as_query_engine()
The narrator discusses the location of our Solar System within the Milky Way galaxy, emphasizing its position in one of the minor spiral arms known as the Orion Spur. The images provided offer visual representations of the Milky Way's structure, with labels indicating the specific location of the Solar System within the galaxy.
res = q.query(
"Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy"
)
print(res)

Our nodesโ€™ metadata includes start and end fields, which represent the start and end times relative to the beginning of the video.

Using this information from the relevant nodes, we can create Video Clips corresponding to these nodes.

from videodb import play_stream
# Helper function to merge overlapping intervals
def merge_intervals(intervals):
if not intervals:
return []
intervals.sort(key=lambda x: x[0])
merged = [intervals[0]]
for interval in intervals[1:]:
if interval[0] <= merged[-1][1]:
merged[-1][1] = max(merged[-1][1], interval[1])
else:
merged.append(interval)
return merged
# Extract relevant timestamps from the source nodes
relevant_timestamps = [
[node.metadata["start"], node.metadata["end"]] for node in res.source_nodes
]
# Create a compilation of all relevant timestamps
stream_url = video.generate_stream(merge_intervals(relevant_timestamps))
play_stream(stream_url)

In this guide, we built a Simple Multimodal RAG for Videos Using VideoDB, Llamaindex, and OpenAI

You can optimize the pipeline by incorporating more advanced techniques like

  • Build a Search on Video Collection
  • Optimize Query Transformation
  • More methods to combine retrieved nodes from different modalities
  • Experiment with Different RAG pipelines like Knowledge Graph

To learn more about Scene Index, explore the following guides:

๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ Support & Community

Section titled โ€œ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ Support & Communityโ€

If you have any questions or feedback. Feel free to reach out to us ๐Ÿ™Œ๐Ÿผ