Building RAG from Scratch (Open-source only!)
In this tutorial, we show you how to build a data ingestion pipeline into a vector database, and then build a retrieval pipeline from that vector database, from scratch.
Notably, we use a fully open-source stack:
- Sentence Transformers as the embedding model
- Postgres as the vector store (we support many other vector stores too!)
- Llama 2 as the LLM (through llama.cpp)
We setup our open-source components.
- Sentence Transformers
- Llama 2
- We initialize postgres and wrap it with our wrappers/abstractions.
Sentence Transformers
Section titled “Sentence Transformers”%pip install llama-index-readers-file pymupdf%pip install llama-index-vector-stores-postgres%pip install llama-index-embeddings-huggingface%pip install llama-index-llms-llama-cpp# sentence transformersfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")Llama CPP
Section titled “Llama CPP”In this notebook, we use the llama-2-chat-13b-ggml model, along with the proper prompt formatting.
Check out our Llama CPP guide for full setup instructions/details.
!pip install llama-cpp-pythonRequirement already satisfied: llama-cpp-python in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (0.2.7)Requirement already satisfied: numpy>=1.20.0 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from llama-cpp-python) (1.23.5)Requirement already satisfied: typing-extensions>=4.5.0 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from llama-cpp-python) (4.7.1)Requirement already satisfied: diskcache>=5.6.1 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from llama-cpp-python) (5.6.3)
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.2.1[0m[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0mfrom llama_index.llms.llama_cpp import LlamaCPP
# model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin"model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"
llm = LlamaCPP(    # You can pass in the URL to a GGML model to download it automatically    model_url=model_url,    # optionally, you can set the path to a pre-downloaded model instead of model_url    model_path=None,    temperature=0.1,    max_new_tokens=256,    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room    context_window=3900,    # kwargs to pass to __call__()    generate_kwargs={},    # kwargs to pass to __init__()    # set to at least 1 to use GPU    model_kwargs={"n_gpu_layers": 1},    verbose=True,)Initialize Postgres
Section titled “Initialize Postgres”Using an existing postgres running at localhost, create the database we’ll be using.
NOTE: Of course there are plenty of other open-source/self-hosted databases you can use! e.g. Chroma, Qdrant, Weaviate, and many more. Take a look at our vector store guide.
NOTE: You will need to setup postgres on your local system. Here’s an example of how to set it up on OSX: https://www.sqlshack.com/setting-up-a-postgresql-database-on-mac/.
NOTE: You will also need to install pgvector (https://github.com/pgvector/pgvector).
You can add a role like the following:
CREATE ROLE <user> WITH LOGIN PASSWORD '<password>';ALTER ROLE <user> SUPERUSER;!pip install psycopg2-binary pgvector asyncpg "sqlalchemy[asyncio]" greenletimport psycopg2
db_name = "vector_db"host = "localhost"password = "password"port = "5432"user = "jerry"# conn = psycopg2.connect(connection_string)conn = psycopg2.connect(    dbname="postgres",    host=host,    password=password,    port=port,    user=user,)conn.autocommit = True
with conn.cursor() as c:    c.execute(f"DROP DATABASE IF EXISTS {db_name}")    c.execute(f"CREATE DATABASE {db_name}")from sqlalchemy import make_urlfrom llama_index.vector_stores.postgres import PGVectorStore
vector_store = PGVectorStore.from_params(    database=db_name,    host=host,    password=password,    port=port,    user=user,    table_name="llama2_paper",    embed_dim=384,  # openai embedding dimension)Build an Ingestion Pipeline from Scratch
Section titled “Build an Ingestion Pipeline from Scratch”We show how to build an ingestion pipeline as mentioned in the introduction.
We fast-track the steps here (can skip metadata extraction). More details can be found in our dedicated ingestion guide.
1. Load Data
Section titled “1. Load Data”!mkdir data!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"from pathlib import Pathfrom llama_index.readers.file import PyMuPDFReaderloader = PyMuPDFReader()documents = loader.load(file_path="./data/llama2.pdf")2. Use a Text Splitter to Split Documents
Section titled “2. Use a Text Splitter to Split Documents”from llama_index.core.node_parser import SentenceSplittertext_parser = SentenceSplitter(    chunk_size=1024,    # separator=" ",)text_chunks = []# maintain relationship with source doc index, to help inject doc metadata in (3)doc_idxs = []for doc_idx, doc in enumerate(documents):    cur_text_chunks = text_parser.split_text(doc.text)    text_chunks.extend(cur_text_chunks)    doc_idxs.extend([doc_idx] * len(cur_text_chunks))3. Manually Construct Nodes from Text Chunks
Section titled “3. Manually Construct Nodes from Text Chunks”from llama_index.core.schema import TextNode
nodes = []for idx, text_chunk in enumerate(text_chunks):    node = TextNode(        text=text_chunk,    )    src_doc = documents[doc_idxs[idx]]    node.metadata = src_doc.metadata    nodes.append(node)4. Generate Embeddings for each Node
Section titled “4. Generate Embeddings for each Node”Here we generate embeddings for each Node using a sentence_transformers model.
for node in nodes:    node_embedding = embed_model.get_text_embedding(        node.get_content(metadata_mode="all")    )    node.embedding = node_embedding5. Load Nodes into a Vector Store
Section titled “5. Load Nodes into a Vector Store”We now insert these nodes into our PostgresVectorStore.
vector_store.add(nodes)Build Retrieval Pipeline from Scratch
Section titled “Build Retrieval Pipeline from Scratch”We show how to build a retrieval pipeline. Similar to ingestion, we fast-track the steps. Take a look at our retrieval guide for more details!
query_str = "Can you tell me about the key concepts for safety finetuning"1. Generate a Query Embedding
Section titled “1. Generate a Query Embedding”query_embedding = embed_model.get_query_embedding(query_str)2. Query the Vector Database
Section titled “2. Query the Vector Database”# construct vector store queryfrom llama_index.core.vector_stores import VectorStoreQuery
query_mode = "default"# query_mode = "sparse"# query_mode = "hybrid"
vector_store_query = VectorStoreQuery(    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode)# returns a VectorStoreQueryResultquery_result = vector_store.query(vector_store_query)print(query_result.nodes[0].get_content())3. Parse Result into a Set of Nodes
Section titled “3. Parse Result into a Set of Nodes”from llama_index.core.schema import NodeWithScorefrom typing import Optional
nodes_with_scores = []for index, node in enumerate(query_result.nodes):    score: Optional[float] = None    if query_result.similarities is not None:        score = query_result.similarities[index]    nodes_with_scores.append(NodeWithScore(node=node, score=score))4. Put into a Retriever
Section titled “4. Put into a Retriever”from llama_index.core import QueryBundlefrom llama_index.core.retrievers import BaseRetrieverfrom typing import Any, List
class VectorDBRetriever(BaseRetriever):    """Retriever over a postgres vector store."""
    def __init__(        self,        vector_store: PGVectorStore,        embed_model: Any,        query_mode: str = "default",        similarity_top_k: int = 2,    ) -> None:        """Init params."""        self._vector_store = vector_store        self._embed_model = embed_model        self._query_mode = query_mode        self._similarity_top_k = similarity_top_k        super().__init__()
    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:        """Retrieve."""        query_embedding = embed_model.get_query_embedding(            query_bundle.query_str        )        vector_store_query = VectorStoreQuery(            query_embedding=query_embedding,            similarity_top_k=self._similarity_top_k,            mode=self._query_mode,        )        query_result = vector_store.query(vector_store_query)
        nodes_with_scores = []        for index, node in enumerate(query_result.nodes):            score: Optional[float] = None            if query_result.similarities is not None:                score = query_result.similarities[index]            nodes_with_scores.append(NodeWithScore(node=node, score=score))
        return nodes_with_scoresretriever = VectorDBRetriever(    vector_store, embed_model, query_mode="default", similarity_top_k=2)Plug this into our RetrieverQueryEngine to synthesize a response
Section titled “Plug this into our RetrieverQueryEngine to synthesize a response”from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)query_str = "How does Llama 2 perform compared to other open-source models?"
response = query_engine.query(query_str)print(str(response)) Based on the results shown in Table 3, Llama 2 outperforms all open-source models on most of the benchmarks, with an average improvement of around 5 points over the next best model (GPT-3.5).print(response.source_nodes[0].get_content())