Document Stores
Document stores contain ingested document chunks, which we call Node
objects.
See the API Reference for more details.
Simple Document Store
Section titled “Simple Document Store”By default, the SimpleDocumentStore
stores Node
objects in-memory.
They can be persisted to (and loaded from) disk by calling docstore.persist()
(and SimpleDocumentStore.from_persist_path(...)
respectively).
A more complete example can be found here
MongoDB Document Store
Section titled “MongoDB Document Store”We support MongoDB as an alternative document store backend that persists data as Node
objects are ingested.
from llama_index.storage.docstore.mongodb import MongoDocumentStorefrom llama_index.core.node_parser import SentenceSplitter
# create parser and parse document into nodesparser = SentenceSplitter()nodes = parser.get_nodes_from_documents(documents)
# create (or load) docstore and add nodesdocstore = MongoDocumentStore.from_uri(uri="<mongodb+srv://...>")docstore.add_documents(nodes)
# create storage contextstorage_context = StorageContext.from_defaults(docstore=docstore)
# build indexindex = VectorStoreIndex(nodes, storage_context=storage_context)
Under the hood, MongoDocumentStore
connects to a fixed MongoDB database and initializes new collections (or loads existing collections) for your nodes.
Note: You can configure the
db_name
andnamespace
when instantiatingMongoDocumentStore
, otherwise they default todb_name="db_docstore"
andnamespace="docstore"
.
Note that it’s not necessary to call storage_context.persist()
(or docstore.persist()
) when using an MongoDocumentStore
since data is persisted by default.
You can easily reconnect to your MongoDB collection and reload the index by re-initializing a MongoDocumentStore
with an existing db_name
and collection_name
.
A more complete example can be found here
Redis Document Store
Section titled “Redis Document Store”We support Redis as an alternative document store backend that persists data as Node
objects are ingested.
from llama_index.storage.docstore.redis import RedisDocumentStorefrom llama_index.core.node_parser import SentenceSplitter
# create parser and parse document into nodesparser = SentenceSplitter()nodes = parser.get_nodes_from_documents(documents)
# create (or load) docstore and add nodesdocstore = RedisDocumentStore.from_host_and_port( host="127.0.0.1", port="6379", namespace="llama_index")docstore.add_documents(nodes)
# create storage contextstorage_context = StorageContext.from_defaults(docstore=docstore)
# build indexindex = VectorStoreIndex(nodes, storage_context=storage_context)
Under the hood, RedisDocumentStore
connects to a redis database and adds your nodes to a namespace stored under {namespace}/docs
.
Note: You can configure the
namespace
when instantiatingRedisDocumentStore
, otherwise it defaultsnamespace="docstore"
.
You can easily reconnect to your Redis client and reload the index by re-initializing a RedisDocumentStore
with an existing host
, port
, and namespace
.
A more complete example can be found here
Firestore Document Store
Section titled “Firestore Document Store”We support Firestore as an alternative document store backend that persists data as Node
objects are ingested.
from llama_index.storage.docstore.firestore import FirestoreDocumentStorefrom llama_index.core.node_parser import SentenceSplitter
# create parser and parse document into nodesparser = SentenceSplitter()nodes = parser.get_nodes_from_documents(documents)
# create (or load) docstore and add nodesdocstore = FirestoreDocumentStore.from_database( project="project-id", database="(default)",)docstore.add_documents(nodes)
# create storage contextstorage_context = StorageContext.from_defaults(docstore=docstore)
# build indexindex = VectorStoreIndex(nodes, storage_context=storage_context)
Under the hood, FirestoreDocumentStore
connects to a firestore database in Google Cloud and adds your nodes to a namespace stored under {namespace}/docs
.
Note: You can configure the
namespace
when instantiatingFirestoreDocumentStore
, otherwise it defaultsnamespace="docstore"
.
You can easily reconnect to your Firestore database and reload the index by re-initializing a FirestoreDocumentStore
with an existing project
, database
, and namespace
.
A more complete example can be found here
Couchbase Document Store
Section titled “Couchbase Document Store”We support Couchbase as an alternative document store backend that persists data as Node
objects are ingested.
from llama_index.storage.docstore.couchbase import CouchbaseDocumentStorefrom llama_index.core.node_parser import SentenceSplitter
from couchbase.cluster import Clusterfrom couchbase.auth import PasswordAuthenticatorfrom couchbase.options import ClusterOptionsfrom datetime import timedelta
# create couchbase clientauth = PasswordAuthenticator("DB_USERNAME", "DB_PASSWORD")options = ClusterOptions(authenticator=auth)
cluster = Cluster("couchbase://localhost", options)
# Wait until the cluster is ready for use.cluster.wait_until_ready(timedelta(seconds=5))
# create parser and parse document into nodesparser = SentenceSplitter()nodes = parser.get_nodes_from_documents(documents)
# create (or load) docstore and add nodesdocstore = CouchbaseDocumentStore.from_couchbase_client( client=cluster, bucket_name="llama-index", scope_name="_default", namespace="default",)docstore.add_documents(nodes)
# create storage contextstorage_context = StorageContext.from_defaults(docstore=docstore)
# build indexindex = VectorStoreIndex(nodes, storage_context=storage_context)
Under the hood, CouchbaseDocumentStore
connects to a Couchbase operational database and adds your nodes to a collection named under {namespace}_data
in the specified {bucket_name}
and {scope_name}
.
Note: You can configure the
namespace
,bucket
andscope
when instantiatingCouchbaseIndexStore
. By default, the collection used isdocstore_data
. Apart from alphanumeric characters,-
,_
and%
are only allowed as part of the collection name. The store will automatically convert other special characters to_
.
You can easily reconnect to your Couchbase database and reload the index by re-initializing a CouchbaseDocumentStore
with an existing client
, bucket_name
, scope_name
and namespace
.
Tablestore Document Store
Section titled “Tablestore Document Store”We support Tablestore as an alternative document store backend that persists data as Node
objects are ingested.
from llama_index.core import Documentfrom llama_index.core import StorageContext, VectorStoreIndexfrom llama_index.core.node_parser import SentenceSplitter
from llama_index.storage.docstore.tablestore import TablestoreDocumentStore
# create parser and parse document into nodesparser = SentenceSplitter()documents = [ Document(text="I like cat.", id_="1", metadata={"key1": "value1"}), Document(text="Mike likes dog.", id_="2", metadata={"key2": "value2"}),]nodes = parser.get_nodes_from_documents(documents)
# create (or load) doc_store and add nodesdocs_tore = TablestoreDocumentStore.from_config( endpoint="<tablestore_end_point>", instance_name="<tablestore_instance_name>", access_key_id="<tablestore_access_key_id>", access_key_secret="<tablestore_access_key_secret>",)docs_tore.add_documents(nodes)
# create storage contextstorage_context = StorageContext.from_defaults(docstore=docs_tore)
# build indexindex = VectorStoreIndex(nodes, storage_context=storage_context)
Under the hood, TablestoreDocumentStore
connects to a Tablestore database and adds your nodes to a table named under {namespace}_data
.
Note: You can configure the
namespace
when instantiatingTablestoreDocumentStore
.
You can easily reconnect to your Tablestore database and reload the index by re-initializing a TablestoreDocumentStore
with an existing endpoint
, instance_name
, access_key_id
and access_key_secret
.
A more complete example can be found here
Google AlloyDB Document Store
Section titled “Google AlloyDB Document Store”We support AlloyDB as an alternative document store backend that persists data as Node
objects are ingested.
This tutorial demonstrates the synchronous interface. All synchronous methods have corresponding asynchronous methods.
pip install llama-indexpip install llama-index-alloydb-pgpip install llama-index-llms-vertex
from llama_index.core import SummaryIndexfrom llama_index_alloydb_pg import AlloyDBEngine, AlloyDBDocumentStore
# create parser and parse document into nodesparser = SentenceSplitter()nodes = parser.get_nodes_from_documents(documents)
# create an AlloyDB Engine for connection poolengine = AlloyDBEngine.from_instance( project_id=PROJECT_ID, region=REGION, cluster=CLUSTER, instance=INSTANCE, database=DATABASE, user=USER, password=PASSWORD,)
# initialize a new table in AlloyDBengine.init_doc_store_table( table_name=TABLE_NAME,)
doc_store = AlloyDBDocumentStore.create_sync( engine=engine, table_name=TABLE_NAME,)
doc_store.add_documents(nodes)
# create storage contextstorage_context = StorageContext.from_defaults(docstore=doc_store)
# build indexindex = VectorStoreIndex(nodes, storage_context=storage_context)
Note: You can configure the
schema_name
along with thetable_name
when initializing a new table and instantiatingAlloyDBDocumentStore
. By default theschema_name
ispublic
.
Under the hood, AlloyDBDocumentStore
connects to the alloydb database in Google Cloud and adds your nodes to a table under the schema_name
.
You can easily reconnect to your AlloyDB database and reload the index by re-initializing a AlloyDBDocumentStore
with an AlloyDBEngine
without initializing a new table.
A more detailed guide can be found here
Google Cloud SQL for PostgreSQL Document Store
Section titled “Google Cloud SQL for PostgreSQL Document Store”We support Cloud SQL for PostgreSQL as an alternative document store backend that persists data as Node
objects are ingested.
This tutorial demonstrates the synchronous interface. All synchronous methods have corresponding asynchronous methods.
pip install llama-indexpip install llama-index-cloud-sql-pg
from llama_index.core import SummaryIndexfrom llama_index_cloud_sql_pg import PostgresEngine, PostgresDocumentStore
# create parser and parse document into nodesparser = SentenceSplitter()nodes = parser.get_nodes_from_documents(documents)
# create an Postgres Engine for connection poolengine = PostgresEngine.from_instance( project_id=PROJECT_ID, region=REGION, instance=INSTANCE, database=DATABASE, user=USER, password=PASSWORD,)
# initialize a new table in cloud sql postgresengine.init_doc_store_table( table_name=TABLE_NAME,)
doc_store = PostgresDocumentStore.create_sync( engine=engine, table_name=TABLE_NAME,)
doc_store.add_documents(nodes)
# create storage contextstorage_context = StorageContext.from_defaults(docstore=doc_store)
# build indexindex = VectorStoreIndex(nodes, storage_context=storage_context)
Note: You can configure the
schema_name
along with thetable_name
when initializing a new table and instantiatingPostgresDocumentStore
. By default theschema_name
ispublic
.
Under the hood, PostgresDocumentStore
connects to the cloud sql for pg database in Google Cloud and adds your nodes to a table under the schema_name
.
You can easily reconnect to your Postgres database and reload the index by re-initializing a PostgresDocumentStore
with an PostgresEngine
without initializing a new table.
A more detailed guide can be found here