Parsing & Transformation in LlamaCloud
Once data is loaded from a Data Source, it is pre-processed before being sent to the Data Sink. There are many pre-processing parameters that can be tweaked to optimize the downstream retrieval performance of your index. While LlamaCloud sets you up with reasonable defaults, you can dig deeper and customize them as you see fit for your specific use case.
Parser Settings
Section titled āParser SettingsāA key step of any RAG pipeline is converting your input file into a format that can be used to generate a vector embedding. There are many parameters that can be used to tweak this conversion process to optimize for your use case. LlamaCloud sets you up from the start with reasonable defaults for your parsing configurations, but also allows you to dig deeper and customize them as you see fit for your specific application.
Transformation Settings
Section titled āTransformation SettingsāThe transform configuration is used to define the transformation of the data before it is ingested into the Index. it is a JSON object which you can choose between two modes auto
and advanced
and as the name suggests, the auto
mode is handled by LlamaCloud which uses a set of default configurations and the advanced
mode is handled by the user with the ability to define their own transformation.
Auto Mode
Section titled āAuto ModeāYou can set the mode by passing the transform_config
as below on index creation or update.
transform_config = { "mode": "auto"}
Also when using the auto
mode, you can configure the chunk size being used for the transformation by passing the chunk_size
and chunk_overlap
parameter as below.
transform_config = { "mode": "auto", "chunk_size": 1000, "chunk_overlap": 100}
Advanced Mode
Section titled āAdvanced ModeāThe advanced mode provides a variation of configuration options for the user to define their own transformation. The advanced mode is defined by the mode
parameter as advanced
and the segmentation_config
and chunking_config
parameters are used to define the segmentation and chunking configuration respectively.
transform_config = { "mode": "advanced", "segmentation_config": { "mode": "page", "page_separator": "\n---\n" }, "chunking_config": { "mode": "sentence", "separator": " ", "paragraph_separator": "\n" }}
Segmentation Configuration
Section titled āSegmentation ConfigurationāThe segmentation configuration uses the document structure and/or semantics to divide the documents into smaller parts following natural segmentation boundaries. The segmentation_config
parameter include three modes none
, page
and element
.
None Segmentation Configuration
Section titled āNone Segmentation ConfigurationāThe none
segmentation configuration is used to define no segmentation.
transform_config = { "mode": "advanced", "segmentation_config": { "mode": "none" }}
Page Segmentation Configuration
Section titled āPage Segmentation ConfigurationāThe page
segmentation configuration is used to define the segmentation by page and the page_separator
parameter is used to define the separator, which will split your document into pages.
transform_config = { "mode": "advanced", "segmentation_config": { "mode": "page", "page_separator": "\n---\n" }}
Element Segmentation Configuration
Section titled āElement Segmentation ConfigurationāThe element
segmentation configuration is used to define the segmentation by element which identifies the elements from the document as title, paragraph, list, table, etc.
transform_config = { "mode": "advanced", "segmentation_config": { "mode": "element" }}
Chunking Configuration
Section titled āChunking ConfigurationāChunking configuration is mainly used to deal with context window limitaitons of embeddings model and LLMs. Conceptually, itās the step after segmenting, where segments are further broken down into smaller chunks as necessary to fit into the context window. It include a few modes none
, character
, token
, sentence
and semantic
.
Also all chunk config modes allow the user to define the chunk_size
and chunk_overlap
parameters. In the examples below we are not always defining the chunk_size and chunk_overlap parameters but you can always define them.
None Chunking Configuration
Section titled āNone Chunking ConfigurationāThe none
chunking configuration is used to define no chunking.
transform_config = { "mode": "advanced", "chunking_config": { "mode": "none" }}
Character Chunking Configuration
Section titled āCharacter Chunking ConfigurationāThe character
chunking configuration is used to define the chunking by character and the chunk_size
parameter is used to define the size of the chunk.
transform_config = { "mode": "advanced", "chunking_config": { "mode": "character", "chunk_size": 1000 }}
Token Chunking Configuration
Section titled āToken Chunking ConfigurationāThe token
chunking configuration is used to define the chunking by token and uses OpenAI tokenizer behind the hood. Alsochunk_size
and chunk_overlap
parameters are used to define the size of the chunk and the overlap between the chunks.
transform_config = { "mode": "advanced", "chunking_config": { "mode": "token", "chunk_size": 1000, "chunk_overlap": 100 }}
Sentence Chunking Configuration
Section titled āSentence Chunking ConfigurationāThe sentence
chunking configuration is used to define the chunking by sentence and the separator
and paragraph_separator
parameters are used to define the separator between the sentences and paragraphs.
transform_config = { "mode": "advanced", "chunking_config": { "mode": "sentence", "separator": " ", "paragraph_separator": "\n" }}
Embedding Model
Section titled āEmbedding ModelāThe embedding model allows you to construct a numerical representation of the text within your files. This is a crucial step in allowing you to search for specific information within your files. There are a wide variety of embedding models to choose from, and we support quite a few on LlamaCloud.
Sparse Model Configuration
Section titled āSparse Model ConfigurationāThe sparse model configuration enables hybrid search by combining dense embeddings with sparse embeddings for improved retrieval accuracy. This configuration is particularly useful for scenarios where you want to leverage both semantic similarity (dense) and keyword matching (sparse) capabilities.
Available Sparse Models
Section titled āAvailable Sparse ModelsāLlamaCloud supports three sparse model types:
auto
(default): Automatically selects the appropriate sparse model (Default: Splade)splade
: Uses SPLADE model for learned sparse representationsbm25
: Uses Qdrantās FastEmbed BM25 model for traditional keyword-based sparse embeddings
Configuration
Section titled āConfigurationāYou can configure the sparse model when creating or updating a pipeline:
from llama_cloud import LlamaCloudClient
client = LlamaCloudClient(api_key="your_api_key")
# Create pipeline with sparse model configurationpipeline = client.pipelines.create_pipeline( name="my-hybrid-pipeline", # ... other pipeline configuration ... sparse_model_config={ "model_type": "splade" # or "bm25", "auto" })
Usage in Retrieval
Section titled āUsage in RetrievalāWhen using hybrid search with configured sparse models, you can control the balance between dense and sparse retrieval:
from llama_cloud_services import LlamaCloudIndex
# Connect to your pipelineindex = LlamaCloudIndex("my-hybrid-pipeline", project_name="Default")
# Configure retriever for hybrid searchretriever = index.as_retriever( dense_similarity_top_k=5, # Number of results from dense search sparse_similarity_top_k=5, # Number of results from sparse search alpha=0.5, # Balance between dense (0.0) and sparse (1.0) enable_reranking=True, # Optional reranking for better results rerank_top_n=10 # Number of results to rerank)
nodes = retriever.retrieve("your search query")
After Pre-Processing, your data is ready to be sent to the Data Sink ā”ļø