Skip to content

Splitting Concatenated Documents

This guide demonstrates how to use the Split API to automatically segment a concatenated PDF into logical document sections based on content categories.

When dealing with large PDFs that contain multiple distinct documents or sections (e.g., a bundle of research papers, a collection of reports), you often need to split them into individual segments. The Split API uses AI to:

  1. Analyze each page’s content
  2. Classify pages into user-defined categories
  3. Group consecutive pages of the same category into segments

We’ll use a PDF containing three concatenated documents:

  • Alan Turing’s essay “Intelligent Machinery, A Heretical Theory” (an essay)
  • ImageNet paper (a research paper)
  • “Attention is All You Need” paper (a research paper)

We’ll split this into segments categorized as either essay or research_paper.

📄 Download the example PDF

Install the required packages:

Terminal window
pip install llama-cloud>=1.0

Set up your environment (or pass your API key directly in code later):

Terminal window
export LLAMA_CLOUD_API_KEY="your_api_key_here"

Upload the concatenated PDF to LlamaCloud using the llama-cloud SDK:

from llama_cloud import LlamaCloud
client = LlamaCloud()
pdf_path = "./data/turing+imagenet+attention.pdf"
uploaded_file = client.files.create(file=pdf_path, purpose="split")
file_id = uploaded_file.id
print(f"✅ File uploaded: {uploaded_file.id}")

Create a split job with category definitions. Each category needs a name and a description that helps the AI understand what content belongs to that category:

job = client.beta.split.create(
document_input={
"type": "file_id",
"value": file_id,
},
categories=[
{
"name": "essay",
"description": "A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic without strict formal structure",
},
{
"name": "research_paper",
"description": "A formal academic document presenting original research, methodology, experiments, results, and conclusions with citations and references",
},
],
)
print(f"✅ Split job created: {job.id}")
print(f" Status: {job.status}")

The split job runs asynchronously. Poll until it completes:

completed_job = client.beta.split.wait_for_completion(job.id, polling_interval=2.0)

Examine the split results:

segments = completed_job.result.segments if completed_job.result else []
print(f"📊 Total segments found: {len(segments)}")
for i, segment in enumerate(segments, 1):
category = segment.category
pages = segment.pages
confidence = segment.confidence_category
if len(pages) == 1:
page_range = f"Page {pages[0]}"
else:
page_range = f"Pages {min(pages)}-{max(pages)}"
print(f"\nSegment {i}:")
print(f" Category: {category}")
print(f" {page_range} ({len(pages)} pages)")
print(f" Confidence: {confidence}")
📊 Total segments found: 3
Segment 1:
Category: essay
Pages 1-4 (4 pages)
Confidence: high
Segment 2:
Category: research_paper
Pages 5-13 (9 pages)
Confidence: high
Segment 3:
Category: research_paper
Pages 14-24 (11 pages)
Confidence: high

The Split API correctly identified:

  • 1 essay segment: Alan Turing’s “Intelligent Machinery, A Heretical Theory”
  • 2 research paper segments: ImageNet paper and “Attention is All You Need”

You can use the allow_uncategorized strategy when you want to capture pages that don’t match any defined category:

job = client.beta.split.create(..., splitting_strategy={"allow_uncategorized": True})

With this configuration, pages that don’t match essay will be grouped as uncategorized.

  • Explore the REST API reference for all available options
  • Combine Split with LlamaExtract to run targeted extraction on each segment
  • Use LlamaParse to parse individual segments with optimized settings