Skip to content

Splitting Concatenated Documents

This guide demonstrates how to use the Split API to automatically segment a concatenated PDF into logical document sections based on content categories.

When dealing with large PDFs that contain multiple distinct documents or sections (e.g., a bundle of research papers, a collection of reports), you often need to split them into individual segments. The Split API uses AI to:

  1. Analyze each page’s content
  2. Classify pages into user-defined categories
  3. Group consecutive pages of the same category into segments

We’ll use a PDF containing three concatenated documents:

  • Alan Turing’s essay “Intelligent Machinery, A Heretical Theory” (an essay)
  • ImageNet paper (a research paper)
  • “Attention is All You Need” paper (a research paper)

We’ll split this into segments categorized as either essay or research_paper.

📄 Download the example PDF

Install the required packages:

Terminal window
pip install llama-cloud python-dotenv requests

Set up your environment:

import os
import time
import requests
from dotenv import load_dotenv
load_dotenv()
LLAMA_CLOUD_API_KEY = os.environ.get("LLAMA_CLOUD_API_KEY")
BASE_URL = os.environ.get("LLAMA_CLOUD_BASE_URL", "https://api.cloud.llamaindex.ai")
PROJECT_ID = os.environ.get("LLAMA_CLOUD_PROJECT_ID", None)
headers = {
"Authorization": f"Bearer {LLAMA_CLOUD_API_KEY}",
"Content-Type": "application/json",
}

Upload the concatenated PDF to LlamaCloud using the llama-cloud SDK:

from llama_cloud.client import LlamaCloud
client = LlamaCloud(token=LLAMA_CLOUD_API_KEY, base_url=BASE_URL)
pdf_path = "./data/turing+imagenet+attention.pdf"
with open(pdf_path, "rb") as f:
uploaded_file = client.files.upload_file(upload_file=f, project_id=PROJECT_ID)
file_id = uploaded_file.id
print(f"✅ File uploaded: {uploaded_file.name}")

Create a split job with category definitions. Each category needs a name and a description that helps the AI understand what content belongs to that category:

split_request = {
"document_input": {
"type": "file_id",
"value": file_id,
},
"categories": [
{
"name": "essay",
"description": "A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic without strict formal structure",
},
{
"name": "research_paper",
"description": "A formal academic document presenting original research, methodology, experiments, results, and conclusions with citations and references",
},
],
}
response = requests.post(
f"{BASE_URL}/api/v1/beta/split/jobs",
params={"project_id": PROJECT_ID},
headers=headers,
json=split_request,
)
response.raise_for_status()
split_job = response.json()
job_id = split_job["id"]
print(f"✅ Split job created: {job_id}")
print(f" Status: {split_job['status']}")

The split job runs asynchronously. Poll until it completes:

def poll_split_job(job_id: str, max_wait_seconds: int = 180, poll_interval: int = 5):
start_time = time.time()
while (time.time() - start_time) < max_wait_seconds:
response = requests.get(
f"{BASE_URL}/api/v1/beta/split/jobs/{job_id}",
params={"project_id": PROJECT_ID},
headers=headers,
)
response.raise_for_status()
job = response.json()
status = job["status"]
elapsed = int(time.time() - start_time)
print(f" Status: {status} (elapsed: {elapsed}s)")
if status in ["completed", "failed"]:
return job
time.sleep(poll_interval)
raise TimeoutError(f"Job did not complete within {max_wait_seconds} seconds")
completed_job = poll_split_job(job_id)

Examine the split results:

segments = completed_job.get("result", {}).get("segments", [])
print(f"📊 Total segments found: {len(segments)}")
for i, segment in enumerate(segments, 1):
category = segment["category"]
pages = segment["pages"]
confidence = segment["confidence_category"]
if len(pages) == 1:
page_range = f"Page {pages[0]}"
else:
page_range = f"Pages {min(pages)}-{max(pages)}"
print(f"\nSegment {i}:")
print(f" Category: {category}")
print(f" {page_range} ({len(pages)} pages)")
print(f" Confidence: {confidence}")
📊 Total segments found: 3
Segment 1:
Category: essay
Pages 1-4 (4 pages)
Confidence: high
Segment 2:
Category: research_paper
Pages 5-13 (9 pages)
Confidence: high
Segment 3:
Category: research_paper
Pages 14-24 (11 pages)
Confidence: high

The Split API correctly identified:

  • 1 essay segment: Alan Turing’s “Intelligent Machinery, A Heretical Theory”
  • 2 research paper segments: ImageNet paper and “Attention is All You Need”

You can use the allow_uncategorized strategy when you want to capture pages that don’t match any defined category:

split_request_uncategorized = {
"document_input": {"type": "file_id", "value": file_id},
"categories": [
{
"name": "essay",
"description": "A philosophical or reflective piece of writing that presents personal viewpoints",
}
# Only 'essay' defined - research papers will be 'uncategorized'
],
"splitting_strategy": {"allow_uncategorized": True},
}

With this configuration, pages that don’t match essay will be grouped as uncategorized.

  • Explore the REST API reference for all available options
  • Combine Split with LlamaExtract to run targeted extraction on each segment
  • Use LlamaParse to parse individual segments with optimized settings