Splitting Concatenated Documents
Learn how to split a PDF containing multiple documents into separate segments using AI-powered classification.
This guide demonstrates how to use the Split API to automatically segment a concatenated PDF into logical document sections based on content categories.
Use Case
Section titled “Use Case”When dealing with large PDFs that contain multiple distinct documents or sections (e.g., a bundle of research papers, a collection of reports), you often need to split them into individual segments. The Split API uses AI to:
- Analyze each page’s content
- Classify pages into user-defined categories
- Group consecutive pages of the same category into segments
Example Document
Section titled “Example Document”We’ll use a PDF containing three concatenated documents:
- Alan Turing’s essay “Intelligent Machinery, A Heretical Theory” (an essay)
- ImageNet paper (a research paper)
- “Attention is All You Need” paper (a research paper)
We’ll split this into segments categorized as either essay or research_paper.
Install the required packages:
pip install llama-cloud>=1.0Set up your environment (or pass your API key directly in code later):
export LLAMA_CLOUD_API_KEY="your_api_key_here"npm install @llamaindex/llama-cloudSet up your environment (or pass your API key directly in code later):
export LLAMA_CLOUD_API_KEY="your_api_key_here"Step 1: Upload the PDF
Section titled “Step 1: Upload the PDF”Upload the concatenated PDF to LlamaCloud using the llama-cloud SDK:
from llama_cloud import LlamaCloud
client = LlamaCloud()
pdf_path = "./data/turing+imagenet+attention.pdf"
uploaded_file = client.files.create(file=pdf_path, purpose="split")
file_id = uploaded_file.idprint(f"✅ File uploaded: {uploaded_file.id}")import fs from "fs";import { LlamaCloud } from "@llamaindex/llama-cloud";
const client = new LlamaCloud();
const pdfPath = "./data/turing+imagenet+attention.pdf";
const uploadedFile = await client.files.create({ file: fs.createReadStream(pdfPath), purpose: "split",});
const fileId = uploadedFile.id;console.log(`✅ File uploaded: ${uploadedFile.id}`);Step 2: Create a Split Job
Section titled “Step 2: Create a Split Job”Create a split job with category definitions. Each category needs a name and a description that helps the AI understand what content belongs to that category:
job = client.beta.split.create( document_input={ "type": "file_id", "value": file_id, }, categories=[ { "name": "essay", "description": "A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic without strict formal structure", }, { "name": "research_paper", "description": "A formal academic document presenting original research, methodology, experiments, results, and conclusions with citations and references", }, ],)
print(f"✅ Split job created: {job.id}")print(f" Status: {job.status}")const job = await client.beta.split.create({ document_input: { type: "file_id", value: fileId, }, categories: [ { name: "essay", description: "A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic without strict formal structure", }, { name: "research_paper", description: "A formal academic document presenting original research, methodology, experiments, results, and conclusions with citations and references", }, ],});
console.log(`✅ Split job created: ${job.id}`);console.log(` Status: ${job.status}`);Step 3: Poll for Completion
Section titled “Step 3: Poll for Completion”The split job runs asynchronously. Poll until it completes:
completed_job = client.beta.split.wait_for_completion(job.id, polling_interval=2.0)const completedJob = await client.beta.split.waitForCompletion(job.id, { pollingInterval: 2.0 });Step 4: Analyze Results
Section titled “Step 4: Analyze Results”Examine the split results:
segments = completed_job.result.segments if completed_job.result else []
print(f"📊 Total segments found: {len(segments)}")
for i, segment in enumerate(segments, 1): category = segment.category pages = segment.pages confidence = segment.confidence_category
if len(pages) == 1: page_range = f"Page {pages[0]}" else: page_range = f"Pages {min(pages)}-{max(pages)}"
print(f"\nSegment {i}:") print(f" Category: {category}") print(f" {page_range} ({len(pages)} pages)") print(f" Confidence: {confidence}")const segments = completedJob.result?.segments || [];
console.log(`📊 Total segments found: ${segments.length}`);
segments.forEach((segment, index) => { const category = segment.category; const pages = segment.pages; const confidence = segment.confidence_category;
const pageRange = pages.length === 1 ? `Page ${pages[0]}` : `Pages ${Math.min(...pages)}-${Math.max(...pages)}`;
console.log(`\nSegment ${index + 1}:`); console.log(` Category: ${category}`); console.log(` ${pageRange} (${pages.length} pages)`); console.log(` Confidence: ${confidence}`);});Expected Output
Section titled “Expected Output”📊 Total segments found: 3
Segment 1: Category: essay Pages 1-4 (4 pages) Confidence: high
Segment 2: Category: research_paper Pages 5-13 (9 pages) Confidence: high
Segment 3: Category: research_paper Pages 14-24 (11 pages) Confidence: highThe Split API correctly identified:
- 1 essay segment: Alan Turing’s “Intelligent Machinery, A Heretical Theory”
- 2 research paper segments: ImageNet paper and “Attention is All You Need”
Controlling uncategorized pages
Section titled “Controlling uncategorized pages”By default, pages that don’t match any defined category are grouped as uncategorized. You can control this with allow_uncategorized:
"include"(default) — pages that don’t match any category will be grouped asuncategorizedand included in results"forbid"— all pages must be assigned to a defined category"omit"— pages that don’t match any category will not appear in results
# Exclude uncategorized pages from resultsjob = client.beta.split.create(..., splitting_strategy={"allow_uncategorized": "omit"})
# Force all pages into defined categoriesjob = client.beta.split.create(..., splitting_strategy={"allow_uncategorized": "forbid"})// Exclude uncategorized pages from resultsconst job = await client.beta.split.create({..., splitting_strategy: { allow_uncategorized: "omit" } });
// Force all pages into defined categoriesconst job = await client.beta.split.create({..., splitting_strategy: { allow_uncategorized: "forbid" } });Next Steps
Section titled “Next Steps”- Explore the REST API reference for all available options
- Combine Split with LlamaExtract to run targeted extraction on each segment
- Use LlamaParse to parse individual segments with optimized settings