Splitting Concatenated Documents
This guide demonstrates how to use the Split API to automatically segment a concatenated PDF into logical document sections based on content categories.
Use Case
Section titled “Use Case”When dealing with large PDFs that contain multiple distinct documents or sections (e.g., a bundle of research papers, a collection of reports), you often need to split them into individual segments. The Split API uses AI to:
- Analyze each page’s content
- Classify pages into user-defined categories
- Group consecutive pages of the same category into segments
Example Document
Section titled “Example Document”We’ll use a PDF containing three concatenated documents:
- Alan Turing’s essay “Intelligent Machinery, A Heretical Theory” (an essay)
- ImageNet paper (a research paper)
- “Attention is All You Need” paper (a research paper)
We’ll split this into segments categorized as either essay or research_paper.
Install the required packages:
pip install llama-cloud python-dotenv requestsSet up your environment:
import osimport timeimport requestsfrom dotenv import load_dotenv
load_dotenv()
LLAMA_CLOUD_API_KEY = os.environ.get("LLAMA_CLOUD_API_KEY")BASE_URL = os.environ.get("LLAMA_CLOUD_BASE_URL", "https://api.cloud.llamaindex.ai")PROJECT_ID = os.environ.get("LLAMA_CLOUD_PROJECT_ID", None)
headers = { "Authorization": f"Bearer {LLAMA_CLOUD_API_KEY}", "Content-Type": "application/json",}Step 1: Upload the PDF
Section titled “Step 1: Upload the PDF”Upload the concatenated PDF to LlamaCloud using the llama-cloud SDK:
from llama_cloud.client import LlamaCloud
client = LlamaCloud(token=LLAMA_CLOUD_API_KEY, base_url=BASE_URL)
pdf_path = "./data/turing+imagenet+attention.pdf"
with open(pdf_path, "rb") as f: uploaded_file = client.files.upload_file(upload_file=f, project_id=PROJECT_ID)
file_id = uploaded_file.idprint(f"✅ File uploaded: {uploaded_file.name}")Step 2: Create a Split Job
Section titled “Step 2: Create a Split Job”Create a split job with category definitions. Each category needs a name and a description that helps the AI understand what content belongs to that category:
split_request = { "document_input": { "type": "file_id", "value": file_id, }, "categories": [ { "name": "essay", "description": "A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic without strict formal structure", }, { "name": "research_paper", "description": "A formal academic document presenting original research, methodology, experiments, results, and conclusions with citations and references", }, ],}
response = requests.post( f"{BASE_URL}/api/v1/beta/split/jobs", params={"project_id": PROJECT_ID}, headers=headers, json=split_request,)response.raise_for_status()
split_job = response.json()job_id = split_job["id"]
print(f"✅ Split job created: {job_id}")print(f" Status: {split_job['status']}")Step 3: Poll for Completion
Section titled “Step 3: Poll for Completion”The split job runs asynchronously. Poll until it completes:
def poll_split_job(job_id: str, max_wait_seconds: int = 180, poll_interval: int = 5): start_time = time.time()
while (time.time() - start_time) < max_wait_seconds: response = requests.get( f"{BASE_URL}/api/v1/beta/split/jobs/{job_id}", params={"project_id": PROJECT_ID}, headers=headers, ) response.raise_for_status() job = response.json()
status = job["status"] elapsed = int(time.time() - start_time) print(f" Status: {status} (elapsed: {elapsed}s)")
if status in ["completed", "failed"]: return job
time.sleep(poll_interval)
raise TimeoutError(f"Job did not complete within {max_wait_seconds} seconds")
completed_job = poll_split_job(job_id)Step 4: Analyze Results
Section titled “Step 4: Analyze Results”Examine the split results:
segments = completed_job.get("result", {}).get("segments", [])
print(f"📊 Total segments found: {len(segments)}")
for i, segment in enumerate(segments, 1): category = segment["category"] pages = segment["pages"] confidence = segment["confidence_category"]
if len(pages) == 1: page_range = f"Page {pages[0]}" else: page_range = f"Pages {min(pages)}-{max(pages)}"
print(f"\nSegment {i}:") print(f" Category: {category}") print(f" {page_range} ({len(pages)} pages)") print(f" Confidence: {confidence}")Expected Output
Section titled “Expected Output”📊 Total segments found: 3
Segment 1: Category: essay Pages 1-4 (4 pages) Confidence: high
Segment 2: Category: research_paper Pages 5-13 (9 pages) Confidence: high
Segment 3: Category: research_paper Pages 14-24 (11 pages) Confidence: highThe Split API correctly identified:
- 1 essay segment: Alan Turing’s “Intelligent Machinery, A Heretical Theory”
- 2 research paper segments: ImageNet paper and “Attention is All You Need”
Using allow_uncategorized
Section titled “Using allow_uncategorized”You can use the allow_uncategorized strategy when you want to capture pages that don’t match any defined category:
split_request_uncategorized = { "document_input": {"type": "file_id", "value": file_id}, "categories": [ { "name": "essay", "description": "A philosophical or reflective piece of writing that presents personal viewpoints", } # Only 'essay' defined - research papers will be 'uncategorized' ], "splitting_strategy": {"allow_uncategorized": True},}With this configuration, pages that don’t match essay will be grouped as uncategorized.
Next Steps
Section titled “Next Steps”- Explore the REST API reference for all available options
- Combine Split with LlamaExtract to run targeted extraction on each segment
- Use LlamaParse to parse individual segments with optimized settings