Parse All PDFs in a Folder with LlamaParse

This example demonstrates how to process multiple PDFs from a folder using LlamaParse with controlled concurrency using asyncio and semaphores. You can follow along with this tutorial alongside an example script that handles async parsing, given a directory name in our llama_cloud_services repository: batch_parse_folder.py

Setup

Environment Variables

Set your LLAMA_CLOUD_API_KEY environment variable:

export LLAMA_CLOUD_API_KEY='llx-...'

Or create a .env file:

LLAMA_CLOUD_API_KEY=llx-...

Install Dependencies

pip install llama-cloud-services python-dotenv requests

Quick Start

Download Example PDFs

Download sample PDFs to test with:

import os
import requests
from pathlib import Path

# Create sample_files directory
sample_dir = Path("sample_files")
sample_dir.mkdir(exist_ok=True)

# Sample documents to download
sample_docs = {
    "attention.pdf": "https://arxiv.org/pdf/1706.03762.pdf",
    "bert.pdf": "https://arxiv.org/pdf/1810.04805.pdf",
}

# Download sample documents with error handling
for filename, url in sample_docs.items():
    filepath = sample_dir / filename
    if not filepath.exists():
        print(f"📥 Downloading {filename}...")
        try:
            response = requests.get(url, timeout=30)
            response.raise_for_status()

            # Basic content validation
            if response.headers.get('content-type', '').startswith('application/pdf'):
                with open(filepath, "wb") as f:
                    f.write(response.content)
                print(f"   ✅ Downloaded {filename}")
            else:
                print(f"   ⚠️ Warning: {filename} may not be a valid PDF")
        except requests.RequestException as e:
            print(f"   ❌ Failed to download {filename}: {e}")
    else:
        print(f"📁 {filename} already exists")

print("\n✅ Sample files ready!")

Use Asyncio and Semaphore with LlamaParse

We can use asyncio to batch parse multiple files in a folder. Below is a complete example script that parses all PDF files in a directory with controlled concurrency:

import asyncio

from llama_cloud_services import LlamaParse

pdf_files = list(input_dir.glob("*.pdf"))

# Initialize parser
parser = LlamaParse(
    api_key=api_key,
    num_workers=1,  # We control concurrency with semaphore
    show_progress=False,  # We'll show our own progress
)

# Create semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(max_concurrent)

# A helper function to parse a single file with semaphore
async def parse_single_file(
    parser,
    file_path,
    semaphore,
):
    async with semaphore:
        try:
            print(f"Starting parse: {file_path.name}")

            result = await parser.aparse(str(file_path))

            print(f"✓ Completed: {file_path.name} ({len(result.pages)} pages)")

            return {
                "file": file_path.name,
                "status": "success",
                "result": result,
                "pages": len(result.pages) if result.pages else 0,
            }
        except Exception as e:
            print(f"✗ Error parsing {file_path.name}: {str(e)}")
            return {
                "file": file_path.name,
                "status": "error",
                "error": str(e),
            }

# Create tasks for all files
tasks = [
    parse_single_file(parser, pdf_file, semaphore)
    for pdf_file in pdf_files
]

results = await asyncio.gather(*tasks)

Alternatively, you can use the batch_parse_folder.py script we’ve provided, which you can use with the sample_files directory you created before:

python batch_parse_folder.py --input-dir ./sample_files --max-concurrent 5

Parameters:

--input-dir: Directory containing PDF files to parse
--max-concurrent: Controls the maximum number of concurrent parse operations. Adjust based on:
- Your API rate limits (typically 5-10 for most accounts)
- Available network bandwidth
- Server capacity
- File sizes (larger files may require lower concurrency to avoid memory issues)

Example Output

Found 2 PDF files to parse
Processing 2 files with max 5 concurrent operations...
Starting parse: attention.pdf
Starting parse: bert.pdf
Started parsing the file under job_id 1a7b8f3b-9119-4e38-954d-b67b8e96b3d6
Started parsing the file under job_id 28123aeb-dd3e-4398-b754-0cb101a3b78b
✓ Completed: attention.pdf (15 pages)
✓ Completed: bert.pdf (16 pages)
PARSE SUMMARY

Total files: 2
Successful: 2
Failed: 0
Total time: 10.00 seconds
Average time per file: 5.00 seconds

How It Works

Semaphore-based Concurrency: Uses asyncio.Semaphore to limit concurrent requests, preventing API rate limit errors and managing resource usage.
Async Processing: Each file is parsed asynchronously using parser.aparse(), allowing multiple files to be processed concurrently up to the semaphore limit.
Result Aggregation: All results are collected and summarized at the end, providing a complete overview of the parsing operation.