LlamaParse Configuration
Configuration and scaling recommendations for LlamaParse OCR services and workers.
Overview
Section titled âOverviewâLlamaParse components:
- OCR Service: Text extraction from document images
- LlamaParse Workers: Document processing (fast, balanced, agentic modes)
OCR Service Configuration
Section titled âOCR Service ConfigurationâOCR service runs on GPU or CPU infrastructure.
Hardware Recommendations
Section titled âHardware RecommendationsâCPU deployments: Use x86 architecture (50% better throughput than ARM).
Resource Requirements
Section titled âResource RequirementsâConfiguration | GPU | CPU |
---|---|---|
Minimum instances | 2 | 12 |
Pages per minute per pod | 100 | ~2 per worker |
Recommended workers per pod | 4 | Core count á 2 |
Scaling Ratios
Section titled âScaling RatiosâCPU: 2 CPU OCR workers (2 cores each) per LlamaParse worker GPU: 1 GPU OCR worker per 8 LlamaParse workers
LlamaParse Worker Configuration
Section titled âLlamaParse Worker ConfigurationâWorkers process documents in three modes:
Performance by Mode
Section titled âPerformance by ModeâMode | Pages per Minute | Use Case |
---|---|---|
Fast | ~10,000 | High-volume, basic text extraction |
Balanced | ~250 | Standard parsing with good accuracy |
Agentic | ~100 | Complex documents requiring AI analysis |
Resource Requirements
Section titled âResource RequirementsâCompute:
- CPU: 2 vCPUs per worker
- Memory: 2-16 GB RAM per worker
Deployment:
- Multiple workers per Kubernetes node
- ~6 workers per node (production)
Scaling Examples
Section titled âScaling ExamplesâTarget Throughput | LlamaParse Workers | CPU OCR Pods | GPU OCR Pods |
---|---|---|---|
1,000 pages/min | 8 | 16 | 2 |
10,000 pages/min | 64 | 128 | 12 |
GenAI Providers
Section titled âGenAI ProvidersâLlamaParse uses GenAI providers for parsing:
parse_page_with_llm
: LLM parsing (supportsgpt-4o-mini
,haiku-3.5
)parse_page_with_lvm
: Vision model parsing (supportsgemini
,openai
,claude sonnet
)parse_page_with_agent
: Agentic parsing (supportsclaude
,gemini
,openai
)
Provider fallback: Multiple providers configured â automatic fallback on unavailability.
Supported providers:
- Claude/Haiku: Anthropic (US), AWS Bedrock, Google VertexAI
- OpenAI: OpenAI (US), OpenAI EU (
parse_page_with_llm
only), AzureAI - Gemini: Google Vertex AI, Google GenAI
Advanced Configuration
Section titled âAdvanced ConfigurationâOCR Worker Tuning
Section titled âOCR Worker TuningâOCR_WORKER=<value> # Recommended: pod_core_count á 2
OCR Concurrency Control
Section titled âOCR Concurrency ControlâOCR_CONCURRENCY=8 # Default
- Lower: Fewer OCR pods, slower processing
- Higher: More OCR pods, faster processing
Image Processing Limits
Section titled âImage Processing LimitsâMAX_EXTRACTED_IMAGES_PER_PAGES=30 # Default
Job Queue Concurrency
Section titled âJob Queue ConcurrencyâPDF_JOB_QUEUE_CONCURRENCY=1 # Default (recommended)
Do not change PDF_JOB_QUEUE_CONCURRENCY
without understanding performance implications.
GenAI Throughput Tuning
Section titled âGenAI Throughput TuningâLimit throughput per mode to match TPM/RPM quotas:
ACCURATE_MODE_LLM_CONCURRENCY=250 # parse_page_with_llm (default)MULTIMODAL_MODEL_CONCURRENCY=50 # parse_page_with_lvm (default)PREMIUM_MODE_MODEL_CONCURRENCY=25 # parse_page_with_agent (default)
Token usage per 1k pages:
Mode | Requests | Input Tokens | Output Tokens |
---|---|---|---|
parse_page_with_llm | 1,010 | 1.2M | 1.5M |
parse_page_with_agent | 2,000 | 4M | 2M |
parse_page_with_lvm | 1,200 | 3M | 1.5M |
Providers like AWS Bedrock have low default quotas. Verify quotas accommodate desired parsing volume.
Monitoring and Optimization
Section titled âMonitoring and OptimizationâKey Metrics
Section titled âKey Metricsâ- OCR throughput: Pages/minute
- Worker utilization: CPU/memory usage
- Queue depth: Pending jobs
- Error rates: Failed operations
Optimization
Section titled âOptimizationâ- Node placement: Collocate complementary resource usage patterns
- Horizontal scaling: Add workers before increasing per-worker resources
- OCR scaling: Scale OCR services independently
- Memory management: Use restart policies for long-running deployments