Autoscaling
Configure autoscaling for LlamaCloud services to automatically scale based on resource utilization or queue depth.
Overview
Section titled “Overview”LlamaCloud supports two autoscaling approaches:
- Standard HPA (Default) - CPU/memory-based scaling using Kubernetes HPA
- KEDA-based scaling (Recommended for Production) - Queue-depth based scaling for better workload responsiveness
Both options are configured through Helm values and provide automatic scaling based on different metrics to ensure optimal resource utilization.
Autoscaling Options
Section titled “Autoscaling Options”Standard HPA
Section titled “Standard HPA”- Metrics: CPU and memory utilization
- Best for: General workloads, development environments
- Setup: Enabled by default, no additional components required
KEDA-based Scaling
Section titled “KEDA-based Scaling”- Metrics: Queue depth from LlamaCloud API
- Best for: Production workloads with variable processing queues
- Setup: Requires KEDA operator installation
- Advantage: Scales based on actual work to be done, not just resource usage
Prerequisites
Section titled “Prerequisites”For Standard HPA
Section titled “For Standard HPA”- Kubernetes Metrics Server (usually pre-installed)
For KEDA-based Scaling
Section titled “For KEDA-based Scaling”- KEDA Operator installed in your Kubernetes cluster
- LlamaCloud version 0.5.8+ (for queue status API)
Helm Configuration
Section titled “Helm Configuration”Option 1: Standard HPA (Default)
Section titled “Option 1: Standard HPA (Default)”By default, LlamaParse uses standard Kubernetes HPA based on CPU and memory metrics:
# Basic HPA configuration (enabled by default)llamaParse: autoscaling: enabled: true minReplicas: 2 maxReplicas: 10 targetCPUUtilizationPercentage: 80 # targetMemoryUtilizationPercentage: 80 # Uncomment to enable memory-based scalingOption 2: KEDA Queue-Based Scaling (Recommended for Production)
Section titled “Option 2: KEDA Queue-Based Scaling (Recommended for Production)”We recommend KEDA-based configuration for more robust queue-depth based scaling in production. Note: You must disable HPA when using KEDA.
# KEDA configuration for queue-based scalingllamaParse: autoscaling: enabled: false # Must disable HPA for KEDA
keda: enabled: true pollingInterval: 15 cooldownPeriod: 120 minReplicaCount: 2 maxReplicaCount: 50
# Configure queue-based scaling trigger triggers: - type: metrics-api metadata: url: "http://llamacloud-backend:8000/api/queue-statusz?queue_prefix=parse_raw_file_job" format: "json" valueLocation: "total_messages" # this includes ready (i.e. waiting) & unacked (i.e. processing) messages targetValue: "20" # if metric provided by the API is equal or higher to this value, KEDA will start scaling outOCR Pod Scaling Based on LlamaParse Worker Pods
Section titled “OCR Pod Scaling Based on LlamaParse Worker Pods”For workloads that use OCR services, you can configure KEDA to scale OCR pods based on the number of LlamaParse worker pods. This ensures OCR capacity matches parsing demand.
KEDA Configuration for OCR Scaling
Section titled “KEDA Configuration for OCR Scaling”The OCR pod scaling uses KEDA’s ability to monitor the number of running LlamaParse Worker pods and applies the formula: Min(3, llamaparse_pods / 3) to determine the optimal number of OCR pods.
# OCR scaling configuration based on LlamaParse Worker podsllamaParse-ocr: autoscaling: enabled: false # Disable standard HPA for KEDA
keda: enabled: true pollingInterval: 15 cooldownPeriod: 120 minReplicaCount: 3 maxReplicaCount: 20
# Scale OCR pods based on LlamaParse Worker pod count triggers: - type: metrics-api metadata: url: "http://llamacloud-backend:8000/api/queue-statusz?queue_prefix=parse_raw_file_job" format: "json" valueLocation: "total_messages" # Number of jobs in queue/running targetValue: "60" # Scale up when jobs exceed 3 * 20 (20 jobs per parse worker pod)Scaling Logic
Section titled “Scaling Logic”OCR pods scale based on parse job count using the formula Min(3, estimated_parse_workers / 3). The target value of 60 assumes ~20 jobs per LlamaParse Worker pod, maintaining a 3:1 LlamaParse Worker to OCR pod ratio for optimal resource efficiency.