Autoscaling
Loading...
Configure autoscaling for LlamaCloud services to automatically scale based on resource utilization or queue depth.
Overview
Section titled “Overview”LlamaCloud supports two autoscaling approaches:
- Standard HPA (Default) - CPU/memory-based scaling using Kubernetes HPA
- KEDA-based scaling (Recommended for Production) - Queue-depth based scaling for better workload responsiveness
Both options are configured through Helm values and provide automatic scaling based on different metrics to ensure optimal resource utilization.
Autoscaling Options
Section titled “Autoscaling Options”Standard HPA
Section titled “Standard HPA”- Metrics: CPU and memory utilization
- Best for: General workloads, development environments
- Setup: Enabled by default, no additional components required
KEDA-based Scaling
Section titled “KEDA-based Scaling”- Metrics: Queue depth from LlamaCloud API
- Best for: Production workloads with variable processing queues
- Setup: Requires KEDA operator installation
- Advantage: Scales based on actual work to be done, not just resource usage
Prerequisites
Section titled “Prerequisites”For Standard HPA
Section titled “For Standard HPA”- Kubernetes Metrics Server (usually pre-installed)
For KEDA-based Scaling
Section titled “For KEDA-based Scaling”- KEDA Operator installed in your Kubernetes cluster
- LlamaCloud version 0.5.8+ (for queue status API)
Helm Configuration
Section titled “Helm Configuration”Option 1: Standard HPA (Default)
Section titled “Option 1: Standard HPA (Default)”By default, LlamaParse uses standard Kubernetes HPA based on CPU and memory metrics:
# Basic HPA configuration (enabled by default)llamaParse: autoscaling: enabled: true minReplicas: 2 maxReplicas: 10 targetCPUUtilizationPercentage: 80 # targetMemoryUtilizationPercentage: 80 # Uncomment to enable memory-based scalingOption 2: KEDA Queue-Based Scaling (Recommended for Production)
Section titled “Option 2: KEDA Queue-Based Scaling (Recommended for Production)”We recommend KEDA-based configuration for more robust queue-depth based scaling in production. Note: You must disable HPA when using KEDA.
# KEDA configuration for queue-based scalingllamaParse: autoscaling: enabled: false # Must disable HPA for KEDA
keda: enabled: true pollingInterval: 15 cooldownPeriod: 120 minReplicaCount: 2 maxReplicaCount: 50
# Configure queue-based scaling trigger triggers: - type: metrics-api metadata: url: "http://llamacloud-backend:8000/api/queue-statusz?queue_prefix=parse_raw_file_job" format: "json" valueLocation: "total_messages" # this includes ready (i.e. waiting) & unacked (i.e. processing) messages targetValue: "20" # if metric provided by the API is equal or higher to this value, KEDA will start scaling out