File Storage
File storage is an integral part of LlamaCloud. Without it, many key features would not be possible. This page walks through how to configure file storage for your deployment — which buckets you need to create and for non-AWS deployments, how to configure the S3 Proxy to interact with them.
Requirements
Section titled “Requirements”- A valid blob storage service. We recommend the following:
- Because LlamaCloud heavily relies on file storage, you will need to create the following buckets:
llama-platform-parsed-documents
llama-platform-etl
llama-platform-external-components
llama-platform-file-parsing
llama-platform-raw-files
llama-cloud-parse-output
llama-platform-file-screenshots
llama-platform-extract-output
(forLlamaExtract
)
Connecting to AWS S3
Section titled “Connecting to AWS S3”Below are two ways to configure a connection to AWS S3:
(Recommended) IAM Role for Service Accounts
Section titled “(Recommended) IAM Role for Service Accounts”We recommend that users create a new IAM Role and Policy for LlamaCloud. You can then attach the role ARN as a service account annotation.
// Example IAM Policy{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:*"], // this is not secure "Resource": [ "arn:aws:s3:::llama-platform-parsed-documents", "arn:aws:s3:::llama-platform-parsed-documents/*", ... ] } ]}
After creating something similar to the above policy, update the backend
, jobsService
, jobsWorker
, and llamaParse
service accounts with the EKS annotation.
# Example for the backend service account. Repeat for each of the services listed above.backend: serviceAccount: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::<account-id>:role/<role-name>
For more information, feel free to refer to the official AWS documentation about this topic.
AWS Credentials
Section titled “AWS Credentials”Create a user with a policy attached for the aforementioned s3 buckets. Afterwards, you can configure the platform to use the aws credentials of that user by setting the following values in your values.yaml
file:
global: cloudProvider: "aws" config: accessKeyId: "<your-access-key-id>" secretAccessKey: "<your-secret-access-key>"
Overriding Default Bucket Names
Section titled “Overriding Default Bucket Names”We allow users to override the default bucket names in the values.yaml
file.
global: config: parsedDocumentsCloudBucketName: "<your-bucket-name>" parsedEtlCloudBucketName: "<your-bucket-name>" parsedExternalComponentsCloudBucketName: "<your-bucket-name>" parsedFileParsingCloudBucketName: "<your-bucket-name>" parsedRawFileCloudBucketName: "<your-bucket-name>" parsedLlamaCloudParseOutputCloudBucketName: "<your-bucket-name>" parsedFileScreenshotCloudBucketName: "<your-bucket-name>" llamaExtractOutputCloudBucketName: "<your-bucket-name>"
Connecting to Azure Blob Storage or Other Providers with S3Proxy
Section titled “Connecting to Azure Blob Storage or Other Providers with S3Proxy”LlamaCloud was first developed on AWS, which means that we started by natively supporting S3. However, to make a self-hosted solution possible, we need a way for the platform to interact with other providers.
We leverage the open-source project S3Proxy to translate the S3 API requests into requests to other storage providers. A containerized deployment of S3Proxy is supported out of the box in our helm charts.
S3 Proxy can be enabled (is enabled by default) and can be further configured in your values.yaml
file. The following is an example for how to connect your LlamaCloud deployment to Azure Blob Storage. For more examples of connecting to different providers, please refer to the project’s Examples page.
s3proxy: enabled: true
config: S3PROXY_ENDPOINT: "http://0.0.0.0:80" S3PROXY_AUTHORIZATION: "none" S3PROXY_IGNORE_UNKNOWN_HEADERS: "true" S3PROXY_CORS_ALLOW_ORIGINS: "*" JCLOUDS_PROVIDER: "azureblob" JCLOUDS_REGION: "eastus" # Change to your region JCLOUDS_AZUREBLOB_AUTH: "azureKey" JCLOUDS_IDENTITY: "fill-out" # Change to your storage account name JCLOUDS_CREDENTIAL: "fill-out" # Change to your storage account key JCLOUDS_ENDPOINT: "fill-out" # Change to your storage account endpoint
s3proxy: enabled: true
######################################################### # configure service and containerto use non-privileged port 8080 # S3PROXY_ENDPOINT: http://0.0.0.0:8080 (in config section or secret) ######################################################### service: port: 8080 containerPort: 8080 #########################################################
podSecurityContext: runAsUser: 1000 runAsGroup: 1000 fsGroup: 1000 securityContext: runAsUser: 1000 runAsGroup: 1000 allowPrivilegeEscalation: false readOnlyRootFilesystem: true
config: S3PROXY_ENDPOINT: "http://0.0.0.0:8080" # This needs to be set to the non-privileged port (8080) S3PROXY_AUTHORIZATION: "none" S3PROXY_IGNORE_UNKNOWN_HEADERS: "true" S3PROXY_CORS_ALLOW_ORIGINS: "*" JCLOUDS_PROVIDER: "azureblob" JCLOUDS_REGION: "eastus" # Change to your region JCLOUDS_AZUREBLOB_AUTH: "azureKey" JCLOUDS_IDENTITY: "fill-out" # Change to your storage account name JCLOUDS_CREDENTIAL: "fill-out" # Change to your storage account key JCLOUDS_ENDPOINT: "fill-out" # Change to your storage account endpoint
s3proxy: enabled: true envFromSecretName: "existing-s3proxy-secret"