Python SDK

For a more programmatic approach, the Python SDK is the recommended way to experiment with different schemas and run extractions at scale. The Github repo for the Python SDK is here.

First, get an api key. We recommend putting your key in a file called .env that looks like this:

LLAMA_CLOUD_API_KEY=llx-xxxxxx

Set up a new python environment using the tool of your choice, we used poetry init. Then install the deps you’ll need:

pip install llama-cloud-services python-dotenv

Now we have our libraries and our API key available, let’s create a extract.py file and extract data from files. In this case, we’re using some sample resumes from our example:

Quick Start

from llama_cloud_services import LlamaExtract
from pydantic import BaseModel, Field

# bring in our LLAMA_CLOUD_API_KEY
from dotenv import load_dotenv
load_dotenv()

# Initialize client
extractor = LlamaExtract()

# Define schema using Pydantic
class Resume(BaseModel):
    name: str = Field(description="Full name of candidate")
    email: str = Field(description="Email address")
    skills: list[str] = Field(description="Technical skills and technologies")

# Create extraction agent
agent = extractor.create_agent(name="resume-parser", data_schema=Resume)

# Extract data from document
result = agent.extract("resume.pdf")
print(result.data)

Now run it like any python file. This will print the results of the extraction.

python extract.py

Defining Schemas

Schemas can be defined using either Pydantic models or JSON Schema. Refer to the Schemas page for more details.

Other Extraction APIs

Extraction over bytes or text

You can use the SourceText class to extract from bytes or text directly without using a file. If passing the file bytes, you will need to pass the filename to the SourceText class.

with open("resume.pdf", "rb") as f:
    file_bytes = f.read()
result = test_agent.extract(SourceText(file=file_bytes, filename="resume.pdf"))

result = test_agent.extract(SourceText(text_content="Candidate Name: Jane Doe"))

Batch Processing

Process multiple files asynchronously:

# Queue multiple files for extraction
jobs = await agent.queue_extraction(["resume1.pdf", "resume2.pdf"])

# Check job status
for job in jobs:
    status = agent.get_extraction_job(job.id).status
    print(f"Job {job.id}: {status}")

# Get results when complete
results = [agent.get_extraction_run_for_job(job.id) for job in jobs]

Updating Schemas

Schemas can be modified and updated after creation:

# Update schema
agent.data_schema = new_schema

# Save changes
agent.save()

Managing Agents

# List all agents
agents = extractor.list_agents()

# Get specific agent
agent = extractor.get_agent(name="resume-parser")

# Delete agent
extractor.delete_agent(agent.id)