Skip to content

SDK Usage

For a more programmatic approach, the SDK is the recommended way to experiment with different schemas and run extractions at scale.

You can visit the Github repo for the Python SDK or the Typescript SDK.

First, get an api key. You can export it as an environment variable for easy access or pass it directly to clients later.

Terminal window
export LLAMA_CLOUD_API_KEY=llx-xxxxxx

Then, install dependencies:

Terminal window
pip install llama-cloud>=1.0

Now we have our libraries and our API key available, let’s create a script file and extract data from files. In this case, we’re using some sample resumes from our example:

from pydantic import BaseModel, Field
from llama_cloud import LlamaCloud, AsyncLlamaCloud
# Define schema using Pydantic
class Resume(BaseModel):
name: str = Field(description="Full name of candidate")
email: str = Field(description="Email address")
skills: list[str] = Field(description="Technical skills and technologies")
client = LlamaCloud(api_key="your_api_key")
# Create extraction agent
agent = client.extraction.extraction_agents.create(
name="resume-parser",
data_schema=Resume,
config={}
)
# Upload a file to extract from
file_obj = client.files.create(file="resume.pdf", purpose="extract")
file_id = file_obj.id
# Extract data from document
result = await client.extraction.jobs.extract(
extraction_agent_id=agent.id,
file_id=file_id,
)
print(result.data)

Run your script to see the extracted result!

Terminal window
python your_script.py

Schemas can be defined using either Pydantic/Zod models or JSON Schema. Refer to the Schemas page for more details.

You can also call extraction directly over raw text.

import io
from llama_cloud import LlamaCloud
client = LlamaCloud(api_key="your_api_key")
source_text = "Candidate Name: Jane Doe\nEmail: jane.doe@example.com"
source_buffer = io.BytesIO(source_text.encode('utf-8'))
file_obj = client.files.create(file=source_buffer, purpose="extract", external_file_id="resume.txt")
file_id = file_obj.id
result = await client.extraction.jobs.extract(
extraction_agent_id=agent.id,
file_id=file_id,
)

Process multiple files asynchronously:

We can submit multiple files for extraction using concurrency control with a semaphore:

import asyncio
from llama_cloud import LlamaCloud, AsyncLlamaCloud
client = AsyncLlamaCloud(api_key="your_api_key")
semaphore = asyncio.Semaphore(5) # Limit concurrency
async def process_path(file_path: str):
async with semaphore:
file_obj = await client.files.create(file=file_path, purpose="extract")
file_id = file_obj.id
result = await client.extraction.jobs.extract(
extraction_agent_id=agent.id,
file_id=file_id,
)
return result
file_paths = ["resume1.pdf", "resume2.pdf", "resume3.pdf"]
results = await asyncio.gather(*(process_path(path) for path in file_paths))

Schemas can be modified and updated after creation:

client.extraction.extraction_agents.update(
extraction_agent_id=agent.id,
data_schema=new_schema,
config={},
)
# List all agents
agents = client.extraction.extraction_agents.list()
# Get specific agent
agent = client.extraction.extraction_agents.get(extraction_agent_id="agent_id")
# Delete agent
client.extraction.extraction_agents.delete(extraction_agent_id="agent_id")