Getting Started with LlamaSheets

LlamaSheets is a new beta API for extracting tables out of messy spreadsheets. A critical step in document understanding is normalizing inputs. Using the LlamaSheets API, it will

Intelligently identify tables per spreadsheet
Isolate and extract each table in a spreadsheet
Output them as Parquet files, a portable format supported by many languages that retains type information. For example, you can load these directly as dataframes with Pandas in Python.
Generates additional metadata about the tables (extracted location) and spreadsheets (titles, descriptions) to assist in downstream flows.

Basic Usage

The Python SDK provides an end-to-end method across multiple API calls to complete the extraction.

from llama_cloud_services.beta.sheets import LlamaSheets, SpreadsheetParsingConfig

client = LlamaSheets(api_key="...")

results = await client.aextract_tables(
    "path_to_file.xlsx",  # Supports paths, bytes, and streams
    config=SpreadsheetParsingConfig(
        sheet_names=None, # Parse the default sheet only
        generate_additional_metadata=True, # Generate titles/descriptions per sheet
    )
)

# Download parquet files
file_bytes = await client.adownload_table_result(
    results.job_id,
    results.tables[0].id,
    result_type="table",  # can be `table` or `cell_metadata`
)

# Download parquet files and convert to pandas dataframes
df = await client.adownload_table_as_dataframe(
    results.job_id,
    results.tables[0].id,
    result_type="table",
)

Lower-Level Usage

Using the LlamaSheets API for table extraction generally consists of 4 main steps.

Below, we detail each step using both the Python SDK and raw HTTP calls.

1. Upload a File

First, upload a file, and get a File ID:

Python SDK
HTTP Calls

from llama_cloud_services.beta.sheets import LlamaSheets

# Initialize the client
client = LlamaSheets(api_key="your_api_key")

# Upload a file
file_response = await client.aupload_file("path/to/your/spreadsheet.xlsx")
print(f"File ID: {file_response.id}")

curl -X POST "https://api.cloud.llamaindex.ai/api/v1/files" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "upload_file=@path/to/your/spreadsheet.xlsx"

Response:

{
  "id": "file-id-here",
  ...
}

2. Create a job

Using the File ID, you can create a job for table extraction to get a job ID:

Python SDK
HTTP Calls

from llama_cloud_services.beta.sheets.types import SpreadsheetParsingConfig

# Create a job with optional configuration
config = SpreadsheetParsingConfig(
    sheet_names=None,  # Parse all sheets (default)
    generate_additional_metadata=True  # Generate extra metadata
)

job = await client.acreate_job(file_id=file_response.id, config=config)
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")

You can also pass the config as a dict:

job = await client.acreate_job(
    file_id=file_response.id,
    config={"generate_additional_metadata": True}
)

curl -X POST "https://api.cloud.llamaindex.ai/api/v1/beta/spreadsheet/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file-id-here",
    "config": {
      "sheet_names": null,
      "generate_additional_metadata": true
    }
  }'

Response:

{
  "id": "job-id-here",
  "file_id": "file-id-here",
  "status": "PENDING",
  "project_id": "project-id",
  "created_at": "2024-01-01T00:00:00Z",
  ...
}

3. Wait for completion

Now that you have a job ID, you can wait for the job to finish:

Python SDK
HTTP Calls

# Wait for the job to complete (polls automatically)
job_result = await client.await_for_completion(job_id=job.id)
print(f"Job Status: {job_result.status}")

# Access extracted tables metadata
if job_result.tables:
    print(f"Found {len(job_result.tables)} table(s)")
    for table in job_result.tables:
        print(f"  - Table ID: {table.table_id}")
        print(f"    Sheet: {table.sheet_name}")
        print(f"    Location: {table.table_location}")

for worksheet_metadata in job_result.worksheet_metadata:
    print(f"Worksheet Title: {worksheet_metadata.title}")
    print(f"Worksheet Description: {worksheet_metadata.description}")

Alternatively, you can manually poll:

while True:
    job_result = await client.aget_job(job_id=job.id, include_results_metadata=True)

    if job_result.status in ["SUCCESS", "PARTIAL_SUCCESS", "ERROR", "FAILURE"]:
        break

    print(f"Status: {job_result.status}")
    await asyncio.sleep(5)

# Poll for job status
curl -X GET "https://api.cloud.llamaindex.ai/api/v1/beta/spreadsheet/jobs/job-id-here?include_results=true" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response when complete:

{
  "id": "job-id-here",
  "status": "SUCCESS",
  "tables": [
    {
      "table_id": "table-id-1",
      "sheet_name": "Sheet1",
      "table_location": "A1:D10",
      "table_title": "Sales Data",
      "table_description": "Monthly sales figures"
    }
  ],
  ...
}

Keep polling until status is one of: SUCCESS, PARTIAL_SUCCESS, ERROR, or FAILURE.

4. Download the result

With a completed job, you can download the generated Parquet file and read any additional metadata about the job result:

Python SDK
HTTP Calls

from llama_cloud_services.beta.sheets.types import SpreadsheetResultType
import pandas as pd

# Download a table directly as a pandas DataFrame
table_id = job_result.tables[0].table_id
df = await client.adownload_table_as_dataframe(
    job_id=job.id,
    table_id=table_id,
    result_type=SpreadsheetResultType.TABLE
)

print(f"Table shape: {df.shape}")
print(df.head())

# Optionally, download cell metadata
metadata_df = await client.adownload_table_as_dataframe(
    job_id=job.id,
    table_id=table_id,
    result_type=SpreadsheetResultType.CELL_METADATA
)
print(f"Metadata shape: {metadata_df.shape}")

You can also download raw parquet bytes:

# Download as raw bytes
parquet_bytes = await client.adownload_table_result(
    job_id=job.id,
    table_id=table_id,
    result_type=SpreadsheetResultType.TABLE
)

# Save to file
with open("table.parquet", "wb") as f:
    f.write(parquet_bytes)

# Step 1: Get presigned URL for the table
curl -X GET "https://api.cloud.llamaindex.ai/api/v1/beta/spreadsheet/jobs/job-id-here/tables/table-id-here/result/table" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "url": "https://s3.amazonaws.com/...",
  "expires_at": "2024-01-01T01:00:00Z"
}

# Step 2: Download the parquet file using the presigned URL
curl -X GET "https://s3.amazonaws.com/..." -o table.parquet

# Load with pandas
python -c "import pandas as pd; df = pd.read_parquet('table.parquet'); print(df.head())"

To download cell metadata, use result/cell_metadata instead of result/table:

curl -X GET "https://api.cloud.llamaindex.ai/api/v1/beta/spreadsheet/jobs/job-id-here/tables/table-id-here/result/cell_metadata" \
  -H "Authorization: Bearer YOUR_API_KEY"

Understanding the Output Format

When a LlamaSheets job completes successfully, you receive rich structured data about the extracted tables. This section explains the different components of the output.

Job Result Structure

The job result object contains:

{
  "id": "job-id",
  "status": "SUCCESS",
  "file_id": "original-file-id",
  "config": { /* your parsing config */ },
  "created_at": "2024-01-01T00:00:00Z",
  "updated_at": "2024-01-01T00:05:00Z",
  "tables": [
    {
      "table_id": "uuid-here",
      "sheet_name": "Sheet1",
      "table_location": "A2:E11",
      "metadata_json": null
    }
  ],
  "worksheet_metadata": [
    {
      "sheet_name": "Sheet1",
      "title": "Sales Data Q1 2024",
      "description": "Quarterly sales figures with revenue, units sold, and regional breakdowns"
    }
  ],
  "errors": []
}

Key fields:

tables: Array of extracted tables with their IDs and locations
worksheet_metadata: Generated titles and descriptions for each sheet (when generate_additional_metadata: true)
status: One of SUCCESS, PARTIAL_SUCCESS, ERROR, or FAILURE

Table Data (Parquet Files)

Each extracted table is saved as a Parquet file containing the normalized table data. Parquet is a columnar storage format that:

Preserves data types (dates, numbers, strings, booleans)
Is highly efficient and compressed
Can be read by pandas, polars, DuckDB, and many other tools

Example table structure:

import pandas as pd

df = pd.read_parquet("table.parquet")
print(df.head())

# Output:
#    col_0  col_1      col_2       col_3  col_4
# 0     44  -124.6  Value_0_2  2020-01-01  False
# 1    153   -34.4  Value_1_2  2020-01-02   True
# 2    184    34.4  Value_2_2  2020-01-03  False

Cell Metadata (Parquet Files)

In addition to the table data, you can download rich cell-level metadata that provides detailed information about each cell in the extracted table. This is particularly useful for:

Understanding cell formatting and styling
Analyzing table structure and layout
Detecting data types and patterns
Preserving formatting for downstream processing

Available metadata fields:

Position & Layout:

row_number, column_number: Cell coordinates
coordinate: Excel-style cell reference (e.g., “A1”)
relative_row_position, relative_column_position: Normalized position (0.0 to 1.0)
is_in_first_row, is_in_last_row, is_in_first_column, is_in_last_column: Boolean flags
distance_from_origin, distance_from_center: Geometric distances

Formatting:

font_bold, font_italic: Font style flags
font_size: Font size in points
font_color_rgb, background_color_rgb: Color values
has_border, border_style_score: Border information
horizontal_alignment, vertical_alignment: Alignment values
text_wrap: Text wrapping setting

Cell Properties:

is_merged_cell: Whether the cell is part of a merged range
horizontal_size, vertical_size: Cell dimensions
alignment_indent: Indentation level

Data Type Detection:

data_type: Detected type (Number, Text, Date, etc.)
is_date_like: Boolean flag for date detection
is_percentage, is_currency: Boolean flags for special number formats
number_format_category: Excel number format category
text_length: Length of text content
has_special_chars: Whether text contains special characters

Content:

cell_value: Processed cell value
raw_cell_value: Original raw value

Clustering & Grouping:

group, sub_group: Cell grouping identifiers
l0_category, f_group: Hierarchical categorization

Example metadata usage:

import pandas as pd

# Load cell metadata
metadata_df = pd.read_parquet("metadata.parquet")

# Find all header cells (first row)
headers = metadata_df[metadata_df['is_in_first_row'] == True]

# Find all bolded cells (likely headers or emphasis)
bold_cells = metadata_df[metadata_df['font_bold'] == True]

# Find date columns
date_cells = metadata_df[metadata_df['is_date_like'] == True]
date_columns = date_cells['column_number'].unique()

# Analyze formatting patterns
print(f"Font sizes used: {metadata_df['font_size'].unique()}")
print(f"Data types present: {metadata_df['data_type'].unique()}")

Downloading Results

You can download two types of parquet files for each extracted table:

Table data (result_type="table"): The actual table content
Cell metadata (result_type="cell_metadata"): Rich formatting and position metadata

Both are stored as parquet files and can be easily loaded into pandas DataFrames for analysis.

Learn More

Beyond just extracting the tables, there are many downstream use-cases that can wrap these outputs.