Skip to content

Granular Bounding Boxes: Word, Line, and Cell Grounding

This example shows how to get per-word, per-line, and per-table-cell bounding boxes alongside the regular item-level layout boxes Parse returns, and how to fetch + walk the JSONL sidecar that carries them.

Use this when you need to:

  • Highlight individual words or lines on a PDF viewer for citation back-references.
  • Ground extracted answers down to the exact glyph rather than the whole paragraph.
  • Build a side-by-side preview that hover-syncs from markdown text → highlighted region on the source document.

Granular bounding boxes are not delivered inline on the parse-result response — they live in a separate JSONL sidecar that the result links to via a presigned URL. This is a deliberate split: the sidecar can be many MB on a long document, and most callers don’t need it. The flow is two steps: parse with granular_bboxes set, then download the sidecar URL.

Terminal window
pip install llama-cloud>=1.0 httpx
import os
from getpass import getpass
os.environ["LLAMA_CLOUD_API_KEY"] = getpass("Llama Cloud API Key: ")
from llama_cloud import AsyncLlamaCloud
client = AsyncLlamaCloud()

Set output_options.granular_bboxes to any subset of "word", "line", "cell". You can request just one level or all three. Parse will produce the JSONL sidecar automatically — there is no corresponding expand value to add.

# 1) Upload the file
file_obj = await client.files.create(
file="executive-summary-2024.pdf",
purpose="parse",
)
# 2) Parse with word + line + cell grounding
result = await client.parsing.parse(
file_id=file_obj.id,
tier="agentic",
version="latest",
output_options={
"granular_bboxes": ["word", "line", "cell"],
},
# `items` is optional — we ask for it so we can compare the inline items tree
# to the sidecar later. The sidecar URL itself is auto-included on the result.
expand=["items"],
)

When granular_bboxes is set, the result auto-includes a grounded_items entry under result_content_metadata. Each entry carries size_bytes, an exists flag, and a presigned_url.

sidecar = (result.result_content_metadata or {}).get("grounded_items")
if sidecar is None:
raise RuntimeError("Sidecar missing — was `granular_bboxes` set on the parse request?")
print(f"Sidecar: {sidecar.size_bytes} bytes")
print(f"URL: {sidecar.presigned_url}")

Presigned URLs are temporary. Download promptly, or call client.parsing.get(job_id=...) again to mint a fresh URL.

The sidecar is JSONL — one JSON object per line, one line per page — not a single JSON array. Stream it line by line.

import json
import httpx
async with httpx.AsyncClient() as http:
response = await http.get(sidecar.presigned_url)
response.raise_for_status()
# Each non-empty line is one page row.
pages = [json.loads(line) for line in response.text.splitlines() if line.strip()]
print(f"Pages in sidecar: {len(pages)}")

Each page row is one of two shapes:

# Success
{
"page_number": 1,
"page_width": 612,
"page_height": 792,
"success": True,
"items": [...],
}
# Failure — grounding could not be produced for this page
{
"page_number": 2,
"success": False,
"error": "...",
}

Always check success before drilling in:

for page in pages:
if not page["success"]:
print(f"Page {page['page_number']} failed: {page['error']}")
continue
print(f"Page {page['page_number']}: {len(page['items'])} items")

Each item has the same type / md / bbox shape as the regular items response, plus an optional grounding block. For text-shaped items (paragraphs, headings, captions), grounding is a GroundedTextSupport:

{
"source": "md", # or "caption" — which surface the spans index into
"lines": [
{
"span": [0, 11], # [start, end) byte range into item["md"]
"bbox": { "x": 72.0, "y": 100.0, "w": 200.0, "h": 12.0 },
"words": [
{
"span": [0, 5],
"bbox": { "x": 72.0, "y": 100.0, "w": 35.0, "h": 12.0 },
},
# ...
],
},
# ...
],
}

To highlight each word on page 1:

page = next(p for p in pages if p["success"] and p["page_number"] == 1)
for item in page["items"]:
grounding = item.get("grounding")
if not grounding or grounding.get("source") not in ("md", "caption"):
continue
for line in grounding["lines"]:
for word in line.get("words", []) or []:
start, end = word["span"]
text = item["md"][start:end]
box = word["bbox"]
print(f" word {text!r} at ({box['x']:.0f}, {box['y']:.0f}) "
f"{box['w']:.0f}×{box['h']:.0f}")

The span is a [start, end) UTF-8 byte range into item["md"], so item["md"][start:end] slices out exactly the word’s source text.

For table items, grounding is a GroundedTableSupport instead — it carries per-cell boxes and spans, plus row- and column-level boxes:

{
"rows": [
# rows[row][col] is a cell or null
[
{
"span": [42, 56], # optional, into the table cell text
"lines": [...], # optional per-line grounding inside the cell
"bbox": [ # one or more boxes covering the cell
{ "x": 100.0, "y": 200.0, "w": 50.0, "h": 16.0 },
],
},
None, # missing/empty cell
# ...
],
# ...
],
"row_bboxes": [[{...}], ...], # boxes per row (a row may span multiple)
"column_bboxes": [[{...}], ...], # boxes per column
}

To find the bbox of the cell at row 0, column 1 on page 1:

for item in page["items"]:
if item["type"] != "table":
continue
grounding = item.get("grounding")
if not grounding or not grounding.get("rows"):
continue
cell = grounding["rows"][0][1]
if cell is None:
print("cell (0, 1) is empty")
continue
for box in cell.get("bbox") or []:
print(f"cell (0, 1) box: ({box['x']:.0f}, {box['y']:.0f}) "
f"{box['w']:.0f}×{box['h']:.0f}")

The sidecar carries page_width and page_height for every page — these match the coordinate space of the bboxes. If you also request images_to_save: ["screenshot"], you can overlay the bboxes onto each page screenshot directly. The screenshot’s pixel dimensions may differ from page_width / page_height (PDF points vs. image pixels), so scale the box coordinates accordingly:

scale_x = screenshot_pixel_width / page["page_width"]
scale_y = screenshot_pixel_height / page["page_height"]
x_px = box["x"] * scale_x
y_px = box["y"] * scale_y
w_px = box["w"] * scale_x
h_px = box["h"] * scale_y

If box["r"] is present and non-zero, the box should be rotated by r degrees around its center to recover the visual quad — x/y/w/h describe the axis-aligned bounding rect of the unrotated content.