Document Complexity
Use is_complex to decide whether a document needs OCR or heavier parsing before you commit to it.
The is_complex command in LiteParse checks whether a document is “complex”. This means looking for images, broken/garbled text, large amounts of vector graphics, and sparse text.
Use it to:
- Route documents to cheaper or more expensive parsing/OCR backends.
- Reject or flag documents you can’t handle — e.g. when running with
--no-ocr, find the pages that would come back empty. - Estimate cost up front by counting how many pages actually need OCR.
How it works
Section titled “How it works”Complexity is computed per page. Each page gets a needs_ocr verdict and a list of reasons explaining why it was flagged. A document is “complex” if any of its pages needs OCR.
The reasons are derived from a few signals:
| Reason | Meaning |
|---|---|
scanned | A single raster covers ~the whole page with little or no text behind it — a scanned/photographed page. |
no-text | Almost no extractable text and no full-page image — a blank page, cover, or divider. |
sparse-text | Some real text, but it covers very little of the page — typically a figure with a thin caption. |
embedded-images | Substantial embedded raster figures sit alongside the native text. |
garbled | The native text decodes to garbage and text is likely unreadable. |
vector-text | Text is painted as filled vector outlines outside the text layer, so no native text items represent it. |
The set of reasons may grow over time as the router learns to recommend heavier pipelines. Treat
reasonsas an open-ended list and route on the values you care about rather than assuming it’s exhaustive.
lit is-complex document.pdfThe command always prints per-page JSON to stdout, a human-readable verdict to stderr, and sets its exit code to reflect the result — so you can consume it however fits your workflow.
// stdout[ { "page_number": 1, "text_length": 0, "text_coverage": 0.0, "has_substantial_images": false, "image_block_count": 0, "image_coverage": 0.0, "largest_image_coverage": 0.0, "full_page_image": true, "uncovered_vector_area": null, "is_garbled": false, "page_area": 482400.0, "needs_ocr": true, "reasons": ["scanned"] }]# stderrCOMPLEX — 1/1 page(s) need OCRThe exit code is non-zero when any page needs OCR, so the command works as a shell predicate:
# Only parse with --no-ocr when the document is simplelit is-complex document.pdf --quiet && lit parse document.pdf --no-ocrPipe the JSON into jq to act on individual pages. The example below will list the page numbers that need OCR:
lit is-complex document.pdf --compact | jq '[.[] | select(.needs_ocr) | .page_number]'Options
Section titled “Options”| Flag | Description |
|---|---|
--compact | Emit dense, whitespace-free JSON instead of pretty-printed. |
--max-pages <N> | Maximum number of pages to check (default: 1000). |
--target-pages <SPEC> | Check only specific pages, e.g. "1-5,10,15-20". |
--password <PASS> | Password for encrypted/protected documents. |
-q, --quiet | Suppress the stderr logging output. |
Library
Section titled “Library”The same check is available programmatically. It returns one entry per page.
import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: false });const pages = await parser.isComplex("document.pdf");
const complex = pages.some((p) => p.needsOcr);if (complex) { // Route to a heavweight pipeline, like LlamaParse ...} else { // Cheap path — skip OCR entirely const result = await parser.parse("document.pdf");}
// Inspect why specific pages were flaggedfor (const page of pages.filter((p) => p.needsOcr)) { console.log(`Page ${page.pageNumber}: ${page.reasons.join(", ")}`);}from liteparse import LiteParse
parser = LiteParse(ocr_enabled=False)pages = parser.is_complex("document.pdf")
if any(p.needs_ocr for p in pages): # Route to a heavweight pipeline, like LlamaParse ...else: # Cheap path — skip OCR entirely result = parser.parse("document.pdf")
# Inspect why specific pages were flaggedfor page in pages: if page.needs_ocr: print(f"Page {page.page_number}: {', '.join(page.reasons)}")import { LiteParse } from "@llamaindex/liteparse-wasm";
const parser = new LiteParse();const pages = await parser.isComplex(pdfBytes); // Uint8Array
const complex = pages.some((p) => p.needsOcr);Both the library and CLI accept raw bytes as well as file paths, so you can run the check on documents you’ve already loaded into memory.
Per-page fields
Section titled “Per-page fields”Every entry includes the raw signals behind the verdict, so you can apply your own thresholds instead of relying solely on needs_ocr:
| Field | Description |
|---|---|
page_number | 1-indexed page number. |
needs_ocr | Verdict: the page needs more than the cheap text-only path. Equivalent to reasons being non-empty. |
reasons | Every reason the page was flagged (see the table above). Empty exactly when needs_ocr is false. |
text_length | Length of usable native text (garbled/unmappable text excluded). |
text_coverage | Fraction of the page area covered by native text (0–1). |
has_substantial_images | Whether any counted inline raster figures are present. |
image_block_count | Number of counted raster image objects (full-page backgrounds excluded). |
image_coverage | Summed image-bbox area over page area, clamped to 1. |
largest_image_coverage | Largest single counted image’s area over page area, clamped to 1. |
full_page_image | A single raster covers ≥90% of the page — the signal that tells a scan apart from a blank page. |
uncovered_vector_area | Filled vector-outline area not covered by native text (pt²). null/undefined when a cheaper signal already decided the page. |
is_garbled | Whether the native text decodes to garbage. |
page_area | Page area in pt². |
Next steps
Section titled “Next steps”- OCR configuration: Set up the OCR backend you route complex documents to.
- Library usage: Full programmatic API for TypeScript and Python.
- CLI reference: Complete command and option reference.