Document Complexity

LiteParse

Guides

Use is_complex to decide whether a document needs OCR or heavier parsing before you commit to it.

The is_complex command in LiteParse checks whether a document is “complex”. This means looking for images, broken/garbled text, large amounts of vector graphics, and sparse text.

Use it to:

Route documents to cheaper or more expensive parsing/OCR backends.
Reject or flag documents you can’t handle — e.g. when running with --no-ocr, find the pages that would come back empty.
Estimate cost up front by counting how many pages actually need OCR.

How it works

Complexity is computed per page. Each page gets a needs_ocr verdict and a list of reasons explaining why it was flagged. A document is “complex” if any of its pages needs OCR.

The reasons are derived from a few signals:

Reason	Meaning
`scanned`	A single raster covers ~the whole page with little or no text behind it — a scanned/photographed page.
`no-text`	Almost no extractable text and no full-page image — a blank page, cover, or divider.
`sparse-text`	Some real text, but it covers very little of the page — typically a figure with a thin caption.
`embedded-images`	Substantial embedded raster figures sit alongside the native text.
`garbled`	The native text decodes to garbage and text is likely unreadable.
`vector-text`	Text is painted as filled vector outlines outside the text layer, so no native text items represent it.

The set of reasons may grow over time as the router learns to recommend heavier pipelines. Treat reasons as an open-ended list and route on the values you care about rather than assuming it’s exhaustive.

CLI

lit is-complex document.pdf

The command always prints per-page JSON to stdout, a human-readable verdict to stderr, and sets its exit code to reflect the result — so you can consume it however fits your workflow.

// stdout
[
  {
    "page_number": 1,
    "text_length": 0,
    "text_coverage": 0.0,
    "has_substantial_images": false,
    "image_block_count": 0,
    "image_coverage": 0.0,
    "largest_image_coverage": 0.0,
    "full_page_image": true,
    "uncovered_vector_area": null,
    "is_garbled": false,
    "page_area": 482400.0,
    "needs_ocr": true,
    "reasons": ["scanned"]
  }
]

# stderr
COMPLEX — 1/1 page(s) need OCR

The exit code is non-zero when any page needs OCR, so the command works as a shell predicate:

# Only parse with --no-ocr when the document is simple
lit is-complex document.pdf --quiet && lit parse document.pdf --no-ocr

Pipe the JSON into jq to act on individual pages. The example below will list the page numbers that need OCR:

lit is-complex document.pdf --compact | jq '[.[] | select(.needs_ocr) | .page_number]'

Options

Flag	Description
`--compact`	Emit dense, whitespace-free JSON instead of pretty-printed.
`--max-pages <N>`	Maximum number of pages to check (default: 1000).
`--target-pages <SPEC>`	Check only specific pages, e.g. `"1-5,10,15-20"`.
`--password <PASS>`	Password for encrypted/protected documents.
`-q`, `--quiet`	Suppress the stderr logging output.

Library

The same check is available programmatically. It returns one entry per page.

import { LiteParse } from "@llamaindex/liteparse";

const parser = new LiteParse({ ocrEnabled: false });
const pages = await parser.isComplex("document.pdf");

const complex = pages.some((p) => p.needsOcr);
if (complex) {
  // Route to a heavweight pipeline, like LlamaParse
  ...
} else {
  // Cheap path — skip OCR entirely
  const result = await parser.parse("document.pdf");
}

// Inspect why specific pages were flagged
for (const page of pages.filter((p) => p.needsOcr)) {
  console.log(`Page ${page.pageNumber}: ${page.reasons.join(", ")}`);
}

from liteparse import LiteParse

parser = LiteParse(ocr_enabled=False)
pages = parser.is_complex("document.pdf")

if any(p.needs_ocr for p in pages):
    # Route to a heavweight pipeline, like LlamaParse
    ...
else:
    # Cheap path — skip OCR entirely
    result = parser.parse("document.pdf")

# Inspect why specific pages were flagged
for page in pages:
    if page.needs_ocr:
        print(f"Page {page.page_number}: {', '.join(page.reasons)}")

import { LiteParse } from "@llamaindex/liteparse-wasm";

const parser = new LiteParse();
const pages = await parser.isComplex(pdfBytes); // Uint8Array

const complex = pages.some((p) => p.needsOcr);

Both the library and CLI accept raw bytes as well as file paths, so you can run the check on documents you’ve already loaded into memory.

Per-page fields

Every entry includes the raw signals behind the verdict, so you can apply your own thresholds instead of relying solely on needs_ocr:

Field	Description
`page_number`	1-indexed page number.
`needs_ocr`	Verdict: the page needs more than the cheap text-only path. Equivalent to `reasons` being non-empty.
`reasons`	Every reason the page was flagged (see the table above). Empty exactly when `needs_ocr` is false.
`text_length`	Length of usable native text (garbled/unmappable text excluded).
`text_coverage`	Fraction of the page area covered by native text (0–1).
`has_substantial_images`	Whether any counted inline raster figures are present.
`image_block_count`	Number of counted raster image objects (full-page backgrounds excluded).
`image_coverage`	Summed image-bbox area over page area, clamped to 1.
`largest_image_coverage`	Largest single counted image’s area over page area, clamped to 1.
`full_page_image`	A single raster covers ≥90% of the page — the signal that tells a scan apart from a blank page.
`uncovered_vector_area`	Filled vector-outline area not covered by native text (pt²). `null`/`undefined` when a cheaper signal already decided the page.
`is_garbled`	Whether the native text decodes to garbage.
`page_area`	Page area in pt².

Next steps

OCR configuration: Set up the OCR backend you route complex documents to.
Library usage: Full programmatic API for TypeScript and Python.
CLI reference: Complete command and option reference.

Note for AI agents: this documentation is built for programmatic access. - Overview of all docs: https://developers.llamaindex.ai/llms.txt - Any page is available as raw Markdown by appending index.md to its URL — e.g. https://developers.llamaindex.ai/llamaparse/parse/getting_started/index.md - Agent-friendly REST search APIs live under https://developers.llamaindex.ai/api/ — search (BM25 full-text), grep (regex), read (fetch a page), and list (browse the doc tree). See https://developers.llamaindex.ai/llms.txt for parameters. - A hosted documentation MCP server is available at https://developers.llamaindex.ai/mcp. If you support MCP, you can ask the user to install it for browsing these docs directly (an alternative to the REST API). Setup: https://developers.llamaindex.ai/python/shared/mcp/