Skip to content
LiteParse
Guides

Document Complexity

Use is_complex to decide whether a document needs OCR or heavier parsing before you commit to it.

The is_complex command in LiteParse checks whether a document is “complex”. This means looking for images, broken/garbled text, large amounts of vector graphics, and sparse text.

Use it to:

  • Route documents to cheaper or more expensive parsing/OCR backends.
  • Reject or flag documents you can’t handle — e.g. when running with --no-ocr, find the pages that would come back empty.
  • Estimate cost up front by counting how many pages actually need OCR.

Complexity is computed per page. Each page gets a needs_ocr verdict and a list of reasons explaining why it was flagged. A document is “complex” if any of its pages needs OCR.

The reasons are derived from a few signals:

ReasonMeaning
scannedA single raster covers ~the whole page with little or no text behind it — a scanned/photographed page.
no-textAlmost no extractable text and no full-page image — a blank page, cover, or divider.
sparse-textSome real text, but it covers very little of the page — typically a figure with a thin caption.
embedded-imagesSubstantial embedded raster figures sit alongside the native text.
garbledThe native text decodes to garbage and text is likely unreadable.
vector-textText is painted as filled vector outlines outside the text layer, so no native text items represent it.

The set of reasons may grow over time as the router learns to recommend heavier pipelines. Treat reasons as an open-ended list and route on the values you care about rather than assuming it’s exhaustive.

Terminal window
lit is-complex document.pdf

The command always prints per-page JSON to stdout, a human-readable verdict to stderr, and sets its exit code to reflect the result — so you can consume it however fits your workflow.

// stdout
[
{
"page_number": 1,
"text_length": 0,
"text_coverage": 0.0,
"has_substantial_images": false,
"image_block_count": 0,
"image_coverage": 0.0,
"largest_image_coverage": 0.0,
"full_page_image": true,
"uncovered_vector_area": null,
"is_garbled": false,
"page_area": 482400.0,
"needs_ocr": true,
"reasons": ["scanned"]
}
]
# stderr
COMPLEX — 1/1 page(s) need OCR

The exit code is non-zero when any page needs OCR, so the command works as a shell predicate:

Terminal window
# Only parse with --no-ocr when the document is simple
lit is-complex document.pdf --quiet && lit parse document.pdf --no-ocr

Pipe the JSON into jq to act on individual pages. The example below will list the page numbers that need OCR:

Terminal window
lit is-complex document.pdf --compact | jq '[.[] | select(.needs_ocr) | .page_number]'
FlagDescription
--compactEmit dense, whitespace-free JSON instead of pretty-printed.
--max-pages <N>Maximum number of pages to check (default: 1000).
--target-pages <SPEC>Check only specific pages, e.g. "1-5,10,15-20".
--password <PASS>Password for encrypted/protected documents.
-q, --quietSuppress the stderr logging output.

The same check is available programmatically. It returns one entry per page.

import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: false });
const pages = await parser.isComplex("document.pdf");
const complex = pages.some((p) => p.needsOcr);
if (complex) {
// Route to a heavweight pipeline, like LlamaParse
...
} else {
// Cheap path — skip OCR entirely
const result = await parser.parse("document.pdf");
}
// Inspect why specific pages were flagged
for (const page of pages.filter((p) => p.needsOcr)) {
console.log(`Page ${page.pageNumber}: ${page.reasons.join(", ")}`);
}

Both the library and CLI accept raw bytes as well as file paths, so you can run the check on documents you’ve already loaded into memory.

Every entry includes the raw signals behind the verdict, so you can apply your own thresholds instead of relying solely on needs_ocr:

FieldDescription
page_number1-indexed page number.
needs_ocrVerdict: the page needs more than the cheap text-only path. Equivalent to reasons being non-empty.
reasonsEvery reason the page was flagged (see the table above). Empty exactly when needs_ocr is false.
text_lengthLength of usable native text (garbled/unmappable text excluded).
text_coverageFraction of the page area covered by native text (0–1).
has_substantial_imagesWhether any counted inline raster figures are present.
image_block_countNumber of counted raster image objects (full-page backgrounds excluded).
image_coverageSummed image-bbox area over page area, clamped to 1.
largest_image_coverageLargest single counted image’s area over page area, clamped to 1.
full_page_imageA single raster covers ≥90% of the page — the signal that tells a scan apart from a blank page.
uncovered_vector_areaFilled vector-outline area not covered by native text (pt²). null/undefined when a cheaper signal already decided the page.
is_garbledWhether the native text decodes to garbage.
page_areaPage area in pt².
Note for AI agents: this documentation is built for programmatic access. - Overview of all docs: https://developers.llamaindex.ai/llms.txt - Any page is available as raw Markdown by appending index.md to its URL — e.g. https://developers.llamaindex.ai/llamaparse/parse/getting_started/index.md - Agent-friendly REST search APIs live under https://developers.llamaindex.ai/api/ — search (BM25 full-text), grep (regex), read (fetch a page), and list (browse the doc tree). See https://developers.llamaindex.ai/llms.txt for parameters. - A hosted documentation MCP server is available at https://developers.llamaindex.ai/mcp. If you support MCP, you can ask the user to install it for browsing these docs directly (an alternative to the REST API). Setup: https://developers.llamaindex.ai/python/shared/mcp/