API Reference

LiteParse

API reference for the @llamaindex/liteparse TypeScript library.

LiteParse — open-source PDF parsing with spatial text extraction, OCR, and bounding boxes.

Example

import { LiteParse } from "@llamaindex/liteparse";

const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse("document.pdf");
console.log(result.text);

Classes

LiteParse

Defined in: core/parser.ts:58

Main document parser class. Handles PDF parsing, OCR, format conversion, and screenshot generation.

Examples

import { LiteParse } from "@llamaindex/liteparse";

const parser = new LiteParse();
const result = await parser.parse("document.pdf");
console.log(result.text);

const parser = new LiteParse({ outputFormat: "json", dpi: 300 });
const result = await parser.parse("document.pdf");
for (const page of result.json.pages) {
  console.log(`Page ${page.page}: ${page.boundingBoxes.length} bounding boxes`);
}

const parser = new LiteParse({
  ocrServerUrl: "http://localhost:8828/ocr",
  ocrLanguage: "en",
});
const result = await parser.parse("scanned-document.pdf");

Constructors

Constructor

new LiteParse(userConfig?): LiteParse

Defined in: core/parser.ts:68

Create a new LiteParse instance.

Parameters

userConfig?

Partial<LiteParseConfig> = {}

Partial configuration to override defaults. See LiteParseConfig for all options.

Returns

LiteParse

Methods

getConfig()

getConfig(): LiteParseConfig

Defined in: core/parser.ts:492

Get a copy of the current configuration, including defaults merged with user overrides.

Returns

LiteParseConfig

A shallow copy of the active LiteParseConfig.

parse()

parse(input, quiet?): Promise<ParseResult>

Defined in: core/parser.ts:100

Parse a document and return the extracted text, page data, and optionally structured JSON.

Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before parsing if the required system tools are installed.

Parameters

input

LiteParseInput

A file path, Buffer, or Uint8Array containing document bytes. When given raw bytes, PDF data is parsed directly with zero disk I/O. Non-PDF bytes are written to a temp file for format conversion.

quiet?

boolean = false

If true, suppresses progress logging to stderr.

Returns

Promise<ParseResult>

Parsed document data including text, per-page info, and optional JSON.

Throws

Error if the file cannot be found, converted, or parsed.

screenshot()

screenshot(input, pageNumbers?, quiet?): Promise<ScreenshotResult[]>

Defined in: core/parser.ts:235

Generate screenshots of PDF pages as image buffers.

Uses PDFium for high-quality rendering. Each page is returned as a ScreenshotResult with the raw image buffer and dimensions.

Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before rendering if the required system tools are installed. Text-based formats (TXT, CSV, etc.) cannot be screenshotted and will throw an error.

Parameters

input

LiteParseInput

A file path, Buffer, or Uint8Array containing document bytes.

pageNumbers?

number[]

1-indexed page numbers to screenshot. If omitted, all pages are rendered.

quiet?

boolean = false

If true, suppresses progress logging to stderr.

Returns

Promise<ScreenshotResult[]>

Array of screenshot results, one per rendered page.

Throws

Error if the input is a text-based format that cannot be rendered.

Throws

Error if the file cannot be found, converted, or rendered.

Interfaces

BoundingBox

Defined in: core/types.ts:281

An axis-aligned bounding box defined by its top-left and bottom-right corners.

All coordinates are in PDF points.

Deprecated

Use TextItem coordinates (x, y, width, height) instead. Will be removed in v2.0.

Properties

x1

x1: number

Defined in: core/types.ts:283

X coordinate of the top-left corner.

x2

x2: number

Defined in: core/types.ts:287

X coordinate of the bottom-right corner.

y1

y1: number

Defined in: core/types.ts:285

Y coordinate of the top-left corner.

y2

y2: number

Defined in: core/types.ts:289

Y coordinate of the bottom-right corner.

GridDebugConfig

Defined in: processing/gridDebugLogger.ts:13

Configuration for grid projection debug logging.

When enabled, logs detailed information about how text elements are snapped, anchored, and projected during grid layout. Use filters to narrow output to specific elements you’re investigating.

Properties

enabled

enabled: boolean

Defined in: processing/gridDebugLogger.ts:18

Enable debug logging for grid projection.

Default Value

false

lineFilter?

optional lineFilter: number[]

Defined in: processing/gridDebugLogger.ts:29

Only log elements on these line indices (0-based within the page).

outputPath?

optional outputPath: string

Defined in: processing/gridDebugLogger.ts:44

Write log output to a file path instead of stderr. If not set, logs to stderr.

pageFilter?

optional pageFilter: number

Defined in: processing/gridDebugLogger.ts:34

Only log elements on this page number (1-indexed).

regionFilter?

optional regionFilter: object

Defined in: processing/gridDebugLogger.ts:39

Only log elements within this bounding region (PDF coordinates).

x1

x1: number

x2

x2: number

y1

y1: number

y2

y2: number

textFilter?

optional textFilter: string[]

Defined in: processing/gridDebugLogger.ts:24

Only log elements whose text contains one of these substrings (case-insensitive). If empty, all elements are logged.

trace?

optional trace: boolean

Defined in: processing/gridDebugLogger.ts:69

Enable trace mode for detailed render decision logging. When enabled, each render logs the full decision chain: initial targetX, lineMax computation, forward anchor checks, and which factor was the binding constraint. Forward anchor mutations are also traced with their triggering item. Respects textFilter/lineFilter/pageFilter.

Default Value

false

visualize?

optional visualize: boolean

Defined in: processing/gridDebugLogger.ts:52

Generate PNG visualizations of the grid projection showing text boxes color-coded by snap type (left/right/center/floating/flowing) with anchor lines overlaid.

Default Value

false

visualizePath?

optional visualizePath: string

Defined in: processing/gridDebugLogger.ts:59

Directory to save visualization PNGs. Each page produces a file named page-{N}-grid.png.

Default Value

"./debug-output"

JsonTextItem

Defined in: core/types.ts:316

A text element from the JSON output with position, size, and font metadata.

Properties

confidence?

optional confidence: number

Defined in: core/types.ts:332

The OCR confidence (null if OCR wasn’t used)

fontName?

optional fontName: string

Defined in: core/types.ts:328

Font name.

fontSize?

optional fontSize: number

Defined in: core/types.ts:330

Font size in PDF points.

height

height: number

Defined in: core/types.ts:326

Height of the text item in PDF points.

text

text: string

Defined in: core/types.ts:318

The text content of this item.

width

width: number

Defined in: core/types.ts:324

Width of the text item in PDF points.

x

x: number

Defined in: core/types.ts:320

X coordinate of the top-left corner, in PDF points.

y

y: number

Defined in: core/types.ts:322

Y coordinate of the top-left corner, in PDF points.

LiteParseConfig

Defined in: core/types.ts:36

Configuration options for the LiteParse parser.

All fields have sensible defaults. Pass a Partial<LiteParseConfig> to the constructor to override only the options you need.

Example

const parser = new LiteParse({
  ocrEnabled: true,
  ocrLanguage: "fra",
  dpi: 300,
  outputFormat: "json",
});

Properties

debug?

optional debug: GridDebugConfig

Defined in: core/types.ts:160

Debug configuration for grid projection. When enabled, logs detailed information about how text elements are snapped, anchored, and projected. Can also generate visual PNG overlays of the projection.

Example

const parser = new LiteParse({
  debug: {
    enabled: true,
    textFilter: ["Total", "Revenue"],
    pageFilter: 2,
    visualize: true,
    visualizePath: "./debug-output",
  }
});

dpi

dpi: number

Defined in: core/types.ts:100

DPI (dots per inch) for rendering pages to images. Higher values improve OCR accuracy but increase processing time and memory usage.

Default Value

150

maxPages

maxPages: number

Defined in: core/types.ts:85

Maximum number of pages to parse from the document.

Default Value

1000

numWorkers

numWorkers: number

Defined in: core/types.ts:78

Number of pages to OCR in parallel. Higher values use more memory but process faster on multi-core machines.

Default Value

CPU cores - 1 (minimum 1)

ocrEnabled

ocrEnabled: boolean

Defined in: core/types.ts:51

Whether to run OCR on pages with little or no native text. When enabled, LiteParse selectively OCRs only images and text-sparse regions.

Default Value

true

ocrLanguage

ocrLanguage: string | string[]

Defined in: core/types.ts:43

OCR language code(s). Uses ISO 639-3 codes for Tesseract (e.g., "eng", "fra") or ISO 639-1 for HTTP OCR servers (e.g., "en", "fr").

Default Value

"en"

ocrServerUrl?

optional ocrServerUrl: string

Defined in: core/types.ts:59

URL of an HTTP OCR server implementing the LiteParse OCR API. If not provided, the built-in Tesseract.js engine is used.

See

OCR API Specification

outputFormat

outputFormat: OutputFormat

Defined in: core/types.ts:107

Output format for parsed results.

Default Value

"json"

password?

optional password: string

Defined in: core/types.ts:140

Password for opening encrypted/protected documents. Used for password-protected PDFs and office documents.

Default Value

undefined

preciseBoundingBox

preciseBoundingBox: boolean

Defined in: core/types.ts:118

Calculate precise bounding boxes for each text line. Disable for faster parsing when bounding boxes aren’t needed.

Deprecated

Controls the deprecated boundingBoxes output. Will be removed in v2.0. Text item coordinates (x, y, width, height) are always present regardless.

Default Value

true

preserveLayoutAlignmentAcrossPages

preserveLayoutAlignmentAcrossPages: boolean

Defined in: core/types.ts:132

Maintain consistent text alignment across page boundaries.

Default Value

false

preserveVerySmallText

preserveVerySmallText: boolean

Defined in: core/types.ts:125

Preserve very small text that would normally be filtered out.

Default Value

false

targetPages?

optional targetPages: string

Defined in: core/types.ts:92

Specific pages to parse, as a comma-separated string of page numbers and ranges.

Example

`"1-5,10,15-20"`

tessdataPath?

optional tessdataPath: string

Defined in: core/types.ts:70

Path to a directory containing Tesseract .traineddata files. Used as both the language data source and cache directory for Tesseract.js.

If not set, falls back to the TESSDATA_PREFIX environment variable. If neither is set, Tesseract.js downloads data from cdn.jsdelivr.net.

Example

`/opt/tessdata`

MarkupData

Defined in: core/types.ts:207

Markup annotation data associated with a text item.

Properties

highlight?

optional highlight: string

Defined in: core/types.ts:209

Highlight color (e.g., "yellow", "#FFFF00"), or undefined if not highlighted.

squiggly?

optional squiggly: boolean

Defined in: core/types.ts:213

Whether the text has a squiggly underline.

strikeout?

optional strikeout: boolean

Defined in: core/types.ts:215

Whether the text is struck out.

underline?

optional underline: boolean

Defined in: core/types.ts:211

Whether the text is underlined.

ParsedPage

Defined in: core/types.ts:295

Parsed data for a single page of a document.

Properties

boundingBoxes?

optional boundingBoxes: BoundingBox[]

Defined in: core/types.ts:310

Deprecated

Use TextItem coordinates instead. Will be removed in v2.0. Present when LiteParseConfig.preciseBoundingBox is enabled.

height

height: number

Defined in: core/types.ts:301

Page height in PDF points.

pageNum

pageNum: number

Defined in: core/types.ts:297

1-indexed page number.

text

text: string

Defined in: core/types.ts:303

Full text content of the page with spatial layout preserved.

textItems

textItems: TextItem[]

Defined in: core/types.ts:305

Individual text elements extracted from the page.

width

width: number

Defined in: core/types.ts:299

Page width in PDF points.

ParseResult

Defined in: core/types.ts:376

The result of parsing a document with LiteParse.parse.

Properties

json?

optional json: ParseResultJson

Defined in: core/types.ts:382

Structured JSON data. Present when LiteParseConfig.outputFormat is "json".

pages

pages: ParsedPage[]

Defined in: core/types.ts:378

Per-page parsed data.

text

text: string

Defined in: core/types.ts:380

Full document text, concatenated from all pages.

ParseResultJson

Defined in: core/types.ts:353

Structured JSON representation of parsed document data. Returned when LiteParseConfig.outputFormat is "json".

Properties

ScreenshotResult

Defined in: core/types.ts:388

The result of generating a screenshot with LiteParse.screenshot.

Properties

height

height: number

Defined in: core/types.ts:394

Image height in pixels.

imageBuffer

imageBuffer: Buffer

Defined in: core/types.ts:396

Raw image data as a Node.js Buffer (PNG or JPG).

imagePath?

optional imagePath: string

Defined in: core/types.ts:398

File path if the screenshot was saved to disk.

pageNum

pageNum: number

Defined in: core/types.ts:390

1-indexed page number.

width

width: number

Defined in: core/types.ts:392

Image width in pixels.

SearchItemsOptions

Defined in: core/types.ts:338

Options for searchItems.

Properties

caseSensitive?

optional caseSensitive: boolean

Defined in: core/types.ts:346

Whether the search should be case-sensitive.

Default Value

false

phrase

phrase: string

Defined in: core/types.ts:340

Find text items containing this phrase. Matches can span multiple adjacent items.

TextItem

Defined in: core/types.ts:169

An individual text element extracted from a page, with position, size, and font metadata.

Coordinates use the PDF coordinate system where the origin is at the top-left of the page, x increases to the right, and y increases downward.

Properties

confidence?

optional confidence: number

Defined in: core/types.ts:201

Confidence score from 0.0 to 1.0. Native PDF text defaults to 1.0, OCR text reflects engine confidence.

fontName?

optional fontName: string

Defined in: core/types.ts:185

Font name (e.g., "Helvetica", "Times-Roman", "OCR" for OCR-detected text).

fontSize?

optional fontSize: number

Defined in: core/types.ts:187

Font size in PDF points.

h

h: number

Defined in: core/types.ts:183

Alias for height.

height

height: number

Defined in: core/types.ts:179

Height of the text item in PDF points.

markup?

optional markup: MarkupData

Defined in: core/types.ts:195

Markup annotations (highlights, underlines, etc.) applied to this text.

r?

optional r: number

Defined in: core/types.ts:189

Rotation angle in degrees. One of 0, 90, 180, or 270.

rx?

optional rx: number

Defined in: core/types.ts:191

X coordinate after rotation transformation.

ry?

optional ry: number

Defined in: core/types.ts:193

Y coordinate after rotation transformation.

str

str: string

Defined in: core/types.ts:171

The text content of this item.

w

w: number

Defined in: core/types.ts:181

Alias for width.

width

width: number

Defined in: core/types.ts:177

Width of the text item in PDF points.

x

x: number

Defined in: core/types.ts:173

X coordinate of the top-left corner, in PDF points.

y

y: number

Defined in: core/types.ts:175

Y coordinate of the top-left corner, in PDF points.

Type Aliases

LiteParseInput

LiteParseInput = string | Buffer | Uint8Array

Defined in: core/types.ts:18

Accepted input types for LiteParse.parse and LiteParse.screenshot.

string — A file path to a document on disk.
Buffer | Uint8Array — Raw file bytes (PDF bytes go straight to the parser with zero disk I/O; non-PDF bytes are written to a temp file for format conversion).

OutputFormat

OutputFormat = "json" | "text"

Defined in: core/types.ts:9

Functions

searchItems()

searchItems(items, options): JsonTextItem[]

Defined in: processing/searchItems.ts:26

Search text items for matches, returning synthetic merged items for each match.

For phrase searches, consecutive text items are concatenated and searched. When a phrase spans multiple items, the result is a single merged item with combined bounding box and the matched text. Font metadata is taken from the first matched item.

Parameters

Example

import { LiteParse, searchItems } from "@llamaindex/liteparse";

const parser = new LiteParse({ outputFormat: "json" });
const result = await parser.parse("report.pdf");

for (const page of result.json.pages) {
  const matches = searchItems(page.textItems, { phrase: "0°C to 70°C" });
  for (const match of matches) {
    console.log(`Found "${match.text}" at (${match.x}, ${match.y})`);
  }
}