API Reference

LiteParse

API reference for the @llamaindex/liteparse TypeScript library.

LiteParse — open-source PDF parsing with spatial text extraction, OCR, and bounding boxes.

Example

import { LiteParse } from "@llamaindex/liteparse";

const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse("document.pdf");
console.log(result.text);

Classes

LiteParse

Defined in: parser.ts:47

Main document parser class. Handles PDF parsing, OCR, format conversion, and screenshot generation.

Examples

import { LiteParse } from "@llamaindex/liteparse";

const parser = new LiteParse();
const result = await parser.parse("document.pdf");
console.log(result.text);

const parser = new LiteParse({ outputFormat: "json", dpi: 300 });
const result = await parser.parse("document.pdf");
for (const page of result.json.pages) {
  console.log(`Page ${page.page}: ${page.boundingBoxes.length} bounding boxes`);
}

const parser = new LiteParse({
  ocrServerUrl: "http://localhost:8828/ocr",
  ocrLanguage: "en",
});
const result = await parser.parse("scanned-document.pdf");

Constructors

Constructor

new LiteParse(userConfig?): LiteParse

Defined in: parser.ts:57

Create a new LiteParse instance.

Parameters

userConfig?

Partial<LiteParseConfig> = {}

Partial configuration to override defaults. See LiteParseConfig for all options.

Returns

LiteParse

Methods

getConfig()

getConfig(): LiteParseConfig

Defined in: parser.ts:400

Get a copy of the current configuration, including defaults merged with user overrides.

Returns

LiteParseConfig

A shallow copy of the active LiteParseConfig.

parse()

parse(filePath, quiet?): Promise<ParseResult>

Defined in: parser.ts:87

Parse a document and return the extracted text, page data, and optionally structured JSON.

Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before parsing if the required system tools are installed.

Parameters

filePath

string

Path to the document file.

quiet?

boolean = false

If true, suppresses progress logging to stderr.

Returns

Promise<ParseResult>

Parsed document data including text, per-page info, and optional JSON.

Throws

Error if the file cannot be found, converted, or parsed.

screenshot()

screenshot(filePath, pageNumbers?, quiet?): Promise<ScreenshotResult[]>

Defined in: parser.ts:188

Generate screenshots of PDF pages as image buffers.

Uses PDFium for high-quality rendering. Each page is returned as a ScreenshotResult with the raw image buffer and dimensions.

Parameters

filePath

string

Path to the PDF file.

pageNumbers?

number[]

1-indexed page numbers to screenshot. If omitted, all pages are rendered.

quiet?

boolean = false

If true, suppresses progress logging to stderr.

Returns

Promise<ScreenshotResult[]>

Array of screenshot results, one per rendered page.

Interfaces

BoundingBox

Defined in: types.ts:224

An axis-aligned bounding box defined by its top-left and bottom-right corners.

All coordinates are in PDF points.

Properties

x1

x1: number

Defined in: types.ts:226

X coordinate of the top-left corner.

x2

x2: number

Defined in: types.ts:230

X coordinate of the bottom-right corner.

y1

y1: number

Defined in: types.ts:228

Y coordinate of the top-left corner.

y2

y2: number

Defined in: types.ts:232

Y coordinate of the bottom-right corner.

LiteParseConfig

Defined in: types.ts:25

Configuration options for the LiteParse parser.

All fields have sensible defaults. Pass a Partial<LiteParseConfig> to the constructor to override only the options you need.

Example

const parser = new LiteParse({
  ocrEnabled: true,
  ocrLanguage: "fra",
  dpi: 300,
  outputFormat: "json",
});

Properties

dpi

dpi: number

Defined in: types.ts:78

DPI (dots per inch) for rendering pages to images. Higher values improve OCR accuracy but increase processing time and memory usage.

Default Value

150

maxPages

maxPages: number

Defined in: types.ts:63

Maximum number of pages to parse from the document.

Default Value

1000

numWorkers

numWorkers: number

Defined in: types.ts:56

Number of pages to OCR in parallel. Higher values use more memory but process faster on multi-core machines.

Default Value

CPU cores - 1 (minimum 1)

ocrEnabled

ocrEnabled: boolean

Defined in: types.ts:40

Whether to run OCR on pages with little or no native text. When enabled, LiteParse selectively OCRs only images and text-sparse regions.

Default Value

true

ocrLanguage

ocrLanguage: string | string[]

Defined in: types.ts:32

OCR language code(s). Uses ISO 639-3 codes for Tesseract (e.g., "eng", "fra") or ISO 639-1 for HTTP OCR servers (e.g., "en", "fr").

Default Value

"en"

ocrServerUrl?

optional ocrServerUrl: string

Defined in: types.ts:48

URL of an HTTP OCR server implementing the LiteParse OCR API. If not provided, the built-in Tesseract.js engine is used.

See

OCR API Specification

outputFormat

outputFormat: OutputFormat

Defined in: types.ts:85

Output format for parsed results.

Default Value

"json"

preciseBoundingBox

preciseBoundingBox: boolean

Defined in: types.ts:93

Calculate precise bounding boxes for each text line. Disable for faster parsing when bounding boxes aren’t needed.

Default Value

true

preserveLayoutAlignmentAcrossPages

preserveLayoutAlignmentAcrossPages: boolean

Defined in: types.ts:107

Maintain consistent text alignment across page boundaries.

Default Value

false

preserveVerySmallText

preserveVerySmallText: boolean

Defined in: types.ts:100

Preserve very small text that would normally be filtered out.

Default Value

false

targetPages?

optional targetPages: string

Defined in: types.ts:70

Specific pages to parse, as a comma-separated string of page numbers and ranges.

Example

`"1-5,10,15-20"`

MarkupData

Defined in: types.ts:152

Markup annotation data associated with a text item.

Properties

highlight?

optional highlight: string

Defined in: types.ts:154

Highlight color (e.g., "yellow", "#FFFF00"), or undefined if not highlighted.

squiggly?

optional squiggly: boolean

Defined in: types.ts:158

Whether the text has a squiggly underline.

strikeout?

optional strikeout: boolean

Defined in: types.ts:160

Whether the text is struck out.

underline?

optional underline: boolean

Defined in: types.ts:156

Whether the text is underlined.

ParsedPage

Defined in: types.ts:238

Parsed data for a single page of a document.

Properties

boundingBoxes?

optional boundingBoxes: BoundingBox[]

Defined in: types.ts:250

Bounding boxes for text lines. Present when LiteParseConfig.preciseBoundingBox is enabled.

height

height: number

Defined in: types.ts:244

Page height in PDF points.

pageNum

pageNum: number

Defined in: types.ts:240

1-indexed page number.

text

text: string

Defined in: types.ts:246

Full text content of the page with spatial layout preserved.

textItems

textItems: TextItem[]

Defined in: types.ts:248

Individual text elements extracted from the page.

width

width: number

Defined in: types.ts:242

Page width in PDF points.

ParseResult

Defined in: types.ts:286

The result of parsing a document with LiteParse.parse.

Properties

json?

optional json: ParseResultJson

Defined in: types.ts:292

Structured JSON data. Present when LiteParseConfig.outputFormat is "json".

pages

pages: ParsedPage[]

Defined in: types.ts:288

Per-page parsed data.

text

text: string

Defined in: types.ts:290

Full document text, concatenated from all pages.

ParseResultJson

Defined in: types.ts:257

Structured JSON representation of parsed document data. Returned when LiteParseConfig.outputFormat is "json".

Properties

ScreenshotResult

Defined in: types.ts:298

The result of generating a screenshot with LiteParse.screenshot.

Properties

height

height: number

Defined in: types.ts:304

Image height in pixels.

imageBuffer

imageBuffer: Buffer

Defined in: types.ts:306

Raw image data as a Node.js Buffer (PNG or JPG).

imagePath?

optional imagePath: string

Defined in: types.ts:308

File path if the screenshot was saved to disk.

pageNum

pageNum: number

Defined in: types.ts:300

1-indexed page number.

width

width: number

Defined in: types.ts:302

Image width in pixels.

TextItem

Defined in: types.ts:116

An individual text element extracted from a page, with position, size, and font metadata.

Coordinates use the PDF coordinate system where the origin is at the top-left of the page, x increases to the right, and y increases downward.

Properties

fontName?

optional fontName: string

Defined in: types.ts:132

Font name (e.g., "Helvetica", "Times-Roman", "OCR" for OCR-detected text).

fontSize?

optional fontSize: number

Defined in: types.ts:134

Font size in PDF points.

h

h: number

Defined in: types.ts:130

Alias for height.

height

height: number

Defined in: types.ts:126

Height of the text item in PDF points.

markup?

optional markup: MarkupData

Defined in: types.ts:142

Markup annotations (highlights, underlines, etc.) applied to this text.

r?

optional r: number

Defined in: types.ts:136

Rotation angle in degrees. One of 0, 90, 180, or 270.

rx?

optional rx: number

Defined in: types.ts:138

X coordinate after rotation transformation.

ry?

optional ry: number

Defined in: types.ts:140

Y coordinate after rotation transformation.

str

str: string

Defined in: types.ts:118

The text content of this item.

w

w: number

Defined in: types.ts:128

Alias for width.

width

width: number

Defined in: types.ts:124

Width of the text item in PDF points.

x

x: number

Defined in: types.ts:120

X coordinate of the top-left corner, in PDF points.

y

y: number

Defined in: types.ts:122

Y coordinate of the top-left corner, in PDF points.

Type Aliases

OutputFormat

OutputFormat = "json" | "text"

Defined in: types.ts:7

Supported output formats for parsed documents.

"json" — Structured JSON with per-page text items, bounding boxes, and metadata.
"text" — Plain text with spatial layout preserved.