Skip to content
LiteParse

API Reference

API reference for the @llamaindex/liteparse TypeScript library.

LiteParse — open-source PDF parsing with spatial text extraction, OCR, and bounding boxes.

import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse("document.pdf");
console.log(result.text);

Defined in: parser.ts:47

Main document parser class. Handles PDF parsing, OCR, format conversion, and screenshot generation.

import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse();
const result = await parser.parse("document.pdf");
console.log(result.text);
const parser = new LiteParse({ outputFormat: "json", dpi: 300 });
const result = await parser.parse("document.pdf");
for (const page of result.json.pages) {
console.log(`Page ${page.page}: ${page.boundingBoxes.length} bounding boxes`);
}
const parser = new LiteParse({
ocrServerUrl: "http://localhost:8828/ocr",
ocrLanguage: "en",
});
const result = await parser.parse("scanned-document.pdf");

new LiteParse(userConfig?): LiteParse

Defined in: parser.ts:57

Create a new LiteParse instance.

Partial<LiteParseConfig> = {}

Partial configuration to override defaults. See LiteParseConfig for all options.

LiteParse

getConfig(): LiteParseConfig

Defined in: parser.ts:400

Get a copy of the current configuration, including defaults merged with user overrides.

LiteParseConfig

A shallow copy of the active LiteParseConfig.

parse(filePath, quiet?): Promise<ParseResult>

Defined in: parser.ts:87

Parse a document and return the extracted text, page data, and optionally structured JSON.

Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before parsing if the required system tools are installed.

string

Path to the document file.

boolean = false

If true, suppresses progress logging to stderr.

Promise<ParseResult>

Parsed document data including text, per-page info, and optional JSON.

Error if the file cannot be found, converted, or parsed.

screenshot(filePath, pageNumbers?, quiet?): Promise<ScreenshotResult[]>

Defined in: parser.ts:188

Generate screenshots of PDF pages as image buffers.

Uses PDFium for high-quality rendering. Each page is returned as a ScreenshotResult with the raw image buffer and dimensions.

string

Path to the PDF file.

number[]

1-indexed page numbers to screenshot. If omitted, all pages are rendered.

boolean = false

If true, suppresses progress logging to stderr.

Promise<ScreenshotResult[]>

Array of screenshot results, one per rendered page.

Defined in: types.ts:224

An axis-aligned bounding box defined by its top-left and bottom-right corners.

All coordinates are in PDF points.

x1: number

Defined in: types.ts:226

X coordinate of the top-left corner.

x2: number

Defined in: types.ts:230

X coordinate of the bottom-right corner.

y1: number

Defined in: types.ts:228

Y coordinate of the top-left corner.

y2: number

Defined in: types.ts:232

Y coordinate of the bottom-right corner.


Defined in: types.ts:25

Configuration options for the LiteParse parser.

All fields have sensible defaults. Pass a Partial<LiteParseConfig> to the constructor to override only the options you need.

const parser = new LiteParse({
ocrEnabled: true,
ocrLanguage: "fra",
dpi: 300,
outputFormat: "json",
});

dpi: number

Defined in: types.ts:78

DPI (dots per inch) for rendering pages to images. Higher values improve OCR accuracy but increase processing time and memory usage.

150

maxPages: number

Defined in: types.ts:63

Maximum number of pages to parse from the document.

1000

numWorkers: number

Defined in: types.ts:56

Number of pages to OCR in parallel. Higher values use more memory but process faster on multi-core machines.

CPU cores - 1 (minimum 1)

ocrEnabled: boolean

Defined in: types.ts:40

Whether to run OCR on pages with little or no native text. When enabled, LiteParse selectively OCRs only images and text-sparse regions.

true

ocrLanguage: string | string[]

Defined in: types.ts:32

OCR language code(s). Uses ISO 639-3 codes for Tesseract (e.g., "eng", "fra") or ISO 639-1 for HTTP OCR servers (e.g., "en", "fr").

"en"

optional ocrServerUrl: string

Defined in: types.ts:48

URL of an HTTP OCR server implementing the LiteParse OCR API. If not provided, the built-in Tesseract.js engine is used.

OCR API Specification

outputFormat: OutputFormat

Defined in: types.ts:85

Output format for parsed results.

"json"

preciseBoundingBox: boolean

Defined in: types.ts:93

Calculate precise bounding boxes for each text line. Disable for faster parsing when bounding boxes aren’t needed.

true

preserveLayoutAlignmentAcrossPages: boolean

Defined in: types.ts:107

Maintain consistent text alignment across page boundaries.

false

preserveVerySmallText: boolean

Defined in: types.ts:100

Preserve very small text that would normally be filtered out.

false

optional targetPages: string

Defined in: types.ts:70

Specific pages to parse, as a comma-separated string of page numbers and ranges.

`"1-5,10,15-20"`

Defined in: types.ts:152

Markup annotation data associated with a text item.

optional highlight: string

Defined in: types.ts:154

Highlight color (e.g., "yellow", "#FFFF00"), or undefined if not highlighted.

optional squiggly: boolean

Defined in: types.ts:158

Whether the text has a squiggly underline.

optional strikeout: boolean

Defined in: types.ts:160

Whether the text is struck out.

optional underline: boolean

Defined in: types.ts:156

Whether the text is underlined.


Defined in: types.ts:238

Parsed data for a single page of a document.

optional boundingBoxes: BoundingBox[]

Defined in: types.ts:250

Bounding boxes for text lines. Present when LiteParseConfig.preciseBoundingBox is enabled.

height: number

Defined in: types.ts:244

Page height in PDF points.

pageNum: number

Defined in: types.ts:240

1-indexed page number.

text: string

Defined in: types.ts:246

Full text content of the page with spatial layout preserved.

textItems: TextItem[]

Defined in: types.ts:248

Individual text elements extracted from the page.

width: number

Defined in: types.ts:242

Page width in PDF points.


Defined in: types.ts:286

The result of parsing a document with LiteParse.parse.

optional json: ParseResultJson

Defined in: types.ts:292

Structured JSON data. Present when LiteParseConfig.outputFormat is "json".

pages: ParsedPage[]

Defined in: types.ts:288

Per-page parsed data.

text: string

Defined in: types.ts:290

Full document text, concatenated from all pages.


Defined in: types.ts:257

Structured JSON representation of parsed document data. Returned when LiteParseConfig.outputFormat is "json".

pages: object[]

Defined in: types.ts:259

Array of page data.

boundingBoxes: BoundingBox[]

Bounding boxes for text lines.

height: number

Page height in PDF points.

page: number

1-indexed page number.

text: string

Full text content of the page.

textItems: object[]

Individual text elements with position and font metadata.

width: number

Page width in PDF points.


Defined in: types.ts:298

The result of generating a screenshot with LiteParse.screenshot.

height: number

Defined in: types.ts:304

Image height in pixels.

imageBuffer: Buffer

Defined in: types.ts:306

Raw image data as a Node.js Buffer (PNG or JPG).

optional imagePath: string

Defined in: types.ts:308

File path if the screenshot was saved to disk.

pageNum: number

Defined in: types.ts:300

1-indexed page number.

width: number

Defined in: types.ts:302

Image width in pixels.


Defined in: types.ts:116

An individual text element extracted from a page, with position, size, and font metadata.

Coordinates use the PDF coordinate system where the origin is at the top-left of the page, x increases to the right, and y increases downward.

optional fontName: string

Defined in: types.ts:132

Font name (e.g., "Helvetica", "Times-Roman", "OCR" for OCR-detected text).

optional fontSize: number

Defined in: types.ts:134

Font size in PDF points.

h: number

Defined in: types.ts:130

Alias for height.

height: number

Defined in: types.ts:126

Height of the text item in PDF points.

optional markup: MarkupData

Defined in: types.ts:142

Markup annotations (highlights, underlines, etc.) applied to this text.

optional r: number

Defined in: types.ts:136

Rotation angle in degrees. One of 0, 90, 180, or 270.

optional rx: number

Defined in: types.ts:138

X coordinate after rotation transformation.

optional ry: number

Defined in: types.ts:140

Y coordinate after rotation transformation.

str: string

Defined in: types.ts:118

The text content of this item.

w: number

Defined in: types.ts:128

Alias for width.

width: number

Defined in: types.ts:124

Width of the text item in PDF points.

x: number

Defined in: types.ts:120

X coordinate of the top-left corner, in PDF points.

y: number

Defined in: types.ts:122

Y coordinate of the top-left corner, in PDF points.

OutputFormat = "json" | "text"

Defined in: types.ts:7

Supported output formats for parsed documents.

  • "json" — Structured JSON with per-page text items, bounding boxes, and metadata.
  • "text" — Plain text with spatial layout preserved.