API Reference
API reference for the @llamaindex/liteparse TypeScript library.
LiteParse — open-source PDF parsing with spatial text extraction, OCR, and bounding boxes.
Example
Section titled “Example”import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: true });const result = await parser.parse("document.pdf");console.log(result.text);Classes
Section titled “Classes”LiteParse
Section titled “LiteParse”Defined in: core/parser.ts:58
Main document parser class. Handles PDF parsing, OCR, format conversion, and screenshot generation.
Examples
Section titled “Examples”import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse();const result = await parser.parse("document.pdf");console.log(result.text);const parser = new LiteParse({ outputFormat: "json", dpi: 300 });const result = await parser.parse("document.pdf");for (const page of result.json.pages) { console.log(`Page ${page.page}: ${page.boundingBoxes.length} bounding boxes`);}const parser = new LiteParse({ ocrServerUrl: "http://localhost:8828/ocr", ocrLanguage: "en",});const result = await parser.parse("scanned-document.pdf");Constructors
Section titled “Constructors”Constructor
Section titled “Constructor”new LiteParse(
userConfig?):LiteParse
Defined in: core/parser.ts:68
Create a new LiteParse instance.
Parameters
Section titled “Parameters”userConfig?
Section titled “userConfig?”Partial<LiteParseConfig> = {}
Partial configuration to override defaults. See LiteParseConfig for all options.
Returns
Section titled “Returns”Methods
Section titled “Methods”getConfig()
Section titled “getConfig()”getConfig():
LiteParseConfig
Defined in: core/parser.ts:492
Get a copy of the current configuration, including defaults merged with user overrides.
Returns
Section titled “Returns”A shallow copy of the active LiteParseConfig.
parse()
Section titled “parse()”parse(
input,quiet?):Promise<ParseResult>
Defined in: core/parser.ts:100
Parse a document and return the extracted text, page data, and optionally structured JSON.
Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before parsing if the required system tools are installed.
Parameters
Section titled “Parameters”A file path, Buffer, or Uint8Array containing document bytes.
When given raw bytes, PDF data is parsed directly with zero disk I/O.
Non-PDF bytes are written to a temp file for format conversion.
quiet?
Section titled “quiet?”boolean = false
If true, suppresses progress logging to stderr.
Returns
Section titled “Returns”Promise<ParseResult>
Parsed document data including text, per-page info, and optional JSON.
Throws
Section titled “Throws”Error if the file cannot be found, converted, or parsed.
screenshot()
Section titled “screenshot()”screenshot(
input,pageNumbers?,quiet?):Promise<ScreenshotResult[]>
Defined in: core/parser.ts:235
Generate screenshots of PDF pages as image buffers.
Uses PDFium for high-quality rendering. Each page is returned as a ScreenshotResult with the raw image buffer and dimensions.
Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before rendering if the required system tools are installed. Text-based formats (TXT, CSV, etc.) cannot be screenshotted and will throw an error.
Parameters
Section titled “Parameters”A file path, Buffer, or Uint8Array containing document bytes.
pageNumbers?
Section titled “pageNumbers?”number[]
1-indexed page numbers to screenshot. If omitted, all pages are rendered.
quiet?
Section titled “quiet?”boolean = false
If true, suppresses progress logging to stderr.
Returns
Section titled “Returns”Promise<ScreenshotResult[]>
Array of screenshot results, one per rendered page.
Throws
Section titled “Throws”Error if the input is a text-based format that cannot be rendered.
Throws
Section titled “Throws”Error if the file cannot be found, converted, or rendered.
Interfaces
Section titled “Interfaces”BoundingBox
Section titled “BoundingBox”Defined in: core/types.ts:281
An axis-aligned bounding box defined by its top-left and bottom-right corners.
All coordinates are in PDF points.
Deprecated
Section titled “Deprecated”Use TextItem coordinates (x, y, width, height) instead. Will be removed in v2.0.
Properties
Section titled “Properties”x1:
number
Defined in: core/types.ts:283
X coordinate of the top-left corner.
x2:
number
Defined in: core/types.ts:287
X coordinate of the bottom-right corner.
y1:
number
Defined in: core/types.ts:285
Y coordinate of the top-left corner.
y2:
number
Defined in: core/types.ts:289
Y coordinate of the bottom-right corner.
GridDebugConfig
Section titled “GridDebugConfig”Defined in: processing/gridDebugLogger.ts:13
Configuration for grid projection debug logging.
When enabled, logs detailed information about how text elements are snapped, anchored, and projected during grid layout. Use filters to narrow output to specific elements you’re investigating.
Properties
Section titled “Properties”enabled
Section titled “enabled”enabled:
boolean
Defined in: processing/gridDebugLogger.ts:18
Enable debug logging for grid projection.
Default Value
Section titled “Default Value”false
lineFilter?
Section titled “lineFilter?”
optionallineFilter:number[]
Defined in: processing/gridDebugLogger.ts:29
Only log elements on these line indices (0-based within the page).
outputPath?
Section titled “outputPath?”
optionaloutputPath:string
Defined in: processing/gridDebugLogger.ts:44
Write log output to a file path instead of stderr. If not set, logs to stderr.
pageFilter?
Section titled “pageFilter?”
optionalpageFilter:number
Defined in: processing/gridDebugLogger.ts:34
Only log elements on this page number (1-indexed).
regionFilter?
Section titled “regionFilter?”
optionalregionFilter:object
Defined in: processing/gridDebugLogger.ts:39
Only log elements within this bounding region (PDF coordinates).
x1:
number
x2:
number
y1:
number
y2:
number
textFilter?
Section titled “textFilter?”
optionaltextFilter:string[]
Defined in: processing/gridDebugLogger.ts:24
Only log elements whose text contains one of these substrings (case-insensitive). If empty, all elements are logged.
trace?
Section titled “trace?”
optionaltrace:boolean
Defined in: processing/gridDebugLogger.ts:69
Enable trace mode for detailed render decision logging. When enabled, each render logs the full decision chain: initial targetX, lineMax computation, forward anchor checks, and which factor was the binding constraint. Forward anchor mutations are also traced with their triggering item. Respects textFilter/lineFilter/pageFilter.
Default Value
Section titled “Default Value”false
visualize?
Section titled “visualize?”
optionalvisualize:boolean
Defined in: processing/gridDebugLogger.ts:52
Generate PNG visualizations of the grid projection showing text boxes color-coded by snap type (left/right/center/floating/flowing) with anchor lines overlaid.
Default Value
Section titled “Default Value”false
visualizePath?
Section titled “visualizePath?”
optionalvisualizePath:string
Defined in: processing/gridDebugLogger.ts:59
Directory to save visualization PNGs. Each page produces a file
named page-{N}-grid.png.
Default Value
Section titled “Default Value”"./debug-output"
JsonTextItem
Section titled “JsonTextItem”Defined in: core/types.ts:316
A text element from the JSON output with position, size, and font metadata.
Properties
Section titled “Properties”confidence?
Section titled “confidence?”
optionalconfidence:number
Defined in: core/types.ts:332
The OCR confidence (null if OCR wasn’t used)
fontName?
Section titled “fontName?”
optionalfontName:string
Defined in: core/types.ts:328
Font name.
fontSize?
Section titled “fontSize?”
optionalfontSize:number
Defined in: core/types.ts:330
Font size in PDF points.
height
Section titled “height”height:
number
Defined in: core/types.ts:326
Height of the text item in PDF points.
text:
string
Defined in: core/types.ts:318
The text content of this item.
width:
number
Defined in: core/types.ts:324
Width of the text item in PDF points.
x:
number
Defined in: core/types.ts:320
X coordinate of the top-left corner, in PDF points.
y:
number
Defined in: core/types.ts:322
Y coordinate of the top-left corner, in PDF points.
LiteParseConfig
Section titled “LiteParseConfig”Defined in: core/types.ts:36
Configuration options for the LiteParse parser.
All fields have sensible defaults. Pass a Partial<LiteParseConfig> to the
constructor to override only the options you need.
Example
Section titled “Example”const parser = new LiteParse({ ocrEnabled: true, ocrLanguage: "fra", dpi: 300, outputFormat: "json",});Properties
Section titled “Properties”debug?
Section titled “debug?”
optionaldebug:GridDebugConfig
Defined in: core/types.ts:160
Debug configuration for grid projection. When enabled, logs detailed information about how text elements are snapped, anchored, and projected. Can also generate visual PNG overlays of the projection.
Example
Section titled “Example”const parser = new LiteParse({ debug: { enabled: true, textFilter: ["Total", "Revenue"], pageFilter: 2, visualize: true, visualizePath: "./debug-output", }});dpi:
number
Defined in: core/types.ts:100
DPI (dots per inch) for rendering pages to images. Higher values improve OCR accuracy but increase processing time and memory usage.
Default Value
Section titled “Default Value”150
maxPages
Section titled “maxPages”maxPages:
number
Defined in: core/types.ts:85
Maximum number of pages to parse from the document.
Default Value
Section titled “Default Value”1000
numWorkers
Section titled “numWorkers”numWorkers:
number
Defined in: core/types.ts:78
Number of pages to OCR in parallel. Higher values use more memory but process faster on multi-core machines.
Default Value
Section titled “Default Value”CPU cores - 1 (minimum 1)ocrEnabled
Section titled “ocrEnabled”ocrEnabled:
boolean
Defined in: core/types.ts:51
Whether to run OCR on pages with little or no native text. When enabled, LiteParse selectively OCRs only images and text-sparse regions.
Default Value
Section titled “Default Value”true
ocrLanguage
Section titled “ocrLanguage”ocrLanguage:
string|string[]
Defined in: core/types.ts:43
OCR language code(s). Uses ISO 639-3 codes for Tesseract (e.g., "eng", "fra")
or ISO 639-1 for HTTP OCR servers (e.g., "en", "fr").
Default Value
Section titled “Default Value”"en"
ocrServerUrl?
Section titled “ocrServerUrl?”
optionalocrServerUrl:string
Defined in: core/types.ts:59
URL of an HTTP OCR server implementing the LiteParse OCR API. If not provided, the built-in Tesseract.js engine is used.
outputFormat
Section titled “outputFormat”outputFormat:
OutputFormat
Defined in: core/types.ts:107
Output format for parsed results.
Default Value
Section titled “Default Value”"json"
password?
Section titled “password?”
optionalpassword:string
Defined in: core/types.ts:140
Password for opening encrypted/protected documents. Used for password-protected PDFs and office documents.
Default Value
Section titled “Default Value”undefined
preciseBoundingBox
Section titled “preciseBoundingBox”preciseBoundingBox:
boolean
Defined in: core/types.ts:118
Calculate precise bounding boxes for each text line. Disable for faster parsing when bounding boxes aren’t needed.
Deprecated
Section titled “Deprecated”Controls the deprecated boundingBoxes output. Will be removed in v2.0.
Text item coordinates (x, y, width, height) are always present regardless.
Default Value
Section titled “Default Value”true
preserveLayoutAlignmentAcrossPages
Section titled “preserveLayoutAlignmentAcrossPages”preserveLayoutAlignmentAcrossPages:
boolean
Defined in: core/types.ts:132
Maintain consistent text alignment across page boundaries.
Default Value
Section titled “Default Value”false
preserveVerySmallText
Section titled “preserveVerySmallText”preserveVerySmallText:
boolean
Defined in: core/types.ts:125
Preserve very small text that would normally be filtered out.
Default Value
Section titled “Default Value”false
targetPages?
Section titled “targetPages?”
optionaltargetPages:string
Defined in: core/types.ts:92
Specific pages to parse, as a comma-separated string of page numbers and ranges.
Example
Section titled “Example”`"1-5,10,15-20"`tessdataPath?
Section titled “tessdataPath?”
optionaltessdataPath:string
Defined in: core/types.ts:70
Path to a directory containing Tesseract .traineddata files.
Used as both the language data source and cache directory for Tesseract.js.
If not set, falls back to the TESSDATA_PREFIX environment variable.
If neither is set, Tesseract.js downloads data from cdn.jsdelivr.net.
Example
Section titled “Example”`/opt/tessdata`MarkupData
Section titled “MarkupData”Defined in: core/types.ts:207
Markup annotation data associated with a text item.
Properties
Section titled “Properties”highlight?
Section titled “highlight?”
optionalhighlight:string
Defined in: core/types.ts:209
Highlight color (e.g., "yellow", "#FFFF00"), or undefined if not highlighted.
squiggly?
Section titled “squiggly?”
optionalsquiggly:boolean
Defined in: core/types.ts:213
Whether the text has a squiggly underline.
strikeout?
Section titled “strikeout?”
optionalstrikeout:boolean
Defined in: core/types.ts:215
Whether the text is struck out.
underline?
Section titled “underline?”
optionalunderline:boolean
Defined in: core/types.ts:211
Whether the text is underlined.
ParsedPage
Section titled “ParsedPage”Defined in: core/types.ts:295
Parsed data for a single page of a document.
Properties
Section titled “Properties”boundingBoxes?
Section titled “boundingBoxes?”
optionalboundingBoxes:BoundingBox[]
Defined in: core/types.ts:310
Deprecated
Section titled “Deprecated”Use TextItem coordinates instead. Will be removed in v2.0. Present when LiteParseConfig.preciseBoundingBox is enabled.
height
Section titled “height”height:
number
Defined in: core/types.ts:301
Page height in PDF points.
pageNum
Section titled “pageNum”pageNum:
number
Defined in: core/types.ts:297
1-indexed page number.
text:
string
Defined in: core/types.ts:303
Full text content of the page with spatial layout preserved.
textItems
Section titled “textItems”textItems:
TextItem[]
Defined in: core/types.ts:305
Individual text elements extracted from the page.
width:
number
Defined in: core/types.ts:299
Page width in PDF points.
ParseResult
Section titled “ParseResult”Defined in: core/types.ts:376
The result of parsing a document with LiteParse.parse.
Properties
Section titled “Properties”
optionaljson:ParseResultJson
Defined in: core/types.ts:382
Structured JSON data. Present when LiteParseConfig.outputFormat is "json".
pages:
ParsedPage[]
Defined in: core/types.ts:378
Per-page parsed data.
text:
string
Defined in: core/types.ts:380
Full document text, concatenated from all pages.
ParseResultJson
Section titled “ParseResultJson”Defined in: core/types.ts:353
Structured JSON representation of parsed document data.
Returned when LiteParseConfig.outputFormat is "json".
Properties
Section titled “Properties”pages:
object[]
Defined in: core/types.ts:355
Array of page data.
boundingBoxes
Section titled “boundingBoxes”boundingBoxes:
BoundingBox[]
Deprecated
Section titled “Deprecated”Use textItems coordinates instead. Will be removed in v2.0.
height
Section titled “height”height:
number
Page height in PDF points.
page:
number
1-indexed page number.
text:
string
Full text content of the page.
textItems
Section titled “textItems”textItems:
JsonTextItem[]
Individual text elements with position and font metadata.
width:
number
Page width in PDF points.
ScreenshotResult
Section titled “ScreenshotResult”Defined in: core/types.ts:388
The result of generating a screenshot with LiteParse.screenshot.
Properties
Section titled “Properties”height
Section titled “height”height:
number
Defined in: core/types.ts:394
Image height in pixels.
imageBuffer
Section titled “imageBuffer”imageBuffer:
Buffer
Defined in: core/types.ts:396
Raw image data as a Node.js Buffer (PNG or JPG).
imagePath?
Section titled “imagePath?”
optionalimagePath:string
Defined in: core/types.ts:398
File path if the screenshot was saved to disk.
pageNum
Section titled “pageNum”pageNum:
number
Defined in: core/types.ts:390
1-indexed page number.
width:
number
Defined in: core/types.ts:392
Image width in pixels.
SearchItemsOptions
Section titled “SearchItemsOptions”Defined in: core/types.ts:338
Options for searchItems.
Properties
Section titled “Properties”caseSensitive?
Section titled “caseSensitive?”
optionalcaseSensitive:boolean
Defined in: core/types.ts:346
Whether the search should be case-sensitive.
Default Value
Section titled “Default Value”false
phrase
Section titled “phrase”phrase:
string
Defined in: core/types.ts:340
Find text items containing this phrase. Matches can span multiple adjacent items.
TextItem
Section titled “TextItem”Defined in: core/types.ts:169
An individual text element extracted from a page, with position, size, and font metadata.
Coordinates use the PDF coordinate system where the origin is at the top-left of the page, x increases to the right, and y increases downward.
Properties
Section titled “Properties”confidence?
Section titled “confidence?”
optionalconfidence:number
Defined in: core/types.ts:201
Confidence score from 0.0 to 1.0. Native PDF text defaults to 1.0, OCR text reflects engine confidence.
fontName?
Section titled “fontName?”
optionalfontName:string
Defined in: core/types.ts:185
Font name (e.g., "Helvetica", "Times-Roman", "OCR" for OCR-detected text).
fontSize?
Section titled “fontSize?”
optionalfontSize:number
Defined in: core/types.ts:187
Font size in PDF points.
h:
number
Defined in: core/types.ts:183
Alias for height.
height
Section titled “height”height:
number
Defined in: core/types.ts:179
Height of the text item in PDF points.
markup?
Section titled “markup?”
optionalmarkup:MarkupData
Defined in: core/types.ts:195
Markup annotations (highlights, underlines, etc.) applied to this text.
optionalr:number
Defined in: core/types.ts:189
Rotation angle in degrees. One of 0, 90, 180, or 270.
optionalrx:number
Defined in: core/types.ts:191
X coordinate after rotation transformation.
optionalry:number
Defined in: core/types.ts:193
Y coordinate after rotation transformation.
str:
string
Defined in: core/types.ts:171
The text content of this item.
w:
number
Defined in: core/types.ts:181
Alias for width.
width:
number
Defined in: core/types.ts:177
Width of the text item in PDF points.
x:
number
Defined in: core/types.ts:173
X coordinate of the top-left corner, in PDF points.
y:
number
Defined in: core/types.ts:175
Y coordinate of the top-left corner, in PDF points.
Type Aliases
Section titled “Type Aliases”LiteParseInput
Section titled “LiteParseInput”LiteParseInput =
string|Buffer|Uint8Array
Defined in: core/types.ts:18
Accepted input types for LiteParse.parse and LiteParse.screenshot.
string— A file path to a document on disk.Buffer | Uint8Array— Raw file bytes (PDF bytes go straight to the parser with zero disk I/O; non-PDF bytes are written to a temp file for format conversion).
OutputFormat
Section titled “OutputFormat”OutputFormat =
"json"|"text"
Defined in: core/types.ts:9
Functions
Section titled “Functions”searchItems()
Section titled “searchItems()”searchItems(
items,options):JsonTextItem[]
Defined in: processing/searchItems.ts:26
Search text items for matches, returning synthetic merged items for each match.
For phrase searches, consecutive text items are concatenated and searched. When a phrase spans multiple items, the result is a single merged item with combined bounding box and the matched text. Font metadata is taken from the first matched item.
Parameters
Section titled “Parameters”options
Section titled “options”Returns
Section titled “Returns”Example
Section titled “Example”import { LiteParse, searchItems } from "@llamaindex/liteparse";
const parser = new LiteParse({ outputFormat: "json" });const result = await parser.parse("report.pdf");
for (const page of result.json.pages) { const matches = searchItems(page.textItems, { phrase: "0°C to 70°C" }); for (const match of matches) { console.log(`Found "${match.text}" at (${match.x}, ${match.y})`); }}