API Reference
API reference for the @llamaindex/liteparse TypeScript library.
LiteParse — open-source PDF parsing with spatial text extraction, OCR, and bounding boxes.
Example
Section titled “Example”import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: true });const result = await parser.parse("document.pdf");console.log(result.text);Classes
Section titled “Classes”LiteParse
Section titled “LiteParse”Defined in: parser.ts:47
Main document parser class. Handles PDF parsing, OCR, format conversion, and screenshot generation.
Examples
Section titled “Examples”import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse();const result = await parser.parse("document.pdf");console.log(result.text);const parser = new LiteParse({ outputFormat: "json", dpi: 300 });const result = await parser.parse("document.pdf");for (const page of result.json.pages) { console.log(`Page ${page.page}: ${page.boundingBoxes.length} bounding boxes`);}const parser = new LiteParse({ ocrServerUrl: "http://localhost:8828/ocr", ocrLanguage: "en",});const result = await parser.parse("scanned-document.pdf");Constructors
Section titled “Constructors”Constructor
Section titled “Constructor”new LiteParse(
userConfig?):LiteParse
Defined in: parser.ts:57
Create a new LiteParse instance.
Parameters
Section titled “Parameters”userConfig?
Section titled “userConfig?”Partial<LiteParseConfig> = {}
Partial configuration to override defaults. See LiteParseConfig for all options.
Returns
Section titled “Returns”Methods
Section titled “Methods”getConfig()
Section titled “getConfig()”getConfig():
LiteParseConfig
Defined in: parser.ts:400
Get a copy of the current configuration, including defaults merged with user overrides.
Returns
Section titled “Returns”A shallow copy of the active LiteParseConfig.
parse()
Section titled “parse()”parse(
filePath,quiet?):Promise<ParseResult>
Defined in: parser.ts:87
Parse a document and return the extracted text, page data, and optionally structured JSON.
Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before parsing if the required system tools are installed.
Parameters
Section titled “Parameters”filePath
Section titled “filePath”string
Path to the document file.
quiet?
Section titled “quiet?”boolean = false
If true, suppresses progress logging to stderr.
Returns
Section titled “Returns”Promise<ParseResult>
Parsed document data including text, per-page info, and optional JSON.
Throws
Section titled “Throws”Error if the file cannot be found, converted, or parsed.
screenshot()
Section titled “screenshot()”screenshot(
filePath,pageNumbers?,quiet?):Promise<ScreenshotResult[]>
Defined in: parser.ts:188
Generate screenshots of PDF pages as image buffers.
Uses PDFium for high-quality rendering. Each page is returned as a ScreenshotResult with the raw image buffer and dimensions.
Parameters
Section titled “Parameters”filePath
Section titled “filePath”string
Path to the PDF file.
pageNumbers?
Section titled “pageNumbers?”number[]
1-indexed page numbers to screenshot. If omitted, all pages are rendered.
quiet?
Section titled “quiet?”boolean = false
If true, suppresses progress logging to stderr.
Returns
Section titled “Returns”Promise<ScreenshotResult[]>
Array of screenshot results, one per rendered page.
Interfaces
Section titled “Interfaces”BoundingBox
Section titled “BoundingBox”Defined in: types.ts:224
An axis-aligned bounding box defined by its top-left and bottom-right corners.
All coordinates are in PDF points.
Properties
Section titled “Properties”x1:
number
Defined in: types.ts:226
X coordinate of the top-left corner.
x2:
number
Defined in: types.ts:230
X coordinate of the bottom-right corner.
y1:
number
Defined in: types.ts:228
Y coordinate of the top-left corner.
y2:
number
Defined in: types.ts:232
Y coordinate of the bottom-right corner.
LiteParseConfig
Section titled “LiteParseConfig”Defined in: types.ts:25
Configuration options for the LiteParse parser.
All fields have sensible defaults. Pass a Partial<LiteParseConfig> to the
constructor to override only the options you need.
Example
Section titled “Example”const parser = new LiteParse({ ocrEnabled: true, ocrLanguage: "fra", dpi: 300, outputFormat: "json",});Properties
Section titled “Properties”dpi:
number
Defined in: types.ts:78
DPI (dots per inch) for rendering pages to images. Higher values improve OCR accuracy but increase processing time and memory usage.
Default Value
Section titled “Default Value”150
maxPages
Section titled “maxPages”maxPages:
number
Defined in: types.ts:63
Maximum number of pages to parse from the document.
Default Value
Section titled “Default Value”1000
numWorkers
Section titled “numWorkers”numWorkers:
number
Defined in: types.ts:56
Number of pages to OCR in parallel. Higher values use more memory but process faster on multi-core machines.
Default Value
Section titled “Default Value”CPU cores - 1 (minimum 1)ocrEnabled
Section titled “ocrEnabled”ocrEnabled:
boolean
Defined in: types.ts:40
Whether to run OCR on pages with little or no native text. When enabled, LiteParse selectively OCRs only images and text-sparse regions.
Default Value
Section titled “Default Value”true
ocrLanguage
Section titled “ocrLanguage”ocrLanguage:
string|string[]
Defined in: types.ts:32
OCR language code(s). Uses ISO 639-3 codes for Tesseract (e.g., "eng", "fra")
or ISO 639-1 for HTTP OCR servers (e.g., "en", "fr").
Default Value
Section titled “Default Value”"en"
ocrServerUrl?
Section titled “ocrServerUrl?”
optionalocrServerUrl:string
Defined in: types.ts:48
URL of an HTTP OCR server implementing the LiteParse OCR API. If not provided, the built-in Tesseract.js engine is used.
outputFormat
Section titled “outputFormat”outputFormat:
OutputFormat
Defined in: types.ts:85
Output format for parsed results.
Default Value
Section titled “Default Value”"json"
preciseBoundingBox
Section titled “preciseBoundingBox”preciseBoundingBox:
boolean
Defined in: types.ts:93
Calculate precise bounding boxes for each text line. Disable for faster parsing when bounding boxes aren’t needed.
Default Value
Section titled “Default Value”true
preserveLayoutAlignmentAcrossPages
Section titled “preserveLayoutAlignmentAcrossPages”preserveLayoutAlignmentAcrossPages:
boolean
Defined in: types.ts:107
Maintain consistent text alignment across page boundaries.
Default Value
Section titled “Default Value”false
preserveVerySmallText
Section titled “preserveVerySmallText”preserveVerySmallText:
boolean
Defined in: types.ts:100
Preserve very small text that would normally be filtered out.
Default Value
Section titled “Default Value”false
targetPages?
Section titled “targetPages?”
optionaltargetPages:string
Defined in: types.ts:70
Specific pages to parse, as a comma-separated string of page numbers and ranges.
Example
Section titled “Example”`"1-5,10,15-20"`MarkupData
Section titled “MarkupData”Defined in: types.ts:152
Markup annotation data associated with a text item.
Properties
Section titled “Properties”highlight?
Section titled “highlight?”
optionalhighlight:string
Defined in: types.ts:154
Highlight color (e.g., "yellow", "#FFFF00"), or undefined if not highlighted.
squiggly?
Section titled “squiggly?”
optionalsquiggly:boolean
Defined in: types.ts:158
Whether the text has a squiggly underline.
strikeout?
Section titled “strikeout?”
optionalstrikeout:boolean
Defined in: types.ts:160
Whether the text is struck out.
underline?
Section titled “underline?”
optionalunderline:boolean
Defined in: types.ts:156
Whether the text is underlined.
ParsedPage
Section titled “ParsedPage”Defined in: types.ts:238
Parsed data for a single page of a document.
Properties
Section titled “Properties”boundingBoxes?
Section titled “boundingBoxes?”
optionalboundingBoxes:BoundingBox[]
Defined in: types.ts:250
Bounding boxes for text lines. Present when LiteParseConfig.preciseBoundingBox is enabled.
height
Section titled “height”height:
number
Defined in: types.ts:244
Page height in PDF points.
pageNum
Section titled “pageNum”pageNum:
number
Defined in: types.ts:240
1-indexed page number.
text:
string
Defined in: types.ts:246
Full text content of the page with spatial layout preserved.
textItems
Section titled “textItems”textItems:
TextItem[]
Defined in: types.ts:248
Individual text elements extracted from the page.
width:
number
Defined in: types.ts:242
Page width in PDF points.
ParseResult
Section titled “ParseResult”Defined in: types.ts:286
The result of parsing a document with LiteParse.parse.
Properties
Section titled “Properties”
optionaljson:ParseResultJson
Defined in: types.ts:292
Structured JSON data. Present when LiteParseConfig.outputFormat is "json".
pages:
ParsedPage[]
Defined in: types.ts:288
Per-page parsed data.
text:
string
Defined in: types.ts:290
Full document text, concatenated from all pages.
ParseResultJson
Section titled “ParseResultJson”Defined in: types.ts:257
Structured JSON representation of parsed document data.
Returned when LiteParseConfig.outputFormat is "json".
Properties
Section titled “Properties”pages:
object[]
Defined in: types.ts:259
Array of page data.
boundingBoxes
Section titled “boundingBoxes”boundingBoxes:
BoundingBox[]
Bounding boxes for text lines.
height
Section titled “height”height:
number
Page height in PDF points.
page:
number
1-indexed page number.
text:
string
Full text content of the page.
textItems
Section titled “textItems”textItems:
object[]
Individual text elements with position and font metadata.
width:
number
Page width in PDF points.
ScreenshotResult
Section titled “ScreenshotResult”Defined in: types.ts:298
The result of generating a screenshot with LiteParse.screenshot.
Properties
Section titled “Properties”height
Section titled “height”height:
number
Defined in: types.ts:304
Image height in pixels.
imageBuffer
Section titled “imageBuffer”imageBuffer:
Buffer
Defined in: types.ts:306
Raw image data as a Node.js Buffer (PNG or JPG).
imagePath?
Section titled “imagePath?”
optionalimagePath:string
Defined in: types.ts:308
File path if the screenshot was saved to disk.
pageNum
Section titled “pageNum”pageNum:
number
Defined in: types.ts:300
1-indexed page number.
width:
number
Defined in: types.ts:302
Image width in pixels.
TextItem
Section titled “TextItem”Defined in: types.ts:116
An individual text element extracted from a page, with position, size, and font metadata.
Coordinates use the PDF coordinate system where the origin is at the top-left of the page, x increases to the right, and y increases downward.
Properties
Section titled “Properties”fontName?
Section titled “fontName?”
optionalfontName:string
Defined in: types.ts:132
Font name (e.g., "Helvetica", "Times-Roman", "OCR" for OCR-detected text).
fontSize?
Section titled “fontSize?”
optionalfontSize:number
Defined in: types.ts:134
Font size in PDF points.
h:
number
Defined in: types.ts:130
Alias for height.
height
Section titled “height”height:
number
Defined in: types.ts:126
Height of the text item in PDF points.
markup?
Section titled “markup?”
optionalmarkup:MarkupData
Defined in: types.ts:142
Markup annotations (highlights, underlines, etc.) applied to this text.
optionalr:number
Defined in: types.ts:136
Rotation angle in degrees. One of 0, 90, 180, or 270.
optionalrx:number
Defined in: types.ts:138
X coordinate after rotation transformation.
optionalry:number
Defined in: types.ts:140
Y coordinate after rotation transformation.
str:
string
Defined in: types.ts:118
The text content of this item.
w:
number
Defined in: types.ts:128
Alias for width.
width:
number
Defined in: types.ts:124
Width of the text item in PDF points.
x:
number
Defined in: types.ts:120
X coordinate of the top-left corner, in PDF points.
y:
number
Defined in: types.ts:122
Y coordinate of the top-left corner, in PDF points.
Type Aliases
Section titled “Type Aliases”OutputFormat
Section titled “OutputFormat”OutputFormat =
"json"|"text"
Defined in: types.ts:7
Supported output formats for parsed documents.
"json"— Structured JSON with per-page text items, bounding boxes, and metadata."text"— Plain text with spatial layout preserved.