Library Usage
Use LiteParse programmatically from TypeScript or Python.
LiteParse can be used as a library in your own code, not just from the CLI. There are native packages for both TypeScript and Python.
TypeScript / Node.js usage
Section titled “TypeScript / Node.js usage”Install as a project dependency:
npm install @llamaindex/liteparse# orpnpm add @llamaindex/liteparseParsing a document
Section titled “Parsing a document”import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: true });const result = await parser.parse("document.pdf");
// Full document text with layout preservedconsole.log(result.text);
// Per-page datafor (const page of result.pages) { console.log(`Page ${page.pageNum}: ${page.textItems.length} text items`);}JSON output with bounding boxes
Section titled “JSON output with bounding boxes”Text items include spatial coordinates (x, y, width, height) in PDF points:
const parser = new LiteParse({ outputFormat: "json" });const result = await parser.parse("document.pdf");
for (const page of result.pages) { for (const item of page.textItems) { console.log(`[${item.x}, ${item.y}] ${item.width}×${item.height} "${item.text}"`); }}Configuration
Section titled “Configuration”Pass any config options to the constructor. You only need to specify what you want to override:
const parser = new LiteParse({ ocrEnabled: true, ocrServerUrl: "http://localhost:8828/ocr", ocrLanguage: "fra", dpi: 300, outputFormat: "json", targetPages: "1-10", password: "secret",});Buffer / Uint8Array input
Section titled “Buffer / Uint8Array input”You can pass raw bytes directly instead of a file path:
import { readFile } from "fs/promises";
const parser = new LiteParse();
// From a file readconst pdfBytes = await readFile("document.pdf");const result = await parser.parse(pdfBytes);
// From an HTTP responseconst response = await fetch("https://example.com/document.pdf");const buffer = Buffer.from(await response.arrayBuffer());const result2 = await parser.parse(buffer);Screenshots
Section titled “Screenshots”Generate page images as buffers — useful for sending to LLMs or saving to disk:
const parser = new LiteParse();const screenshots = parser.screenshot("document.pdf");
for (const shot of screenshots) { console.log(`Page ${shot.pageNum}: ${shot.width}x${shot.height}`); // shot.imageBuffer contains the raw PNG data}
// Screenshot specific pagesconst shots = parser.screenshot("document.pdf", [1, 2, 3]);Environment variables
Section titled “Environment variables”| Variable | Description |
|---|---|
TESSDATA_PREFIX | Path to a directory containing Tesseract .traineddata files. For offline/air-gapped environments. Also available as the tessdataPath config option. |
Python usage
Section titled “Python usage”Install the package from PyPI:
pip install liteparseParsing a document
Section titled “Parsing a document”from liteparse import LiteParse
parser = LiteParse()result = parser.parse("document.pdf")
# Full document textprint(result.text)
# Per-page datafor page in result.pages: print(f"Page {page.page_num}: {len(page.text_items)} text items")Configuration
Section titled “Configuration”All options are passed to the constructor:
parser = LiteParse( ocr_enabled=True, ocr_server_url="http://localhost:8828/ocr", ocr_language="fra", dpi=300, target_pages="1-5", password="secret",)result = parser.parse("document.pdf")Parsing from bytes
Section titled “Parsing from bytes”If you already have file contents in memory (e.g. from a web upload), pass them directly to parse():
with open("document.pdf", "rb") as f: pdf_bytes = f.read()
result = parser.parse(pdf_bytes)print(result.text)Screenshots
Section titled “Screenshots”screenshots = parser.screenshot("document.pdf", page_numbers=[1, 2, 3])for s in screenshots: print(f"Page {s.page_num}: {s.width}x{s.height}") # s.image_bytes contains PNG dataEnvironment variables
Section titled “Environment variables”| Variable | Description |
|---|---|
TESSDATA_PREFIX | Path to a directory containing Tesseract .traineddata files. For offline/air-gapped environments. Also available as the tessdata_path constructor option. |
Rust usage
Section titled “Rust usage”Add liteparse to your Cargo.toml:
[dependencies]liteparse = "2"tokio = { version = "1", features = ["rt-multi-thread", "macros"] }Parsing a document
Section titled “Parsing a document”use liteparse::{LiteParse, LiteParseConfig};
#[tokio::main]async fn main() -> Result<(), Box<dyn std::error::Error>> { let parser = LiteParse::new(LiteParseConfig::default()); let result = parser.parse("document.pdf").await?;
// Full document text println!("{}", result.text);
// Per-page data for page in &result.pages { println!("Page {}: {} text items", page.page_number, page.text_items.len()); } Ok(())}Parsing from bytes
Section titled “Parsing from bytes”Use PdfInput::Bytes when you have the PDF in memory:
use liteparse::{LiteParse, LiteParseConfig};use liteparse::types::PdfInput;
let parser = LiteParse::new(LiteParseConfig::default());let pdf_bytes = std::fs::read("document.pdf")?;let result = parser.parse_input(PdfInput::Bytes(pdf_bytes)).await?;Configuration
Section titled “Configuration”Override only the fields you need:
use liteparse::{LiteParse, LiteParseConfig, OutputFormat};
let config = LiteParseConfig { ocr_enabled: true, ocr_language: "fra".to_string(), dpi: 300.0, target_pages: Some("1-10".to_string()), output_format: OutputFormat::Json, password: Some("secret".to_string()), ..Default::default()};let parser = LiteParse::new(config);Screenshots
Section titled “Screenshots”Generate page images as PNG bytes:
let parser = LiteParse::new(LiteParseConfig::default());let screenshots = parser.screenshot("document.pdf", None).await?;
for shot in &screenshots { println!("Page {}: {}x{}", shot.page_num, shot.width, shot.height); // shot.image_bytes contains the raw PNG data}
// Screenshot specific pageslet shots = parser.screenshot("document.pdf", Some(vec![1, 2, 3])).await?;Custom OCR engine
Section titled “Custom OCR engine”Implement the OcrEngine trait to plug in your own OCR backend:
use liteparse::LiteParse;use liteparse::ocr::OcrEngine;use std::sync::Arc;
let parser = LiteParse::new(Default::default()) .with_ocr_engine(Arc::new(my_engine));Features
Section titled “Features”| Feature | Default | Description |
|---|---|---|
tesseract | Yes | Built-in Tesseract OCR via tesseract-rs. Disable with default-features = false if you only use HTTP OCR or no OCR. |
Browser (WASM) usage
Section titled “Browser (WASM) usage”Install the WASM package:
npm install @llamaindex/liteparse-wasmParsing a document
Section titled “Parsing a document”import init, { LiteParse } from "@llamaindex/liteparse-wasm";
// Initialize the WASM module (required once before use)await init();
const parser = new LiteParse({ ocrEnabled: false });
// Pass a Uint8Array (e.g. from fetch, File input, or drag-drop)const bytes = new Uint8Array(await file.arrayBuffer());const result = await parser.parse(bytes);
console.log(result.text); // full document textconsole.log(result.pages[0]); // per-page items with bboxesConfiguration
Section titled “Configuration”All options are optional and use camelCase:
const parser = new LiteParse({ ocrEnabled: true, ocrLanguage: "fra", dpi: 300, outputFormat: "json", targetPages: "1-10", maxPages: 100, password: "secret", quiet: false,});OCR in the browser
Section titled “OCR in the browser”The native Tesseract and HTTP OCR backends are not available in the browser. To use OCR, provide a JS-side engine with a recognize method:
const parser = new LiteParse({ ocrEnabled: true, ocrLanguage: "eng", ocrEngine: { /** * @param imageData - PNG-encoded image bytes * @param width - rendered page width in pixels * @param height - rendered page height in pixels * @param language - e.g. "eng" * @returns array of recognized text regions */ async recognize(imageData, width, height, language) { // e.g. call tesseract.js in a worker, or a remote OCR service return [ { text: "Hello", bbox: [10, 20, 80, 40], confidence: 0.98 }, ]; }, },});- Input must be a
Uint8Array— file paths are not supported in the browser. - The WASM module must be initialized with
await init()before creating aLiteParseinstance.