OCR Configuration
Configure OCR in LiteParse — built-in Tesseract, or bring your own via HTTP servers.
LiteParse uses OCR selectively — only on embedded images or pages where native text extraction didn’t find text. This keeps parsing fast while still capturing text from scanned pages and embedded images.
Built-in Tesseract (default)
Section titled “Built-in Tesseract (default)”Tesseract.js is bundled with LiteParse. The only setup is the automatic download of the Tesseract model files on first use. Just run:
lit parse document.pdfIf bundling LiteParse into a docker container or server environment, you might want to pre-download the Tesseract files to avoid network calls at runtime with the above command or similar.
Language support
Section titled “Language support”Specify the OCR language for better accuracy on non-English documents:
lit parse document.pdf --ocr-language fra # Frenchlit parse document.pdf --ocr-language deu # Germanlit parse document.pdf --ocr-language jpn # JapaneseTesseract uses ISO 639-3 language codes (eng, fra, deu, etc.).
Disabling OCR
Section titled “Disabling OCR”If you don’t need OCR (pure native-text PDFs, or you don’t care about images), disable it for faster parsing:
lit parse document.pdf --no-ocrHTTP OCR servers
Section titled “HTTP OCR servers”For higher accuracy or GPU-accelerated OCR, you can point LiteParse at an HTTP OCR server. LiteParse ships with ready-to-use examples for popular OCR engines.
EasyOCR
Section titled “EasyOCR”# Start the EasyOCR server (requires Python)git clone https://github.com/run-llama/liteparse.gitcd liteparse/ocr/easyocrpip install -r requirements.txtpython server.py
# Parse with EasyOCR in another terminallit parse document.pdf --ocr-server-url http://localhost:8828/ocrPaddleOCR
Section titled “PaddleOCR”# Start the PaddleOCR server (requires Python)git clone https://github.com/run-llama/liteparse.gitcd liteparse/ocr/paddleocrpip install -r requirements.txtpython server.py
# Parse with PaddleOCR in another terminallit parse document.pdf --ocr-server-url http://localhost:8828/ocrParallel OCR workers
Section titled “Parallel OCR workers”LiteParse OCRs multiple pages in parallel. By default, it uses one fewer worker than your CPU core count. Override this with:
lit parse document.pdf --num-workers 8This is useful if you need to slow down OCR requests to an external server or if your OCR engine is GPU-accelerated and can handle more concurrency.
Custom OCR servers
Section titled “Custom OCR servers”You can integrate any OCR engine by implementing the LiteParse OCR API. Your server needs a single endpoint:
POST /ocrContent-Type: multipart/form-dataRequest fields:
| Field | Type | Required | Description |
|---|---|---|---|
file | binary | Yes | Image file (PNG, JPG, etc.) |
language | string | No | ISO 639-1 language code (default: en) |
Response format:
{ "results": [ { "text": "recognized text", "bbox": [x1, y1, x2, y2], "confidence": 0.95 } ]}Each result contains:
| Field | Type | Description |
|---|---|---|
text | string | Recognized text |
bbox | [x1, y1, x2, y2] | Bounding box in pixels. Origin is top-left, x goes right, y goes down |
confidence | number | Score from 0.0 to 1.0 |
Testing your server
Section titled “Testing your server”# Quick test with curlcurl -X POST http://localhost:8080/ocr \ -F "file=@test.png" \ -F "language=en" | jq .
# Use with LiteParselit parse document.pdf --ocr-server-url http://localhost:8080/ocrCommon Gotchas
Section titled “Common Gotchas”- Return
{"results": []}if no text is detected - Bounding boxes must be axis-aligned (
[x1, y1, x2, y2]where top-left to bottom-right) - If your engine returns rotated boxes, convert to axis-aligned by taking min/max coordinates
- If your engine doesn’t provide confidence scores, return
1.0 - Results should be in reading order (top-to-bottom, left-to-right)
- Cache OCR models in memory rather than reloading per request
A note on OCR approaches
Section titled “A note on OCR approaches”These days, its common to apply the term “OCR” to both traditional approaches and newer LLM-based document understanding models.
The LiteParse OCR API is designed specifically for approaches that return text with bounding boxes.
If you are trying to integrate a method that doesn’t return bounding boxes, you will have to generate dummy bounding boxes.