OCR Configuration

LiteParse

Guides

Configure OCR in LiteParse — built-in Tesseract, or bring your own via HTTP servers.

LiteParse uses OCR selectively — only on embedded images or pages where native text extraction didn’t find text. This keeps parsing fast while still capturing text from scanned pages and embedded images.

Built-in Tesseract (default)

Tesseract.js is bundled with LiteParse. The only setup is the automatic download of the Tesseract model files on first use. Just run:

lit parse document.pdf

If bundling LiteParse into a docker container or server environment, you might want to pre-download the Tesseract files to avoid network calls at runtime with the above command or similar.

Language support

Specify the OCR language for better accuracy on non-English documents:

lit parse document.pdf --ocr-language fra    # French
lit parse document.pdf --ocr-language deu    # German
lit parse document.pdf --ocr-language jpn    # Japanese

Tesseract uses ISO 639-3 language codes (eng, fra, deu, etc.).

Disabling OCR

If you don’t need OCR (pure native-text PDFs, or you don’t care about images), disable it for faster parsing:

lit parse document.pdf --no-ocr

HTTP OCR servers

For higher accuracy or GPU-accelerated OCR, you can point LiteParse at an HTTP OCR server. LiteParse ships with ready-to-use examples for popular OCR engines.

EasyOCR

# Start the EasyOCR server (requires Python)
git clone https://github.com/run-llama/liteparse.git
cd liteparse/ocr/easyocr
pip install -r requirements.txt
python server.py

# Parse with EasyOCR in another terminal
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr

PaddleOCR

# Start the PaddleOCR server (requires Python)
git clone https://github.com/run-llama/liteparse.git
cd liteparse/ocr/paddleocr
pip install -r requirements.txt
python server.py

# Parse with PaddleOCR in another terminal
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr

Parallel OCR workers

LiteParse OCRs multiple pages in parallel. By default, it uses one fewer worker than your CPU core count. Override this with:

lit parse document.pdf --num-workers 8

This is useful if you need to slow down OCR requests to an external server or if your OCR engine is GPU-accelerated and can handle more concurrency.

Custom OCR servers

You can integrate any OCR engine by implementing the LiteParse OCR API. Your server needs a single endpoint:

POST /ocr
Content-Type: multipart/form-data

Request fields:

Field	Type	Required	Description
`file`	binary	Yes	Image file (PNG, JPG, etc.)
`language`	string	No	ISO 639-1 language code (default: `en`)

Response format:

{
  "results": [
    {
      "text": "recognized text",
      "bbox": [x1, y1, x2, y2],
      "confidence": 0.95
    }
  ]
}

Each result contains:

Field	Type	Description
`text`	string	Recognized text
`bbox`	`[x1, y1, x2, y2]`	Bounding box in pixels. Origin is top-left, x goes right, y goes down
`confidence`	number	Score from 0.0 to 1.0

Testing your server

# Quick test with curl
curl -X POST http://localhost:8080/ocr \
  -F "file=@test.png" \
  -F "language=en" | jq .

# Use with LiteParse
lit parse document.pdf --ocr-server-url http://localhost:8080/ocr

Common Gotchas

Return {"results": []} if no text is detected
Bounding boxes must be axis-aligned ([x1, y1, x2, y2] where top-left to bottom-right)
If your engine returns rotated boxes, convert to axis-aligned by taking min/max coordinates
If your engine doesn’t provide confidence scores, return 1.0
Results should be in reading order (top-to-bottom, left-to-right)
Cache OCR models in memory rather than reloading per request

A note on OCR approaches

These days, its common to apply the term “OCR” to both traditional approaches and newer LLM-based document understanding models.

The LiteParse OCR API is designed specifically for approaches that return text with bounding boxes.

If you are trying to integrate a method that doesn’t return bounding boxes, you will have to generate dummy bounding boxes.