Skip to content
LiteParse
Guides

OCR Configuration

Configure OCR in LiteParse — built-in Tesseract, or bring your own via HTTP servers.

LiteParse uses OCR selectively — only on embedded images or pages where native text extraction didn’t find text. This keeps parsing fast while still capturing text from scanned pages and embedded images.

Tesseract.js is bundled with LiteParse. The only setup is the automatic download of the Tesseract model files on first use. Just run:

Terminal window
lit parse document.pdf

If bundling LiteParse into a docker container or server environment, you might want to pre-download the Tesseract files to avoid network calls at runtime with the above command or similar.

Specify the OCR language for better accuracy on non-English documents:

Terminal window
lit parse document.pdf --ocr-language fra # French
lit parse document.pdf --ocr-language deu # German
lit parse document.pdf --ocr-language jpn # Japanese

Tesseract uses ISO 639-3 language codes (eng, fra, deu, etc.).

If you don’t need OCR (pure native-text PDFs, or you don’t care about images), disable it for faster parsing:

Terminal window
lit parse document.pdf --no-ocr

For higher accuracy or GPU-accelerated OCR, you can point LiteParse at an HTTP OCR server. LiteParse ships with ready-to-use examples for popular OCR engines.

Terminal window
# Start the EasyOCR server (requires Python)
git clone https://github.com/run-llama/liteparse.git
cd liteparse/ocr/easyocr
pip install -r requirements.txt
python server.py
# Parse with EasyOCR in another terminal
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr
Terminal window
# Start the PaddleOCR server (requires Python)
git clone https://github.com/run-llama/liteparse.git
cd liteparse/ocr/paddleocr
pip install -r requirements.txt
python server.py
# Parse with PaddleOCR in another terminal
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr

LiteParse OCRs multiple pages in parallel. By default, it uses one fewer worker than your CPU core count. Override this with:

Terminal window
lit parse document.pdf --num-workers 8

This is useful if you need to slow down OCR requests to an external server or if your OCR engine is GPU-accelerated and can handle more concurrency.

You can integrate any OCR engine by implementing the LiteParse OCR API. Your server needs a single endpoint:

POST /ocr
Content-Type: multipart/form-data

Request fields:

FieldTypeRequiredDescription
filebinaryYesImage file (PNG, JPG, etc.)
languagestringNoISO 639-1 language code (default: en)

Response format:

{
"results": [
{
"text": "recognized text",
"bbox": [x1, y1, x2, y2],
"confidence": 0.95
}
]
}

Each result contains:

FieldTypeDescription
textstringRecognized text
bbox[x1, y1, x2, y2]Bounding box in pixels. Origin is top-left, x goes right, y goes down
confidencenumberScore from 0.0 to 1.0
Terminal window
# Quick test with curl
curl -X POST http://localhost:8080/ocr \
-F "file=@test.png" \
-F "language=en" | jq .
# Use with LiteParse
lit parse document.pdf --ocr-server-url http://localhost:8080/ocr
  • Return {"results": []} if no text is detected
  • Bounding boxes must be axis-aligned ([x1, y1, x2, y2] where top-left to bottom-right)
  • If your engine returns rotated boxes, convert to axis-aligned by taking min/max coordinates
  • If your engine doesn’t provide confidence scores, return 1.0
  • Results should be in reading order (top-to-bottom, left-to-right)
  • Cache OCR models in memory rather than reloading per request

These days, its common to apply the term “OCR” to both traditional approaches and newer LLM-based document understanding models.

The LiteParse OCR API is designed specifically for approaches that return text with bounding boxes.

If you are trying to integrate a method that doesn’t return bounding boxes, you will have to generate dummy bounding boxes.