OCR Configuration
Configure OCR in LiteParse — built-in Tesseract, or bring your own via HTTP servers.
LiteParse uses OCR selectively — only on embedded images or pages where native text extraction didn’t find text. This keeps parsing fast while still capturing text from scanned pages and embedded images.
Built-in Tesseract (default)
Section titled “Built-in Tesseract (default)”Tesseract is bundled with LiteParse and works out of the box. Just run:
lit parse document.pdfLanguage support
Section titled “Language support”Specify the OCR language for better accuracy on non-English documents:
lit parse document.pdf --ocr-language fra # Frenchlit parse document.pdf --ocr-language deu # Germanlit parse document.pdf --ocr-language jpn # JapaneseTesseract uses ISO 639-3 language codes (eng, fra, deu, etc.).
Offline / air-gapped environments
Section titled “Offline / air-gapped environments”For environments without internet access, point Tesseract at a local directory containing pre-downloaded .traineddata files:
# Via environment variableexport TESSDATA_PREFIX=/path/to/tessdatalit parse document.pdf --ocr-language eng
# Or via CLI flaglit parse document.pdf --tessdata-path /path/to/tessdataThe tessdata_path / tessdataPath option is also available in the library APIs.
Troubleshooting: missing language data
Section titled “Troubleshooting: missing language data”If Tesseract can’t find its language data, you’ll see an error like:
Error opening data file tessdata/eng.traineddataPlease make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.Failed loading language 'eng'The bundled Tesseract still needs the .traineddata file for your language. It is normally downloaded and cached automatically on first use (under ~/.tesseract-rs/tessdata on Linux, ~/Library/Application Support/tesseract-rs/tessdata on macOS); if that download didn’t happen — e.g. offline, restricted network, or a sandboxed install — OCR cannot run.
To resolve it, do any one of the following:
- Download the language file (e.g.
eng.traineddata) and point LiteParse at it withTESSDATA_PREFIXor--tessdata-path(see above). - Use an HTTP OCR server instead of the built-in engine.
- Disable OCR with
--no-ocrif you don’t need it.
When language data is missing for every page, LiteParse now fails with a clear OCR failed for all N page(s) error instead of returning a document with silently-empty OCR text — so an unresolved setup surfaces immediately rather than producing partial results that look complete.
Disabling OCR
Section titled “Disabling OCR”If you don’t need OCR (pure native-text PDFs, or you don’t care about images), disable it for faster parsing:
lit parse document.pdf --no-ocrHTTP OCR servers
Section titled “HTTP OCR servers”For higher accuracy or GPU-accelerated OCR, you can point LiteParse at an HTTP OCR server. LiteParse ships with ready-to-use examples for popular OCR engines.
EasyOCR
Section titled “EasyOCR”# Start the EasyOCR server (requires Python)git clone https://github.com/run-llama/liteparse.gitcd liteparse/ocr/easyocrpip install -r requirements.txtpython server.py
# Parse with EasyOCR in another terminallit parse document.pdf --ocr-server-url http://localhost:8828/ocrPaddleOCR
Section titled “PaddleOCR”# Start the PaddleOCR server (requires Python)git clone https://github.com/run-llama/liteparse.gitcd liteparse/ocr/paddleocrpip install -r requirements.txtpython server.py
# Parse with PaddleOCR in another terminallit parse document.pdf --ocr-server-url http://localhost:8828/ocrParallel OCR workers
Section titled “Parallel OCR workers”LiteParse OCRs multiple pages in parallel. By default, it uses one fewer worker than your CPU core count. Override this with:
lit parse document.pdf --num-workers 8This is useful if you need to slow down OCR requests to an external server or if your OCR engine is GPU-accelerated and can handle more concurrency.
Custom OCR servers
Section titled “Custom OCR servers”You can integrate any OCR engine by implementing the LiteParse OCR API. Your server needs a single endpoint:
POST /ocrContent-Type: multipart/form-dataRequest fields:
| Field | Type | Required | Description |
|---|---|---|---|
file | binary | Yes | Image file (PNG, JPG, etc.) |
language | string | No | ISO 639-1 language code (default: en) |
Response format:
{ "results": [ { "text": "recognized text", "bbox": [x1, y1, x2, y2], "confidence": 0.95 } ]}Each result contains:
| Field | Type | Description |
|---|---|---|
text | string | Recognized text |
bbox | [x1, y1, x2, y2] | Bounding box in pixels. Origin is top-left, x goes right, y goes down |
confidence | number | Score from 0.0 to 1.0 |
Testing your server
Section titled “Testing your server”# Quick test with curlcurl -X POST http://localhost:8080/ocr \ -F "file=@test.png" \ -F "language=en" | jq .
# Use with LiteParselit parse document.pdf --ocr-server-url http://localhost:8080/ocrCommon Gotchas
Section titled “Common Gotchas”- Return
{"results": []}if no text is detected - Bounding boxes must be axis-aligned (
[x1, y1, x2, y2]where top-left to bottom-right) - If your engine returns rotated boxes, convert to axis-aligned by taking min/max coordinates
- If your engine doesn’t provide confidence scores, return
1.0 - Results should be in reading order (top-to-bottom, left-to-right)
- Cache OCR models in memory rather than reloading per request
OCR in the browser (WASM)
Section titled “OCR in the browser (WASM)”The built-in Tesseract and HTTP OCR backends are not available in the WASM build. Instead, you can pass a custom ocrEngine object with a recognize method. See the browser usage guide for details.
A note on OCR approaches
Section titled “A note on OCR approaches”These days, its common to apply the term “OCR” to both traditional approaches and newer LLM-based document understanding models.
The LiteParse OCR API is designed specifically for approaches that return text with bounding boxes.
If you are trying to integrate a method that doesn’t return bounding boxes, you will have to generate dummy bounding boxes.