Installation¶
Requirements¶
- Python 3.10, 3.11, or 3.12
- Tesseract OCR with Nepali language pack (for legacy/scanned PDFs)
Install LamiSema¶
This installs the core pipeline with pdfplumber and PyMuPDF. Tesseract is not bundled — it is a system binary installed separately.
Install with S3/Minio support¶
This adds boto3 for persistence to S3-compatible object storage.
Install all extras¶
This includes Tesseract support, EasyOCR, and S3 persistence drivers.
Install from source¶
Install Tesseract¶
Tesseract is required for legacy-encoded and scanned PDFs. The Nepali language pack (nep) must be installed alongside the binary.
Download the installer from the Tesseract GitHub releases. During installation, select Nepali from the additional language packs list.
Verify installation¶
from lamisema import LamiSema
lamisema = LamiSema()
print(lamisema.ocr_backend) # TesseractBackend or EasyOCRBackend or None
print(lamisema.storage) # InMemoryStorage (default)
Or start the API server and hit the health endpoint:
The response shows which libraries and storage backends are active:
{
"status": "online",
"libraries": {
"PyMuPDF": true,
"pdfplumber": true,
"pytesseract": true,
"easyocr": false
},
"ocr_backend": "tesseract",
"storage_backend": "InMemoryStorage"
}
Running the API server¶
# via installed CLI (Recommended)
lamisema serve
# via Python module
python -m lamisema.api.app
# via uvicorn directly (with persistent storage fallback test)
export LAMI_STORAGE_TYPE=s3
uvicorn lamisema.api.app:app --port 9001 --reload
Interactive docs: http://localhost:9001/docs