Extending LamiSema to Other Languages¶
LamiSema is built around a pluggable NLPBackend interface. Adding a new language requires implementing one class — the rest of the pipeline (encoding detection, OCR routing, confidence scoring, REST API) works unchanged.
How It Works¶
The pipeline is split into two independent layers:
| Layer | Interface | What it does | Language-specific? |
|---|---|---|---|
| PDF routing | PDFPreflightService |
Detects encoding type, routes to text layer or OCR | No — works for any PDF |
| NLP analysis | NLPBackend |
Script ratio, entity extraction, confidence scoring | Yes — one per language |
When you call LamiSema(nlp_backend=YourBackend()), only the NLP layer changes. Everything else is identical.
The NLPBackend Interface¶
from abc import ABC, abstractmethod
from typing import List
from lamisema.models import Entity
class NLPBackend(ABC):
@property
@abstractmethod
def language_code(self) -> str:
"""ISO 639-1 code, e.g. 'ne', 'hi', 'en', 'mai'"""
...
@property
@abstractmethod
def ocr_language(self) -> str:
"""Tesseract language string, e.g. 'nep+eng', 'hin+eng', 'eng'"""
...
@abstractmethod
def script_ratio(self, text: str) -> float:
"""Fraction of characters in the primary script. Used as quality signal."""
...
@abstractmethod
def extract_entities(self, text: str) -> List[Entity]:
"""Run NER on extracted text. Return empty list if not implemented."""
...
@abstractmethod
def compute_confidence(self, text: str, extraction_method: str) -> float:
"""Per-page confidence score in [0.0, 1.0]."""
...
Step-by-Step: Adding a New Language¶
1. Create the backend file¶
Create lamisema/nlp/<language>.py. Use lamisema/nlp/nepali.py as your reference implementation.
2. Implement NLPBackend¶
Minimum viable implementation — add NER patterns incrementally:
# lamisema/nlp/english.py
import re
from typing import List
from lamisema.models import Entity
from lamisema.nlp.base import NLPBackend
# ASCII/Latin character range used as the primary script signal for English
_LATIN_RANGE = (0x0020, 0x007E)
class EnglishNLPBackend(NLPBackend):
@property
def language_code(self) -> str:
return "en"
@property
def ocr_language(self) -> str:
return "eng" # Tesseract default
def script_ratio(self, text: str) -> float:
"""Fraction of printable ASCII characters."""
if not text:
return 0.0
latin = sum(1 for ch in text if _LATIN_RANGE[0] <= ord(ch) <= _LATIN_RANGE[1])
return latin / len(text)
def extract_entities(self, text: str) -> List[Entity]:
"""
Add English-specific NER patterns here.
Return an empty list until patterns are implemented.
"""
entities = []
# TODO: DATE patterns (YYYY-MM-DD, Month DD YYYY), USD/GBP currency,
# organization suffixes (Ltd, Inc, Corp, plc), etc.
return entities
def compute_confidence(self, text: str, extraction_method: str) -> float:
if not text or len(text) < 10:
return 0.05
ratio = self.script_ratio(text)
method_base = {
"text_layer": 0.90,
"tesseract": 0.72,
"easyocr": 0.75,
}.get(extraction_method, 0.50)
return round(min((method_base * 0.60) + (ratio * 0.40), 1.0), 4)
3. Use it¶
from lamisema import LamiSema
from lamisema.nlp.english import EnglishNLPBackend
lamisema = LamiSema(nlp_backend=EnglishNLPBackend())
with open("report.pdf", "rb") as f:
result = lamisema.extract(f.read(), filename="report.pdf")
print(result.language) # "en"
print(result.pages[0].entities)
4. Install the Tesseract language pack¶
# Ubuntu / Debian
sudo apt-get install tesseract-ocr-eng # usually pre-installed
# macOS (all language packs included)
brew install tesseract tesseract-lang
# Verify
tesseract --list-langs | grep eng
Language Support Matrix¶
| Language | Code | Tesseract pack | Backend | Status |
|---|---|---|---|---|
| Nepali (नेपाली) | ne |
nep |
NepaliNLPBackend |
✅ Default |
| Hindi (हिन्दी) | hi |
hin |
— | Planned |
| English | en |
eng |
— | Planned |
| Maithili (मैथिली) | mai |
mai |
— | Planned |
| Newari (नेवारी) | new |
— | — | Research |
What Each Language Backend Should Implement¶
| Feature | Required | Notes |
|---|---|---|
language_code |
✅ | ISO 639-1 code |
ocr_language |
✅ | Tesseract language string |
script_ratio |
✅ | Quality signal for extraction |
compute_confidence |
✅ | Per-page score |
extract_entities |
Optional | Can return [] — add NER incrementally |
| Date normalization | Optional | Language-specific calendar handling |
| Currency patterns | Optional | NPR, INR, USD, etc. |
Notes¶
- The OCR backend is separate from the NLP backend. The same
TesseractBackendworks for all languages; it readsnlp_backend.ocr_languageto choose the recognition model. - Devanagari script is shared across Nepali, Hindi, Maithili, and Sanskrit. The
script_ratiologic fromNepaliNLPBackendcan be reused as-is for any Devanagari-script language. - NER is purely additive. You can ship a backend with an empty
extract_entitiesand add patterns in later versions without breaking the interface.