Skip to content

Extending LamiSema to Other Languages

LamiSema is built around a pluggable NLPBackend interface. Adding a new language requires implementing one class — the rest of the pipeline (encoding detection, OCR routing, confidence scoring, REST API) works unchanged.


How It Works

The pipeline is split into two independent layers:

Layer Interface What it does Language-specific?
PDF routing PDFPreflightService Detects encoding type, routes to text layer or OCR No — works for any PDF
NLP analysis NLPBackend Script ratio, entity extraction, confidence scoring Yes — one per language

When you call LamiSema(nlp_backend=YourBackend()), only the NLP layer changes. Everything else is identical.


The NLPBackend Interface

from abc import ABC, abstractmethod
from typing import List
from lamisema.models import Entity

class NLPBackend(ABC):

    @property
    @abstractmethod
    def language_code(self) -> str:
        """ISO 639-1 code, e.g. 'ne', 'hi', 'en', 'mai'"""
        ...

    @property
    @abstractmethod
    def ocr_language(self) -> str:
        """Tesseract language string, e.g. 'nep+eng', 'hin+eng', 'eng'"""
        ...

    @abstractmethod
    def script_ratio(self, text: str) -> float:
        """Fraction of characters in the primary script. Used as quality signal."""
        ...

    @abstractmethod
    def extract_entities(self, text: str) -> List[Entity]:
        """Run NER on extracted text. Return empty list if not implemented."""
        ...

    @abstractmethod
    def compute_confidence(self, text: str, extraction_method: str) -> float:
        """Per-page confidence score in [0.0, 1.0]."""
        ...

Step-by-Step: Adding a New Language

1. Create the backend file

Create lamisema/nlp/<language>.py. Use lamisema/nlp/nepali.py as your reference implementation.

2. Implement NLPBackend

Minimum viable implementation — add NER patterns incrementally:

# lamisema/nlp/english.py
import re
from typing import List
from lamisema.models import Entity
from lamisema.nlp.base import NLPBackend

# ASCII/Latin character range used as the primary script signal for English
_LATIN_RANGE = (0x0020, 0x007E)

class EnglishNLPBackend(NLPBackend):

    @property
    def language_code(self) -> str:
        return "en"

    @property
    def ocr_language(self) -> str:
        return "eng"  # Tesseract default

    def script_ratio(self, text: str) -> float:
        """Fraction of printable ASCII characters."""
        if not text:
            return 0.0
        latin = sum(1 for ch in text if _LATIN_RANGE[0] <= ord(ch) <= _LATIN_RANGE[1])
        return latin / len(text)

    def extract_entities(self, text: str) -> List[Entity]:
        """
        Add English-specific NER patterns here.
        Return an empty list until patterns are implemented.
        """
        entities = []
        # TODO: DATE patterns (YYYY-MM-DD, Month DD YYYY), USD/GBP currency,
        #       organization suffixes (Ltd, Inc, Corp, plc), etc.
        return entities

    def compute_confidence(self, text: str, extraction_method: str) -> float:
        if not text or len(text) < 10:
            return 0.05
        ratio = self.script_ratio(text)
        method_base = {
            "text_layer": 0.90,
            "tesseract": 0.72,
            "easyocr": 0.75,
        }.get(extraction_method, 0.50)
        return round(min((method_base * 0.60) + (ratio * 0.40), 1.0), 4)

3. Use it

from lamisema import LamiSema
from lamisema.nlp.english import EnglishNLPBackend

lamisema = LamiSema(nlp_backend=EnglishNLPBackend())

with open("report.pdf", "rb") as f:
    result = lamisema.extract(f.read(), filename="report.pdf")

print(result.language)   # "en"
print(result.pages[0].entities)

4. Install the Tesseract language pack

# Ubuntu / Debian
sudo apt-get install tesseract-ocr-eng   # usually pre-installed

# macOS (all language packs included)
brew install tesseract tesseract-lang

# Verify
tesseract --list-langs | grep eng

Language Support Matrix

Language Code Tesseract pack Backend Status
Nepali (नेपाली) ne nep NepaliNLPBackend ✅ Default
Hindi (हिन्दी) hi hin Planned
English en eng Planned
Maithili (मैथिली) mai mai Planned
Newari (नेवारी) new Research

What Each Language Backend Should Implement

Feature Required Notes
language_code ISO 639-1 code
ocr_language Tesseract language string
script_ratio Quality signal for extraction
compute_confidence Per-page score
extract_entities Optional Can return [] — add NER incrementally
Date normalization Optional Language-specific calendar handling
Currency patterns Optional NPR, INR, USD, etc.

Notes

  • The OCR backend is separate from the NLP backend. The same TesseractBackend works for all languages; it reads nlp_backend.ocr_language to choose the recognition model.
  • Devanagari script is shared across Nepali, Hindi, Maithili, and Sanskrit. The script_ratio logic from NepaliNLPBackend can be reused as-is for any Devanagari-script language.
  • NER is purely additive. You can ship a backend with an empty extract_entities and add patterns in later versions without breaking the interface.