LamiSema¶
Lamichhane Semantic — Structured Information Extraction for Nepali Documents
The world's leading system for extracting structured meaning from Nepali PDFs — not just characters. LamiSema combines encoding-aware routing, layout intelligence, and a Nepali NER ontology purpose-built for the schemas of Nepali law, economics, finance, land records, and general-purpose government documents.
The Problem in One Paragraph¶
Nepal has decades of government records — land titles, court judgments, budget reports, economics analyses, gazette notices, and census data — stored as PDFs. Most of these documents use legacy Nepali fonts (Preeti, Kantipur, Sagarmatha) that store Devanagari characters as ASCII bytes. When any standard extraction tool reads the text layer, it returns garbage: g]kfn instead of नेपाल. No error is raised, no warning emitted — the output is silently wrong. Other documents are scanned images with no text layer at all. And even when characters are extracted correctly, mainstream tools have no understanding of Nepal’s document schemas: Bikram Sambat dates, ward and VDC hierarchies, kittaa land parcel notation, or NPR currency.
LamiSema solves this end-to-end: detecting encoding type first, routing to the correct extraction strategy, recovering document structure, and extracting structured meaning using Nepal’s own administrative and legal vocabulary.
How It Works¶
PDF Input
│
▼
┌──────────────────────────────────┐
│ Stage 1 — Pre-flight Analysis │ Detects font names, encoding
│ PDFPreflightService │ type, presence of text layer
└──────────────────────────────────┘
│
├─ unicode_native ──→ pdfplumber (fast, lossless)
├─ legacy_encoded ──→ 300 DPI render → Tesseract nep+eng
└─ scanned ──→ 300 DPI render → Tesseract nep+eng
│
▼
┌──────────────────────────────────┐
│ Stage 2 — Layout Intelligence │ Tables, sections, headings,
│ │ multi-column reading order
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Stage 3 — Symbolic NLP Layer │ Rule-based NER (20+ types):
│ DevanagariTextAnalyzer │ BS dates, currency, orgs,
└──────────────────────────────────┘ wards, districts, land parcels
│
▼
┌──────────────────────────────────┐
│ Stage 4 — Domain Schema Output │ Typed models: budget, land,
│ │ gazette, court, economics,
└──────────────────────────────────┘ general-purpose documents
│
▼
Structured JSON output with confidence scores
Quickstart¶
pip install lamisema
brew install tesseract tesseract-lang # macOS
# apt-get install tesseract-ocr tesseract-ocr-nep # Ubuntu
from lamisema import LamiSema
lamisema = LamiSema()
with open("budget-2081.pdf", "rb") as f:
result = lamisema.extract(f.read(), filename="budget-2081.pdf")
print(result.encoding_type) # "legacy_encoded"
print(result.language) # "ne"
print(result.overall_confidence) # 0.74
print(result.pages[0].entities) # [Entity(type="DATE_BS", ...)]
See Installation for full setup instructions.
Key Features¶
| Feature | Description |
|---|---|
| Encoding pre-flight | Detects 20+ legacy Nepali fonts before extraction |
| Automatic routing | No config — correct strategy chosen per document |
| Layout intelligence | Tables, sections, headings, multi-column layouts as structure |
| Deep Nepali NER | 20+ entity types: dates, currency, orgs, wards, districts, land parcels, court refs, gazette refs |
| High-Availability Storage | Pluggable backends: memory, disk, or S3/Minio with automatic fallback |
| Language-Agnostic Core | NLPBackend interface allows adding Hindi, English, etc. |
| Bikram Sambat normalization | Converts BS dates to approximate AD equivalents |
| Domain schemas | Typed output for budget, land, gazette, court, economics, and general-purpose documents |
| Cross-document intelligence | Deduplication, version tracking, entity co-reference across corpora |
| FastAPI + Docker | Ready for production deployment with docker-compose |
| Fully offline | No API keys, no network calls, no data sent anywhere |