LamiSema¶

Lamichhane Semantic — Structured Information Extraction for Nepali Documents

The world's leading system for extracting structured meaning from Nepali PDFs — not just characters. LamiSema combines encoding-aware routing, layout intelligence, and a Nepali NER ontology purpose-built for the schemas of Nepali law, economics, finance, land records, and general-purpose government documents.

The Problem in One Paragraph¶

Nepal has decades of government records — land titles, court judgments, budget reports, economics analyses, gazette notices, and census data — stored as PDFs. Most of these documents use legacy Nepali fonts (Preeti, Kantipur, Sagarmatha) that store Devanagari characters as ASCII bytes. When any standard extraction tool reads the text layer, it returns garbage: g]kfn instead of नेपाल. No error is raised, no warning emitted — the output is silently wrong. Other documents are scanned images with no text layer at all. And even when characters are extracted correctly, mainstream tools have no understanding of Nepal’s document schemas: Bikram Sambat dates, ward and VDC hierarchies, kittaa land parcel notation, or NPR currency.

LamiSema solves this end-to-end: detecting encoding type first, routing to the correct extraction strategy, recovering document structure, and extracting structured meaning using Nepal’s own administrative and legal vocabulary.

How It Works¶

PDF Input
   │
   ▼
┌──────────────────────────────────┐
│  Stage 1 — Pre-flight Analysis   │  Detects font names, encoding
│  PDFPreflightService             │  type, presence of text layer
└──────────────────────────────────┘
   │
   ├─ unicode_native ──→  pdfplumber (fast, lossless)
   ├─ legacy_encoded ──→  300 DPI render → Tesseract nep+eng
   └─ scanned        ──→  300 DPI render → Tesseract nep+eng
   │
   ▼
┌──────────────────────────────────┐
│  Stage 2 — Layout Intelligence   │  Tables, sections, headings,
│                                  │  multi-column reading order
└──────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────┐
│  Stage 3 — Symbolic NLP Layer    │  Rule-based NER (20+ types):
│  DevanagariTextAnalyzer          │  BS dates, currency, orgs,
└──────────────────────────────────┘  wards, districts, land parcels
   │
   ▼
┌──────────────────────────────────┐
│  Stage 4 — Domain Schema Output  │  Typed models: budget, land,
│                                  │  gazette, court, economics,
└──────────────────────────────────┘  general-purpose documents
   │
   ▼
Structured JSON output with confidence scores

Quickstart¶

pip install lamisema
brew install tesseract tesseract-lang   # macOS
# apt-get install tesseract-ocr tesseract-ocr-nep  # Ubuntu

from lamisema import LamiSema

lamisema = LamiSema()

with open("budget-2081.pdf", "rb") as f:
    result = lamisema.extract(f.read(), filename="budget-2081.pdf")

print(result.encoding_type)        # "legacy_encoded"
print(result.language)             # "ne"
print(result.overall_confidence)   # 0.74
print(result.pages[0].entities)    # [Entity(type="DATE_BS", ...)]

See Installation for full setup instructions.

Key Features¶

Feature	Description
Encoding pre-flight	Detects 20+ legacy Nepali fonts before extraction
Automatic routing	No config — correct strategy chosen per document
Layout intelligence	Tables, sections, headings, multi-column layouts as structure
Deep Nepali NER	20+ entity types: dates, currency, orgs, wards, districts, land parcels, court refs, gazette refs
High-Availability Storage	Pluggable backends: `memory`, `disk`, or `S3/Minio` with automatic fallback
Language-Agnostic Core	`NLPBackend` interface allows adding Hindi, English, etc.
Bikram Sambat normalization	Converts BS dates to approximate AD equivalents
Domain schemas	Typed output for budget, land, gazette, court, economics, and general-purpose documents
Cross-document intelligence	Deduplication, version tracking, entity co-reference across corpora
FastAPI + Docker	Ready for production deployment with `docker-compose`
Fully offline	No API keys, no network calls, no data sent anywhere