The Nepali PDF Encoding Problem¶

Understanding why LamiSema exists requires understanding how Nepali PDFs store text — and why getting it wrong is invisible.

Three Encoding Types¶

Nepali PDFs fall into exactly three categories. Each requires a completely different extraction strategy. Using the wrong one produces silent corruption — no error, no warning.

1. Unicode-Native¶

Source: Modern government portals (e.g. mof.gov.np), banks, international organizations, documents produced after ~2010.

How it works: Characters are stored as actual Unicode codepoints in the Devanagari block (U+0900–U+097F). The text layer contains real न, े, प, ा, ल characters.

Extraction: Direct text layer extraction via pdfplumber or PyMuPDF. Fast (milliseconds per page), lossless, no OCR needed.

Detection: Font encoding is Identity-H, WinAnsiEncoding, or a standard Unicode CMap. No legacy font names present.

2. Legacy-Encoded (Preeti / Kantipur / Sagarmatha)¶

Source: Government documents from before ~2010, district offices, many court records, land registry documents, older newspaper archives.

How it works: These fonts were created before Unicode existed. They work by remapping ASCII codepoints to render as Devanagari glyphs. The font file maps g → न, ] → े, k → प, f → ा, n → ल. So the word "नेपाल" is stored as the ASCII string g]kfn.

On screen, the PDF looks perfect because the rendering engine applies the font map. But in the text layer, the stored bytes are ASCII.

What happens without encoding detection:

pdfplumber.open("preeti-budget.pdf").pages[0].extract_text()
# → 'g]kfn ;/sf/ sf] cfGtl/s /fh:j'
# Looks like garbage. It IS garbage. No error raised.

Extraction: The text layer must be completely bypassed. LamiSema renders each page at 300 DPI and runs Tesseract OCR (nep+eng). The rendered image accurately shows the Devanagari glyphs, and Tesseract reads them correctly.

Detection: PyMuPDF's page.get_fonts() returns font metadata including the base font name. LamiSema strips subset prefixes (ABCDEF+Preeti → Preeti) and checks against a list of 20+ known legacy Nepali font names.

3. Scanned / Image¶

Source: Physical forms scanned to PDF, old records photographed, faxed documents, field survey forms.

How it works: The PDF contains no text layer at all — just a rasterized image of each page. page.extract_text() returns an empty string.

Extraction: Same as legacy-encoded: render at 300 DPI and run OCR. The difference is detection — no fonts to inspect, just an absence of text.

Detection: LamiSema checks the first three pages for any extractable text. If none is found and no legacy fonts are present, the document is classified as scanned.

Why 300 DPI¶

Tesseract's Devanagari accuracy degrades significantly below 300 DPI. Devanagari characters have complex conjuncts and matras (vowel signs) that sit above, below, and around consonants. At 150 DPI many of these become indistinguishable. At 300 DPI, the distinction between visually similar characters — ग vs ग़, ध vs ब, ी vs ि — becomes reliable.

300 DPI also produces large images (a typical A4 page becomes ~2480×3508 pixels). LamiSema processes them in memory and does not write to disk.

The Silent Failure Problem¶

The reason this matters is that no error is raised when the wrong strategy is applied:

Scenario	What you see	What actually happened
Unicode PDF + pdfplumber	Clean Devanagari text	Correct
Legacy PDF + pdfplumber	`g]kfn ;/sf/` or similar	Silent corruption
Scanned PDF + pdfplumber	Empty string `""`	Silent miss
Legacy PDF + Tesseract (no pre-flight)	Correct OCR output	Correct but slow
Unicode PDF + Tesseract (no pre-flight)	Mostly correct with OCR errors	Unnecessary degradation

The last row is why routing matters even when OCR works: running OCR on a clean Unicode PDF introduces character errors that were not in the original document and wastes significant time.

Known Legacy Nepali Fonts¶

LamiSema's pre-flight detector identifies all of the following:

Font	Common usage
Preeti	Most common; government forms, newspapers
Kantipur	Kantipur Media Group publications
Sagarmatha	Government documents, older NGO reports
Himali	Official government correspondence
Himali TT	TrueType variant of Himali
PCS Nepali	Public service documents
Navjeevan	Religious texts, older publications
Narad	Older government records
Fontasy Himali	Desktop publishing, older websites
Fontasy Himalb	Bold variant of Fontasy Himali
Kanjirowa	Regional government documents
Kuti	Older academic and research documents
Shangrila	Travel and tourism sector documents
GuptaLipi	Some district court records
Sabdatara	Educational materials
Sambhav	Legal documents
Everest	Newspaper archives
Nepal	Generic legacy font
Ratna	Some government publications
Devanagari	Generic name used by multiple legacy vendors

If you encounter a legacy font not in this list, open a GitHub issue with the font name (visible via pdffonts your-file.pdf).