Python API¶

Quick reference¶

from lamisema import LamiSema

lamisema = LamiSema()

with open("report.pdf", "rb") as f:
    result = lamisema.extract(f.read(), filename="report.pdf")

result.encoding_type        # "unicode_native" | "legacy_encoded" | "scanned"
result.language             # "ne" (or active language code)
result.overall_confidence   # 0.0 – 1.0
result.ocr_backend          # "tesseract" | "easyocr" | "none"
result.warnings             # list of str
result.pages[0].raw_text    # extracted text for page 1
result.pages[0].entities    # list of Entity
result.pages[0].script_ratio # fraction of chars in script (Devanagari/Latin/etc)
result.pages[0].confidence  # per-page confidence score

LamiSema¶

The main entry point. LamiSema is the canonical alias — NepaliPDFExtractionPipeline is identical and also exported for verbose usage. Both are stateless and safe to share across threads and requests.

from lamisema import LamiSema from lamisema.ocr import TesseractBackend, EasyOCRBackend from lamisema.storage.s3 import S3Storage from lamisema.nlp.nepali import NepaliNLPBackend

Default: auto-selects OCR, Nepali NLP, and InMemoryStorage.¶

lamisema = LamiSema()

Explicit S3 Storage + Tesseract¶

lamisema = LamiSema( ocr_backend=TesseractBackend(), storage=S3Storage() )

`.extract(pdf_bytes, filename, doc_id="DOC")`¶

Run the full four-stage pipeline.

Parameter	Type	Description
`pdf_bytes`	`bytes`	Raw PDF content
`filename`	`str`	Original filename (metadata only)
`doc_id`	`str`	Identifier for log messages (optional)

Returns: ExtractionResult

PDFPreflightService¶

Run Stage 1 encoding detection without extraction.

from lamisema import PDFPreflightService

svc = PDFPreflightService()

with open("report.pdf", "rb") as f:
    result = svc.analyze(f.read(), filename="report.pdf", doc_id="DOC-001")

result.encoding_type          # EncodingType enum
result.fonts                  # list of FontInfo
result.has_text_layer         # bool
result.recommended_strategy   # human-readable string

Data Models¶

`ExtractionResult`¶

Field	Type	Description
`doc_id`	`str`	Document identifier
`filename`	`str`	Original filename
`encoding_type`	`EncodingType`	Detected encoding
`total_pages`	`int`	Page count
`pages`	`list[PageResult]`	Per-page results
`overall_confidence`	`float`	Mean confidence across pages
`ocr_backend`	`str`	Backend used: `tesseract`, `easyocr`, or `none`
`warnings`	`list[str]`	Non-fatal warnings

`PageResult`¶

Field	Type	Description
`page_number`	`int`	1-indexed page number
`raw_text`	`str`	Extracted text
`script_ratio`	`float`	Fraction of chars in target script (e.g. Devanagari)
`entities`	`list[Entity]`	Detected named entities
`extraction_method`	`str`	`text_layer`, `tesseract`, or `easyocr`
`confidence`	`float`	Per-page confidence score

`Entity`¶

Field	Type	Description
`text`	`str`	Surface form from the document
`entity_type`	`str`	Entity class (see below)
`normalized`	`str \| None`	Normalized form (e.g. AD date string, `NPR 12500`)
`confidence`	`float`	Rule match confidence

Entity types (v1.x and roadmap):

Type	Example	Status
`DATE_BS`	`२०८१ साल असार १५`	✅ v1.0
`CURRENCY`	`रु. १२,५००`	✅ v1.0
`ORGANIZATION`	`अर्थ मन्त्रालय`	✅ v1.0
`WARD_CODE`	`वडा नं. ४`	v1.2
`VDC_MUNICIPALITY`	`काठमाडौं महानगरपालिका`	v1.2
`DISTRICT`	`सिन्धुपाल्चोक`	v1.2
`PROVINCE`	`बागमती प्रदेश`	v1.2
`LAND_PARCEL`	`कि.नं. ४५२`	v1.2
`PERSON_NAME_NE`	`श्री राम बहादुर थापा`	v1.2
`PHONE_NE`	`९८४१२३४५६७`	v1.2
`COURT_CASE`	`मिसिल नं. ०७८-CR-०४५`	v1.2
`GAZETTE_REF`	`नेपाल राजपत्र भाग ३`	v1.2
`GOV_POSITION`	`सचिव`	v1.2

`EncodingType`¶

from lamisema.models import EncodingType

EncodingType.UNICODE_NATIVE   # "unicode_native"
EncodingType.LEGACY_ENCODED   # "legacy_encoded"
EncodingType.SCANNED          # "scanned"
EncodingType.UNKNOWN          # "unknown"

Custom OCR Backend¶

Implement OCRBackend to plug in any OCR engine:

from lamisema.ocr import OCRBackend
from lamisema import LamiSema

class GeminiVisionBackend(OCRBackend):
    @property
    def name(self) -> str:
        return "gemini-vision"

    def extract_text(self, image_bytes: bytes) -> str:
        # Call Gemini Vision API with image_bytes
        ...

lamisema = LamiSema(
    ocr_backend=GeminiVisionBackend()
)

Bikram Sambat normalization (standalone)¶

from lamisema.nlp.dates import normalize_bs_date, bs_year_to_ad

normalize_bs_date("2081", "असार", "15")
# → "~2024-06-29 AD (approx)"

normalize_bs_date("२०८१", "असार", "१५")
# → "~2024-06-29 AD (approx)"

bs_year_to_ad("२०८१")
# → 2025

Rule-based NER (standalone)¶

from lamisema.nlp.ner import extract_entities

entities = extract_entities("अर्थ मन्त्रालयले २०८१ साल असार १५ मा रु. १२,५०० को बजेट पारित गर्यो।")

for e in entities:
    print(e.entity_type, e.text, e.normalized)
# DATE_BS    २०८१ साल असार १५    ~2024-06-29 AD (approx)
# CURRENCY   रु. १२,५००           NPR 12500
# ORGANIZATION  अर्थ मन्त्रालय     None