Skip to content

Python API

Quick reference

from lamisema import LamiSema

lamisema = LamiSema()

with open("report.pdf", "rb") as f:
    result = lamisema.extract(f.read(), filename="report.pdf")

result.encoding_type        # "unicode_native" | "legacy_encoded" | "scanned"
result.language             # "ne" (or active language code)
result.overall_confidence   # 0.0 – 1.0
result.ocr_backend          # "tesseract" | "easyocr" | "none"
result.warnings             # list of str
result.pages[0].raw_text    # extracted text for page 1
result.pages[0].entities    # list of Entity
result.pages[0].script_ratio # fraction of chars in script (Devanagari/Latin/etc)
result.pages[0].confidence  # per-page confidence score

LamiSema

The main entry point. LamiSema is the canonical alias — NepaliPDFExtractionPipeline is identical and also exported for verbose usage. Both are stateless and safe to share across threads and requests.

from lamisema import LamiSema from lamisema.ocr import TesseractBackend, EasyOCRBackend from lamisema.storage.s3 import S3Storage from lamisema.nlp.nepali import NepaliNLPBackend

Default: auto-selects OCR, Nepali NLP, and InMemoryStorage.

lamisema = LamiSema()

Explicit S3 Storage + Tesseract

lamisema = LamiSema( ocr_backend=TesseractBackend(), storage=S3Storage() )

.extract(pdf_bytes, filename, doc_id="DOC")

Run the full four-stage pipeline.

Parameter Type Description
pdf_bytes bytes Raw PDF content
filename str Original filename (metadata only)
doc_id str Identifier for log messages (optional)

Returns: ExtractionResult


PDFPreflightService

Run Stage 1 encoding detection without extraction.

from lamisema import PDFPreflightService

svc = PDFPreflightService()

with open("report.pdf", "rb") as f:
    result = svc.analyze(f.read(), filename="report.pdf", doc_id="DOC-001")

result.encoding_type          # EncodingType enum
result.fonts                  # list of FontInfo
result.has_text_layer         # bool
result.recommended_strategy   # human-readable string

Data Models

ExtractionResult

Field Type Description
doc_id str Document identifier
filename str Original filename
encoding_type EncodingType Detected encoding
total_pages int Page count
pages list[PageResult] Per-page results
overall_confidence float Mean confidence across pages
ocr_backend str Backend used: tesseract, easyocr, or none
warnings list[str] Non-fatal warnings

PageResult

Field Type Description
page_number int 1-indexed page number
raw_text str Extracted text
script_ratio float Fraction of chars in target script (e.g. Devanagari)
entities list[Entity] Detected named entities
extraction_method str text_layer, tesseract, or easyocr
confidence float Per-page confidence score

Entity

Field Type Description
text str Surface form from the document
entity_type str Entity class (see below)
normalized str | None Normalized form (e.g. AD date string, NPR 12500)
confidence float Rule match confidence

Entity types (v1.x and roadmap):

Type Example Status
DATE_BS २०८१ साल असार १५ ✅ v1.0
CURRENCY रु. १२,५०० ✅ v1.0
ORGANIZATION अर्थ मन्त्रालय ✅ v1.0
WARD_CODE वडा नं. ४ v1.2
VDC_MUNICIPALITY काठमाडौं महानगरपालिका v1.2
DISTRICT सिन्धुपाल्चोक v1.2
PROVINCE बागमती प्रदेश v1.2
LAND_PARCEL कि.नं. ४५२ v1.2
PERSON_NAME_NE श्री राम बहादुर थापा v1.2
PHONE_NE ९८४१२३४५६७ v1.2
COURT_CASE मिसिल नं. ०७८-CR-०४५ v1.2
GAZETTE_REF नेपाल राजपत्र भाग ३ v1.2
GOV_POSITION सचिव v1.2

EncodingType

from lamisema.models import EncodingType

EncodingType.UNICODE_NATIVE   # "unicode_native"
EncodingType.LEGACY_ENCODED   # "legacy_encoded"
EncodingType.SCANNED          # "scanned"
EncodingType.UNKNOWN          # "unknown"

Custom OCR Backend

Implement OCRBackend to plug in any OCR engine:

from lamisema.ocr import OCRBackend
from lamisema import LamiSema

class GeminiVisionBackend(OCRBackend):
    @property
    def name(self) -> str:
        return "gemini-vision"

    def extract_text(self, image_bytes: bytes) -> str:
        # Call Gemini Vision API with image_bytes
        ...

lamisema = LamiSema(
    ocr_backend=GeminiVisionBackend()
)

Bikram Sambat normalization (standalone)

from lamisema.nlp.dates import normalize_bs_date, bs_year_to_ad

normalize_bs_date("2081", "असार", "15")
# → "~2024-06-29 AD (approx)"

normalize_bs_date("२०८१", "असार", "१५")
# → "~2024-06-29 AD (approx)"

bs_year_to_ad("२०८१")
# → 2025

Rule-based NER (standalone)

from lamisema.nlp.ner import extract_entities

entities = extract_entities("अर्थ मन्त्रालयले २०८१ साल असार १५ मा रु. १२,५०० को बजेट पारित गर्यो।")

for e in entities:
    print(e.entity_type, e.text, e.normalized)
# DATE_BS    २०८१ साल असार १५    ~2024-06-29 AD (approx)
# CURRENCY   रु. १२,५००           NPR 12500
# ORGANIZATION  अर्थ मन्त्रालय     None