Benchmark¶

Evaluation Methodology¶

LamiSema is evaluated against manually transcribed ground truth across three PDF encoding types.

Dataset Target¶

Encoding Type	Count	Source
`unicode_native`	50 PDFs	mof.gov.np, opm.gov.np, nnpc.gov.np, World Bank Nepal
`legacy_encoded`	50 PDFs	District offices, court records, pre-2010 government archives
`scanned`	50 PDFs	Physical forms, field survey records, land registry scans

Ground truth transcription by native Nepali speakers. PII anonymized before storage.

Metrics¶

Character Error Rate (CER)

CER = Levenshtein(hypothesis, reference) / len(reference)

Lower is better. 0.0 = perfect match. Capped at 1.0.

Entity F1

Precision, Recall, and F1 per entity type (DATE_BS, CURRENCY, ORGANIZATION). Matching is exact surface form (lowercased).

Date Normalization Accuracy

Fraction of detected BS dates whose normalized AD value matches ground truth.

Baseline Comparison¶

The evaluation script (benchmark/evaluate.py) compares three extraction strategies:

Strategy	Description
`pdfplumber_raw`	Raw text extraction — no routing, no OCR fallback
`tesseract_raw`	Tesseract on every page — no pre-flight routing
`lamisema`	Full LamiSema pipeline with encoding-aware routing

Run the comparison:

python benchmark/evaluate.py --baselines

Results¶

⚠️ Formal benchmark results are pending. The dataset collection and native-speaker transcription are in progress.

Expected to be published here before the v1.1 release. The evaluation script is complete and ready — only the dataset needs to be populated.

Expected findings¶

Based on manual spot-checks during development:

unicode_native: LamiSema and raw pdfplumber should be nearly identical (both use text layer). Tesseract will introduce minor OCR errors.
legacy_encoded: Raw pdfplumber will produce ~100% CER (garbage output). LamiSema's OCR routing will produce ~15–30% CER depending on scan quality.
scanned: Raw pdfplumber returns empty string (CER = 1.0). LamiSema and raw Tesseract should be comparable; LamiSema adds NER and confidence scoring on top.

Running the Evaluation¶

# Full evaluation against all three encoding types
python benchmark/evaluate.py

# Single encoding type
python benchmark/evaluate.py --type legacy_encoded

# With baseline comparison
python benchmark/evaluate.py --baselines

# Custom dataset location
python benchmark/evaluate.py \
  --dataset-dir /path/to/dataset \
  --gt-dir /path/to/ground_truth

Adding PDFs to the dataset¶

Place PDFs in benchmark/dataset/<encoding_type>/
Create benchmark/ground_truth/<pdf_stem>.txt with manually transcribed text
Create benchmark/ground_truth/<pdf_stem>.entities.json with ground truth entities
Run python benchmark/evaluate.py

See benchmark/ground_truth/README.md for the exact file format.