doceval — Document Extraction Eval Harness

Open-source eval harness for LLM document extraction pipelines. Point it at your extractor and a labeled dataset to get field-level accuracy, a failure taxonomy, and per-document cost tracking.

Overview

You’ve built an LLM-based document extractor. It seems to work. But how accurate is it, actually? Which fields fail most — and why? Did accuracy change when you updated the prompt?

Without answers, “seems to work” is all you have. That’s not good enough for production.

doceval is an open-source eval harness that gives you those answers. Point it at your extraction function and a labeled dataset, and it produces:

Works with any extractor (Claude, GPT, regex, rules-based) and any document schema.

View on GitHub

How It Works

Write an extractor — a Python function that takes (doc_bytes, filepath) and returns a dict:

def extract(doc_bytes: bytes, filepath: str) -> dict:
    # call Claude, GPT, or any extraction logic
    return {"vendor": "Acme", "total": "1234.56", "date": "2026-01-15"}

Add one JSON label file per document:

{
  "vendor": "Acme Corp",
  "total": "1234.56",
  "date": "2026-01-15"
}

Run the eval:

pip install doceval

doceval run \
  --docs    ./dataset/docs \
  --labels  ./dataset/labels \
  --extractor my_module:extract

Failure Mode Taxonomy

Every mismatch is classified into one of four modes:

ModeMeaning
missed_fieldLabel has a value; extractor returned empty
hallucinationExtractor returned a value; label is empty
wrong_formatBoth non-empty; numeric or date values differ
wrong_valueBoth non-empty; string values differ

doceval handles numeric normalisation ($1,234.56 = 1234.56 = 1.234,56) and date normalisation (Nov 15 2012 = 2012-11-15) before comparison, so format differences don’t inflate your error count.


Optional Cost Tracking

Return a (dict, cost_usd) tuple from your extractor and doceval tracks cost automatically:

def extract(doc_bytes: bytes, filepath: str) -> tuple[dict, float]:
    response = client.messages.create(...)
    cost = response.usage.input_tokens / 1e6 * 0.80
    return result_dict, cost

Try the Example

The repo includes a working 20-document invoice dataset with labels and a Claude Haiku extractor you can run immediately:

git clone https://github.com/dave8172/doceval
cd doceval
pip install -e ".[examples]"
export ANTHROPIC_API_KEY=sk-ant-...

doceval run \
  --docs    examples/invoices/docs \
  --labels  examples/invoices/labels \
  --extractor examples.invoices.extractor:extract

Tech Stack


← Back to projects