Patterns
intermediatedocument

Document Classification with OCR

Route incoming documents — PDFs, scanned images, email attachments — by extracting text and classifying them automatically. Replaces human eyeballs at the front door.

Views7
BPMN 2.0
On this page

Visual Flow

Rendering diagram…

When to Use This Pattern

Use this pattern whenever documents flow into your process and a human has to look at each one just to decide where it belongs:

  • Accounts payable: separating invoices from credit notes from statements
  • Legal intake: classifying contracts, NDAs, amendments, correspondence
  • HR: sorting CVs, references, certifications, ID scans
  • Compliance: tagging documents by retention requirement

The triage is the bottleneck, not the work that follows. Automating the triage frees the humans for the judgment calls.

How It Works

Three stages:

  1. OCR — extract text from the document. For clean PDFs with embedded text, just read it. For scans or images, run an OCR engine (AWS Textract, Google Document AI, Tesseract for self-hosted).
  2. Entity extraction — pull structured fields: invoice_number, total_amount, counterparty_name, effective_date. Most commercial OCR services bundle this.
  3. Classification — assign a document type. Rule-based if the keywords are reliable ("Invoice" in the header). ML-based (a classifier trained on past labelled docs) when rules get too brittle.

Route each document based on its class plus any extracted fields (e.g. invoice with total > $10K goes to tier-2 AP).

Warning

Confidence scores exist for a reason. Never auto-route a classification below your confidence threshold — send it to a human review queue and learn from the corrections.

Implementation Guide

Step 1: Measure baseline human accuracy

Before automating, know what you're replacing. If humans are right 97% of the time and the model is 93%, you've made things worse. Sample a week of decisions; label them; compare.

Step 2: Pick OCR output shape for your data

Forms benefit from structured OCR (AWS Textract's forms, Google's Form Parser). Free-text documents want plain extraction. Don't pay for structured OCR where you don't need it.

Step 3: Decide: rules or model, or both

Rules: fast to start, brittle when formats change. Model: slower to start (needs labelled data), robust to variation. Hybrid is common — rules handle the obvious cases, model handles the rest, human handles low-confidence.

Step 4: Build the human-review queue from day one

No production doc pipeline runs at 100% accuracy. Plan a review queue for low-confidence items. Route human corrections back as labelled training data.

Step 5: Track accuracy in production

Periodically sample outcomes and manually verify. Drift happens — a supplier changes their invoice layout and your model quietly starts misclassifying. Monitoring is the safety net.

Tips & Best Practices

  • Store originals forever. Extracted data can be re-derived; lost originals can't.
  • Redact PII before sending to external classifiers unless you've cleared it with legal.
  • Version the classifier. "The model changed last Tuesday" is a real debugging clue.
  • Route rejections, don't drop them. An unknown-type document going silently to /dev/null is a customer-impact bug waiting to happen.
  • Start small. One doc type at a time. Getting invoices right is already valuable; trying to do 15 types at once fails.

Related patterns