Case Study

Building a Searchable Archive from Scanned Legal Documents

How I built an OCR pipeline and full-text search system to make thousands of court documents accessible and navigable.

Shipped Products
Epstein Document Browser

The Problem

When thousands of court documents get publicly released, they arrive as scanned PDFs—often from faxes, photocopies, or decades-old legal filings. The text isn't machine-readable. The metadata is inconsistent. And there's no way to search across them without manually opening each file.

I wanted to build a tool where journalists, researchers, and the public could search across the entire document set instantly—type a name or phrase and get relevant documents with highlighted matches.

System Architecture

The architecture has two main components: an ingestion pipeline for processing documents, and a search frontend for exploring them.

┌─────────────────────────────────────────────────────────────────────────┐
│                         INGESTION PIPELINE                               │
│                                                                          │
│  PDF Files ──▶ Image Extraction ──▶ Preprocessing ──▶ Tesseract OCR     │
│                                          │                    │          │
│                   (deskew, contrast)     │                    ▼          │
│                                          │           Text + Confidence   │
│                                          │                    │          │
│                                          ▼                    ▼          │
│                                    PostgreSQL ◀──────── Full-Text Index │
└─────────────────────────────────────────────────────────────────────────┘
                                        │
                    ┌───────────────────┘
                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           SEARCH FRONTEND                                │
│                                                                          │
│  Query ──▶ PostgreSQL FTS ──▶ Ranked Results ──▶ PDF.js Viewer          │
│                                                        │                 │
│                                                        ▼                 │
│                                            In-browser highlighting       │
└─────────────────────────────────────────────────────────────────────────┘

Technical Decisions

Why Tesseract OCR?

I evaluated three OCR options:

  • Tesseract: Open source, battle-tested, free
  • AWS Textract: Cloud-based, expensive at scale, excellent accuracy
  • Google Cloud Vision: Similar to Textract, great for structured forms

Tesseract won because:

  1. Cost: Processing thousands of pages through cloud OCR would cost hundreds of dollars
  2. Privacy: Legal documents stay local, no third-party cloud involved
  3. Control: I could tune preprocessing and confidence thresholds

The tradeoff: Tesseract struggles with low-quality scans. I invested heavily in preprocessing to compensate.

Why PostgreSQL Full-Text Search?

PostgreSQL's built-in full-text search is underrated. For this use case, it beat dedicated search engines:

  1. Single database: Documents, metadata, and search index in one place
  2. Ranking out of the box: ts_rank() and ts_rank_cd() for relevance scoring
  3. Highlighting: ts_headline() generates snippets with match highlights
  4. Phrase search: Native support for "exact phrase" queries
-- Search with ranking and highlighting
SELECT
  id,
  title,
  ts_rank(search_vector, plainto_tsquery('english', $1)) AS rank,
  ts_headline('english', content, plainto_tsquery('english', $1),
    'StartSel=****, StopSel=****, MaxWords=35, MinWords=15'
  ) AS snippet
FROM documents
WHERE search_vector @@ plainto_tsquery('english', $1)
ORDER BY rank DESC
LIMIT 20;

(In production, **** would be HTML highlight tags like <mark>)

For a corpus of ~10,000 documents, PostgreSQL handles queries in under 50ms. I'd consider Elasticsearch only if the corpus grew 100x.

Why pdf.js for the Viewer?

The challenge: users need to see the original document AND the searchable text. Most PDF viewers don't support text overlay on scanned images.

pdf.js (Mozilla's PDF renderer) lets me:

  1. Render the original scanned pages as the user sees them
  2. Overlay an invisible text layer from the OCR
  3. Highlight search terms in that text layer
  4. Support keyboard navigation and accessibility

The result feels like searching a native document, even though the underlying PDF is just images.

The OCR Pipeline Challenge

Scanned legal documents are hostile to OCR:

  • Fax artifacts and scan lines
  • Skewed pages from feeding through a scanner
  • Low contrast from photocopying
  • Handwritten annotations
  • Mixed fonts and sizes

My Preprocessing Approach

Before sending pages to Tesseract, I run a preprocessing pipeline:

1. Deskewing Many scanned pages are rotated 1-3 degrees. Tesseract's accuracy drops significantly with rotation. I use OpenCV's minAreaRect on detected text blocks to calculate rotation angle, then correct it.

2. Binarization Convert to black-and-white using adaptive thresholding. This handles uneven lighting across the page—common when scanning bound documents.

# Adaptive thresholding for uneven lighting
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
binary = cv2.adaptiveThreshold(
    gray, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    blockSize=11,
    C=2
)

3. Noise Removal Small speckles from dust or scan artifacts confuse OCR. A morphological opening (erosion followed by dilation) removes noise while preserving text.

4. Confidence Filtering Tesseract outputs a confidence score per word. I flag low-confidence regions (usually handwriting or severely degraded text) rather than including garbage characters.

Results After Preprocessing

MetricBefore PreprocessingAfter Preprocessing
Character accuracy~72%~91%
Word accuracy~65%~87%
Processing time/page1.2s2.1s

The 2x processing time is worth it. Low-accuracy OCR produces garbage search results.

Search UX Details

Instant Results

The search UI updates as you type with debouncing (250ms delay). PostgreSQL handles partial queries well enough that results feel instant.

Snippet Generation

Search results show context around the match. I tuned ts_headline parameters to show meaningful snippets:

  • MaxWords=35: Enough context to understand the match
  • MinWords=15: Prevents tiny snippets
  • StartSel/StopSel: Custom highlight tags for styling

Document Preview

Clicking a result opens the PDF viewer at the page containing the match. The text layer highlights all instances of the search term. Users can navigate between matches with arrow keys.

Lessons Learned

  1. Preprocessing is the real work: Getting OCR accuracy from 72% to 91% required more engineering than the rest of the project combined. Don't underestimate document quality issues.

  2. PostgreSQL FTS is enough: I almost reached for Elasticsearch out of habit. For document counts under 100k, PostgreSQL full-text search is simpler to operate and good enough.

  3. Show confidence indicators: Some pages just can't be OCR'd reliably. Showing users "low confidence" warnings is better than pretending the text is accurate.

  4. Original + text is essential: Researchers need to verify against the original scan. A text-only interface wouldn't be trusted for legal documents.

What I'd Do Differently

  • Batch processing with resume: The initial ingestion was a single Python script. When it crashed 60% through, I had to restart. A proper job queue (Redis + RQ) with checkpointing would have saved hours.

  • Better handwriting handling: I punt on handwritten annotations entirely. A dedicated handwriting model (or flagging for human review) would capture important margin notes.

  • Export functionality: Downloading search results as a CSV or creating document collections would be natural features to add.


The meta-lesson: making information accessible isn't about fancy AI. It's about the boring work of cleaning data, handling edge cases, and building reliable pipelines. OCR is a solved problem—the hard part is making it work reliably on messy real-world documents.

Interested in working together?

Get in touch →
View all projects