Building a Searchable Archive from Scanned Legal Documents

## The Problem When thousands of court documents get publicly released, they arrive as scanned PDFs—often from faxes, photocopies, or decades-old legal filings. The text isn't machine-readable. The metadata is inconsistent. And there's no way to search across them without manually opening each file. I wanted to build a tool where journalists, researchers, and the public could search across the entire document set instantly—type a name or phrase and get relevant documents with highlighted matches. ## System Architecture The architecture has two main components: an ingestion pipeline for processing documents, and a search frontend for exploring them. ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ INGESTION PIPELINE │ │ │ │ PDF Files ──▶ Image Extraction ──▶ Preprocessing ──▶ Tesseract OCR │ │ │ │ │ │ (deskew, contrast) │ ▼ │ │ │ Text + Confidence │ │ │ │ │ │ ▼ ▼ │ │ PostgreSQL ◀──────── Full-Text Index │ └─────────────────────────────────────────────────────────────────────────┘ │ ┌───────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ SEARCH FRONTEND │ │ │ │ Query ──▶ PostgreSQL FTS ──▶ Ranked Results ──▶ PDF.js Viewer │ │ │ │ │ ▼ │ │ In-browser highlighting │ └─────────────────────────────────────────────────────────────────────────┘ ``` ## Technical Decisions ### Why Tesseract OCR? I evaluated three OCR options: - **Tesseract**: Open source, battle-tested, free - **AWS Textract**: Cloud-based, expensive at scale, excellent accuracy - **Google Cloud Vision**: Similar to Textract, great for structured forms **Tesseract won** because: 1. **Cost**: Processing thousands of pages through cloud OCR would cost hundreds of dollars 2. **Privacy**: Legal documents stay local, no third-party cloud involved 3. **Control**: I could tune preprocessing and confidence thresholds The tradeoff: Tesseract struggles with low-quality scans. I invested heavily in preprocessing to compensate. ### Why PostgreSQL Full-Text Search? PostgreSQL's built-in full-text search is underrated. For this use case, it beat dedicated search engines: 1. **Single database**: Documents, metadata, and search index in one place 2. **Ranking out of the box**: `ts_rank()` and `ts_rank_cd()` for relevance scoring 3. **Highlighting**: `ts_headline()` generates snippets with match highlights 4. **Phrase search**: Native support for "exact phrase" queries ```sql -- Search with ranking and highlighting SELECT id, title, ts_rank(search_vector, plainto_tsquery('english', $1)) AS rank, ts_headline('english', content, plainto_tsquery('english', $1), 'StartSel=****, StopSel=****, MaxWords=35, MinWords=15' ) AS snippet FROM documents WHERE search_vector @@ plainto_tsquery('english', $1) ORDER BY rank DESC LIMIT 20; ``` (In production, `****` would be HTML highlight tags like `<mark>`) For a corpus of ~10,000 documents, PostgreSQL handles queries in under 50ms. I'd consider Elasticsearch only if the corpus grew 100x. ### Why pdf.js for the Viewer? The challenge: users need to see the original document AND the searchable text. Most PDF viewers don't support text overlay on scanned images. **pdf.js** (Mozilla's PDF renderer) lets me: 1. Render the original scanned pages as the user sees them 2. Overlay an invisible text layer from the OCR 3. Highlight search terms in that text layer 4. Support keyboard navigation and accessibility The result feels like searching a native document, even though the underlying PDF is just images. ## The OCR Pipeline Challenge Scanned legal documents are hostile to OCR: - Fax artifacts and scan lines - Skewed pages from feeding through a scanner - Low contrast from photocopying - Handwritten annotations - Mixed fonts and sizes ### My Preprocessing Approach Before sending pages to Tesseract, I run a preprocessing pipeline: **1. Deskewing** Many scanned pages are rotated 1-3 degrees. Tesseract's accuracy drops significantly with rotation. I use OpenCV's `minAreaRect` on detected text blocks to calculate rotation angle, then correct it. **2. Binarization** Convert to black-and-white using adaptive thresholding. This handles uneven lighting across the page—common when scanning bound documents. ```python # Adaptive thresholding for uneven lighting gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) binary = cv2.adaptiveThreshold( gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, blockSize=11, C=2 ) ``` **3. Noise Removal** Small speckles from dust or scan artifacts confuse OCR. A morphological opening (erosion followed by dilation) removes noise while preserving text. **4. Confidence Filtering** Tesseract outputs a confidence score per word. I flag low-confidence regions (usually handwriting or severely degraded text) rather than including garbage characters. ### Results After Preprocessing | Metric | Before Preprocessing | After Preprocessing | |--------|---------------------|---------------------| | Character accuracy | ~72% | ~91% | | Word accuracy | ~65% | ~87% | | Processing time/page | 1.2s | 2.1s | The 2x processing time is worth it. Low-accuracy OCR produces garbage search results. ## Search UX Details ### Instant Results The search UI updates as you type with debouncing (250ms delay). PostgreSQL handles partial queries well enough that results feel instant. ### Snippet Generation Search results show context around the match. I tuned `ts_headline` parameters to show meaningful snippets: - **MaxWords=35**: Enough context to understand the match - **MinWords=15**: Prevents tiny snippets - **StartSel/StopSel**: Custom highlight tags for styling ### Document Preview Clicking a result opens the PDF viewer at the page containing the match. The text layer highlights all instances of the search term. Users can navigate between matches with arrow keys. ## Lessons Learned 1. **Preprocessing is the real work**: Getting OCR accuracy from 72% to 91% required more engineering than the rest of the project combined. Don't underestimate document quality issues. 2. **PostgreSQL FTS is enough**: I almost reached for Elasticsearch out of habit. For document counts under 100k, PostgreSQL full-text search is simpler to operate and good enough. 3. **Show confidence indicators**: Some pages just can't be OCR'd reliably. Showing users "low confidence" warnings is better than pretending the text is accurate. 4. **Original + text is essential**: Researchers need to verify against the original scan. A text-only interface wouldn't be trusted for legal documents. ## What I'd Do Differently - **Batch processing with resume**: The initial ingestion was a single Python script. When it crashed 60% through, I had to restart. A proper job queue (Redis + RQ) with checkpointing would have saved hours. - **Better handwriting handling**: I punt on handwritten annotations entirely. A dedicated handwriting model (or flagging for human review) would capture important margin notes. - **Export functionality**: Downloading search results as a CSV or creating document collections would be natural features to add. --- *The meta-lesson: making information accessible isn't about fancy AI. It's about the boring work of cleaning data, handling edge cases, and building reliable pipelines. OCR is a solved problem—the hard part is making it work reliably on messy real-world documents.*

Building a Searchable Archive from Scanned Legal Documents

You might also like

UFC Fighter Pokedex

SEC EDGAR Agent

Weather History