Making SEC Filings Speak LLM: Building an AI-Native Financial Research Platform

## The Problem SEC filings are the primary source of truth for public company financials. Every 10-K, 10-Q, and 8-K contains critical information that investors, analysts, and researchers need. But there's a catch: **SEC filings are hostile to machines**. The raw EDGAR HTML is a mess of nested tables, inconsistent formatting, and financial data that looks like text but needs to be parsed as structured data. Try asking an LLM to find Apple's revenue breakdown by segment from their 10-K, and you'll get either a hallucinated number or a refusal because the context window can't fit the entire filing. I wanted to build a platform where you could ask natural language questions like *"What percentage of Microsoft's revenue came from Azure in FY2023?"* and get an accurate answer **with citations** pointing to the exact section and page. ## System Architecture The architecture has three main subsystems: ingestion, retrieval, and chat. ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ INGESTION PIPELINE │ │ │ │ SEC EDGAR ──▶ HTML Parser ──▶ Table Extractor ──▶ Chunker ──▶ Embedder │ │ │ │ │ │ ▼ ▼ │ │ Markdown Tables ChromaDB │ └─────────────────────────────────────────────────────────────────────────┘ │ ┌───────────────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ RETRIEVAL │ │ │ │ Query ──▶ OpenAI Embedding ──▶ ChromaDB Search ──▶ Ranked Chunks │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ RAG CHAT │ │ │ │ User Question + Retrieved Chunks ──▶ GPT-4 ──▶ Answer with Citations │ │ │ │ Example: "Azure revenue?" ──▶ "[MSFT 10-K, Item 7] Azure generated..." │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### API Layer FastAPI serves everything—filing ingestion, semantic search, and chat—through a typed REST API. The React frontend hits these endpoints, but I also ship a Python SDK (`sec-api-llm`) so developers can integrate directly. ## Technical Decisions ### Why OpenAI Embeddings (text-embedding-3-small)? I evaluated three embedding models: - **OpenAI text-embedding-3-small**: 1536 dimensions, $0.02/1M tokens - **OpenAI text-embedding-3-large**: 3072 dimensions, $0.13/1M tokens - **Cohere embed-english-v3.0**: 1024 dimensions, usage-based pricing **text-embedding-3-small won** because: 1. **Cost-performance ratio**: 6.5x cheaper than text-embedding-3-large with only ~3% lower retrieval quality on my financial domain tests 2. **Dimension reduction**: I could truncate to 512 dimensions for faster search with minimal quality loss 3. **OpenAI ecosystem**: Using GPT-4 for chat meant one API key, one billing dashboard The key insight: for domain-specific retrieval, **chunking strategy matters more than embedding model quality**. More on that below. ### Why ChromaDB? I considered Pinecone, Weaviate, and PostgreSQL with pgvector. ChromaDB won for a scrappy project like this: 1. **Zero infrastructure**: Runs as an embedded database or single-process server 2. **Metadata filtering**: Filter by company ticker, form type, fiscal year before vector search 3. **Python-native**: No context switching between languages 4. **Fast enough**: ~50ms query latency for 100k chunks is fine for a research tool The tradeoff: ChromaDB doesn't scale horizontally. If I needed millions of filings, I'd migrate to Pinecone. For now, 10,000 filings indexed per company works fine. ### Why FastAPI? Same reasoning as my UFC project—async support, automatic docs, Pydantic validation. But here I also needed: 1. **Background tasks**: Filing ingestion takes 30-60 seconds. FastAPI's `BackgroundTasks` let me return immediately with a job ID and poll for completion. 2. **WebSocket support**: For streaming chat responses, though I ended up using SSE instead for simpler error handling. ## The Table Extraction Challenge This is where the project gets interesting. Financial tables in SEC filings are the most valuable data, but they're nightmarish to parse: ```html

Revenue	FY 2023		FY 2022
	$	394,328	$	365,817

``` The HTML says nothing about what's a header, what's a footnote, what's a subtotal. And you can't just flatten it—the hierarchical structure IS the meaning. ### My Approach: Semantic Table Reconstruction Instead of trying to parse the HTML perfectly (impossible given SEC filing inconsistency), I built a **two-pass extractor**: **Pass 1: Structure Detection** - Identify header rows using heuristics (first N rows, bold text, spanning cells) - Detect numeric columns vs. label columns - Handle merged cells by repeating values down/across **Pass 2: Markdown Conversion** - Convert to Markdown tables with explicit headers - Preserve indentation for hierarchy (Revenue → Product Revenue → iPhone) - Add metadata comments for LLM context ```markdown | Category | FY 2023 | FY 2022 | YoY Change | |:---------|--------:|--------:|-----------:| | **Net Sales** | | | | | Products | 298,085 | 316,199 | -5.7% | | Services | 85,200 | 78,129 | 9.1% | | **Total Net Sales** | 383,285 | 394,328 | -2.8% | ``` The **100% accuracy claim** refers to structural accuracy—the numbers in the Markdown match the numbers in the filing. I validated this against 500 randomly sampled tables by hand. ## RAG Implementation Details ### Chunking Strategy Bad chunking kills RAG quality. I learned this the hard way after my first version returned useless results. **What didn't work:** - Fixed-size chunks (512 tokens): Cut through sentences, split tables from their captions - Paragraph-based: SEC filings have 10-page paragraphs in legal sections **What worked: Semantic chunking with table awareness** ```python def chunk_filing(filing: Filing) -> List[Chunk]: chunks = [] # Tables get their own chunks with captions for table in filing.tables: chunk = Chunk( content=table.markdown, metadata={ "type": "table", "section": table.section, # e.g., "Item 8 - Financial Statements" "caption": table.caption, "page": table.page } ) chunks.append(chunk) # Prose sections chunked by heading hierarchy for section in filing.sections: # Target 400 tokens with 50-token overlap # But never split within a sentence chunks.extend(semantic_split(section, target=400, overlap=50)) return chunks ``` ### Citation Generation Citations look like `[AAPL 10-K FY2023, Item 8]`. The trick is making them both human-readable AND useful for the LLM: 1. **Chunk metadata includes everything**: ticker, form type, fiscal year, section, page 2. **System prompt enforces citation format**: "When referencing source material, use the format [TICKER FORM YEAR, Section]" 3. **Post-processing validation**: If a citation doesn't match any retrieved chunk, flag it as potentially hallucinated The citation accuracy is ~95%—meaning 95% of citations point to chunks that actually contain supporting information. ## What I Built | Metric | Value | |--------|-------| | Filings indexed | 2,500+ (10-K, 10-Q) | | Tables extracted | 45,000+ | | Table extraction accuracy | 100% structural (validated on 500 sample tables) | | Semantic search latency (local benchmark, p50) | ~48ms | | Citation accuracy (test set) | ~95% | The system demonstrates that financial Q&A with accurate citations is achievable—the kind of tool that could save analysts hours of digging through filings. ## Lessons Learned 1. **Domain-specific chunking is everything**: Generic chunking algorithms don't understand that a table and its caption need to stay together. Financial documents have structure that matters—respect it. 2. **Metadata is your secret weapon**: Filtering by ticker and fiscal year before vector search reduces noise dramatically. Don't rely on embeddings alone. 3. **Build the SDK early**: I shipped `sec-api-llm` as an afterthought. If I'd designed API ergonomics around the SDK from the start, the API would be cleaner. 4. **Tables are the hard part**: Everyone focuses on RAG and embeddings. The real work was parsing SEC tables reliably. That unglamorous parsing code is what makes the platform actually useful. ## What I'd Do Differently - **Use a hybrid retrieval strategy**: Pure vector search misses exact matches (like finding all mentions of "Azure"). I'd add BM25 keyword search and combine scores with reciprocal rank fusion. - **Stream from the start**: I added SSE streaming later, but the initial version waited for the full response. Financial questions often have long answers—streaming would have improved perceived latency. - **Version the embeddings**: When I upgraded from text-embedding-ada-002 to text-embedding-3-small, I had to re-embed everything. A versioned schema (chunks table with `embedding_model` column) would have allowed gradual migration. - **Build evaluation infrastructure early**: I manually tested retrieval quality. Should have built a golden test set with expected chunks for 100 queries and automated regression testing. --- *The surprising lesson from this project: the hardest problems in AI applications aren't the AI parts. They're the data engineering—parsing messy HTML, handling edge cases in chunking, and maintaining accurate metadata. Get those right, and the LLM does its job. Get them wrong, and no amount of prompt engineering will save you.*

Making SEC Filings Speak LLM: Building an AI-Native Financial Research Platform

You might also like

FGP (Fast Gateway Protocol)

iMessage Gateway CLI

AI Life Planner