Building a Scalable MMA Fighter Database

## The Problem As an MMA fan, I was frustrated by how fragmented fight data is across the internet. UFCStats has the data, but the UX is stuck in 2008. Wikipedia has context, but no structured data. Betting sites have odds, but paywalled everything else. I wanted a **single source of truth** for MMA data—something that felt as intuitive as looking up a Pokémon in a Pokédex. ## System Architecture The architecture follows a classic three-tier pattern, but with aggressive caching to handle the scale of MMA data: ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Next.js App │────▶│ FastAPI API │────▶│ PostgreSQL │ │ (Frontend) │ │ (Backend) │ │ (Database) │ └─────────────────┘ └────────┬────────┘ └─────────────────┘ │ ┌────────▼────────┐ │ Redis │ │ (Cache) │ └─────────────────┘ ``` ### Data Pipeline The scraping pipeline runs nightly using Scrapy, pulling data from UFCStats.com. Here's what made it interesting: 1. **Incremental scraping**: Only fetch new events and updated fighter records 2. **Deduplication**: Fighters appear under multiple names (nicknames, typos, name changes) 3. **Data normalization**: Convert strings like "5'11\"" into inches for comparison queries The result: **76,544 fighter profiles**, **270,000+ fights**, and **755 events** dating back to UFC 1 in 1993. ## Technical Decisions ### Why FastAPI over Django/Flask? **FastAPI** won for three reasons: 1. **Async by default**: Async handlers can serve significantly more concurrent requests, which matters for an API that could see traffic spikes during fight nights. 2. **Automatic OpenAPI docs**: The `/docs` endpoint provides interactive API documentation, making it easy to explore and test endpoints. 3. **Pydantic validation**: Type safety at the API boundary catches bugs during development. ### Why Redis caching? The database has 76,000+ fighters, but realistically most queries would target the ~500 active UFC fighters. Redis caching with a 15-minute TTL provides: - **Fighter profiles**: Cached after first request → sub-10ms responses - **Division rankings**: Cached on scrape → instant load - **Search autocomplete**: Cached trie structure → type-ahead feels instant In local benchmarks, the cache hit rate reaches **~94%** for typical browsing patterns, enabling sub-100ms response times for most requests. ### Why Next.js? The frontend needed to handle: - **SEO for fighter pages**: Google should index Jon Jones' profile - **Fast navigation**: Fans browse multiple fighters rapidly during events - **Real-time feel**: Rankings and records should update seamlessly Next.js App Router gave us all three: SSG for fighter pages, client-side navigation for speed, and ISR for fresh data without rebuild. ## Challenges & Solutions ### Challenge 1: Fighter Name Disambiguation The same fighter might appear as: - "Jon Jones" - "Jonathan Jones" - "Jon 'Bones' Jones" - "Jonathan Dwight Jones" **Solution**: Fuzzy matching on names + unique identifier based on (name + birthdate + first_fight_date). This catches 99.9% of duplicates. ### Challenge 2: Historical Data Quality Early UFC events have incomplete data. UFC 1 has no round times. Some fighters have no birthdate. **Solution**: Nullable fields everywhere, with the UI gracefully degrading. Instead of "Born: N/A", we just don't show the birth section. ### Challenge 3: Designing for Traffic Spikes A real MMA database would see huge traffic spikes during major events—potentially 50x normal during a Conor McGregor fight. **Solution**: Redis caching + Vercel's edge network. Static assets are globally distributed, and the caching layer ensures most requests never hit the database, making the architecture ready for traffic spikes if they occur. ## What I Built | Metric | Value | |--------|-------| | Fighter profiles scraped | 76,544 | | Documented fights | 270,000+ | | Events indexed | 755 | | API response time (local benchmark, p50) | ~42ms | | Lighthouse performance score | 98 | ## Lessons Learned 1. **Cache aggressively, invalidate carefully**: The cache key strategy matters. I use `fighter:{id}:v{schema_version}` to handle schema migrations without cache pollution. 2. **Build for the 90% case**: Most searches would be for active fighters. Optimizing that path (Redis cache, autocomplete) makes the app feel fast, even though querying obscure 1990s fighters hits the database. 3. **APIs invite experimentation**: The auto-generated `/docs` endpoint makes it easy to explore the data programmatically. The architecture would support building Chrome extensions, Discord bots, or betting calculators on top of it. ## What I'd Do Differently - **Start with TypeScript everywhere**: The FastAPI backend is Python. Rewriting the shared types between frontend and backend got tedious. A monorepo with shared types would have saved hours. - **Add rate limiting from the start**: Any public API should have token-bucket rate limiting per IP to prevent abuse. - **Consider serverless from the start**: The FastAPI backend runs on a single Railway instance. For this scale, serverless (like Vercel Functions) might have been simpler and cheaper. --- *This project taught me that the best technical decisions often come from understanding user behavior. MMA fans don't care about your database schema—they care about finding Khabib's takedown percentage in 0.5 seconds.*

Building a Scalable MMA Fighter Database

You might also like

Epstein Document Browser

Weather History

SEC EDGAR Agent