Making a Century of Weather Data Comprehensible at a Glance
•Weather History
## The Insight
Every weather app tells you what the temperature is. None of them tell you if that temperature is *unusual*.
When it's 75F in December, is that remarkable? Unprecedented? Or just a mild day that happens every few years? Without historical context, you can't know. And without knowing, you lose the ability to notice climate patterns, appreciate seasonal outliers, or simply satisfy the human curiosity of "is this weather weird?"
I wanted to build something that answers one simple question: **Is today's weather unusual for this date and location?**
The answer required comparing today's conditions against over a century of historical data—and making that comparison instantly comprehensible.
## Data Pipeline
### The NOAA GHCN Dataset
The Global Historical Climatology Network (GHCN) is NOAA's gold standard for historical weather data. It's comprehensive, well-documented, and freely available. But it comes with quirks:
- **Coverage varies wildly**: Some stations have continuous data since 1890. Others have gaps, moved locations, or shut down.
- **Measurement standards changed**: Temperature recording practices evolved over 130 years.
- **Station density differs by region**: Dense coverage in the US and Europe, sparse in remote areas.
The dataset includes daily min/max temperatures, precipitation, and snowfall for roughly 100,000 weather stations worldwide. I focused on temperature—the metric people care about most.
### Data Quality Challenges
GHCN data isn't clean. Here's what I had to handle:
1. **Missing days**: Stations go offline. Sensors break. Holidays happen.
2. **Outliers**: A mis-recorded "150F" day in Wisconsin needs to be caught.
3. **Unit inconsistencies**: Some historical data is in Celsius, some in Fahrenheit.
4. **Station relocations**: A weather station that moved 10 miles in 1952 is technically two different data sources.
My approach: aggressive validation at ingestion time. Any temperature outside 3 standard deviations for that station's historical range gets flagged for review. Unit conversion happens once, at import. Station metadata tracks location changes.
## Statistical Approach
### Percentile Bands
The core insight: raw averages hide the full story. Knowing "the average high for December 30th is 45F" doesn't tell you whether 55F is unusual.
Instead, I calculate **percentile bands**:
- **10th percentile**: Only 10% of historical days were colder
- **50th percentile**: The true median (more robust than mean for weather data)
- **90th percentile**: Only 10% of historical days were warmer
- **Record high/low**: The absolute extremes
When today's temperature falls outside the 10th-90th band, that's genuinely unusual—it's happening less than 1 in 10 years historically.
### Handling Missing Data
Weather stations don't have 100% coverage. A station with 80 years of data might only have 60-70 observations for any specific date (accounting for gaps).
My rule: **minimum 30 observations** to calculate percentiles. Below that threshold, the UI shows "insufficient data" rather than misleading statistics. This is statistics 101—small samples produce unreliable estimates—but it's easy to forget when you're excited about showing data.
### Seasonal Normalization
Weather has strong seasonality, but the transitions aren't perfectly smooth. February 28th and March 1st shouldn't have wildly different historical distributions.
I use a **7-day rolling window** for percentile calculations. The percentiles for December 30th actually incorporate data from December 27th through January 2nd. This smooths the seasonal transitions while preserving the core signal.
## Visualization Design
### The Core Challenge
How do you show 100+ years of data on a single chart without overwhelming the user?
The answer: **layered context**. The chart shows:
1. **Background**: The 10th-90th percentile range as a shaded band
2. **Midline**: The 50th percentile (median) as a reference line
3. **Extremes**: Record high/low as subtle markers
4. **Today**: Current temperature as a prominent dot
At a glance, you can see whether today is typical (inside the band), unusual (at the edge), or extreme (near the records).
### Recharts Implementation
I chose Recharts for the visualizations. It's React-native, composable, and handles responsive sizing gracefully. The key components:
- **AreaChart** for the percentile bands (fill the space between 10th and 90th)
- **Line** for the median reference
- **ReferenceDot** for today's actual temperature
- **ReferenceLine** for record high/low markers
The trickiest part was the 24-hour hourly view. GHCN only provides daily min/max, not hourly data. I interpolate hourly historical ranges using a sinusoidal model (temperature typically peaks mid-afternoon and troughs at dawn). It's not perfect, but it's useful—and I'm transparent in the UI that hourly historical data is modeled, not measured.
### Color Psychology
Colors carry meaning in weather visualization:
- **Blue shading**: Cool/cold historical range
- **Red/orange**: Warm/hot range
- **White/gray**: The median zone
- **Bright accent**: Today's actual temperature
When today's dot is deep in the blue zone, the visual immediately communicates "unusually cold." No reading required.
## Location Search
### Geocoding
Users search by city name. The app needs to find the nearest weather station with sufficient historical data.
The flow:
1. User types "San Francisco"
2. Geocode to coordinates via OpenStreetMap Nominatim
3. Query PostgreSQL for stations within 50km using PostGIS
4. Rank by (data_completeness * proximity_score)
5. Return best match
The ranking formula matters. A station 2 miles away with 40 years of data might be worse than a station 10 miles away with 100 years. I weight completeness heavily—users came for historical context, not just proximity.
### Station Metadata
Each weather station gets a profile: name, coordinates, elevation, years of coverage, and data completeness percentage. The UI shows which station is powering the data, so users understand they're seeing "San Francisco International Airport" data, not some abstract "San Francisco" number.
## Performance Considerations
### PostgreSQL Queries
The naive query—"give me all temperatures for this station on this date across 100 years"—is expensive. With 100,000 stations and 365 days, that's billions of potential queries.
**Solution 1: Materialized views**. I pre-compute percentiles for each (station_id, day_of_year) combination. The table has ~36.5M rows (100K stations × 365 days), but each row contains the pre-calculated 10th, 50th, 90th percentiles and records.
**Solution 2: Composite indexes**. The query pattern is always (station_id, day_of_year). A composite index makes these lookups O(log n).
**Result**: Percentile lookups are sub-5ms, even on a modest PostgreSQL instance.
### Caching Strategy
Weather data is perfect for caching:
- **Historical percentiles**: Never change (unless we reimport data)
- **Today's weather**: Changes, but users tolerate 15-minute staleness
I cache percentiles aggressively (24-hour TTL) and current weather moderately (15-minute TTL). For typical browsing patterns, this should result in high cache hit rates.
## Technical Decisions
### Why Next.js?
The app needs both server-side data fetching (for SEO and initial load) and client-side interactivity (for chart interactions). Next.js App Router handles this elegantly. The main page server-renders with the user's detected location, then hydrates for client-side interactions.
### Why Recharts?
I evaluated Chart.js, D3, and Recharts. Recharts won because:
1. **React-native**: Components, not imperative code
2. **Responsive by default**: Charts resize gracefully
3. **Composable**: Building the layered percentile visualization was straightforward
4. **Good enough performance**: Even with 24 data points × 5 series, rendering is smooth
D3 would have been more flexible, but the development time wasn't worth it for this use case.
### Database Schema
The core tables:
```
stations (id, name, latitude, longitude, elevation, data_start, data_end)
daily_observations (station_id, date, temp_min, temp_max, precipitation)
percentile_cache (station_id, day_of_year, p10, p50, p90, record_high, record_low, observation_count)
```
The `percentile_cache` table is the key optimization. It's rebuilt weekly from `daily_observations`, but queries hit the cache exclusively.
## Lessons Learned
1. **The question matters more than the data**: I had access to 100+ years of weather data. The insight was realizing that "is today unusual?" was a question worth answering—and that percentile bands were the right abstraction.
2. **Statistical honesty pays off**: Showing "insufficient data" instead of unreliable percentiles builds trust. Users appreciate knowing when the app doesn't have a good answer.
3. **Pre-computation is underrated**: The percentile cache transformed a slow, complex query into a fast, simple lookup. Sometimes the best optimization is doing the work ahead of time.
4. **Location is everything**: Weather is hyperlocal. A station 30 miles away might have completely different historical patterns. The geocoding and station-matching logic is quietly one of the most important parts of the app.
## What I'd Do Differently
- **Add more weather variables**: Precipitation percentiles, humidity, wind—the same "is today unusual?" framing works for all of them. I focused on temperature for V1, but the architecture supports expansion.
- **Mobile-first design from day one**: The charts work on mobile, but they were designed desktop-first. A mobile-first approach would have led to different visualization choices.
- **Expose the API**: A public API (with rate limiting) would let power users query historical percentiles programmatically.
- **Climate trend overlays**: The data could show not just "is today unusual?" but "is this decade warmer than last?" That's a bigger editorial statement, but the data supports it.
---
*This project taught me that data visualization isn't about showing more data—it's about answering a specific question clearly. The chart doesn't show 100 years of raw temperature readings. It shows whether today is unusual. That's the difference between data and insight.*
You might also like
UFC Fighter Pokedex
Full-stack MMA fighter database with 76,000+ fighter profiles, 270,000+ fights, and real-time stats. A Pokedex-style experience for combat sports enthusiasts.
Read case study
QuantumShell
Interactive periodic table with animated electron orbital visualizations. AI-powered exploration of atomic structure and quantum mechanics concepts.
Read case study
SEC EDGAR Agent
AI-native platform for financial research providing LLM-ready SEC filing data with semantic search, RAG chat, and structured table extraction.
Read case study
Interested in working together?
Get in touch →