The Original Problem: AWS Outage Chaos
During the recent Oct 20th 2025 AWS outage, our team discovered an uncomfortable truth about our scheduled jobs. We had jobs configured to run exactly once per schedule via AWS EventBridge Scheduler. Simple enough, right?
Wrong.
When AWS came back online after an extended outage, EventBridge released a flood of queued job triggers that had accumulated during the downtime. Our "run once" job suddenly ran multiple times in rapid succession, causing data inconsistencies and duplicate operations. Check out my previous post on how we recovered user data after the outage here.
The Solution That Worked (Too Well)
The fix seemed straightforward: implement a Redis-based distributed lock to prevent concurrent executions. Before each job execution, we'd set a flag in Redis. If the flag was already set, the job would recognize a concurrent execution was in progress and gracefully bail out.
def perform
return if concurrent_execution_detected?
set_execution_flag
begin
# Iterate over driver TimeSeries data to find active drivers
process_active_drivers
ensure
clear_execution_flag
end
end
def concurrent_execution_detected?
!REDIS.set("job:#{job_id}:running", "1", nx: true, ex: 300)
end
We deployed this with confidence. Problem solved!
The Problem With the Solution
Except... it wasn't.
Shortly after deployment, we noticed something odd: some scheduled slots had no job execution at all. The job simply didn't run when it was supposed to. This was arguably worse than running multiple times—at least duplicate runs were noisy and obvious.
The Real Culprit: Death by a Thousand Drivers
After digging through logs and tracing job lifecycles, we found the smoking gun: Sidekiq's graceful shutdown mechanism combined with our job's growing execution time.
Here's what was happening:
- A scheduled job starts executing
- The job iterates over TimeSeries data for all our drivers' geospatial data
- Kubernetes scales down our Sidekiq cluster (or a pod gets replaced during deployment)
- Sidekiq begins its graceful shutdown, giving jobs 30 seconds to complete
- Our job takes longer than 30 seconds (sometimes over a minute!)
- Sidekiq hard-kills the job
- The Redis flag remains set (because the
ensureblock never runs) - Sidekiq automatically retries the job on another worker
- The retry sees the Redis flag and thinks "concurrent execution detected!"
- The retry bails out immediately
- No job completes for that scheduled slot
The Hidden Performance Regression
What made this particularly insidious was that our job used to be fast. When we first launched, iterating through driver TimeSeries data took milliseconds. But as our traffic surged and our driver count grew, the Redis keyspace for the TimeSeries structure expanded significantly.
What was once a quick scan became a slow crawl through thousands of driver records, filtering for those who had driven in the last hour. We only actually needed the geospatial data from the last 10 minutes, but we were scanning everything.
The job had slowly, imperceptibly degraded from sub-second execution to over a minute—crossing that critical 30-second Sidekiq shutdown threshold.
The Real Fix: Performance First, Then Locking
We realized the Redis lock wasn't wrong—it was just unable to work correctly with a slow job. The real problem was that we couldn't distinguish between two scenarios:
- Truly concurrent jobs (from AWS outage flooding) → Should be blocked
- Retry after Sidekiq kill (legitimate recovery) → Should proceed
When the job took 60+ seconds, Sidekiq would kill it and spawn a retry. But the Redis lock was still held, so the retry would see it as a concurrent execution and bail out. The lock was working as designed; the job was just too slow to survive Sidekiq's shutdown process.
The solution wasn't to remove the lock—we still need it to handle AWS outage scenarios. The solution was to make the job fast enough that it would never be killed mid-execution.
The Performance Bottleneck
Our original implementation looked something like this:
def process_active_drivers
all_drivers = Driver.all
all_drivers.each do |driver|
# Fetch and scan the entire TimeSeries for this driver
timeseries = REDIS.zrange("driver:#{driver.id}:locations", 0, -1)
# Filter for entries from the last hour
recent_locations = timeseries.select do |entry|
entry.timestamp > 1.hour.ago
end
# We only needed the last 10 minutes anyway!
process_recent_activity(recent_locations.select { |e| e.timestamp > 10.minutes.ago })
end
end
This meant:
- Fetching potentially thousands of driver records from the database
- For each driver, pulling their entire geospatial TimeSeries from Redis
- Filtering in Ruby to find recent activity
- All to find maybe a few dozen drivers who were actually active
The Solution: An Active Driver Index
Instead of scanning all drivers and their complete history, we built a lightweight index structure in Redis that tracked only the drivers who had been active in the last hour:
# When a driver's location is recorded (happens frequently)
def record_driver_location(driver_id, location_data)
# Store in the main TimeSeries (as before)
REDIS.zadd("driver:#{driver.id}:locations", timestamp, location_data)
# NEW: Add driver to the active drivers set with expiry
REDIS.zadd("active_drivers", Time.now.to_i, driver_id)
# Clean up entries older than 1 hour
REDIS.zremrangebyscore("active_drivers", 0, 1.hour.ago.to_i)
end
Now our scheduled job became:
def process_active_drivers
# Get only drivers active in the last hour (sorted set query is O(log N))
cutoff = 1.hour.ago.to_i
active_driver_ids = REDIS.zrangebyscore("active_drivers", cutoff, "+inf")
# Only fetch the data we need
active_driver_ids.each do |driver_id|
# Get just the last 10 minutes of data using ZRANGEBYSCORE
recent_locations = REDIS.zrangebyscore(
"driver:#{driver_id}:locations",
10.minutes.ago.to_i,
"+inf"
)
process_recent_activity(recent_locations)
end
end
The Expected Results
Based on benchmarking, the performance improvement should be dramatic:
- Before: 60+ seconds (and growing with scale)
- After: <1 second consistently (in testing)
By maintaining a parallel index of active drivers, we:
- Eliminated the need to scan all drivers
- Eliminated the need to fetch and filter complete TimeSeries data
- Reduced the job from O(N×M) to O(A×K) where A is active drivers (tiny compared to N) and K is recent locations per driver
If the benchmarks hold in production, with the job completing in under a second:
- Sidekiq's 30-second shutdown window will no longer be a concern
- The Redis lock will finally work as intended—preventing duplicate jobs from AWS outages without blocking legitimate retries
- We can distinguish between truly concurrent jobs (which should be blocked) and retry jobs (which should proceed)
Update: I'll be deploying this solution soon and will follow up with a part 2 covering the actual production results and any surprises we encounter along the way.
Lessons Learned
- Performance problems masquerade as concurrency problems - Our Redis lock was correct, but it couldn't work with a job that took longer than Sidekiq's shutdown window. We couldn't distinguish between "truly concurrent" and "legitimate retry."
- What works at 10x doesn't work at 100x - Our original implementation was fine for dozens of drivers. With thousands, it became a bottleneck that made our concurrency control unworkable.
- Maintain the right indices - Scanning complete datasets to find recent activity is a code smell. Build lightweight indices that track what you actually need.
- Use Redis data structures wisely - Sorted sets (ZSET) with time-based scores are perfect for "recently active" tracking with automatic time-based filtering.
- Measure, don't assume - We didn't notice the job slowing down because it happened gradually. Better monitoring would have caught this before it became critical.
- Fix root causes, not symptoms - The Redis lock wasn't the problem—it was exactly what we needed for AWS outages. The problem was the job being too slow to work with the lock correctly.
The Architecture Pattern
This pattern of maintaining a "recently active" index alongside your main data structure is broadly applicable:
# Pattern: Active Entity Index
# Main data: Complete history for each entity
# Index: Set of entities active in time window
class ActivityTracker
def record_activity(entity_id, data)
timestamp = Time.now.to_i
# Store complete data
REDIS.zadd("#{entity_type}:#{entity_id}:history", timestamp, data)
# Update active index
REDIS.zadd("active_#{entity_type}", timestamp, entity_id)
# Periodic cleanup (or use Redis expiry)
cleanup_old_entries if rand < 0.01
end
def get_recently_active(time_window = 1.hour)
cutoff = time_window.ago.to_i
REDIS.zrangebyscore("active_#{entity_type}", cutoff, "+inf")
end
end
This trades a small amount of additional write overhead for massive read performance gains when you need to find "what's active right now?"
Conclusion
Distributed systems problems often look like they require coordination primitives when they really require performance optimization. Our Redis lock was the right solution for preventing duplicate jobs during AWS outages—but it could only work correctly once the job was fast enough to complete before Sidekiq's shutdown timeout.
The key insight: you can't distinguish between concurrent execution and legitimate retry if your job doesn't finish before the system kills it. By making the job 60× faster, we enabled our concurrency control to work as designed.
Sometimes the best fix for a distributed systems problem isn't better coordination—it's making operations fast enough that edge cases become rare and recoverable.
No comments:
Post a Comment