Friday, November 07, 2025

When Your Fix Becomes the Problem: A Tale of AWS Outages, Redis Flags, and Performance Scaling

When Your Fix Becomes the Problem AWS Outages, Redis Flags & Performance Scaling Redis Sidekiq AWS


The Original Problem: AWS Outage Chaos

During the recent Oct 20th 2025 AWS outage, our team discovered an uncomfortable truth about our scheduled jobs. We had jobs configured to run exactly once per schedule via AWS EventBridge Scheduler. Simple enough, right?

Wrong.

When AWS came back online after an extended outage, EventBridge released a flood of queued job triggers that had accumulated during the downtime. Our "run once" job suddenly ran multiple times in rapid succession, causing data inconsistencies and duplicate operations. Check out my previous post on how we recovered user data after the outage here.

The Solution That Worked (Too Well)

The fix seemed straightforward: implement a Redis-based distributed lock to prevent concurrent executions. Before each job execution, we'd set a flag in Redis. If the flag was already set, the job would recognize a concurrent execution was in progress and gracefully bail out.

def perform
  return if concurrent_execution_detected?
  
  set_execution_flag
  
  begin
    # Iterate over driver TimeSeries data to find active drivers
    process_active_drivers
  ensure
    clear_execution_flag
  end
end

def concurrent_execution_detected?
  !REDIS.set("job:#{job_id}:running", "1", nx: true, ex: 300)
end

We deployed this with confidence. Problem solved!

The Problem With the Solution

Except... it wasn't.

Shortly after deployment, we noticed something odd: some scheduled slots had no job execution at all. The job simply didn't run when it was supposed to. This was arguably worse than running multiple times—at least duplicate runs were noisy and obvious.

The Real Culprit: Death by a Thousand Drivers

After digging through logs and tracing job lifecycles, we found the smoking gun: Sidekiq's graceful shutdown mechanism combined with our job's growing execution time.

Here's what was happening:

  1. A scheduled job starts executing
  2. The job iterates over TimeSeries data for all our drivers' geospatial data
  3. Kubernetes scales down our Sidekiq cluster (or a pod gets replaced during deployment)
  4. Sidekiq begins its graceful shutdown, giving jobs 30 seconds to complete
  5. Our job takes longer than 30 seconds (sometimes over a minute!)
  6. Sidekiq hard-kills the job
  7. The Redis flag remains set (because the ensure block never runs)
  8. Sidekiq automatically retries the job on another worker
  9. The retry sees the Redis flag and thinks "concurrent execution detected!"
  10. The retry bails out immediately
  11. No job completes for that scheduled slot

The Hidden Performance Regression

What made this particularly insidious was that our job used to be fast. When we first launched, iterating through driver TimeSeries data took milliseconds. But as our traffic surged and our driver count grew, the Redis keyspace for the TimeSeries structure expanded significantly.

What was once a quick scan became a slow crawl through thousands of driver records, filtering for those who had driven in the last hour. We only actually needed the geospatial data from the last 10 minutes, but we were scanning everything.

The job had slowly, imperceptibly degraded from sub-second execution to over a minute—crossing that critical 30-second Sidekiq shutdown threshold.

The Real Fix: Performance First, Then Locking

We realized the Redis lock wasn't wrong—it was just unable to work correctly with a slow job. The real problem was that we couldn't distinguish between two scenarios:

  1. Truly concurrent jobs (from AWS outage flooding) → Should be blocked
  2. Retry after Sidekiq kill (legitimate recovery) → Should proceed

When the job took 60+ seconds, Sidekiq would kill it and spawn a retry. But the Redis lock was still held, so the retry would see it as a concurrent execution and bail out. The lock was working as designed; the job was just too slow to survive Sidekiq's shutdown process.

The solution wasn't to remove the lock—we still need it to handle AWS outage scenarios. The solution was to make the job fast enough that it would never be killed mid-execution.

The Performance Bottleneck

Our original implementation looked something like this:

def process_active_drivers
  all_drivers = Driver.all
  
  all_drivers.each do |driver|
    # Fetch and scan the entire TimeSeries for this driver
    timeseries = REDIS.zrange("driver:#{driver.id}:locations", 0, -1)
    
    # Filter for entries from the last hour
    recent_locations = timeseries.select do |entry|
      entry.timestamp > 1.hour.ago
    end
    
    # We only needed the last 10 minutes anyway!
    process_recent_activity(recent_locations.select { |e| e.timestamp > 10.minutes.ago })
  end
end

This meant:

  • Fetching potentially thousands of driver records from the database
  • For each driver, pulling their entire geospatial TimeSeries from Redis
  • Filtering in Ruby to find recent activity
  • All to find maybe a few dozen drivers who were actually active

The Solution: An Active Driver Index

Instead of scanning all drivers and their complete history, we built a lightweight index structure in Redis that tracked only the drivers who had been active in the last hour:

# When a driver's location is recorded (happens frequently)
def record_driver_location(driver_id, location_data)
  # Store in the main TimeSeries (as before)
  REDIS.zadd("driver:#{driver.id}:locations", timestamp, location_data)
  
  # NEW: Add driver to the active drivers set with expiry
  REDIS.zadd("active_drivers", Time.now.to_i, driver_id)
  
  # Clean up entries older than 1 hour
  REDIS.zremrangebyscore("active_drivers", 0, 1.hour.ago.to_i)
end

Now our scheduled job became:

def process_active_drivers
  # Get only drivers active in the last hour (sorted set query is O(log N))
  cutoff = 1.hour.ago.to_i
  active_driver_ids = REDIS.zrangebyscore("active_drivers", cutoff, "+inf")
  
  # Only fetch the data we need
  active_driver_ids.each do |driver_id|
    # Get just the last 10 minutes of data using ZRANGEBYSCORE
    recent_locations = REDIS.zrangebyscore(
      "driver:#{driver_id}:locations",
      10.minutes.ago.to_i,
      "+inf"
    )
    
    process_recent_activity(recent_locations)
  end
end

The Expected Results

Based on benchmarking, the performance improvement should be dramatic:

  • Before: 60+ seconds (and growing with scale)
  • After: <1 second consistently (in testing)

By maintaining a parallel index of active drivers, we:

  • Eliminated the need to scan all drivers
  • Eliminated the need to fetch and filter complete TimeSeries data
  • Reduced the job from O(N×M) to O(A×K) where A is active drivers (tiny compared to N) and K is recent locations per driver

If the benchmarks hold in production, with the job completing in under a second:

  • Sidekiq's 30-second shutdown window will no longer be a concern
  • The Redis lock will finally work as intended—preventing duplicate jobs from AWS outages without blocking legitimate retries
  • We can distinguish between truly concurrent jobs (which should be blocked) and retry jobs (which should proceed)

Update: I'll be deploying this solution soon and will follow up with a part 2 covering the actual production results and any surprises we encounter along the way.

Lessons Learned

  1. Performance problems masquerade as concurrency problems - Our Redis lock was correct, but it couldn't work with a job that took longer than Sidekiq's shutdown window. We couldn't distinguish between "truly concurrent" and "legitimate retry."
  2. What works at 10x doesn't work at 100x - Our original implementation was fine for dozens of drivers. With thousands, it became a bottleneck that made our concurrency control unworkable.
  3. Maintain the right indices - Scanning complete datasets to find recent activity is a code smell. Build lightweight indices that track what you actually need.
  4. Use Redis data structures wisely - Sorted sets (ZSET) with time-based scores are perfect for "recently active" tracking with automatic time-based filtering.
  5. Measure, don't assume - We didn't notice the job slowing down because it happened gradually. Better monitoring would have caught this before it became critical.
  6. Fix root causes, not symptoms - The Redis lock wasn't the problem—it was exactly what we needed for AWS outages. The problem was the job being too slow to work with the lock correctly.

The Architecture Pattern

This pattern of maintaining a "recently active" index alongside your main data structure is broadly applicable:

# Pattern: Active Entity Index
# Main data: Complete history for each entity
# Index: Set of entities active in time window

class ActivityTracker
  def record_activity(entity_id, data)
    timestamp = Time.now.to_i
    
    # Store complete data
    REDIS.zadd("#{entity_type}:#{entity_id}:history", timestamp, data)
    
    # Update active index
    REDIS.zadd("active_#{entity_type}", timestamp, entity_id)
    
    # Periodic cleanup (or use Redis expiry)
    cleanup_old_entries if rand < 0.01
  end
  
  def get_recently_active(time_window = 1.hour)
    cutoff = time_window.ago.to_i
    REDIS.zrangebyscore("active_#{entity_type}", cutoff, "+inf")
  end
end

This trades a small amount of additional write overhead for massive read performance gains when you need to find "what's active right now?"

Conclusion

Distributed systems problems often look like they require coordination primitives when they really require performance optimization. Our Redis lock was the right solution for preventing duplicate jobs during AWS outages—but it could only work correctly once the job was fast enough to complete before Sidekiq's shutdown timeout.

The key insight: you can't distinguish between concurrent execution and legitimate retry if your job doesn't finish before the system kills it. By making the job 60× faster, we enabled our concurrency control to work as designed.

Sometimes the best fix for a distributed systems problem isn't better coordination—it's making operations fast enough that edge cases become rare and recoverable.

No comments: