Sunday, March 15, 2026

Dead Code Is a Cognitive Tax — Here's How AI Helps You Stop Paying It

Dead Code Is a Cognitive Tax — Here's How AI Helps You Stop Paying It

Every engineer knows the feeling. You open an unfamiliar part of the codebase, and you're immediately staring down a tangle of services, workers, models, and task entries — none of which come with a label saying "still matters" or "abandoned in 2023." You read the code carefully, try to trace the call graph, maybe even grep for usages — and only after 30 minutes do you realize: this thing hasn't run in production for over a year.

That tax on your attention has a name: cognitive load. And dead code is one of its most insidious sources.


What Is Cognitive Load in a Codebase?

Cognitive load, in the context of software engineering, is the total mental effort required to understand a system well enough to work in it safely. Every class, method, model, and background job you encounter is a unit of context you have to hold in your head.

The problem is that your brain doesn't automatically know which of those units are live and which are ghosts. If an EstimateWorker class exists in your repo, you have to assume it matters — until you prove otherwise. That proof takes time, attention, and often a distracting detour away from the actual work you sat down to do.

Dead code doesn't just waste disk space. It actively misleads you.

A Real-World Example: The Estimation Pipeline Cleanup

Recently, our team completed a cleanup effort across seven pull requests targeting a legacy estimation infrastructure — a suite of services originally built around Prophet forecasts and a Clair analysis pipeline — that had gone completely dark since late 2023.

Here's what was still sitting in the codebase, doing nothing:

  • EstimateService — fetched a CSV over HTTP, upserted records into the database, and refreshed an estimation cache. Silent for months.
  • EstimateWorker — a Sidekiq background job that uploaded files to S3, triggered the estimation flow, and posted Slack notifications. Long dead.
  • Estimation::Prophet::DownloadWorker — downloaded forecast CSVs from S3 and upserted them into a Prophet table. Never called.
  • Estimators::ClairAnalysis — computed hourly analysis records for a brief window in late 2023, then stopped.
  • ClairAnalysis model and its backing database table — zero writes since the pipeline went quiet.
  • Three SwitchBoard dispatch entriesevents_collect_for_next_week, generate_weekly_user_report, estimate_v2 — all orphaned task names in a routing map.

Any engineer — or AI assistant — reading this codebase would reasonably assume all of the above was active production infrastructure. None of it was.

The Numbers

7
Pull Requests
31
Files Changed
943
Lines Deleted
−816
Net Lines Removed
PRBranch+Added−DeletedFiles
#1cleanup-tasks13162
#2cleanup-unused-estimate0744
#3remove-clair-analysis03142
#4remove-prophet02105
#5remove-clair-analysis-model20573
#6rename-clair-v2s946813
#7remove-estimate-unused02042
Total12794331

The 127 additions are almost entirely the rename PR (#6) — migrations, updated references, and renamed specs. Every other PR was pure deletion.


The Cognitive Impact of the Cleanup

Cleaner model surface. Once EstimateService, EstimateWorker, and ClairAnalysis were gone, the remaining models — Clair, ClairDailyInterimResult, ClairSetting — actually reflected how the system works today.

Naming that signals intent. ClairV2 implies a versioning scheme. ClairDailyInterimResult tells you exactly what the thing is and why it exists.

A smaller SwitchBoard dispatch map. Removing the three orphaned entries made the dispatch map honest again.

A shorter test suite that still covers everything that matters. Several spec files covering deleted code were removed. The test suite got faster without losing any meaningful coverage.


Where AI Fits In: Finding Dead Code You Can't See

Here's the uncomfortable truth about dead code: it's often invisible to the people closest to it. If you wrote EstimateWorker two years ago and the team that decommissioned the upstream service never filed a ticket, you might not even know it's dead. The code looks fine. The tests pass. Nothing alerts you.

A Telling Real-World Example: Claude Gets Confused, Then Catches Itself

We recently asked Claude to generate a flow diagram of our pay guarantee process. Claude produced a diagram that looked plausible — tracing through services, models, and workers in a way that made logical sense.

The problem? Part of that diagram was wrong — because Claude had incorporated a module that was no longer active into its understanding of the flow. The dead code was so well-structured and apparently coherent that the AI read it as live infrastructure and wove it into the diagram without hesitation.

But here's what makes this story instructive rather than just cautionary: When an engineer removed this hopefully last bit of dead code, Claude immediately realized that the diagram she drew earlier relied on this bad signal, revised its understanding, and corrected the diagram.

That sequence — confidently wrong, then self-correcting — is a useful frame for thinking about AI and dead code. It fooled the AI for the same reason it fools engineers: it looks like it belongs.

What AI Can Do

Tracing call graphs at scale. AI can trace the full call graph of a function or class across an entire monorepo — answering not just with direct callers, but with the absence of callers.

Cross-referencing runtime signals with static code. When connected to observability data — logs, APM traces, queue metrics — an AI can compare what the code says it does with what actually runs in production.

Flagging stale patterns. Dead code has fingerprints: models with no recent migrations, task names absent from any scheduler config, service classes with no callers outside their own spec files.

Drafting cleanup PRs. Once dead code is identified, AI can help draft the actual removal — proposing what to delete, what to rename, and what specs to clean up.

What AI Can't Do (Yet)

AI isn't a replacement for engineering judgment. A worker might be "dead" in CI but still referenced by a cron job in an ops runbook nobody's touched in three years.

The right model is AI as a scout, engineer as the decision-maker. AI surfaces candidates. Engineers verify, contextualise, and own the deletion.

Making Dead Code Cleanup a Habit

  1. Timestamp your decommissions. When you turn off a pipeline, leave a comment in the code with the date.
  2. Review your task dispatch maps regularly. A quarterly review catches orphaned entries before they fossilise.
  3. Use AI during onboarding and code review. AI tools can help new engineers quickly validate whether something is live — and surface it for cleanup if it isn't.
  4. Treat deletion as a first-class deliverable. 816 lines removed is a meaningful engineering contribution. Make it visible in sprint planning, changelogs, and retros.

Conclusion

Large codebases accumulate cognitive debt quietly, continuously, and with compounding interest. Dead code is one of the most expensive line items: it misleads engineers, bloats test suites, and turns routine code reading into archaeology.

As we saw first-hand, it even misleads AI. Claude confidently incorporated a dead module into a flow diagram of our pay guarantee process — because the code looked live. That moment of confusion, and the self-correction that followed, is a perfect metaphor for where we are with AI-assisted engineering today: powerful, promising, and most effective when paired with good runtime context and human judgment.

The goal isn't a perfect codebase. It's a codebase where the code you're reading is the code that's actually running. That's a goal worth shipping toward.

Thursday, February 26, 2026

When AI Sounds Right But Isn't: A Sidekiq Story

I recently ran into a Redis migration problem with Sidekiq. I needed to point our workers at a new Redis instance while safely draining jobs still sitting in the old one. I turned to Claude for advice — and what followed is a good reminder of why GenAI output should always be treated as a starting point, not a final answer.


Step 1: My Initial Idea (and Why Claude Correctly Pushed Back)

My first instinct was a "coin toss" approach: configure the Sidekiq server to point to the old Redis with 50% probability and the new Redis otherwise, while the client always pushes to the new Redis. Once the old queue drained, I'd do a clean cutover.

Claude's response here was actually good:

Claude: The coin toss approach has some real problems... Non-deterministic behavior is hard to reason about in production. You lose retry guarantees. A job that fails while a worker is connected to the old Redis will be retried there — but if on the next attempt the worker happens to connect to the new Redis, it won't see that job. The job could get stuck or lost. Sidekiq server connects to one Redis at startup, not per-job...

Fair enough. The critique was accurate. Claude then suggested a cleaner alternative: use Sidekiq 7's config.capsules to run multiple Redis connections within a single process — one capsule for the new Redis, one to drain the old.


Step 2: A Confident, Detailed, Completely Wrong Answer

I was on Sidekiq Pro 7.3.9, so I asked Claude to show me how to configure capsules for this use case. It gave me this:

Sidekiq.configure_server do |config|
  # New Redis - primary capsule (default)
  config.redis = { url: new_redis_url }

  # Old Redis - drain capsule
  config.capsule("legacy") do |cap|
    cap.redis = { url: old_redis_url }
    cap.concurrency = 5
    cap.queues = %w[default mailers critical]
  end
end

Sidekiq.configure_client do |config|
  config.redis = { url: new_redis_url }
end

The response was detailed. It used the right vocabulary. It even included helpful caveats about thread pool sizing, checking the retry and scheduled sets before teardown, and queue enumeration via redis-cli SMEMBERS queues. It felt authoritative.

There was just one problem: it was wrong.


Step 3: Thirty Seconds in the Docs

Before implementing anything, I did a quick check of the official Sidekiq documentation. Here's what it says, plainly:

Sidekiq Docs: "There is still one iron-clad rule: a Sidekiq process only executes jobs from one Redis instance; all Capsules within a process must use the same Redis instance. If you want to process jobs from two separate Redis instances, you need to start two separate Sidekiq processes."

Capsules are for organizing queues and concurrency within a single Redis instance. They cannot span multiple Redis instances. The entire premise of Claude's solution was invalid.

When I pointed this out, Claude acknowledged it immediately:

Claude: You're right, I was wrong — capsules cannot span multiple Redis instances. They're for organizing queues/concurrency within a single Redis. So for your drain scenario, you actually need two separate Sidekiq processes with different config files.

No pushback. No hedging. Just a clean admission that the confident, detailed answer was incorrect.


What Makes This Failure Mode Dangerous

This isn't a story about AI being obviously broken. It's about AI being convincingly wrong — and that's a harder problem to guard against.

A few things made this particularly easy to fall for:

  • The answer was structurally sound. Capsules are a real Sidekiq 7 feature. The code was syntactically valid Ruby. The caveats about concurrency and retry sets were genuinely useful. Only the core assumption — that capsules can target different Redis instances — was wrong.
  • The fluency signals trust. When an answer uses the right terminology, references the right version numbers, and anticipates edge cases, it reads as expert. That fluency is a product of training on large amounts of text, not of verified understanding.
  • The model doesn't know what it doesn't know. Claude didn't say "I'm not certain about the multi-Redis constraint — check the docs." It presented the solution as if it were established fact.

A Simple Rule of Thumb

If you wouldn't ship code based solely on a Stack Overflow answer from 2019 without reading the docs, don't ship code based solely on a GenAI answer either. The bar should be the same — or higher, because at least the Stack Overflow answer has upvotes, comments, and a date stamp.

GenAI is genuinely useful for orientation: understanding an unfamiliar API surface, exploring options, getting unstuck. But any answer that involves a specific documented behavior — especially version-specific constraints — needs at least one authoritative source check before you act on it.

In this case, thirty seconds in the Sidekiq docs saved what could have been hours of debugging a fundamentally broken architecture. That's a pretty good return on thirty seconds.


The actual solution, if you're curious: two separate Sidekiq processes with separate config files, each pointing at a different Redis instance. One processes new work, one drains the old queues. When the old queue, retry set, and scheduled set are all empty, shut the old process down.

Friday, November 28, 2025

Death-Defying Sidekiq Jobs: Part 2

In my previous post, I outlined the problem of parent jobs getting killed during Sidekiq shutdowns because they took too long to enqueue child jobs. We implemented a solution that used an active driver index instead of the expensive redis iterator, but the story doesn't end there.

The Data Revealed More

After deploying the active driver index, I gathered metrics on the parent job execution times. The good news: runtime dropped significantly. The bad news: even with the new index, the higher percentile execution times still hovered around 40 seconds.


That 40-second ceiling was a problem. Sidekiq's shutdown grace period is 25 seconds by default, and while we could extend it, we'd just be postponing the inevitable. Jobs that take 40 seconds to enqueue children are still vulnerable to being killed mid-execution during deployments or restarts.

Enter bulk_perform

The problem was that we had 100,000 jobs to push to sidekiq and while each push was in the order of a micro-second or less, the math adds up, and soon we are waiting close to a minute till all jobs were sent. I knew that this was a common problem with I/O bound systems where generally a "bulk" operation comes to the resuce. As in database writes, where we need to write a thousand records, we use a bulk insert, where through a single connection/call, the client sends a 1000 prepared statements that then are executed as a single batch in the database server (ex: postgres). A quick GenAI search hit upon bulk_perform - a method specifically designed for this exact scenario in the sidekiq world. Instead of enqueuing jobs one at a time, bulk_perform allows you to asynchronously submit up to 1,000 jobs to Sidekiq at once.

Here's what the refactored code looked like:

class ParentJob
  include Sidekiq::Job

  def perform(work_item_ids)
    # Prepare all job arguments
    job_args = work_item_ids.map { |id| [id] }
    
    ChildJob.perform_bulk(job_args)
  end
end

The key difference: perform_bulk pushes the jobs to Redis in a single pipelined operation rather than individual Redis calls. This dramatically reduces the network overhead that was causing our bottleneck.

The Results

The impact was immediate and dramatic. Parent job execution times dropped to just a few seconds, even for large batches. The 99th percentile went from 40 seconds down to under 5 seconds.


This shows the results of our incremental optimizations:

More importantly, the job now always finishes gracefully during a Sidekiq-initiated shutdown. No more interrupted enqueuing, no more orphaned work items, no more race conditions.

The overall time for job processing was reduced significantly, allowing for more efficient use of the cluster:

Lessons Learned

  1. Measure first, optimize second:  Premature optimization is still the root of at least some evil. Our goal here was to run the task under 20 seconds so that it would not get interrupted by sidekiq. If our first optimization got us there, we would not need to use bulk_perform. And bulk_perform is not a slam dunk. Since all the arguments for the jobs are marshaled at once, it can overwhelm your redis db if it is running high on memory already.
  2. Deep dive when the situation demands it: bulk_perform has been in Sidekiq for years, but I'd never needed it until this specific use case pushed me to look deeper. Where else might we improve silent in-efficiencies with this technique? Time will tell.
  3. Network calls are expensive: The difference between 1,000 individual Redis calls and one pipelined bulk operation was the difference between 40 seconds and 3 seconds.
  4. Graceful shutdowns matter: Taking the time to handle shutdowns properly means deployments are smoother and data integrity is maintained.

Conclusion

What started as a critical bug during deployments became an opportunity to understand Sidekiq's internals more deeply. The journey from "jobs getting killed" to "graceful shutdowns every time" involved measuring performance, understanding bottlenecks, and discovering the right tool for the job.

If you're enqueuing large numbers of child jobs from a parent job, bulk_perform may just be the ticket.

Thursday, November 13, 2025

Death-defying sidekiq jobs

1011001 0101110 1100101 0011010 1010111 0110100 1001011 0101011 1100110 0010101 1011100 0111001 1000110 0101111 101 010 110 001 10110 01101 11010 00101 1 0 1 0 1 1 1 0 1 0 1 0 1 1

As promised in my earlier post, I'm thrilled to announce that the changes to prevent Sidekiq job termination have been successfully deployed, and the results look promising!

But before I get ahead of myself, let me break down the problem again. (If you haven't read the previous posts, you might want to check them out for context.)

The Problem

  1. We have a parent job that spawns child jobs for mileage calculation for each user
  2. The parent job runs longer than 30 seconds and occasionally gets killed by Sidekiq
  3. Why does this happen? Sidekiq restarts every time we deploy new code (several times a day—we are a startup, after all!). Auto-scaling rules on the cluster can also reboot Sidekiq
  4. Generally, this parent job is idempotent when interrupted during the time series iteration (where 99% of the time is spent), so it doesn't usually cause data corruption—just an annoying inefficiency
  5. In the unlucky 1% of cases, we could spawn two jobs for each user, causing each to compute mileage independently and doubling the count
  6. We can't handle concurrent invocations (which happen at the end of an outage) because it's hard to differentiate between a scheduled invocation and one triggered by a service restart

The Solution (Deployed Methodically)

First, I tackled these steps:

  1. Deployed metrics to track how long the parent job takes. We now have over a day's worth of data. Notice it takes way longer than 30 seconds—if our new approach succeeds, this graph should flatten out in the coming days
  2. Deployed code that builds a parallel data structure to hold driver IDs
  3. Tested to ensure both the old and new approaches return the same set of users

Testing Challenges

Step #3 proved harder than expected. Testing against a live system means the numbers never match exactly. I wrote code to examine the differences, built a hypothesis about why/how the numbers would differ, and tested it against the data.

users = []
GeoTimeseries.iterate do |user_id|
  users << user_id if GeoTimeseries.recently_driven?(user_id) 
end
orig_set = Set.new(users)

current_time = Time.current
new_set = 
  6.times.reduce(Set.new) do |user_ids, i|
    bucket_time = current_time - (i * Geo::LastHourSink::BUCKET_DURATION)
    bucket_key = Geo::LastHourSink.bucket_key_for(bucket_time)
    members = $redis_aws.smembers(bucket_key).map(&:to_i)
    user_ids.merge(members)
  end

Analyzing the Differences

To understand the discrepancies:

  • orig_set - new_set shows users our new technique missed
  • new_set - orig_set shows users who appear with the new technique but were absent before

Users We Missed (orig_set - new_set)

Spot-checking the last timestamp of several users showed they'd last driven slightly over an hour ago. This makes sense—our new technique runs about a minute after the time series iteration, by which point we'd already expired some early drivers.

Running the time delta across the complete set revealed two patterns:

  1. Users who stopped driving slightly before the 1-hour mark
  2. Users who started driving a few seconds ago

    I hypothesized that users who hadn't driven for the past hour must have just started driving. If correct, these users should now be present in our new data structure—which I validated.

New Drivers (new_set - orig_set)

Everyone in this set had just started driving, so it made sense we missed them during the iteration that happened a minute earlier. (This screenshot shows -- second column --how long they have been driving and they are mostly under 60 seconds )


With these validations complete, I'm confident in the new approach. Stay tuned for follow-up metrics showing the flattened execution times!

Friday, November 07, 2025

When Your Fix Becomes the Problem: A Tale of AWS Outages, Redis Flags, and Performance Scaling

When Your Fix Becomes the Problem AWS Outages, Redis Flags & Performance Scaling Redis Sidekiq AWS


The Original Problem: AWS Outage Chaos

During the recent Oct 20th 2025 AWS outage, our team discovered an uncomfortable truth about our scheduled jobs. We had jobs configured to run exactly once per schedule via AWS EventBridge Scheduler. Simple enough, right?

Wrong.

When AWS came back online after an extended outage, EventBridge released a flood of queued job triggers that had accumulated during the downtime. Our "run once" job suddenly ran multiple times in rapid succession, causing data inconsistencies and duplicate operations. Check out my previous post on how we recovered user data after the outage here.

The Solution That Worked (Too Well)

The fix seemed straightforward: implement a Redis-based distributed lock to prevent concurrent executions. Before each job execution, we'd set a flag in Redis. If the flag was already set, the job would recognize a concurrent execution was in progress and gracefully bail out.

def perform
  return if concurrent_execution_detected?
  
  set_execution_flag
  
  begin
    # Iterate over driver TimeSeries data to find active drivers
    process_active_drivers
  ensure
    clear_execution_flag
  end
end

def concurrent_execution_detected?
  !REDIS.set("job:#{job_id}:running", "1", nx: true, ex: 300)
end

We deployed this with confidence. Problem solved!

The Problem With the Solution

Except... it wasn't.

Shortly after deployment, we noticed something odd: some scheduled slots had no job execution at all. The job simply didn't run when it was supposed to. This was arguably worse than running multiple times—at least duplicate runs were noisy and obvious.

The Real Culprit: Death by a Thousand Drivers

After digging through logs and tracing job lifecycles, we found the smoking gun: Sidekiq's graceful shutdown mechanism combined with our job's growing execution time.

Here's what was happening:

  1. A scheduled job starts executing
  2. The job iterates over TimeSeries data for all our drivers' geospatial data
  3. Kubernetes scales down our Sidekiq cluster (or a pod gets replaced during deployment)
  4. Sidekiq begins its graceful shutdown, giving jobs 30 seconds to complete
  5. Our job takes longer than 30 seconds (sometimes over a minute!)
  6. Sidekiq hard-kills the job
  7. The Redis flag remains set (because the ensure block never runs)
  8. Sidekiq automatically retries the job on another worker
  9. The retry sees the Redis flag and thinks "concurrent execution detected!"
  10. The retry bails out immediately
  11. No job completes for that scheduled slot

The Hidden Performance Regression

What made this particularly insidious was that our job used to be fast. When we first launched, iterating through driver TimeSeries data took milliseconds. But as our traffic surged and our driver count grew, the Redis keyspace for the TimeSeries structure expanded significantly.

What was once a quick scan became a slow crawl through thousands of driver records, filtering for those who had driven in the last hour. We only actually needed the geospatial data from the last 10 minutes, but we were scanning everything.

The job had slowly, imperceptibly degraded from sub-second execution to over a minute—crossing that critical 30-second Sidekiq shutdown threshold.

The Real Fix: Performance First, Then Locking

We realized the Redis lock wasn't wrong—it was just unable to work correctly with a slow job. The real problem was that we couldn't distinguish between two scenarios:

  1. Truly concurrent jobs (from AWS outage flooding) → Should be blocked
  2. Retry after Sidekiq kill (legitimate recovery) → Should proceed

When the job took 60+ seconds, Sidekiq would kill it and spawn a retry. But the Redis lock was still held, so the retry would see it as a concurrent execution and bail out. The lock was working as designed; the job was just too slow to survive Sidekiq's shutdown process.

The solution wasn't to remove the lock—we still need it to handle AWS outage scenarios. The solution was to make the job fast enough that it would never be killed mid-execution.

The Performance Bottleneck

Our original implementation looked something like this:

def process_active_drivers
  all_drivers = Driver.all
  
  all_drivers.each do |driver|
    # Fetch and scan the entire TimeSeries for this driver
    timeseries = REDIS.zrange("driver:#{driver.id}:locations", 0, -1)
    
    # Filter for entries from the last hour
    recent_locations = timeseries.select do |entry|
      entry.timestamp > 1.hour.ago
    end
    
    # We only needed the last 10 minutes anyway!
    process_recent_activity(recent_locations.select { |e| e.timestamp > 10.minutes.ago })
  end
end

This meant:

  • Fetching potentially thousands of driver records from the database
  • For each driver, pulling their entire geospatial TimeSeries from Redis
  • Filtering in Ruby to find recent activity
  • All to find maybe a few dozen drivers who were actually active

The Solution: An Active Driver Index

Instead of scanning all drivers and their complete history, we built a lightweight index structure in Redis that tracked only the drivers who had been active in the last hour:

# When a driver's location is recorded (happens frequently)
def record_driver_location(driver_id, location_data)
  # Store in the main TimeSeries (as before)
  REDIS.zadd("driver:#{driver.id}:locations", timestamp, location_data)
  
  # NEW: Add driver to the active drivers set with expiry
  REDIS.zadd("active_drivers", Time.now.to_i, driver_id)
  
  # Clean up entries older than 1 hour
  REDIS.zremrangebyscore("active_drivers", 0, 1.hour.ago.to_i)
end

Now our scheduled job became:

def process_active_drivers
  # Get only drivers active in the last hour (sorted set query is O(log N))
  cutoff = 1.hour.ago.to_i
  active_driver_ids = REDIS.zrangebyscore("active_drivers", cutoff, "+inf")
  
  # Only fetch the data we need
  active_driver_ids.each do |driver_id|
    # Get just the last 10 minutes of data using ZRANGEBYSCORE
    recent_locations = REDIS.zrangebyscore(
      "driver:#{driver_id}:locations",
      10.minutes.ago.to_i,
      "+inf"
    )
    
    process_recent_activity(recent_locations)
  end
end

The Expected Results

Based on benchmarking, the performance improvement should be dramatic:

  • Before: 60+ seconds (and growing with scale)
  • After: <1 second consistently (in testing)

By maintaining a parallel index of active drivers, we:

  • Eliminated the need to scan all drivers
  • Eliminated the need to fetch and filter complete TimeSeries data
  • Reduced the job from O(N×M) to O(A×K) where A is active drivers (tiny compared to N) and K is recent locations per driver

If the benchmarks hold in production, with the job completing in under a second:

  • Sidekiq's 30-second shutdown window will no longer be a concern
  • The Redis lock will finally work as intended—preventing duplicate jobs from AWS outages without blocking legitimate retries
  • We can distinguish between truly concurrent jobs (which should be blocked) and retry jobs (which should proceed)

Update: I'll be deploying this solution soon and will follow up with a part 2 covering the actual production results and any surprises we encounter along the way.

Lessons Learned

  1. Performance problems masquerade as concurrency problems - Our Redis lock was correct, but it couldn't work with a job that took longer than Sidekiq's shutdown window. We couldn't distinguish between "truly concurrent" and "legitimate retry."
  2. What works at 10x doesn't work at 100x - Our original implementation was fine for dozens of drivers. With thousands, it became a bottleneck that made our concurrency control unworkable.
  3. Maintain the right indices - Scanning complete datasets to find recent activity is a code smell. Build lightweight indices that track what you actually need.
  4. Use Redis data structures wisely - Sorted sets (ZSET) with time-based scores are perfect for "recently active" tracking with automatic time-based filtering.
  5. Measure, don't assume - We didn't notice the job slowing down because it happened gradually. Better monitoring would have caught this before it became critical.
  6. Fix root causes, not symptoms - The Redis lock wasn't the problem—it was exactly what we needed for AWS outages. The problem was the job being too slow to work with the lock correctly.

The Architecture Pattern

This pattern of maintaining a "recently active" index alongside your main data structure is broadly applicable:

# Pattern: Active Entity Index
# Main data: Complete history for each entity
# Index: Set of entities active in time window

class ActivityTracker
  def record_activity(entity_id, data)
    timestamp = Time.now.to_i
    
    # Store complete data
    REDIS.zadd("#{entity_type}:#{entity_id}:history", timestamp, data)
    
    # Update active index
    REDIS.zadd("active_#{entity_type}", timestamp, entity_id)
    
    # Periodic cleanup (or use Redis expiry)
    cleanup_old_entries if rand < 0.01
  end
  
  def get_recently_active(time_window = 1.hour)
    cutoff = time_window.ago.to_i
    REDIS.zrangebyscore("active_#{entity_type}", cutoff, "+inf")
  end
end

This trades a small amount of additional write overhead for massive read performance gains when you need to find "what's active right now?"

Conclusion

Distributed systems problems often look like they require coordination primitives when they really require performance optimization. Our Redis lock was the right solution for preventing duplicate jobs during AWS outages—but it could only work correctly once the job was fast enough to complete before Sidekiq's shutdown timeout.

The key insight: you can't distinguish between concurrent execution and legitimate retry if your job doesn't finish before the system kills it. By making the job 60× faster, we enabled our concurrency control to work as designed.

Sometimes the best fix for a distributed systems problem isn't better coordination—it's making operations fast enough that edge cases become rare and recoverable.

Sunday, October 26, 2025

When AWS Went Down, Our Users Didn’t Lose Their Miles

On Oct 20, 2025 UTC, AWS experienced a significant regional service disruption that affected several of our core components — specifically the EventBridge scheduling layer that powers our mileage pipeline.

For several hours, our data ingestion flow couldn’t persist new trip events from the Redis TimeSeries. This temporarily paused mileage calculations, leading to incorrect user summaries in the app.

But here’s what didn’t happen:
We didn’t lose a single record, and no user lost a mile.


How Our System Works

Our pipeline captures anonymized telemetry events from users’ mobile devices, processes them through AWS infrastructure, and stores aggregated trip summaries in a relational database.

Data Flow Diagram
Data Flow: Mobile App → Ingestion API → Redis TimeSeries → EventBridge Scheduler → RDS

Event flow:

  • Mobile App captures GPS telemetry and trip start/stop events.
  • Ingestion API authenticates and sanitizes the data before writing to Redis TimeSeries.
  • Redis TimeSeries stores short-term data points with fine-grained timestamps for quick replay (a background job backs up the stream to S3 for longer term storage).
  • EventBridge Scheduler triggers aggregation and processing jobs every few minutes.
  • RDS stores validated, aggregated trip records for long-term analytics and reporting.

What Happened During the Outage

When AWS degraded services in our primary region (us-east-1),  the EventBridge Scheduler stopped firing events from 6:50 - 10:07 UTC halting our pipeline smack in the middle. While we were still capturing and ingesting the user geo data, the Redis Timeseries was not being processed as this job did not get scheduled for little over 3 hours.

Remarkably, the Redis Timeseries held up. Our Redis cloud is with Redis Enterprise, but the instances are hosted in the AWS cloud. Even though Redis Enterprise noted that some customers would be impacted, we did not see a significant degradation.


Understanding the AWS architecture helps explain why this was the case.

AWS separates each service into a control plane and a data plane. The control plane is responsible for resource monitoring, allocation, scaling whereas once resources have been allocated, the data plane takes over - the data plane is more reliable/resilient while the control plane is known to fail. 

Here from the horses's mouth:

Since we provision our Redis cluster for expected usage, manually adding more nodes/memory as our volumes increase, we were not relying on the AWS control plane for scaling - our instances continued to hum (we saw the same with RDS as well -- again, it is customary to provision the RDS for your needs and perform upgrades manually as traffic increases)

This was not the case for our web/job servers, that were configured with Auto scaling rules. We had set a lower limit on the number of machines for each cluster, and we were running hard on this reduced capacity until recovery.

When services recovered, we started processing events from the Timeseries, creating trips for users. But since we generate incremental trips for the last few minutes, we were still missing trips for the last 3 hours and 7 minutes.

We could easily tell how many trips we missed as we track this closely using a Cloud Watch metric. Each bar shows a completion of the job that is responsible for incrementally processing the timeseries.

  

When services recovered, EventBridge Scheduler fired all the events in the backlog.



This caused a different problem as our trips processor was designed to handle the time series data in real time. We did not anticipate serving more than a single event during a ten minute window. So we got 21 late-fired triggers but effectively could process just one, for the last ten minutes. More on this later!

The critical task was to update the user data for the missing three hours. I had written a script to patch trips that I had used earlier for a less severe outage (5 minutes). With some minor modification to account for the partial data towards the tail, I was able to correct mileage for all our users who happened to be driving during the outage (luckily they were not on self-driving cars powered by AWS. Ok - bad joke)

There was still something I couldn't explain - CW told me ~ 2,000 jobs completed after jobs started flowing again. I expected 21 jobs, but I was puzzled by the much larger volume that ran at the tail. What amounted to that, and would they cause a different type of mis-calculation? Indeed, some interesting things did take place with those 21 EventBridge triggers, let me explain.

When a trigger fires on the tenth minute, we launch a job per user who have likely been driving recently. These jobs run concurrently, and we need to track the completion of the last job to know that all users have been processed and the window can be marked "complete".

This is done with a Redis Set that keeps track of users who are still being processed. So when the trigger fires, it first determines all recent drivers, adds them to the set, before spawning a worker per user.  Then each worker removes an element from the set, and if it is the last item, notifies the completion of the run.


When 21 triggers fired in rapid succession, they all forked a job per user, resulting in many workers racing to compute the same job, and may workers hitting an empty set. And of course this meant that these jobs "up counted" miles for the drivers in that time window.

So the last data cleanup was to figure out where we added more miles to users during the end of the outage. I first thought this might be really hard as we had already updated these records, which then kept getting updated further as the users kept driving. But fortunately, we update both start and end times for each trip in the record, so it was possible to compute the miles driven in this specific range for each user from the raw timeseries data.

To verify that we have been up-counting, I queried for records in descending order of speed. And I saw speeds of 600 mph, which confirmed the hypothesis quite fast [no pun intended]

I could re-use a method from the earlier script for patching the data, and write a bit of code for the update. So finally, after a very long day, our users' data was fully corrected.

Improvements made:

We are making improvements on how we handle the tail of an outage going forward. The idea here is to not let more than a single processor to run in a given time window. This can be done with a simple Redis SET command with option "NX", which sets a flag (lock) if it is not already set, thus guaranteeing that only a single process can acquire the lock. We set the TTL to be below the time window (7 minutes in this case) so that the lock naturally expires before the next trigger.

Our Approach: Fairness and Fidelity

Our principle is simple: if you drove it, we count it.

We don’t approximate, and we don’t drop data due to transient infrastructure issues. Each pipeline component is designed for eventual consistency and idempotent replay, so every record can be reconstructed safely and accurately.


What We’re Building Next

Resilience isn’t just uptime — it’s graceful recovery. We’re implementing several next steps to strengthen this architecture:

  • Buffer device data: As users go through areas with low mobile reception, we want to buffer the location data and deliver it when reception improves.
  • Adjust inaccurate device signal : Use techniques like Kanman filtering to adjust the location for high fidelity, when the device accuracy is low
  • De-couple RDS for real time updates: We will store running trips in Redis, with archiving to take place later. This makes us resilient on an event when the RDS is un-responsive, as we only need it at the latter step of archiving
  • Monitoring for anomalies: Add speed as a tracked metric and alert over 200 mph.
  • Chaos testing & fault injection: Monthly simulated outages to validate that our recovery flow remains reliable.

What It Means for Our Users

When something breaks in the cloud, we don’t panic — we verify, replay, and reconcile.
Because behind every data point is a real person trusting us to get it right.

Outages happen, but trust shouldn’t. And that’s what we’re building for.


Posted by [Thushara Wijeratna], [Head of Eng] at [WorkSolo]

Thursday, October 16, 2025

Rails 8: - can't cast RSpec::Mocks::Double

 One of the first unit test failures I encountered on a Rails 8.0 upgrade was:

    can't cast RSpec::Mocks::Double

The error happens on saving an ActiveRecord object to the database.


# connected :boolean default(FALSE)
# last_connected_at :datetime

class LymoAccount < ApplicationRecord
after_commit :update_last_connected_at, if: :saved_change_to_connected?

def update_last_connected_at
update!(last_connected_at: Time.current) if connected?
end

Turns out, this is related to stricter validation in Active Record, that refuses to save a mock object.

Generally speaking, you should be using FactoryBot methods to create real ActiveRecord objects from your unit tests. And we were. So it puzzled me why we would get this error, as it did not seem like we were storing anything that was a mock.

ChatGPT got pretty confused as well -- It got confused as the exception was thrown from an after_commit hook and its assumption was that there were attributes already set in the model that are being re-serialized and this was causing the issue.

We went through a listing of all the attributes of the record, examining their type (class) and none of them was a mock. 

This was the point when I gave up on Gen AI and took another look at the model.

I quickly eliminated that this has anything to do with the connected attribute that we are checking, by updating the test to save without the check. It didn't help, so I knew that the update itself was throwing.

Then I wondered if updating any column would throw the same error.

I updated a text column, and it worked - progress. Now I knew this might have something to do with either:

1. the specific column - maybe it has some validation that is behaving weird

2. the column type - maybe datetime columns behave differently with updates

Luckily there were a few other datetime columns, and when I updated any one of them with Time.current, they threw the same error. We were getting somewhere.

On a hunch, that this was to do with Time.current , I took a look at the unit test, to see if Time was being mocked - normally, we use Timecop as the best practice, but it was always possible that an engineer decided to be "creative".

This is what I found on the unit test:


  let(:data_sync_at) { Time.current.to_i }
  before do
    allow(Time).to receive_message_chain(:current, :to_i).and_return(data_sync_at)
  end
So what this does is stub the intermediate Time.current call, and it is not that obvious! So when we used Time.current to update the record, it tried to use a Mock object and failed.

A simple Timecop.travel(data_sync_at) fixed the issue.


Here is the Claude transcript, which happened later, when I was better able to guide the conversation along a more productive route!

Debugging Rails RSpec Error: RSpec::Mocks::Double Casting Issue

Q: On a Rails model, why would this code throw an error on a unit test saying: can't cast RSpec::Mocks::Double?
def update_last_connected_at
    update!(last_connected_at: Time.current) if connected?
end
Answer: The error occurs because connected? is returning an RSpec mock double instead of a boolean value, and Rails is trying to cast it to a boolean in the if conditional.

Common causes:

  • Unmocked associationconnected? might depend on an association that's mocked but not configured properly
  • Missing stub — The connected? method itself might be mocked but not given a return value

Solutions:

1. Stub the method explicitly with a boolean:

allow(model).to receive(:connected?).and_return(true)

2. Build a proper test object instead of relying on mocks:

let(:model) { build(:your_model, connected: true) }
Q: Happens without connected?
Answer: If you're getting the error without even calling connected?, the problem is likely that update! itself is being called on a mock.

Most common cause: You're probably testing this method on a mocked or partially mocked model instance. When you stub the model without fully setting it up, calling update! on it returns a double instead of actually updating the record.

Solutions:

1. Use create or build instead of double:

let(:model) { create(:your_model) }
model.update_last_connected_at

2. If you must use a double, stub update!:

let(:model) { double(:your_model) }
allow(model).to receive(:update!).and_return(true)
Q: Model is not stubbed, it is a FactoryBot created one
Answer: If it's a real FactoryBot instance, then the issue is different. The error suggests that update! is being called but something in the conditional logic is going wrong.

Most likely causes:

  • connected? method doesn't exist or returns a mock
  • The model instance itself is getting wrapped in a double
  • Database transaction/rollback issues

Debug steps: Add some debugging to see what's actually happening:

def update_last_connected_at
    puts "connected? returns: #{connected?.inspect}"
    puts "connected?.class: #{connected?.class}"
    update!(last_connected_at: Time.current) if connected?
end
Q: Connected is a column in the table (model). Here's my test and model:
let(:data_sync_at) { Time.current.to_i }
allow(Time).to receive_message_chain(:current, :to_i).and_return(data_sync_at)
// call method that ends up saving the record

# Model:
after_commit :update_last_connected_at

def update_last_connected_at     
  update!(last_connected_at: Time.current)   
end
Answer: Ah, there's the issue! You have an infinite loop of callbacks:
  1. update! is called in update_last_connected_at
  2. update! triggers after_commit
  3. after_commit calls update_last_connected_at again
  4. Which calls update! again... and so on

This creates a recursive callback that RSpec is having trouble with, and the mock you set up on Time.current might be interfering.

The real problem: You're missing the connected? check that should prevent this recursion:

def update_last_connected_at
    update!(last_connected_at: Time.current) if connected?
end

Or use update_column instead (skips callbacks):

def update_last_connected_at
    update_column(:last_connected_at, Time.current) if connected?
end
Q: My after_commit is like this so should not recurse:
after_commit :update_last_connected_at, if: :saved_change_to_connected?
Answer: Good catch! With that guard clause, it shouldn't recurse. But there's still an issue:

The problem: When update_last_connected_at calls update!, it triggers another after_commit. Even though the guard if: :saved_change_to_connected? prevents it from running again, RSpec is still having issues.

The real culprit is likely your Time mock:

allow(Time).to receive_message_chain(:current, :to_i).and_return(data_sync_at)

This is mocking Time.current in a way that might be interfering with the after_commit callback execution in your test. When you call update!(last_connected_at: Time.current) inside the callback, the mocked Time.current might be returning a double.

Try this instead:

Option 1: Mock just the .to_i part, not the whole chain:

allow_any_instance_of(Time).to receive(:to_i).and_return(data_sync_at)

Option 2: Use travel_to:

travel_to(Time.at(data_sync_at)) do
  # your test code
end

Option 3: Simplify by not mocking Time at all:

let(:data_sync_at) { Time.current.to_i }
# Just use the actual Time.current in your model