Thursday, November 13, 2025

Death-defying sidekiq jobs

1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1

As promised in my earlier post, I'm thrilled to announce that the changes to prevent Sidekiq job termination have been successfully deployed, and the results look promising!

But before I get ahead of myself, let me break down the problem again. (If you haven't read the previous posts, you might want to check them out for context.)

The Problem

  1. We have a parent job that spawns child jobs for mileage calculation for each user
  2. The parent job runs longer than 30 seconds and occasionally gets killed by Sidekiq
  3. Why does this happen? Sidekiq restarts every time we deploy new code (several times a day—we are a startup, after all!). Auto-scaling rules on the cluster can also reboot Sidekiq
  4. Generally, this parent job is idempotent when interrupted during the time series iteration (where 99% of the time is spent), so it doesn't usually cause data corruption—just an annoying inefficiency
  5. In the unlucky 1% of cases, we could spawn two jobs for each user, causing each to compute mileage independently and doubling the count
  6. We can't handle concurrent invocations (which happen at the end of an outage) because it's hard to differentiate between a scheduled invocation and one triggered by a service restart

The Solution (Deployed Methodically)

First, I tackled these steps:

  1. Deployed metrics to track how long the parent job takes. We now have over a day's worth of data. Notice it takes way longer than 30 seconds—if our new approach succeeds, this graph should flatten out in the coming days
  2. Deployed code that builds a parallel data structure to hold driver IDs
  3. Tested to ensure both the old and new approaches return the same set of users

Testing Challenges

Step #3 proved harder than expected. Testing against a live system means the numbers never match exactly. I wrote code to examine the differences, built a hypothesis about why/how the numbers would differ, and tested it against the data.

users = []
GeoTimeseries.iterate do |user_id|
  users << user_id if GeoTimeseries.recently_driven?(user_id) 
end
orig_set = Set.new(users)

current_time = Time.current
new_set = 
  6.times.reduce(Set.new) do |user_ids, i|
    bucket_time = current_time - (i * Geo::LastHourSink::BUCKET_DURATION)
    bucket_key = Geo::LastHourSink.bucket_key_for(bucket_time)
    members = $redis_aws.smembers(bucket_key).map(&:to_i)
    user_ids.merge(members)
  end

Analyzing the Differences

To understand the discrepancies:

  • orig_set - new_set shows users our new technique missed
  • new_set - orig_set shows users who appear with the new technique but were absent before

Users We Missed (orig_set - new_set)

Spot-checking the last timestamp of several users showed they'd last driven slightly over an hour ago. This makes sense—our new technique runs about a minute after the time series iteration, by which point we'd already expired some early drivers.

Running the time delta across the complete set revealed two patterns:

  1. Users who stopped driving slightly before the 1-hour mark
  2. Users who started driving a few seconds ago

    I hypothesized that users who hadn't driven for the past hour must have just started driving. If correct, these users should now be present in our new data structure—which I validated.

New Drivers (new_set - orig_set)

Everyone in this set had just started driving, so it made sense we missed them during the iteration that happened a minute earlier.


With these validations complete, I'm confident in the new approach. Stay tuned for follow-up metrics showing the flattened execution times!

No comments: