As promised in my earlier post, I'm thrilled to announce that the changes to prevent Sidekiq job termination have been successfully deployed, and the results look promising!
But before I get ahead of myself, let me break down the problem again. (If you haven't read the previous posts, you might want to check them out for context.)
The Problem
- We have a parent job that spawns child jobs for mileage calculation for each user
- The parent job runs longer than 30 seconds and occasionally gets killed by Sidekiq
- Why does this happen? Sidekiq restarts every time we deploy new code (several times a day—we are a startup, after all!). Auto-scaling rules on the cluster can also reboot Sidekiq
- Generally, this parent job is idempotent when interrupted during the time series iteration (where 99% of the time is spent), so it doesn't usually cause data corruption—just an annoying inefficiency
- In the unlucky 1% of cases, we could spawn two jobs for each user, causing each to compute mileage independently and doubling the count
- We can't handle concurrent invocations (which happen at the end of an outage) because it's hard to differentiate between a scheduled invocation and one triggered by a service restart
The Solution (Deployed Methodically)
First, I tackled these steps:
- Deployed metrics to track how long the parent job takes. We now have over a day's worth of data. Notice it takes way longer than 30 seconds—if our new approach succeeds, this graph should flatten out in the coming days
- Deployed code that builds a parallel data structure to hold driver IDs
- Tested to ensure both the old and new approaches return the same set of users
Testing Challenges
Step #3 proved harder than expected. Testing against a live system means the numbers never match exactly. I wrote code to examine the differences, built a hypothesis about why/how the numbers would differ, and tested it against the data.
users = []
GeoTimeseries.iterate do |user_id|
users << user_id if GeoTimeseries.recently_driven?(user_id)
end
orig_set = Set.new(users)
current_time = Time.current
new_set =
6.times.reduce(Set.new) do |user_ids, i|
bucket_time = current_time - (i * Geo::LastHourSink::BUCKET_DURATION)
bucket_key = Geo::LastHourSink.bucket_key_for(bucket_time)
members = $redis_aws.smembers(bucket_key).map(&:to_i)
user_ids.merge(members)
end
Analyzing the Differences
To understand the discrepancies:
- • orig_set - new_set shows users our new technique missed
- • new_set - orig_set shows users who appear with the new technique but were absent before
Users We Missed (orig_set - new_set)
Spot-checking the last timestamp of several users showed they'd last driven slightly over an hour ago. This makes sense—our new technique runs about a minute after the time series iteration, by which point we'd already expired some early drivers.
Running the time delta across the complete set revealed two patterns:
- Users who stopped driving slightly before the 1-hour mark
- Users who started driving a few seconds ago
I hypothesized that users who hadn't driven for the past hour must have just started driving. If correct, these users should now be present in our new data structure—which I validated.
New Drivers (new_set - orig_set)
Everyone in this set had just started driving, so it made sense we missed them during the iteration that happened a minute earlier.
With these validations complete, I'm confident in the new approach. Stay tuned for follow-up metrics showing the flattened execution times!
No comments:
Post a Comment