Friday, November 28, 2025

Death-Defying Sidekiq Jobs: Part 2

In my previous post, I outlined the problem of parent jobs getting killed during Sidekiq shutdowns because they took too long to enqueue child jobs. We implemented a solution that used an active driver index instead of the expensive redis iterator, but the story doesn't end there.

The Data Revealed More

After deploying the active driver index, I gathered metrics on the parent job execution times. The good news: runtime dropped significantly. The bad news: even with the new index, the higher percentile execution times still hovered around 40 seconds.


That 40-second ceiling was a problem. Sidekiq's shutdown grace period is 25 seconds by default, and while we could extend it, we'd just be postponing the inevitable. Jobs that take 40 seconds to enqueue children are still vulnerable to being killed mid-execution during deployments or restarts.

Enter bulk_perform

The problem was that we had 100,000 jobs to push to sidekiq and while each push was in the order of a micro-second or less, the math adds up, and soon we are waiting close to a minute till all jobs were sent. I knew that this was a common problem with I/O bound systems where generally a "bulk" operation comes to the resuce. As in database writes, where we need to write a thousand records, we use a bulk insert, where through a single connection/call, the client sends a 1000 prepared statements that then are executed as a single batch in the database server (ex: postgres). A quick GenAI search hit upon bulk_perform - a method specifically designed for this exact scenario in the sidekiq world. Instead of enqueuing jobs one at a time, bulk_perform allows you to asynchronously submit up to 1,000 jobs to Sidekiq at once.

Here's what the refactored code looked like:

class ParentJob
  include Sidekiq::Job

  def perform(work_item_ids)
    # Prepare all job arguments
    job_args = work_item_ids.map { |id| [id] }
    
    ChildJob.perform_bulk(job_args)
  end
end

The key difference: perform_bulk pushes the jobs to Redis in a single pipelined operation rather than individual Redis calls. This dramatically reduces the network overhead that was causing our bottleneck.

The Results

The impact was immediate and dramatic. Parent job execution times dropped to just a few seconds, even for large batches. The 99th percentile went from 40 seconds down to under 5 seconds.


This shows the results of our incremental optimizations:

More importantly, the job now always finishes gracefully during a Sidekiq-initiated shutdown. No more interrupted enqueuing, no more orphaned work items, no more race conditions.

The overall time for job processing was reduced significantly, allowing for more efficient use of the cluster:

Lessons Learned

  1. Measure first, optimize second:  Premature optimization is still the root of at least some evil. Our goal here was to run the task under 20 seconds so that it would not get interrupted by sidekiq. If our first optimization got us there, we would not need to use bulk_perform. And bulk_perform is not a slam dunk. Since all the arguments for the jobs are marshaled at once, it can overwhelm your redis db if it is running high on memory already.
  2. Deep dive when the situation demands it: bulk_perform has been in Sidekiq for years, but I'd never needed it until this specific use case pushed me to look deeper. Where else might we improve silent in-efficiencies with this technique? Time will tell.
  3. Network calls are expensive: The difference between 1,000 individual Redis calls and one pipelined bulk operation was the difference between 40 seconds and 3 seconds.
  4. Graceful shutdowns matter: Taking the time to handle shutdowns properly means deployments are smoother and data integrity is maintained.

Conclusion

What started as a critical bug during deployments became an opportunity to understand Sidekiq's internals more deeply. The journey from "jobs getting killed" to "graceful shutdowns every time" involved measuring performance, understanding bottlenecks, and discovering the right tool for the job.

If you're enqueuing large numbers of child jobs from a parent job, bulk_perform may just be the ticket.

No comments: