Infrastructure · Redis · Sidekiq
How We Migrated Sidekiq's Redis Without Losing a Single Job (and Without Listening to AI)
We moved our Sidekiq backend from Redis Enterprise to AWS ElastiCache. The AI tools recommended a careful, expensive approach. We did something simpler — and it worked perfectly.
The Setup
Our app runs Sidekiq workers on ECS. Each process connects to Redis on startup to read and process jobs. We were moving from Redis Enterprise to ElastiCache — different host, different connection string, same protocol.
New jobs would start going to the new Redis as soon as we deployed. But existing jobs queued in the old Redis? They'd be orphaned the moment every worker switched over.
What the AI Tools Said
We asked around — Claude, ChatGPT, Gemini, Grok. They all landed in roughly the same place:
It's not wrong. But it's heavy. That approach meant new ECS task definitions, environment variable management across two sets of infra, coordinating the decommission, and extra cost while two clusters run in parallel.
When we pushed back, one tool offered an alternative: run two Sidekiq processes per Docker container — one pointed at old Redis, one at new. That would have required changes to CloudFormation templates, process supervision config inside the container, and careful cleanup afterward. Trading one complex migration for another.
So the "debugging nightmare" scenario the AI tools described... wouldn't actually happen.
The Actual Solution
Our team came up with something much simpler. In config/initializers/sidekiq.rb, at startup, each Sidekiq process decides which Redis to connect to. We added one line:
# Coin toss at startup — connects this process to one Redis for its entire lifetime redis_url = rand < 0.5 ? LYMO_SIDEKIQ_NEW_REDIS_URL : LYMO_SIDEKIQ_OLD_REDIS_URL
That's it. On startup, each worker tosses a coin. Heads → new ElastiCache. Tails → old Redis Enterprise.
The result: roughly half the cluster continued draining the old queue, while the other half processed new jobs on ElastiCache. No new infra. No task definition changes. No separate environment to coordinate.
We also pointed all job producers (the code that enqueues jobs) at the new Redis immediately. So new work only ever went to ElastiCache. The old Redis just needed to drain.
This is where Sidekiq's initializer structure becomes the key enabler. Each configure_server and configure_client are can be wired seperately where the server (one that reads) uses the redis_url resolved at startup:
redis_url = rand < 0.5 ? LYMO_SIDEKIQ_NEW_REDIS_URL : LYMO_SIDEKIQ_OLD_REDIS_URL Sidekiq.configure_server do |config| config.redis = { url: redis_url } end Sidekiq.configure_client do |config| config.redis = { url: new_redis_url } end
One coin toss. One URL to pull from. That process reads and from the same Redis for its entire lifetime.
The clients (that push jobs) will always use the new url, and the reads would be split between the old and new url. In time, the old queue drains as it receives no further jobs. The old Redis processes were naturally left behind to drain, and as they cycled out, the cluster fully converged on the new setup with no intervention required.
How It Went
It worked exactly as expected. Within a day, roughly 90% of the old queue had drained naturally. Workers reading from old Redis gradually found less and less work, while ElastiCache handled all the new throughput.
The remaining jobs were a different story: scheduled jobs. These live in Sidekiq's sorted set and don't get picked up until their execution time arrives — which could be hours away. Waiting wasn't ideal, so we wrote a small script to move them from the old Redis to the new one manually. A few lines to iterate the scheduled (and retry) set, re-enqueue on ElastiCache, and delete from old Redis. Clean cutover.
Once that was done, we deployed the cleanup — removed the conditional and all references to the old Redis connection. Four lines of code deleted. Done.
Oh, and while all of this was happening? The rest of the team made a dozen normal deployments — which restarted servers, reshuffled which Redis each process landed on, and generally did everything the AI tools said would cause a debugging nightmare. Nothing broke. No jobs lost. The doom and gloom theories were disproven in the most practical way possible: by live testing.
Why the AI Advice Missed the Mark
The AI tools were technically cautious but operationally naive. They modeled the problem as "jobs are tied to a running process" — which isn't how Sidekiq works. Redis is the source of truth, not the worker. The worker is stateless.
They also defaulted to the safest, most conservative architecture: full environment isolation. That's sensible for high-stakes migrations. But for a queue drain, it's significant overengineering.
The human insight — the DB is external, the workers are stateless, so we can split them probabilistically — is the kind of lateral thinking that comes from actually understanding the system rather than pattern-matching to a template.
Takeaways
-
01
Sidekiq workers are stateless. Redis is the state. This gives you more migration flexibility than you might think.
-
02
Probabilistic splits are underrated. You don't always need clean cutoffs. A coin toss at startup is simple, observable, and reversible.
-
03
AI tools are good at safe answers, not always good at efficient ones. They'll often recommend the conservative solution even when a simpler one exists. Treat their output as a starting point, not a final answer.
-
04
The cleanup should be as simple as the migration. If your migration leaves behind complex infra, you've done too much. Ours cleaned up with four deleted lines.
No comments:
Post a Comment