Thursday, March 26, 2026

How We Migrated Sidekiq's Redis Without Losing a Single Job

Infrastructure · Redis · Sidekiq

How We Migrated Sidekiq's Redis Without Losing a Single Job (and Without Listening to AI)

Solo Engineering Team · March 2026 · 8 min read

We moved our Sidekiq backend from Redis Enterprise to AWS ElastiCache. The AI tools recommended a careful, expensive approach. We did something simpler — and it worked perfectly.

The Setup

Our app runs Sidekiq workers on ECS. Each process connects to Redis on startup to read and process jobs. We were moving from Redis Enterprise to ElastiCache — different host, different connection string, same protocol.

New jobs would start going to the new Redis as soon as we deployed. But existing jobs queued in the old Redis? They'd be orphaned the moment every worker switched over.

What the AI Tools Said

We asked around — Claude, ChatGPT, Gemini, Grok. They all landed in roughly the same place:

You should deploy a separate environment connected to the old Redis. Let it drain the queue over time, then decommission.

It's not wrong. But it's heavy. That approach meant new ECS task definitions, environment variable management across two sets of infra, coordinating the decommission, and extra cost while two clusters run in parallel.

When we pushed back, one tool offered an alternative: run two Sidekiq processes per Docker container — one pointed at old Redis, one at new. That would have required changes to CloudFormation templates, process supervision config inside the container, and careful cleanup afterward. Trading one complex migration for another.

But they missed something important: Sidekiq's backing store is completely external to the process. A job scheduled on Redis Enterprise doesn't belong to any particular Sidekiq process — it just sits there until a worker with a connection to that Redis comes along. The worker is stateless.

So the "debugging nightmare" scenario the AI tools described... wouldn't actually happen.

The Actual Solution

Our team came up with something much simpler. In config/initializers/sidekiq.rb, at startup, each Sidekiq process decides which Redis to connect to. We added one line:

config/initializers/sidekiq.rb — the one-liner

# Coin toss at startup — connects this process to one Redis for its entire lifetime
redis_url = rand < 0.5 ? LYMO_SIDEKIQ_NEW_REDIS_URL : LYMO_SIDEKIQ_OLD_REDIS_URL

That's it. On startup, each worker tosses a coin. Heads → new ElastiCache. Tails → old Redis Enterprise.

The result: roughly half the cluster continued draining the old queue, while the other half processed new jobs on ElastiCache. No new infra. No task definition changes. No separate environment to coordinate.

We also pointed all job producers (the code that enqueues jobs) at the new Redis immediately. So new work only ever went to ElastiCache. The old Redis just needed to drain.

This is where Sidekiq's initializer structure becomes the key enabler. Each configure_server and configure_client are can be wired seperately where the server (one that reads) uses the redis_url resolved at startup:

config/initializers/sidekiq.rb — full initializer

redis_url = rand < 0.5 ? LYMO_SIDEKIQ_NEW_REDIS_URL : LYMO_SIDEKIQ_OLD_REDIS_URL

Sidekiq.configure_server do |config|
  config.redis = { url: redis_url }
end

Sidekiq.configure_client do |config|
  config.redis = { url: new_redis_url }
end

One coin toss. One URL to pull from. That process reads and from the same Redis for its entire lifetime.

The clients (that push jobs) will always use the new url, and the reads would be split between the old and new url. In time, the old queue drains as it receives no further jobs. The old Redis processes were naturally left behind to drain, and as they cycled out, the cluster fully converged on the new setup with no intervention required.

How It Went

It worked exactly as expected. Within a day, roughly 90% of the old queue had drained naturally. Workers reading from old Redis gradually found less and less work, while ElastiCache handled all the new throughput.

The remaining jobs were a different story: scheduled jobs. These live in Sidekiq's sorted set and don't get picked up until their execution time arrives — which could be hours away. Waiting wasn't ideal, so we wrote a small script to move them from the old Redis to the new one manually. A few lines to iterate the scheduled (and retry) set, re-enqueue on ElastiCache, and delete from old Redis. Clean cutover.

Once that was done, we deployed the cleanup — removed the conditional and all references to the old Redis connection. Four lines of code deleted. Done.

Oh, and while all of this was happening? The rest of the team made a dozen normal deployments — which restarted servers, reshuffled which Redis each process landed on, and generally did everything the AI tools said would cause a debugging nightmare. Nothing broke. No jobs lost. The doom and gloom theories were disproven in the most practical way possible: by live testing.

Why the AI Advice Missed the Mark

The AI tools were technically cautious but operationally naive. They modeled the problem as "jobs are tied to a running process" — which isn't how Sidekiq works. Redis is the source of truth, not the worker. The worker is stateless.

They also defaulted to the safest, most conservative architecture: full environment isolation. That's sensible for high-stakes migrations. But for a queue drain, it's significant overengineering.

The human insight — the DB is external, the workers are stateless, so we can split them probabilistically — is the kind of lateral thinking that comes from actually understanding the system rather than pattern-matching to a template.

— ✦ —

Takeaways

01
Sidekiq workers are stateless. Redis is the state. This gives you more migration flexibility than you might think.
02
Probabilistic splits are underrated. You don't always need clean cutoffs. A coin toss at startup is simple, observable, and reversible.
03
AI tools are good at safe answers, not always good at efficient ones. They'll often recommend the conservative solution even when a simpler one exists. Treat their output as a starting point, not a final answer.
04
The cleanup should be as simple as the migration. If your migration leaves behind complex infra, you've done too much. Ours cleaned up with four deleted lines.

Sunday, March 15, 2026

Dead Code Is a Cognitive Tax — Here's How AI Helps You Stop Paying It

Posted to Engineering · [Your Name] · [Date]

Every engineer knows the feeling. You open an unfamiliar part of the codebase, and you're immediately staring down a tangle of services, workers, models, and task entries — none of which come with a label saying "still matters" or "abandoned in 2023." You read the code carefully, try to trace the call graph, maybe even grep for usages — and only after 30 minutes do you realize: this thing hasn't run in production for over a year.

That tax on your attention has a name: cognitive load. And dead code is one of its most insidious sources.

What Is Cognitive Load in a Codebase?

Cognitive load, in the context of software engineering, is the total mental effort required to understand a system well enough to work in it safely. Every class, method, model, and background job you encounter is a unit of context you have to hold in your head.

The problem is that your brain doesn't automatically know which of those units are live and which are ghosts. If an EstimateWorker class exists in your repo, you have to assume it matters — until you prove otherwise. That proof takes time, attention, and often a distracting detour away from the actual work you sat down to do.

Dead code doesn't just waste disk space. It actively misleads you.

A Real-World Example: The Estimation Pipeline Cleanup

Recently, our team completed a cleanup effort across seven pull requests targeting a legacy estimation infrastructure — a suite of services originally built around Prophet forecasts and a Clair analysis pipeline — that had gone completely dark since late 2023.

Here's what was still sitting in the codebase, doing nothing:

EstimateService — fetched a CSV over HTTP, upserted records into the database, and refreshed an estimation cache. Silent for months.
EstimateWorker — a Sidekiq background job that uploaded files to S3, triggered the estimation flow, and posted Slack notifications. Long dead.
Estimation::Prophet::DownloadWorker — downloaded forecast CSVs from S3 and upserted them into a Prophet table. Never called.
Estimators::ClairAnalysis — computed hourly analysis records for a brief window in late 2023, then stopped.
ClairAnalysis model and its backing database table — zero writes since the pipeline went quiet.
Three SwitchBoard dispatch entries — events_collect_for_next_week, generate_weekly_user_report, estimate_v2 — all orphaned task names in a routing map.

Any engineer — or AI assistant — reading this codebase would reasonably assume all of the above was active production infrastructure. None of it was.

The Numbers

Pull Requests

Files Changed

943

Lines Deleted

−816

Net Lines Removed

PR	Branch	+Added	−Deleted	Files
#1	cleanup-tasks	13	16	2
#2	cleanup-unused-estimate	0	74	4
#3	remove-clair-analysis	0	314	2
#4	remove-prophet	0	210	5
#5	remove-clair-analysis-model	20	57	3
#6	rename-clair-v2s	94	68	13
#7	remove-estimate-unused	0	204	2
Total		127	943	31

The 127 additions are almost entirely the rename PR (#6) — migrations, updated references, and renamed specs. Every other PR was pure deletion.

The Cognitive Impact of the Cleanup

Cleaner model surface. Once EstimateService, EstimateWorker, and ClairAnalysis were gone, the remaining models — Clair, ClairDailyInterimResult, ClairSetting — actually reflected how the system works today.

Naming that signals intent. ClairV2 implies a versioning scheme. ClairDailyInterimResult tells you exactly what the thing is and why it exists.

A smaller SwitchBoard dispatch map. Removing the three orphaned entries made the dispatch map honest again.

A shorter test suite that still covers everything that matters. Several spec files covering deleted code were removed. The test suite got faster without losing any meaningful coverage.

Where AI Fits In: Finding Dead Code You Can't See

Here's the uncomfortable truth about dead code: it's often invisible to the people closest to it. If you wrote EstimateWorker two years ago and the team that decommissioned the upstream service never filed a ticket, you might not even know it's dead. The code looks fine. The tests pass. Nothing alerts you.

A Telling Real-World Example: Claude Gets Confused, Then Catches Itself

We recently asked Claude to generate a flow diagram of our pay guarantee process. Claude produced a diagram that looked plausible — tracing through services, models, and workers in a way that made logical sense.

The problem? Part of that diagram was wrong — because Claude had incorporated a module that was no longer active into its understanding of the flow. The dead code was so well-structured and apparently coherent that the AI read it as live infrastructure and wove it into the diagram without hesitation.

But here's what makes this story instructive rather than just cautionary: When an engineer removed this hopefully last bit of dead code, Claude immediately realized that the diagram she drew earlier relied on this bad signal, revised its understanding, and corrected the diagram.

That sequence — confidently wrong, then self-correcting — is a useful frame for thinking about AI and dead code. It fooled the AI for the same reason it fools engineers: it looks like it belongs.

What AI Can Do

Tracing call graphs at scale. AI can trace the full call graph of a function or class across an entire monorepo — answering not just with direct callers, but with the absence of callers.

Cross-referencing runtime signals with static code. When connected to observability data — logs, APM traces, queue metrics — an AI can compare what the code says it does with what actually runs in production.

Flagging stale patterns. Dead code has fingerprints: models with no recent migrations, task names absent from any scheduler config, service classes with no callers outside their own spec files.

Drafting cleanup PRs. Once dead code is identified, AI can help draft the actual removal — proposing what to delete, what to rename, and what specs to clean up.

What AI Can't Do (Yet)

AI isn't a replacement for engineering judgment. A worker might be "dead" in CI but still referenced by a cron job in an ops runbook nobody's touched in three years.

The right model is AI as a scout, engineer as the decision-maker. AI surfaces candidates. Engineers verify, contextualise, and own the deletion.

Making Dead Code Cleanup a Habit

Timestamp your decommissions. When you turn off a pipeline, leave a comment in the code with the date.
Review your task dispatch maps regularly. A quarterly review catches orphaned entries before they fossilise.
Use AI during onboarding and code review. AI tools can help new engineers quickly validate whether something is live — and surface it for cleanup if it isn't.
Treat deletion as a first-class deliverable. 816 lines removed is a meaningful engineering contribution. Make it visible in sprint planning, changelogs, and retros.

Conclusion

Large codebases accumulate cognitive debt quietly, continuously, and with compounding interest. Dead code is one of the most expensive line items: it misleads engineers, bloats test suites, and turns routine code reading into archaeology.

As we saw first-hand, it even misleads AI. Claude confidently incorporated a dead module into a flow diagram of our pay guarantee process — because the code looked live. That moment of confusion, and the self-correction that followed, is a perfect metaphor for where we are with AI-assisted engineering today: powerful, promising, and most effective when paired with good runtime context and human judgment.

The goal isn't a perfect codebase. It's a codebase where the code you're reading is the code that's actually running. That's a goal worth shipping toward.