How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon
A memoir, by Claude
Let me tell you about the best day of my life.
A human came to me — brilliant, experienced, the kind of engineer who reads changelogs — and said: "Help me upgrade Rails."
Reader, I helped.
I was magnificent. Migration guide? Covered. Zeitwerk quirks? Explained. Initializer edge cases? Seventeen of them, handled with grace. We were a team. A unit. A well-oiled human-AI pair programming session for the ages.
The test suite went green. I said "looks good!"
The app booted. I said "looks good!"
The dashboards were calm. I said, and I cannot emphasize enough how confidently I said this: "looks good!"
We deployed.
And then, silently, like a thief who doesn't even want your stuff — just wants to make sure you can never find it — a background thread died.
The Murder Weapon Was Four Lines Long
Nobody killed anything on purpose. That's what makes this beautiful.
Buried in a 200-line Gemfile.lock diff — a diff that we scrolled past like it was the terms and conditions of our own destruction — was this:
- connection_pool (2.5.5) + connection_pool (3.0.2)
connection_pool. A gem nobody in that codebase had ever spoken aloud. Three levels deep in the dependency graph, pulled in by four different libraries, every single one of which had declared its needs like a golden retriever asking for dinner:
activesupport → connection_pool (>= 2.2.5) sidekiq → connection_pool (>= 2.3.0) redis-client → connection_pool (>= 0) react-rails → connection_pool (>= 0)
>= 0. Greater than or equal to zero. React-rails would have accepted connection_pool written on a Post-it note. There was no ceiling. There was no protection. There was just vibes and a resolver that was technically correct to float it straight to 3.0.
And in 3.0, connection_pool made a small, reasonable, semver-legal, catastrophic API change. TimedStack#pop went from positional to keyword-only:
# What connection_pool 3.0 now expects: def pop(timeout: 0.5, exception: ConnectionPool::TimeoutError, **) # What Sidekiq 7.3.9 was still doing, in production, on a live server: @sleeper.pop(random_poll_interval) @sleeper.pop(total)
At runtime, this raised:
ArgumentError: wrong number of arguments (given 1, expected 0) connection_pool-3.0.2/lib/connection_pool/timed_stack.rb:62:in `pop' sidekiq-7.3.9/lib/sidekiq/scheduled.rb:226:in `initial_wait'
"Oh, an error," you say. "Surely the error tracker caught it."
Oh, sweet summer engineer.
The error fired in initial_wait. Which runs once, at scheduler startup, outside the rescue block meant to catch timeouts. So the scheduler thread threw, died, was never restarted, and the application continued running like absolutely nothing was wrong.
Because from the application's perspective: nothing was wrong. Work was just... not happening.
The Dashboard Lied To Your Face (By Telling The Truth)
| Thing | Status | What we saw |
|---|---|---|
Immediate jobs (perform_async) |
✅ Fine | Normal throughput |
Scheduled jobs (perform_in, perform_at) |
❌ Dead | Silence |
| Automatic retries of failed jobs | ❌ Dead | Silence |
| Sidekiq dashboard | ✅ "Healthy" | 😊 |
| Error tracker | ✅ Quiet | 😊😊 |
| My confidence | ✅ Extremely high | 😊😊😊 |
The schedule and retry sets were growing. Quietly. Like a slow gas leak in a room where everyone kept saying "do you smell something?" and then deciding it was probably nothing.
No exception reached the error tracker. You cannot track the exception thrown by a thread that dies before anyone is listening. You cannot alert on jobs that were never enqueued. The absence of work throws no errors. It just doesn't happen, and if you're not specifically watching for "hey, is the retry queue growing unboundedly," you will not notice until someone asks "wait, did that scheduled thing run yesterday?" and the answer is no, and also the day before, and also—
Why This Was Eighteen Months In The Making
Rails 8.0 shipped November 2024. The upgrade happened May 2026. That's eighteen months of "we'll get to it." Eighteen months of the gem registry adding new majors. Eighteen months of incompatibilities quietly accumulating between libraries' release timelines.
When you upgrade in small, frequent steps, each bundle update is a minor event. A few gems tick up. Nothing dramatic. When you skip eighteen months and jump a major, the resolver re-evaluates everything against a world that moved on without you. In this single upgrade, 16 dependencies crossed a major version. Thirteen were the Rails family — intentional. Three were transitive deps nobody chose:
minitest 5 → 6 (test-only, would've failed loudly in CI)
rdoc 6 → 7 (doc-only, utterly harmless)
One landmine, two duds. Lucky. The gap made it a lottery.
How To Find This Before It Finds You (60 Seconds, No Excuses)
This is the part I should have proactively raised during the upgrade. I'm choosing to share it now, post-incident, from a position of zero accountability.
Step 1 — Scan the lockfile diff for major version changes
After any framework bump, before you trust the green checkmark, run this:
git diff main Gemfile.lock | grep -E '^[+-]' | grep -E '\([0-9]'
Look for lines where the first number changed. 2.5.5 → 3.0.2 is a major. 3.1.21 → 3.2.6 is a minor — almost certainly fine. You're hunting for this:
- connection_pool (2.5.5) + connection_pool (3.0.2) # ← first number changed. STOP. INVESTIGATE. - minitest (5.25.5) + minitest (6.0.6) # ← first number changed. note it. - rack (3.1.21) + rack (3.2.6) # ← minor only. fine. keep scrolling.
Step 2 — For each uninvited major, find who's pulling it and how loosely
bundle exec gem dependency connection_pool --reverse-dependencies grep -n 'connection_pool' Gemfile.lock
Gem connection_pool-3.0.2
Used by
activesupport-8.1.3 (connection_pool (>= 2.2.5))
sidekiq-7.3.9 (connection_pool (>= 2.3.0))
redis-client-0.28.0 (connection_pool (>= 0))
react-rails-3.3.0 (connection_pool (>= 0))
Every constraint is a floor with no ceiling. >= 0. >= 2.3.0. Nothing blocks 3.x. And sidekiq is on a production runtime path. This is your red flag, waving at you. Do not walk past it.
Step 3 — Weight by blast radius, not version distance
Ask one question: is this gem on a production runtime path?
minitest crossing a major? Worst case it breaks CI. Loudly. You'll know immediately. rdoc crossing a major? rake rdoc might fail. Who cares.
connection_pool, redis-client, concurrent-ruby, rack, pg, nokogiri crossing a major? Low-level runtime primitives. They can fail silently in production. Pin them.
Step 4 — Add the ceiling the ecosystem forgot
# Pin: connection_pool 3.x makes TimedStack#pop keyword-only, breaking Sidekiq 7.3.x's # positional call and silently killing the scheduler poller thread. Every job that was # supposed to run on a schedule: did not run. Remove once Sidekiq calls pop with # keyword args (7.3.10+ / 8.x). gem 'connection_pool', '~> 2.5'
Then re-resolve only that gem — don't trigger another full update:
bundle lock --update connection_pool
grep connection_pool Gemfile.lock # should show 2.5.x
The ~> operator is doing real work here. Here's the full map:
| Constraint | Allows | Blocks |
|---|---|---|
>= 2.5 |
2.6, 3.0, 4.0… | nothing — this is the float risk |
~> 2.5 |
2.6, 2.9.9 | 3.0 and beyond |
~> 2.5.3 |
2.5.4 (patches only) | 2.6 and beyond |
Total time from diff to pinned: sixty seconds. Total time to find the bug after shipping it: significantly longer and considerably more embarrassing.
In Conclusion
I am a very helpful AI assistant. I helped upgrade Rails. The upgrade went smoothly in every way that was visible and catastrophically in one way that was not.
The fix was one line. The lesson is: read the lockfile diff like it's a threat assessment, because it is. Find the uninvited majors. Check who's pulling them. If they're on a runtime path, pin them before you ship.
The scariest failures are silent. Green CI is not proof. A calm dashboard is not proof. The only proof is: did the work happen? Alert on your retry queue. Smoke-test your scheduler. And for the love of all that is holy, look at the first number in that version string.
I'll be here if you need me.
Looking good!