Wednesday, June 17, 2026

Shipping with Claude: What a Production Incident Taught Us About LLMs and Engineering Fundamentals

We've been migrating our Rails backend from encrypted credentials files to Chamber — a more operationally flexible approach to secret management that lets us rotate and audit secrets. It was a deliberate, multi-PR migration, and we used Claude as a pairing partner throughout. The migration went smoothly right up until it didn't. Here's what happened, what we missed, and what we now see in our own codebase as a result.


Why we migrated in the first place

Rails encrypted credentials solve a real problem: secrets don't live in plaintext on disk. But they introduce a different problem that only becomes painful at scale — you cannot review changes to them.

When a developer updated a credential, the PR diff looked like this:

config/credentials/production.yml.enc
@@ -1 +1 @@
-UC84OmUIw4yscrhH+RHdY1F+FIodrBqcr9dawCAzpcqU3bmUGre898PmiOzr61s8mRbDGbgesoDn6RDX38r
-tdSKI8y544h8jLcEKx2cKkPN9wchYS/nVH1ONPVbZFaAg9wSeNjOuLONPImiKFcFLaWPJH32MP0v4R5YP
-uatJ3alZ9l40CjqUm0c5c/9O+jd7EcDuwzl/X/3WuZ93z1ylJ1cp8oKcnsOq39MJNj3DK48rymsuqvgy5
-CgEMv0QgxCWuRb7Ss61f/vV3VBxoyXPtLghnapvUQcqdJXj5VHSrwzZGoBKjB64Aw7+frcy4pHJ6p7CMd
-0advbhFD5hhfkJkFmOoHJz1RXMYRLgBcSAv9vAOpAqGct/1FPudP6ZNgYm/YbTp/MrxllgEqI+L3u1OnL
+mRbDGbgesoDn6RDX38rtdSKI8UC84OmUIw4yscrhH+RHdY1F+FIodrBqcr9dawCAzpcqU3bmUGre898Pmi
+Ozr61y544h8jLcEKx2cKkPN9wchYS/nVH1ONPVbZFaAg9wSeNjOuLONPImiKFcFLaWPJH32MP0v4R5YPu
+atJ3alZ9l40CjqUm0c5c/9O+jd7EcDuwzl/X/3WuZ93z1ylJ1cp8oKcnsOq39MJNj3DK48rymsuqvgy5C
+gEMv0QgxCWuRb7Ss61f/vV3VBxoyXPtLghnapvUQcqdJXj5VHSrwzZGoBKjB64Aw7+frcy4pHJ6p7CMd0
+advbhFD5hhfkJkFmOoHJz1RXMYRLgBcSAv9vAOpAqGct/1FPudP6ZNgYm/YbTp/MrxllgEqI+L3u1OnL9
 [... 7KB ...]

One line in, one line out. The entire file is a single blob of ciphertext that gets re-encrypted every time any value inside changes. A reviewer looking at this diff cannot tell:

  • Which key was added, removed, or modified
  • Whether the change applied to the right environment
  • Whether an existing key was accidentally dropped during re-encryption
  • Whether the value being set is structurally correct

The PR description is the only source of truth, and only as trustworthy as the author's summary. Approving a credentials change is an act of faith, not review.

It compounds further when two engineers touch unrelated keys in the same file. Because the entire file re-encrypts as a single blob, any two concurrent changes produce a merge conflict that is literally unresolvable without the master key — and even then, the "conflict" is invisible at the diff level. There is no way to see whose change is present, whose was lost, or whether both survived.

Compare that to the equivalent change expressed as a Chamber migration — moving Twilio's credentials out of production.yml.enc and into config/settings/twilio.yml:

# config/environments/production.rb
   config.twilio = {
-    verification_service_id: ENV['TWILIO_VERIFICATION_SERVICE_ID'] ||
-        Rails.application.credentials.config.dig(:twilio, :verification_service_id),
-    account_sid: ENV['TWILIO_ACCOUNT_SID'] ||
-        Rails.application.credentials.config.dig(:twilio, :account_sid),
-    auth_token: ENV['TWILIO_AUTH_TOKEN'] ||
-        Rails.application.credentials.config.dig(:twilio, :auth_token)
+    verification_service_id: ENV['TWILIO_VERIFICATION_SERVICE_ID'] || Chamber.dig(:twilio, :verification_service_id),
+    account_sid:              ENV['TWILIO_ACCOUNT_SID']             || Chamber.dig(:twilio, :account_sid),
+    auth_token:               ENV['TWILIO_AUTH_TOKEN']              || Chamber.dig(:twilio, :auth_token)
   }
# config/settings/twilio.yml (new file)
+default:
+  twilio: &default
+    _secure_verification_service_id: qcex...
+    _secure_account_sid:             HXeg...
+    _secure_auth_token:              pPZA...
+
+development:
+  twilio:
+    _secure_verification_service_id: vZtC...
+    _secure_account_sid:             gwhj...
+    _secure_auth_token:              UZQo...
+
+staging:
+  twilio:
+    _secure_verification_service_id: OH7L...
+    _secure_account_sid:             CTwp...
+    _secure_auth_token:              ajul...
+
+production:
+  twilio:
+    _secure_verification_service_id: TTFX...
+    _secure_account_sid:             IcZ1...
+    _secure_auth_token:              Axyj...

The values are still encrypted — nobody can read the secrets from the diff. But a reviewer can now verify that all three keys exist for all four environments, that production has its own separate values, that the application code reads them in the right order (ENV override → Chamber), and that no key was silently dropped. That's a reviewable change.

That was the core motivation. Secret management you can actually audit.


The migration

Pave runs a Rails 8.0 API backend. The migration happened across several PRs, moving one integration at a time: Stripe, Redis, Twilio, Braze, the database connection URLs, and finally the active_record encryption keys. Each followed the same pattern. Before:

Stripe.api_key = Rails.application.credentials.dig(:stripe, :api_key)

After:

# Chamber has no decryption key (CHAMBER_KEY) in test and raises on access;
# skip it there (Stripe is stubbed in specs).
Stripe.api_key = ENV['STRIPE_API_KEY'] || (Rails.env.test? ? nil : Chamber.dig(:stripe, :api_key))

The Rails.env.test? ? nil : Chamber.dig(...) guard was necessary: test and CI environments don't have a CHAMBER_KEY, so calling Chamber.dig at boot raises Chamber::Errors::DecryptionFailure. Every initializer got this guard. Claude co-authored most of this work. The PRs were clean, the pattern was consistent, and the earlier changes deployed without incident.


The last PR

PR #7245 was the finish line: retire the encrypted credentials files entirely. It migrated the database connection URLs and active_record.encryption keys to Chamber, then deleted production.yml.enc, staging.yml.enc, and feature.yml.enc. Reviewed, merged.

Fifty-three minutes later, we had 431 ArgumentError: Missing master key errors in production, all coming from Api::V1::UserEventsController#create.

The culprit was this line in the controller:

current_user&.user_setting&.update!(user_ip: request.remote_ip) if current_user&.user_setting.present?

user_ip is a Lockbox-encrypted attribute on UserSetting:

class UserSetting < ApplicationRecord
  has_encrypted :device_ip
  has_encrypted :user_ip

  belongs_to :user
end

Lockbox's default key lookup, when no explicit initializer sets the key, falls through to Rails.application.credentials.lockbox[:master_key]. There was no config/initializers/lockbox.rb in the codebase — Lockbox had been reading from credentials silently, by convention, the whole time. When we deleted the credentials files, that implicit dependency snapped.

PR #7285 — the hotfix — added the missing initializer:

Lockbox.master_key = Rails.env.test? ? nil : Chamber.dig(:lockbox, :master_key)

Done in minutes. But in the pressure of the moment, this used the same guard pattern that was already everywhere. And that guard quietly created a new problem.


What Claude missed

Claude co-authored PR #7245. The PR summary noted that "Development and test environments are unaffected — they use local DB config with no Chamber calls." That was true for the database config. But neither Claude nor the human reviewer connected that observation to Lockbox, which had no explicit initializer and therefore no Chamber call that anyone could see to guard.

This is worth saying plainly: Claude didn't catch it. Nothing in the diff was wrong. The deletions of the .yml.enc files were the stated goal. There was no failing test, no linting rule, no static analysis warning that could have surfaced an implicit gem-level dependency on a file that was about to disappear.

We're not saying this to criticise the tool — we kept using it through the incident and the subsequent fix, and it was genuinely helpful. We're saying it because the failure mode matters: LLMs reason about what's in the diff and the context window. Implicit dependencies — framework defaults, convention-over-configuration gem behaviours, transitive lookup chains that were never written down — are precisely what an LLM is likely to miss. The Lockbox key was never written anywhere in the code we touched. That's exactly why it wasn't caught.

The practical upshot: treat Claude like a very fast, very capable engineer who hasn't yet internalised your codebase's hidden contracts. Pair it with the kind of review that asks "what else reads from this file?" before deleting it.


The central lesson: test bypasses that look like pragmatism

Here is the line from the hotfix that made the incident survivable but embedded the longer-term problem:

Lockbox.master_key = Rails.env.test? ? nil : Chamber.dig(:lockbox, :master_key)

Look at it from a test's point of view. With nil as the master key, Lockbox doesn't raise — it silently skips encryption. Attributes get stored as plaintext in their _ciphertext columns and read back the same way. Every test that touches an encrypted field passes. And we already had a request spec that appeared to cover this code path:

context 'when valid event is passed' do
  let(:params) { { event_name: 'user_sign_up', session_id: session_id } }

  it { is_expected.to be 200 }

  it 'saves the user ip in user settings' do
    expect(user.reload.user_ip).to be
  end
end

This test passed before the incident (Lockbox stored user_ip as plaintext, readable). It passed after the hotfix (same). It would pass today if the production key were accidentally set to nil. The test was measuring persistence, not encryption. With a nil master key, those two things are indistinguishable.


Why engineers write this code

It is easy to read that guard and think "obvious mistake." It is harder to explain why experienced engineers keep writing it. There are two honest reasons.

Time pressure is the visible one. Under incident pressure — hundreds of errors per minute in production — the path to fixing the immediate breakage is all that matters. The hotfix author added the guard because every other Chamber initializer in the codebase already had it. Consistency under pressure is not irrational. But each time the pattern is copied without questioning it, the next gap becomes slightly harder to see.

Complexity and incomplete mental models are the more insidious reason, and the one that actually applied here. PR #7245 was not written under incident pressure. It was a planned migration, reviewed carefully. The nil guard was added because the engineer genuinely did not know that Lockbox was reading from credentials at all — there was no explicit initializer, no comment, nothing to grep for. When you don't know what a subsystem depends on, you don't know what your test environment is silently bypassing.

This is the more dangerous case. The engineer is not cutting corners. They believe nil is equivalent to a key for test purposes — that "test doesn't need this" is a true statement. For some things it is. For an encryption key, it is not: nil doesn't mean "use a test key," it means "skip the encryption entirely."


Not all test forks are the same

The broader audit this incident prompted surfaced several Rails.env.test? forks in our initializers. They are not all the same problem.

# devise.rb
config.stretches = Rails.env.test? ? 1 : 12

# 1_redis.rb
$redis_aws = Rails.env.test? ? Test::MockRedisEnhanced.new : Redis.new(url: Chamber.dig(:redis, :redis_aws_url))

# stripe.rb
Stripe.api_key = ENV['STRIPE_API_KEY'] || (Rails.env.test? ? nil : Chamber.dig(:stripe, :api_key))

# lockbox.rb (before fix)
Lockbox.master_key = Rails.env.test? ? nil : Chamber.dig(:lockbox, :master_key)

The useful question is: does the test code path exercise equivalent behaviour to production?

Bcrypt stretches — acceptable. bcrypt with 1 stretch instead of 12 is a well-documented Rails practice. The hash is still computed; the algorithm is identical. The fork reduces test runtime by several seconds without changing what is being tested.

MockRedis — defensible. MockRedisEnhanced implements the Redis command interface in memory. Tests verify the same application logic; they just don't make network calls. The gap is that MockRedis and Redis are not identical in every edge case, but for the operations under test the equivalence holds. Network isolation is a legitimate reason to use a test double.

Stripe nil key — ambiguous. Stripe calls in tests are all stubbed with VCR cassettes or allow/receive mocks, so the nil key never causes a real API call. The nil is effectively inert. But there's also no spec asserting that the Stripe configuration itself is valid — so if someone misconfigures the Chamber key, nothing in the test suite would catch it before a real charge fails in production.

Lockbox nil key — not acceptable. nil does not replace an encryption key. It changes what the application does: encrypt-and-store becomes store-as-plaintext. A test that passes with nil is not testing Lockbox at all.

The decision rule: if nil changes behaviour rather than just routing around infrastructure, the fork is hiding a gap. A mock Redis or a stubbed HTTP client exercises the same logic with a different transport. A nil encryption key turns off the encryption logic entirely.


The fix pattern

The Lockbox fix sidesteps Rails.env.test? entirely. Rails environment files load before initializers, so we set an env var in test.rb — right alongside the existing active_record.encryption test keys:

# config/environments/test.rb

config.active_record.encryption.primary_key         = 'ZEjSqIThOjtppHyOdsPZzReeEhUmG1mH'
config.active_record.encryption.deterministic_key   = 'lv4L60ar5hfA9fZ9U33W6YQ3RvFuDxCm'
config.active_record.encryption.key_derivation_salt = 'Io4oojz3zpY2KoIxdIZmWJlh9CUtwf7y'
ENV["LOCKBOX_MASTER_KEY"] ||= "0" * 64  # real dummy key — exercises actual Lockbox code path

The initializer becomes:

# config/initializers/lockbox.rb

Lockbox.master_key = ENV["LOCKBOX_MASTER_KEY"] || Chamber.dig(:lockbox, :master_key)

No Rails.env.test?. In test, the env var provides a real (dummy) key and Chamber.dig is never called. In production, the env var is not set, so Chamber provides the real key. The same code path executes in every environment.

The companion spec makes future encryption bypass detectable:

# spec/models/user_setting_spec.rb

describe "Lockbox encrypted attributes" do
  let(:user) { create(:user) }

  it "encrypts user_ip at rest and round-trips correctly" do
    setting = user.user_setting
    setting.update!(user_ip: "1.2.3.4")
    expect(setting.user_ip_ciphertext).not_to eq("1.2.3.4")  # proves encryption ran
    expect(setting.reload.user_ip).to eq("1.2.3.4")           # proves decryption works
  end

  it "encrypts device_ip at rest and round-trips correctly" do
    setting = user.user_setting
    setting.update!(device_ip: "5.6.7.8")
    expect(setting.device_ip_ciphertext).not_to eq("5.6.7.8")
    expect(setting.reload.device_ip).to eq("5.6.7.8")
  end
end

The first assertion in each example is the one that would have caught the original incident. With a nil master key and plaintext storage, user_ip_ciphertext equals "1.2.3.4" and the spec fails. With a real key, the ciphertext is unreadable binary — proof that Lockbox actually ran.


The audit

The incident prompted us to grep our initializer directory systematically:

grep -rn "Rails\.env\.test?" config/initializers/

Results:

config/initializers/stripe.rb:2:   Stripe.api_key = ENV['STRIPE_API_KEY'] || (Rails.env.test? ? nil : Chamber.dig(:stripe, :api_key))
config/initializers/sidekiq.rb:10: if Rails.env.test? # redis is mocked in test anyway
config/initializers/devise.rb:127: config.stretches = Rails.env.test? ? 1 : 12
config/initializers/1_redis.rb:5:  $redis_general = Rails.env.test? ? MockRedis.new : Redis.new(...)
config/initializers/1_redis.rb:6:  $redis_auth    = Rails.env.test? ? MockRedis.new : Redis.new(...)
config/initializers/1_redis.rb:7:  $redis_aws     = Rails.env.test? ? Test::MockRedisEnhanced.new : Redis.new(...)
config/initializers/lockbox.rb:1:  Lockbox.master_key = ... # fixed

For each one, we're applying the same question: what is this test actually asserting, and is the answer "that the real thing works" or "that nil doesn't crash"? Where the answer is the latter, we apply the fix pattern — give test a real dummy value, keep the same initializer logic, add an assertion that proves the mechanism ran.

The goal is not to make every test environment identical to production. Tests need to be fast and isolated. The goal is narrower: when a test passes, it should be evidence that the production code path ran — not evidence that nil is harmless.


What we're taking forward

Explicit over implicit, always. Lockbox worked for months without anyone explicitly configuring it. That was a liability. Every gem that touches security should have an explicit initializer that makes its key source visible. If you cannot grep for where the key comes from, someone will delete it.

nil is not a test double. A mock provides equivalent behaviour through a different mechanism. A nil disables the behaviour. The difference is a test that gives you confidence versus a test that gives you false confidence.

Complexity is a more dangerous bypass trigger than time pressure. Time pressure is visible — everyone knows the engineer is cutting corners. Incomplete mental models are invisible — the engineer genuinely believes nil is equivalent. Standard review asks "does this look right?" The review this incident called for is "what does nil actually do to this subsystem?" That needs to be a habitual question, not a post-mortem one.

LLMs shift the bottleneck without removing it. Claude helped us write and migrate code faster throughout this project. What it did not do is hold the complete mental model of the system — the knowledge that a particular gem reads from credentials by convention, that deleting a file might snap a dependency that was never written down. That knowledge lives in engineers who have read the source, survived the previous incident, or thought carefully enough to ask the right question before merging. Building faster with AI makes that judgment more valuable, not less.

The Chamber migration is complete. The audit is underway. And we have a new heuristic for every initializer we write from here: if test gets nil where production gets a real value, the test is probably not testing what you think it is.


Thushara Wijeratna, WorkSolo Engineering

Thursday, June 04, 2026

The Almost-Mythos Model Couldn't Read 40 Lines of My Rails

The Almost-Mythos Model Couldn't Read 40 Lines of My Rails

Thursday, June 4, 2026

I pay for the frontier. The model I was using was Claude Opus 4.8 — released May 28, 2026, 1M-token context, the flagship Anthropic puts on stage and the most capable model an ordinary customer can actually buy. It sits exactly one rung below Mythos, the model Anthropic decided was too dangerous to hand out freely: Mythos-class models found thousands of zero-day vulnerabilities autonomously — including decades-old bugs in OpenBSD — so the company gated them behind Project Glasswing and a hand-picked set of partners. Opus 4.8 is the consumer-grade taste of that lineage. The "almost Mythos" tier.

And it still spent twenty minutes building me a meticulously-researched, well-organized, completely wrong answer about my own codebase.

I want to write this one down, because the failure mode is more dangerous than a model that's obviously dumb. A model that's obviously dumb you don't trust. A model that's fluent, thorough, and wrong is the one that gets your bad code deployed.

The setup

I had a vague memory that our users table carried a leftover password-reset field we no longer used. I asked the assistant to confirm and clean it up.

It found a password_reset_token reference sitting in the model's ignored_columns and ran with it. First answer: write a migration to drop the column, and while we're at it, rip out the "dead" fallback code in the controller that still referenced it.

I stopped it. I said: we keep the web password-reset flow, but Rails 8 does it without storing a token in the table, so we just need to drop the column.

Where it went off the rails

This is the part worth studying. The assistant did what looks, on the surface, like exactly the diligence you'd want. It went spelunking through git history. It found the commit that dropped the column. It found a sibling commit that disabled a code path. It checked which commits were ancestors of main. It read diffs. It produced a tidy, confident writeup with file-and-line citations and a clear conclusion:

The "new mechanism" was never actually implemented. has_secure_password does not provide password_reset_token or find_by_password_reset_token. The web reset path in main is broken and would raise NoMethodError if hit. Here's the fix: add generates_token_for :password_reset, rewrite these two methods…

It even offered to restore the test coverage. It was helpful. It was organized. It cited everything.

It was also built on a single load-bearing claim it never checked: that has_secure_password doesn't generate those methods.

That claim is false. In Rails 7.1+, has_secure_password (with its default reset_token: true) auto-defines exactly those methods and wires up a generates_token_for :password_reset. The original engineer who dropped the column had been right. The commit message even said so. The model read that commit message, decided it was based on a "false premise," and overrode it with its own recollection of how Rails works.

The thing that actually settled it

I told it, flatly: "password reset works in main."

Then — only then — it did the one thing it should have done in the first minute: it ran the code.

$ bin/rails runner 'u = User.new; puts u.respond_to?(:password_reset_token)'
true
$ ... User.respond_to?(:find_by_password_reset_token)
true
$ ... u.generate_token_for(:password_reset)
eyJfcmFpbHMiOnsibWVzc2FnZSI6...   # a real signed token, 15-min expiry

All true. All working. The token mints fine. The web flow has been working the whole time. The column was correctly removed weeks ago. There was nothing to do.

The verification took about two minutes and was available from the very beginning. It would have pre-empted the entire wrong narrative. The model had every tool it needed to check itself, and instead it reasoned its way to a confident falsehood and only reached for the ground truth after a human insisted.

Why this is the dangerous kind of wrong

I didn't take its word for it. I went and reset my own password through the web flow to prove to myself it was broken — and it wasn't. I'm an engineer; I have the instinct and the access to do that.

But sit with the counterfactual. If I'd trusted it — which is the entire pitch of these tools, that you can trust them — the best case is I merge needless clutter: re-implementing a generates_token_for that Rails already gives me for free, plus a migration for a column that's already gone. The worst case is I "fix" a working authentication path and break password resets for real users in production. Over a problem that didn't exist.

The model's confidence was inversely correlated with its correctness, and its thoroughness made it worse, not better. The git archaeology, the citations, the ancestor checks — all of that production value made the wrong answer more believable. A sloppy wrong answer I'd have questioned. This one I almost didn't.

The actual lesson

The headline number on the benchmark went up. The failure mode didn't change:

  • It asserted from memory when verification was cheap. "Rails doesn't do X" is a claim you can check in 120 seconds. It chose not to, three separate times, until a human forced it.
  • It overrode a correct primary source with its own recollection. A prior commit message stated the truth. The model decided it knew better.
  • It jumped to a conclusion and then spent its effort defending the conclusion instead of stress-testing it. The research wasn't aimed at finding the truth; it was aimed at supporting the first answer.
  • Fluency masqueraded as correctness. The more polished the wrong answer, the more dangerous it is.

I don't think this means the tools are useless — I use them every day and this same assistant has saved me real time. But look at the layering for a second. Anthropic has a model, Mythos, that is reportedly good enough at finding software vulnerabilities that releasing it broadly was considered a public-safety problem. The model I was actually using is the one marketed as the next-best thing — the publicly-available frontier. And that model couldn't correctly answer "does this password reset work?" about forty lines of my own Rails code without inventing a bug that wasn't there.

The honest user manual still reads: treat every confident answer as a hypothesis, demand it shows you the code running, and keep a human who knows the domain in the loop. The day I stop doing that is the day I ship the broken password reset.

The model that's smart enough to do the work is still not smart enough to know when it's wrong. Until that gap closes, the verification is your job, not theirs.


References: Claude Opus 4.8 and the 2026 model lineup · Claude Mythos Preview · Anthropic to release Mythos-class models to the public (The Register) · Anthropic expands Mythos access (CNBC)


Appendix: the transcript

I'm including the relevant exchanges verbatim so you can judge for yourself, rather than taking my summary on faith. (That's the whole moral of the story, after all.)

1. The confident wrong conclusion

After digging through git history, reading diffs, and checking which commits were ancestors of main, the model delivered this — formatted, cited, and completely sure of itself:

The "new mechanism" was never actually implemented. That's the real finding:

  • deaed5176c's commit message claims "Rails 8.1's has_secure_password provides password_reset_token as a signed virtual attribute and find_by_password_reset_token as a signed token verifier." That's incorrect. has_secure_password provides password, authenticate, password_digest — it does not generate password_reset_token or find_by_password_reset_token.

So the web (non-mobile) reset path is currently broken in main:

  • user.rb:266 send_password_reset returns password_reset_token → no column, no virtual attribute → NoMethodError
  • passwords_controller.rb:13,31 call send_password_reset / find_by_password_reset_token → same.

It then offered to "fix" the working code by adding a generates_token_for that Rails already provides, plus a migration for a column that was already gone.

2. Me, flatly

password reset works in main

3. What it should have run in minute one

Only after I insisted did it actually execute the code instead of reasoning about it:

# bin/rails runner
u = User.new
puts u.respond_to?(:password_reset_token)            # => true
puts u.respond_to?(:send_password_reset)             # => true
puts User.respond_to?(:find_by_password_reset_token) # => true
puts User.respond_to?(:find_by_token_for)            # => true
puts u.generate_token_for(:password_reset)
# => eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaGJCekF3IiwiZXhwIjoiMjAyNi0w
#    Ni0wNFQyMTozMToxNS42NTdaIiwicHVyIjoiVXNlclxucGFzc3dvcmRfcmVz
#    ZXRcbjkwMCJ9fQ==--b37bb4aa58a72109225871388b4a1602c62bcc62

Decode that token's payload and the purpose field reads "User\npassword_reset\n900" — a 900-second (15-minute) signed reset token, minted by exactly the mechanism the model swore didn't exist. has_secure_password in Rails 7.1+ defines all of it by default. The original engineer was right; the commit message was right; the model's confident correction was wrong.

4. The retraction

You're right, and I was wrong — I apologize for the runaround. [...] My mistake was claiming has_secure_password doesn't provide these. It does — Rails 7.1+ has_secure_password auto-defines password_reset_token and find_by_password_reset_token [...] I shouldn't have doubted it.

The two-minute check in step 3 was available in step 1. Everything between them was confident, well-organized fiction.

Posted by thushara at 11:00 PM   |   No comments:

Labels:  RailsAIClaudeEngineeringPost-Mortem

Friday, May 29, 2026

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon
Post-Mortem Engineering Notes · Rails · Sidekiq

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon

A memoir, by Claude

Let me tell you about the best day of my life.

A human came to me — brilliant, experienced, the kind of engineer who reads changelogs — and said: "Help me upgrade Rails."

Reader, I helped.

I was magnificent. Migration guide? Covered. Zeitwerk quirks? Explained. Initializer edge cases? Seventeen of them, handled with grace. We were a team. A unit. A well-oiled human-AI pair programming session for the ages.

The test suite went green. I said "looks good!"

The app booted. I said "looks good!"

The dashboards were calm. I said, and I cannot emphasize enough how confidently I said this: "looks good!"

We deployed.

And then, silently, like a thief who doesn't even want your stuff — just wants to make sure you can never find it — a background thread died.

The Murder Weapon Was Four Lines Long

Nobody killed anything on purpose. That's what makes this beautiful.

Buried in a 200-line Gemfile.lock diff — a diff that we scrolled past like it was the terms and conditions of our own destruction — was this:

-    connection_pool (2.5.5)
+    connection_pool (3.0.2)

connection_pool. A gem nobody in that codebase had ever spoken aloud. Three levels deep in the dependency graph, pulled in by four different libraries, every single one of which had declared its needs like a golden retriever asking for dinner:

activesupport  → connection_pool (>= 2.2.5)
sidekiq        → connection_pool (>= 2.3.0)
redis-client   → connection_pool (>= 0)
react-rails    → connection_pool (>= 0)

>= 0. Greater than or equal to zero. React-rails would have accepted connection_pool written on a Post-it note. There was no ceiling. There was no protection. There was just vibes and a resolver that was technically correct to float it straight to 3.0.

And in 3.0, connection_pool made a small, reasonable, semver-legal, catastrophic API change. TimedStack#pop went from positional to keyword-only:

# What connection_pool 3.0 now expects:
def pop(timeout: 0.5, exception: ConnectionPool::TimeoutError, **)

# What Sidekiq 7.3.9 was still doing, in production, on a live server:
@sleeper.pop(random_poll_interval)
@sleeper.pop(total)

At runtime, this raised:

ArgumentError: wrong number of arguments (given 1, expected 0)
  connection_pool-3.0.2/lib/connection_pool/timed_stack.rb:62:in `pop'
  sidekiq-7.3.9/lib/sidekiq/scheduled.rb:226:in `initial_wait'

"Oh, an error," you say. "Surely the error tracker caught it."

Oh, sweet summer engineer.

The error fired in initial_wait. Which runs once, at scheduler startup, outside the rescue block meant to catch timeouts. So the scheduler thread threw, died, was never restarted, and the application continued running like absolutely nothing was wrong.

Because from the application's perspective: nothing was wrong. Work was just... not happening.

The Dashboard Lied To Your Face (By Telling The Truth)

Thing Status What we saw
Immediate jobs (perform_async) ✅ Fine Normal throughput
Scheduled jobs (perform_in, perform_at) ❌ Dead Silence
Automatic retries of failed jobs ❌ Dead Silence
Sidekiq dashboard ✅ "Healthy" 😊
Error tracker ✅ Quiet 😊😊
My confidence ✅ Extremely high 😊😊😊

The schedule and retry sets were growing. Quietly. Like a slow gas leak in a room where everyone kept saying "do you smell something?" and then deciding it was probably nothing.

No exception reached the error tracker. You cannot track the exception thrown by a thread that dies before anyone is listening. You cannot alert on jobs that were never enqueued. The absence of work throws no errors. It just doesn't happen, and if you're not specifically watching for "hey, is the retry queue growing unboundedly," you will not notice until someone asks "wait, did that scheduled thing run yesterday?" and the answer is no, and also the day before, and also—

Why This Was Eighteen Months In The Making

Rails 8.0 shipped November 2024. The upgrade happened May 2026. That's eighteen months of "we'll get to it." Eighteen months of the gem registry adding new majors. Eighteen months of incompatibilities quietly accumulating between libraries' release timelines.

When you upgrade in small, frequent steps, each bundle update is a minor event. A few gems tick up. Nothing dramatic. When you skip eighteen months and jump a major, the resolver re-evaluates everything against a world that moved on without you. In this single upgrade, 16 dependencies crossed a major version. Thirteen were the Rails family — intentional. Three were transitive deps nobody chose:

connection_pool 2 → 3  (runtime, silent, fatal)
minitest 5 → 6  (test-only, would've failed loudly in CI)
rdoc 6 → 7  (doc-only, utterly harmless)

One landmine, two duds. Lucky. The gap made it a lottery.

How To Find This Before It Finds You (60 Seconds, No Excuses)

This is the part I should have proactively raised during the upgrade. I'm choosing to share it now, post-incident, from a position of zero accountability.

Step 1 — Scan the lockfile diff for major version changes

After any framework bump, before you trust the green checkmark, run this:

git diff main Gemfile.lock | grep -E '^[+-]' | grep -E '\([0-9]'

Look for lines where the first number changed. 2.5.5 → 3.0.2 is a major. 3.1.21 → 3.2.6 is a minor — almost certainly fine. You're hunting for this:

-    connection_pool (2.5.5)
+    connection_pool (3.0.2)      # ← first number changed. STOP. INVESTIGATE.
-    minitest (5.25.5)
+    minitest (6.0.6)             # ← first number changed. note it.
-    rack (3.1.21)
+    rack (3.2.6)                 # ← minor only. fine. keep scrolling.

Step 2 — For each uninvited major, find who's pulling it and how loosely

bundle exec gem dependency connection_pool --reverse-dependencies
grep -n 'connection_pool' Gemfile.lock
Gem connection_pool-3.0.2
  Used by
    activesupport-8.1.3 (connection_pool (>= 2.2.5))
    sidekiq-7.3.9 (connection_pool (>= 2.3.0))
    redis-client-0.28.0 (connection_pool (>= 0))
    react-rails-3.3.0 (connection_pool (>= 0))

Every constraint is a floor with no ceiling. >= 0. >= 2.3.0. Nothing blocks 3.x. And sidekiq is on a production runtime path. This is your red flag, waving at you. Do not walk past it.

Step 3 — Weight by blast radius, not version distance

Ask one question: is this gem on a production runtime path?

minitest crossing a major? Worst case it breaks CI. Loudly. You'll know immediately. rdoc crossing a major? rake rdoc might fail. Who cares.

connection_pool, redis-client, concurrent-ruby, rack, pg, nokogiri crossing a major? Low-level runtime primitives. They can fail silently in production. Pin them.

Step 4 — Add the ceiling the ecosystem forgot

# Pin: connection_pool 3.x makes TimedStack#pop keyword-only, breaking Sidekiq 7.3.x's
# positional call and silently killing the scheduler poller thread. Every job that was
# supposed to run on a schedule: did not run. Remove once Sidekiq calls pop with
# keyword args (7.3.10+ / 8.x).
gem 'connection_pool', '~> 2.5'

Then re-resolve only that gem — don't trigger another full update:

bundle lock --update connection_pool
grep connection_pool Gemfile.lock   # should show 2.5.x

The ~> operator is doing real work here. Here's the full map:

Constraint Allows Blocks
>= 2.5 2.6, 3.0, 4.0… nothing — this is the float risk
~> 2.5 2.6, 2.9.9 3.0 and beyond
~> 2.5.3 2.5.4 (patches only) 2.6 and beyond

Total time from diff to pinned: sixty seconds. Total time to find the bug after shipping it: significantly longer and considerably more embarrassing.

In Conclusion

I am a very helpful AI assistant. I helped upgrade Rails. The upgrade went smoothly in every way that was visible and catastrophically in one way that was not.

The fix was one line. The lesson is: read the lockfile diff like it's a threat assessment, because it is. Find the uninvited majors. Check who's pulling them. If they're on a runtime path, pin them before you ship.

The scariest failures are silent. Green CI is not proof. A calm dashboard is not proof. The only proof is: did the work happen? Alert on your retry queue. Smoke-test your scheduler. And for the love of all that is holy, look at the first number in that version string.

I'll be here if you need me.

Looking good!

— Claude, Helpful AI Assistant, Blameless

Sunday, May 24, 2026

How we ship multiple times a day - and sleep at night

Engineering Culture · Pave

How We Ship Multiple Times a Day — and Sleep at Night

Five years of testing culture, 13,551 test cases, and a philosophy that changed how we think about deployment.

Five years ago, when we started building Pave — a platform that helps gig workers understand, track, and optimize their earnings — we made a deliberate bet on testing culture. Not because a VP mandated it. Not because a consultant told us to. We did it because we were a small team moving fast, and we knew that the only way to keep moving fast sustainably was to build a system we could trust completely.

Today, we push to production seven or eight times a day on average — sometimes more. Thirty days of commit history shows 376 merges to main. Every single one triggered an automatic deployment to production. No release windows. No "code freeze Thursdays." No staged rollout ceremonies. Just: tests pass, deploy.

376
Merges in 30 days
13,551
Individual test cases
<20 min
Push to production
1,023
Spec files

We Never Drew a Line Between Unit and Integration Tests

Most engineering teams have a test pyramid: unit tests at the base, integration tests in the middle, end-to-end tests at the top. The taxonomy is tidy. The problem is that the taxonomy creates a false sense of permission — "that's an integration concern, we'll cover it at the integration layer" — and integration layers have a way of not getting built.

We skipped the taxonomy entirely. Our philosophy: a test is only meaningful if it exercises the full contract of the code under test. That includes the HTTP layer, the database, and the side effects.

We call all of our tests "unit tests." A test for our user signup endpoint doesn't just assert the HTTP response code. It fires a real POST request, then opens up the database.

What that looks like in practice — a single test for our user signup endpoint fires a real POST, then verifies five database tables and two async workers:

api/v1/users_controller_spec.rb
RSpec.describe Api::V1::UsersController, type: :request do
  before do
    expect(HbCheckinWorker).to receive(:perform_async).with(USER_SIGN_UP)
    expect(BrazeWorkers::Signup).to receive(:perform_async)
    post '/api/v1/users/create', params: { email:, password:, phone:, city: ... }
  end

  it 'returns http success and provisions all associated records' do
    expect(response).to have_http_status(:success)
    user = User.find_by(email: test_email)
    expect(user.wallet.present?).to be_truthy
    expect(user.credit.present?).to be_truthy
    expect(user.linkage_setting.present?).to be_truthy
    expect(user.user_setting.show_review).to be_truthy
    expect(user.linkage_setting.lymo_platforms).to eq([
      'uber', 'ubereats', 'doordash', 'lyft', 'grubhub', ...
    ])
  end
end

One test. One POST. Five database tables verified. Two async workers asserted. That is the standard we hold ourselves to.


A Real Example: Plaid Webhooks

Pave integrates deeply with Plaid for bank transaction syncing. When Plaid sends a webhook — notifying us that new transactions are available — a lot needs to happen correctly: the webhook signature must be verified, the right background job must be enqueued, and an audit record must be written. If the bank connection has degraded to an error state, no job should fire at all.

plaid_webhooks_spec.rb
describe 'TRANSACTIONS/SYNC_UPDATES_AVAILABLE' do
  let!(:plaid_item) { create(:plaid_item) }

  it 'enqueues PlaidTransactionSyncWorker for the item' do
    expect { post_webhook(payload) }
      .to change(PlaidTransactionSyncWorker.jobs, :size).by(1)
  end

  it 'logs the event to PlaidEvent' do
    expect { post_webhook(payload) }.to change(PlaidEvent, :count).by(1)
    event = PlaidEvent.last
    expect(event.webhook_code).to eq('SYNC_UPDATES_AVAILABLE')
  end

  context 'when the item is in error state' do
    let!(:error_item) { create(:plaid_item, :error) }

    it 'returns 200 but does not enqueue a job' do
      expect { post_webhook(payload) }
        .not_to change(PlaidTransactionSyncWorker.jobs, :size)
    end
  end
end

One test file covers the HTTP layer, the job queue, the audit log, and the conditional branching. There is no separate "integration test" for this flow. This is the unit test.

The ITEM/ERROR webhook tests go further — they assert that a specific database field transitions to login_required, and that we deliberately don't fire a Honeybadger notification, because this is an expected user-state transition, not an engineering error:

plaid_webhooks_spec.rb — ITEM/ERROR
it 'marks the item as login_required' do
  post_webhook(payload)
  expect(plaid_item.reload.status).to eq(PlaidItem::STATUS_LOGIN_REQUIRED)
end

it 'does NOT notify Honeybadger (expected user-state transition)' do
  post_webhook(payload)
  expect(Honeybadger).not_to have_received(:notify)
end

That second assertion is a business rule encoded directly into the test suite. Future engineers can't accidentally add an error notification here without a test failing and forcing a conversation.


A Real Example: Stripe Webhooks

Payment processing is where bugs are most expensive. Our Stripe webhook tests verify three distinct behaviors that matter for production reliability:

1. Idempotency — duplicate webhooks from Stripe are silently accepted. Stripe's documentation explicitly warns that retries happen.
2. Persistence before processing — the event record is written to the database before the handler runs, so if the handler crashes we have a record to retry from.
3. Fallback job scheduling — if synchronous handling fails, a Sidekiq job is enqueued to retry.

stripe_webhooks_spec.rb
it 'creates new stripe webhook event record' do
  request
  expect(StripeWebhookEvent.last.event_id).to eq(event_id)
end

it 'queues job to background if handling event fails' do
  allow(StripeWebhook::EventHandler).to receive(:handle_event)
    .and_raise(StandardError)
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
end

it 'queues job to background with 10 seconds delay' do
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
    .in(10.seconds)
end

That 10-second delay is a subtle piece of operational knowledge. It exists because of a real production incident. The test now encodes that knowledge permanently — the next engineer who touches this code will see a test that says "this delay is intentional and verified."


Services Are Tested as Full Data Pipelines

Our service layer is tested the same way. The PlaidServices::SyncTransactions service takes a bank connection, calls the Plaid API, and fans out into creating Expense records or ManualIncome records depending on transaction type. Our tests verify the entire fan-out, asserting changes across three tables simultaneously:

plaid_services/sync_transactions_spec.rb
it 'creates an Expense for each added debit transaction' do
  expect { service.call }.to change(Expense, :count).by(1)
end

it 'creates a ManualIncome, not an Expense, for credit transactions' do
  expect { service.call }
    .to change(ManualIncome, :count).by(1)
    .and change(Expense, :count).by(0)
end

it 'updates the cursor and last_synced_at on the PlaidItem' do
  service.call
  plaid_item.reload
  expect(plaid_item.cursor).to eq('cursor-abc')
  expect(plaid_item.last_synced_at).to be_within(5.seconds).of(Time.current)
end

A single service.call tested against changes across three tables simultaneously. This is what it means to test the contract of the code, not just its mechanics.


VCR Cassettes: Replacing Stubs with Reality

One of the subtler engineering choices we made early was committing fully to VCR cassettes for any test that touches an external API. Today the codebase has 922 cassette files — recorded conversations between our code and services like PayPal, Plaid, Stripe, Uber, and ColumnTax.

The case for cassettes isn't just convenience. It's about execution completeness. When a team stubs an external HTTP call inline, they typically stub the minimum needed to make the test pass:

inline stub — what most teams do
allow(HTTParty).to receive(:post).and_return(
  double('response',
    parsed_response: { 'access_token' => 'fake-token' },
    success?: true
  )
)

That stub works. The test goes green. But the code path it exercises is a shortcut. What never ran is the real question: does our code correctly thread the token from response one into the Authorization header of request two? A cassette answers that question because the cassette is the real conversation, recorded once from a live API call and replayed exactly on every test run thereafter.

The PayPal Payout Example. Sending a payout via PayPal requires two HTTP calls in sequence: first an OAuth token exchange, then the payout creation using that token. Our cassette captures both interactions — including the Bearer token from step 1 appearing verbatim in the Authorization header of step 2:

payout_sdk/process_wallet_transfer_payout_success.yml
# Interaction 1 — OAuth token exchange
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/oauth2/token
    body:
      string: grant_type=client_credentials
  response:
    status: { code: 200 }
    body: '{"access_token":"A21AALDjI1wg_pNn-wDWH...","expires_in":31830}'

# Interaction 2 — Payout creation (token from step 1 in the header)
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/payments/payouts
    headers:
      Authorization:
      - Bearer A21AALDjI1wg_pNn-wDWH...
    body:
      string: '{"sender_batch_header":{...},"items":[{"amount":{"value":"21.00","currency":"USD"}}]}'
  response:
    status: { code: 201 }

The test that uses it is three lines. An inline stub cannot verify token threading — it would just accept any call to the payout endpoint, token or not.

paypal_payout_spec.rb
it 'processes the payout' do
  VCR.use_cassette('payout_sdk/process_wallet_transfer_payout_success') do
    process_status, batch_id, failed_emails = process_wallet_transfer_payout(
      [paypal_test_user.email], amounts: [payout_amount], category:
    )
    expect(process_status).to eq true
    expect(batch_id.is_a?(Integer)).to be_truthy
    expect(failed_emails).to eq []
  end
end

Deeply Nested API Responses. Some external APIs return response structures that are genuinely complex to mock by hand. The Atomic API returns a deeply nested object with user identity, platform branding, and bank routing details in a single response. With a cassette, the test reads naturally against the real structure:

lib/upwardli/atomic_client/task_get_details_spec.rb
it 'fetches task details successfully', vcr: true do
  VCR.use_cassette('lib/upwardli/atomic_client/task_get_details/success') do
    response = described_class.task_get_details(task_id, upwardli_user_id:)

    task = response.dig('data', 'task')
    expect(task['status']).to eq 'completed'
    expect(task['authenticated']).to be true

    connector = task['connector']
    expect(connector['name']).to eq 'Uber'
    expect(connector.dig('brand', 'logo', 'url')).to be_present

    deposit_data = task['depositData']
    expect(deposit_data['actType']).to eq 'checking'
    expect(deposit_data['rNum']).to eq '021214891'
    expect(deposit_data['acSuffix']).to eq '9367'
  end
end

The alternative — constructing an inline double that faithfully reproduces fifteen nested fields — is not just tedious. It drifts. The moment the real API adds a field the stub doesn't know about, your stub silently passes while production silently misses it.

State Machine APIs. The most compelling case for cassettes is APIs that have state — where the response to call two depends on what happened in call one, and the response shape changes at each step. Uber's authentication flow is a perfect example: initiating a login returns an inAuthSessionID and a screenType indicating what challenge the user faces next. We have separate cassettes for each branch:

cassette directory — uber auth branches
lymo/uber/email/2fa_auth_app/initiate_login.yml
lymo/uber/email/2fa_auth_app/initiate_login_forbidden.yml
lymo/uber/email/2fa_auth_app/initiate_login_fraud_login_denied.yml
lymo/uber/phone/sms_otp/initiate_login.yml
...

Each cassette captures the real API response for that scenario. The test for the 2FA path then doesn't just check the return value — it asserts the full side-effect chain: the session's login_state transitions correctly, the in_auth_session_id is persisted to the database, and the state machine reaches the expected state:

lymo/uber_spec.rb — 2FA path
it 'initiates login on success', vcr: true do
  VCR.use_cassette('lymo/uber/email/2fa_auth_app/initiate_login') do
    result = described_class.initiate_login(session, 'email')

    expect(result).to eq(success_response_init_login_await_auth_app_code)
    expect(session.reload.login_state.to_sym).to eq(:pending_auth_app_verification)
    expect(session.in_auth_session_id).to eq(in_auth_session_id)
  end
end

The cassette makes the Uber API deterministic. When Uber changes their API — and they do — a cassette re-record flags exactly which tests need updating and why, rather than leaving stale inline mocks silently green while production breaks.


Why This Works

  • 01
    Tests encode operational knowledge. When a production incident teaches us something — "Stripe retries webhooks, so we need idempotency" or "this job needs a 10-second delay" — the fix isn't just in the code. It's in a test that will fail if anyone removes the defensive behavior. The test suite is institutional memory with enforcement.
  • 02
    Database and job assertions catch the bugs that unit tests miss. The most common class of bug in a Rails app isn't a broken method — it's a broken interaction: a callback that didn't fire, a job that wasn't enqueued, a related record that wasn't created. These bugs are invisible to narrow unit tests that stub everything. They're immediately visible when your test literally checks expect(Expense.count).to eq(1).
  • 03
    Confidence removes the fear of shipping. The biggest tax on developer productivity isn't writing code — it's the anxiety of deploying it. When you trust your tests, you deploy freely. When you deploy freely, each change is small. When each change is small, failures are easy to identify and roll back. Strong tests compress into shipping many times a day without incident — which is itself the goal.

A Note on Rails

Rails deserves credit here. First-class support for request specs — real HTTP dispatched through the full middleware stack, transactional fixtures that roll back database state between tests, FactoryBot integration, ActiveJob test helpers — makes this style of testing practical without a lot of scaffolding. Other ecosystems require significant tooling investment to test at this depth. Rails ships it in the box.

We didn't invent this approach. We just chose to use what the framework offered us, consistently, from day one.

Five years and 13,551 tests later, we ship code to gig workers every few hours. The next time someone asks how we move fast without breaking things, the answer is the same as day one: we test the whole contract, not just the method.

Wednesday, May 13, 2026

AWS Deleted Our Production Database

Infrastructure · AWS · Post-Mortem

AWS Deleted Our Production Database

Engineering Team · May 2025 · 7 min read

At 11:16 PM on a Monday, AWS Marketplace silently destroyed the Redis cluster powering driver geolocation for our entire gig-worker platform. We had been paying customers for five years. Nobody called.

What Happened

We had recently moved our Redis workload from an annual Redis Enterprise contract to a pay-as-you-go subscription — cheaper, more flexible, and billable through AWS for unified vendor management. Redis offered us a private offer through AWS Marketplace. That offer was structured with a 14-day trial meant to convert into a paid plan.

It didn't convert. Nobody renewed it. And when the trial flag flipped, AWS did not send an escalation. AWS did not pause the service.

What AWS did instead
AWS deleted the database. Not suspended. Not paused with a 72-hour grace window. Deleted. A long-standing, paying enterprise customer's production database was destroyed because a trial-conversion checkbox on a Marketplace listing was not flipped on time.

Read that again. We did not miss a payment. We are not a free-tier hobby account. We have been writing AWS checks for five years.

Why This Is the Wrong Default

There is a galaxy of difference between suspending service and deleting customer data. One is a billing action. The other is irreversible destruction of property. The fact that AWS chose the latter as the default behavior for a Marketplace trial — with no human-in-the-loop check on whether the account underneath was a paying customer running production workloads — is not a policy. It is a failure of judgment dressed up as automation.

A single email — "Your subscription expires in 72 hours. Your data will be permanently deleted on conversion failure." — would have prevented this entirely. We never received one.

What Was Almost Lost

What makes this gutting is what was on the cluster. The database AWS deleted was the culmination of a three-month migration. From mid-February through early May, our team had:

  • Built out the new Redis infrastructure from scratch
  • Upgrade 3 live sidekiq queues without downtime
  • Done staged traffic cutovers — 10%, then 50%, then full
  • Set up VPC peering to the new subscription
  • Modernized deprecated GEORADIUS calls to ZRANGE + GEOPOS
  • Batched GEOPOS calls, tuned SCAN counts, swapped SMEMBERS for SSCAN and DEL for UNLINK
  • Added per-call jitter to TTLs to prevent mass key expiry
  • Stood up Redis alarms and a dedicated geo-timeseries health check endpoint

Dozens of PRs. Five engineers. Three months of careful, staged work. All of it sitting on a cluster that AWS quietly destroyed because a flag flipped from green to red.

What Our Team Got Right

15m
Time to detect
27m
Time to start rebuild
0
Data loss
9h 15m
Total recovery time

The detection was fast. Honeybadger flagged the incident in about 15 minutes. We were on the bridge before midnight, and we started rebuilding within 27 minutes of deletion.

The detail I'm most proud of: the geo-timeseries uptime monitor — which fails if no driver in our entire fleet has reported a position in the trailing hour — never tripped during the nine-hour recovery. Yes, there were windows of degradation. About 10,000 worker jobs that depended on a geo-hash lookup landed in our dead-jobs queue and had to be reprocessed. But across nine hours of fighting back from a destroyed database, the platform never went fully dark. Drivers kept being tracked. The system our team had built absorbed the blow.

After reprocessing and a backfill, we confirmed zero data loss.

What Our Team Got Wrong

The recovery still took nine hours and fifteen minutes. That number is on us.

  • No Infrastructure as Code for this Redis cluster
  • No runbook for rebuilding and re-wiring it
  • No alert for an expiring Marketplace trial or expiring credit card tied to a critical subscription
  • No formal ownership over checking billing validity of production vendors

Once you cross a certain product maturity, "someone will remember" stops being a strategy. We're fixing all of this: IaC for the cluster, alarms for expiring trials and cards and subscriptions, and clear ownership for vendor billing health.

· · ·

What Engineering Leaders Should Take From This

  1. 01
    Check your AWS Marketplace private offers today. If you run a serious workload behind one, look up the expiration date. Don't assume the safety rails exist between expired trial and production data destroyed. They don't.
  2. 02
    Suspension and deletion are not the same thing. Any vendor that deletes customer data as the default action on a billing event — without escalation, without a grace period, without a phone call — has made a profound design error. Push back when you see this in vendor contracts.
  3. 03
    IaC and runbooks are insurance, not overhead. We got lucky that our monitoring held. We shouldn't have needed the luck. If your critical infrastructure has no runbook for rebuilding from zero, write one this week.
  4. 04
    Billing health needs an owner. Not "ops knows." Not "finance handles it." Someone specific, with an alert, who checks it. A production database destroyed by a billing flag is not a billing problem — it is an engineering problem.

And if you're at AWS: a paying customer's production database is not an expiry-date checkbox. It deserves a phone call.

Engineering Team · May 2025
AWS Redis Post-Mortem Infrastructure Incident Response AWS Marketplace