tech: Shipping with Claude: What a Production Incident Taught Us About LLMs and Engineering Fundamentals

We've been migrating our Rails backend from encrypted credentials files to Chamber — a more operationally flexible approach to secret management that lets us rotate and audit secrets. It was a deliberate, multi-PR migration, and we used Claude as a pairing partner throughout. The migration went smoothly right up until it didn't. Here's what happened, what we missed, and what we now see in our own codebase as a result.

Why we migrated in the first place

Rails encrypted credentials solve a real problem: secrets don't live in plaintext on disk. But they introduce a different problem that only becomes painful at scale — you cannot review changes to them.

When a developer updated a credential, the PR diff looked like this:

config/credentials/production.yml.enc
@@ -1 +1 @@
-UC84OmUIw4yscrhH+RHdY1F+FIodrBqcr9dawCAzpcqU3bmUGre898PmiOzr61s8mRbDGbgesoDn6RDX38r
-tdSKI8y544h8jLcEKx2cKkPN9wchYS/nVH1ONPVbZFaAg9wSeNjOuLONPImiKFcFLaWPJH32MP0v4R5YP
-uatJ3alZ9l40CjqUm0c5c/9O+jd7EcDuwzl/X/3WuZ93z1ylJ1cp8oKcnsOq39MJNj3DK48rymsuqvgy5
-CgEMv0QgxCWuRb7Ss61f/vV3VBxoyXPtLghnapvUQcqdJXj5VHSrwzZGoBKjB64Aw7+frcy4pHJ6p7CMd
-0advbhFD5hhfkJkFmOoHJz1RXMYRLgBcSAv9vAOpAqGct/1FPudP6ZNgYm/YbTp/MrxllgEqI+L3u1OnL
+mRbDGbgesoDn6RDX38rtdSKI8UC84OmUIw4yscrhH+RHdY1F+FIodrBqcr9dawCAzpcqU3bmUGre898Pmi
+Ozr61y544h8jLcEKx2cKkPN9wchYS/nVH1ONPVbZFaAg9wSeNjOuLONPImiKFcFLaWPJH32MP0v4R5YPu
+atJ3alZ9l40CjqUm0c5c/9O+jd7EcDuwzl/X/3WuZ93z1ylJ1cp8oKcnsOq39MJNj3DK48rymsuqvgy5C
+gEMv0QgxCWuRb7Ss61f/vV3VBxoyXPtLghnapvUQcqdJXj5VHSrwzZGoBKjB64Aw7+frcy4pHJ6p7CMd0
+advbhFD5hhfkJkFmOoHJz1RXMYRLgBcSAv9vAOpAqGct/1FPudP6ZNgYm/YbTp/MrxllgEqI+L3u1OnL9
 [... 7KB ...]

One line in, one line out. The entire file is a single blob of ciphertext that gets re-encrypted every time any value inside changes. A reviewer looking at this diff cannot tell:

Which key was added, removed, or modified
Whether the change applied to the right environment
Whether an existing key was accidentally dropped during re-encryption
Whether the value being set is structurally correct

The PR description is the only source of truth, and only as trustworthy as the author's summary. Approving a credentials change is an act of faith, not review.

It compounds further when two engineers touch unrelated keys in the same file. Because the entire file re-encrypts as a single blob, any two concurrent changes produce a merge conflict that is literally unresolvable without the master key — and even then, the "conflict" is invisible at the diff level. There is no way to see whose change is present, whose was lost, or whether both survived.

Compare that to the equivalent change expressed as a Chamber migration — moving Twilio's credentials out of production.yml.enc and into config/settings/twilio.yml:

# config/environments/production.rb
   config.twilio = {
-    verification_service_id: ENV['TWILIO_VERIFICATION_SERVICE_ID'] ||
-        Rails.application.credentials.config.dig(:twilio, :verification_service_id),
-    account_sid: ENV['TWILIO_ACCOUNT_SID'] ||
-        Rails.application.credentials.config.dig(:twilio, :account_sid),
-    auth_token: ENV['TWILIO_AUTH_TOKEN'] ||
-        Rails.application.credentials.config.dig(:twilio, :auth_token)
+    verification_service_id: ENV['TWILIO_VERIFICATION_SERVICE_ID'] || Chamber.dig(:twilio, :verification_service_id),
+    account_sid:              ENV['TWILIO_ACCOUNT_SID']             || Chamber.dig(:twilio, :account_sid),
+    auth_token:               ENV['TWILIO_AUTH_TOKEN']              || Chamber.dig(:twilio, :auth_token)
   }

# config/settings/twilio.yml (new file)
+default:
+  twilio: &default
+    _secure_verification_service_id: qcex...
+    _secure_account_sid:             HXeg...
+    _secure_auth_token:              pPZA...
+
+development:
+  twilio:
+    _secure_verification_service_id: vZtC...
+    _secure_account_sid:             gwhj...
+    _secure_auth_token:              UZQo...
+
+staging:
+  twilio:
+    _secure_verification_service_id: OH7L...
+    _secure_account_sid:             CTwp...
+    _secure_auth_token:              ajul...
+
+production:
+  twilio:
+    _secure_verification_service_id: TTFX...
+    _secure_account_sid:             IcZ1...
+    _secure_auth_token:              Axyj...

The values are still encrypted — nobody can read the secrets from the diff. But a reviewer can now verify that all three keys exist for all four environments, that production has its own separate values, that the application code reads them in the right order (ENV override → Chamber), and that no key was silently dropped. That's a reviewable change.

That was the core motivation. Secret management you can actually audit.

The migration

Pave runs a Rails 8.0 API backend. The migration happened across several PRs, moving one integration at a time: Stripe, Redis, Twilio, Braze, the database connection URLs, and finally the active_record encryption keys. Each followed the same pattern. Before:

Stripe.api_key = Rails.application.credentials.dig(:stripe, :api_key)

After:

# Chamber has no decryption key (CHAMBER_KEY) in test and raises on access;
# skip it there (Stripe is stubbed in specs).
Stripe.api_key = ENV['STRIPE_API_KEY'] || (Rails.env.test? ? nil : Chamber.dig(:stripe, :api_key))

The Rails.env.test? ? nil : Chamber.dig(...) guard was necessary: test and CI environments don't have a CHAMBER_KEY, so calling Chamber.dig at boot raises Chamber::Errors::DecryptionFailure. Every initializer got this guard. Claude co-authored most of this work. The PRs were clean, the pattern was consistent, and the earlier changes deployed without incident.

The last PR

PR #7245 was the finish line: retire the encrypted credentials files entirely. It migrated the database connection URLs and active_record.encryption keys to Chamber, then deleted production.yml.enc, staging.yml.enc, and feature.yml.enc. Reviewed, merged.

Fifty-three minutes later, we had 431 ArgumentError: Missing master key errors in production, all coming from Api::V1::UserEventsController#create.

The culprit was this line in the controller:

current_user&.user_setting&.update!(user_ip: request.remote_ip) if current_user&.user_setting.present?

user_ip is a Lockbox-encrypted attribute on UserSetting:

class UserSetting < ApplicationRecord
  has_encrypted :device_ip
  has_encrypted :user_ip

  belongs_to :user
end

Lockbox's default key lookup, when no explicit initializer sets the key, falls through to Rails.application.credentials.lockbox[:master_key]. There was no config/initializers/lockbox.rb in the codebase — Lockbox had been reading from credentials silently, by convention, the whole time. When we deleted the credentials files, that implicit dependency snapped.

PR #7285 — the hotfix — added the missing initializer:

Lockbox.master_key = Rails.env.test? ? nil : Chamber.dig(:lockbox, :master_key)

Done in minutes. But in the pressure of the moment, this used the same guard pattern that was already everywhere. And that guard quietly created a new problem.

What Claude missed

Claude co-authored PR #7245. The PR summary noted that "Development and test environments are unaffected — they use local DB config with no Chamber calls." That was true for the database config. But neither Claude nor the human reviewer connected that observation to Lockbox, which had no explicit initializer and therefore no Chamber call that anyone could see to guard.

This is worth saying plainly: Claude didn't catch it. Nothing in the diff was wrong. The deletions of the .yml.enc files were the stated goal. There was no failing test, no linting rule, no static analysis warning that could have surfaced an implicit gem-level dependency on a file that was about to disappear.

We're not saying this to criticise the tool — we kept using it through the incident and the subsequent fix, and it was genuinely helpful. We're saying it because the failure mode matters: LLMs reason about what's in the diff and the context window. Implicit dependencies — framework defaults, convention-over-configuration gem behaviours, transitive lookup chains that were never written down — are precisely what an LLM is likely to miss. The Lockbox key was never written anywhere in the code we touched. That's exactly why it wasn't caught.

The practical upshot: treat Claude like a very fast, very capable engineer who hasn't yet internalised your codebase's hidden contracts. Pair it with the kind of review that asks "what else reads from this file?" before deleting it.

The central lesson: test bypasses that look like pragmatism

Here is the line from the hotfix that made the incident survivable but embedded the longer-term problem:

Lockbox.master_key = Rails.env.test? ? nil : Chamber.dig(:lockbox, :master_key)

Look at it from a test's point of view. With nil as the master key, Lockbox doesn't raise — it silently skips encryption. Attributes get stored as plaintext in their _ciphertext columns and read back the same way. Every test that touches an encrypted field passes. And we already had a request spec that appeared to cover this code path:

context 'when valid event is passed' do
  let(:params) { { event_name: 'user_sign_up', session_id: session_id } }

  it { is_expected.to be 200 }

  it 'saves the user ip in user settings' do
    expect(user.reload.user_ip).to be
  end
end

This test passed before the incident (Lockbox stored user_ip as plaintext, readable). It passed after the hotfix (same). It would pass today if the production key were accidentally set to nil. The test was measuring persistence, not encryption. With a nil master key, those two things are indistinguishable.

Why engineers write this code

It is easy to read that guard and think "obvious mistake." It is harder to explain why experienced engineers keep writing it. There are two honest reasons.

Time pressure is the visible one. Under incident pressure — hundreds of errors per minute in production — the path to fixing the immediate breakage is all that matters. The hotfix author added the guard because every other Chamber initializer in the codebase already had it. Consistency under pressure is not irrational. But each time the pattern is copied without questioning it, the next gap becomes slightly harder to see.

Complexity and incomplete mental models are the more insidious reason, and the one that actually applied here. PR #7245 was not written under incident pressure. It was a planned migration, reviewed carefully. The nil guard was added because the engineer genuinely did not know that Lockbox was reading from credentials at all — there was no explicit initializer, no comment, nothing to grep for. When you don't know what a subsystem depends on, you don't know what your test environment is silently bypassing.

This is the more dangerous case. The engineer is not cutting corners. They believe nil is equivalent to a key for test purposes — that "test doesn't need this" is a true statement. For some things it is. For an encryption key, it is not: nil doesn't mean "use a test key," it means "skip the encryption entirely."

Not all test forks are the same

The broader audit this incident prompted surfaced several Rails.env.test? forks in our initializers. They are not all the same problem.

# devise.rb
config.stretches = Rails.env.test? ? 1 : 12

# 1_redis.rb
$redis_aws = Rails.env.test? ? Test::MockRedisEnhanced.new : Redis.new(url: Chamber.dig(:redis, :redis_aws_url))

# stripe.rb
Stripe.api_key = ENV['STRIPE_API_KEY'] || (Rails.env.test? ? nil : Chamber.dig(:stripe, :api_key))

# lockbox.rb (before fix)
Lockbox.master_key = Rails.env.test? ? nil : Chamber.dig(:lockbox, :master_key)

The useful question is: does the test code path exercise equivalent behaviour to production?

Bcrypt stretches — acceptable. bcrypt with 1 stretch instead of 12 is a well-documented Rails practice. The hash is still computed; the algorithm is identical. The fork reduces test runtime by several seconds without changing what is being tested.

MockRedis — defensible. MockRedisEnhanced implements the Redis command interface in memory. Tests verify the same application logic; they just don't make network calls. The gap is that MockRedis and Redis are not identical in every edge case, but for the operations under test the equivalence holds. Network isolation is a legitimate reason to use a test double.

Stripe nil key — ambiguous. Stripe calls in tests are all stubbed with VCR cassettes or allow/receive mocks, so the nil key never causes a real API call. The nil is effectively inert. But there's also no spec asserting that the Stripe configuration itself is valid — so if someone misconfigures the Chamber key, nothing in the test suite would catch it before a real charge fails in production.

Lockbox nil key — not acceptable. nil does not replace an encryption key. It changes what the application does: encrypt-and-store becomes store-as-plaintext. A test that passes with nil is not testing Lockbox at all.

The decision rule: if nil changes behaviour rather than just routing around infrastructure, the fork is hiding a gap. A mock Redis or a stubbed HTTP client exercises the same logic with a different transport. A nil encryption key turns off the encryption logic entirely.

The fix pattern

The Lockbox fix sidesteps Rails.env.test? entirely. Rails environment files load before initializers, so we set an env var in test.rb — right alongside the existing active_record.encryption test keys:

# config/environments/test.rb

config.active_record.encryption.primary_key         = 'ZEjSqIThOjtppHyOdsPZzReeEhUmG1mH'
config.active_record.encryption.deterministic_key   = 'lv4L60ar5hfA9fZ9U33W6YQ3RvFuDxCm'
config.active_record.encryption.key_derivation_salt = 'Io4oojz3zpY2KoIxdIZmWJlh9CUtwf7y'
ENV["LOCKBOX_MASTER_KEY"] ||= "0" * 64  # real dummy key — exercises actual Lockbox code path

The initializer becomes:

# config/initializers/lockbox.rb

Lockbox.master_key = ENV["LOCKBOX_MASTER_KEY"] || Chamber.dig(:lockbox, :master_key)

No Rails.env.test?. In test, the env var provides a real (dummy) key and Chamber.dig is never called. In production, the env var is not set, so Chamber provides the real key. The same code path executes in every environment.

The companion spec makes future encryption bypass detectable:

# spec/models/user_setting_spec.rb

describe "Lockbox encrypted attributes" do
  let(:user) { create(:user) }

  it "encrypts user_ip at rest and round-trips correctly" do
    setting = user.user_setting
    setting.update!(user_ip: "1.2.3.4")
    expect(setting.user_ip_ciphertext).not_to eq("1.2.3.4")  # proves encryption ran
    expect(setting.reload.user_ip).to eq("1.2.3.4")           # proves decryption works
  end

  it "encrypts device_ip at rest and round-trips correctly" do
    setting = user.user_setting
    setting.update!(device_ip: "5.6.7.8")
    expect(setting.device_ip_ciphertext).not_to eq("5.6.7.8")
    expect(setting.reload.device_ip).to eq("5.6.7.8")
  end
end

The first assertion in each example is the one that would have caught the original incident. With a nil master key and plaintext storage, user_ip_ciphertext equals "1.2.3.4" and the spec fails. With a real key, the ciphertext is unreadable binary — proof that Lockbox actually ran.

The audit

The incident prompted us to grep our initializer directory systematically:

grep -rn "Rails\.env\.test?" config/initializers/

Results:

config/initializers/stripe.rb:2:   Stripe.api_key = ENV['STRIPE_API_KEY'] || (Rails.env.test? ? nil : Chamber.dig(:stripe, :api_key))
config/initializers/sidekiq.rb:10: if Rails.env.test? # redis is mocked in test anyway
config/initializers/devise.rb:127: config.stretches = Rails.env.test? ? 1 : 12
config/initializers/1_redis.rb:5:  $redis_general = Rails.env.test? ? MockRedis.new : Redis.new(...)
config/initializers/1_redis.rb:6:  $redis_auth    = Rails.env.test? ? MockRedis.new : Redis.new(...)
config/initializers/1_redis.rb:7:  $redis_aws     = Rails.env.test? ? Test::MockRedisEnhanced.new : Redis.new(...)
config/initializers/lockbox.rb:1:  Lockbox.master_key = ... # fixed

For each one, we're applying the same question: what is this test actually asserting, and is the answer "that the real thing works" or "that nil doesn't crash"? Where the answer is the latter, we apply the fix pattern — give test a real dummy value, keep the same initializer logic, add an assertion that proves the mechanism ran.

The goal is not to make every test environment identical to production. Tests need to be fast and isolated. The goal is narrower: when a test passes, it should be evidence that the production code path ran — not evidence that nil is harmless.

What we're taking forward

Explicit over implicit, always. Lockbox worked for months without anyone explicitly configuring it. That was a liability. Every gem that touches security should have an explicit initializer that makes its key source visible. If you cannot grep for where the key comes from, someone will delete it.

nil is not a test double. A mock provides equivalent behaviour through a different mechanism. A nil disables the behaviour. The difference is a test that gives you confidence versus a test that gives you false confidence.

Complexity is a more dangerous bypass trigger than time pressure. Time pressure is visible — everyone knows the engineer is cutting corners. Incomplete mental models are invisible — the engineer genuinely believes nil is equivalent. Standard review asks "does this look right?" The review this incident called for is "what does nil actually do to this subsystem?" That needs to be a habitual question, not a post-mortem one.

LLMs shift the bottleneck without removing it. Claude helped us write and migrate code faster throughout this project. What it did not do is hold the complete mental model of the system — the knowledge that a particular gem reads from credentials by convention, that deleting a file might snap a dependency that was never written down. That knowledge lives in engineers who have read the source, survived the previous incident, or thought carefully enough to ask the right question before merging. Building faster with AI makes that judgment more valuable, not less.

The Chamber migration is complete. The audit is underway. And we have a new heuristic for every initializer we write from here: if test gets nil where production gets a real value, the test is probably not testing what you think it is.

Thushara Wijeratna, WorkSolo Engineering

tech

Wednesday, June 17, 2026

Shipping with Claude: What a Production Incident Taught Us About LLMs and Engineering Fundamentals