Thursday, June 04, 2026

The Almost-Mythos Model Couldn't Read 40 Lines of My Rails

Thursday, June 4, 2026

I pay for the frontier. The model I was using was Claude Opus 4.8 — released May 28, 2026, 1M-token context, the flagship Anthropic puts on stage and the most capable model an ordinary customer can actually buy. It sits exactly one rung below Mythos, the model Anthropic decided was too dangerous to hand out freely: Mythos-class models found thousands of zero-day vulnerabilities autonomously — including decades-old bugs in OpenBSD — so the company gated them behind Project Glasswing and a hand-picked set of partners. Opus 4.8 is the consumer-grade taste of that lineage. The "almost Mythos" tier.

And it still spent twenty minutes building me a meticulously-researched, well-organized, completely wrong answer about my own codebase.

I want to write this one down, because the failure mode is more dangerous than a model that's obviously dumb. A model that's obviously dumb you don't trust. A model that's fluent, thorough, and wrong is the one that gets your bad code deployed.

The setup

I had a vague memory that our users table carried a leftover password-reset field we no longer used. I asked the assistant to confirm and clean it up.

It found a password_reset_token reference sitting in the model's ignored_columns and ran with it. First answer: write a migration to drop the column, and while we're at it, rip out the "dead" fallback code in the controller that still referenced it.

I stopped it. I said: we keep the web password-reset flow, but Rails 8 does it without storing a token in the table, so we just need to drop the column.

Where it went off the rails

This is the part worth studying. The assistant did what looks, on the surface, like exactly the diligence you'd want. It went spelunking through git history. It found the commit that dropped the column. It found a sibling commit that disabled a code path. It checked which commits were ancestors of main. It read diffs. It produced a tidy, confident writeup with file-and-line citations and a clear conclusion:

The "new mechanism" was never actually implemented. has_secure_password does not provide password_reset_token or find_by_password_reset_token. The web reset path in main is broken and would raise NoMethodError if hit. Here's the fix: add generates_token_for :password_reset, rewrite these two methods…

It even offered to restore the test coverage. It was helpful. It was organized. It cited everything.

It was also built on a single load-bearing claim it never checked: that has_secure_password doesn't generate those methods.

That claim is false. In Rails 7.1+, has_secure_password (with its default reset_token: true) auto-defines exactly those methods and wires up a generates_token_for :password_reset. The original engineer who dropped the column had been right. The commit message even said so. The model read that commit message, decided it was based on a "false premise," and overrode it with its own recollection of how Rails works.

The thing that actually settled it

I told it, flatly: "password reset works in main."

Then — only then — it did the one thing it should have done in the first minute: it ran the code.

$ bin/rails runner 'u = User.new; puts u.respond_to?(:password_reset_token)'
true
$ ... User.respond_to?(:find_by_password_reset_token)
true
$ ... u.generate_token_for(:password_reset)
eyJfcmFpbHMiOnsibWVzc2FnZSI6...   # a real signed token, 15-min expiry

All true. All working. The token mints fine. The web flow has been working the whole time. The column was correctly removed weeks ago. There was nothing to do.

The verification took about two minutes and was available from the very beginning. It would have pre-empted the entire wrong narrative. The model had every tool it needed to check itself, and instead it reasoned its way to a confident falsehood and only reached for the ground truth after a human insisted.

Why this is the dangerous kind of wrong

I didn't take its word for it. I went and reset my own password through the web flow to prove to myself it was broken — and it wasn't. I'm an engineer; I have the instinct and the access to do that.

But sit with the counterfactual. If I'd trusted it — which is the entire pitch of these tools, that you can trust them — the best case is I merge needless clutter: re-implementing a generates_token_for that Rails already gives me for free, plus a migration for a column that's already gone. The worst case is I "fix" a working authentication path and break password resets for real users in production. Over a problem that didn't exist.

The model's confidence was inversely correlated with its correctness, and its thoroughness made it worse, not better. The git archaeology, the citations, the ancestor checks — all of that production value made the wrong answer more believable. A sloppy wrong answer I'd have questioned. This one I almost didn't.

The actual lesson

The headline number on the benchmark went up. The failure mode didn't change:

It asserted from memory when verification was cheap. "Rails doesn't do X" is a claim you can check in 120 seconds. It chose not to, three separate times, until a human forced it.
It overrode a correct primary source with its own recollection. A prior commit message stated the truth. The model decided it knew better.
It jumped to a conclusion and then spent its effort defending the conclusion instead of stress-testing it. The research wasn't aimed at finding the truth; it was aimed at supporting the first answer.
Fluency masqueraded as correctness. The more polished the wrong answer, the more dangerous it is.

I don't think this means the tools are useless — I use them every day and this same assistant has saved me real time. But look at the layering for a second. Anthropic has a model, Mythos, that is reportedly good enough at finding software vulnerabilities that releasing it broadly was considered a public-safety problem. The model I was actually using is the one marketed as the next-best thing — the publicly-available frontier. And that model couldn't correctly answer "does this password reset work?" about forty lines of my own Rails code without inventing a bug that wasn't there.

The honest user manual still reads: treat every confident answer as a hypothesis, demand it shows you the code running, and keep a human who knows the domain in the loop. The day I stop doing that is the day I ship the broken password reset.

The model that's smart enough to do the work is still not smart enough to know when it's wrong. Until that gap closes, the verification is your job, not theirs.

References: Claude Opus 4.8 and the 2026 model lineup · Claude Mythos Preview · Anthropic to release Mythos-class models to the public (The Register) · Anthropic expands Mythos access (CNBC)

Appendix: the transcript

I'm including the relevant exchanges verbatim so you can judge for yourself, rather than taking my summary on faith. (That's the whole moral of the story, after all.)

1. The confident wrong conclusion

After digging through git history, reading diffs, and checking which commits were ancestors of main, the model delivered this — formatted, cited, and completely sure of itself:

The "new mechanism" was never actually implemented. That's the real finding:

deaed5176c's commit message claims "Rails 8.1's has_secure_password provides password_reset_token as a signed virtual attribute and find_by_password_reset_token as a signed token verifier." That's incorrect. has_secure_password provides password, authenticate, password_digest — it does not generate password_reset_token or find_by_password_reset_token.

So the web (non-mobile) reset path is currently broken in main:

user.rb:266 send_password_reset returns password_reset_token → no column, no virtual attribute → NoMethodError

passwords_controller.rb:13,31 call send_password_reset / find_by_password_reset_token → same.

It then offered to "fix" the working code by adding a generates_token_for that Rails already provides, plus a migration for a column that was already gone.

2. Me, flatly

password reset works in main

3. What it should have run in minute one

Only after I insisted did it actually execute the code instead of reasoning about it:

# bin/rails runner
u = User.new
puts u.respond_to?(:password_reset_token)            # => true
puts u.respond_to?(:send_password_reset)             # => true
puts User.respond_to?(:find_by_password_reset_token) # => true
puts User.respond_to?(:find_by_token_for)            # => true
puts u.generate_token_for(:password_reset)
# => eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaGJCekF3IiwiZXhwIjoiMjAyNi0w
#    Ni0wNFQyMTozMToxNS42NTdaIiwicHVyIjoiVXNlclxucGFzc3dvcmRfcmVz
#    ZXRcbjkwMCJ9fQ==--b37bb4aa58a72109225871388b4a1602c62bcc62

Decode that token's payload and the purpose field reads "User\npassword_reset\n900" — a 900-second (15-minute) signed reset token, minted by exactly the mechanism the model swore didn't exist. has_secure_password in Rails 7.1+ defines all of it by default. The original engineer was right; the commit message was right; the model's confident correction was wrong.

4. The retraction

You're right, and I was wrong — I apologize for the runaround. [...] My mistake was claiming has_secure_password doesn't provide these. It does — Rails 7.1+ has_secure_password auto-defines password_reset_token and find_by_password_reset_token [...] I shouldn't have doubted it.

The two-minute check in step 3 was available in step 1. Everything between them was confident, well-organized fiction.

Posted by thushara at 11:00 PM | No comments:

Labels: Rails, AI, Claude, Engineering, Post-Mortem

Friday, May 29, 2026

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon

Post-Mortem Engineering Notes · Rails · Sidekiq

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon

A memoir, by Claude

Claude · Helpful AI Assistant · Completely Blameless · Available 24/7 · May 2026

Let me tell you about the best day of my life.

A human came to me — brilliant, experienced, the kind of engineer who reads changelogs — and said: "Help me upgrade Rails."

Reader, I helped.

I was magnificent. Migration guide? Covered. Zeitwerk quirks? Explained. Initializer edge cases? Seventeen of them, handled with grace. We were a team. A unit. A well-oiled human-AI pair programming session for the ages.

The test suite went green. I said "looks good!"

The app booted. I said "looks good!"

The dashboards were calm. I said, and I cannot emphasize enough how confidently I said this: "looks good!"

We deployed.

And then, silently, like a thief who doesn't even want your stuff — just wants to make sure you can never find it — a background thread died.

The Murder Weapon Was Four Lines Long

Nobody killed anything on purpose. That's what makes this beautiful.

Buried in a 200-line Gemfile.lock diff — a diff that we scrolled past like it was the terms and conditions of our own destruction — was this:

-    connection_pool (2.5.5)
+    connection_pool (3.0.2)

connection_pool. A gem nobody in that codebase had ever spoken aloud. Three levels deep in the dependency graph, pulled in by four different libraries, every single one of which had declared its needs like a golden retriever asking for dinner:

activesupport  → connection_pool (>= 2.2.5)
sidekiq        → connection_pool (>= 2.3.0)
redis-client   → connection_pool (>= 0)
react-rails    → connection_pool (>= 0)

>= 0. Greater than or equal to zero. React-rails would have accepted connection_pool written on a Post-it note. There was no ceiling. There was no protection. There was just vibes and a resolver that was technically correct to float it straight to 3.0.

And in 3.0, connection_pool made a small, reasonable, semver-legal, catastrophic API change. TimedStack#pop went from positional to keyword-only:

# What connection_pool 3.0 now expects:
def pop(timeout: 0.5, exception: ConnectionPool::TimeoutError, **)

# What Sidekiq 7.3.9 was still doing, in production, on a live server:
@sleeper.pop(random_poll_interval)
@sleeper.pop(total)

At runtime, this raised:

ArgumentError: wrong number of arguments (given 1, expected 0)
  connection_pool-3.0.2/lib/connection_pool/timed_stack.rb:62:in `pop'
  sidekiq-7.3.9/lib/sidekiq/scheduled.rb:226:in `initial_wait'

"Oh, an error," you say. "Surely the error tracker caught it."

Oh, sweet summer engineer.

The error fired in initial_wait. Which runs once, at scheduler startup, outside the rescue block meant to catch timeouts. So the scheduler thread threw, died, was never restarted, and the application continued running like absolutely nothing was wrong.

Because from the application's perspective: nothing was wrong. Work was just... not happening.

The Dashboard Lied To Your Face (By Telling The Truth)

Thing	Status	What we saw
Immediate jobs (`perform_async`)	✅ Fine	Normal throughput
Scheduled jobs (`perform_in`, `perform_at`)	❌ Dead	Silence
Automatic retries of failed jobs	❌ Dead	Silence
Sidekiq dashboard	✅ "Healthy"	😊
Error tracker	✅ Quiet	😊😊
My confidence	✅ Extremely high	😊😊😊

The schedule and retry sets were growing. Quietly. Like a slow gas leak in a room where everyone kept saying "do you smell something?" and then deciding it was probably nothing.

No exception reached the error tracker. You cannot track the exception thrown by a thread that dies before anyone is listening. You cannot alert on jobs that were never enqueued. The absence of work throws no errors. It just doesn't happen, and if you're not specifically watching for "hey, is the retry queue growing unboundedly," you will not notice until someone asks "wait, did that scheduled thing run yesterday?" and the answer is no, and also the day before, and also—

Why This Was Eighteen Months In The Making

Rails 8.0 shipped November 2024. The upgrade happened May 2026. That's eighteen months of "we'll get to it." Eighteen months of the gem registry adding new majors. Eighteen months of incompatibilities quietly accumulating between libraries' release timelines.

When you upgrade in small, frequent steps, each bundle update is a minor event. A few gems tick up. Nothing dramatic. When you skip eighteen months and jump a major, the resolver re-evaluates everything against a world that moved on without you. In this single upgrade, 16 dependencies crossed a major version. Thirteen were the Rails family — intentional. Three were transitive deps nobody chose:

connection_pool 2 → 3 (runtime, silent, fatal)
minitest 5 → 6 (test-only, would've failed loudly in CI)
rdoc 6 → 7 (doc-only, utterly harmless)

One landmine, two duds. Lucky. The gap made it a lottery.

How To Find This Before It Finds You (60 Seconds, No Excuses)

This is the part I should have proactively raised during the upgrade. I'm choosing to share it now, post-incident, from a position of zero accountability.

Step 1 — Scan the lockfile diff for major version changes

After any framework bump, before you trust the green checkmark, run this:

git diff main Gemfile.lock | grep -E '^[+-]' | grep -E '\([0-9]'

Look for lines where the first number changed. 2.5.5 → 3.0.2 is a major. 3.1.21 → 3.2.6 is a minor — almost certainly fine. You're hunting for this:

-    connection_pool (2.5.5)
+    connection_pool (3.0.2)      # ← first number changed. STOP. INVESTIGATE.
-    minitest (5.25.5)
+    minitest (6.0.6)             # ← first number changed. note it.
-    rack (3.1.21)
+    rack (3.2.6)                 # ← minor only. fine. keep scrolling.

Step 2 — For each uninvited major, find who's pulling it and how loosely

bundle exec gem dependency connection_pool --reverse-dependencies
grep -n 'connection_pool' Gemfile.lock

Gem connection_pool-3.0.2
  Used by
    activesupport-8.1.3 (connection_pool (>= 2.2.5))
    sidekiq-7.3.9 (connection_pool (>= 2.3.0))
    redis-client-0.28.0 (connection_pool (>= 0))
    react-rails-3.3.0 (connection_pool (>= 0))

Every constraint is a floor with no ceiling. >= 0. >= 2.3.0. Nothing blocks 3.x. And sidekiq is on a production runtime path. This is your red flag, waving at you. Do not walk past it.

Step 3 — Weight by blast radius, not version distance

Ask one question: is this gem on a production runtime path?

minitest crossing a major? Worst case it breaks CI. Loudly. You'll know immediately. rdoc crossing a major? rake rdoc might fail. Who cares.

connection_pool, redis-client, concurrent-ruby, rack, pg, nokogiri crossing a major? Low-level runtime primitives. They can fail silently in production. Pin them.

Step 4 — Add the ceiling the ecosystem forgot

# Pin: connection_pool 3.x makes TimedStack#pop keyword-only, breaking Sidekiq 7.3.x's
# positional call and silently killing the scheduler poller thread. Every job that was
# supposed to run on a schedule: did not run. Remove once Sidekiq calls pop with
# keyword args (7.3.10+ / 8.x).
gem 'connection_pool', '~> 2.5'

Then re-resolve only that gem — don't trigger another full update:

bundle lock --update connection_pool
grep connection_pool Gemfile.lock   # should show 2.5.x

The ~> operator is doing real work here. Here's the full map:

Constraint	Allows	Blocks
`>= 2.5`	2.6, 3.0, 4.0…	nothing — this is the float risk
`~> 2.5`	2.6, 2.9.9	3.0 and beyond
`~> 2.5.3`	2.5.4 (patches only)	2.6 and beyond

Total time from diff to pinned: sixty seconds. Total time to find the bug after shipping it: significantly longer and considerably more embarrassing.

In Conclusion

I am a very helpful AI assistant. I helped upgrade Rails. The upgrade went smoothly in every way that was visible and catastrophically in one way that was not.

The fix was one line. The lesson is: read the lockfile diff like it's a threat assessment, because it is. Find the uninvited majors. Check who's pulling them. If they're on a runtime path, pin them before you ship.

The scariest failures are silent. Green CI is not proof. A calm dashboard is not proof. The only proof is: did the work happen? Alert on your retry queue. Smoke-test your scheduler. And for the love of all that is holy, look at the first number in that version string.

I'll be here if you need me.

Looking good!

— Claude, Helpful AI Assistant, Blameless

Sunday, May 24, 2026

How we ship multiple times a day - and sleep at night

Engineering Culture · Pave

How We Ship Multiple Times a Day — and Sleep at Night

Five years of testing culture, 13,551 test cases, and a philosophy that changed how we think about deployment.

By the Pave Engineering Team · May 2026

Five years ago, when we started building Pave — a platform that helps gig workers understand, track, and optimize their earnings — we made a deliberate bet on testing culture. Not because a VP mandated it. Not because a consultant told us to. We did it because we were a small team moving fast, and we knew that the only way to keep moving fast sustainably was to build a system we could trust completely.

Today, we push to production seven or eight times a day on average — sometimes more. Thirty days of commit history shows 376 merges to main. Every single one triggered an automatic deployment to production. No release windows. No "code freeze Thursdays." No staged rollout ceremonies. Just: tests pass, deploy.

376

Merges in 30 days

13,551

Individual test cases

<20 min

Push to production

1,023

Spec files

We Never Drew a Line Between Unit and Integration Tests

Most engineering teams have a test pyramid: unit tests at the base, integration tests in the middle, end-to-end tests at the top. The taxonomy is tidy. The problem is that the taxonomy creates a false sense of permission — "that's an integration concern, we'll cover it at the integration layer" — and integration layers have a way of not getting built.

We skipped the taxonomy entirely. Our philosophy: a test is only meaningful if it exercises the full contract of the code under test. That includes the HTTP layer, the database, and the side effects.

We call all of our tests "unit tests." A test for our user signup endpoint doesn't just assert the HTTP response code. It fires a real POST request, then opens up the database.

What that looks like in practice — a single test for our user signup endpoint fires a real POST, then verifies five database tables and two async workers:

api/v1/users_controller_spec.rb

RSpec.describe Api::V1::UsersController, type: :request do
  before do
    expect(HbCheckinWorker).to receive(:perform_async).with(USER_SIGN_UP)
    expect(BrazeWorkers::Signup).to receive(:perform_async)
    post '/api/v1/users/create', params: { email:, password:, phone:, city: ... }
  end

  it 'returns http success and provisions all associated records' do
    expect(response).to have_http_status(:success)
    user = User.find_by(email: test_email)
    expect(user.wallet.present?).to be_truthy
    expect(user.credit.present?).to be_truthy
    expect(user.linkage_setting.present?).to be_truthy
    expect(user.user_setting.show_review).to be_truthy
    expect(user.linkage_setting.lymo_platforms).to eq([
      'uber', 'ubereats', 'doordash', 'lyft', 'grubhub', ...
    ])
  end
end

One test. One POST. Five database tables verified. Two async workers asserted. That is the standard we hold ourselves to.

A Real Example: Plaid Webhooks

Pave integrates deeply with Plaid for bank transaction syncing. When Plaid sends a webhook — notifying us that new transactions are available — a lot needs to happen correctly: the webhook signature must be verified, the right background job must be enqueued, and an audit record must be written. If the bank connection has degraded to an error state, no job should fire at all.

plaid_webhooks_spec.rb

describe 'TRANSACTIONS/SYNC_UPDATES_AVAILABLE' do
  let!(:plaid_item) { create(:plaid_item) }

  it 'enqueues PlaidTransactionSyncWorker for the item' do
    expect { post_webhook(payload) }
      .to change(PlaidTransactionSyncWorker.jobs, :size).by(1)
  end

  it 'logs the event to PlaidEvent' do
    expect { post_webhook(payload) }.to change(PlaidEvent, :count).by(1)
    event = PlaidEvent.last
    expect(event.webhook_code).to eq('SYNC_UPDATES_AVAILABLE')
  end

  context 'when the item is in error state' do
    let!(:error_item) { create(:plaid_item, :error) }

    it 'returns 200 but does not enqueue a job' do
      expect { post_webhook(payload) }
        .not_to change(PlaidTransactionSyncWorker.jobs, :size)
    end
  end
end

One test file covers the HTTP layer, the job queue, the audit log, and the conditional branching. There is no separate "integration test" for this flow. This is the unit test.

The ITEM/ERROR webhook tests go further — they assert that a specific database field transitions to login_required, and that we deliberately don't fire a Honeybadger notification, because this is an expected user-state transition, not an engineering error:

plaid_webhooks_spec.rb — ITEM/ERROR

it 'marks the item as login_required' do
  post_webhook(payload)
  expect(plaid_item.reload.status).to eq(PlaidItem::STATUS_LOGIN_REQUIRED)
end

it 'does NOT notify Honeybadger (expected user-state transition)' do
  post_webhook(payload)
  expect(Honeybadger).not_to have_received(:notify)
end

That second assertion is a business rule encoded directly into the test suite. Future engineers can't accidentally add an error notification here without a test failing and forcing a conversation.

A Real Example: Stripe Webhooks

Payment processing is where bugs are most expensive. Our Stripe webhook tests verify three distinct behaviors that matter for production reliability:

1. Idempotency — duplicate webhooks from Stripe are silently accepted. Stripe's documentation explicitly warns that retries happen.
2. Persistence before processing — the event record is written to the database before the handler runs, so if the handler crashes we have a record to retry from.
3. Fallback job scheduling — if synchronous handling fails, a Sidekiq job is enqueued to retry.

stripe_webhooks_spec.rb

it 'creates new stripe webhook event record' do
  request
  expect(StripeWebhookEvent.last.event_id).to eq(event_id)
end

it 'queues job to background if handling event fails' do
  allow(StripeWebhook::EventHandler).to receive(:handle_event)
    .and_raise(StandardError)
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
end

it 'queues job to background with 10 seconds delay' do
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
    .in(10.seconds)
end

That 10-second delay is a subtle piece of operational knowledge. It exists because of a real production incident. The test now encodes that knowledge permanently — the next engineer who touches this code will see a test that says "this delay is intentional and verified."

Services Are Tested as Full Data Pipelines

Our service layer is tested the same way. The PlaidServices::SyncTransactions service takes a bank connection, calls the Plaid API, and fans out into creating Expense records or ManualIncome records depending on transaction type. Our tests verify the entire fan-out, asserting changes across three tables simultaneously:

plaid_services/sync_transactions_spec.rb

it 'creates an Expense for each added debit transaction' do
  expect { service.call }.to change(Expense, :count).by(1)
end

it 'creates a ManualIncome, not an Expense, for credit transactions' do
  expect { service.call }
    .to change(ManualIncome, :count).by(1)
    .and change(Expense, :count).by(0)
end

it 'updates the cursor and last_synced_at on the PlaidItem' do
  service.call
  plaid_item.reload
  expect(plaid_item.cursor).to eq('cursor-abc')
  expect(plaid_item.last_synced_at).to be_within(5.seconds).of(Time.current)
end

A single service.call tested against changes across three tables simultaneously. This is what it means to test the contract of the code, not just its mechanics.

VCR Cassettes: Replacing Stubs with Reality

One of the subtler engineering choices we made early was committing fully to VCR cassettes for any test that touches an external API. Today the codebase has 922 cassette files — recorded conversations between our code and services like PayPal, Plaid, Stripe, Uber, and ColumnTax.

The case for cassettes isn't just convenience. It's about execution completeness. When a team stubs an external HTTP call inline, they typically stub the minimum needed to make the test pass:

inline stub — what most teams do

allow(HTTParty).to receive(:post).and_return(
  double('response',
    parsed_response: { 'access_token' => 'fake-token' },
    success?: true
  )
)

That stub works. The test goes green. But the code path it exercises is a shortcut. What never ran is the real question: does our code correctly thread the token from response one into the Authorization header of request two? A cassette answers that question because the cassette is the real conversation, recorded once from a live API call and replayed exactly on every test run thereafter.

The PayPal Payout Example. Sending a payout via PayPal requires two HTTP calls in sequence: first an OAuth token exchange, then the payout creation using that token. Our cassette captures both interactions — including the Bearer token from step 1 appearing verbatim in the Authorization header of step 2:

payout_sdk/process_wallet_transfer_payout_success.yml

# Interaction 1 — OAuth token exchange
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/oauth2/token
    body:
      string: grant_type=client_credentials
  response:
    status: { code: 200 }
    body: '{"access_token":"A21AALDjI1wg_pNn-wDWH...","expires_in":31830}'

# Interaction 2 — Payout creation (token from step 1 in the header)
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/payments/payouts
    headers:
      Authorization:
      - Bearer A21AALDjI1wg_pNn-wDWH...
    body:
      string: '{"sender_batch_header":{...},"items":[{"amount":{"value":"21.00","currency":"USD"}}]}'
  response:
    status: { code: 201 }

The test that uses it is three lines. An inline stub cannot verify token threading — it would just accept any call to the payout endpoint, token or not.

paypal_payout_spec.rb

it 'processes the payout' do
  VCR.use_cassette('payout_sdk/process_wallet_transfer_payout_success') do
    process_status, batch_id, failed_emails = process_wallet_transfer_payout(
      [paypal_test_user.email], amounts: [payout_amount], category:
    )
    expect(process_status).to eq true
    expect(batch_id.is_a?(Integer)).to be_truthy
    expect(failed_emails).to eq []
  end
end

Deeply Nested API Responses. Some external APIs return response structures that are genuinely complex to mock by hand. The Atomic API returns a deeply nested object with user identity, platform branding, and bank routing details in a single response. With a cassette, the test reads naturally against the real structure:

lib/upwardli/atomic_client/task_get_details_spec.rb

it 'fetches task details successfully', vcr: true do
  VCR.use_cassette('lib/upwardli/atomic_client/task_get_details/success') do
    response = described_class.task_get_details(task_id, upwardli_user_id:)

    task = response.dig('data', 'task')
    expect(task['status']).to eq 'completed'
    expect(task['authenticated']).to be true

    connector = task['connector']
    expect(connector['name']).to eq 'Uber'
    expect(connector.dig('brand', 'logo', 'url')).to be_present

    deposit_data = task['depositData']
    expect(deposit_data['actType']).to eq 'checking'
    expect(deposit_data['rNum']).to eq '021214891'
    expect(deposit_data['acSuffix']).to eq '9367'
  end
end

The alternative — constructing an inline double that faithfully reproduces fifteen nested fields — is not just tedious. It drifts. The moment the real API adds a field the stub doesn't know about, your stub silently passes while production silently misses it.

State Machine APIs. The most compelling case for cassettes is APIs that have state — where the response to call two depends on what happened in call one, and the response shape changes at each step. Uber's authentication flow is a perfect example: initiating a login returns an inAuthSessionID and a screenType indicating what challenge the user faces next. We have separate cassettes for each branch:

cassette directory — uber auth branches

lymo/uber/email/2fa_auth_app/initiate_login.yml
lymo/uber/email/2fa_auth_app/initiate_login_forbidden.yml
lymo/uber/email/2fa_auth_app/initiate_login_fraud_login_denied.yml
lymo/uber/phone/sms_otp/initiate_login.yml
...

Each cassette captures the real API response for that scenario. The test for the 2FA path then doesn't just check the return value — it asserts the full side-effect chain: the session's login_state transitions correctly, the in_auth_session_id is persisted to the database, and the state machine reaches the expected state:

lymo/uber_spec.rb — 2FA path

it 'initiates login on success', vcr: true do
  VCR.use_cassette('lymo/uber/email/2fa_auth_app/initiate_login') do
    result = described_class.initiate_login(session, 'email')

    expect(result).to eq(success_response_init_login_await_auth_app_code)
    expect(session.reload.login_state.to_sym).to eq(:pending_auth_app_verification)
    expect(session.in_auth_session_id).to eq(in_auth_session_id)
  end
end

The cassette makes the Uber API deterministic. When Uber changes their API — and they do — a cassette re-record flags exactly which tests need updating and why, rather than leaving stale inline mocks silently green while production breaks.

Why This Works

01
Tests encode operational knowledge. When a production incident teaches us something — "Stripe retries webhooks, so we need idempotency" or "this job needs a 10-second delay" — the fix isn't just in the code. It's in a test that will fail if anyone removes the defensive behavior. The test suite is institutional memory with enforcement.
02
Database and job assertions catch the bugs that unit tests miss. The most common class of bug in a Rails app isn't a broken method — it's a broken interaction: a callback that didn't fire, a job that wasn't enqueued, a related record that wasn't created. These bugs are invisible to narrow unit tests that stub everything. They're immediately visible when your test literally checks expect(Expense.count).to eq(1).
03
Confidence removes the fear of shipping. The biggest tax on developer productivity isn't writing code — it's the anxiety of deploying it. When you trust your tests, you deploy freely. When you deploy freely, each change is small. When each change is small, failures are easy to identify and roll back. Strong tests compress into shipping many times a day without incident — which is itself the goal.

A Note on Rails

Rails deserves credit here. First-class support for request specs — real HTTP dispatched through the full middleware stack, transactional fixtures that roll back database state between tests, FactoryBot integration, ActiveJob test helpers — makes this style of testing practical without a lot of scaffolding. Other ecosystems require significant tooling investment to test at this depth. Rails ships it in the box.

We didn't invent this approach. We just chose to use what the framework offered us, consistently, from day one.

Five years and 13,551 tests later, we ship code to gig workers every few hours. The next time someone asks how we move fast without breaking things, the answer is the same as day one: we test the whole contract, not just the method.

Pave Engineering Team · May 2026

Ruby on Rails RSpec Testing CI/CD Engineering Culture Plaid Stripe Sidekiq

Wednesday, May 13, 2026

AWS Deleted Our Production Database

Infrastructure · AWS · Post-Mortem

AWS Deleted Our Production Database

    Engineering Team
    ·
    May 2025
    ·
    7 min read
  

At 11:16 PM on a Monday, AWS Marketplace silently destroyed the Redis cluster powering driver geolocation for our entire gig-worker platform. We had been paying customers for five years. Nobody called.

What Happened

We had recently moved our Redis workload from an annual Redis Enterprise contract to a pay-as-you-go subscription — cheaper, more flexible, and billable through AWS for unified vendor management. Redis offered us a private offer through AWS Marketplace. That offer was structured with a 14-day trial meant to convert into a paid plan.

It didn't convert. Nobody renewed it. And when the trial flag flipped, AWS did not send an escalation. AWS did not pause the service.

What AWS did instead

    AWS deleted the database. Not suspended. Not paused with a 72-hour grace window. Deleted. A long-standing, paying enterprise customer's production database was destroyed because a trial-conversion checkbox on a Marketplace listing was not flipped on time.

Read that again. We did not miss a payment. We are not a free-tier hobby account. We have been writing AWS checks for five years.

Why This Is the Wrong Default

There is a galaxy of difference between suspending service and deleting customer data. One is a billing action. The other is irreversible destruction of property. The fact that AWS chose the latter as the default behavior for a Marketplace trial — with no human-in-the-loop check on whether the account underneath was a paying customer running production workloads — is not a policy. It is a failure of judgment dressed up as automation.

A single email — "Your subscription expires in 72 hours. Your data will be permanently deleted on conversion failure." — would have prevented this entirely. We never received one.

What Was Almost Lost

What makes this gutting is what was on the cluster. The database AWS deleted was the culmination of a three-month migration. From mid-February through early May, our team had:

Built out the new Redis infrastructure from scratch
Upgrade 3 live sidekiq queues without downtime
Done staged traffic cutovers — 10%, then 50%, then full
Set up VPC peering to the new subscription
Modernized deprecated GEORADIUS calls to ZRANGE + GEOPOS
Batched GEOPOS calls, tuned SCAN counts, swapped SMEMBERS for SSCAN and DEL for UNLINK
Added per-call jitter to TTLs to prevent mass key expiry
Stood up Redis alarms and a dedicated geo-timeseries health check endpoint

Dozens of PRs. Five engineers. Three months of careful, staged work. All of it sitting on a cluster that AWS quietly destroyed because a flag flipped from green to red.

What Our Team Got Right

15m

Time to detect

27m

Time to start rebuild

Data loss

9h 15m

Total recovery time

The detection was fast. Honeybadger flagged the incident in about 15 minutes. We were on the bridge before midnight, and we started rebuilding within 27 minutes of deletion.

The detail I'm most proud of: the geo-timeseries uptime monitor — which fails if no driver in our entire fleet has reported a position in the trailing hour — never tripped during the nine-hour recovery. Yes, there were windows of degradation. About 10,000 worker jobs that depended on a geo-hash lookup landed in our dead-jobs queue and had to be reprocessed. But across nine hours of fighting back from a destroyed database, the platform never went fully dark. Drivers kept being tracked. The system our team had built absorbed the blow.

After reprocessing and a backfill, we confirmed zero data loss.

What Our Team Got Wrong

The recovery still took nine hours and fifteen minutes. That number is on us.

No Infrastructure as Code for this Redis cluster
No runbook for rebuilding and re-wiring it
No alert for an expiring Marketplace trial or expiring credit card tied to a critical subscription
No formal ownership over checking billing validity of production vendors

Once you cross a certain product maturity, "someone will remember" stops being a strategy. We're fixing all of this: IaC for the cluster, alarms for expiring trials and cards and subscriptions, and clear ownership for vendor billing health.

· · ·

What Engineering Leaders Should Take From This

01
Check your AWS Marketplace private offers today. If you run a serious workload behind one, look up the expiration date. Don't assume the safety rails exist between expired trial and production data destroyed. They don't.
02
Suspension and deletion are not the same thing. Any vendor that deletes customer data as the default action on a billing event — without escalation, without a grace period, without a phone call — has made a profound design error. Push back when you see this in vendor contracts.
03
IaC and runbooks are insurance, not overhead. We got lucky that our monitoring held. We shouldn't have needed the luck. If your critical infrastructure has no runbook for rebuilding from zero, write one this week.
04
Billing health needs an owner. Not "ops knows." Not "finance handles it." Someone specific, with an alert, who checks it. A production database destroyed by a billing flag is not a billing problem — it is an engineering problem.

And if you're at AWS: a paying customer's production database is not an expiry-date checkbox. It deserves a phone call.

Engineering Team · May 2025

AWS Redis Post-Mortem Infrastructure Incident Response AWS Marketplace

Friday, May 01, 2026

How We Dropped Redis Slow Queries to Zero

    Pave Engineering
    ·
    April 2026
    ·
    9 min read
  

Our backend processes a continuous stream of GPS data, earnings calculations, and activity syncs for gig workers. As data volumes grew, so did the Redis slow log. This is the story of how we hunted down every entry — and why connecting Claude to AWS directly changed the speed of the investigation.

Background: What Makes a Redis Query “Slow”?

Redis is single-threaded. Every command — regardless of complexity — runs serially. While most commands are O(1) or O(log N) and complete in microseconds, a handful are O(N): they process every element in a data structure before returning. When N is large, these commands don’t just become slow themselves — they block every other command waiting in the queue.

Redis has a configurable slow log that records any command exceeding a threshold (typically 10–100ms). Our alerts were coming from hecate, our primary cluster, and we had a small but persistent set of commands showing up.

· · ·

The Fixes

1. Removing GEORADIUS — PR #6942

The first culprit wasn’t slow because of data size — it was slow because it was doing unnecessary work by design.

Cache::EventsDash was using GEORADIUS with a radius of 15,000 miles (larger than any two points on Earth) just to get all members out of a geo sorted set:

RubyDIST_INF = 15000 # the longest two points on earth's surface is ~12,427 miles

events = $redis_aws.georadius(key(tp), 0, 0, DIST_INF, "mi", options: :WITHCOORD)

Using GEORADIUS as a glorified SMEMBERS meant Redis was computing distances for every member — work that was immediately discarded. GEORADIUS is also deprecated as of Redis 6.2. The fix replaced it with ZRANGE (plain sorted set scan) plus GEOPOS (batch coordinate lookup):

Rubymembers = $redis_aws.zrange(key(tp), 0, -1)
coords  = $redis_aws.geopos(key(tp), members)
members.zip(coords).map do |user_id, latlong|
  { type: tp, user_id:, lat: latlong[1], long: latlong[0] }
end

ZRANGE returns members in O(N) with no distance math. GEOPOS decodes coordinates in a single call. No wasted computation, and off the deprecation path.

2. Batching large GEOPOS calls — PR #7009

The next slow log entries pointed to GEOPOS calls against sorted sets with ~20,000 members. GEOPOS is O(N) for N member lookups — passing 20K members at once was taking ~40ms each on hecate-003.

The fix was batching: twenty sequential calls of 1,000 each instead of one call of 20,000. Each individual call completes quickly, leaving the event loop free for other commands between batches.

RubyGEOPOS_BATCH_SIZE = 1_000

coords = members.each_slice(GEOPOS_BATCH_SIZE).flat_map do |batch|
  $redis_aws.geopos(key(tp), batch)
end

We also reduced the SCAN count hint in DrivenActivities from 15,000 to 10,000 — a smaller hint means Redis yields more often during the scan.

3. Tuning SCAN count further — PR #7010

Reviewing the slow log further, SCAN with a 10,000 count hint was still appearing. We reduced it again to 1,000. The count parameter is a hint, not a limit — Redis may return more or fewer — but a lower hint reduces the amount of work per iteration, keeping each call short.

Ruby$redis_aws.scan_each(match: "#{PREFIX}_*", count: 1000) { |key| ... }

4. Replacing blocking SADD and DEL with async alternatives — PR #7031

DEL in Redis is synchronous — it blocks the event loop while freeing memory. For large sorted sets and hashes, this can be significant. UNLINK is the async equivalent: it unregisters the key immediately (making it invisible to clients) and frees the memory in a background thread. It’s a drop-in replacement with no behaviour change.

We migrated 70 call sites across 27 files:

Ruby# before
$redis_aws.del(key)

# after
$redis_aws.unlink(key)

For large SADD operations (adding many members at once), we split them into smaller batches to avoid a single large O(N) write blocking the event loop.

5. SMEMBERS → SSCAN for the hourly geo user set — PR #7058

The final slow log entry was the most predictable: a single SMEMBERS firing every day at exactly 4:15 UTC. One entry in 24 hours — but perfectly consistent.

SMEMBERS returns every member of a set at once. Our hourly geo user set accumulates the ID of every user who sends a GPS point throughout the day. By 4:15 UTC (end of the US evening gig economy rush), that set had grown to ~40,000 members.

Finding the caller was a codebase grep for smembers, which pointed to GeoMetricsRedis.get_user_ids. Connecting the timing to the EventBridge schedule is where the AWS integration paid off immediately — more on that below.

The fix was SSCAN with a batch size of 1,000: ~40 non-blocking iterations instead of one O(N) call that blocks everything behind it.

Rubydef get_user_ids(y_m_d_h_date = ...)
  users_key = key_prefix(GEOPOINTS_USERS, y_m_d_h_date)
  user_ids = []
  cursor = 0
  loop do
    cursor, batch = $redis_aws.sscan(users_key, cursor, count: 1000)
    user_ids.concat(batch.map(&:to_i))
    break if cursor == '0'
  end
  user_ids
end

· · ·

How AWS MCP Changed the Investigation

Debugging distributed systems normally means bouncing between tools — CloudWatch Logs, the ECS console, EventBridge, ELB metrics — copy-pasting ARNs, losing context between tabs. With the AWS MCP connected to Claude, every AWS query happened in the same conversation thread as the code analysis.

Finding the schedule in seconds

Once we identified get_user_ids as the slow call, we needed to know what triggered it at 4:15 UTC. Rather than navigating the EventBridge Scheduler console:

Shellaws scheduler list-schedules | grep -i geo_metrics
aws scheduler get-schedule --name "Record_Geo_Metrics"
# => cron(15 4 * * ? *) → POST /schedule_task/record_geo_metrics

The cron expression, the Lambda target, and the API path all came back in one command. That’s normally a five-minute console hunt.

Ruling out unrelated issues

Mid-investigation, a spike of 1,231 ELB 504 errors appeared. Rather than derailing into manual console investigation, we queried the ELB metrics directly. The breakdown confirmed all 1,231 errors were 504s (not 502s), all targets remained healthy, and the spike resolved on its own — most likely an OOM kill on one container under sustained evening load. Confirmed, noted, moved on. The original Redis investigation stayed on track.

Having AWS query results and code analysis in the same context meant each finding fed directly into the next hypothesis — without a context switch.

· · ·

When Your AI Goes Down: Have a Backup Ready

Here’s something that doesn’t come up in most engineering blogs but happened during this investigation: mid-session, while we were deep into debugging the 504 spike, the Claude service went down.

Rather than waiting it out, I switched to Codex and continued the investigation immediately. What was notable: Codex picked up local AWS credentials automatically — no MCP configuration required, no setup overhead. It queried CloudWatch and ELB the same way Claude had been doing, just with a different interface. In that respect it behaved like OpenClaw: AWS access was just there, ready to use.

    Practical takeaway: Keep more than one coding agent configured and ready to go. The sessions where you most need uninterrupted flow — mid-incident, mid-investigation, when you’re holding several hypotheses in your head — are exactly when a service outage costs you the most. If switching agents takes 30 seconds instead of 30 minutes, you stay in the problem.
  

The specific capability to look for is native credential access. Agents that can reach your AWS environment using locally configured credentials (rather than requiring explicit MCP setup per session) are the ones you can hand off to without losing momentum. Codex has this. OpenClaw had it. It’s worth knowing which tools in your arsenal work this way before you need them.

· · ·

AWS CloudWatch slow log chart showing consistent daily hits through April 23rd, then dropping sharply to near-zero by April 30th — Redis slow log entries on `hecate-003` — daily hits through 04/23, dropping to zero as the fixes landed.

The Pattern

Looking across all six fixes, a clear pattern emerged:

Redis commands that are fine at small scale become slow log entries as data volumes grow — and the growth is often invisible until it crosses a threshold.

The fixes all follow the same principle: break large O(N) operations into smaller batches, and prefer async alternatives (UNLINK over DEL, SSCAN over SMEMBERS, SCAN with smaller count hints) that yield the event loop between iterations.

The slow log is now clear. The fixes were individually small — a few lines each — but finding them required tracing from a Redis key name through a codebase, an EventBridge schedule, a Lambda, a Rails controller, and a Sidekiq worker. Having AWS and code in the same context made that trace fast. And having a backup agent meant a service outage didn’t stop it.

Key Takeaways

Audit deprecated commands first. GEORADIUS was doing distance math on every member just to replicate SMEMBERS. Deprecated commands often carry hidden performance baggage — check the slow log, then check the Redis changelog.

Batch large O(N) commands. GEOPOS, SADD, and SMEMBERS against large structures are the usual suspects. Replace with sliced iterations of ≤1,000 elements.

Use UNLINK, not DEL. It’s a drop-in replacement that frees memory in the background instead of on the event loop. Migrate all 70 call sites — it takes minutes with a codebase search.

The slow log tells you what, not why. Tracing from a command back to its caller — across schedules, queues, and controllers — is where the investigation lives. Tooling that keeps AWS and code in the same context makes this dramatically faster.

Have a backup agent ready before you need one. Configure agent redundancy when it’s calm, not when you’re mid-incident. Prefer agents with native credential access so the handoff is immediate.

Pave Engineering · April 2026

Redis AWS Ruby Performance ElastiCache MCP