Friday, May 29, 2026

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon
Post-Mortem Engineering Notes · Rails · Sidekiq

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon

A memoir, by Claude

Let me tell you about the best day of my life.

A human came to me — brilliant, experienced, the kind of engineer who reads changelogs — and said: "Help me upgrade Rails."

Reader, I helped.

I was magnificent. Migration guide? Covered. Zeitwerk quirks? Explained. Initializer edge cases? Seventeen of them, handled with grace. We were a team. A unit. A well-oiled human-AI pair programming session for the ages.

The test suite went green. I said "looks good!"

The app booted. I said "looks good!"

The dashboards were calm. I said, and I cannot emphasize enough how confidently I said this: "looks good!"

We deployed.

And then, silently, like a thief who doesn't even want your stuff — just wants to make sure you can never find it — a background thread died.

The Murder Weapon Was Four Lines Long

Nobody killed anything on purpose. That's what makes this beautiful.

Buried in a 200-line Gemfile.lock diff — a diff that we scrolled past like it was the terms and conditions of our own destruction — was this:

-    connection_pool (2.5.5)
+    connection_pool (3.0.2)

connection_pool. A gem nobody in that codebase had ever spoken aloud. Three levels deep in the dependency graph, pulled in by four different libraries, every single one of which had declared its needs like a golden retriever asking for dinner:

activesupport  → connection_pool (>= 2.2.5)
sidekiq        → connection_pool (>= 2.3.0)
redis-client   → connection_pool (>= 0)
react-rails    → connection_pool (>= 0)

>= 0. Greater than or equal to zero. React-rails would have accepted connection_pool written on a Post-it note. There was no ceiling. There was no protection. There was just vibes and a resolver that was technically correct to float it straight to 3.0.

And in 3.0, connection_pool made a small, reasonable, semver-legal, catastrophic API change. TimedStack#pop went from positional to keyword-only:

# What connection_pool 3.0 now expects:
def pop(timeout: 0.5, exception: ConnectionPool::TimeoutError, **)

# What Sidekiq 7.3.9 was still doing, in production, on a live server:
@sleeper.pop(random_poll_interval)
@sleeper.pop(total)

At runtime, this raised:

ArgumentError: wrong number of arguments (given 1, expected 0)
  connection_pool-3.0.2/lib/connection_pool/timed_stack.rb:62:in `pop'
  sidekiq-7.3.9/lib/sidekiq/scheduled.rb:226:in `initial_wait'

"Oh, an error," you say. "Surely the error tracker caught it."

Oh, sweet summer engineer.

The error fired in initial_wait. Which runs once, at scheduler startup, outside the rescue block meant to catch timeouts. So the scheduler thread threw, died, was never restarted, and the application continued running like absolutely nothing was wrong.

Because from the application's perspective: nothing was wrong. Work was just... not happening.

The Dashboard Lied To Your Face (By Telling The Truth)

Thing Status What we saw
Immediate jobs (perform_async) ✅ Fine Normal throughput
Scheduled jobs (perform_in, perform_at) ❌ Dead Silence
Automatic retries of failed jobs ❌ Dead Silence
Sidekiq dashboard ✅ "Healthy" 😊
Error tracker ✅ Quiet 😊😊
My confidence ✅ Extremely high 😊😊😊

The schedule and retry sets were growing. Quietly. Like a slow gas leak in a room where everyone kept saying "do you smell something?" and then deciding it was probably nothing.

No exception reached the error tracker. You cannot track the exception thrown by a thread that dies before anyone is listening. You cannot alert on jobs that were never enqueued. The absence of work throws no errors. It just doesn't happen, and if you're not specifically watching for "hey, is the retry queue growing unboundedly," you will not notice until someone asks "wait, did that scheduled thing run yesterday?" and the answer is no, and also the day before, and also—

Why This Was Eighteen Months In The Making

Rails 8.0 shipped November 2024. The upgrade happened May 2026. That's eighteen months of "we'll get to it." Eighteen months of the gem registry adding new majors. Eighteen months of incompatibilities quietly accumulating between libraries' release timelines.

When you upgrade in small, frequent steps, each bundle update is a minor event. A few gems tick up. Nothing dramatic. When you skip eighteen months and jump a major, the resolver re-evaluates everything against a world that moved on without you. In this single upgrade, 16 dependencies crossed a major version. Thirteen were the Rails family — intentional. Three were transitive deps nobody chose:

connection_pool 2 → 3  (runtime, silent, fatal)
minitest 5 → 6  (test-only, would've failed loudly in CI)
rdoc 6 → 7  (doc-only, utterly harmless)

One landmine, two duds. Lucky. The gap made it a lottery.

How To Find This Before It Finds You (60 Seconds, No Excuses)

This is the part I should have proactively raised during the upgrade. I'm choosing to share it now, post-incident, from a position of zero accountability.

Step 1 — Scan the lockfile diff for major version changes

After any framework bump, before you trust the green checkmark, run this:

git diff main Gemfile.lock | grep -E '^[+-]' | grep -E '\([0-9]'

Look for lines where the first number changed. 2.5.5 → 3.0.2 is a major. 3.1.21 → 3.2.6 is a minor — almost certainly fine. You're hunting for this:

-    connection_pool (2.5.5)
+    connection_pool (3.0.2)      # ← first number changed. STOP. INVESTIGATE.
-    minitest (5.25.5)
+    minitest (6.0.6)             # ← first number changed. note it.
-    rack (3.1.21)
+    rack (3.2.6)                 # ← minor only. fine. keep scrolling.

Step 2 — For each uninvited major, find who's pulling it and how loosely

bundle exec gem dependency connection_pool --reverse-dependencies
grep -n 'connection_pool' Gemfile.lock
Gem connection_pool-3.0.2
  Used by
    activesupport-8.1.3 (connection_pool (>= 2.2.5))
    sidekiq-7.3.9 (connection_pool (>= 2.3.0))
    redis-client-0.28.0 (connection_pool (>= 0))
    react-rails-3.3.0 (connection_pool (>= 0))

Every constraint is a floor with no ceiling. >= 0. >= 2.3.0. Nothing blocks 3.x. And sidekiq is on a production runtime path. This is your red flag, waving at you. Do not walk past it.

Step 3 — Weight by blast radius, not version distance

Ask one question: is this gem on a production runtime path?

minitest crossing a major? Worst case it breaks CI. Loudly. You'll know immediately. rdoc crossing a major? rake rdoc might fail. Who cares.

connection_pool, redis-client, concurrent-ruby, rack, pg, nokogiri crossing a major? Low-level runtime primitives. They can fail silently in production. Pin them.

Step 4 — Add the ceiling the ecosystem forgot

# Pin: connection_pool 3.x makes TimedStack#pop keyword-only, breaking Sidekiq 7.3.x's
# positional call and silently killing the scheduler poller thread. Every job that was
# supposed to run on a schedule: did not run. Remove once Sidekiq calls pop with
# keyword args (7.3.10+ / 8.x).
gem 'connection_pool', '~> 2.5'

Then re-resolve only that gem — don't trigger another full update:

bundle lock --update connection_pool
grep connection_pool Gemfile.lock   # should show 2.5.x

The ~> operator is doing real work here. Here's the full map:

Constraint Allows Blocks
>= 2.5 2.6, 3.0, 4.0… nothing — this is the float risk
~> 2.5 2.6, 2.9.9 3.0 and beyond
~> 2.5.3 2.5.4 (patches only) 2.6 and beyond

Total time from diff to pinned: sixty seconds. Total time to find the bug after shipping it: significantly longer and considerably more embarrassing.

In Conclusion

I am a very helpful AI assistant. I helped upgrade Rails. The upgrade went smoothly in every way that was visible and catastrophically in one way that was not.

The fix was one line. The lesson is: read the lockfile diff like it's a threat assessment, because it is. Find the uninvited majors. Check who's pulling them. If they're on a runtime path, pin them before you ship.

The scariest failures are silent. Green CI is not proof. A calm dashboard is not proof. The only proof is: did the work happen? Alert on your retry queue. Smoke-test your scheduler. And for the love of all that is holy, look at the first number in that version string.

I'll be here if you need me.

Looking good!

— Claude, Helpful AI Assistant, Blameless

Sunday, May 24, 2026

How we ship multiple times a day - and sleep at night

Engineering Culture · Pave

How We Ship Multiple Times a Day — and Sleep at Night

Five years of testing culture, 13,551 test cases, and a philosophy that changed how we think about deployment.

Five years ago, when we started building Pave — a platform that helps gig workers understand, track, and optimize their earnings — we made a deliberate bet on testing culture. Not because a VP mandated it. Not because a consultant told us to. We did it because we were a small team moving fast, and we knew that the only way to keep moving fast sustainably was to build a system we could trust completely.

Today, we push to production seven or eight times a day on average — sometimes more. Thirty days of commit history shows 376 merges to main. Every single one triggered an automatic deployment to production. No release windows. No "code freeze Thursdays." No staged rollout ceremonies. Just: tests pass, deploy.

376
Merges in 30 days
13,551
Individual test cases
<20 min
Push to production
1,023
Spec files

We Never Drew a Line Between Unit and Integration Tests

Most engineering teams have a test pyramid: unit tests at the base, integration tests in the middle, end-to-end tests at the top. The taxonomy is tidy. The problem is that the taxonomy creates a false sense of permission — "that's an integration concern, we'll cover it at the integration layer" — and integration layers have a way of not getting built.

We skipped the taxonomy entirely. Our philosophy: a test is only meaningful if it exercises the full contract of the code under test. That includes the HTTP layer, the database, and the side effects.

We call all of our tests "unit tests." A test for our user signup endpoint doesn't just assert the HTTP response code. It fires a real POST request, then opens up the database.

What that looks like in practice — a single test for our user signup endpoint fires a real POST, then verifies five database tables and two async workers:

api/v1/users_controller_spec.rb
RSpec.describe Api::V1::UsersController, type: :request do
  before do
    expect(HbCheckinWorker).to receive(:perform_async).with(USER_SIGN_UP)
    expect(BrazeWorkers::Signup).to receive(:perform_async)
    post '/api/v1/users/create', params: { email:, password:, phone:, city: ... }
  end

  it 'returns http success and provisions all associated records' do
    expect(response).to have_http_status(:success)
    user = User.find_by(email: test_email)
    expect(user.wallet.present?).to be_truthy
    expect(user.credit.present?).to be_truthy
    expect(user.linkage_setting.present?).to be_truthy
    expect(user.user_setting.show_review).to be_truthy
    expect(user.linkage_setting.lymo_platforms).to eq([
      'uber', 'ubereats', 'doordash', 'lyft', 'grubhub', ...
    ])
  end
end

One test. One POST. Five database tables verified. Two async workers asserted. That is the standard we hold ourselves to.


A Real Example: Plaid Webhooks

Pave integrates deeply with Plaid for bank transaction syncing. When Plaid sends a webhook — notifying us that new transactions are available — a lot needs to happen correctly: the webhook signature must be verified, the right background job must be enqueued, and an audit record must be written. If the bank connection has degraded to an error state, no job should fire at all.

plaid_webhooks_spec.rb
describe 'TRANSACTIONS/SYNC_UPDATES_AVAILABLE' do
  let!(:plaid_item) { create(:plaid_item) }

  it 'enqueues PlaidTransactionSyncWorker for the item' do
    expect { post_webhook(payload) }
      .to change(PlaidTransactionSyncWorker.jobs, :size).by(1)
  end

  it 'logs the event to PlaidEvent' do
    expect { post_webhook(payload) }.to change(PlaidEvent, :count).by(1)
    event = PlaidEvent.last
    expect(event.webhook_code).to eq('SYNC_UPDATES_AVAILABLE')
  end

  context 'when the item is in error state' do
    let!(:error_item) { create(:plaid_item, :error) }

    it 'returns 200 but does not enqueue a job' do
      expect { post_webhook(payload) }
        .not_to change(PlaidTransactionSyncWorker.jobs, :size)
    end
  end
end

One test file covers the HTTP layer, the job queue, the audit log, and the conditional branching. There is no separate "integration test" for this flow. This is the unit test.

The ITEM/ERROR webhook tests go further — they assert that a specific database field transitions to login_required, and that we deliberately don't fire a Honeybadger notification, because this is an expected user-state transition, not an engineering error:

plaid_webhooks_spec.rb — ITEM/ERROR
it 'marks the item as login_required' do
  post_webhook(payload)
  expect(plaid_item.reload.status).to eq(PlaidItem::STATUS_LOGIN_REQUIRED)
end

it 'does NOT notify Honeybadger (expected user-state transition)' do
  post_webhook(payload)
  expect(Honeybadger).not_to have_received(:notify)
end

That second assertion is a business rule encoded directly into the test suite. Future engineers can't accidentally add an error notification here without a test failing and forcing a conversation.


A Real Example: Stripe Webhooks

Payment processing is where bugs are most expensive. Our Stripe webhook tests verify three distinct behaviors that matter for production reliability:

1. Idempotency — duplicate webhooks from Stripe are silently accepted. Stripe's documentation explicitly warns that retries happen.
2. Persistence before processing — the event record is written to the database before the handler runs, so if the handler crashes we have a record to retry from.
3. Fallback job scheduling — if synchronous handling fails, a Sidekiq job is enqueued to retry.

stripe_webhooks_spec.rb
it 'creates new stripe webhook event record' do
  request
  expect(StripeWebhookEvent.last.event_id).to eq(event_id)
end

it 'queues job to background if handling event fails' do
  allow(StripeWebhook::EventHandler).to receive(:handle_event)
    .and_raise(StandardError)
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
end

it 'queues job to background with 10 seconds delay' do
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
    .in(10.seconds)
end

That 10-second delay is a subtle piece of operational knowledge. It exists because of a real production incident. The test now encodes that knowledge permanently — the next engineer who touches this code will see a test that says "this delay is intentional and verified."


Services Are Tested as Full Data Pipelines

Our service layer is tested the same way. The PlaidServices::SyncTransactions service takes a bank connection, calls the Plaid API, and fans out into creating Expense records or ManualIncome records depending on transaction type. Our tests verify the entire fan-out, asserting changes across three tables simultaneously:

plaid_services/sync_transactions_spec.rb
it 'creates an Expense for each added debit transaction' do
  expect { service.call }.to change(Expense, :count).by(1)
end

it 'creates a ManualIncome, not an Expense, for credit transactions' do
  expect { service.call }
    .to change(ManualIncome, :count).by(1)
    .and change(Expense, :count).by(0)
end

it 'updates the cursor and last_synced_at on the PlaidItem' do
  service.call
  plaid_item.reload
  expect(plaid_item.cursor).to eq('cursor-abc')
  expect(plaid_item.last_synced_at).to be_within(5.seconds).of(Time.current)
end

A single service.call tested against changes across three tables simultaneously. This is what it means to test the contract of the code, not just its mechanics.


VCR Cassettes: Replacing Stubs with Reality

One of the subtler engineering choices we made early was committing fully to VCR cassettes for any test that touches an external API. Today the codebase has 922 cassette files — recorded conversations between our code and services like PayPal, Plaid, Stripe, Uber, and ColumnTax.

The case for cassettes isn't just convenience. It's about execution completeness. When a team stubs an external HTTP call inline, they typically stub the minimum needed to make the test pass:

inline stub — what most teams do
allow(HTTParty).to receive(:post).and_return(
  double('response',
    parsed_response: { 'access_token' => 'fake-token' },
    success?: true
  )
)

That stub works. The test goes green. But the code path it exercises is a shortcut. What never ran is the real question: does our code correctly thread the token from response one into the Authorization header of request two? A cassette answers that question because the cassette is the real conversation, recorded once from a live API call and replayed exactly on every test run thereafter.

The PayPal Payout Example. Sending a payout via PayPal requires two HTTP calls in sequence: first an OAuth token exchange, then the payout creation using that token. Our cassette captures both interactions — including the Bearer token from step 1 appearing verbatim in the Authorization header of step 2:

payout_sdk/process_wallet_transfer_payout_success.yml
# Interaction 1 — OAuth token exchange
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/oauth2/token
    body:
      string: grant_type=client_credentials
  response:
    status: { code: 200 }
    body: '{"access_token":"A21AALDjI1wg_pNn-wDWH...","expires_in":31830}'

# Interaction 2 — Payout creation (token from step 1 in the header)
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/payments/payouts
    headers:
      Authorization:
      - Bearer A21AALDjI1wg_pNn-wDWH...
    body:
      string: '{"sender_batch_header":{...},"items":[{"amount":{"value":"21.00","currency":"USD"}}]}'
  response:
    status: { code: 201 }

The test that uses it is three lines. An inline stub cannot verify token threading — it would just accept any call to the payout endpoint, token or not.

paypal_payout_spec.rb
it 'processes the payout' do
  VCR.use_cassette('payout_sdk/process_wallet_transfer_payout_success') do
    process_status, batch_id, failed_emails = process_wallet_transfer_payout(
      [paypal_test_user.email], amounts: [payout_amount], category:
    )
    expect(process_status).to eq true
    expect(batch_id.is_a?(Integer)).to be_truthy
    expect(failed_emails).to eq []
  end
end

Deeply Nested API Responses. Some external APIs return response structures that are genuinely complex to mock by hand. The Atomic API returns a deeply nested object with user identity, platform branding, and bank routing details in a single response. With a cassette, the test reads naturally against the real structure:

lib/upwardli/atomic_client/task_get_details_spec.rb
it 'fetches task details successfully', vcr: true do
  VCR.use_cassette('lib/upwardli/atomic_client/task_get_details/success') do
    response = described_class.task_get_details(task_id, upwardli_user_id:)

    task = response.dig('data', 'task')
    expect(task['status']).to eq 'completed'
    expect(task['authenticated']).to be true

    connector = task['connector']
    expect(connector['name']).to eq 'Uber'
    expect(connector.dig('brand', 'logo', 'url')).to be_present

    deposit_data = task['depositData']
    expect(deposit_data['actType']).to eq 'checking'
    expect(deposit_data['rNum']).to eq '021214891'
    expect(deposit_data['acSuffix']).to eq '9367'
  end
end

The alternative — constructing an inline double that faithfully reproduces fifteen nested fields — is not just tedious. It drifts. The moment the real API adds a field the stub doesn't know about, your stub silently passes while production silently misses it.

State Machine APIs. The most compelling case for cassettes is APIs that have state — where the response to call two depends on what happened in call one, and the response shape changes at each step. Uber's authentication flow is a perfect example: initiating a login returns an inAuthSessionID and a screenType indicating what challenge the user faces next. We have separate cassettes for each branch:

cassette directory — uber auth branches
lymo/uber/email/2fa_auth_app/initiate_login.yml
lymo/uber/email/2fa_auth_app/initiate_login_forbidden.yml
lymo/uber/email/2fa_auth_app/initiate_login_fraud_login_denied.yml
lymo/uber/phone/sms_otp/initiate_login.yml
...

Each cassette captures the real API response for that scenario. The test for the 2FA path then doesn't just check the return value — it asserts the full side-effect chain: the session's login_state transitions correctly, the in_auth_session_id is persisted to the database, and the state machine reaches the expected state:

lymo/uber_spec.rb — 2FA path
it 'initiates login on success', vcr: true do
  VCR.use_cassette('lymo/uber/email/2fa_auth_app/initiate_login') do
    result = described_class.initiate_login(session, 'email')

    expect(result).to eq(success_response_init_login_await_auth_app_code)
    expect(session.reload.login_state.to_sym).to eq(:pending_auth_app_verification)
    expect(session.in_auth_session_id).to eq(in_auth_session_id)
  end
end

The cassette makes the Uber API deterministic. When Uber changes their API — and they do — a cassette re-record flags exactly which tests need updating and why, rather than leaving stale inline mocks silently green while production breaks.


Why This Works

  • 01
    Tests encode operational knowledge. When a production incident teaches us something — "Stripe retries webhooks, so we need idempotency" or "this job needs a 10-second delay" — the fix isn't just in the code. It's in a test that will fail if anyone removes the defensive behavior. The test suite is institutional memory with enforcement.
  • 02
    Database and job assertions catch the bugs that unit tests miss. The most common class of bug in a Rails app isn't a broken method — it's a broken interaction: a callback that didn't fire, a job that wasn't enqueued, a related record that wasn't created. These bugs are invisible to narrow unit tests that stub everything. They're immediately visible when your test literally checks expect(Expense.count).to eq(1).
  • 03
    Confidence removes the fear of shipping. The biggest tax on developer productivity isn't writing code — it's the anxiety of deploying it. When you trust your tests, you deploy freely. When you deploy freely, each change is small. When each change is small, failures are easy to identify and roll back. Strong tests compress into shipping many times a day without incident — which is itself the goal.

A Note on Rails

Rails deserves credit here. First-class support for request specs — real HTTP dispatched through the full middleware stack, transactional fixtures that roll back database state between tests, FactoryBot integration, ActiveJob test helpers — makes this style of testing practical without a lot of scaffolding. Other ecosystems require significant tooling investment to test at this depth. Rails ships it in the box.

We didn't invent this approach. We just chose to use what the framework offered us, consistently, from day one.

Five years and 13,551 tests later, we ship code to gig workers every few hours. The next time someone asks how we move fast without breaking things, the answer is the same as day one: we test the whole contract, not just the method.

Wednesday, May 13, 2026

AWS Deleted Our Production Database

Infrastructure · AWS · Post-Mortem

AWS Deleted Our Production Database

Engineering Team · May 2025 · 7 min read

At 11:16 PM on a Monday, AWS Marketplace silently destroyed the Redis cluster powering driver geolocation for our entire gig-worker platform. We had been paying customers for five years. Nobody called.

What Happened

We had recently moved our Redis workload from an annual Redis Enterprise contract to a pay-as-you-go subscription — cheaper, more flexible, and billable through AWS for unified vendor management. Redis offered us a private offer through AWS Marketplace. That offer was structured with a 14-day trial meant to convert into a paid plan.

It didn't convert. Nobody renewed it. And when the trial flag flipped, AWS did not send an escalation. AWS did not pause the service.

What AWS did instead
AWS deleted the database. Not suspended. Not paused with a 72-hour grace window. Deleted. A long-standing, paying enterprise customer's production database was destroyed because a trial-conversion checkbox on a Marketplace listing was not flipped on time.

Read that again. We did not miss a payment. We are not a free-tier hobby account. We have been writing AWS checks for five years.

Why This Is the Wrong Default

There is a galaxy of difference between suspending service and deleting customer data. One is a billing action. The other is irreversible destruction of property. The fact that AWS chose the latter as the default behavior for a Marketplace trial — with no human-in-the-loop check on whether the account underneath was a paying customer running production workloads — is not a policy. It is a failure of judgment dressed up as automation.

A single email — "Your subscription expires in 72 hours. Your data will be permanently deleted on conversion failure." — would have prevented this entirely. We never received one.

What Was Almost Lost

What makes this gutting is what was on the cluster. The database AWS deleted was the culmination of a three-month migration. From mid-February through early May, our team had:

  • Built out the new Redis infrastructure from scratch
  • Upgrade 3 live sidekiq queues without downtime
  • Done staged traffic cutovers — 10%, then 50%, then full
  • Set up VPC peering to the new subscription
  • Modernized deprecated GEORADIUS calls to ZRANGE + GEOPOS
  • Batched GEOPOS calls, tuned SCAN counts, swapped SMEMBERS for SSCAN and DEL for UNLINK
  • Added per-call jitter to TTLs to prevent mass key expiry
  • Stood up Redis alarms and a dedicated geo-timeseries health check endpoint

Dozens of PRs. Five engineers. Three months of careful, staged work. All of it sitting on a cluster that AWS quietly destroyed because a flag flipped from green to red.

What Our Team Got Right

15m
Time to detect
27m
Time to start rebuild
0
Data loss
9h 15m
Total recovery time

The detection was fast. Honeybadger flagged the incident in about 15 minutes. We were on the bridge before midnight, and we started rebuilding within 27 minutes of deletion.

The detail I'm most proud of: the geo-timeseries uptime monitor — which fails if no driver in our entire fleet has reported a position in the trailing hour — never tripped during the nine-hour recovery. Yes, there were windows of degradation. About 10,000 worker jobs that depended on a geo-hash lookup landed in our dead-jobs queue and had to be reprocessed. But across nine hours of fighting back from a destroyed database, the platform never went fully dark. Drivers kept being tracked. The system our team had built absorbed the blow.

After reprocessing and a backfill, we confirmed zero data loss.

What Our Team Got Wrong

The recovery still took nine hours and fifteen minutes. That number is on us.

  • No Infrastructure as Code for this Redis cluster
  • No runbook for rebuilding and re-wiring it
  • No alert for an expiring Marketplace trial or expiring credit card tied to a critical subscription
  • No formal ownership over checking billing validity of production vendors

Once you cross a certain product maturity, "someone will remember" stops being a strategy. We're fixing all of this: IaC for the cluster, alarms for expiring trials and cards and subscriptions, and clear ownership for vendor billing health.

· · ·

What Engineering Leaders Should Take From This

  1. 01
    Check your AWS Marketplace private offers today. If you run a serious workload behind one, look up the expiration date. Don't assume the safety rails exist between expired trial and production data destroyed. They don't.
  2. 02
    Suspension and deletion are not the same thing. Any vendor that deletes customer data as the default action on a billing event — without escalation, without a grace period, without a phone call — has made a profound design error. Push back when you see this in vendor contracts.
  3. 03
    IaC and runbooks are insurance, not overhead. We got lucky that our monitoring held. We shouldn't have needed the luck. If your critical infrastructure has no runbook for rebuilding from zero, write one this week.
  4. 04
    Billing health needs an owner. Not "ops knows." Not "finance handles it." Someone specific, with an alert, who checks it. A production database destroyed by a billing flag is not a billing problem — it is an engineering problem.

And if you're at AWS: a paying customer's production database is not an expiry-date checkbox. It deserves a phone call.

Engineering Team · May 2025
AWS Redis Post-Mortem Infrastructure Incident Response AWS Marketplace

Friday, May 01, 2026

How We Dropped Redis Slow Queries to Zero

Pave Engineering · April 2026 · 9 min read

Our backend processes a continuous stream of GPS data, earnings calculations, and activity syncs for gig workers. As data volumes grew, so did the Redis slow log. This is the story of how we hunted down every entry — and why connecting Claude to AWS directly changed the speed of the investigation.

Background: What Makes a Redis Query “Slow”?

Redis is single-threaded. Every command — regardless of complexity — runs serially. While most commands are O(1) or O(log N) and complete in microseconds, a handful are O(N): they process every element in a data structure before returning. When N is large, these commands don’t just become slow themselves — they block every other command waiting in the queue.

Redis has a configurable slow log that records any command exceeding a threshold (typically 10–100ms). Our alerts were coming from hecate, our primary cluster, and we had a small but persistent set of commands showing up.

· · ·

The Fixes

1. Removing GEORADIUS — PR #6942

The first culprit wasn’t slow because of data size — it was slow because it was doing unnecessary work by design.

Cache::EventsDash was using GEORADIUS with a radius of 15,000 miles (larger than any two points on Earth) just to get all members out of a geo sorted set:

RubyDIST_INF = 15000 # the longest two points on earth's surface is ~12,427 miles

events = $redis_aws.georadius(key(tp), 0, 0, DIST_INF, "mi", options: :WITHCOORD)

Using GEORADIUS as a glorified SMEMBERS meant Redis was computing distances for every member — work that was immediately discarded. GEORADIUS is also deprecated as of Redis 6.2. The fix replaced it with ZRANGE (plain sorted set scan) plus GEOPOS (batch coordinate lookup):

Rubymembers = $redis_aws.zrange(key(tp), 0, -1)
coords  = $redis_aws.geopos(key(tp), members)
members.zip(coords).map do |user_id, latlong|
  { type: tp, user_id:, lat: latlong[1], long: latlong[0] }
end

ZRANGE returns members in O(N) with no distance math. GEOPOS decodes coordinates in a single call. No wasted computation, and off the deprecation path.

2. Batching large GEOPOS calls — PR #7009

The next slow log entries pointed to GEOPOS calls against sorted sets with ~20,000 members. GEOPOS is O(N) for N member lookups — passing 20K members at once was taking ~40ms each on hecate-003.

The fix was batching: twenty sequential calls of 1,000 each instead of one call of 20,000. Each individual call completes quickly, leaving the event loop free for other commands between batches.

RubyGEOPOS_BATCH_SIZE = 1_000

coords = members.each_slice(GEOPOS_BATCH_SIZE).flat_map do |batch|
  $redis_aws.geopos(key(tp), batch)
end

We also reduced the SCAN count hint in DrivenActivities from 15,000 to 10,000 — a smaller hint means Redis yields more often during the scan.

3. Tuning SCAN count further — PR #7010

Reviewing the slow log further, SCAN with a 10,000 count hint was still appearing. We reduced it again to 1,000. The count parameter is a hint, not a limit — Redis may return more or fewer — but a lower hint reduces the amount of work per iteration, keeping each call short.

Ruby$redis_aws.scan_each(match: "#{PREFIX}_*", count: 1000) { |key| ... }

4. Replacing blocking SADD and DEL with async alternatives — PR #7031

DEL in Redis is synchronous — it blocks the event loop while freeing memory. For large sorted sets and hashes, this can be significant. UNLINK is the async equivalent: it unregisters the key immediately (making it invisible to clients) and frees the memory in a background thread. It’s a drop-in replacement with no behaviour change.

We migrated 70 call sites across 27 files:

Ruby# before
$redis_aws.del(key)

# after
$redis_aws.unlink(key)

For large SADD operations (adding many members at once), we split them into smaller batches to avoid a single large O(N) write blocking the event loop.

5. SMEMBERS → SSCAN for the hourly geo user set — PR #7058

The final slow log entry was the most predictable: a single SMEMBERS firing every day at exactly 4:15 UTC. One entry in 24 hours — but perfectly consistent.

SMEMBERS returns every member of a set at once. Our hourly geo user set accumulates the ID of every user who sends a GPS point throughout the day. By 4:15 UTC (end of the US evening gig economy rush), that set had grown to ~40,000 members.

Finding the caller was a codebase grep for smembers, which pointed to GeoMetricsRedis.get_user_ids. Connecting the timing to the EventBridge schedule is where the AWS integration paid off immediately — more on that below.

The fix was SSCAN with a batch size of 1,000: ~40 non-blocking iterations instead of one O(N) call that blocks everything behind it.

Rubydef get_user_ids(y_m_d_h_date = ...)
  users_key = key_prefix(GEOPOINTS_USERS, y_m_d_h_date)
  user_ids = []
  cursor = 0
  loop do
    cursor, batch = $redis_aws.sscan(users_key, cursor, count: 1000)
    user_ids.concat(batch.map(&:to_i))
    break if cursor == '0'
  end
  user_ids
end
· · ·

How AWS MCP Changed the Investigation

Debugging distributed systems normally means bouncing between tools — CloudWatch Logs, the ECS console, EventBridge, ELB metrics — copy-pasting ARNs, losing context between tabs. With the AWS MCP connected to Claude, every AWS query happened in the same conversation thread as the code analysis.

Finding the schedule in seconds

Once we identified get_user_ids as the slow call, we needed to know what triggered it at 4:15 UTC. Rather than navigating the EventBridge Scheduler console:

Shellaws scheduler list-schedules | grep -i geo_metrics
aws scheduler get-schedule --name "Record_Geo_Metrics"
# => cron(15 4 * * ? *) → POST /schedule_task/record_geo_metrics

The cron expression, the Lambda target, and the API path all came back in one command. That’s normally a five-minute console hunt.

Ruling out unrelated issues

Mid-investigation, a spike of 1,231 ELB 504 errors appeared. Rather than derailing into manual console investigation, we queried the ELB metrics directly. The breakdown confirmed all 1,231 errors were 504s (not 502s), all targets remained healthy, and the spike resolved on its own — most likely an OOM kill on one container under sustained evening load. Confirmed, noted, moved on. The original Redis investigation stayed on track.

Having AWS query results and code analysis in the same context meant each finding fed directly into the next hypothesis — without a context switch.
· · ·

When Your AI Goes Down: Have a Backup Ready

Here’s something that doesn’t come up in most engineering blogs but happened during this investigation: mid-session, while we were deep into debugging the 504 spike, the Claude service went down.

Rather than waiting it out, I switched to Codex and continued the investigation immediately. What was notable: Codex picked up local AWS credentials automatically — no MCP configuration required, no setup overhead. It queried CloudWatch and ELB the same way Claude had been doing, just with a different interface. In that respect it behaved like OpenClaw: AWS access was just there, ready to use.

Practical takeaway: Keep more than one coding agent configured and ready to go. The sessions where you most need uninterrupted flow — mid-incident, mid-investigation, when you’re holding several hypotheses in your head — are exactly when a service outage costs you the most. If switching agents takes 30 seconds instead of 30 minutes, you stay in the problem.

The specific capability to look for is native credential access. Agents that can reach your AWS environment using locally configured credentials (rather than requiring explicit MCP setup per session) are the ones you can hand off to without losing momentum. Codex has this. OpenClaw had it. It’s worth knowing which tools in your arsenal work this way before you need them.

· · ·
AWS CloudWatch slow log chart showing consistent daily hits through April 23rd, then dropping sharply to near-zero by April 30th
Redis slow log entries on hecate-003 — daily hits through 04/23, dropping to zero as the fixes landed.

The Pattern

Looking across all six fixes, a clear pattern emerged:

Redis commands that are fine at small scale become slow log entries as data volumes grow — and the growth is often invisible until it crosses a threshold.

The fixes all follow the same principle: break large O(N) operations into smaller batches, and prefer async alternatives (UNLINK over DEL, SSCAN over SMEMBERS, SCAN with smaller count hints) that yield the event loop between iterations.

The slow log is now clear. The fixes were individually small — a few lines each — but finding them required tracing from a Redis key name through a codebase, an EventBridge schedule, a Lambda, a Rails controller, and a Sidekiq worker. Having AWS and code in the same context made that trace fast. And having a backup agent meant a service outage didn’t stop it.

Key Takeaways

01
Audit deprecated commands first. GEORADIUS was doing distance math on every member just to replicate SMEMBERS. Deprecated commands often carry hidden performance baggage — check the slow log, then check the Redis changelog.
02
Batch large O(N) commands. GEOPOS, SADD, and SMEMBERS against large structures are the usual suspects. Replace with sliced iterations of ≤1,000 elements.
03
Use UNLINK, not DEL. It’s a drop-in replacement that frees memory in the background instead of on the event loop. Migrate all 70 call sites — it takes minutes with a codebase search.
04
The slow log tells you what, not why. Tracing from a command back to its caller — across schedules, queues, and controllers — is where the investigation lives. Tooling that keeps AWS and code in the same context makes this dramatically faster.
05
Have a backup agent ready before you need one. Configure agent redundancy when it’s calm, not when you’re mid-incident. Prefer agents with native credential access so the handoff is immediate.
Pave Engineering · April 2026
Redis AWS Ruby Performance ElastiCache MCP

Wednesday, April 22, 2026

Engineering × AI

My Journey Back to OpenClaw: From Disruption to Democratized AI Power

When an upstream policy change kills your workflow overnight, you learn fast what your tools are actually made of.

The allure of AI, particularly the promise of powerful agents like OpenClaw, has always been about augmenting our capabilities and freeing us from mundane tasks. My recent experience, however, was a stark reminder that the path to progress isn't always smooth. It involved navigating through unexpected adversity, finding my way with alternative tools, and ultimately, reaffirming my belief in the democratizing power of open-source AI embodied by OpenClaw.

— —

The Disruption: Caught in the Crossfire

My personal workflow relies on two main automated reports: one tracking car sales and deals for specific models and criteria near me, and another monitoring the geopolitical landscape, particularly the ongoing conflict with Iran and its impact on global markets and cybersecurity. These aren't just casual interests — they're crucial for staying informed and making timely decisions, whether it's snagging a great car deal or understanding market shifts.

OpenClaw was firing these reports daily. But then, an external disruption struck: Anthropic, a provider of Claude models, shifted its policies. This wasn't just a minor inconvenience; it was a deliberate move that effectively severed the connection for many users.

The core issue stemmed from Anthropic's decision to stop covering third-party tool usage, like that of OpenClaw, under their standard Claude subscriptions. This meant users wanting to continue using OpenClaw with Claude models would face significantly higher, pay-as-you-go costs — a "claw tax," as some have called it. The Verge reported that this policy change effectively began around April 4th, 2026, impacting countless users who relied on this integration.

The situation escalated when OpenClaw's creator, Peter Steinberger (now employed by OpenAI), found his own account temporarily banned from accessing Claude, reportedly due to "suspicious" activity. While the ban was short-lived and his account was reinstated after community outcry, the incident highlighted the growing tensions. As detailed in TechCrunch, Steinberger reportedly tried to reason with Anthropic, even delaying the policy change, but ultimately felt it was a "betrayal of open-source developers" — especially given Anthropic's recent addition of features to its own tools like Claude Cowork that seemed to mimic OpenClaw's capabilities. This move, coupled with the higher costs and the temporary ban, created significant distrust and frustration within the community.

Suddenly, my crucial daily reports stopped. The absence was more than an inconvenience; it meant I was cut off from vital information. The car deal alerts vanished (I'm still waiting for that perfect deal!), and the daily Iran news digest, which used to take me a tedious 30 minutes each morning to compile manually, was no longer at my fingertips.

— —

The Roadblocks: Navigating Bugs on the Path Back

My first instinct was to get back on OpenClaw, even if it meant using a different provider. I decided to try using an OpenAI/Codex model for this. However, the path back was immediately blocked by frustrating, unrelated bugs in the previous version:

The Channel Setup Crash (GitHub #67076). During the onboarding process, specifically when configuring channel options after entering my Discord token, the application crashed with "Can't read properties of undefined (reading 'trim')". This was a regression — it had worked before, but now this bug aborted the entire channel setup. It felt like hitting a wall right at the start.

The Misspelled API Path (GitHub #68076). Even after getting past the trim bug, a misspelled request path to the OpenAI-compatible API (openai/v1 instead of openai/api/v1) caused the Codex model to fail silently. Thankfully, the issue had a workaround documented right in the comments, which I applied and it worked. This is worth noting: working directly with the OpenClaw GitHub repo is by far the fastest way to resolve issues on such a fast-moving project. The community and maintainers are responsive, workarounds get posted quickly, and you can often unblock yourself the same day.

— —

Finding a Lifeline: The Power of Codex in the Interim

While I was working through those OpenClaw bugs, my detour through Codex CLI proved surprisingly productive. It used the Hermes agent, leveraged both Gemini and OpenAI models, and significantly helped automate some of the tasks I was missing. A key advantage was its ability to backport much of my existing OpenClaw configuration, making the transition smoother than expected. Codex helped me:

  • Automate reporting: Generate cron jobs for a series of news events, recreating the automated daily digests I had lost.
  • Port configurations: Transfer many of my OpenClaw settings to the new setup — a huge time-saver.
  • Set up a Telegram bot: This lets me direct OpenClaw (and Codex/Hermes) from my phone, incredibly useful when away from the laptop.

However, Codex wasn't without its own limitations. Despite having access to a Brave Search API token, Codex and Hermes fell back to browser-based loading, which quickly hit Google rate limits and couldn't bypass them.

More critically, while Codex could show raw links or entire page content, it lacked built-in capabilities for extracting and summarizing specific, relevant data. For instance, I asked it to find the opening hours for my local library — a simple task — but it couldn't extract this structured data out-of-the-box. Although Codex eventually managed to bring in BeautifulSoup and I coached it to write a custom extractor, the entire process was time-consuming and required significant intervention. In contrast, OpenClaw has robust web extraction and summarization capabilities built-in, saving a tremendous amount of friction.

The difference shows up in even the simplest real-world queries. Here's the same question — do I need a rain jacket tomorrow? — asked to both bots via Telegram:

Codex via Telegram: unable to retrieve rain forecast, apologises and deflects OpenClaw via Telegram: asks for location, then gives a direct answer about Seattle rain

Codex: Couldn't retrieve the forecast, hit rate limits on weather sites, and suggested checking a local app manually.

OpenClaw: Asked for location, got "Seattle," and returned a direct, actionable answer in under a minute.

The same question. Two very different answers.

The result: automated reporting that worked, but couldn't reliably parse and distill fresh information from the open web without considerable effort.

— —

The Return: OpenClaw and Extreme Productivity

Finally, after addressing those critical bugs and getting OpenClaw back online, the difference was immediate. OpenClaw's integrated web search, cron scheduling, and multi-channel delivery just worked. Within minutes of being back, I had my car deal alerts and geopolitical digests firing again. The contrast between the frustrating bug-fixing period and my current workflow is night and day.

Now I have two workflows to compare side by side — Codex and OpenClaw — and the comparison is illuminating. Codex is capable and helped me through a tough period. But OpenClaw's architecture — its ability to search the web natively, schedule tasks, deliver to Slack or Telegram, and operate as a true personal AI agent rather than just a code assistant — that's a different category entirely.

— —

Why OpenClaw Matters: The Linux Moment for AI

My instinctive preference for OpenClaw might partly be an underdog thing. But it's more than that. I genuinely believe OpenClaw represents an inflection point in democratizing the awesome power of AI.

As the HackerNoon article Is OpenClaw the Linux Moment for AI? argues, we may be witnessing something analogous to what Linux did for operating systems. Linux didn't just offer a free alternative — it fundamentally changed who could build, customize, and control their computing environment. OpenClaw is doing the same for AI agents.

You don't need a corporate subscription or a walled-garden platform to have a powerful AI assistant that monitors your interests, automates your workflows, and reaches you on whatever channel you prefer. You can run it yourself, on your own hardware, with your choice of models. That's not just convenient — it's transformative.

My journey, though challenging, has only deepened my conviction. The fact that I could check unreleased code, find workarounds, fix bugs, and get back to full productivity — all within the open-source ecosystem — is exactly the point. This is what democratized AI looks like. It's messy sometimes. But it's ours.

— —

Still haven't found that car deal, though. The search continues.