Sunday, May 24, 2026

How we ship multiple times a day - and sleep at night

Engineering Culture · Pave

How We Ship Multiple Times a Day — and Sleep at Night

Five years of testing culture, 13,551 test cases, and a philosophy that changed how we think about deployment.

Five years ago, when we started building Pave — a platform that helps gig workers understand, track, and optimize their earnings — we made a deliberate bet on testing culture. Not because a VP mandated it. Not because a consultant told us to. We did it because we were a small team moving fast, and we knew that the only way to keep moving fast sustainably was to build a system we could trust completely.

Today, we push to production seven or eight times a day on average — sometimes more. Thirty days of commit history shows 376 merges to main. Every single one triggered an automatic deployment to production. No release windows. No "code freeze Thursdays." No staged rollout ceremonies. Just: tests pass, deploy.

376
Merges in 30 days
13,551
Individual test cases
<20 min
Push to production
1,023
Spec files

We Never Drew a Line Between Unit and Integration Tests

Most engineering teams have a test pyramid: unit tests at the base, integration tests in the middle, end-to-end tests at the top. The taxonomy is tidy. The problem is that the taxonomy creates a false sense of permission — "that's an integration concern, we'll cover it at the integration layer" — and integration layers have a way of not getting built.

We skipped the taxonomy entirely. Our philosophy: a test is only meaningful if it exercises the full contract of the code under test. That includes the HTTP layer, the database, and the side effects.

We call all of our tests "unit tests." A test for our user signup endpoint doesn't just assert the HTTP response code. It fires a real POST request, then opens up the database.

What that looks like in practice — a single test for our user signup endpoint fires a real POST, then verifies five database tables and two async workers:

api/v1/users_controller_spec.rb
RSpec.describe Api::V1::UsersController, type: :request do
  before do
    expect(HbCheckinWorker).to receive(:perform_async).with(USER_SIGN_UP)
    expect(BrazeWorkers::Signup).to receive(:perform_async)
    post '/api/v1/users/create', params: { email:, password:, phone:, city: ... }
  end

  it 'returns http success and provisions all associated records' do
    expect(response).to have_http_status(:success)
    user = User.find_by(email: test_email)
    expect(user.wallet.present?).to be_truthy
    expect(user.credit.present?).to be_truthy
    expect(user.linkage_setting.present?).to be_truthy
    expect(user.user_setting.show_review).to be_truthy
    expect(user.linkage_setting.lymo_platforms).to eq([
      'uber', 'ubereats', 'doordash', 'lyft', 'grubhub', ...
    ])
  end
end

One test. One POST. Five database tables verified. Two async workers asserted. That is the standard we hold ourselves to.


A Real Example: Plaid Webhooks

Pave integrates deeply with Plaid for bank transaction syncing. When Plaid sends a webhook — notifying us that new transactions are available — a lot needs to happen correctly: the webhook signature must be verified, the right background job must be enqueued, and an audit record must be written. If the bank connection has degraded to an error state, no job should fire at all.

plaid_webhooks_spec.rb
describe 'TRANSACTIONS/SYNC_UPDATES_AVAILABLE' do
  let!(:plaid_item) { create(:plaid_item) }

  it 'enqueues PlaidTransactionSyncWorker for the item' do
    expect { post_webhook(payload) }
      .to change(PlaidTransactionSyncWorker.jobs, :size).by(1)
  end

  it 'logs the event to PlaidEvent' do
    expect { post_webhook(payload) }.to change(PlaidEvent, :count).by(1)
    event = PlaidEvent.last
    expect(event.webhook_code).to eq('SYNC_UPDATES_AVAILABLE')
  end

  context 'when the item is in error state' do
    let!(:error_item) { create(:plaid_item, :error) }

    it 'returns 200 but does not enqueue a job' do
      expect { post_webhook(payload) }
        .not_to change(PlaidTransactionSyncWorker.jobs, :size)
    end
  end
end

One test file covers the HTTP layer, the job queue, the audit log, and the conditional branching. There is no separate "integration test" for this flow. This is the unit test.

The ITEM/ERROR webhook tests go further — they assert that a specific database field transitions to login_required, and that we deliberately don't fire a Honeybadger notification, because this is an expected user-state transition, not an engineering error:

plaid_webhooks_spec.rb — ITEM/ERROR
it 'marks the item as login_required' do
  post_webhook(payload)
  expect(plaid_item.reload.status).to eq(PlaidItem::STATUS_LOGIN_REQUIRED)
end

it 'does NOT notify Honeybadger (expected user-state transition)' do
  post_webhook(payload)
  expect(Honeybadger).not_to have_received(:notify)
end

That second assertion is a business rule encoded directly into the test suite. Future engineers can't accidentally add an error notification here without a test failing and forcing a conversation.


A Real Example: Stripe Webhooks

Payment processing is where bugs are most expensive. Our Stripe webhook tests verify three distinct behaviors that matter for production reliability:

1. Idempotency — duplicate webhooks from Stripe are silently accepted. Stripe's documentation explicitly warns that retries happen.
2. Persistence before processing — the event record is written to the database before the handler runs, so if the handler crashes we have a record to retry from.
3. Fallback job scheduling — if synchronous handling fails, a Sidekiq job is enqueued to retry.

stripe_webhooks_spec.rb
it 'creates new stripe webhook event record' do
  request
  expect(StripeWebhookEvent.last.event_id).to eq(event_id)
end

it 'queues job to background if handling event fails' do
  allow(StripeWebhook::EventHandler).to receive(:handle_event)
    .and_raise(StandardError)
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
end

it 'queues job to background with 10 seconds delay' do
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
    .in(10.seconds)
end

That 10-second delay is a subtle piece of operational knowledge. It exists because of a real production incident. The test now encodes that knowledge permanently — the next engineer who touches this code will see a test that says "this delay is intentional and verified."


Services Are Tested as Full Data Pipelines

Our service layer is tested the same way. The PlaidServices::SyncTransactions service takes a bank connection, calls the Plaid API, and fans out into creating Expense records or ManualIncome records depending on transaction type. Our tests verify the entire fan-out, asserting changes across three tables simultaneously:

plaid_services/sync_transactions_spec.rb
it 'creates an Expense for each added debit transaction' do
  expect { service.call }.to change(Expense, :count).by(1)
end

it 'creates a ManualIncome, not an Expense, for credit transactions' do
  expect { service.call }
    .to change(ManualIncome, :count).by(1)
    .and change(Expense, :count).by(0)
end

it 'updates the cursor and last_synced_at on the PlaidItem' do
  service.call
  plaid_item.reload
  expect(plaid_item.cursor).to eq('cursor-abc')
  expect(plaid_item.last_synced_at).to be_within(5.seconds).of(Time.current)
end

A single service.call tested against changes across three tables simultaneously. This is what it means to test the contract of the code, not just its mechanics.


VCR Cassettes: Replacing Stubs with Reality

One of the subtler engineering choices we made early was committing fully to VCR cassettes for any test that touches an external API. Today the codebase has 922 cassette files — recorded conversations between our code and services like PayPal, Plaid, Stripe, Uber, and ColumnTax.

The case for cassettes isn't just convenience. It's about execution completeness. When a team stubs an external HTTP call inline, they typically stub the minimum needed to make the test pass:

inline stub — what most teams do
allow(HTTParty).to receive(:post).and_return(
  double('response',
    parsed_response: { 'access_token' => 'fake-token' },
    success?: true
  )
)

That stub works. The test goes green. But the code path it exercises is a shortcut. What never ran is the real question: does our code correctly thread the token from response one into the Authorization header of request two? A cassette answers that question because the cassette is the real conversation, recorded once from a live API call and replayed exactly on every test run thereafter.

The PayPal Payout Example. Sending a payout via PayPal requires two HTTP calls in sequence: first an OAuth token exchange, then the payout creation using that token. Our cassette captures both interactions — including the Bearer token from step 1 appearing verbatim in the Authorization header of step 2:

payout_sdk/process_wallet_transfer_payout_success.yml
# Interaction 1 — OAuth token exchange
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/oauth2/token
    body:
      string: grant_type=client_credentials
  response:
    status: { code: 200 }
    body: '{"access_token":"A21AALDjI1wg_pNn-wDWH...","expires_in":31830}'

# Interaction 2 — Payout creation (token from step 1 in the header)
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/payments/payouts
    headers:
      Authorization:
      - Bearer A21AALDjI1wg_pNn-wDWH...
    body:
      string: '{"sender_batch_header":{...},"items":[{"amount":{"value":"21.00","currency":"USD"}}]}'
  response:
    status: { code: 201 }

The test that uses it is three lines. An inline stub cannot verify token threading — it would just accept any call to the payout endpoint, token or not.

paypal_payout_spec.rb
it 'processes the payout' do
  VCR.use_cassette('payout_sdk/process_wallet_transfer_payout_success') do
    process_status, batch_id, failed_emails = process_wallet_transfer_payout(
      [paypal_test_user.email], amounts: [payout_amount], category:
    )
    expect(process_status).to eq true
    expect(batch_id.is_a?(Integer)).to be_truthy
    expect(failed_emails).to eq []
  end
end

Deeply Nested API Responses. Some external APIs return response structures that are genuinely complex to mock by hand. The Atomic API returns a deeply nested object with user identity, platform branding, and bank routing details in a single response. With a cassette, the test reads naturally against the real structure:

lib/upwardli/atomic_client/task_get_details_spec.rb
it 'fetches task details successfully', vcr: true do
  VCR.use_cassette('lib/upwardli/atomic_client/task_get_details/success') do
    response = described_class.task_get_details(task_id, upwardli_user_id:)

    task = response.dig('data', 'task')
    expect(task['status']).to eq 'completed'
    expect(task['authenticated']).to be true

    connector = task['connector']
    expect(connector['name']).to eq 'Uber'
    expect(connector.dig('brand', 'logo', 'url')).to be_present

    deposit_data = task['depositData']
    expect(deposit_data['actType']).to eq 'checking'
    expect(deposit_data['rNum']).to eq '021214891'
    expect(deposit_data['acSuffix']).to eq '9367'
  end
end

The alternative — constructing an inline double that faithfully reproduces fifteen nested fields — is not just tedious. It drifts. The moment the real API adds a field the stub doesn't know about, your stub silently passes while production silently misses it.

State Machine APIs. The most compelling case for cassettes is APIs that have state — where the response to call two depends on what happened in call one, and the response shape changes at each step. Uber's authentication flow is a perfect example: initiating a login returns an inAuthSessionID and a screenType indicating what challenge the user faces next. We have separate cassettes for each branch:

cassette directory — uber auth branches
lymo/uber/email/2fa_auth_app/initiate_login.yml
lymo/uber/email/2fa_auth_app/initiate_login_forbidden.yml
lymo/uber/email/2fa_auth_app/initiate_login_fraud_login_denied.yml
lymo/uber/phone/sms_otp/initiate_login.yml
...

Each cassette captures the real API response for that scenario. The test for the 2FA path then doesn't just check the return value — it asserts the full side-effect chain: the session's login_state transitions correctly, the in_auth_session_id is persisted to the database, and the state machine reaches the expected state:

lymo/uber_spec.rb — 2FA path
it 'initiates login on success', vcr: true do
  VCR.use_cassette('lymo/uber/email/2fa_auth_app/initiate_login') do
    result = described_class.initiate_login(session, 'email')

    expect(result).to eq(success_response_init_login_await_auth_app_code)
    expect(session.reload.login_state.to_sym).to eq(:pending_auth_app_verification)
    expect(session.in_auth_session_id).to eq(in_auth_session_id)
  end
end

The cassette makes the Uber API deterministic. When Uber changes their API — and they do — a cassette re-record flags exactly which tests need updating and why, rather than leaving stale inline mocks silently green while production breaks.


Why This Works

  • 01
    Tests encode operational knowledge. When a production incident teaches us something — "Stripe retries webhooks, so we need idempotency" or "this job needs a 10-second delay" — the fix isn't just in the code. It's in a test that will fail if anyone removes the defensive behavior. The test suite is institutional memory with enforcement.
  • 02
    Database and job assertions catch the bugs that unit tests miss. The most common class of bug in a Rails app isn't a broken method — it's a broken interaction: a callback that didn't fire, a job that wasn't enqueued, a related record that wasn't created. These bugs are invisible to narrow unit tests that stub everything. They're immediately visible when your test literally checks expect(Expense.count).to eq(1).
  • 03
    Confidence removes the fear of shipping. The biggest tax on developer productivity isn't writing code — it's the anxiety of deploying it. When you trust your tests, you deploy freely. When you deploy freely, each change is small. When each change is small, failures are easy to identify and roll back. Strong tests compress into shipping many times a day without incident — which is itself the goal.

A Note on Rails

Rails deserves credit here. First-class support for request specs — real HTTP dispatched through the full middleware stack, transactional fixtures that roll back database state between tests, FactoryBot integration, ActiveJob test helpers — makes this style of testing practical without a lot of scaffolding. Other ecosystems require significant tooling investment to test at this depth. Rails ships it in the box.

We didn't invent this approach. We just chose to use what the framework offered us, consistently, from day one.

Five years and 13,551 tests later, we ship code to gig workers every few hours. The next time someone asks how we move fast without breaking things, the answer is the same as day one: we test the whole contract, not just the method.

No comments: