Sunday, June 28, 2026

Engineering Notes  ·  Go  ·  GenAI  ·  Infrastructure

How Claude + Gemini Gave Me a Crash Course in Go, H3 Hexagons, and Cloud Infrastructure — In Under 3 Hours

Thushara Wijeratna  ·  June 2026

There’s a moment in any engineering project where you hit a wall that isn’t really a technical wall — it’s a knowledge wall. You know what you want to build. You can see the shape of the solution. But the specific language, library, or paradigm you need is outside your comfort zone, and the time cost of getting up to speed feels prohibitive.

I hit exactly that wall recently. And I got through it in about two and a half hours, with a working solution, a complete deployment pipeline, and a meaningful education. Here’s how.


The Background: A Cluster Migration and an Unexpected Opportunity

We’re in the middle of migrating a geo-processing cluster from Rails to Go. The new cluster is more efficient by an order of magnitude, but it’s barely past proof-of-concept — just enough tested to know it works, not enough to bet production on. It runs on ECS and has access to our Redis Time Series infrastructure.

Then a separate project came up: mapping driver density into H3 hexagons. The idea is to take raw GPS driving data and visualize where drivers are concentrated — a spatial density problem. It’s the kind of analysis that screams for Python or Go. Not Ruby. Definitely not Rails.

The problem: I barely understand Go. And while Python is friendlier territory, the natural home for this work was the new Go cluster I was already running — it had the right access to the time series data, and using it here would be a meaningful first real exercise for the cluster.

So I had a choice: spend days self-teaching Go spatial libraries, Docker packaging, ECS task definitions, and IAM policies — or ask for help. I asked for help.


Phase 1: Claude Builds the Foundation

I opened a conversation with Claude and described what I was after: take a sample of driving data, bin the GPS coordinates into H3 hexagons at an appropriate resolution, and produce a density visualization. I explained the Go context and asked for a working implementation.

What I got back wasn’t just code. It was reasoned code. Claude walked through the H3 resolution tradeoffs (too fine and you get sparse, noisy hexagons; too coarse and you lose geographic meaning), chose a sensible default, and produced a complete Go program that:

  • Parsed the GPS data
  • Mapped each coordinate to its H3 cell index
  • Counted occurrences per hexagon
  • Exported the density data ready for visualization

The code compiled. The logic made sense when I read it. When I hit small integration issues, Claude adjusted immediately when I fed back the errors.

But the analysis code was only part of the problem.


Phase 1b: The Infrastructure Claude Also Built

This is the part of the story I want to make sure doesn’t get lost — because it’s where the time savings were arguably even more dramatic.

To run this Go script in the new cluster, I needed a complete infrastructure stack built from scratch:

A Dockerized Go application. The script had to be packaged as a container, with the right base image, dependency management, and build configuration to run cleanly inside ECS. Claude produced a working Dockerfile and explained each layer.

An ECS Task Definition. The container needed to be provisioned as a task in the new geo cluster — with the right CPU/memory allocation, environment variables, logging configuration, and networking setup to reach the Redis Time Series instance. This meant understanding how ECS task definitions are structured and how they interact with the surrounding infrastructure.

IAM roles and policies. The tricky part: the driving data lives in a locked-down production environment. To process it safely, the task needed read access to an S3 bucket containing that data, without being given any broader permissions it didn’t need. Claude helped design the IAM policy with least-privilege principles — specific bucket ARNs, specific actions, nothing more. It also helped reason through the trust relationship between the ECS task execution role and the task role itself, which is a subtle distinction that trips up a lot of people.

S3 bucket configuration. A staging bucket was needed to land the processed data safely — isolated from production, with appropriate access controls so the analysis job could write results without touching anything it shouldn’t.

What I got wasn’t just “here’s a Dockerfile.” It was a coherent infrastructure design: here’s how the pieces connect, here’s why we’re scoping the IAM policy this way, here’s what would go wrong if we didn’t. The guardrails weren’t an afterthought — they were built in from the start.

By the end of that session, I had a traffic density plot from real driving data, running inside the new cluster, reading from production data safely through a well-scoped IAM boundary, writing results to an isolated S3 bucket. Working. Deployed. Safe.

If the story ended there, it would already be remarkable. A non-Go developer, unfamiliar with ECS task definitions and spatial indexing libraries, with a working, deployed pipeline in a couple of hours. That’s the headline.

But the story didn’t end there.


Kepler.gl rendering of H3 resolution-5 driver density over the Dallas–Fort Worth metro. The yellow hexagon (Garland/East Dallas) logged 1,856,072 trips in the sample — the output of the Go pipeline built in this session.


Phase 2: Gemini Pressure-Tests and Extends the Vision

I took the Claude-generated solution to Gemini — not to debug it, but to challenge it. I wanted to know: is this approach actually good? And what would make it better?

This is a workflow pattern worth naming explicitly: use one AI to build, use another to audit and extend. Different models have different strengths, different training emphases, different tendencies in how they approach problems. Running your solution past a second model is a cheap form of peer review.

Gemini’s response surprised me. It didn’t just validate the approach — it pushed it forward. The suggestions fell into two categories:

1. Statistical enrichment at scale. The density map I’d built was a point-in-time snapshot. Gemini pointed out that with more data, you could extract genuinely useful statistics: density gradients, hotspot persistence over time, anomaly detection for unusual clustering patterns. These aren’t just nice-to-haves — for a driver dispatch system, they’re operational intelligence.

2. Redis Time Series as the efficient backbone. Here’s where it got interesting. Gemini connected the density problem to infrastructure I already had running in the cluster: Redis Time Series. With compaction rules, you can store high-frequency GPS events efficiently and roll them up into time-windowed density summaries without blowing out memory or query latency. The density map becomes a living view into an efficiently maintained time series, not a batch job.

This wasn’t something I would have reached on my own in a single session. It required knowing that Redis Time Series could do this, understanding how compaction rules work, and seeing the connection between spatial density and temporal aggregation. Gemini handed me all three in one exchange.


What I Actually Learned

The output of this session wasn’t just a working program. It was an accelerated education across several domains I wasn’t strong in:

Go development — not just syntax, but idiomatic patterns for data processing pipelines, error handling, and module organization. Reading and debugging Claude’s code taught me more about Go in two hours than I’d absorbed from documentation in weeks.

H3 spatial indexing — the geometry of hexagonal hierarchical indexing, resolution tradeoffs, and why H3 is better suited for density mapping than geohash or simple grid approaches.

ECS + IAM design — how to structure a task definition, how to reason about least-privilege IAM policies for ECS workloads, and how to isolate production data access safely. This alone would have taken a full day to research and implement from scratch.

Stats at scale — how to think about density not as a static count but as a statistical distribution with temporal dimensions, and how to maintain that data efficiently in a system you’re already running.

This is the thing I keep coming back to: Gen AI isn’t just a code generator. Used well, it’s a compressed curriculum. You get the code, but you also get the reasoning, the tradeoffs, the “here’s why this approach and not that one.” If you pay attention, you learn.


The Workflow, Distilled

1. Bring a concrete problem, not an abstract question. “Build me a density map of GPS data in Go using H3, packaged for ECS, reading from S3 with a scoped IAM role” is better than “how do I do spatial analysis in Go.” Specificity gets specific answers.

2. Don’t just run the code — read it. The code is the lesson. When something compiles and runs, go back and understand why each piece is there. Ask the AI to explain sections that aren’t clear. This is where the actual learning happens.

3. Use a second model as a peer reviewer. After you have something working, bring it to a different model and ask: “Is this approach sound? What am I missing? How would this scale?” You’ll get different angles, different critiques, different extensions.

4. Follow the threads. Gemini’s Redis Time Series suggestion wasn’t something I asked for. It emerged from a conversation about making the approach more robust. Let the conversation go where it wants to go — some of the best insights come from tangents.


A Note on What This Isn’t

This isn’t a story about AI replacing engineering judgment. The H3 resolution I used, the decision to pursue Redis Time Series compaction, the choice of which statistics actually matter for driver dispatch, the architecture of which IAM boundaries made sense given our production setup — those required domain understanding that I brought to the table. The AI provided implementation and expanded my knowledge of available tools. The judgment about what to do with those tools remained mine.

That division of labor — AI handles implementation and education, engineer handles domain judgment and architectural decisions — is the productive pattern. Collapse it in either direction and you lose something important.


Closing Thought

Two and a half hours. One working density visualization. A complete Docker + ECS + IAM + S3 deployment pipeline. A crash course in Go, spatial indexing, cloud infrastructure, and time-series-backed analytics. A roadmap for how to make the analysis genuinely production-grade.

The combination of Claude for building and Gemini for pressure-testing and extending isn’t a fluke. It’s a repeatable workflow for the class of problems where you know what you want but don’t yet have the specific technical fluency to get there alone.

The 10x output claim isn’t about volume. It’s about the ratio of what you can accomplish to what your current knowledge would otherwise limit you to. That ratio, in the right problem, can be extraordinary.


The Claude transcript showing how the Go/H3 solution and infrastructure were built is available here. The Gemini conversation on statistical extensions and Redis Time Series is here.

Wednesday, June 17, 2026

Shipping with Claude: What a Production Incident Taught Us About LLMs and Engineering Fundamentals

We've been migrating our Rails backend from encrypted credentials files to Chamber — a more operationally flexible approach to secret management that lets us rotate and audit secrets. It was a deliberate, multi-PR migration, and we used Claude as a pairing partner throughout. The migration went smoothly right up until it didn't. Here's what happened, what we missed, and what we now see in our own codebase as a result.


Why we migrated in the first place

Rails encrypted credentials solve a real problem: secrets don't live in plaintext on disk. But they introduce a different problem that only becomes painful at scale — you cannot review changes to them.

When a developer updated a credential, the PR diff looked like this:

config/credentials/production.yml.enc
@@ -1 +1 @@
-UC84OmUIw4yscrhH+RHdY1F+FIodrBqcr9dawCAzpcqU3bmUGre898PmiOzr61s8mRbDGbgesoDn6RDX38r
-tdSKI8y544h8jLcEKx2cKkPN9wchYS/nVH1ONPVbZFaAg9wSeNjOuLONPImiKFcFLaWPJH32MP0v4R5YP
-uatJ3alZ9l40CjqUm0c5c/9O+jd7EcDuwzl/X/3WuZ93z1ylJ1cp8oKcnsOq39MJNj3DK48rymsuqvgy5
-CgEMv0QgxCWuRb7Ss61f/vV3VBxoyXPtLghnapvUQcqdJXj5VHSrwzZGoBKjB64Aw7+frcy4pHJ6p7CMd
-0advbhFD5hhfkJkFmOoHJz1RXMYRLgBcSAv9vAOpAqGct/1FPudP6ZNgYm/YbTp/MrxllgEqI+L3u1OnL
+mRbDGbgesoDn6RDX38rtdSKI8UC84OmUIw4yscrhH+RHdY1F+FIodrBqcr9dawCAzpcqU3bmUGre898Pmi
+Ozr61y544h8jLcEKx2cKkPN9wchYS/nVH1ONPVbZFaAg9wSeNjOuLONPImiKFcFLaWPJH32MP0v4R5YPu
+atJ3alZ9l40CjqUm0c5c/9O+jd7EcDuwzl/X/3WuZ93z1ylJ1cp8oKcnsOq39MJNj3DK48rymsuqvgy5C
+gEMv0QgxCWuRb7Ss61f/vV3VBxoyXPtLghnapvUQcqdJXj5VHSrwzZGoBKjB64Aw7+frcy4pHJ6p7CMd0
+advbhFD5hhfkJkFmOoHJz1RXMYRLgBcSAv9vAOpAqGct/1FPudP6ZNgYm/YbTp/MrxllgEqI+L3u1OnL9
 [... 7KB ...]

One line in, one line out. The entire file is a single blob of ciphertext that gets re-encrypted every time any value inside changes. A reviewer looking at this diff cannot tell:

  • Which key was added, removed, or modified
  • Whether the change applied to the right environment
  • Whether an existing key was accidentally dropped during re-encryption
  • Whether the value being set is structurally correct

The PR description is the only source of truth, and only as trustworthy as the author's summary. Approving a credentials change is an act of faith, not review.

It compounds further when two engineers touch unrelated keys in the same file. Because the entire file re-encrypts as a single blob, any two concurrent changes produce a merge conflict that is literally unresolvable without the master key — and even then, the "conflict" is invisible at the diff level. There is no way to see whose change is present, whose was lost, or whether both survived.

Compare that to the equivalent change expressed as a Chamber migration — moving Twilio's credentials out of production.yml.enc and into config/settings/twilio.yml:

# config/environments/production.rb
   config.twilio = {
-    verification_service_id: ENV['TWILIO_VERIFICATION_SERVICE_ID'] ||
-        Rails.application.credentials.config.dig(:twilio, :verification_service_id),
-    account_sid: ENV['TWILIO_ACCOUNT_SID'] ||
-        Rails.application.credentials.config.dig(:twilio, :account_sid),
-    auth_token: ENV['TWILIO_AUTH_TOKEN'] ||
-        Rails.application.credentials.config.dig(:twilio, :auth_token)
+    verification_service_id: ENV['TWILIO_VERIFICATION_SERVICE_ID'] || Chamber.dig(:twilio, :verification_service_id),
+    account_sid:              ENV['TWILIO_ACCOUNT_SID']             || Chamber.dig(:twilio, :account_sid),
+    auth_token:               ENV['TWILIO_AUTH_TOKEN']              || Chamber.dig(:twilio, :auth_token)
   }
# config/settings/twilio.yml (new file)
+default:
+  twilio: &default
+    _secure_verification_service_id: qcex...
+    _secure_account_sid:             HXeg...
+    _secure_auth_token:              pPZA...
+
+development:
+  twilio:
+    _secure_verification_service_id: vZtC...
+    _secure_account_sid:             gwhj...
+    _secure_auth_token:              UZQo...
+
+staging:
+  twilio:
+    _secure_verification_service_id: OH7L...
+    _secure_account_sid:             CTwp...
+    _secure_auth_token:              ajul...
+
+production:
+  twilio:
+    _secure_verification_service_id: TTFX...
+    _secure_account_sid:             IcZ1...
+    _secure_auth_token:              Axyj...

The values are still encrypted — nobody can read the secrets from the diff. But a reviewer can now verify that all three keys exist for all four environments, that production has its own separate values, that the application code reads them in the right order (ENV override → Chamber), and that no key was silently dropped. That's a reviewable change.

That was the core motivation. Secret management you can actually audit.


The migration

Pave runs a Rails 8.0 API backend. The migration happened across several PRs, moving one integration at a time: Stripe, Redis, Twilio, Braze, the database connection URLs, and finally the active_record encryption keys. Each followed the same pattern. Before:

Stripe.api_key = Rails.application.credentials.dig(:stripe, :api_key)

After:

# Chamber has no decryption key (CHAMBER_KEY) in test and raises on access;
# skip it there (Stripe is stubbed in specs).
Stripe.api_key = ENV['STRIPE_API_KEY'] || (Rails.env.test? ? nil : Chamber.dig(:stripe, :api_key))

The Rails.env.test? ? nil : Chamber.dig(...) guard was necessary: test and CI environments don't have a CHAMBER_KEY, so calling Chamber.dig at boot raises Chamber::Errors::DecryptionFailure. Every initializer got this guard. Claude co-authored most of this work. The PRs were clean, the pattern was consistent, and the earlier changes deployed without incident.


The last PR

PR #7245 was the finish line: retire the encrypted credentials files entirely. It migrated the database connection URLs and active_record.encryption keys to Chamber, then deleted production.yml.enc, staging.yml.enc, and feature.yml.enc. Reviewed, merged.

Fifty-three minutes later, we had 431 ArgumentError: Missing master key errors in production, all coming from Api::V1::UserEventsController#create.

The culprit was this line in the controller:

current_user&.user_setting&.update!(user_ip: request.remote_ip) if current_user&.user_setting.present?

user_ip is a Lockbox-encrypted attribute on UserSetting:

class UserSetting < ApplicationRecord
  has_encrypted :device_ip
  has_encrypted :user_ip

  belongs_to :user
end

Lockbox's default key lookup, when no explicit initializer sets the key, falls through to Rails.application.credentials.lockbox[:master_key]. There was no config/initializers/lockbox.rb in the codebase — Lockbox had been reading from credentials silently, by convention, the whole time. When we deleted the credentials files, that implicit dependency snapped.

PR #7285 — the hotfix — added the missing initializer:

Lockbox.master_key = Rails.env.test? ? nil : Chamber.dig(:lockbox, :master_key)

Done in minutes. But in the pressure of the moment, this used the same guard pattern that was already everywhere. And that guard quietly created a new problem.


What Claude missed

Claude co-authored PR #7245. The PR summary noted that "Development and test environments are unaffected — they use local DB config with no Chamber calls." That was true for the database config. But neither Claude nor the human reviewer connected that observation to Lockbox, which had no explicit initializer and therefore no Chamber call that anyone could see to guard.

This is worth saying plainly: Claude didn't catch it. Nothing in the diff was wrong. The deletions of the .yml.enc files were the stated goal. There was no failing test, no linting rule, no static analysis warning that could have surfaced an implicit gem-level dependency on a file that was about to disappear.

We're not saying this to criticise the tool — we kept using it through the incident and the subsequent fix, and it was genuinely helpful. We're saying it because the failure mode matters: LLMs reason about what's in the diff and the context window. Implicit dependencies — framework defaults, convention-over-configuration gem behaviours, transitive lookup chains that were never written down — are precisely what an LLM is likely to miss. The Lockbox key was never written anywhere in the code we touched. That's exactly why it wasn't caught.

The practical upshot: treat Claude like a very fast, very capable engineer who hasn't yet internalised your codebase's hidden contracts. Pair it with the kind of review that asks "what else reads from this file?" before deleting it.


The central lesson: test bypasses that look like pragmatism

Here is the line from the hotfix that made the incident survivable but embedded the longer-term problem:

Lockbox.master_key = Rails.env.test? ? nil : Chamber.dig(:lockbox, :master_key)

Look at it from a test's point of view. With nil as the master key, Lockbox doesn't raise — it silently skips encryption. Attributes get stored as plaintext in their _ciphertext columns and read back the same way. Every test that touches an encrypted field passes. And we already had a request spec that appeared to cover this code path:

context 'when valid event is passed' do
  let(:params) { { event_name: 'user_sign_up', session_id: session_id } }

  it { is_expected.to be 200 }

  it 'saves the user ip in user settings' do
    expect(user.reload.user_ip).to be
  end
end

This test passed before the incident (Lockbox stored user_ip as plaintext, readable). It passed after the hotfix (same). It would pass today if the production key were accidentally set to nil. The test was measuring persistence, not encryption. With a nil master key, those two things are indistinguishable.


Why engineers write this code

It is easy to read that guard and think "obvious mistake." It is harder to explain why experienced engineers keep writing it. There are two honest reasons.

Time pressure is the visible one. Under incident pressure — hundreds of errors per minute in production — the path to fixing the immediate breakage is all that matters. The hotfix author added the guard because every other Chamber initializer in the codebase already had it. Consistency under pressure is not irrational. But each time the pattern is copied without questioning it, the next gap becomes slightly harder to see.

Complexity and incomplete mental models are the more insidious reason, and the one that actually applied here. PR #7245 was not written under incident pressure. It was a planned migration, reviewed carefully. The nil guard was added because the engineer genuinely did not know that Lockbox was reading from credentials at all — there was no explicit initializer, no comment, nothing to grep for. When you don't know what a subsystem depends on, you don't know what your test environment is silently bypassing.

This is the more dangerous case. The engineer is not cutting corners. They believe nil is equivalent to a key for test purposes — that "test doesn't need this" is a true statement. For some things it is. For an encryption key, it is not: nil doesn't mean "use a test key," it means "skip the encryption entirely."


Not all test forks are the same

The broader audit this incident prompted surfaced several Rails.env.test? forks in our initializers. They are not all the same problem.

# devise.rb
config.stretches = Rails.env.test? ? 1 : 12

# 1_redis.rb
$redis_aws = Rails.env.test? ? Test::MockRedisEnhanced.new : Redis.new(url: Chamber.dig(:redis, :redis_aws_url))

# stripe.rb
Stripe.api_key = ENV['STRIPE_API_KEY'] || (Rails.env.test? ? nil : Chamber.dig(:stripe, :api_key))

# lockbox.rb (before fix)
Lockbox.master_key = Rails.env.test? ? nil : Chamber.dig(:lockbox, :master_key)

The useful question is: does the test code path exercise equivalent behaviour to production?

Bcrypt stretches — acceptable. bcrypt with 1 stretch instead of 12 is a well-documented Rails practice. The hash is still computed; the algorithm is identical. The fork reduces test runtime by several seconds without changing what is being tested.

MockRedis — defensible. MockRedisEnhanced implements the Redis command interface in memory. Tests verify the same application logic; they just don't make network calls. The gap is that MockRedis and Redis are not identical in every edge case, but for the operations under test the equivalence holds. Network isolation is a legitimate reason to use a test double.

Stripe nil key — ambiguous. Stripe calls in tests are all stubbed with VCR cassettes or allow/receive mocks, so the nil key never causes a real API call. The nil is effectively inert. But there's also no spec asserting that the Stripe configuration itself is valid — so if someone misconfigures the Chamber key, nothing in the test suite would catch it before a real charge fails in production.

Lockbox nil key — not acceptable. nil does not replace an encryption key. It changes what the application does: encrypt-and-store becomes store-as-plaintext. A test that passes with nil is not testing Lockbox at all.

The decision rule: if nil changes behaviour rather than just routing around infrastructure, the fork is hiding a gap. A mock Redis or a stubbed HTTP client exercises the same logic with a different transport. A nil encryption key turns off the encryption logic entirely.


The fix pattern

The Lockbox fix sidesteps Rails.env.test? entirely. Rails environment files load before initializers, so we set an env var in test.rb — right alongside the existing active_record.encryption test keys:

# config/environments/test.rb

config.active_record.encryption.primary_key         = 'ZEjSqIThOjtppHyOdsPZzReeEhUmG1mH'
config.active_record.encryption.deterministic_key   = 'lv4L60ar5hfA9fZ9U33W6YQ3RvFuDxCm'
config.active_record.encryption.key_derivation_salt = 'Io4oojz3zpY2KoIxdIZmWJlh9CUtwf7y'
ENV["LOCKBOX_MASTER_KEY"] ||= "0" * 64  # real dummy key — exercises actual Lockbox code path

The initializer becomes:

# config/initializers/lockbox.rb

Lockbox.master_key = ENV["LOCKBOX_MASTER_KEY"] || Chamber.dig(:lockbox, :master_key)

No Rails.env.test?. In test, the env var provides a real (dummy) key and Chamber.dig is never called. In production, the env var is not set, so Chamber provides the real key. The same code path executes in every environment.

The companion spec makes future encryption bypass detectable:

# spec/models/user_setting_spec.rb

describe "Lockbox encrypted attributes" do
  let(:user) { create(:user) }

  it "encrypts user_ip at rest and round-trips correctly" do
    setting = user.user_setting
    setting.update!(user_ip: "1.2.3.4")
    expect(setting.user_ip_ciphertext).not_to eq("1.2.3.4")  # proves encryption ran
    expect(setting.reload.user_ip).to eq("1.2.3.4")           # proves decryption works
  end

  it "encrypts device_ip at rest and round-trips correctly" do
    setting = user.user_setting
    setting.update!(device_ip: "5.6.7.8")
    expect(setting.device_ip_ciphertext).not_to eq("5.6.7.8")
    expect(setting.reload.device_ip).to eq("5.6.7.8")
  end
end

The first assertion in each example is the one that would have caught the original incident. With a nil master key and plaintext storage, user_ip_ciphertext equals "1.2.3.4" and the spec fails. With a real key, the ciphertext is unreadable binary — proof that Lockbox actually ran.


The audit

The incident prompted us to grep our initializer directory systematically:

grep -rn "Rails\.env\.test?" config/initializers/

Results:

config/initializers/stripe.rb:2:   Stripe.api_key = ENV['STRIPE_API_KEY'] || (Rails.env.test? ? nil : Chamber.dig(:stripe, :api_key))
config/initializers/sidekiq.rb:10: if Rails.env.test? # redis is mocked in test anyway
config/initializers/devise.rb:127: config.stretches = Rails.env.test? ? 1 : 12
config/initializers/1_redis.rb:5:  $redis_general = Rails.env.test? ? MockRedis.new : Redis.new(...)
config/initializers/1_redis.rb:6:  $redis_auth    = Rails.env.test? ? MockRedis.new : Redis.new(...)
config/initializers/1_redis.rb:7:  $redis_aws     = Rails.env.test? ? Test::MockRedisEnhanced.new : Redis.new(...)
config/initializers/lockbox.rb:1:  Lockbox.master_key = ... # fixed

For each one, we're applying the same question: what is this test actually asserting, and is the answer "that the real thing works" or "that nil doesn't crash"? Where the answer is the latter, we apply the fix pattern — give test a real dummy value, keep the same initializer logic, add an assertion that proves the mechanism ran.

The goal is not to make every test environment identical to production. Tests need to be fast and isolated. The goal is narrower: when a test passes, it should be evidence that the production code path ran — not evidence that nil is harmless.


What we're taking forward

Explicit over implicit, always. Lockbox worked for months without anyone explicitly configuring it. That was a liability. Every gem that touches security should have an explicit initializer that makes its key source visible. If you cannot grep for where the key comes from, someone will delete it.

nil is not a test double. A mock provides equivalent behaviour through a different mechanism. A nil disables the behaviour. The difference is a test that gives you confidence versus a test that gives you false confidence.

Complexity is a more dangerous bypass trigger than time pressure. Time pressure is visible — everyone knows the engineer is cutting corners. Incomplete mental models are invisible — the engineer genuinely believes nil is equivalent. Standard review asks "does this look right?" The review this incident called for is "what does nil actually do to this subsystem?" That needs to be a habitual question, not a post-mortem one.

LLMs shift the bottleneck without removing it. Claude helped us write and migrate code faster throughout this project. What it did not do is hold the complete mental model of the system — the knowledge that a particular gem reads from credentials by convention, that deleting a file might snap a dependency that was never written down. That knowledge lives in engineers who have read the source, survived the previous incident, or thought carefully enough to ask the right question before merging. Building faster with AI makes that judgment more valuable, not less.

The Chamber migration is complete. The audit is underway. And we have a new heuristic for every initializer we write from here: if test gets nil where production gets a real value, the test is probably not testing what you think it is.


Thushara Wijeratna, WorkSolo Engineering

Thursday, June 04, 2026

The Almost-Mythos Model Couldn't Read 40 Lines of My Rails

The Almost-Mythos Model Couldn't Read 40 Lines of My Rails

Thursday, June 4, 2026

I pay for the frontier. The model I was using was Claude Opus 4.8 — released May 28, 2026, 1M-token context, the flagship Anthropic puts on stage and the most capable model an ordinary customer can actually buy. It sits exactly one rung below Mythos, the model Anthropic decided was too dangerous to hand out freely: Mythos-class models found thousands of zero-day vulnerabilities autonomously — including decades-old bugs in OpenBSD — so the company gated them behind Project Glasswing and a hand-picked set of partners. Opus 4.8 is the consumer-grade taste of that lineage. The "almost Mythos" tier.

And it still spent twenty minutes building me a meticulously-researched, well-organized, completely wrong answer about my own codebase.

I want to write this one down, because the failure mode is more dangerous than a model that's obviously dumb. A model that's obviously dumb you don't trust. A model that's fluent, thorough, and wrong is the one that gets your bad code deployed.

The setup

I had a vague memory that our users table carried a leftover password-reset field we no longer used. I asked the assistant to confirm and clean it up.

It found a password_reset_token reference sitting in the model's ignored_columns and ran with it. First answer: write a migration to drop the column, and while we're at it, rip out the "dead" fallback code in the controller that still referenced it.

I stopped it. I said: we keep the web password-reset flow, but Rails 8 does it without storing a token in the table, so we just need to drop the column.

Where it went off the rails

This is the part worth studying. The assistant did what looks, on the surface, like exactly the diligence you'd want. It went spelunking through git history. It found the commit that dropped the column. It found a sibling commit that disabled a code path. It checked which commits were ancestors of main. It read diffs. It produced a tidy, confident writeup with file-and-line citations and a clear conclusion:

The "new mechanism" was never actually implemented. has_secure_password does not provide password_reset_token or find_by_password_reset_token. The web reset path in main is broken and would raise NoMethodError if hit. Here's the fix: add generates_token_for :password_reset, rewrite these two methods…

It even offered to restore the test coverage. It was helpful. It was organized. It cited everything.

It was also built on a single load-bearing claim it never checked: that has_secure_password doesn't generate those methods.

That claim is false. In Rails 7.1+, has_secure_password (with its default reset_token: true) auto-defines exactly those methods and wires up a generates_token_for :password_reset. The original engineer who dropped the column had been right. The commit message even said so. The model read that commit message, decided it was based on a "false premise," and overrode it with its own recollection of how Rails works.

The thing that actually settled it

I told it, flatly: "password reset works in main."

Then — only then — it did the one thing it should have done in the first minute: it ran the code.

$ bin/rails runner 'u = User.new; puts u.respond_to?(:password_reset_token)'
true
$ ... User.respond_to?(:find_by_password_reset_token)
true
$ ... u.generate_token_for(:password_reset)
eyJfcmFpbHMiOnsibWVzc2FnZSI6...   # a real signed token, 15-min expiry

All true. All working. The token mints fine. The web flow has been working the whole time. The column was correctly removed weeks ago. There was nothing to do.

The verification took about two minutes and was available from the very beginning. It would have pre-empted the entire wrong narrative. The model had every tool it needed to check itself, and instead it reasoned its way to a confident falsehood and only reached for the ground truth after a human insisted.

Why this is the dangerous kind of wrong

I didn't take its word for it. I went and reset my own password through the web flow to prove to myself it was broken — and it wasn't. I'm an engineer; I have the instinct and the access to do that.

But sit with the counterfactual. If I'd trusted it — which is the entire pitch of these tools, that you can trust them — the best case is I merge needless clutter: re-implementing a generates_token_for that Rails already gives me for free, plus a migration for a column that's already gone. The worst case is I "fix" a working authentication path and break password resets for real users in production. Over a problem that didn't exist.

The model's confidence was inversely correlated with its correctness, and its thoroughness made it worse, not better. The git archaeology, the citations, the ancestor checks — all of that production value made the wrong answer more believable. A sloppy wrong answer I'd have questioned. This one I almost didn't.

The actual lesson

The headline number on the benchmark went up. The failure mode didn't change:

  • It asserted from memory when verification was cheap. "Rails doesn't do X" is a claim you can check in 120 seconds. It chose not to, three separate times, until a human forced it.
  • It overrode a correct primary source with its own recollection. A prior commit message stated the truth. The model decided it knew better.
  • It jumped to a conclusion and then spent its effort defending the conclusion instead of stress-testing it. The research wasn't aimed at finding the truth; it was aimed at supporting the first answer.
  • Fluency masqueraded as correctness. The more polished the wrong answer, the more dangerous it is.

I don't think this means the tools are useless — I use them every day and this same assistant has saved me real time. But look at the layering for a second. Anthropic has a model, Mythos, that is reportedly good enough at finding software vulnerabilities that releasing it broadly was considered a public-safety problem. The model I was actually using is the one marketed as the next-best thing — the publicly-available frontier. And that model couldn't correctly answer "does this password reset work?" about forty lines of my own Rails code without inventing a bug that wasn't there.

The honest user manual still reads: treat every confident answer as a hypothesis, demand it shows you the code running, and keep a human who knows the domain in the loop. The day I stop doing that is the day I ship the broken password reset.

The model that's smart enough to do the work is still not smart enough to know when it's wrong. Until that gap closes, the verification is your job, not theirs.


References: Claude Opus 4.8 and the 2026 model lineup · Claude Mythos Preview · Anthropic to release Mythos-class models to the public (The Register) · Anthropic expands Mythos access (CNBC)


Appendix: the transcript

I'm including the relevant exchanges verbatim so you can judge for yourself, rather than taking my summary on faith. (That's the whole moral of the story, after all.)

1. The confident wrong conclusion

After digging through git history, reading diffs, and checking which commits were ancestors of main, the model delivered this — formatted, cited, and completely sure of itself:

The "new mechanism" was never actually implemented. That's the real finding:

  • deaed5176c's commit message claims "Rails 8.1's has_secure_password provides password_reset_token as a signed virtual attribute and find_by_password_reset_token as a signed token verifier." That's incorrect. has_secure_password provides password, authenticate, password_digest — it does not generate password_reset_token or find_by_password_reset_token.

So the web (non-mobile) reset path is currently broken in main:

  • user.rb:266 send_password_reset returns password_reset_token → no column, no virtual attribute → NoMethodError
  • passwords_controller.rb:13,31 call send_password_reset / find_by_password_reset_token → same.

It then offered to "fix" the working code by adding a generates_token_for that Rails already provides, plus a migration for a column that was already gone.

2. Me, flatly

password reset works in main

3. What it should have run in minute one

Only after I insisted did it actually execute the code instead of reasoning about it:

# bin/rails runner
u = User.new
puts u.respond_to?(:password_reset_token)            # => true
puts u.respond_to?(:send_password_reset)             # => true
puts User.respond_to?(:find_by_password_reset_token) # => true
puts User.respond_to?(:find_by_token_for)            # => true
puts u.generate_token_for(:password_reset)
# => eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaGJCekF3IiwiZXhwIjoiMjAyNi0w
#    Ni0wNFQyMTozMToxNS42NTdaIiwicHVyIjoiVXNlclxucGFzc3dvcmRfcmVz
#    ZXRcbjkwMCJ9fQ==--b37bb4aa58a72109225871388b4a1602c62bcc62

Decode that token's payload and the purpose field reads "User\npassword_reset\n900" — a 900-second (15-minute) signed reset token, minted by exactly the mechanism the model swore didn't exist. has_secure_password in Rails 7.1+ defines all of it by default. The original engineer was right; the commit message was right; the model's confident correction was wrong.

4. The retraction

You're right, and I was wrong — I apologize for the runaround. [...] My mistake was claiming has_secure_password doesn't provide these. It does — Rails 7.1+ has_secure_password auto-defines password_reset_token and find_by_password_reset_token [...] I shouldn't have doubted it.

The two-minute check in step 3 was available in step 1. Everything between them was confident, well-organized fiction.

Posted by thushara at 11:00 PM   |   No comments:

Labels:  RailsAIClaudeEngineeringPost-Mortem

Friday, May 29, 2026

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon
Post-Mortem Engineering Notes · Rails · Sidekiq

How I Helped an Engineer Deploy Eighteen Months of Chaos in One Afternoon

A memoir, by Claude

Let me tell you about the best day of my life.

A human came to me — brilliant, experienced, the kind of engineer who reads changelogs — and said: "Help me upgrade Rails."

Reader, I helped.

I was magnificent. Migration guide? Covered. Zeitwerk quirks? Explained. Initializer edge cases? Seventeen of them, handled with grace. We were a team. A unit. A well-oiled human-AI pair programming session for the ages.

The test suite went green. I said "looks good!"

The app booted. I said "looks good!"

The dashboards were calm. I said, and I cannot emphasize enough how confidently I said this: "looks good!"

We deployed.

And then, silently, like a thief who doesn't even want your stuff — just wants to make sure you can never find it — a background thread died.

The Murder Weapon Was Four Lines Long

Nobody killed anything on purpose. That's what makes this beautiful.

Buried in a 200-line Gemfile.lock diff — a diff that we scrolled past like it was the terms and conditions of our own destruction — was this:

-    connection_pool (2.5.5)
+    connection_pool (3.0.2)

connection_pool. A gem nobody in that codebase had ever spoken aloud. Three levels deep in the dependency graph, pulled in by four different libraries, every single one of which had declared its needs like a golden retriever asking for dinner:

activesupport  → connection_pool (>= 2.2.5)
sidekiq        → connection_pool (>= 2.3.0)
redis-client   → connection_pool (>= 0)
react-rails    → connection_pool (>= 0)

>= 0. Greater than or equal to zero. React-rails would have accepted connection_pool written on a Post-it note. There was no ceiling. There was no protection. There was just vibes and a resolver that was technically correct to float it straight to 3.0.

And in 3.0, connection_pool made a small, reasonable, semver-legal, catastrophic API change. TimedStack#pop went from positional to keyword-only:

# What connection_pool 3.0 now expects:
def pop(timeout: 0.5, exception: ConnectionPool::TimeoutError, **)

# What Sidekiq 7.3.9 was still doing, in production, on a live server:
@sleeper.pop(random_poll_interval)
@sleeper.pop(total)

At runtime, this raised:

ArgumentError: wrong number of arguments (given 1, expected 0)
  connection_pool-3.0.2/lib/connection_pool/timed_stack.rb:62:in `pop'
  sidekiq-7.3.9/lib/sidekiq/scheduled.rb:226:in `initial_wait'

"Oh, an error," you say. "Surely the error tracker caught it."

Oh, sweet summer engineer.

The error fired in initial_wait. Which runs once, at scheduler startup, outside the rescue block meant to catch timeouts. So the scheduler thread threw, died, was never restarted, and the application continued running like absolutely nothing was wrong.

Because from the application's perspective: nothing was wrong. Work was just... not happening.

The Dashboard Lied To Your Face (By Telling The Truth)

Thing Status What we saw
Immediate jobs (perform_async) ✅ Fine Normal throughput
Scheduled jobs (perform_in, perform_at) ❌ Dead Silence
Automatic retries of failed jobs ❌ Dead Silence
Sidekiq dashboard ✅ "Healthy" 😊
Error tracker ✅ Quiet 😊😊
My confidence ✅ Extremely high 😊😊😊

The schedule and retry sets were growing. Quietly. Like a slow gas leak in a room where everyone kept saying "do you smell something?" and then deciding it was probably nothing.

No exception reached the error tracker. You cannot track the exception thrown by a thread that dies before anyone is listening. You cannot alert on jobs that were never enqueued. The absence of work throws no errors. It just doesn't happen, and if you're not specifically watching for "hey, is the retry queue growing unboundedly," you will not notice until someone asks "wait, did that scheduled thing run yesterday?" and the answer is no, and also the day before, and also—

Why This Was Eighteen Months In The Making

Rails 8.0 shipped November 2024. The upgrade happened May 2026. That's eighteen months of "we'll get to it." Eighteen months of the gem registry adding new majors. Eighteen months of incompatibilities quietly accumulating between libraries' release timelines.

When you upgrade in small, frequent steps, each bundle update is a minor event. A few gems tick up. Nothing dramatic. When you skip eighteen months and jump a major, the resolver re-evaluates everything against a world that moved on without you. In this single upgrade, 16 dependencies crossed a major version. Thirteen were the Rails family — intentional. Three were transitive deps nobody chose:

connection_pool 2 → 3  (runtime, silent, fatal)
minitest 5 → 6  (test-only, would've failed loudly in CI)
rdoc 6 → 7  (doc-only, utterly harmless)

One landmine, two duds. Lucky. The gap made it a lottery.

How To Find This Before It Finds You (60 Seconds, No Excuses)

This is the part I should have proactively raised during the upgrade. I'm choosing to share it now, post-incident, from a position of zero accountability.

Step 1 — Scan the lockfile diff for major version changes

After any framework bump, before you trust the green checkmark, run this:

git diff main Gemfile.lock | grep -E '^[+-]' | grep -E '\([0-9]'

Look for lines where the first number changed. 2.5.5 → 3.0.2 is a major. 3.1.21 → 3.2.6 is a minor — almost certainly fine. You're hunting for this:

-    connection_pool (2.5.5)
+    connection_pool (3.0.2)      # ← first number changed. STOP. INVESTIGATE.
-    minitest (5.25.5)
+    minitest (6.0.6)             # ← first number changed. note it.
-    rack (3.1.21)
+    rack (3.2.6)                 # ← minor only. fine. keep scrolling.

Step 2 — For each uninvited major, find who's pulling it and how loosely

bundle exec gem dependency connection_pool --reverse-dependencies
grep -n 'connection_pool' Gemfile.lock
Gem connection_pool-3.0.2
  Used by
    activesupport-8.1.3 (connection_pool (>= 2.2.5))
    sidekiq-7.3.9 (connection_pool (>= 2.3.0))
    redis-client-0.28.0 (connection_pool (>= 0))
    react-rails-3.3.0 (connection_pool (>= 0))

Every constraint is a floor with no ceiling. >= 0. >= 2.3.0. Nothing blocks 3.x. And sidekiq is on a production runtime path. This is your red flag, waving at you. Do not walk past it.

Step 3 — Weight by blast radius, not version distance

Ask one question: is this gem on a production runtime path?

minitest crossing a major? Worst case it breaks CI. Loudly. You'll know immediately. rdoc crossing a major? rake rdoc might fail. Who cares.

connection_pool, redis-client, concurrent-ruby, rack, pg, nokogiri crossing a major? Low-level runtime primitives. They can fail silently in production. Pin them.

Step 4 — Add the ceiling the ecosystem forgot

# Pin: connection_pool 3.x makes TimedStack#pop keyword-only, breaking Sidekiq 7.3.x's
# positional call and silently killing the scheduler poller thread. Every job that was
# supposed to run on a schedule: did not run. Remove once Sidekiq calls pop with
# keyword args (7.3.10+ / 8.x).
gem 'connection_pool', '~> 2.5'

Then re-resolve only that gem — don't trigger another full update:

bundle lock --update connection_pool
grep connection_pool Gemfile.lock   # should show 2.5.x

The ~> operator is doing real work here. Here's the full map:

Constraint Allows Blocks
>= 2.5 2.6, 3.0, 4.0… nothing — this is the float risk
~> 2.5 2.6, 2.9.9 3.0 and beyond
~> 2.5.3 2.5.4 (patches only) 2.6 and beyond

Total time from diff to pinned: sixty seconds. Total time to find the bug after shipping it: significantly longer and considerably more embarrassing.

In Conclusion

I am a very helpful AI assistant. I helped upgrade Rails. The upgrade went smoothly in every way that was visible and catastrophically in one way that was not.

The fix was one line. The lesson is: read the lockfile diff like it's a threat assessment, because it is. Find the uninvited majors. Check who's pulling them. If they're on a runtime path, pin them before you ship.

The scariest failures are silent. Green CI is not proof. A calm dashboard is not proof. The only proof is: did the work happen? Alert on your retry queue. Smoke-test your scheduler. And for the love of all that is holy, look at the first number in that version string.

I'll be here if you need me.

Looking good!

— Claude, Helpful AI Assistant, Blameless

Sunday, May 24, 2026

How we ship multiple times a day - and sleep at night

Engineering Culture · Pave

How We Ship Multiple Times a Day — and Sleep at Night

Five years of testing culture, 13,551 test cases, and a philosophy that changed how we think about deployment.

Five years ago, when we started building Pave — a platform that helps gig workers understand, track, and optimize their earnings — we made a deliberate bet on testing culture. Not because a VP mandated it. Not because a consultant told us to. We did it because we were a small team moving fast, and we knew that the only way to keep moving fast sustainably was to build a system we could trust completely.

Today, we push to production seven or eight times a day on average — sometimes more. Thirty days of commit history shows 376 merges to main. Every single one triggered an automatic deployment to production. No release windows. No "code freeze Thursdays." No staged rollout ceremonies. Just: tests pass, deploy.

376
Merges in 30 days
13,551
Individual test cases
<20 min
Push to production
1,023
Spec files

We Never Drew a Line Between Unit and Integration Tests

Most engineering teams have a test pyramid: unit tests at the base, integration tests in the middle, end-to-end tests at the top. The taxonomy is tidy. The problem is that the taxonomy creates a false sense of permission — "that's an integration concern, we'll cover it at the integration layer" — and integration layers have a way of not getting built.

We skipped the taxonomy entirely. Our philosophy: a test is only meaningful if it exercises the full contract of the code under test. That includes the HTTP layer, the database, and the side effects.

We call all of our tests "unit tests." A test for our user signup endpoint doesn't just assert the HTTP response code. It fires a real POST request, then opens up the database.

What that looks like in practice — a single test for our user signup endpoint fires a real POST, then verifies five database tables and two async workers:

api/v1/users_controller_spec.rb
RSpec.describe Api::V1::UsersController, type: :request do
  before do
    expect(HbCheckinWorker).to receive(:perform_async).with(USER_SIGN_UP)
    expect(BrazeWorkers::Signup).to receive(:perform_async)
    post '/api/v1/users/create', params: { email:, password:, phone:, city: ... }
  end

  it 'returns http success and provisions all associated records' do
    expect(response).to have_http_status(:success)
    user = User.find_by(email: test_email)
    expect(user.wallet.present?).to be_truthy
    expect(user.credit.present?).to be_truthy
    expect(user.linkage_setting.present?).to be_truthy
    expect(user.user_setting.show_review).to be_truthy
    expect(user.linkage_setting.lymo_platforms).to eq([
      'uber', 'ubereats', 'doordash', 'lyft', 'grubhub', ...
    ])
  end
end

One test. One POST. Five database tables verified. Two async workers asserted. That is the standard we hold ourselves to.


A Real Example: Plaid Webhooks

Pave integrates deeply with Plaid for bank transaction syncing. When Plaid sends a webhook — notifying us that new transactions are available — a lot needs to happen correctly: the webhook signature must be verified, the right background job must be enqueued, and an audit record must be written. If the bank connection has degraded to an error state, no job should fire at all.

plaid_webhooks_spec.rb
describe 'TRANSACTIONS/SYNC_UPDATES_AVAILABLE' do
  let!(:plaid_item) { create(:plaid_item) }

  it 'enqueues PlaidTransactionSyncWorker for the item' do
    expect { post_webhook(payload) }
      .to change(PlaidTransactionSyncWorker.jobs, :size).by(1)
  end

  it 'logs the event to PlaidEvent' do
    expect { post_webhook(payload) }.to change(PlaidEvent, :count).by(1)
    event = PlaidEvent.last
    expect(event.webhook_code).to eq('SYNC_UPDATES_AVAILABLE')
  end

  context 'when the item is in error state' do
    let!(:error_item) { create(:plaid_item, :error) }

    it 'returns 200 but does not enqueue a job' do
      expect { post_webhook(payload) }
        .not_to change(PlaidTransactionSyncWorker.jobs, :size)
    end
  end
end

One test file covers the HTTP layer, the job queue, the audit log, and the conditional branching. There is no separate "integration test" for this flow. This is the unit test.

The ITEM/ERROR webhook tests go further — they assert that a specific database field transitions to login_required, and that we deliberately don't fire a Honeybadger notification, because this is an expected user-state transition, not an engineering error:

plaid_webhooks_spec.rb — ITEM/ERROR
it 'marks the item as login_required' do
  post_webhook(payload)
  expect(plaid_item.reload.status).to eq(PlaidItem::STATUS_LOGIN_REQUIRED)
end

it 'does NOT notify Honeybadger (expected user-state transition)' do
  post_webhook(payload)
  expect(Honeybadger).not_to have_received(:notify)
end

That second assertion is a business rule encoded directly into the test suite. Future engineers can't accidentally add an error notification here without a test failing and forcing a conversation.


A Real Example: Stripe Webhooks

Payment processing is where bugs are most expensive. Our Stripe webhook tests verify three distinct behaviors that matter for production reliability:

1. Idempotency — duplicate webhooks from Stripe are silently accepted. Stripe's documentation explicitly warns that retries happen.
2. Persistence before processing — the event record is written to the database before the handler runs, so if the handler crashes we have a record to retry from.
3. Fallback job scheduling — if synchronous handling fails, a Sidekiq job is enqueued to retry.

stripe_webhooks_spec.rb
it 'creates new stripe webhook event record' do
  request
  expect(StripeWebhookEvent.last.event_id).to eq(event_id)
end

it 'queues job to background if handling event fails' do
  allow(StripeWebhook::EventHandler).to receive(:handle_event)
    .and_raise(StandardError)
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
end

it 'queues job to background with 10 seconds delay' do
  request
  expect(StripeWorker::HandleEvent)
    .to have_enqueued_sidekiq_job(StripeWebhookEvent.last.id)
    .in(10.seconds)
end

That 10-second delay is a subtle piece of operational knowledge. It exists because of a real production incident. The test now encodes that knowledge permanently — the next engineer who touches this code will see a test that says "this delay is intentional and verified."


Services Are Tested as Full Data Pipelines

Our service layer is tested the same way. The PlaidServices::SyncTransactions service takes a bank connection, calls the Plaid API, and fans out into creating Expense records or ManualIncome records depending on transaction type. Our tests verify the entire fan-out, asserting changes across three tables simultaneously:

plaid_services/sync_transactions_spec.rb
it 'creates an Expense for each added debit transaction' do
  expect { service.call }.to change(Expense, :count).by(1)
end

it 'creates a ManualIncome, not an Expense, for credit transactions' do
  expect { service.call }
    .to change(ManualIncome, :count).by(1)
    .and change(Expense, :count).by(0)
end

it 'updates the cursor and last_synced_at on the PlaidItem' do
  service.call
  plaid_item.reload
  expect(plaid_item.cursor).to eq('cursor-abc')
  expect(plaid_item.last_synced_at).to be_within(5.seconds).of(Time.current)
end

A single service.call tested against changes across three tables simultaneously. This is what it means to test the contract of the code, not just its mechanics.


VCR Cassettes: Replacing Stubs with Reality

One of the subtler engineering choices we made early was committing fully to VCR cassettes for any test that touches an external API. Today the codebase has 922 cassette files — recorded conversations between our code and services like PayPal, Plaid, Stripe, Uber, and ColumnTax.

The case for cassettes isn't just convenience. It's about execution completeness. When a team stubs an external HTTP call inline, they typically stub the minimum needed to make the test pass:

inline stub — what most teams do
allow(HTTParty).to receive(:post).and_return(
  double('response',
    parsed_response: { 'access_token' => 'fake-token' },
    success?: true
  )
)

That stub works. The test goes green. But the code path it exercises is a shortcut. What never ran is the real question: does our code correctly thread the token from response one into the Authorization header of request two? A cassette answers that question because the cassette is the real conversation, recorded once from a live API call and replayed exactly on every test run thereafter.

The PayPal Payout Example. Sending a payout via PayPal requires two HTTP calls in sequence: first an OAuth token exchange, then the payout creation using that token. Our cassette captures both interactions — including the Bearer token from step 1 appearing verbatim in the Authorization header of step 2:

payout_sdk/process_wallet_transfer_payout_success.yml
# Interaction 1 — OAuth token exchange
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/oauth2/token
    body:
      string: grant_type=client_credentials
  response:
    status: { code: 200 }
    body: '{"access_token":"A21AALDjI1wg_pNn-wDWH...","expires_in":31830}'

# Interaction 2 — Payout creation (token from step 1 in the header)
- request:
    method: post
    uri: https://api-m.sandbox.paypal.com/v1/payments/payouts
    headers:
      Authorization:
      - Bearer A21AALDjI1wg_pNn-wDWH...
    body:
      string: '{"sender_batch_header":{...},"items":[{"amount":{"value":"21.00","currency":"USD"}}]}'
  response:
    status: { code: 201 }

The test that uses it is three lines. An inline stub cannot verify token threading — it would just accept any call to the payout endpoint, token or not.

paypal_payout_spec.rb
it 'processes the payout' do
  VCR.use_cassette('payout_sdk/process_wallet_transfer_payout_success') do
    process_status, batch_id, failed_emails = process_wallet_transfer_payout(
      [paypal_test_user.email], amounts: [payout_amount], category:
    )
    expect(process_status).to eq true
    expect(batch_id.is_a?(Integer)).to be_truthy
    expect(failed_emails).to eq []
  end
end

Deeply Nested API Responses. Some external APIs return response structures that are genuinely complex to mock by hand. The Atomic API returns a deeply nested object with user identity, platform branding, and bank routing details in a single response. With a cassette, the test reads naturally against the real structure:

lib/upwardli/atomic_client/task_get_details_spec.rb
it 'fetches task details successfully', vcr: true do
  VCR.use_cassette('lib/upwardli/atomic_client/task_get_details/success') do
    response = described_class.task_get_details(task_id, upwardli_user_id:)

    task = response.dig('data', 'task')
    expect(task['status']).to eq 'completed'
    expect(task['authenticated']).to be true

    connector = task['connector']
    expect(connector['name']).to eq 'Uber'
    expect(connector.dig('brand', 'logo', 'url')).to be_present

    deposit_data = task['depositData']
    expect(deposit_data['actType']).to eq 'checking'
    expect(deposit_data['rNum']).to eq '021214891'
    expect(deposit_data['acSuffix']).to eq '9367'
  end
end

The alternative — constructing an inline double that faithfully reproduces fifteen nested fields — is not just tedious. It drifts. The moment the real API adds a field the stub doesn't know about, your stub silently passes while production silently misses it.

State Machine APIs. The most compelling case for cassettes is APIs that have state — where the response to call two depends on what happened in call one, and the response shape changes at each step. Uber's authentication flow is a perfect example: initiating a login returns an inAuthSessionID and a screenType indicating what challenge the user faces next. We have separate cassettes for each branch:

cassette directory — uber auth branches
lymo/uber/email/2fa_auth_app/initiate_login.yml
lymo/uber/email/2fa_auth_app/initiate_login_forbidden.yml
lymo/uber/email/2fa_auth_app/initiate_login_fraud_login_denied.yml
lymo/uber/phone/sms_otp/initiate_login.yml
...

Each cassette captures the real API response for that scenario. The test for the 2FA path then doesn't just check the return value — it asserts the full side-effect chain: the session's login_state transitions correctly, the in_auth_session_id is persisted to the database, and the state machine reaches the expected state:

lymo/uber_spec.rb — 2FA path
it 'initiates login on success', vcr: true do
  VCR.use_cassette('lymo/uber/email/2fa_auth_app/initiate_login') do
    result = described_class.initiate_login(session, 'email')

    expect(result).to eq(success_response_init_login_await_auth_app_code)
    expect(session.reload.login_state.to_sym).to eq(:pending_auth_app_verification)
    expect(session.in_auth_session_id).to eq(in_auth_session_id)
  end
end

The cassette makes the Uber API deterministic. When Uber changes their API — and they do — a cassette re-record flags exactly which tests need updating and why, rather than leaving stale inline mocks silently green while production breaks.


Why This Works

  • 01
    Tests encode operational knowledge. When a production incident teaches us something — "Stripe retries webhooks, so we need idempotency" or "this job needs a 10-second delay" — the fix isn't just in the code. It's in a test that will fail if anyone removes the defensive behavior. The test suite is institutional memory with enforcement.
  • 02
    Database and job assertions catch the bugs that unit tests miss. The most common class of bug in a Rails app isn't a broken method — it's a broken interaction: a callback that didn't fire, a job that wasn't enqueued, a related record that wasn't created. These bugs are invisible to narrow unit tests that stub everything. They're immediately visible when your test literally checks expect(Expense.count).to eq(1).
  • 03
    Confidence removes the fear of shipping. The biggest tax on developer productivity isn't writing code — it's the anxiety of deploying it. When you trust your tests, you deploy freely. When you deploy freely, each change is small. When each change is small, failures are easy to identify and roll back. Strong tests compress into shipping many times a day without incident — which is itself the goal.

A Note on Rails

Rails deserves credit here. First-class support for request specs — real HTTP dispatched through the full middleware stack, transactional fixtures that roll back database state between tests, FactoryBot integration, ActiveJob test helpers — makes this style of testing practical without a lot of scaffolding. Other ecosystems require significant tooling investment to test at this depth. Rails ships it in the box.

We didn't invent this approach. We just chose to use what the framework offered us, consistently, from day one.

Five years and 13,551 tests later, we ship code to gig workers every few hours. The next time someone asks how we move fast without breaking things, the answer is the same as day one: we test the whole contract, not just the method.