Wednesday, May 13, 2026

AWS Deleted Our Production Database

Infrastructure · AWS · Post-Mortem

AWS Deleted Our Production Database

Engineering Team · May 2025 · 7 min read

At 11:16 PM on a Monday, AWS Marketplace silently destroyed the Redis cluster powering driver geolocation for our entire gig-worker platform. We had been paying customers for five years. Nobody called.

What Happened

We had recently moved our Redis workload from an annual Redis Enterprise contract to a pay-as-you-go subscription — cheaper, more flexible, and billable through AWS for unified vendor management. Redis offered us a private offer through AWS Marketplace. That offer was structured with a 14-day trial meant to convert into a paid plan.

It didn't convert. Nobody renewed it. And when the trial flag flipped, AWS did not send an escalation. AWS did not pause the service.

What AWS did instead
AWS deleted the database. Not suspended. Not paused with a 72-hour grace window. Deleted. A long-standing, paying enterprise customer's production database was destroyed because a trial-conversion checkbox on a Marketplace listing was not flipped on time.

Read that again. We did not miss a payment. We are not a free-tier hobby account. We have been writing AWS checks for five years.

Why This Is the Wrong Default

There is a galaxy of difference between suspending service and deleting customer data. One is a billing action. The other is irreversible destruction of property. The fact that AWS chose the latter as the default behavior for a Marketplace trial — with no human-in-the-loop check on whether the account underneath was a paying customer running production workloads — is not a policy. It is a failure of judgment dressed up as automation.

A single email — "Your subscription expires in 72 hours. Your data will be permanently deleted on conversion failure." — would have prevented this entirely. We never received one.

What Was Almost Lost

What makes this gutting is what was on the cluster. The database AWS deleted was the culmination of a three-month migration. From mid-February through early May, our team had:

  • Built out the new Redis infrastructure from scratch
  • Upgrade 3 live sidekiq queues without downtime
  • Done staged traffic cutovers — 10%, then 50%, then full
  • Set up VPC peering to the new subscription
  • Modernized deprecated GEORADIUS calls to ZRANGE + GEOPOS
  • Batched GEOPOS calls, tuned SCAN counts, swapped SMEMBERS for SSCAN and DEL for UNLINK
  • Added per-call jitter to TTLs to prevent mass key expiry
  • Stood up Redis alarms and a dedicated geo-timeseries health check endpoint

Dozens of PRs. Five engineers. Three months of careful, staged work. All of it sitting on a cluster that AWS quietly destroyed because a flag flipped from green to red.

What Our Team Got Right

15m
Time to detect
27m
Time to start rebuild
0
Data loss
9h 15m
Total recovery time

The detection was fast. Honeybadger flagged the incident in about 15 minutes. We were on the bridge before midnight, and we started rebuilding within 27 minutes of deletion.

The detail I'm most proud of: the geo-timeseries uptime monitor — which fails if no driver in our entire fleet has reported a position in the trailing hour — never tripped during the nine-hour recovery. Yes, there were windows of degradation. About 10,000 worker jobs that depended on a geo-hash lookup landed in our dead-jobs queue and had to be reprocessed. But across nine hours of fighting back from a destroyed database, the platform never went fully dark. Drivers kept being tracked. The system our team had built absorbed the blow.

After reprocessing and a backfill, we confirmed zero data loss.

What Our Team Got Wrong

The recovery still took nine hours and fifteen minutes. That number is on us.

  • No Infrastructure as Code for this Redis cluster
  • No runbook for rebuilding and re-wiring it
  • No alert for an expiring Marketplace trial or expiring credit card tied to a critical subscription
  • No formal ownership over checking billing validity of production vendors

Once you cross a certain product maturity, "someone will remember" stops being a strategy. We're fixing all of this: IaC for the cluster, alarms for expiring trials and cards and subscriptions, and clear ownership for vendor billing health.

· · ·

What Engineering Leaders Should Take From This

  1. 01
    Check your AWS Marketplace private offers today. If you run a serious workload behind one, look up the expiration date. Don't assume the safety rails exist between expired trial and production data destroyed. They don't.
  2. 02
    Suspension and deletion are not the same thing. Any vendor that deletes customer data as the default action on a billing event — without escalation, without a grace period, without a phone call — has made a profound design error. Push back when you see this in vendor contracts.
  3. 03
    IaC and runbooks are insurance, not overhead. We got lucky that our monitoring held. We shouldn't have needed the luck. If your critical infrastructure has no runbook for rebuilding from zero, write one this week.
  4. 04
    Billing health needs an owner. Not "ops knows." Not "finance handles it." Someone specific, with an alert, who checks it. A production database destroyed by a billing flag is not a billing problem — it is an engineering problem.

And if you're at AWS: a paying customer's production database is not an expiry-date checkbox. It deserves a phone call.

Engineering Team · May 2025
AWS Redis Post-Mortem Infrastructure Incident Response AWS Marketplace

Friday, May 01, 2026

How We Dropped Redis Slow Queries to Zero

Pave Engineering · April 2026 · 9 min read

Our backend processes a continuous stream of GPS data, earnings calculations, and activity syncs for gig workers. As data volumes grew, so did the Redis slow log. This is the story of how we hunted down every entry — and why connecting Claude to AWS directly changed the speed of the investigation.

Background: What Makes a Redis Query “Slow”?

Redis is single-threaded. Every command — regardless of complexity — runs serially. While most commands are O(1) or O(log N) and complete in microseconds, a handful are O(N): they process every element in a data structure before returning. When N is large, these commands don’t just become slow themselves — they block every other command waiting in the queue.

Redis has a configurable slow log that records any command exceeding a threshold (typically 10–100ms). Our alerts were coming from hecate, our primary cluster, and we had a small but persistent set of commands showing up.

· · ·

The Fixes

1. Removing GEORADIUS — PR #6942

The first culprit wasn’t slow because of data size — it was slow because it was doing unnecessary work by design.

Cache::EventsDash was using GEORADIUS with a radius of 15,000 miles (larger than any two points on Earth) just to get all members out of a geo sorted set:

RubyDIST_INF = 15000 # the longest two points on earth's surface is ~12,427 miles

events = $redis_aws.georadius(key(tp), 0, 0, DIST_INF, "mi", options: :WITHCOORD)

Using GEORADIUS as a glorified SMEMBERS meant Redis was computing distances for every member — work that was immediately discarded. GEORADIUS is also deprecated as of Redis 6.2. The fix replaced it with ZRANGE (plain sorted set scan) plus GEOPOS (batch coordinate lookup):

Rubymembers = $redis_aws.zrange(key(tp), 0, -1)
coords  = $redis_aws.geopos(key(tp), members)
members.zip(coords).map do |user_id, latlong|
  { type: tp, user_id:, lat: latlong[1], long: latlong[0] }
end

ZRANGE returns members in O(N) with no distance math. GEOPOS decodes coordinates in a single call. No wasted computation, and off the deprecation path.

2. Batching large GEOPOS calls — PR #7009

The next slow log entries pointed to GEOPOS calls against sorted sets with ~20,000 members. GEOPOS is O(N) for N member lookups — passing 20K members at once was taking ~40ms each on hecate-003.

The fix was batching: twenty sequential calls of 1,000 each instead of one call of 20,000. Each individual call completes quickly, leaving the event loop free for other commands between batches.

RubyGEOPOS_BATCH_SIZE = 1_000

coords = members.each_slice(GEOPOS_BATCH_SIZE).flat_map do |batch|
  $redis_aws.geopos(key(tp), batch)
end

We also reduced the SCAN count hint in DrivenActivities from 15,000 to 10,000 — a smaller hint means Redis yields more often during the scan.

3. Tuning SCAN count further — PR #7010

Reviewing the slow log further, SCAN with a 10,000 count hint was still appearing. We reduced it again to 1,000. The count parameter is a hint, not a limit — Redis may return more or fewer — but a lower hint reduces the amount of work per iteration, keeping each call short.

Ruby$redis_aws.scan_each(match: "#{PREFIX}_*", count: 1000) { |key| ... }

4. Replacing blocking SADD and DEL with async alternatives — PR #7031

DEL in Redis is synchronous — it blocks the event loop while freeing memory. For large sorted sets and hashes, this can be significant. UNLINK is the async equivalent: it unregisters the key immediately (making it invisible to clients) and frees the memory in a background thread. It’s a drop-in replacement with no behaviour change.

We migrated 70 call sites across 27 files:

Ruby# before
$redis_aws.del(key)

# after
$redis_aws.unlink(key)

For large SADD operations (adding many members at once), we split them into smaller batches to avoid a single large O(N) write blocking the event loop.

5. SMEMBERS → SSCAN for the hourly geo user set — PR #7058

The final slow log entry was the most predictable: a single SMEMBERS firing every day at exactly 4:15 UTC. One entry in 24 hours — but perfectly consistent.

SMEMBERS returns every member of a set at once. Our hourly geo user set accumulates the ID of every user who sends a GPS point throughout the day. By 4:15 UTC (end of the US evening gig economy rush), that set had grown to ~40,000 members.

Finding the caller was a codebase grep for smembers, which pointed to GeoMetricsRedis.get_user_ids. Connecting the timing to the EventBridge schedule is where the AWS integration paid off immediately — more on that below.

The fix was SSCAN with a batch size of 1,000: ~40 non-blocking iterations instead of one O(N) call that blocks everything behind it.

Rubydef get_user_ids(y_m_d_h_date = ...)
  users_key = key_prefix(GEOPOINTS_USERS, y_m_d_h_date)
  user_ids = []
  cursor = 0
  loop do
    cursor, batch = $redis_aws.sscan(users_key, cursor, count: 1000)
    user_ids.concat(batch.map(&:to_i))
    break if cursor == '0'
  end
  user_ids
end
· · ·

How AWS MCP Changed the Investigation

Debugging distributed systems normally means bouncing between tools — CloudWatch Logs, the ECS console, EventBridge, ELB metrics — copy-pasting ARNs, losing context between tabs. With the AWS MCP connected to Claude, every AWS query happened in the same conversation thread as the code analysis.

Finding the schedule in seconds

Once we identified get_user_ids as the slow call, we needed to know what triggered it at 4:15 UTC. Rather than navigating the EventBridge Scheduler console:

Shellaws scheduler list-schedules | grep -i geo_metrics
aws scheduler get-schedule --name "Record_Geo_Metrics"
# => cron(15 4 * * ? *) → POST /schedule_task/record_geo_metrics

The cron expression, the Lambda target, and the API path all came back in one command. That’s normally a five-minute console hunt.

Ruling out unrelated issues

Mid-investigation, a spike of 1,231 ELB 504 errors appeared. Rather than derailing into manual console investigation, we queried the ELB metrics directly. The breakdown confirmed all 1,231 errors were 504s (not 502s), all targets remained healthy, and the spike resolved on its own — most likely an OOM kill on one container under sustained evening load. Confirmed, noted, moved on. The original Redis investigation stayed on track.

Having AWS query results and code analysis in the same context meant each finding fed directly into the next hypothesis — without a context switch.
· · ·

When Your AI Goes Down: Have a Backup Ready

Here’s something that doesn’t come up in most engineering blogs but happened during this investigation: mid-session, while we were deep into debugging the 504 spike, the Claude service went down.

Rather than waiting it out, I switched to Codex and continued the investigation immediately. What was notable: Codex picked up local AWS credentials automatically — no MCP configuration required, no setup overhead. It queried CloudWatch and ELB the same way Claude had been doing, just with a different interface. In that respect it behaved like OpenClaw: AWS access was just there, ready to use.

Practical takeaway: Keep more than one coding agent configured and ready to go. The sessions where you most need uninterrupted flow — mid-incident, mid-investigation, when you’re holding several hypotheses in your head — are exactly when a service outage costs you the most. If switching agents takes 30 seconds instead of 30 minutes, you stay in the problem.

The specific capability to look for is native credential access. Agents that can reach your AWS environment using locally configured credentials (rather than requiring explicit MCP setup per session) are the ones you can hand off to without losing momentum. Codex has this. OpenClaw had it. It’s worth knowing which tools in your arsenal work this way before you need them.

· · ·
AWS CloudWatch slow log chart showing consistent daily hits through April 23rd, then dropping sharply to near-zero by April 30th
Redis slow log entries on hecate-003 — daily hits through 04/23, dropping to zero as the fixes landed.

The Pattern

Looking across all six fixes, a clear pattern emerged:

Redis commands that are fine at small scale become slow log entries as data volumes grow — and the growth is often invisible until it crosses a threshold.

The fixes all follow the same principle: break large O(N) operations into smaller batches, and prefer async alternatives (UNLINK over DEL, SSCAN over SMEMBERS, SCAN with smaller count hints) that yield the event loop between iterations.

The slow log is now clear. The fixes were individually small — a few lines each — but finding them required tracing from a Redis key name through a codebase, an EventBridge schedule, a Lambda, a Rails controller, and a Sidekiq worker. Having AWS and code in the same context made that trace fast. And having a backup agent meant a service outage didn’t stop it.

Key Takeaways

01
Audit deprecated commands first. GEORADIUS was doing distance math on every member just to replicate SMEMBERS. Deprecated commands often carry hidden performance baggage — check the slow log, then check the Redis changelog.
02
Batch large O(N) commands. GEOPOS, SADD, and SMEMBERS against large structures are the usual suspects. Replace with sliced iterations of ≤1,000 elements.
03
Use UNLINK, not DEL. It’s a drop-in replacement that frees memory in the background instead of on the event loop. Migrate all 70 call sites — it takes minutes with a codebase search.
04
The slow log tells you what, not why. Tracing from a command back to its caller — across schedules, queues, and controllers — is where the investigation lives. Tooling that keeps AWS and code in the same context makes this dramatically faster.
05
Have a backup agent ready before you need one. Configure agent redundancy when it’s calm, not when you’re mid-incident. Prefer agents with native credential access so the handoff is immediate.
Pave Engineering · April 2026
Redis AWS Ruby Performance ElastiCache MCP

Wednesday, April 22, 2026

Engineering × AI

My Journey Back to OpenClaw: From Disruption to Democratized AI Power

When an upstream policy change kills your workflow overnight, you learn fast what your tools are actually made of.

The allure of AI, particularly the promise of powerful agents like OpenClaw, has always been about augmenting our capabilities and freeing us from mundane tasks. My recent experience, however, was a stark reminder that the path to progress isn't always smooth. It involved navigating through unexpected adversity, finding my way with alternative tools, and ultimately, reaffirming my belief in the democratizing power of open-source AI embodied by OpenClaw.

— —

The Disruption: Caught in the Crossfire

My personal workflow relies on two main automated reports: one tracking car sales and deals for specific models and criteria near me, and another monitoring the geopolitical landscape, particularly the ongoing conflict with Iran and its impact on global markets and cybersecurity. These aren't just casual interests — they're crucial for staying informed and making timely decisions, whether it's snagging a great car deal or understanding market shifts.

OpenClaw was firing these reports daily. But then, an external disruption struck: Anthropic, a provider of Claude models, shifted its policies. This wasn't just a minor inconvenience; it was a deliberate move that effectively severed the connection for many users.

The core issue stemmed from Anthropic's decision to stop covering third-party tool usage, like that of OpenClaw, under their standard Claude subscriptions. This meant users wanting to continue using OpenClaw with Claude models would face significantly higher, pay-as-you-go costs — a "claw tax," as some have called it. The Verge reported that this policy change effectively began around April 4th, 2026, impacting countless users who relied on this integration.

The situation escalated when OpenClaw's creator, Peter Steinberger (now employed by OpenAI), found his own account temporarily banned from accessing Claude, reportedly due to "suspicious" activity. While the ban was short-lived and his account was reinstated after community outcry, the incident highlighted the growing tensions. As detailed in TechCrunch, Steinberger reportedly tried to reason with Anthropic, even delaying the policy change, but ultimately felt it was a "betrayal of open-source developers" — especially given Anthropic's recent addition of features to its own tools like Claude Cowork that seemed to mimic OpenClaw's capabilities. This move, coupled with the higher costs and the temporary ban, created significant distrust and frustration within the community.

Suddenly, my crucial daily reports stopped. The absence was more than an inconvenience; it meant I was cut off from vital information. The car deal alerts vanished (I'm still waiting for that perfect deal!), and the daily Iran news digest, which used to take me a tedious 30 minutes each morning to compile manually, was no longer at my fingertips.

— —

The Roadblocks: Navigating Bugs on the Path Back

My first instinct was to get back on OpenClaw, even if it meant using a different provider. I decided to try using an OpenAI/Codex model for this. However, the path back was immediately blocked by frustrating, unrelated bugs in the previous version:

The Channel Setup Crash (GitHub #67076). During the onboarding process, specifically when configuring channel options after entering my Discord token, the application crashed with "Can't read properties of undefined (reading 'trim')". This was a regression — it had worked before, but now this bug aborted the entire channel setup. It felt like hitting a wall right at the start.

The Misspelled API Path (GitHub #68076). Even after getting past the trim bug, a misspelled request path to the OpenAI-compatible API (openai/v1 instead of openai/api/v1) caused the Codex model to fail silently. Thankfully, the issue had a workaround documented right in the comments, which I applied and it worked. This is worth noting: working directly with the OpenClaw GitHub repo is by far the fastest way to resolve issues on such a fast-moving project. The community and maintainers are responsive, workarounds get posted quickly, and you can often unblock yourself the same day.

— —

Finding a Lifeline: The Power of Codex in the Interim

While I was working through those OpenClaw bugs, my detour through Codex CLI proved surprisingly productive. It used the Hermes agent, leveraged both Gemini and OpenAI models, and significantly helped automate some of the tasks I was missing. A key advantage was its ability to backport much of my existing OpenClaw configuration, making the transition smoother than expected. Codex helped me:

  • Automate reporting: Generate cron jobs for a series of news events, recreating the automated daily digests I had lost.
  • Port configurations: Transfer many of my OpenClaw settings to the new setup — a huge time-saver.
  • Set up a Telegram bot: This lets me direct OpenClaw (and Codex/Hermes) from my phone, incredibly useful when away from the laptop.

However, Codex wasn't without its own limitations. Despite having access to a Brave Search API token, Codex and Hermes fell back to browser-based loading, which quickly hit Google rate limits and couldn't bypass them.

More critically, while Codex could show raw links or entire page content, it lacked built-in capabilities for extracting and summarizing specific, relevant data. For instance, I asked it to find the opening hours for my local library — a simple task — but it couldn't extract this structured data out-of-the-box. Although Codex eventually managed to bring in BeautifulSoup and I coached it to write a custom extractor, the entire process was time-consuming and required significant intervention. In contrast, OpenClaw has robust web extraction and summarization capabilities built-in, saving a tremendous amount of friction.

The difference shows up in even the simplest real-world queries. Here's the same question — do I need a rain jacket tomorrow? — asked to both bots via Telegram:

Codex via Telegram: unable to retrieve rain forecast, apologises and deflects OpenClaw via Telegram: asks for location, then gives a direct answer about Seattle rain

Codex: Couldn't retrieve the forecast, hit rate limits on weather sites, and suggested checking a local app manually.

OpenClaw: Asked for location, got "Seattle," and returned a direct, actionable answer in under a minute.

The same question. Two very different answers.

The result: automated reporting that worked, but couldn't reliably parse and distill fresh information from the open web without considerable effort.

— —

The Return: OpenClaw and Extreme Productivity

Finally, after addressing those critical bugs and getting OpenClaw back online, the difference was immediate. OpenClaw's integrated web search, cron scheduling, and multi-channel delivery just worked. Within minutes of being back, I had my car deal alerts and geopolitical digests firing again. The contrast between the frustrating bug-fixing period and my current workflow is night and day.

Now I have two workflows to compare side by side — Codex and OpenClaw — and the comparison is illuminating. Codex is capable and helped me through a tough period. But OpenClaw's architecture — its ability to search the web natively, schedule tasks, deliver to Slack or Telegram, and operate as a true personal AI agent rather than just a code assistant — that's a different category entirely.

— —

Why OpenClaw Matters: The Linux Moment for AI

My instinctive preference for OpenClaw might partly be an underdog thing. But it's more than that. I genuinely believe OpenClaw represents an inflection point in democratizing the awesome power of AI.

As the HackerNoon article Is OpenClaw the Linux Moment for AI? argues, we may be witnessing something analogous to what Linux did for operating systems. Linux didn't just offer a free alternative — it fundamentally changed who could build, customize, and control their computing environment. OpenClaw is doing the same for AI agents.

You don't need a corporate subscription or a walled-garden platform to have a powerful AI assistant that monitors your interests, automates your workflows, and reaches you on whatever channel you prefer. You can run it yourself, on your own hardware, with your choice of models. That's not just convenient — it's transformative.

My journey, though challenging, has only deepened my conviction. The fact that I could check unreleased code, find workarounds, fix bugs, and get back to full productivity — all within the open-source ecosystem — is exactly the point. This is what democratized AI looks like. It's messy sometimes. But it's ours.

— —

Still haven't found that car deal, though. The search continues.



Wednesday, April 08, 2026

The Prompt That Crossed Two Organizations
Engineering × AI

The Prompt That Crossed Two Organizations — And Got Sharper Each Time

How a product executive's pressure-testing framework traveled to systems engineering, and what happened when we pointed it at a real AWS workflow.

There's a quiet revolution happening in how smart teams use AI — and it has nothing to do with the model. It has everything to do with the instructions.

A few weeks ago I borrowed a prompt template from a product owner at a large enterprise. He had built it in ChatGPT to do something powerful: give his leadership team a private space to pressure-test ideas before they ever entered a room. When executives were developing roadmaps, he'd run them through the model first — surfacing assumptions, stress-testing the logic, anticipating the hard questions a CFO or GM might raise. The result was a win on both sides of the table. The CEO could arrive at conversations with a sharper, more fully-formed point of view. And the product manager got to execute against a plan that had already survived serious scrutiny — no half-baked pivots, no surprises mid-flight.

Think of it less as critique and more as a rehearsal room. The tool doesn't challenge people — it challenges ideas, privately, before the stakes are high.

I took the same framework and ran it in Claude — the model we use at Solo. It worked just as well. Which raises something worth sitting with: the framework didn't just travel across organizations and domains. It traveled across AI models entirely. That's the tell. When the same set of instructions produces sharp, useful output regardless of which model is running them, the instructions are the asset. The model is increasingly the commodity.

I read those instructions and thought: this exact mental model applies to systems architecture.

So I adapted them. Same seven-step skeleton — identify the thesis, stress-test the portfolio balance, expose assumptions, map risk concentration, name the opportunity costs, simulate the leadership challenge, propose alternative shapes. I changed the vocabulary and the lens. Instead of asking what will Finance push back on, I asked what will the CEO and engineering team challenge. Instead of scoring for revenue potential, I scored for cost and time savings. The goal shifted from sharpening an executive's boardroom instincts to sharpening an engineer's thinking in peer and leadership conversations.

Then I pointed it at something real: the workflow our team at Solo uses to create new EventBridge scheduled rules in AWS.

— —

What the workflow looked like

The process isn't complicated on its face. You take a YAML template, swap in five dynamic variables, insert the block into a CloudFormation file, open a PR, get infra review, run a change set in the AWS console, verify, execute, merge. Clear enough. Solid IaC discipline — no console drift, changes tracked in git, peer review required.

What happened next is what makes this worth writing about.

— —

What Claude found

Step 1 — Strategic thesis

The thesis was absent. Claude's first move was to name this: defensive / constraint-driven, no clear thesis. The process enforces the right principle — infrastructure as code over console changes — but it's encoded as a manual human checklist rather than a system with guardrails. The implied thesis should be "infrastructure changes are safe, auditable, and low-friction." Claude's verdict: it's only partially achieving that.

Step 2 — Portfolio balance

The portfolio is completely unbalanced. Claude broke the workflow down across four dimensions: IaC discipline (present), automation and tooling (absent), validation guardrails (absent), developer autonomy (weak). Its summary was blunt: all weight is on human process, zero investment in tooling that would make this self-service and safe simultaneously.

Step 3 — Assumption exposure

The assumptions don't hold up. Claude identified four things that must be true for this process to succeed — and challenged each one:

"Engineers will follow the checklist correctly every time." There's no validation that catches a malformed cron expression, wrong snake_case, or a misaligned variable before it hits CloudFormation.

"Infra team review is a meaningful gate." Unclear. If it's checking syntactic correctness, a linter does this faster and more reliably. If it's checking strategic intent, that's not documented anywhere.

"The change set review step prevents mistakes." Partially — but it requires the engineer to know what to look for. No checklist defines what "looks good" actually means.

"Console changes never happen." The last line of the process says never change the rule from the AWS Console — but there's no enforcement mechanism. That's policy, not a guardrail.

Step 4 — Risk concentration

The risk concentration is real. One engineer executing the process incorrectly causes a production scheduling outage or a silent missed execution. The template has six-plus Fn::ImportValue calls — a single upstream stack name change silently breaks every rule, with no cross-reference validation documented. And the process assumes the engineer simultaneously knows valid cron syntax, CloudFormation change set semantics, ECS task override structure, and rake task naming conventions. That's a high knowledge bar with no scaffolding.

The most underappreciated risk lives in the last line: "Never change the rule from the AWS Console." A critical constraint buried where engineers are least likely to retain it.

If someone misses it — especially under time pressure — they introduce configuration drift that CloudFormation won't detect on the next deploy. The instruction exists. The enforcement doesn't.

— —

The leadership challenge simulation

Claude predicted exactly the questions that would land hardest in an executive or peer review:

CEO / Engineering lead

"How many engineer-hours does this take per rule addition, end to end? What's the error rate? Have we had production scheduling failures from this process?"

Probably no clean answers. That's the problem.

Finance

"If a scheduled job silently fails to be created or runs at the wrong time, what's the business impact?"

The process has no alerting or confirmation that a newly added rule is actually firing. Execution success is not the same as operational correctness.

Engineering

"Why do I need infra team review for a cron job? Why can't I validate this locally? Why is there no test environment path?"

Claude's assessment: these are legitimate objections. The current process treats every engineer as a potential misconfigurer rather than building systems that make misconfiguration hard.

— —

Three alternative shapes

Claude proposed three different strategic directions, each with honest tradeoffs:

Option A — Recommended near-term

Script-first

Build a small CLI that prompts for inputs, validates cron syntax, generates the YAML block, and inserts it correctly into the file. This eliminates the entire class of template variable substitution errors and removes the knowledge burden from the engineer.

Eliminates substitution error class entirely One-time build and maintenance investment
Option B

Separate stack per rule

Instead of one monolithic YAML file, each rule gets its own CloudFormation stack. Engineers own their rule's lifecycle. Merge conflicts disappear. Blast radius is isolated.

Eliminates merge conflicts, isolates blast radius Stack proliferation requires naming discipline
Option C — Longer horizon

Migrate to EventBridge Scheduler + CDK / Terraform

Replace CloudFormation-managed EventBridge Rules with the purpose-built newer service, managed through CDK or Terraform modules. Better DX, built-in retry policies, templated constructs reduce copy-paste risk significantly.

Better DX, built-in retries, less copy-paste risk Migration cost, team upskilling, short-term disruption
Claude's bottom line: This process enforces the right principle with the wrong mechanism. The risk isn't that engineers are careless — it's that the process provides no structural resistance to errors. A linter, a generator script, and a validation step in CI would eliminate the majority of failure modes at low cost. The highest-leverage immediate action: a script that generates the YAML block from inputs and validates cron syntax before the PR is opened. Everything else can wait.
— —

What this is really about

The genealogy of this critique is what I keep coming back to. A product owner at an enterprise company built a framework in ChatGPT to make product leaders sharper. I adapted it for Claude to make systems engineers sharper. The seven-step skeleton traveled across two organizations, two domains, two AI models, and two completely different problems — and produced something genuinely useful every time.

That last part matters more than it might seem. We're entering a moment where the major AI models are converging in capability. The choice between them is increasingly a matter of workflow preference, not raw power. What doesn't transfer automatically — what has to be deliberately designed — is how you instruct them. The same prompt that works in ChatGPT works in Claude. The same framework that sharpens a product roadmap sharpens an engineering workflow. The instructions are the portable, reusable, compounding asset. The model is the infrastructure underneath.

We spend a lot of time evaluating which AI model to use and almost no time designing how we instruct it. The difference between an AI that validates your thinking and one that challenges it isn't the model version — it's the instruction set. One framing decision, encoded in a project's system prompt, shifts the output from agreeable to adversarial, from a mirror to a pressure test.

The insight the enterprise product owner had — that you can force structured, sequential reasoning by encoding a multi-step framework as the operating instruction — turns out to be domain-agnostic and model-agnostic. The same architecture works on roadmaps, on engineering workflows, on financial models, on hiring processes. You change the vocabulary. The sharpness is the point.

The prompt that lives in our Claude project now means any engineer can walk in with a workflow, a design doc, or an architectural decision and get back something that will make them think harder — not feel better.

That's the unlock. And it cost nothing but the willingness to borrow a smart idea from someone doing a completely different job, on a completely different platform, solving a completely different problem.

The best prompts, it turns out, travel well.

Want to build your own pressure-testing project? The pattern is straightforward: pick an adversarial advisor persona, write a multi-step reasoning framework that forces each analytical lens to run in sequence, and explicitly ban validation as a default behavior. It works in Claude. It works in ChatGPT. The framework above has seven steps — but what makes it work isn't the number, and it isn't the model. It's the instruction not to let weak reasoning slide.

Solo  ·  Engineering & AI  ·  2025

Thursday, March 26, 2026

How We Migrated Sidekiq's Redis Without Losing a Single Job

How We Migrated Sidekiq's Redis Without Losing a Single Job

Infrastructure · Redis · Sidekiq

How We Migrated Sidekiq's Redis Without Losing a Single Job (and Without Listening to AI)

Solo Engineering Team · March 2026 · 8 min read

We moved our Sidekiq backend from Redis Enterprise to AWS ElastiCache. The AI tools recommended a careful, expensive approach. We did something simpler — and it worked perfectly.

The Setup

Our app runs Sidekiq workers on ECS. Each process connects to Redis on startup to read and process jobs. We were moving from Redis Enterprise to ElastiCache — different host, different connection string, same protocol.

New jobs would start going to the new Redis as soon as we deployed. But existing jobs queued in the old Redis? They'd be orphaned the moment every worker switched over.

What the AI Tools Said

We asked around — Claude, ChatGPT, Gemini, Grok. They all landed in roughly the same place:

You should deploy a separate environment connected to the old Redis. Let it drain the queue over time, then decommission.

It's not wrong. But it's heavy. That approach meant new ECS task definitions, environment variable management across two sets of infra, coordinating the decommission, and extra cost while two clusters run in parallel.

When we pushed back, one tool offered an alternative: run two Sidekiq processes per Docker container — one pointed at old Redis, one at new. That would have required changes to CloudFormation templates, process supervision config inside the container, and careful cleanup afterward. Trading one complex migration for another.

But they missed something important: Sidekiq's backing store is completely external to the process. A job scheduled on Redis Enterprise doesn't belong to any particular Sidekiq process — it just sits there until a worker with a connection to that Redis comes along. The worker is stateless.

So the "debugging nightmare" scenario the AI tools described... wouldn't actually happen.

The Actual Solution

Our team came up with something much simpler. In config/initializers/sidekiq.rb, at startup, each Sidekiq process decides which Redis to connect to. We added one line:

config/initializers/sidekiq.rb — the one-liner
# Coin toss at startup — connects this process to one Redis for its entire lifetime
redis_url = rand < 0.5 ? LYMO_SIDEKIQ_NEW_REDIS_URL : LYMO_SIDEKIQ_OLD_REDIS_URL

That's it. On startup, each worker tosses a coin. Heads → new ElastiCache. Tails → old Redis Enterprise.

The result: roughly half the cluster continued draining the old queue, while the other half processed new jobs on ElastiCache. No new infra. No task definition changes. No separate environment to coordinate.

We also pointed all job producers (the code that enqueues jobs) at the new Redis immediately. So new work only ever went to ElastiCache. The old Redis just needed to drain.

This is where Sidekiq's initializer structure becomes the key enabler. Each configure_server and configure_client are can be wired seperately where the server (one that reads) uses the redis_url resolved at startup:

config/initializers/sidekiq.rb — full initializer
redis_url = rand < 0.5 ? LYMO_SIDEKIQ_NEW_REDIS_URL : LYMO_SIDEKIQ_OLD_REDIS_URL

Sidekiq.configure_server do |config|
  config.redis = { url: redis_url }
end

Sidekiq.configure_client do |config|
  config.redis = { url: new_redis_url }
end

One coin toss. One URL to pull from. That process reads and from the same Redis for its entire lifetime.

The clients (that push jobs) will always use the new url, and the reads would be split between the old and new url. In time, the old queue drains as it receives no further jobs. The old Redis processes were naturally left behind to drain, and as they cycled out, the cluster fully converged on the new setup with no intervention required.

How It Went

It worked exactly as expected. Within a day, roughly 90% of the old queue had drained naturally. Workers reading from old Redis gradually found less and less work, while ElastiCache handled all the new throughput.

The remaining jobs were a different story: scheduled jobs. These live in Sidekiq's sorted set and don't get picked up until their execution time arrives — which could be hours away. Waiting wasn't ideal, so we wrote a small script to move them from the old Redis to the new one manually. A few lines to iterate the scheduled (and retry) set, re-enqueue on ElastiCache, and delete from old Redis. Clean cutover.

Once that was done, we deployed the cleanup — removed the conditional and all references to the old Redis connection. Four lines of code deleted. Done.

Oh, and while all of this was happening? The rest of the team made a dozen normal deployments — which restarted servers, reshuffled which Redis each process landed on, and generally did everything the AI tools said would cause a debugging nightmare. Nothing broke. No jobs lost. The doom and gloom theories were disproven in the most practical way possible: by live testing.

Why the AI Advice Missed the Mark

The AI tools were technically cautious but operationally naive. They modeled the problem as "jobs are tied to a running process" — which isn't how Sidekiq works. Redis is the source of truth, not the worker. The worker is stateless.

They also defaulted to the safest, most conservative architecture: full environment isolation. That's sensible for high-stakes migrations. But for a queue drain, it's significant overengineering.

The human insight — the DB is external, the workers are stateless, so we can split them probabilistically — is the kind of lateral thinking that comes from actually understanding the system rather than pattern-matching to a template.

— ✦ —

Takeaways

  • 01
    Sidekiq workers are stateless. Redis is the state. This gives you more migration flexibility than you might think.
  • 02
    Probabilistic splits are underrated. You don't always need clean cutoffs. A coin toss at startup is simple, observable, and reversible.
  • 03
    AI tools are good at safe answers, not always good at efficient ones. They'll often recommend the conservative solution even when a simpler one exists. Treat their output as a starting point, not a final answer.
  • 04
    The cleanup should be as simple as the migration. If your migration leaves behind complex infra, you've done too much. Ours cleaned up with four deleted lines.
Redis Sidekiq AWS ElastiCache Migration Ruby ECS