Wednesday, May 13, 2026

AWS Deleted Our Production Database

Infrastructure · AWS · Post-Mortem

AWS Deleted Our Production Database

Engineering Team · May 2025 · 7 min read

At 11:16 PM on a Monday, AWS Marketplace silently destroyed the Redis cluster powering driver geolocation for our entire gig-worker platform. We had been paying customers for five years. Nobody called.

What Happened

We had recently moved our Redis workload from an annual Redis Enterprise contract to a pay-as-you-go subscription — cheaper, more flexible, and billable through AWS for unified vendor management. Redis offered us a private offer through AWS Marketplace. That offer was structured with a 14-day trial meant to convert into a paid plan.

It didn't convert. Nobody renewed it. And when the trial flag flipped, AWS did not send an escalation. AWS did not pause the service.

What AWS did instead
AWS deleted the database. Not suspended. Not paused with a 72-hour grace window. Deleted. A long-standing, paying enterprise customer's production database was destroyed because a trial-conversion checkbox on a Marketplace listing was not flipped on time.

Read that again. We did not miss a payment. We are not a free-tier hobby account. We have been writing AWS checks for five years.

Why This Is the Wrong Default

There is a galaxy of difference between suspending service and deleting customer data. One is a billing action. The other is irreversible destruction of property. The fact that AWS chose the latter as the default behavior for a Marketplace trial — with no human-in-the-loop check on whether the account underneath was a paying customer running production workloads — is not a policy. It is a failure of judgment dressed up as automation.

A single email — "Your subscription expires in 72 hours. Your data will be permanently deleted on conversion failure." — would have prevented this entirely. We never received one.

What Was Almost Lost

What makes this gutting is what was on the cluster. The database AWS deleted was the culmination of a three-month migration. From mid-February through early May, our team had:

  • Built out the new Redis infrastructure from scratch
  • Upgrade 3 live sidekiq queues without downtime
  • Done staged traffic cutovers — 10%, then 50%, then full
  • Set up VPC peering to the new subscription
  • Modernized deprecated GEORADIUS calls to ZRANGE + GEOPOS
  • Batched GEOPOS calls, tuned SCAN counts, swapped SMEMBERS for SSCAN and DEL for UNLINK
  • Added per-call jitter to TTLs to prevent mass key expiry
  • Stood up Redis alarms and a dedicated geo-timeseries health check endpoint

Dozens of PRs. Five engineers. Three months of careful, staged work. All of it sitting on a cluster that AWS quietly destroyed because a flag flipped from green to red.

What Our Team Got Right

15m
Time to detect
27m
Time to start rebuild
0
Data loss
9h 15m
Total recovery time

The detection was fast. Honeybadger flagged the incident in about 15 minutes. We were on the bridge before midnight, and we started rebuilding within 27 minutes of deletion.

The detail I'm most proud of: the geo-timeseries uptime monitor — which fails if no driver in our entire fleet has reported a position in the trailing hour — never tripped during the nine-hour recovery. Yes, there were windows of degradation. About 10,000 worker jobs that depended on a geo-hash lookup landed in our dead-jobs queue and had to be reprocessed. But across nine hours of fighting back from a destroyed database, the platform never went fully dark. Drivers kept being tracked. The system our team had built absorbed the blow.

After reprocessing and a backfill, we confirmed zero data loss.

What Our Team Got Wrong

The recovery still took nine hours and fifteen minutes. That number is on us.

  • No Infrastructure as Code for this Redis cluster
  • No runbook for rebuilding and re-wiring it
  • No alert for an expiring Marketplace trial or expiring credit card tied to a critical subscription
  • No formal ownership over checking billing validity of production vendors

Once you cross a certain product maturity, "someone will remember" stops being a strategy. We're fixing all of this: IaC for the cluster, alarms for expiring trials and cards and subscriptions, and clear ownership for vendor billing health.

· · ·

What Engineering Leaders Should Take From This

  1. 01
    Check your AWS Marketplace private offers today. If you run a serious workload behind one, look up the expiration date. Don't assume the safety rails exist between expired trial and production data destroyed. They don't.
  2. 02
    Suspension and deletion are not the same thing. Any vendor that deletes customer data as the default action on a billing event — without escalation, without a grace period, without a phone call — has made a profound design error. Push back when you see this in vendor contracts.
  3. 03
    IaC and runbooks are insurance, not overhead. We got lucky that our monitoring held. We shouldn't have needed the luck. If your critical infrastructure has no runbook for rebuilding from zero, write one this week.
  4. 04
    Billing health needs an owner. Not "ops knows." Not "finance handles it." Someone specific, with an alert, who checks it. A production database destroyed by a billing flag is not a billing problem — it is an engineering problem.

And if you're at AWS: a paying customer's production database is not an expiry-date checkbox. It deserves a phone call.

Engineering Team · May 2025
AWS Redis Post-Mortem Infrastructure Incident Response AWS Marketplace

No comments: