Infrastructure · AWS · Post-Mortem
AWS Deleted Our Production Database
At 11:16 PM on a Monday, AWS Marketplace silently destroyed the Redis cluster powering driver geolocation for our entire gig-worker platform. We had been paying customers for five years. Nobody called.
What Happened
We had recently moved our Redis workload from an annual Redis Enterprise contract to a pay-as-you-go subscription — cheaper, more flexible, and billable through AWS for unified vendor management. Redis offered us a private offer through AWS Marketplace. That offer was structured with a 14-day trial meant to convert into a paid plan.
It didn't convert. Nobody renewed it. And when the trial flag flipped, AWS did not send an escalation. AWS did not pause the service.
Read that again. We did not miss a payment. We are not a free-tier hobby account. We have been writing AWS checks for five years.
Why This Is the Wrong Default
There is a galaxy of difference between suspending service and deleting customer data. One is a billing action. The other is irreversible destruction of property. The fact that AWS chose the latter as the default behavior for a Marketplace trial — with no human-in-the-loop check on whether the account underneath was a paying customer running production workloads — is not a policy. It is a failure of judgment dressed up as automation.
What Was Almost Lost
What makes this gutting is what was on the cluster. The database AWS deleted was the culmination of a three-month migration. From mid-February through early May, our team had:
- Built out the new Redis infrastructure from scratch
- Upgrade 3 live sidekiq queues without downtime
- Done staged traffic cutovers — 10%, then 50%, then full
- Set up VPC peering to the new subscription
- Modernized deprecated GEORADIUS calls to ZRANGE + GEOPOS
- Batched GEOPOS calls, tuned SCAN counts, swapped SMEMBERS for SSCAN and DEL for UNLINK
- Added per-call jitter to TTLs to prevent mass key expiry
- Stood up Redis alarms and a dedicated geo-timeseries health check endpoint
Dozens of PRs. Five engineers. Three months of careful, staged work. All of it sitting on a cluster that AWS quietly destroyed because a flag flipped from green to red.
What Our Team Got Right
The detection was fast. Honeybadger flagged the incident in about 15 minutes. We were on the bridge before midnight, and we started rebuilding within 27 minutes of deletion.
The detail I'm most proud of: the geo-timeseries uptime monitor — which fails if no driver in our entire fleet has reported a position in the trailing hour — never tripped during the nine-hour recovery. Yes, there were windows of degradation. About 10,000 worker jobs that depended on a geo-hash lookup landed in our dead-jobs queue and had to be reprocessed. But across nine hours of fighting back from a destroyed database, the platform never went fully dark. Drivers kept being tracked. The system our team had built absorbed the blow.
After reprocessing and a backfill, we confirmed zero data loss.
What Our Team Got Wrong
The recovery still took nine hours and fifteen minutes. That number is on us.
- No Infrastructure as Code for this Redis cluster
- No runbook for rebuilding and re-wiring it
- No alert for an expiring Marketplace trial or expiring credit card tied to a critical subscription
- No formal ownership over checking billing validity of production vendors
Once you cross a certain product maturity, "someone will remember" stops being a strategy. We're fixing all of this: IaC for the cluster, alarms for expiring trials and cards and subscriptions, and clear ownership for vendor billing health.
What Engineering Leaders Should Take From This
-
01
Check your AWS Marketplace private offers today. If you run a serious workload behind one, look up the expiration date. Don't assume the safety rails exist between expired trial and production data destroyed. They don't.
-
02
Suspension and deletion are not the same thing. Any vendor that deletes customer data as the default action on a billing event — without escalation, without a grace period, without a phone call — has made a profound design error. Push back when you see this in vendor contracts.
-
03
IaC and runbooks are insurance, not overhead. We got lucky that our monitoring held. We shouldn't have needed the luck. If your critical infrastructure has no runbook for rebuilding from zero, write one this week.
-
04
Billing health needs an owner. Not "ops knows." Not "finance handles it." Someone specific, with an alert, who checks it. A production database destroyed by a billing flag is not a billing problem — it is an engineering problem.
And if you're at AWS: a paying customer's production database is not an expiry-date checkbox. It deserves a phone call.