Sunday, October 26, 2025

When AWS Went Down, Our Users Didn’t Lose Their Miles

On Oct 20, 2025 UTC, AWS experienced a significant regional service disruption that affected several of our core components — specifically the EventBridge scheduling layer that powers our mileage pipeline.

For several hours, our data ingestion flow couldn’t persist new trip events from the Redis TimeSeries. This temporarily paused mileage calculations, leading to incorrect user summaries in the app.

But here’s what didn’t happen:
We didn’t lose a single record, and no user lost a mile.


How Our System Works

Our pipeline captures anonymized telemetry events from users’ mobile devices, processes them through AWS infrastructure, and stores aggregated trip summaries in a relational database.

Data Flow Diagram
Data Flow: Mobile App → Ingestion API → Redis TimeSeries → EventBridge Scheduler → RDS

Event flow:

  • Mobile App captures GPS telemetry and trip start/stop events.
  • Ingestion API authenticates and sanitizes the data before writing to Redis TimeSeries.
  • Redis TimeSeries stores short-term data points with fine-grained timestamps for quick replay (a background job backs up the stream to S3 for longer term storage).
  • EventBridge Scheduler triggers aggregation and processing jobs every few minutes.
  • RDS stores validated, aggregated trip records for long-term analytics and reporting.

What Happened During the Outage

When AWS degraded services in our primary region (us-east-1),  the EventBridge Scheduler stopped firing events from 6:50 - 10:07 UTC halting our pipeline smack in the middle. While we were still capturing and ingesting the user geo data, the Redis Timeseries was not being processed as this job did not get scheduled for little over 3 hours.

Remarkably, the Redis Timeseries held up. Our Redis cloud is with Redis Enterprise, but the instances are hosted in the AWS cloud. Even though Redis Enterprise noted that some customers would be impacted, we did not see a significant degradation.


Understanding the AWS architecture helps explain why this was the case.

AWS separates each service into a control plane and a data plane. The control plane is responsible for resource monitoring, allocation, scaling whereas once resources have been allocated, the data plane takes over - the data plane is more reliable/resilient while the control plane is known to fail. 

Here from the horses's mouth:

Since we provision our Redis cluster for expected usage, manually adding more nodes/memory as our volumes increase, we were not relying on the AWS control plane for scaling - our instances continued to hum (we saw the same with RDS as well -- again, it is customary to provision the RDS for your needs and perform upgrades manually as traffic increases)

This was not the case for our web/job servers, that were configured with Auto scaling rules. We had set a lower limit on the number of machines for each cluster, and we were running hard on this reduced capacity until recovery.

When services recovered, we started processing events from the Timeseries, creating trips for users. But since we generate incremental trips for the last few minutes, we were still missing trips for the last 3 hours and 7 minutes.

We could easily tell how many trips we missed as we track this closely using a Cloud Watch metric. Each bar shows a completion of the job that is responsible for incrementally processing the timeseries.

  

When services recovered, EventBridge Scheduler fired all the events in the backlog.



This caused a different problem as our trips processor was designed to handle the time series data in real time. We did not anticipate serving more than a single event during a ten minute window. So we got 21 late-fired triggers but effectively could process just one, for the last ten minutes. More on this later!

The critical task was to update the user data for the missing three hours. I had written a script to patch trips that I had used earlier for a less severe outage (5 minutes). With some minor modification to account for the partial data towards the tail, I was able to correct mileage for all our users who happened to be driving during the outage (luckily they were not on self-driving cars powered by AWS. Ok - bad joke)

There was still something I couldn't explain - CW told me ~ 2,000 jobs completed after jobs started flowing again. I expected 21 jobs, but I was puzzled by the much larger volume that ran at the tail. What amounted to that, and would they cause a different type of mis-calculation? Indeed, some interesting things did take place with those 21 EventBridge triggers, let me explain.

When a trigger fires on the tenth minute, we launch a job per user who have likely been driving recently. These jobs run concurrently, and we need to track the completion of the last job to know that all users have been processed and the window can be marked "complete".

This is done with a Redis Set that keeps track of users who are still being processed. So when the trigger fires, it first determines all recent drivers, adds them to the set, before spawning a worker per user.  Then each worker removes an element from the set, and if it is the last item, notifies the completion of the run.


When 21 triggers fired in rapid succession, they all forked a job per user, resulting in many workers racing to compute the same job, and may workers hitting an empty set. And of course this meant that these jobs "up counted" miles for the drivers in that time window.

So the last data cleanup was to figure out where we added more miles to users during the end of the outage. I first thought this might be really hard as we had already updated these records, which then kept getting updated further as the users kept driving. But fortunately, we update both start and end times for each trip in the record, so it was possible to compute the miles driven in this specific range for each user from the raw timeseries data.

To verify that we have been up-counting, I queried for records in descending order of speed. And I saw speeds of 600 mph, which confirmed the hypothesis quite fast [no pun intended]

I could re-use a method from the earlier script for patching the data, and write a bit of code for the update. So finally, after a very long day, our users' data was fully corrected.

Improvements made:

We are making improvements on how we handle the tail of an outage going forward. The idea here is to not let more than a single processor to run in a given time window. This can be done with a simple Redis SET command with option "NX", which sets a flag (lock) if it is not already set, thus guaranteeing that only a single process can acquire the lock. We set the TTL to be below the time window (7 minutes in this case) so that the lock naturally expires before the next trigger.

Our Approach: Fairness and Fidelity

Our principle is simple: if you drove it, we count it.

We don’t approximate, and we don’t drop data due to transient infrastructure issues. Each pipeline component is designed for eventual consistency and idempotent replay, so every record can be reconstructed safely and accurately.


What We’re Building Next

Resilience isn’t just uptime — it’s graceful recovery. We’re implementing several next steps to strengthen this architecture:

  • Buffer device data: As users go through areas with low mobile reception, we want to buffer the location data and deliver it when reception improves.
  • Adjust inaccurate device signal : Use techniques like Kanman filtering to adjust the location for high fidelity, when the device accuracy is low
  • De-couple RDS for real time updates: We will store running trips in Redis, with archiving to take place later. This makes us resilient on an event when the RDS is un-responsive, as we only need it at the latter step of archiving
  • Monitoring for anomalies: Add speed as a tracked metric and alert over 200 mph.
  • Chaos testing & fault injection: Monthly simulated outages to validate that our recovery flow remains reliable.

What It Means for Our Users

When something breaks in the cloud, we don’t panic — we verify, replay, and reconcile.
Because behind every data point is a real person trusting us to get it right.

Outages happen, but trust shouldn’t. And that’s what we’re building for.


Posted by [Thushara Wijeratna], [Head of Eng] at [WorkSolo]

Thursday, October 16, 2025

Rails 8: - can't cast RSpec::Mocks::Double

 One of the first unit test failures I encountered on a Rails 8.0 upgrade was:

    can't cast RSpec::Mocks::Double

The error happens on saving an ActiveRecord object to the database.


# connected :boolean default(FALSE)
# last_connected_at :datetime

class LymoAccount < ApplicationRecord
after_commit :update_last_connected_at, if: :saved_change_to_connected?

def update_last_connected_at
update!(last_connected_at: Time.current) if connected?
end

Turns out, this is related to stricter validation in Active Record, that refuses to save a mock object.

Generally speaking, you should be using FactoryBot methods to create real ActiveRecord objects from your unit tests. And we were. So it puzzled me why we would get this error, as it did not seem like we were storing anything that was a mock.

ChatGPT got pretty confused as well -- It got confused as the exception was thrown from an after_commit hook and its assumption was that there were attributes already set in the model that are being re-serialized and this was causing the issue.

We went through a listing of all the attributes of the record, examining their type (class) and none of them was a mock. 

This was the point when I gave up on Gen AI and took another look at the model.

I quickly eliminated that this has anything to do with the connected attribute that we are checking, by updating the test to save without the check. It didn't help, so I knew that the update itself was throwing.

Then I wondered if updating any column would throw the same error.

I updated a text column, and it worked - progress. Now I knew this might have something to do with either:

1. the specific column - maybe it has some validation that is behaving weird

2. the column type - maybe datetime columns behave differently with updates

Luckily there were a few other datetime columns, and when I updated any one of them with Time.current, they threw the same error. We were getting somewhere.

On a hunch, that this was to do with Time.current , I took a look at the unit test, to see if Time was being mocked - normally, we use Timecop as the best practice, but it was always possible that an engineer decided to be "creative".

This is what I found on the unit test:


  let(:data_sync_at) { Time.current.to_i }
  before do
    allow(Time).to receive_message_chain(:current, :to_i).and_return(data_sync_at)
  end
So what this does is stub the intermediate Time.current call, and it is not that obvious! So when we used Time.current to update the record, it tried to use a Mock object and failed.

A simple Timecop.travel(data_sync_at) fixed the issue.


Here is the Claude transcript, which happened later, when I was better able to guide the conversation along a more productive route!

Debugging Rails RSpec Error: RSpec::Mocks::Double Casting Issue

Q: On a Rails model, why would this code throw an error on a unit test saying: can't cast RSpec::Mocks::Double?
def update_last_connected_at
    update!(last_connected_at: Time.current) if connected?
end
Answer: The error occurs because connected? is returning an RSpec mock double instead of a boolean value, and Rails is trying to cast it to a boolean in the if conditional.

Common causes:

  • Unmocked associationconnected? might depend on an association that's mocked but not configured properly
  • Missing stub — The connected? method itself might be mocked but not given a return value

Solutions:

1. Stub the method explicitly with a boolean:

allow(model).to receive(:connected?).and_return(true)

2. Build a proper test object instead of relying on mocks:

let(:model) { build(:your_model, connected: true) }
Q: Happens without connected?
Answer: If you're getting the error without even calling connected?, the problem is likely that update! itself is being called on a mock.

Most common cause: You're probably testing this method on a mocked or partially mocked model instance. When you stub the model without fully setting it up, calling update! on it returns a double instead of actually updating the record.

Solutions:

1. Use create or build instead of double:

let(:model) { create(:your_model) }
model.update_last_connected_at

2. If you must use a double, stub update!:

let(:model) { double(:your_model) }
allow(model).to receive(:update!).and_return(true)
Q: Model is not stubbed, it is a FactoryBot created one
Answer: If it's a real FactoryBot instance, then the issue is different. The error suggests that update! is being called but something in the conditional logic is going wrong.

Most likely causes:

  • connected? method doesn't exist or returns a mock
  • The model instance itself is getting wrapped in a double
  • Database transaction/rollback issues

Debug steps: Add some debugging to see what's actually happening:

def update_last_connected_at
    puts "connected? returns: #{connected?.inspect}"
    puts "connected?.class: #{connected?.class}"
    update!(last_connected_at: Time.current) if connected?
end
Q: Connected is a column in the table (model). Here's my test and model:
let(:data_sync_at) { Time.current.to_i }
allow(Time).to receive_message_chain(:current, :to_i).and_return(data_sync_at)
// call method that ends up saving the record

# Model:
after_commit :update_last_connected_at

def update_last_connected_at     
  update!(last_connected_at: Time.current)   
end
Answer: Ah, there's the issue! You have an infinite loop of callbacks:
  1. update! is called in update_last_connected_at
  2. update! triggers after_commit
  3. after_commit calls update_last_connected_at again
  4. Which calls update! again... and so on

This creates a recursive callback that RSpec is having trouble with, and the mock you set up on Time.current might be interfering.

The real problem: You're missing the connected? check that should prevent this recursion:

def update_last_connected_at
    update!(last_connected_at: Time.current) if connected?
end

Or use update_column instead (skips callbacks):

def update_last_connected_at
    update_column(:last_connected_at, Time.current) if connected?
end
Q: My after_commit is like this so should not recurse:
after_commit :update_last_connected_at, if: :saved_change_to_connected?
Answer: Good catch! With that guard clause, it shouldn't recurse. But there's still an issue:

The problem: When update_last_connected_at calls update!, it triggers another after_commit. Even though the guard if: :saved_change_to_connected? prevents it from running again, RSpec is still having issues.

The real culprit is likely your Time mock:

allow(Time).to receive_message_chain(:current, :to_i).and_return(data_sync_at)

This is mocking Time.current in a way that might be interfering with the after_commit callback execution in your test. When you call update!(last_connected_at: Time.current) inside the callback, the mocked Time.current might be returning a double.

Try this instead:

Option 1: Mock just the .to_i part, not the whole chain:

allow_any_instance_of(Time).to receive(:to_i).and_return(data_sync_at)

Option 2: Use travel_to:

travel_to(Time.at(data_sync_at)) do
  # your test code
end

Option 3: Simplify by not mocking Time at all:

let(:data_sync_at) { Time.current.to_i }
# Just use the actual Time.current in your model