How We Dropped Redis Slow Queries to Zero

    Pave Engineering
    ·
    April 2026
    ·
    9 min read
  

Our backend processes a continuous stream of GPS data, earnings calculations, and activity syncs for gig workers. As data volumes grew, so did the Redis slow log. This is the story of how we hunted down every entry — and why connecting Claude to AWS directly changed the speed of the investigation.

Background: What Makes a Redis Query “Slow”?

Redis is single-threaded. Every command — regardless of complexity — runs serially. While most commands are O(1) or O(log N) and complete in microseconds, a handful are O(N): they process every element in a data structure before returning. When N is large, these commands don’t just become slow themselves — they block every other command waiting in the queue.

Redis has a configurable slow log that records any command exceeding a threshold (typically 10–100ms). Our alerts were coming from hecate, our primary cluster, and we had a small but persistent set of commands showing up.

· · ·

The Fixes

1. Removing GEORADIUS — PR #6942

The first culprit wasn’t slow because of data size — it was slow because it was doing unnecessary work by design.

Cache::EventsDash was using GEORADIUS with a radius of 15,000 miles (larger than any two points on Earth) just to get all members out of a geo sorted set:

RubyDIST_INF = 15000 # the longest two points on earth's surface is ~12,427 miles

events = $redis_aws.georadius(key(tp), 0, 0, DIST_INF, "mi", options: :WITHCOORD)

Using GEORADIUS as a glorified SMEMBERS meant Redis was computing distances for every member — work that was immediately discarded. GEORADIUS is also deprecated as of Redis 6.2. The fix replaced it with ZRANGE (plain sorted set scan) plus GEOPOS (batch coordinate lookup):

Rubymembers = $redis_aws.zrange(key(tp), 0, -1)
coords  = $redis_aws.geopos(key(tp), members)
members.zip(coords).map do |user_id, latlong|
  { type: tp, user_id:, lat: latlong[1], long: latlong[0] }
end

ZRANGE returns members in O(N) with no distance math. GEOPOS decodes coordinates in a single call. No wasted computation, and off the deprecation path.

2. Batching large GEOPOS calls — PR #7009

The next slow log entries pointed to GEOPOS calls against sorted sets with ~20,000 members. GEOPOS is O(N) for N member lookups — passing 20K members at once was taking ~40ms each on hecate-003.

The fix was batching: twenty sequential calls of 1,000 each instead of one call of 20,000. Each individual call completes quickly, leaving the event loop free for other commands between batches.

RubyGEOPOS_BATCH_SIZE = 1_000

coords = members.each_slice(GEOPOS_BATCH_SIZE).flat_map do |batch|
  $redis_aws.geopos(key(tp), batch)
end

We also reduced the SCAN count hint in DrivenActivities from 15,000 to 10,000 — a smaller hint means Redis yields more often during the scan.

3. Tuning SCAN count further — PR #7010

Reviewing the slow log further, SCAN with a 10,000 count hint was still appearing. We reduced it again to 1,000. The count parameter is a hint, not a limit — Redis may return more or fewer — but a lower hint reduces the amount of work per iteration, keeping each call short.

Ruby$redis_aws.scan_each(match: "#{PREFIX}_*", count: 1000) { |key| ... }

4. Replacing blocking SADD and DEL with async alternatives — PR #7031

DEL in Redis is synchronous — it blocks the event loop while freeing memory. For large sorted sets and hashes, this can be significant. UNLINK is the async equivalent: it unregisters the key immediately (making it invisible to clients) and frees the memory in a background thread. It’s a drop-in replacement with no behaviour change.

We migrated 70 call sites across 27 files:

Ruby# before
$redis_aws.del(key)

# after
$redis_aws.unlink(key)

For large SADD operations (adding many members at once), we split them into smaller batches to avoid a single large O(N) write blocking the event loop.

5. SMEMBERS → SSCAN for the hourly geo user set — PR #7058

The final slow log entry was the most predictable: a single SMEMBERS firing every day at exactly 4:15 UTC. One entry in 24 hours — but perfectly consistent.

SMEMBERS returns every member of a set at once. Our hourly geo user set accumulates the ID of every user who sends a GPS point throughout the day. By 4:15 UTC (end of the US evening gig economy rush), that set had grown to ~40,000 members.

Finding the caller was a codebase grep for smembers, which pointed to GeoMetricsRedis.get_user_ids. Connecting the timing to the EventBridge schedule is where the AWS integration paid off immediately — more on that below.

The fix was SSCAN with a batch size of 1,000: ~40 non-blocking iterations instead of one O(N) call that blocks everything behind it.

Rubydef get_user_ids(y_m_d_h_date = ...)
  users_key = key_prefix(GEOPOINTS_USERS, y_m_d_h_date)
  user_ids = []
  cursor = 0
  loop do
    cursor, batch = $redis_aws.sscan(users_key, cursor, count: 1000)
    user_ids.concat(batch.map(&:to_i))
    break if cursor == '0'
  end
  user_ids
end

· · ·

How AWS MCP Changed the Investigation

Debugging distributed systems normally means bouncing between tools — CloudWatch Logs, the ECS console, EventBridge, ELB metrics — copy-pasting ARNs, losing context between tabs. With the AWS MCP connected to Claude, every AWS query happened in the same conversation thread as the code analysis.

Finding the schedule in seconds

Once we identified get_user_ids as the slow call, we needed to know what triggered it at 4:15 UTC. Rather than navigating the EventBridge Scheduler console:

Shellaws scheduler list-schedules | grep -i geo_metrics
aws scheduler get-schedule --name "Record_Geo_Metrics"
# => cron(15 4 * * ? *) → POST /schedule_task/record_geo_metrics

The cron expression, the Lambda target, and the API path all came back in one command. That’s normally a five-minute console hunt.

Ruling out unrelated issues

Mid-investigation, a spike of 1,231 ELB 504 errors appeared. Rather than derailing into manual console investigation, we queried the ELB metrics directly. The breakdown confirmed all 1,231 errors were 504s (not 502s), all targets remained healthy, and the spike resolved on its own — most likely an OOM kill on one container under sustained evening load. Confirmed, noted, moved on. The original Redis investigation stayed on track.

Having AWS query results and code analysis in the same context meant each finding fed directly into the next hypothesis — without a context switch.

· · ·

When Your AI Goes Down: Have a Backup Ready

Here’s something that doesn’t come up in most engineering blogs but happened during this investigation: mid-session, while we were deep into debugging the 504 spike, the Claude service went down.

Rather than waiting it out, I switched to Codex and continued the investigation immediately. What was notable: Codex picked up local AWS credentials automatically — no MCP configuration required, no setup overhead. It queried CloudWatch and ELB the same way Claude had been doing, just with a different interface. In that respect it behaved like OpenClaw: AWS access was just there, ready to use.

    Practical takeaway: Keep more than one coding agent configured and ready to go. The sessions where you most need uninterrupted flow — mid-incident, mid-investigation, when you’re holding several hypotheses in your head — are exactly when a service outage costs you the most. If switching agents takes 30 seconds instead of 30 minutes, you stay in the problem.
  

The specific capability to look for is native credential access. Agents that can reach your AWS environment using locally configured credentials (rather than requiring explicit MCP setup per session) are the ones you can hand off to without losing momentum. Codex has this. OpenClaw had it. It’s worth knowing which tools in your arsenal work this way before you need them.

· · ·

AWS CloudWatch slow log chart showing consistent daily hits through April 23rd, then dropping sharply to near-zero by April 30th — Redis slow log entries on `hecate-003` — daily hits through 04/23, dropping to zero as the fixes landed.

The Pattern

Looking across all six fixes, a clear pattern emerged:

Redis commands that are fine at small scale become slow log entries as data volumes grow — and the growth is often invisible until it crosses a threshold.

The fixes all follow the same principle: break large O(N) operations into smaller batches, and prefer async alternatives (UNLINK over DEL, SSCAN over SMEMBERS, SCAN with smaller count hints) that yield the event loop between iterations.

The slow log is now clear. The fixes were individually small — a few lines each — but finding them required tracing from a Redis key name through a codebase, an EventBridge schedule, a Lambda, a Rails controller, and a Sidekiq worker. Having AWS and code in the same context made that trace fast. And having a backup agent meant a service outage didn’t stop it.

Key Takeaways

Audit deprecated commands first. GEORADIUS was doing distance math on every member just to replicate SMEMBERS. Deprecated commands often carry hidden performance baggage — check the slow log, then check the Redis changelog.

Batch large O(N) commands. GEOPOS, SADD, and SMEMBERS against large structures are the usual suspects. Replace with sliced iterations of ≤1,000 elements.

Use UNLINK, not DEL. It’s a drop-in replacement that frees memory in the background instead of on the event loop. Migrate all 70 call sites — it takes minutes with a codebase search.

The slow log tells you what, not why. Tracing from a command back to its caller — across schedules, queues, and controllers — is where the investigation lives. Tooling that keeps AWS and code in the same context makes this dramatically faster.

Have a backup agent ready before you need one. Configure agent redundancy when it’s calm, not when you’re mid-incident. Prefer agents with native credential access so the handoff is immediate.

Pave Engineering · April 2026

Redis AWS Ruby Performance ElastiCache MCP

tech

Friday, May 01, 2026

How We Dropped Redis Slow Queries to Zero

Background: What Makes a Redis Query “Slow”?

The Fixes

1. Removing GEORADIUS — PR #6942

2. Batching large GEOPOS calls — PR #7009

3. Tuning SCAN count further — PR #7010

4. Replacing blocking SADD and DEL with async alternatives — PR #7031

5. SMEMBERS → SSCAN for the hourly geo user set — PR #7058

How AWS MCP Changed the Investigation

Finding the schedule in seconds

Ruling out unrelated issues

When Your AI Goes Down: Have a Backup Ready

The Pattern

Key Takeaways