How We Dropped Redis Slow Queries to Zero
Our backend processes a continuous stream of GPS data, earnings calculations, and activity syncs for gig workers. As data volumes grew, so did the Redis slow log. This is the story of how we hunted down every entry — and why connecting Claude to AWS directly changed the speed of the investigation.
Background: What Makes a Redis Query “Slow”?
Redis is single-threaded. Every command — regardless of complexity — runs serially. While most commands are O(1) or O(log N) and complete in microseconds, a handful are O(N): they process every element in a data structure before returning. When N is large, these commands don’t just become slow themselves — they block every other command waiting in the queue.
Redis has a configurable slow log that records any command exceeding a threshold (typically 10–100ms). Our alerts were coming from hecate, our primary cluster, and we had a small but persistent set of commands showing up.
The Fixes
1. Removing GEORADIUS — PR #6942
The first culprit wasn’t slow because of data size — it was slow because it was doing unnecessary work by design.
Cache::EventsDash was using GEORADIUS with a radius of 15,000 miles (larger than any two points on Earth) just to get all members out of a geo sorted set:
RubyDIST_INF = 15000 # the longest two points on earth's surface is ~12,427 miles
events = $redis_aws.georadius(key(tp), 0, 0, DIST_INF, "mi", options: :WITHCOORD)
Using GEORADIUS as a glorified SMEMBERS meant Redis was computing distances for every member — work that was immediately discarded. GEORADIUS is also deprecated as of Redis 6.2. The fix replaced it with ZRANGE (plain sorted set scan) plus GEOPOS (batch coordinate lookup):
Rubymembers = $redis_aws.zrange(key(tp), 0, -1)
coords = $redis_aws.geopos(key(tp), members)
members.zip(coords).map do |user_id, latlong|
{ type: tp, user_id:, lat: latlong[1], long: latlong[0] }
end
ZRANGE returns members in O(N) with no distance math. GEOPOS decodes coordinates in a single call. No wasted computation, and off the deprecation path.
2. Batching large GEOPOS calls — PR #7009
The next slow log entries pointed to GEOPOS calls against sorted sets with ~20,000 members. GEOPOS is O(N) for N member lookups — passing 20K members at once was taking ~40ms each on hecate-003.
The fix was batching: twenty sequential calls of 1,000 each instead of one call of 20,000. Each individual call completes quickly, leaving the event loop free for other commands between batches.
RubyGEOPOS_BATCH_SIZE = 1_000
coords = members.each_slice(GEOPOS_BATCH_SIZE).flat_map do |batch|
$redis_aws.geopos(key(tp), batch)
end
We also reduced the SCAN count hint in DrivenActivities from 15,000 to 10,000 — a smaller hint means Redis yields more often during the scan.
3. Tuning SCAN count further — PR #7010
Reviewing the slow log further, SCAN with a 10,000 count hint was still appearing. We reduced it again to 1,000. The count parameter is a hint, not a limit — Redis may return more or fewer — but a lower hint reduces the amount of work per iteration, keeping each call short.
Ruby$redis_aws.scan_each(match: "#{PREFIX}_*", count: 1000) { |key| ... }
4. Replacing blocking SADD and DEL with async alternatives — PR #7031
DEL in Redis is synchronous — it blocks the event loop while freeing memory. For large sorted sets and hashes, this can be significant. UNLINK is the async equivalent: it unregisters the key immediately (making it invisible to clients) and frees the memory in a background thread. It’s a drop-in replacement with no behaviour change.
We migrated 70 call sites across 27 files:
Ruby# before
$redis_aws.del(key)
# after
$redis_aws.unlink(key)
For large SADD operations (adding many members at once), we split them into smaller batches to avoid a single large O(N) write blocking the event loop.
5. SMEMBERS → SSCAN for the hourly geo user set — PR #7058
The final slow log entry was the most predictable: a single SMEMBERS firing every day at exactly 4:15 UTC. One entry in 24 hours — but perfectly consistent.
SMEMBERS returns every member of a set at once. Our hourly geo user set accumulates the ID of every user who sends a GPS point throughout the day. By 4:15 UTC (end of the US evening gig economy rush), that set had grown to ~40,000 members.
Finding the caller was a codebase grep for smembers, which pointed to GeoMetricsRedis.get_user_ids. Connecting the timing to the EventBridge schedule is where the AWS integration paid off immediately — more on that below.
The fix was SSCAN with a batch size of 1,000: ~40 non-blocking iterations instead of one O(N) call that blocks everything behind it.
Rubydef get_user_ids(y_m_d_h_date = ...)
users_key = key_prefix(GEOPOINTS_USERS, y_m_d_h_date)
user_ids = []
cursor = 0
loop do
cursor, batch = $redis_aws.sscan(users_key, cursor, count: 1000)
user_ids.concat(batch.map(&:to_i))
break if cursor == '0'
end
user_ids
end
How AWS MCP Changed the Investigation
Debugging distributed systems normally means bouncing between tools — CloudWatch Logs, the ECS console, EventBridge, ELB metrics — copy-pasting ARNs, losing context between tabs. With the AWS MCP connected to Claude, every AWS query happened in the same conversation thread as the code analysis.
Finding the schedule in seconds
Once we identified get_user_ids as the slow call, we needed to know what triggered it at 4:15 UTC. Rather than navigating the EventBridge Scheduler console:
Shellaws scheduler list-schedules | grep -i geo_metrics
aws scheduler get-schedule --name "Record_Geo_Metrics"
# => cron(15 4 * * ? *) → POST /schedule_task/record_geo_metrics
The cron expression, the Lambda target, and the API path all came back in one command. That’s normally a five-minute console hunt.
Ruling out unrelated issues
Mid-investigation, a spike of 1,231 ELB 504 errors appeared. Rather than derailing into manual console investigation, we queried the ELB metrics directly. The breakdown confirmed all 1,231 errors were 504s (not 502s), all targets remained healthy, and the spike resolved on its own — most likely an OOM kill on one container under sustained evening load. Confirmed, noted, moved on. The original Redis investigation stayed on track.
When Your AI Goes Down: Have a Backup Ready
Here’s something that doesn’t come up in most engineering blogs but happened during this investigation: mid-session, while we were deep into debugging the 504 spike, the Claude service went down.
Rather than waiting it out, I switched to Codex and continued the investigation immediately. What was notable: Codex picked up local AWS credentials automatically — no MCP configuration required, no setup overhead. It queried CloudWatch and ELB the same way Claude had been doing, just with a different interface. In that respect it behaved like OpenClaw: AWS access was just there, ready to use.
The specific capability to look for is native credential access. Agents that can reach your AWS environment using locally configured credentials (rather than requiring explicit MCP setup per session) are the ones you can hand off to without losing momentum. Codex has this. OpenClaw had it. It’s worth knowing which tools in your arsenal work this way before you need them.
hecate-003 — daily hits through 04/23, dropping to zero as the fixes landed.The Pattern
Looking across all six fixes, a clear pattern emerged:
The fixes all follow the same principle: break large O(N) operations into smaller batches, and prefer async alternatives (UNLINK over DEL, SSCAN over SMEMBERS, SCAN with smaller count hints) that yield the event loop between iterations.
The slow log is now clear. The fixes were individually small — a few lines each — but finding them required tracing from a Redis key name through a codebase, an EventBridge schedule, a Lambda, a Rails controller, and a Sidekiq worker. Having AWS and code in the same context made that trace fast. And having a backup agent meant a service outage didn’t stop it.
Key Takeaways
GEORADIUS was doing distance math on every member just to replicate SMEMBERS. Deprecated commands often carry hidden performance baggage — check the slow log, then check the Redis changelog.GEOPOS, SADD, and SMEMBERS against large structures are the usual suspects. Replace with sliced iterations of ≤1,000 elements.UNLINK, not DEL. It’s a drop-in replacement that frees memory in the background instead of on the event loop. Migrate all 70 call sites — it takes minutes with a codebase search.