Hello world! I'm Jim, and I do systems-y things at Bluesky. This is my first leaflet, and today we have a short one: what the heck happened to Bluesky search the other day?

No data was lost, but writes to our OpenSearch clusters were delayed by a few hours. Sorry about that!

We do a lot of really cool stuff at Bluesky, and frankly this was just one of those operational issues that is of high consequence, but isn't all that interesting. We do a huge amount of extremely cool, challenging, novel work, and if you're interested in building an open foundation for the social web, come work with us!

Background

We dump the firehose in to an internal Bufstream Kafka cluster to ensure we can easily do durable processing (i.e. replay, and HA consumption amongst multiple hosts).

We have a job that reads from this firehose kafka topic and "enriches" each message with data that is expensive to compute. Image bytes (CDN pulls), vector embeddings (GPU compute), etc. This has been in prod for quite a while and running fine.

The Issue(s)

The morning of the incident, I noticed our AWS bill was up a fair bit due to needing more GPUs to do some new types of embedding generation. I saw that we were underutilizing our existing GPUs, so I reduced the highest instance class of the GPU AWS nodes substantially, down to a much cheaper instance class to try to save a few bucks.

Things looked fine for a while, so I declared victory and moved on. However, it turns out that the GPU class I assigned didn't have enough firepower to handle the load. This was issue number 1.

Later in the day, we got some user reports that posts weren't showing up in search for several hours. This was issue number 2: we had no automated alerts on kafka consumer group lag. We've since added alerts to let us know automatically when things are getting behind.

At this point, I reverted the change to the GPU instance type, but we had accumulated a large backlog of work, and things weren't catching up.

It turns out that the kafka consumer process we wrote was not well-configured (issue number 3). The go kafka library we're using only heartbeats to the brokers every time you attempt to consume messages, and our batch size was 10k messages, each of which has a relatively high record processing latency even under normal circumstances:

So we were pulling large batches we couldn't process fast enough, and the consumer would time out because it wasn't issuing hearbeats back to the brokers. This meant the consumers were essentially crash looping. Also, the brokers had to constantly do rebalancing amongst that consumer group, which is heavy-weight and painful.

The reason a 10k batch size isn't normally a problem is that we don't typically have a large backlog, so the consumers pull batches much smaller than their limit. They're able to handle batch sizes of say 100 messages without issue, but they're not able to complete a batch of 10k messages before their heartbeat timeout is reached.

One other thing that made resolution weird and slow was that our metrics were lacking. For all the records that come over the firehose, we were recording their latency, and tracking it as one big chart. However, we only do heavy-duty processing on posts, replies, and quotes (i.e. there's no need to do embedding generation for a like). The jumbling of all these in to one big pool meant that it obscured the metrics for the tasks we actually care about, and took a long time to figure out what was wrong.

The Resolution

To get around this issue, we reverted the GPU instance class change, then fixed our misconfigured kafka consumers to pull a much smaller batch size. This allowed us to actually successfully heartbeat, and stop the crash looping.

At that point, we were stable, and just scaled up the pipeline to churn through the large backlog we had built up over the course of the day. The backlog was cleared in a few hours and things were back to normal. Posts were showing up in search a few seconds after they were made.

Here's a chart measuring consumer group lag by partition for that topic:

Those weird gaps in metrics actually were the brokers themselves OOM'ing due to large load. We've since added more juice.

Lessons

1. We need automated alerts on kafka consumer group lag

2. We need to be very deliberate in tuning our kafka producer and consumer behavior. Many of us have run kafka in prod for years, so we knew this already, but we're a small team with limited time, and this one slipped through the cracks.

3. We needed to improve our metrics to break out "heavy" tasks vs. "light" tasks

All of these remediations are complete. It's never great to have degraded service, but this past week was a particularly bad week for it, so apologies.

Until the next one,

Jim