Indexing problem impacting 1/12 of customers

Incident Report for Help Scout

Postmortem

What Happened

We experienced a series of status incidents on Monday, Nov 19, and Tuesday, Nov 20. Since there were some common threads and the customer impact was largely the same, we’ve decided to cover it all in this postmortem.

Help Scout experienced no downtime in the web or mobile apps, but Mailbox API 2.0 experienced a cumulative 14 minutes of downtime. There was also no data lost. However, during the incidents there were brief periods of higher than normal page load times in the web app, and delays in things like folder counts, inbound/outbound email processing, search and reports indexing, and workflows processing. Those delays ranged anywhere from a couple of minutes to roughly 20 minutes. Across all customers, there were 98 chat transcripts delivered up to a few hours late.

Monday, Nov 19 - 4:10 - 5:40am EST

A few separate processes cumulatively created more events than our queues were able to process at a time, which had a negative impact on overall database performance as well. Our Ops team was notified immediately and took a number of actions to get the event queues back to zero and stabilize performance. Page load times were 4-5 seconds on average, which is an increase of about 25% above normal.

Monday, Nov 19 - 9:52am - 3:26pm EST

There was no real page load performance impact during this incident, but roughly 8% of customers did see sporadic delays in things like folder counts, search and reports indexing, and workflows processing.

The cause was the release of an internal API a couple hours earlier. This release was out of sync with our search indexing services, and led to a backup of events on our queues. We were able to roll back the release and address the initial backup, but a high number of invalid events were created before the roll back finished. Long story short, it took several hours for this data shard to fully recover.

Tuesday, Nov 20 - 10:37 - 11:25am EST

This incident was relatively similar to the first one experienced on Monday morning. Separate processes came together to create a high number of events on our queues and negatively impact database performance. Between roughly 11:00-11:20am EST, page load times more than doubled to more than 7 seconds, but did recover without downtime.

Later in the afternoon, we did identify 98 conversations that had failed to be properly created during the incident, and we were able to run a script to retry those and distribute them to 65 different mailboxes across our system.

Changes We’re Making

It won’t surprise you to hear that we learned a lot! Several changes have been shipped over the last 24 hours that will help us prevent these sorts of issues in the future. We can’t make any guarantees, but we don’t expect the trend to continue tomorrow. We also have a list of items the Engineering team will be working on to further stabilize the system moving forward.

One thing that worked rather well were our alerting and logging systems, properly pointing us in the right direction more often than not. So we’re happy to see things getting better on that front. We sincerely apologize for any frustration this caused over the last couple days.

Posted Nov 20, 2018 - 22:12 EST

Resolved

We're closing this one out and plan to share a full postmortem in the next few days. So sorry for the troubles today!

Posted Nov 19, 2018 - 14:12 EST

Update

The impacted search shard is now in the clear. You should see things continuing to return back to normal. Thanks for hanging with us!

Posted Nov 19, 2018 - 13:08 EST

Update

Almost there, we're still waiting for the indexer to catch up.

Posted Nov 19, 2018 - 12:18 EST

Update

We're monitoring our queues and seeing steady improvements, though folder counts/views, tags and workflows are still a little behind. Again this only impacts roughly 8% of customers. Everything should be back to normal soon!

Posted Nov 19, 2018 - 11:34 EST

Monitoring

Still working on it! Folder views/counts, tags and workflow will be slow to update until we're in the clear.

Posted Nov 19, 2018 - 11:01 EST

Update

Updating the component status here: Docs sites and Beacon/live chat are unaffected.

Posted Nov 19, 2018 - 10:49 EST

Investigating

One of our indexing shards, which handle things like folder counts, search and reports indexing, and workflows processing, is having trouble and we're looking into it. Those things will be delayed until we can resolve the issue.

Posted Nov 19, 2018 - 10:44 EST

This incident affected: Web App, Docs Sites, Beacon, Email Processing, Mailbox counts, folders, reports, search results and Public APIs (Inbox API).