We experienced a series of status incidents on Monday, Nov 19, and Tuesday, Nov 20. Since there were some common threads and the customer impact was largely the same, we’ve decided to cover it all in this postmortem.
Help Scout experienced no downtime in the web or mobile apps, but Mailbox API 2.0 experienced a cumulative 14 minutes of downtime. There was also no data lost. However, during the incidents there were brief periods of higher than normal page load times in the web app, and delays in things like folder counts, inbound/outbound email processing, search and reports indexing, and workflows processing. Those delays ranged anywhere from a couple of minutes to roughly 20 minutes. Across all customers, there were 98 chat transcripts delivered up to a few hours late.
Monday, Nov 19 - 4:10 - 5:40am EST
A few separate processes cumulatively created more events than our queues were able to process at a time, which had a negative impact on overall database performance as well. Our Ops team was notified immediately and took a number of actions to get the event queues back to zero and stabilize performance. Page load times were 4-5 seconds on average, which is an increase of about 25% above normal.
Monday, Nov 19 - 9:52am - 3:26pm EST
There was no real page load performance impact during this incident, but roughly 8% of customers did see sporadic delays in things like folder counts, search and reports indexing, and workflows processing.
The cause was the release of an internal API a couple hours earlier. This release was out of sync with our search indexing services, and led to a backup of events on our queues. We were able to roll back the release and address the initial backup, but a high number of invalid events were created before the roll back finished. Long story short, it took several hours for this data shard to fully recover.
Tuesday, Nov 20 - 10:37 - 11:25am EST
This incident was relatively similar to the first one experienced on Monday morning. Separate processes came together to create a high number of events on our queues and negatively impact database performance. Between roughly 11:00-11:20am EST, page load times more than doubled to more than 7 seconds, but did recover without downtime.
Later in the afternoon, we did identify 98 conversations that had failed to be properly created during the incident, and we were able to run a script to retry those and distribute them to 65 different mailboxes across our system.
It won’t surprise you to hear that we learned a lot! Several changes have been shipped over the last 24 hours that will help us prevent these sorts of issues in the future. We can’t make any guarantees, but we don’t expect the trend to continue tomorrow. We also have a list of items the Engineering team will be working on to further stabilize the system moving forward.
One thing that worked rather well were our alerting and logging systems, properly pointing us in the right direction more often than not. So we’re happy to see things getting better on that front. We sincerely apologize for any frustration this caused over the last couple days.