Between 9:36 and 10:00am EST on July 9, customers experienced errors in Help Scout when loading conversations, search results and some other functions of the web app.
One of our search shards failed, which caused the errors and impacted roughly 1/6 (17%) of web app page views in that time. We have redundancies in place to prevent an error like this from impacting customers, but also encountered a problem with the load balancer managing traffic between the shards.
After a little over 20 minutes, our Ops team was able to resolve the issue and get everything back to normal.
Within 24 hours, we replaced the load balancer with new software and added end-to-end health checks to the impacted service. These items were on the short-term roadmap, but we dropped everything to knock them out following the issue.
We feel very confident about our ability to detect this issue and prevent it from impacting customers in the future thanks to these changes.