We would like to share more details about the events that occurred with Memsource between 11:14 CEST and 11:26 CEST on September 14th, 2021 which led to a partial performance degradation of the Project Management, Editors, API and File Processing services and what Memsource engineers are doing to prevent these sorts of issues from happening again.
11:14 AM CEST: First warnings come from automated monitoring reporting an increased error rate. Our engineers start to look into the problem.
11:17 AM CEST: First Zendesk tickets concerning Memsource slowness and availability start to appear.
11:19 AM CEST: First critical alert was triggered by automated monitoring, concerning number of error requests.
11:24 AM CEST: Communication between engineering and support teams established and a peak in database load was identified.
11:26 AM CEST: Database load dropped back to a normal level.
11:35 AM CEST: Engineers found problematic database queries.
11:48 AM CEST: Source of problematic database queries blocked to prevent further incidents.
The number of slow queries on the database server went up in a short amount of time. That caused a significant drop in DB performance and resulted in general Memsource unavailability for a few minutes.
Operations that led to the long-running queries were blocked immediately.
Actions to Prevent Recurrence
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.