Partial Performance Degradation of Project Management, Editors, API and File Processing Services between 11:14 AM and 11:26 AM CEST
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 11:14 CEST and 11:26 CEST on September 14th, 2021 which led to a partial performance degradation of the Project Management, Editors, API and File Processing services and what Memsource engineers are doing to prevent these sorts of issues from happening again.

Timeline

11:14 AM CEST: First warnings come from automated monitoring reporting an increased error rate. Our engineers start to look into the problem.

11:17 AM CEST: First Zendesk tickets concerning Memsource slowness and availability start to appear.

11:19 AM CEST: First critical alert was triggered by automated monitoring, concerning number of error requests. 

11:24 AM CEST: Communication between engineering and support teams established and a peak in database load was identified.

11:26 AM CEST: Database load dropped back to a normal level.

11:35 AM CEST: Engineers found problematic database queries. 

11:48 AM CEST: Source of problematic database queries blocked to prevent further incidents.

Root Cause

The number of slow queries on the database server went up in a short amount of time. That caused a significant drop in DB performance and resulted in general Memsource unavailability for a few minutes.
Operations that led to the long-running queries were blocked immediately. 

Actions to Prevent Recurrence

  • Enhance monitoring of the database to trigger earlier warnings. 
  • Tune long-running database queries performance. 

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Sep 17, 2021 - 12:01 CEST

Resolved
This incident has been resolved.
Posted Sep 14, 2021 - 11:26 CEST
Identified
The issue has been identified and a fix is being implemented.
Posted Sep 14, 2021 - 11:20 CEST
Investigating
We are currently investigating this issue.
Posted Sep 14, 2021 - 11:14 CEST