Degraded Performance of All Memsource Components between 8:25 and 9:45 AM CEST
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 8:25 and 9:45 CEST on October 7th, 2021 which led to a partial performance degradation of all Memsource components and what Memsource engineers are doing to prevent these issues from happening again.

Timeline

Wed 6th October 15:00 CEST: A new version of the database cleaner is deployed to production servers.

Wed 6th October 16:00 CEST: The database cleaner periodically runs new complex queries which start consuming available burst IO capacity on one of  the database volumes.

Thu 7th October 7:41 CEST: Deployment of a new version of the Memsource service starts, invalidates the local cache and subsequently increases the number of database IO operations. 

Thu 7th October 8:20 CEST: Available burst IO capacity is completely consumed on one of  the database volumes; only the baseline IO capacity is now available for processing incoming database queries.

Thu 7th October 8:25 CEST: The database load is too high, increasing the response time of user requests and slowing down some parts of the Memsource service. The service becomes unavailable for some users.

Thu 7th October 8:26 CEST: Automated monitoring starts reporting the slow response of the system. Memsource engineers start looking for the cause of the problem.

Thu 7th October 8:48 CEST: Some servers are reconfigured to disable unnecessary database requests to decrease the database load; Memsource engineers continue looking for the root cause.

Thu 7th October 9:16 CEST: The database cleaner and consumed burst IO capacity are identified as the root cause.

Thu 7th October 9:18 CEST: The database cleaner is paused and IO capacity on the disk volume is increased; the database load quickly returns back to normal.

Thu 7th October 9:28 CEST: Database load and response time of the system are returning back to normal.

Thu 7th October 9:45 CEST: The incident is resolved.

Root Cause

Processing of complex queries in a new version of the database cleaner created large temporary database tables on a database volume dedicated to creating such tables. Processing such queries required increasingly demanding performance which led to a gradual consumption of all available burstable IO capacity of the volume. There was no automated alert set up for an incident of this type.  Exhaustion of the burst IO capacity was accelerated by clearing the cache during the deployment of a new version of the system. The result of this chain of events was many queued database queries which could not be processed fast enough leading to a degradation of the service response time.

Actions to Prevent Recurrence

  • Monitoring of burst IO capacity will be improved to prevent burst credits running out.
  • The database cleaner will be optimized as to not cause such high database loads.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Oct 08, 2021 - 06:39 CEST

Resolved
This incident has been resolved.
Posted Oct 07, 2021 - 09:44 CEST
Identified
The issue has been identified and a fix is being implemented. We will be monitoring the results.
Posted Oct 07, 2021 - 09:31 CEST
Update
We are continuing to investigate problems with degraded performance of Memsource components.
Posted Oct 07, 2021 - 09:11 CEST
Update
We are continuing to investigate this issue.
Posted Oct 07, 2021 - 08:45 CEST
Investigating
We are currently investigating problems with degraded performance of Memsource components.
Posted Oct 07, 2021 - 08:39 CEST
This incident affected: Memsource (SLA) (API, Editor for Web, File Processing, Machine Translation, Project Management, Term Base, Translation Memory).