Root Cause Analysis

Introduction

We would like to share more details about the events that occurred with Memsource between

10:55 AM CET and 11:15 AM/PM CET on November 24th,
06:55 AM CET and 7:25 AM CET on November 25th,
03:38 AM CET to 03:53 AM CET,
06:27 AM CET to 06:56 AM CET and
07:05 AM CET to 07:30 AM CET on November 26th,

which led to a gradual outage of the Project Management service and what Memsource engineers are doing to prevent these issues from happening again.

Timeline - 24th November 2021

10:55 AM CET: Automated monitoring triggers an alert indicating slow response times of the Project Management service. Memsource engineers start investigating the problem.

It is immediately recognized as a high load DB problem impacting all Memsource Services.

11:07 AM CET: Affected servers are restarted to free up used memory and reclaim DB connections.

11:15 AM CET: Memsource services are returning to normal. Memsource engineers disable some servers to speed up the service recovery. Memsource is operational but may be slower for some users.

11:27 AM CET: A runaway script from one customer is identified as the source of problems. The problematic API endpoint is cut off for the specific user and the customer is contacted.

11:57 AM CET: All servers are recovered and Memsource service responsiveness returns to normal.

Timeline - 25th November 2021

06:55 AM CET: Automated monitoring triggering alerts indicating high DB usage and slow response time.

06:58 AM CET: Memsource on-duty engineers quickly recognizing similar patterns and investigating queries and endpoints overloading the DB.

07:15 AM CET: The API user overloading the system is identified and the API manually disabled for the user. Affected servers are restarted to reclaim the memory and DB connections. Some users are still impacted by slow response times and elevated error rates.

07:25 AM CET: Response times and error rates are back to normal levels.

Timeline - 26th November 2021

03:38 AM CET: Automated monitoring triggering alerts similar problems are indicated - slow response time, elevated response error rate and an overloaded database.

03:46 AM CET: Memsource on-duty engineers started investigating the issue.

03:53 AM CET: Running queries causing database overload were cancelled, their respective API calls are blocked for the related user and the customer was contacted. Database load drops immediately, response time and error rates fallback to acceptable level for most users.

06:27 AM CET: Automated monitoring raising alarms for elevated response error rate and slow response times again.

06:32 AM CET: Memsource on-duty engineers identified the problematic queries and API endpoint, the user is blocked. Most of the requests are now ok, however a small number of users are still experiencing slow response times. Servers are removed from load balancing and gradually restarted.

06:56 AM CET: All Memsource services are running with usual response times.

Root Cause

In all cases the root cause was similar: some project related queries were overloading the capacity of the database server. The queries, users and API were not the same in all cases.

Shortly after the number of problematic requests reached the critical threshold, the database server reached maximum capacity and started blocking other queries leading to a cascade of timeouts and errors, eventually leading to degraded performance of all Memsource components as the majority of operations are dependent on the project database.

The queries were blocking the database server with an unexpected number of parallel queries; queries were suboptimal or the size of data in specific tables changed dramatically with the database server not able to effectively plan query execution.

Actions to Prevent Recurrence

Capacity of the project database server was increased to handle such peaks and parallel execution of such queries was updated to improve performance and fairness.
Queries and API endpoint code are being updated to run on database slave nodes.
Queries are being analysed and rewritten to maintain expected performance with different sets of data.
Database servers are being upgraded with new configurations to significantly improve sub-optimal query capacity and performance.
Monitoring of API endpoints and queries is being improved to have more granular data about in-flight requests and to provide better information on possible bottlenecks.
Database monitoring is being updated to contain more metrics about potentially dangerous queries.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.

Posted Dec 02, 2021 - 13:07 CET

Resolved

We detected higher database load at 6:27 AM CET that has caused a short performance disruption of all Memsource components. By 6:56 AM CET, everything was back to normal. Our engineers are investigating root cause and postmortem of the incident is going to be published in the next few days.

Posted Nov 26, 2021 - 07:20 CET

This incident affected: Memsource TMS (EU) (API, Editor for Web, File Processing, Machine Translation, Project Management, Term Base, Translation Memory).