Gradual Degraded Performance of All Memsource Components Between 12:58 and 02:42 PM CEST

Incident Report for Phrase (formerly Memsource)

Postmortem

Root Cause Analysis

14 October 2021

Introduction

We would like to share more details about the events that occurred with Memsource between 12:58 CEST and 02:42 PM CEST on October 14th, 2021 which led to a gradual outage of the Project Management component and what Memsource engineers are doing to prevent these issues from happening again.

Timeline

12:58 CEST: Automated monitoring triggers an alert indicating slow response times of the Project Management component. Memsource engineers start investigating the problem.

13:02 CEST: Slow Project Management affects the responsiveness of other Memsource components.

13:18 CEST: High database load is identified as the cause of slow responsiveness; some servers are reconfigured to disable unnecessary database requests to decrease the database load.

13:50 CEST: Memsource components are returning to normal. Memsource engineers disable some servers to speed up the recovery of the component. Memsource is operational but may be slower for some users.

14:05 CEST: Memsource engineers commence a controlled restart of some servers to speed up their recovery.

14:36 CEST: All servers are recovered; responsiveness of the Memsource component returns to normal.

‌

Root Cause

The database server was close to the configured global query capacity when incoming user requests triggered many high performance database queries. It led to the database server running out of available capacity, slowing down the processing of requests and exhausting the connection pool for a short period of time. Pending requests accumulated during this time were being processed as the database capacity was gradually freed, which slowed down the recovery of the system. The system recovered after processing all pending requests.

Actions to Prevent Recurrence

Identify and optimize slow database queries and queries with a high memory footprint.
Redistribute the database load between the master and slave database nodes or scale the database nodes vertically to catch up with the load.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Oct 20, 2021 - 10:40 CEST

Resolved

The response times remain stable. The incident has been resolved.

Posted Oct 14, 2021 - 14:36 CEST

Monitoring

The response times should be back to normal. We keep monitoring the situation.

Posted Oct 14, 2021 - 14:17 CEST

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 14, 2021 - 13:54 CEST

Investigating

Since 12:58 CEST Memsource experiences high response times. We are investigating the root cause of the issue.

Posted Oct 14, 2021 - 13:40 CEST

This incident affected: Memsource TMS (EU) (API, Editor for Web, File Processing, Machine Translation, Project Management, Term Base, Translation Memory).