Degraded Performance of API, Editor for Web and Project Management Services between 3:00 and 3:42 PM CET

Incident Report for Phrase (formerly Memsource)

Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 3:00 and 3:40 pm CET on the 15th of July, 2020 which led to the partial disruption of Memsource services and what Memsource engineers are doing to prevent these sorts of issues from happening again.

Timeline

3:01 PM CET: First non-critical alerts reporting slightly higher error rate on some of the Project Management servers. SW engineers started investigating, some of the servers disabled on load balancers.

3:14 PM CET: Critical alerts reporting problems with the availability of the Project Management service. SW engineers asked infrastructure engineers for help as a number of errors seemed to be network related.

3:16 PM CET: Most servers stopped accepting new user requests.

3:20 PM CET: Decided to restart servers to help the service return to regular operation.

3:25 PM CET: One server restarted and recovered, proceeded to restart all of them.

3:36 PM CET: Support agents report the Project Management service is still operable, but slow.

3:41 PM CET: Load balancer runs out of memory so it is restarted.

3:45 PM CET: Servers restarted, Memsource is back at the usual level of response times.

Root cause

Connection reset to the database at 2:59:58 PM CET triggered stack overflow error on one of the servers. This led to the infinite loop where Project Management application servers exhausted the connection pool to the DB and caused a high write demand for the log file. The same problem happened on 10 out of 12 servers within the next minute.

Actions to Prevent Recurrence

As a reaction to the problems:

SW engineers identified the problematic part of the code and implemented the prevention for the stack overflow in that sensitive area.
Investigation of the OOM for the load balancer will happen and countermeasures will be implemented.
We will improve our monitoring to identify an unusual spike of messages in log files.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Jul 17, 2020 - 11:45 CEST

Resolved

A fix has been implemented and we are monitoring the results.

Posted Jul 15, 2020 - 16:20 CEST

Identified

The engineering team has identified the issue and the speed of the components is getting back to normal.

Posted Jul 15, 2020 - 15:47 CEST

Investigating

Our engineers are currently investigating why Memsource services are not available for our users.

Posted Jul 15, 2020 - 15:00 CEST

This incident affected: Memsource TMS (EU) (API, Editor for Web, File Processing, Machine Translation, Project Management, Term Base, Translation Memory).