Degraded Performance of the File Processing Component Between 3:30 and 5:06 PM CET
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 03:05 PM CET and 05:30 PM CET on November 29th, 2021 which led to a gradual outage of the Project Management service and what Memsource engineers are doing to prevent these issues from happening again.

Timeline

03:05 PM CET: Automated monitoring triggers an alert indicating a large number asynchronous requests in the processing queue. Memsource engineers start investigating the problem.

The initial investigation suggests that there is an unusually large number of incoming asynchronous requests sent by various customers at the same time; requests are being processed correctly. Engineers monitor the situation.

03:22 PM CET: The number of received asynchronous requests keeps increasing which slows down request processing for Memsource users.

03:36 PM CET: A customer’s integration is identified as the source of the large number of requests. Memsource engineers cut off the integration by disabling the API endpoints for the user. Memsource support agents contact the customer.

The number of asynchronous requests is stabilized. The system is operational and queued requests are being processed. The File Processing service is reported as being slower than usual by some customers.

04:00 PM CET: Memsource engineers identify that queued requests created by the cut off customer integration are duplicates and their slow processing causes a high database load. The problematic requests are manually terminated by Memsource engineers to unblock the processing of other users’ requests and speed up the File Processing service.

04:59 PM CET: The high database load slows down processing of user requests which leads to the exhaustion of the database connection pool. Memsource service becomes inoperational for a short period of time.

05:03 PM CET: The database connections are freed and all services are slowly returning back to normal.

05:06 PM CET: The Memsource service is fully operational and responsive.

Root Cause

A large number of asynchronous requests sent in a short period of time by a customer’s broken integration significantly degraded the performance of the File Processing service. Memsource engineers manually terminated duplicated asynchronous requests sent by the broken integration. Finalization of the huge number of terminated requests resulted in the high database load and exhausted the connection pool which led to a short outage of the Memsource service. 

Actions to Prevent Recurrence

  • A per organization limit on the number of concurrent asynchronous requests was introduced.
  • Slow database queries related to the processing of asynchronous requests will be identified and optimized to prevent the database connections pool exhaustion.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.

Posted Dec 01, 2021 - 15:51 CET

Resolved
This incident has been resolved.
Posted Nov 29, 2021 - 18:45 CET
Update
This incident has been resolved.
Posted Nov 29, 2021 - 18:36 CET
Update
This incident has been resolved.
Posted Nov 29, 2021 - 18:21 CET
Update
We are continuing to work on a fix for this issue.
Posted Nov 29, 2021 - 18:02 CET
Update
We are continuing to work on a fix for this issue.
Posted Nov 29, 2021 - 17:08 CET
Identified
Some users are experiencing decreased responsiveness of Memsource components. We have identified the reason for this degraded performance and are resolving the issue.
Posted Nov 29, 2021 - 17:08 CET
This incident affected: Memsource TMS (API, Editor for Web, File Processing, Machine Translation, Project Management, Term Base, Translation Memory).