Degraded Performance of Memsource TMS (EU) Project Management between 10:03 CEST and 10:19 CEST
Incident Report for Phrase (formerly Memsource)
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 10:03 AM CEST and 10:19 AM CEST on April 6th, 2022 which led to degraded performance of Memsource TMS (EU) Project Management, and what Memsource engineers are doing to prevent these issues from happening again.

Timeline

10:03 AM CEST: An internal configuration property is deleted. An audit record is created in the monitoring system. The request error rate goes up immediately.

10:15 AM CEST: Alerts from the automated monitoring system start to come to engineers on duty. Error details are quickly analyzed by the engineers.

10:19 AM CEST: Memcached servers acting as a second level cache are restarted. The request error rate returns to normal values.

Root Cause

Internal configuration properties are injected into some pages used by the client-side JavaScript code. There are 7 such pages.

Due to a race condition, a configuration property being deleted was removed from the database but not completely removed from the second level cache. The cached result set for the listing all configuration properties query still included the deleted property which led to failures on those 7 pages.

Actions to Prevent Recurrence

  • Pages will only include configuration properties relevant to them. We’ll stop including all such properties so there won’t be a need for the problematic query at all.
  • Monitoring will be adjusted so that we are notified earlier about similar situations in the future. We have all the necessary metrics already in place, we just need to add new alert triggers.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Apr 11, 2022 - 17:00 CEST

Resolved
This incident has been resolved. At 10:03 CEST, there was a disruption in rendering a few pages within the Project Management. The issue was resolved at 10:19 CEST.
Posted Apr 06, 2022 - 10:19 CEST