Degraded Performance of Editor for Web and Term Base Components
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 11:30 and 1:00 PM CET on the 12th of May, 2021 which led to the degraded performance of the Editor for Web and Term Base functionality and what Memsource engineers are doing to prevent these sorts of issues from happening again.

Timeline

11:05 AM CET: Planned maintenance of the Term Base database (Elasticsearch) starts. 

11:37 AM CET: Automated monitoring reports increased response time for user requests. 

11:40 AM CET: Ongoing maintenance is identified as the cause of the increased response time. 

11:51 AM CET: Inoperational Term Base functionality negatively affects the operability of Quality Assurance checks and Autocomplete functions in Memsource Editors. 

11:55 AM CET: The recovery of Term Base functionality is in progress. Some Elasticsearch nodes become overloaded; the recovery process moves customer data from overloaded to healthy nodes. 

13:00 PM CET: Term Base functionality is recovered and automated monitoring reports the system as fully operational. 

Root Cause

Not enough heap memory being available to the Term Base Elasticsearch cluster during a rolling restart resulted in a cascade effect and a long recovery time.

Actions to Prevent Recurrence

As a reaction to the problems:

  • The size of the Elasticsearch cluster will be increased to provide more overall heap space.
  • Additional monitoring is being established to provide an overview of the ratio between the heap space which can be freed during garbage collecting, used heap space and number of nodes in a cluster.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted May 13, 2021 - 16:09 CEST

Resolved
We have resolved the degraded performance of Editor for Web and Term Base components.
Posted May 12, 2021 - 13:00 CEST
Identified
We are encountering unexpected performance downgrade due to planned system maintenance.
Posted May 12, 2021 - 11:30 CEST
This incident affected: Memsource (SLA) (Editor for Web, Term Base).