Degraded Performance of Translation Memory Service
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 14:35 CEST and 15:20 CEST on April 12, 2021 which led to a partial disruption of the translation memory service and what Memsource engineers are doing to prevent these sorts of issues from happening again.

Timeline

12:00 PM CEST - Automated monitoring reports a failure of a Translation Memory backup process on one of the Elasticsearch nodes.

12:45 PM CEST - The root cause of the backup failure is identified; the process of deleting corrupted Elasticsearch snapshots starts. 

2:30 PM CEST - The deletion process is stuck on Elasticsearch's master node.

2:32 PM CEST - The problematic node of the cluster is manually restarted.

2:35 PM CEST - Automated monitoring of the Translation Memory service reports an increased error rate; the service is inoperational for most users.

2:45 PM CEST - The Elasticsearch clusters cannot recover after restarting the master node; the election of a new master node results in an election loop.

2:50 PM CEST - The Translation Memory service is disabled to help the Elasticsearch cluster recover by decreasing the number of requests sent to the cluster.

3:00 PM CEST - The Incident response team restarts all nodes of the Elasticsearch cluster. 

3:10 PM CEST - After a controlled restart, the cluster successfully elects a new master and data shards recovery starts.

3:15 PM CEST - The Translation Memory service and data are available for most users.

3:20 PM CEST - All data are confirmed available and consistent.

Root Cause

After the master node restart, the Elasticsearch cluster was not able to elect a new master and the election process resulted in an election loop. This led to Translation Memory data becoming unavailable and Translation Memory requests resulting in errors. The Elasticsearch cluster required manual intervention to reconcile.

Actions to Prevent Recurrence

  • We will change our Elasticsearch clusters topology to be more resilient to master node loss.
  • We are updating our internal documentation on Elasticsearch cluster handling to improve resolution time.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Apr 13, 2021 - 15:52 CEST

Resolved
We have resolved the degraded performance of Translation Memory service.
Posted Apr 12, 2021 - 15:20 CEST
Identified
We have identified a cause and we are working on restoring the service.
Posted Apr 12, 2021 - 15:11 CEST
Investigating
Translation Memory service is experiencing performance issues. We are currently investigating the cause and will update the status page accordingly. Apologies for the inconvenience.
Posted Apr 12, 2021 - 14:35 CEST
This incident affected: Memsource (SLA) (Translation Memory).