We would like to share more details about the events that occurred with Memsource between 14:35 CEST and 15:20 CEST on April 12, 2021 which led to a partial disruption of the translation memory service and what Memsource engineers are doing to prevent these sorts of issues from happening again.
12:00 PM CEST - Automated monitoring reports a failure of a Translation Memory backup process on one of the Elasticsearch nodes.
12:45 PM CEST - The root cause of the backup failure is identified; the process of deleting corrupted Elasticsearch snapshots starts.
2:30 PM CEST - The deletion process is stuck on Elasticsearch's master node.
2:32 PM CEST - The problematic node of the cluster is manually restarted.
2:35 PM CEST - Automated monitoring of the Translation Memory service reports an increased error rate; the service is inoperational for most users.
2:45 PM CEST - The Elasticsearch clusters cannot recover after restarting the master node; the election of a new master node results in an election loop.
2:50 PM CEST - The Translation Memory service is disabled to help the Elasticsearch cluster recover by decreasing the number of requests sent to the cluster.
3:00 PM CEST - The Incident response team restarts all nodes of the Elasticsearch cluster.
3:10 PM CEST - After a controlled restart, the cluster successfully elects a new master and data shards recovery starts.
3:15 PM CEST - The Translation Memory service and data are available for most users.
3:20 PM CEST - All data are confirmed available and consistent.
After the master node restart, the Elasticsearch cluster was not able to elect a new master and the election process resulted in an election loop. This led to Translation Memory data becoming unavailable and Translation Memory requests resulting in errors. The Elasticsearch cluster required manual intervention to reconcile.
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.