Degraded Performance of the Memsource Editor for Web
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 10:06 and 11:28 AM CET on the 18th of December, 2019 which led to the partial disruption of the Editor for Web service and what Memsource engineers are doing to prevent these sorts of issues from happening again.

Timeline

10:05 AM CET: Creation of new indices in Elasticsearch becomes too slow; the master node’s garbage collection is too slow for the cluster to recover on its own. Customers on affected servers cannot open new jobs but open jobs are not impacted.

10:06 AM: Noticed degraded performance of Memsource Editor for Web. Node monitoring reported Elasticsearch warnings and slower Java garbage collection. The Elasticsearch cluster used for storing open job data reported problems with the creation of new indices and aliases. At this point in time, most customers had been routed to affected servers as some editor servers had been running a new version deployed in a canary scenario.

10:32 AM: Still not fully recovered and slow Elasticsearch causes the editor to become effectively unresponsive for most customers. The master node with persistent garbage collection problems is restarted. The relocation of shards is disabled to limit the number of ongoing operations freeing the cluster to resolve pending tasks.

10:54 AM: Shard relocation is complete. The number of pending tasks has decreased but problems with creating new indices are not resolved. The slow garbage collection appears on other servers which further slows down the cluster. Some deprecated data is manually removed to decrease the number of shards and aliases and help the cluster recover faster. The removed data is safely stored in another database from where it can be loaded by the editor when needed, ensuring no customer data is lost.

11:03 AM: Editor service instances running on affected servers and still accepting user requests are stopped. This further decreases the load on Elasticsearch and provides it an opportunity to recover.

11:11 AM: Most users are redirected to the servers currently hosting the canary deployment of the new editor version using a completely separate and unaffected instance of an Elasticsearch cluster. The affected Elasticsearch cluster is restarted and recovered.

11:28 AM: All servers of Memsource Editor for Web are fully operable.

Root cause

On the 16th of December, we started gradually redirecting about 90% of editor users to a single Elasticsearch cluster in order to prepare the rest of the servers for a canary deployment of a new editor version. The redirection had been monitored and it has had no visible impact on the stability of the editor. In a few days, significantly more indices and aliases than were usual accumulated in Elasticsearch resulting in the slower creation of new indices. Further, the redirected users generated extra load utilizing too much heap space and triggering a longer-taking “stop-the-world” garbage collection making indice creation and cluster recovery without manual intervention even harder.

Actions to Prevent Recurrence

As a reaction to the problems:

  • We have modified Elasticsearch settings to make garbage collection more efficient and less disruptive for the service.
  • We are in the process of a gradual deployment of a new version of the editor not dependent on Elasticsearch aliases which have been proven problematic under certain conditions.
  • We are in the process of updating Elasticsearch to the newest version which will bring more stability and more efficient garbage collection algorithms.
  • We will omit any redirection of users to a single cluster for a longer period of time until sufficient counter measures are implemented.
  • We will refine monitoring of our Elasticsearch cluster to be able to detect smaller node performance problems.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Dec 19, 2019 - 20:58 CET

Resolved
This incident has been resolved.
Posted Dec 18, 2019 - 11:58 CET
Monitoring
The problem has been fixed and we are monitoring the results.
Posted Dec 18, 2019 - 11:35 CET
Update
Memsource engineers are continuing to investigate this issue with the Editor for Web. Please use the Editor for Desktop in the meantime.
Posted Dec 18, 2019 - 11:14 CET
Investigating
We are currently investigating reports from users concerning unavailability of the Editor for Web. If you have trouble accessing it, please download the job as a MXLIFF file and open it in the Editor for Desktop.
Posted Dec 18, 2019 - 10:43 CET
This incident affected: SLA (Editor for Web).