We would like to share more details about the events that occurred with Memsource between 10:06 and 11:28 AM CET on the 18th of December, 2019 which led to the partial disruption of the Editor for Web service and what Memsource engineers are doing to prevent these sorts of issues from happening again.
10:05 AM CET: Creation of new indices in Elasticsearch becomes too slow; the master node’s garbage collection is too slow for the cluster to recover on its own. Customers on affected servers cannot open new jobs but open jobs are not impacted.
10:06 AM: Noticed degraded performance of Memsource Editor for Web. Node monitoring reported Elasticsearch warnings and slower Java garbage collection. The Elasticsearch cluster used for storing open job data reported problems with the creation of new indices and aliases. At this point in time, most customers had been routed to affected servers as some editor servers had been running a new version deployed in a canary scenario.
10:32 AM: Still not fully recovered and slow Elasticsearch causes the editor to become effectively unresponsive for most customers. The master node with persistent garbage collection problems is restarted. The relocation of shards is disabled to limit the number of ongoing operations freeing the cluster to resolve pending tasks.
10:54 AM: Shard relocation is complete. The number of pending tasks has decreased but problems with creating new indices are not resolved. The slow garbage collection appears on other servers which further slows down the cluster. Some deprecated data is manually removed to decrease the number of shards and aliases and help the cluster recover faster. The removed data is safely stored in another database from where it can be loaded by the editor when needed, ensuring no customer data is lost.
11:03 AM: Editor service instances running on affected servers and still accepting user requests are stopped. This further decreases the load on Elasticsearch and provides it an opportunity to recover.
11:11 AM: Most users are redirected to the servers currently hosting the canary deployment of the new editor version using a completely separate and unaffected instance of an Elasticsearch cluster. The affected Elasticsearch cluster is restarted and recovered.
11:28 AM: All servers of Memsource Editor for Web are fully operable.
On the 16th of December, we started gradually redirecting about 90% of editor users to a single Elasticsearch cluster in order to prepare the rest of the servers for a canary deployment of a new editor version. The redirection had been monitored and it has had no visible impact on the stability of the editor. In a few days, significantly more indices and aliases than were usual accumulated in Elasticsearch resulting in the slower creation of new indices. Further, the redirected users generated extra load utilizing too much heap space and triggering a longer-taking “stop-the-world” garbage collection making indice creation and cluster recovery without manual intervention even harder.
As a reaction to the problems:
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.