Disruption of Memsource between 8:49 and 10:15 AM CET
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 8:49 and 10:15 AM CET on the 17 of December, 2019 which led to the disruption of all Memsource components and what Memsource engineers are doing to prevent these sorts of issues from happening again.

Timeline

We first noticed degraded performance of the service at 7:52 AM CET. Monitoring reported a spike in events emitted by our Web and API servers and streamed to the Elasticsearch cluster responsible for events analytics.

While examining event streaming and the Elasticsearch cluster we noticed that some nodes in the cluster were responding with high latency but the cluster itself was reported as healthy.

At 8:49 AM, event traffic reached a volume above what Elasticsearch cluster was able to ingest. Web and API servers became less responsive as they were not able to read nor write analytics events to the cluster and were running out of worker threads.

At 9:00 AM, Web and API servers became unresponsive and yet still filling queues with events. Most requests were not able to finish or even start without the analytics Elasticsearch cluster working.

At 9:10 AM, we identified malfunctioning nodes in the Elasticsearch cluster and started the node restart process.

At 9:30 AM, we decided to stop the Web and API servers to limit the pressure on the Elasticsearch cluster. This ensured current requests would not be lost and faster recovery of the Elasticsearch cluster.

At 9:55 AM, we completed the recovery of two Elasticsearch nodes and confirmed the cluster and all nodes were in good shape.

At 10:00 AM, we restarted the first few Web and API servers. Once it was confirmed that everything was working as expected, the rest were enabled.

At 10:15 AM, we confirmed all Web and API servers were back online and the Elasticsearch cluster was performing as expected.

Root cause

On the 16th of December at 18:00 PM, we completed the regular export of event analytics data to another database without any impact on the cluster’s performance but created more pressure on the Elasticsearch cluster nodes’ memory.

While recovering the affected Elasticsearch cluster nodes we noticed a Java garbage collection process starting to take an unusually long time beginning at 7:00 PM on the 16th of December causing the cluster to flap between healthy and unhealthy states at 11PM.

The cluster was still performing well and no performance nor availability degradation was recorded until 7:45 AM of the 17th of December; two nodes from the cluster reported Java garbage collection taking minutes and servers experiencing out-of-memory errors leading to nodes being dropped from the cluster while the cluster itself still indicated a healthy state with all nodes healthy and up. This blocked some operations on some of the Elasticsearch clusters and started to degrade performance and availability of the Web and API servers.

Actions to Prevent Recurrence

As a reaction to the problems:

  • We will refine monitoring of our Elasticsearch cluster to be able to detect smaller node performance problems.
  • Based on data from Java garbage collection on Elasticsearch, we will update the settings to prevent nodes from flapping and long garbage collection times.
  • We postponed regular event analytics data export until these changes are implemented and delivered.
  • We will expand cluster computing power so recovery time will be shorter in case of a recurrence.
  • We will update the regular event analytics data export to run from an Elasticsearch snapshot to ensure no additional pressure is created on the cluster.
  • We will loosen a dependency of the Web application and the API on Elasticsearch which should help us recover faster if the cluster becomes unresponsive.
  • We will implement new configuration options making it easier for us to disable specific features helping us to isolate problems and avoid outages of the entire application.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Dec 18, 2019 - 15:04 CET

Resolved
This incident has been resolved.
Posted Dec 17, 2019 - 10:38 CET
Monitoring
The problem has been fixed and we are monitoring the results.
Posted Dec 17, 2019 - 10:22 CET
Identified
Memsource engineers are currently working on fixing the ongoing issue which affects all Memsource components.
Posted Dec 17, 2019 - 10:04 CET
Update
Memsource engineers are continuing to investigate this issue.
Posted Dec 17, 2019 - 09:34 CET
Investigating
We are currently investigating this issue related to the unavailability of all Memsource components.
Posted Dec 17, 2019 - 08:59 CET
This incident affected: SLA (API, Editor for Web, Project Management).