We would like to share more details about the events that occurred with Memsource between 8:49 and 10:15 AM CET on the 17 of December, 2019 which led to the disruption of all Memsource components and what Memsource engineers are doing to prevent these sorts of issues from happening again.
We first noticed degraded performance of the service at 7:52 AM CET. Monitoring reported a spike in events emitted by our Web and API servers and streamed to the Elasticsearch cluster responsible for events analytics.
While examining event streaming and the Elasticsearch cluster we noticed that some nodes in the cluster were responding with high latency but the cluster itself was reported as healthy.
At 8:49 AM, event traffic reached a volume above what Elasticsearch cluster was able to ingest. Web and API servers became less responsive as they were not able to read nor write analytics events to the cluster and were running out of worker threads.
At 9:00 AM, Web and API servers became unresponsive and yet still filling queues with events. Most requests were not able to finish or even start without the analytics Elasticsearch cluster working.
At 9:10 AM, we identified malfunctioning nodes in the Elasticsearch cluster and started the node restart process.
At 9:30 AM, we decided to stop the Web and API servers to limit the pressure on the Elasticsearch cluster. This ensured current requests would not be lost and faster recovery of the Elasticsearch cluster.
At 9:55 AM, we completed the recovery of two Elasticsearch nodes and confirmed the cluster and all nodes were in good shape.
At 10:00 AM, we restarted the first few Web and API servers. Once it was confirmed that everything was working as expected, the rest were enabled.
At 10:15 AM, we confirmed all Web and API servers were back online and the Elasticsearch cluster was performing as expected.
On the 16th of December at 18:00 PM, we completed the regular export of event analytics data to another database without any impact on the cluster’s performance but created more pressure on the Elasticsearch cluster nodes’ memory.
While recovering the affected Elasticsearch cluster nodes we noticed a Java garbage collection process starting to take an unusually long time beginning at 7:00 PM on the 16th of December causing the cluster to flap between healthy and unhealthy states at 11PM.
The cluster was still performing well and no performance nor availability degradation was recorded until 7:45 AM of the 17th of December; two nodes from the cluster reported Java garbage collection taking minutes and servers experiencing out-of-memory errors leading to nodes being dropped from the cluster while the cluster itself still indicated a healthy state with all nodes healthy and up. This blocked some operations on some of the Elasticsearch clusters and started to degrade performance and availability of the Web and API servers.
As a reaction to the problems:
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.