We would like to share more details about the events that occurred with Memsource between 11:30 and 1:00 PM CET on the 12th of May, 2021 which led to the degraded performance of the Editor for Web and Term Base functionality and what Memsource engineers are doing to prevent these sorts of issues from happening again.
Timeline
11:05 AM CET: Planned maintenance of the Term Base database (Elasticsearch) starts.
11:37 AM CET: Automated monitoring reports increased response time for user requests.
11:40 AM CET: Ongoing maintenance is identified as the cause of the increased response time.
11:51 AM CET: Inoperational Term Base functionality negatively affects the operability of Quality Assurance checks and Autocomplete functions in Memsource Editors.
11:55 AM CET: The recovery of Term Base functionality is in progress. Some Elasticsearch nodes become overloaded; the recovery process moves customer data from overloaded to healthy nodes.
13:00 PM CET: Term Base functionality is recovered and automated monitoring reports the system as fully operational.
Not enough heap memory being available to the Term Base Elasticsearch cluster during a rolling restart resulted in a cascade effect and a long recovery time.
As a reaction to the problems:
Conclusion
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.