We would like to share more details about the events that occurred with Memsource between 4:30 and 6:58 AM CET on the 30th of January, 2020 which led to the slow import of files and quality assurance response time for some customers.
5:06 AM CET: Slow file import and quality assurance response time is reported to Memsource Support. The support agent starts gathering more information from the user and from system monitoring.
6:09 AM CET: The support agent confirms the problem and notifies the engineering team. Only some users seem to be affected.
6:55 AM CET: The engineering team identifies asynchronous requests stuck in the queue. This is the assumed cause of the problem being reported by an increasing number of users. The support agent escalates the problem and updates the Memsource status page.
6:58 AM CET: The system is reconfigured to use an alternative method of processing asynchronous requests. Monitoring reports that performance is improving.
7:03 AM CET: Memsource performance is back to normal.
A new system using a message broker service for the processing of asynchronous requests has been recently introduced into Memsource. This new system should increase Memsource stability under heavy loads. The incident revealed a bug in the implementation that caused some messages representing asynchronous requests not being properly acknowledged while being processed. This eventually led to the interruption of communication with the message broker. The system could not recover from this unexpected error. The new system had been enabled for only some organizations and therefore only some users were affected.
As a reaction to the problems:
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.