Introduction
We would like to share more details about the events that occurred with Memsource between 3:00 and 3:40 pm CET on the 15th of July, 2020 which led to the partial disruption of Memsource services and what Memsource engineers are doing to prevent these sorts of issues from happening again.
3:01 PM CET: First non-critical alerts reporting slightly higher error rate on some of the Project Management servers. SW engineers started investigating, some of the servers disabled on load balancers.
3:14 PM CET: Critical alerts reporting problems with the availability of the Project Management service. SW engineers asked infrastructure engineers for help as a number of errors seemed to be network related.
3:16 PM CET: Most servers stopped accepting new user requests.
3:20 PM CET: Decided to restart servers to help the service return to regular operation.
3:25 PM CET: One server restarted and recovered, proceeded to restart all of them.
3:36 PM CET: Support agents report the Project Management service is still operable, but slow.
3:41 PM CET: Load balancer runs out of memory so it is restarted.
3:45 PM CET: Servers restarted, Memsource is back at the usual level of response times.
Connection reset to the database at 2:59:58 PM CET triggered stack overflow error on one of the servers. This led to the infinite loop where Project Management application servers exhausted the connection pool to the DB and caused a high write demand for the log file. The same problem happened on 10 out of 12 servers within the next minute.
As a reaction to the problems:
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.