We would like to share more details about the events that occurred with Memsource between 2:22 and 2:34 AM CET on the May 26th, 2020 leading to the partial disruption of incoming connections and what Memsource engineers are doing to prevent these sorts of issues from reoccurring.
02:22 AM CET: One of our load balancers stopped accepting new connections while the second is functioning normally. Approximately 50% of traffic is affected. There is DNS HA failover so affected clients slowly migrate to the second load balancer (in 10 minutes the second load balancer holds ~75% of the usual number of connections for both load balancers).
02:24 AM CET: We receive the first alert from the monitoring system.
02:34 AM CET: The problematic load balancer is restarted and functions normally since.
02:35 AM CET: The second balancer is restarted as a precautionary measure. All traffic is back to normal.
HAProxy process stopped accepting connections and remained stuck until restart. We are still investigating the cause of why HAProxy stopped responding.
As a reaction to the problems:
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.