On April 1st, 2020 between 8:07 PM and 8:30 PM CET, Memsource was unavailable for a small group of customers and some users may have noticed performance issues.
8:07 PM CET: The CPU load on one load balancer gradually increased up to 100%.
8:09 PM CET: First alerts received by the 24x7 on-call engineering team. Connections started piling up on the load balancer and stopped passing new traffic.
8:17 PM CET: We received more monitoring alerts about requests taking too long, high backend failure rates, high load balancer CPU usage and Memsource service flapping.
8:20 PM CET: We identified the problem on the load balancer and redirected the traffic to a healthy one. Most of the internal and external traffic started to be routed via the healthy load balancer within 60 seconds.
8:25 PM CET: We gathered data to later debug the problem and restarted the load balancer on the failing node.
8:30 PM CET: All monitored metrics returned to normal and alerts were cleared.
8:50 PM CET: The previously failed load balancer was returned to a pool.
The root cause was the HAProxy process on the failing load balancer consuming 100% CPU time without passing traffic.
To prevent occurrence of similar problems and improve reaction time:
Finally, we want to apologize. We know how critical our services are to our users. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services.