Degraded Availability of Memsource between 8:07 and 8:30 PM CET
Incident Report for Memsource
Postmortem

Introduction

On April 1st, 2020 between 8:07 PM and 8:30 PM CET, Memsource was unavailable for a small group of customers and some users may have noticed performance issues.

Timeline

8:07 PM CET: The CPU load on one load balancer gradually increased up to 100%.

8:09 PM CET: First alerts received by the 24x7 on-call engineering team. Connections started piling up on the load balancer and stopped passing new traffic.

8:17 PM CET: We received more monitoring alerts about requests taking too long, high backend failure rates, high load balancer CPU usage and Memsource service flapping.

8:20 PM CET: We identified the problem on the load balancer and redirected the traffic to a healthy one. Most of the internal and external traffic started to be routed via the healthy load balancer within 60 seconds.

8:25 PM CET: We gathered data to later debug the problem and restarted the load balancer on the failing node.

8:30 PM CET: All monitored metrics returned to normal and alerts were cleared.

8:50 PM CET: The previously failed load balancer was returned to a pool.

Root Cause

The root cause was the HAProxy process on the failing load balancer consuming 100% CPU time without passing traffic.

Actions to Prevent Recurrence

To prevent occurrence of similar problems and improve reaction time:

  • We decreased TTL on our internal DNS records for quicker reaction.
  • We will update our DNS records to allow better load balancing and faster failover.
  • We will increase the sensitivity of most critical monitoring checks to be aware of such critical failures earlier.

Conclusion

Finally, we want to apologize. We know how critical our services are to our users. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services.

Posted Apr 02, 2020 - 16:06 CEST

Resolved
Some of the Memsource components experienced slow response times. However, quick action was taken from the Memsource engineering team and the incident is now resolved.
Posted Apr 01, 2020 - 20:39 CEST
This incident affected: Memsource (SLA) (API, Editor for Web, Project Management).