Degraded Availability of Memsource between 2:22 and 2:34 AM CET

Incident Report for Phrase (formerly Memsource)

Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 2:22 and 2:34 AM CET on the May 26th, 2020 leading to the partial disruption of incoming connections and what Memsource engineers are doing to prevent these sorts of issues from reoccurring.

Timeline

02:22 AM CET: One of our load balancers stopped accepting new connections while the second is functioning normally. Approximately 50% of traffic is affected. There is DNS HA failover so affected clients slowly migrate to the second load balancer (in 10 minutes the second load balancer holds ~75% of the usual number of connections for both load balancers).

02:24 AM CET: We receive the first alert from the monitoring system.

02:34 AM CET: The problematic load balancer is restarted and functions normally since.

02:35 AM CET: The second balancer is restarted as a precautionary measure. All traffic is back to normal.

Root cause

HAProxy process stopped accepting connections and remained stuck until restart. We are still investigating the cause of why HAProxy stopped responding.

Actions to Prevent Recurrence

As a reaction to the problems:

We will continue with a thorough internal investigation into HAProxy problems to identify the root cause.
We will improve our load-balancer (LB) infrastructure and update DNS records to allow quicker failover and provide better scalability.
We will run different HAProxy versions and compare configurations to improve stability and performance of our LB layer.
We will improve internal documentation and update monitoring settings so similar situations are mitigated with minimal impact.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted May 27, 2020 - 15:43 CEST

Resolved

Some users may have noticed performance issues on May 26th, 2020 between 2:22 and 2:34 AM CET when Memsource was unavailable.

Posted May 26, 2020 - 02:22 CEST