Degraded Performance of Memsource TMS (EU) Project Management between 01:55 PM CEST and 03:02 PM CEST
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 01:55 PM CEST and 03:02 PM CEST on May 4th, 2022 which led to degraded performance of Memsource TMS (EU) Project Management and what Memsource engineers are doing to prevent these issues from happening again.

Timeline

01:34 PM CEST: Deployment of the new version of Project Management was started.

01:55 PM CEST: The first few servers are running the new version.

02:20 PM CEST: The first customers start reporting problems when creating or editing jobs.

02:25 PM CEST: A decision was made to rollback to the previous version.

03:02 PM CEST: All servers are running the previous version.

04:43 PM CEST: The load balancer configuration is updated to prevent the problem from occurring again.

06:06 PM CEST: All servers are running the latest version of Project Management.

Root Cause

A few weeks ago we changed the load balancer configurations to serve the Project Management UI in a ‘non-sticky’ manner (i.e. different servers may serve different pages and other resources requested by a particular user) and all seemed fine. However, this incident revealed that this is not completely safe in all cases. As an example, updating static resources in the new version caused a naming update that could not be found in the previous version.

Actions to Prevent Recurrence

  • Load-balancing of UI requests has been configured back to a ‘sticky’ manner.
  • Monitoring will be updated to trigger events in case of elevated 404 errors. This will help us detect the problem earlier.
  • Any future changes to load-balancing UI requests will ensure that stickiness is used at least during deployments.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted May 11, 2022 - 17:17 CEST

Resolved
This incident has been resolved.
Posted May 04, 2022 - 15:36 CEST
Identified
The issue has been identified and a fix is being implemented.
Posted May 04, 2022 - 15:22 CEST
Investigating
Project Management component is experiencing performance issues and some pages fail to load properly. Our engineers are currently investigating the cause.
Posted May 04, 2022 - 14:54 CEST
This incident affected: Memsource TMS (EU) (Project Management).