We would like to share more details about the events that occurred with Memsource between 01:55 PM CEST and 03:02 PM CEST on May 4th, 2022 which led to degraded performance of Memsource TMS (EU) Project Management and what Memsource engineers are doing to prevent these issues from happening again.
01:34 PM CEST: Deployment of the new version of Project Management was started.
01:55 PM CEST: The first few servers are running the new version.
02:20 PM CEST: The first customers start reporting problems when creating or editing jobs.
02:25 PM CEST: A decision was made to rollback to the previous version.
03:02 PM CEST: All servers are running the previous version.
04:43 PM CEST: The load balancer configuration is updated to prevent the problem from occurring again.
06:06 PM CEST: All servers are running the latest version of Project Management.
A few weeks ago we changed the load balancer configurations to serve the Project Management UI in a ‘non-sticky’ manner (i.e. different servers may serve different pages and other resources requested by a particular user) and all seemed fine. However, this incident revealed that this is not completely safe in all cases. As an example, updating static resources in the new version caused a naming update that could not be found in the previous version.
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.