We would like to share more details about the events that occurred with Memsource between 10:03 AM CEST and 10:19 AM CEST on April 6th, 2022 which led to degraded performance of Memsource TMS (EU) Project Management, and what Memsource engineers are doing to prevent these issues from happening again.
10:03 AM CEST: An internal configuration property is deleted. An audit record is created in the monitoring system. The request error rate goes up immediately.
10:15 AM CEST: Alerts from the automated monitoring system start to come to engineers on duty. Error details are quickly analyzed by the engineers.
10:19 AM CEST: Memcached servers acting as a second level cache are restarted. The request error rate returns to normal values.
Root Cause
Internal configuration properties are injected into some pages used by the client-side JavaScript code. There are 7 such pages.
Due to a race condition, a configuration property being deleted was removed from the database but not completely removed from the second level cache. The cached result set for the listing all configuration properties query still included the deleted property which led to failures on those 7 pages.
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.