We would like to share more details about the events that occurred with Memsource between 2:24 and 5:10 AM CET on October 2, 2020 which led to a partial disruption of the translation memory service and what Memsource engineers are doing to prevent these sorts of issues from happening again.
2:29 AM CEST: Monitoring of the translation memory service reports slower responses. Some instances of the TM service are running out of memory.
2:49 AM CEST: The TM service suffers from an increased error rate. Monitor warnings turn into alerts and a 24x7 on-duty engineer begins investigating.
3:01 AM CEST: Memory dumps from the crashed TM service instances fill the disk. Another 24x7 on-duty engineer from the infrastructure team is engaged. Disk size is increased on all affected instances.
2:49–4:50 AM CEST: Engineers are investigating the cause and trying to restart the TM service instances without success.
5:00 AM CEST: Caching is disabled in the TM service and recovery begins.
5:10 AM CEST: No more out of memory cases encountered. No more monitor warning.
2:00 PM CEST: Regular fix deployed.
The search of the TM service was triggered by some pre-translation actions requiring a lot of data to be loaded to the memory. This data was not automatically freed from the cache and after some time the service froze and crashed on an out of memory error.
We recently migrated all translation memories to a new service. This new service brings better results and performance and it is built using the latest technologies. It has been tested very extensively and the transition from the old service to the new one took a few months and was done in batches. We believe that this bug is the last major one that we encounter with the new TM service.
As a reaction to the problems:
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.