Slower pre-translations caused by one slow MT engine
Incident Report for Phrase (formerly Memsource)
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 12:10 and 3:10 PM CEST on 29th May, 2020 which led to the significant slow down of pre-translations (most of them stuck for some time) and then errors returned by pre-translation with Globalese 3 MT engine and what Memsource engineers are doing to prevent these sorts of issues from happening again.

Timeline

12:10 PM CEST - We received alerts about a higher number of unprocessed asynchronous operations and began investigating the cause.

12:37 PM CEST - The cause is identified - many calls to a slow Globalese 3 MT server. We began implementing measures to unblock other waiting operations.

12:55 PM CEST - We received first user complaints about some slower operations.

1:17 PM CEST - We temporarily disabled the Globalese 3 engine type in our MT connector, unblocking all waiting operations. They processed very quickly but translation with Globalese 3 started to return errors.

1:31 PM CEST - We partially enabled Globalese 3; cca 50% translations were successful and 50% returned errors.

3:10 PM CEST - After reconfiguration of the MT connector service, Globalese 3 is enabled.

Root cause

Many pre-translations with the Globalese 3 MT engine were blocking the capacity of the MT integration service.

Actions to Prevent Recurrence

As a reaction to the problems:

  • MT connector will reserve capacity for other clients, so slow translations from one client does not significantly slow down translations for other clients.
  • The number of outgoing connections from the MT service will be monitored for each MT engine.
  • The total capacity of the MT service will be increased.
  • Slow MT engines will be automatically turned off if they start to misbehave.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Jun 02, 2020 - 16:18 CEST

Resolved
The issue has been identified and fixed. Right now, we continue to monitor the performance.
Posted May 29, 2020 - 13:17 CEST
Investigating
Our engineering team is investigating the degraded performance of Machine Translation and File Processing components.
Posted May 29, 2020 - 12:10 CEST
This incident affected: Memsource TMS (EU) (File Processing, Machine Translation).