Slow Import of Files and Quality Assurance Response Time
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 4:30 and 6:58 AM CET on the 30th of January, 2020 which led to the slow import of files and quality assurance response time for some customers.

Timeline

5:06 AM CET: Slow file import and quality assurance response time is reported to Memsource Support. The support agent starts gathering more information from the user and from system monitoring.

6:09 AM CET: The support agent confirms the problem and notifies the engineering team. Only some users seem to be affected.

6:55 AM CET: The engineering team identifies asynchronous requests stuck in the queue. This is the assumed cause of the problem being reported by an increasing number of users. The support agent escalates the problem and updates the Memsource status page.

6:58 AM CET: The system is reconfigured to use an alternative method of processing asynchronous requests. Monitoring reports that performance is improving.

7:03 AM CET: Memsource performance is back to normal.

Root cause

A new system using a message broker service for the processing of asynchronous requests has been recently introduced into Memsource. This new system should increase Memsource stability under heavy loads. The incident revealed a bug in the implementation that caused some messages representing asynchronous requests not being properly acknowledged while being processed. This eventually led to the interruption of communication with the message broker. The system could not recover from this unexpected error. The new system had been enabled for only some organizations and therefore only some users were affected.

Actions to Prevent Recurrence

As a reaction to the problems:

  • The system has been reconfigured so all users use an alternative method of processing asynchronous requests until the bug is fixed.
  • The bug in the new asynchronous request processing system is being fixed.
  • A bug that caused the number of queued asynchronous requests being incorrectly reported to our monitoring system will be fixed.
  • Our automated alerting system will be set up to trigger an alert if the number of queued messages in the message broker is unexpectedly high.
  • Support agents were instructed on how similar problems are to be escalated with escalation procedure documentation being improved. This will improve response and resolution time.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Jan 31, 2020 - 16:54 CET

Resolved
This incident has been resolved.
Posted Jan 30, 2020 - 07:38 CET
Monitoring
The issue has been identified and fixed. Right now, we continue to monitor the performance. A postmortem with more details will be added later.
Posted Jan 30, 2020 - 07:15 CET
Investigating
Our engineering team is investigating the slow processing of newly imported files.
Posted Jan 30, 2020 - 06:59 CET
This incident affected: Memsource (SLA) (File Processing).