Slow Processing of Microsoft Office Files between 9:28 and 11:15 AM CET
Incident Report for Memsource
Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 9:28 and 11:15 AM CET on the 22nd of January, 2020 which led to the slow processing of newly imported Microsoft Office files and what Memsource engineers are doing to prevent these sorts of issues from happening again.

Timeline

9:28 AM CET: The engineering team noticed slow processing of newly imported Microsoft Office files.

10:36 AM CET: As processing is stuck waiting for document conversion, all instances of the file conversion service are restarted; jobs are still stuck.

10:45 AM CET: One instance of the file processing service was restarted. This helped newly created jobs.

10:53 AM CET: Requests in the file processing service are stuck waiting for the connections to the document conversion service. Two more other instances of the file processing service are restarted. The number of stuck jobs drops significantly.

11:15 AM CET: The last instance of the file processing service is restarted with the remaining stuck documents being imported within several minutes. This resolves the issue.

Root cause

Some instances of the file processing service were not able to communicate with the file conversion service in order to generate the in-context preview for MS Word and PowerPoint documents. Import of those documents became stuck.

The root cause was a slow leak in the connections to the file conversion service over the period of a week. The file processing service had run out of connections to the file conversion service at the time the incident started.

Actions to Prevent Recurrence

As a reaction to the problems:

  • Connection leak was found and fixed.
  • Connection pool monitoring will be established.
  • Asynchronous requests monitoring will be improved to identify such problems earlier.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Jan 23, 2020 - 14:16 CET

Resolved
This incident has been resolved.
Posted Jan 22, 2020 - 11:25 CET
Monitoring
The issue has been identified and fixed. Right now, we continue to monitor the performance. A postmortem with more details will be added later.
Posted Jan 22, 2020 - 11:08 CET
Investigating
Our engineering team is investigating the slow processing of newly imported Microsoft Office files.
Posted Jan 22, 2020 - 10:59 CET
This incident affected: Memsource (SLA) (File Processing).