Post-Mortems: Learning from our mistakes

Our recent TestFairy blunder has shed some light on a scary reality. As an organisation, we are not ready for primetime. Incidents like this at scale damages reputation, for the most part, beyond repair.

Using 3P services has been a known concern for months now. However what is most surprising about this incident, is that this is the second time a bug in TestFairy has caused seed phrases to be leaked. Making the same mistake twice, especially for an organisation so focused on Security and Privacy, is unacceptable.

One immediate tactic to mitigate future incidents is to implement Post-Mortems. If you aren’t familiar with them, please do some background reading on how its done at Etsy and Google. If you touch code, both those documents are a must read to get context.

- Status template for a post-mortem can be found here -

Proposed workflow for all incidents at Status:

  1. Immediately after an incident has been discovered, a Post-Mortem should be created and documented live as the incident is diagnosed and resolved.
    a. A first reaction when discovering an incident should be “Where is the post-mortem?!?”.
    b. Email [email protected] with a link to the post-mortem as its being worked on.
  2. Once the post-mortem is completed, it should be shared in the “Status-All → Security” category in Discuss.

Please add any comments/concerns or thoughts/additions to the process. Once fully fleshed out, we can publish this as it’s static page.

A good test of the process would be to create a Post-Mortem for the TestFairy incident, where we use the template as a structure to lead the conversation on how we improve.

8 Likes

I very much agree with everything in this post. Big yes to TestFairy incident being a (“good”) opportunity to do this.

A post-mortem for this will be done within the weeks end for everyone to read.

those are great reads, thank you!