Skip to content

Abstract: Six Proven Techniques to Investigate a Production Failure

Just before Christmas, you brought a festive vibe to your customers by adding a tiny Christmas tree icon to the login page of your old Java app. Even though it was just a little page decoration, you still made the effort to get it tested, so it was a bit of a shock when everything went wrong following the release, resulting in a failed production, a rollback, and an “unexpectedly entertaining” holiday season.

These things happen, but the real issue is finding the root cause of the problem. The easy solution is often to fall back on blaming the usual topics: 

“It’s because our QA’s decided not to run a regression.” 

And so straight to post-mortem.

We really need to stop settling on the easiest reason. This kind of retrospective approach won’t prevent the next failure, won’t help us discover the real reason behind the issue and it certainly won’t address the situation with all its complexities and relationships.

I’ll be providing a brief overview of six different root cause analysis techniques, including ValuStream Mapping, and the Ishikawa Diagram.

I hope you will find ones to suit your needs so that you can “really enjoy” investigating production failure next time!