Responding to and learning from failure

  • Andrew Harvey, Microsoft

Tailwind Traders has done a tremendous amount of good work using modern operations principles and practices to create, deploy, monitor, and troubleshoot their applications and infrastructure in the cloud. As an initial effort, this has been superb, but the engineers know that putting processes in place for continuous learning and continuous improvement are the only sure way to provide continuous value to the customers.

In this module, we'll do more than just talk about these processes, we'll see how they work in action. We pick up the story right in the middle of Tailwind Traders first significant outage. Everything is on fire (metaphorically) and the engineers are struggling to understand the problem and remediate it as fast as possible. We'll demonstrate not just how the outage is brought under control, but even more importantly, how Tailwind Traders is able to learn from their experience after the fact and improve their systems while doing so. Understanding this process is one of the most important keys to continuous improvement, "leveling up" our operational practices, and getting the most value from our cloud investments.

  • Date:Thursday, February 14
  • Time:3:30 PM - 4:30 PM
  • Room:Grand Ballroom B3
  • Location:Breakout 3
  • Session Type:Module: 60 minutes
  • Session Code:SRE50
  • Learning Path:Operating applications and infrastructure in the cloud
  • Level:Advanced (300)

This speaker's other sessions: