Principal Thoughts

Plan for Failure First

Plan for Failure First

Over my many years in the software industry, I have come to observe a set of common patterns as people gain experience. One behavior (of many) that sets experienced engineers and developers apart from more junior engineers and developers is attention to detail, and in particular, the attention to how things are going to break.

In our consistently fast-paced industry, management and business leaders often push for results on tight schedules. If you’ve worked in software for any length of time, you have experienced the squeeze, when a manager asks you to “temporarily” skip the testing steps and “overlook” good security practices “just for now,” always promising that these things will be fixed after release.You know as well as I do that there won’t ever be a time without tight deadlines, or a space where these shortcuts can be undone. While all of these practices are harmful (and I address them elsewhere), there is a more insidious shortcut that often finds its way into development as a result of this schedule pressure.

I’m speaking, of course, of the title of this blog post: if you are pressed to deliver, and you have a clear description of what the customer wants, it is easy to go down the path of just coding for that solution and hoping everything else will work. Another way to say it is you will be pressed to build solutions that assume the best case:

  • Third party libraries will always do what they are documented to do
  • Infrastructure components won’t fail in any way, or at least, not in any way that isn’t super common
  • Dependent services will work exactly the way you expect them to
  • No one will violate their SLAs or have any problems with their own systems.

No matter how tempting it is, however, assuming a perfect environment is a trap, and will always end up leading to pain.

As you gain experience, you will find it more and more natural to push back on this tendency. Even so, I have seen seasoned veterans falling into the trap of assuming best-case scenarios, usually driven by schedule pressure. If, however, you don’t plan for failures, you will not be prepared to fix things when failures occur (and let’s face it, they will). If this creates a bad situation for your customers, your business, or a critical aspect of your product, then you are immediately under pressure to fix a complex issue that you have not anticipated. In this situation, poor solutions, hacks, and fragile work-arounds tend to prevail, leading to further pain down the road.

As a mental strategy to help solve this problem, I strongly encourage looking for the failure modes first. If you are adopting a piece of third-party technology, do a deep dive to understand how that technology might fail. Read the documentation, but also read the code and look for others who have used this solution and found problems. If you are leveraging a piece of infrastructure, learn about the failure modes, both common and complex. You don’t need to address every one of them, but you need to document what you find, provide links to helpful information, and, if possible, develop a strategy for overcoming the particular failure modes.

Finally, if you are partnering with another team, or utilizing a dependent service or technology, develop a deep understanding as to how that dependency should be used, not just from your perspective, but from the perspective of the team that owns and supports it. Dive into their understanding of the service they provide, including SLAs and support models, and make sure that they align with your use case. If they are unable to support your solution, work with your leadership and theirs to resolve the issue. Don’t trust that just because you are doing something important, they will have the resources or ability to rise to meet your demands.

I have often advised that if you are going to forget to write one piece of code, it should be the success case — that is, after all, only one case, and it is the most clear and well defined. Preparing for failure cases, either directly in the code or via documentation and run books, is too often left for “later” and, consequently, not done. The pain that this causes customers, delivery timelines, and your own team can be considerable. If you plan for failure first, you will be a lot more likely to succeed.

Justin, of The Aspiring Principal