Million Dollar Software Bugs

Would your company do things differently if a critical bug cost a million dollars to fix?

Part of the reason for the bugs in software these days, might be our willingness to accept them. We have become so used to software that is filled with bugs, that we are willing to lower our expectations accordingly.

I suspect that most companies would alter their practices if the cost of bugs were unacceptably high.

Don’t believe me? Read on.

Background

I started my software life in the world of embedded firmware in the 1990s. For those of you who don’t know what “embedded firmware” is: embedded firmware is software that is compiled to a low level machine language, burned on to a chip, placed on a board, and sold as part of some interesting, reliable, electronic device of some kind.

They call it “firmware” mostly because it is somewhere in between software and hardware. Firm, see? Okay, let’s move ahead.

Anyway, this is what I ate, slept, lived, and breathed for the first 8 years of my software career. I loved it in some ways, and did not love it in other ways. Such is life.

Ever since I left the embedded software genre around the year 2000, I have always been fascinated by the differences between the embedded world and the non-embedded world. They are very different animals. One major driver, in my opinion, is the cost of bugs – also referred to as “cost of quality.”

Software Bugs have a Cost?

Yes, dear reader, they do. Sometimes the cost of bugs are hidden – consciously or unconsciously – from engineering. This can be a double-edged sword, in my opinion.

If you are a software developer, then you probably know what a “bug” is. Of course, there are classifications of “bugs”, but that is a post for another time. For the sake of this post, let us define a bug to be: anything that causes failure or causes the software to function outside its requirements.

When a bug is detected in a typical software company’s product, one of two things will happen:

If the bug is non-critical: somebody writes a ticket, that ticket eventually makes it to development, a developer finds and fixes the problem, and an update is eventually released with a software patch. This is the “normal” way issues are addressed for much of the software world.
If the bug is critical: escalated to the CEO by the CEO’s golf buddy. The bug gets delegated to the CTO. The CTO panics because he is not a technology guy in the first place and he freaks out. There is a fire drill to fix the issue, intense pressure is put on the development team, and an update is eventually released with a software patch. Okay, that was over-dramatized. This is hopefully the exception and not the rule!

If we consider the cost of this process, we are talking about cost in terms of the required resources to support fixing the bug. In the critical case, we might be talking about several days of time with several layers of the company.

Multiply that by the salaries of the involved stakeholders, and it may get into the thousands or tens of thousands of dollars in total.

Now imagine we are dealing with an embedded system that is deployed out in the wild somewhere. The same scenarios might look more like this:

Non-critical bug: somebody writes a ticket, that ticket eventually makes it to development, a developer finds and fixes the problem. The software build is tested and released to the factory, which burns chips and replaces units as they come in from the field, or releases the build to customers for downloading and on-site programming. It might take years for the fix to reach all units.
Critical bug: units are pulled from the field and boxed, shipped, tracked, received; chips are pulled and re-burned; validation is re-done for every unit, and they are sent back out to the field.

Think about how expensive either of those could be. Imagine that there are a million units in the field. What if it was a hundred million units?

A critical bug might cost a million dollars or more to rectify. Look at what happened to GM in 2014 (in fairness, this was likely not a software bug, I am just pointing to cost)….excerpt:

“GM recalled about 29 million vehicles for ignition switch defects and other issues, at a cost of $2.5 billion.”

That’s billion….with a “b”. Ouch.

This leads to my point.

Million Dollar Bug Thinking

Imagine for a moment that bugs cost a lot of money when they are uncovered after release. In fact, assume that it is a significant chunk of revenue. How about one million dollars for a critical bug?

“Toto, I’ve a feeling we’re not in Kansas anymore.“

-Dorothy in The Wizard of Oz

Let your imagination wander here. Whatever a “critical” bug means to you in your specific area of specialty, ponder its cost to fix.

I suspect that software companies would do things very differently if a critical bug had such a large cost. Here are some differences that I can imagine:

The Organization Gets Serious

If a critical software bug cost a million dollars, we might see…

upper and mid-level management really engaged in ensuring that people, tools, and processes were in place to prevent or fix problems.
engineering would get a seat at the table of organizational leadership, if they didn’t have one already.
engineering leadership might be composed of people with strong technical backgrounds, who understand how to proactively handle technical risk – not hands-off business managers, lawyers, dentists, or sports pros (not that those things are without merit, of course.)
lack of alignment between business and engineering would not be tolerated for long, if the resulting products were saturated with problems that caused massive expense.
there might even be 360-degree feedback on middle management layers, to glean some insight into their effectiveness.

To me, these actually sound like great ideas for any software company, regardless of the cost of bugs.

Requirements Skills

If a critical software bug cost a million dollars, I could imagine…

suddenly it would seem far more important that there was higher diligence associated with writing and managing user stories.
business analysts and product folks would start receiving some training in writing and managing stories that are concise, unambiguous, consistent, and so on. In the pre-agile days, there were industry conventions for the specific act of writing requirements (Google IEEE-830 or RFC-2119). Not that the output was friendly or consumable – but there was some recognition that it was a specialized skill.
product roadmaps and backlog contents would probably start to align more closely; perhaps there would be more clear boundaries between the tactical and the strategic.

This seems beneficial, even if bugs are not expensive to fix.

Sprint Demos

If a critical software bug cost a million dollars, I would wager that….

nobody would be falling asleep in boring Sprint Demos. In fact, Sprint Demos might become much more interesting!
business folk would probably ask more questions to ensure alignment and proper functionality.
there might even be an expectation that developers try to evolve their skills at communicating with a non-technical audience in order to be sure the ramifications of technical decisions are well understood.

There might be some healthy and open discussion in such demonstrations.

Code Oversight

If a critical software bug cost a million dollars, I would think that….

there would be a realization that code changes mean risk.
suddenly, code oversight would be considered far more important, rather than existing primarily in the domain of the craftsman. It is likely that close to 100% of code changes would have multiple sets of eyes on them.
we would probably see great collaboration tools such as Fisheye/Crucible suddenly be used to strongly enforce code review roles, formal sign-off, and so forth. You will see my own militant code review philosophy in another post.
controlled refactoring would receive time on most backlogs (note the word “controlled”.)
there would be a much larger focus on Test-Driven Development (TDD) where it really counts. In the Java world for example, sometimes testing the getters and setters gives a nice feeling of fulfillment in test writing, but is not really time well spent when compared to testing complex functionality.

Meaningful TDD and more effective code review practices would be good overall, don’t you think?

Knowledge Transfer and Project History

If a critical software bug cost a million dollars, I would really hope that…

project historical knowledge would matter a lot more. Zero documentation with an agile project would probably not be tolerated.
new engineers on projects would probably have more oversight or mentoring until they came up to speed.
we would likely come up with light-weight documentation (e.g. Markdown or similar) that is associated directly to code by some intermediary tooling. Something like what Swagger does for API documentation. The problem is, the documentation would have to be “valuable” in the eyes of the audience. This is a very difficult thing to achieve.

Try starting on a project with a million-source line of code (SLOC) codebase that has been documented in stories alone. Try getting an understanding of the major concepts, the assumptions made, and the technical trade-offs. It is very challenging indeed.

In fact, I assert that user stories are not an ideal mechanism by which to capture project history. If there is no written design, and code is the only deliverable that matters, this is a recipe for major problems due to misunderstood implementations.

I have personally seen near rewrites of entire codebases, simply because the “original intent” – the good and bad of existing design – was undocumented, lost through attrition, or unknown. It probably could have been avoided if its pros/cons and overall design were clearly captured somewhere.

This one would be interesting to explore further.

Testing, Oh, Testing

If a critical software bug cost a million dollars, this might happen…

testing and QA practices would take on new importance. Love it or hate it, development and QA need each other.
companies might buy best-of-breed automated testing solutions and implement formal test plans that had consequences for lack of coverage.
companies would probably encourage thoughtful, meaningful, formal charters between development, QA, and product teams.
BDD tools might evolve more rapidly. How about voice-narrated BDD tests?
QA might even have the political power to stop delivery of software. Anyone remember SEI CMM?

A quality organization with a stronger project voice might be an interesting discussion.

Continuous Integration (CI) and Automation

If a critical software bug cost a million dollars…

the alignment between development, IT, and dev-ops would receive much focus.
CI scripts might be reviewed with the same intensity as the code.
there might be metrics on CI as much as there would be metrics coming out of software products.
change management in dev-ops might be managed with a charter in which upstream and downstream functions are clear stakeholders.
integration testing would probably see a high degree of automation.
test data would suddenly be crucial and carefully managed.
every process that can be automated, would be automated, probably with reporting attached to it.

Interestingly, the software industry seems to be moving in this direction to some degree, with the rise in containerization and related tooling.

What Do You Think?

If a critical bug cost a million dollars to fix, how would your company do things differently?