May 17, 2006

Learning from Failure

From the NYTimes:

Failure, Mr. Petroski shows, works. Or rather, engineers only learn from things that fail: bridges that collapse, software that crashes, spacecraft that explode. Everything that is designed fails, and everything that fails leads to better design. Next time at least that mistake won’t be made: Aleve won’t be packed in child-proof bottles so difficult to open that they stymie the arthritic patients seeking the pills inside; narrow suspension bridges won’t be built without “stay cables” like the ill-fated Tacoma Narrows Bridge, which was twisted to its destruction by strong winds in 1940.

Because of the volatile nature of software development, failure is a spectre that lingers over every project, all the time. Failures in communications, in comprehension, in reliability, in efficiency, in usability, in execution. Failure, as Mr. Petroski says, is both inevitable and the key to learning and growing.

Iterative development is an example of the improving state-of-the-art, given that failures are going to happen (and going to happen often), Iterations allow the failures to be recognized and studied soon after they occur, instead of 1,2,3 years down the road. Iterations allow failures to be repaired/addressed earlier too, which leads to a better overall design. The next time someone says ‘We need to think this through before we do it’, remember that engineers thought the Tacoma Narrows bridge through too. And yet it failed. And the failure was studied, and learned from, and since then bridges have been better (Although we appear to be overdue for another major bridge failure).

Unless you are your own customer, you will never have a precise understanding of what they want, and they will never have a precise understanding of what you are going to give them until they get it. That’s what I would call a failure-rich environment. Iterative development allows those failures to be identified early, discovered in a controlled way, and dealt with immediately.

Compare and contrast with this story, where a fairly innocuous change to the implementation of the architect almost led to major catastrophe.   If the architect had been more dismissive of the student’s question, or if he had decided to remain quiet to avoid disgrace, things might have ended tragically.

I can’t help but notice that it was a technical change that created the problem, and a human interaction that brought it to light.