21 November 2011

Nobody has solved the fault resilience problem

Lessons from Supercomputing 2011:

1. Nobody has solved the programming models program.  This has never stopped anyone from writing unmaintainable but functional scientific codes, however.

2. Nobody has solved the fault resilience problem.  This actually terrifies everyone who has thought about it, because we have never had to deal with this problem before.  Furthermore, most of us are ignorant of the huge body of existing research and practice on fault-resilient computing in highly fault-prone environments with a huge penalty of failure (space-based computing, sometimes with human payloads).  Most of us balk at the price of fault resilience (triple modular redundancy) that this experience recommends.  Nevertheless, we're being asked by policymakers to do computations to help them make high-consequence decisions (asteroid-earth collisions? climate change? whether The Bomb might go off by mistake if unfortunate techs drop it on their big toe?). 

The #2 situation is comparable to that of floating-point arithmetic in the awful, awful days before industry and academia converged on a standard.  The lack of a sensible standard required immense floating-point expertise in order to have some confidence in the result of a computation.  These days, the lack of reliability of data and computations might require the same level of expertise in order to prove correctness of existing algorithms.  Around six years ago, Prof. William Kahan (UC Berkeley) suggested that perhaps only one PhD expert in floating-point computation graduates per year.  While his standards are exceptionally high, I have no doubt that the lack of training in numerical analysis is the bottleneck for making use of computers whose data and computations are likely to experience undetected errors.