21 November 2011

Nobody has solved the fault resilience problem

Lessons from Supercomputing 2011:

1. Nobody has solved the programming models program.  This has never stopped anyone from writing unmaintainable but functional scientific codes, however.

2. Nobody has solved the fault resilience problem.  This actually terrifies everyone who has thought about it, because we have never had to deal with this problem before.  Furthermore, most of us are ignorant of the huge body of existing research and practice on fault-resilient computing in highly fault-prone environments with a huge penalty of failure (space-based computing, sometimes with human payloads).  Most of us balk at the price of fault resilience (triple modular redundancy) that this experience recommends.  Nevertheless, we're being asked by policymakers to do computations to help them make high-consequence decisions (asteroid-earth collisions? climate change? whether The Bomb might go off by mistake if unfortunate techs drop it on their big toe?). 

The #2 situation is comparable to that of floating-point arithmetic in the awful, awful days before industry and academia converged on a standard.  The lack of a sensible standard required immense floating-point expertise in order to have some confidence in the result of a computation.  These days, the lack of reliability of data and computations might require the same level of expertise in order to prove correctness of existing algorithms.  Around six years ago, Prof. William Kahan (UC Berkeley) suggested that perhaps only one PhD expert in floating-point computation graduates per year.  While his standards are exceptionally high, I have no doubt that the lack of training in numerical analysis is the bottleneck for making use of computers whose data and computations are likely to experience undetected errors.


Hariprasad Kannan said...

You should continue this blog.

HilbertAstronaut said...

That's kind of you! Work is SUPER busy so I end up not having a lot of time for extracurricular stuff, but this reminds me to think of new things to write...

Hariprasad Kannan said...

Can you answer this query for me?

Is the book "Performance optimization of numerically intensive codes" still a good reference? I would like to learn more about the interaction of numerical methods and computer architecture. The book is from 2001. Processor architectures have advanced considerably since then. At the same time, there should be some timeless concepts regarding performance optimization considering architectural aspects like memory hierarchy. Is this book good for that? What are some other good books?

Thank you for your time.

HilbertAstronaut said...

I must admit that I have not had much time to read books on performance optimization. I learn a lot from my colleagues, who are very good at it :-)

I find simple performance models helpful. Here are the models I find especially insightful:

- Simple counting of data movement terms ("number of messages" / latency-bound access, and "data volume" / bandwidth-bound access)
- The "Roofline Model"
- Parallel complexity (e.g., do I have enough parallelism to justify using a more parallel but less work-efficient algorithm?)

Hariprasad Kannan said...

Thank you very much for your reply. Appreciate it. I will follow these leads and learn from them.