24 September 2010

Using ScaLAPACK: not so painful!

I finally got an excuse this week to try out ScaLAPACK, a library for distributed-memory parallel dense linear algebra computations.  I've mainly been working on iterative methods for sparse problems, so it's not common for me to want to factor some giant dense matrix.  However, part of those iterative methods involves an orthogonalization method for small groups of dense vectors.  Comparing against ScaLAPACK's QR factorization PDGEQRF should make a good sanity check for accuracy and performance.  (PDGEQRF should be slower, since it uses an algorithm that requires more communication, and also since it likely doesn't use multicore parallelism as effectively for the panel factorization.)

It took me two and a half days to get a prototype accuracy test and benchmark up and running, which is a lot less than I had thought.  The Intel MKL Link Line Advisor was a big help in linking the right libraries.  I've tried building ScaLAPACK from source before -- while that was some four years ago, it was a real pain and I was happy not to have to try that.  The last few hours of those two-and-a-half days were spent trying to figure out why the matrix distribution wasn't working.  The routine in question, PDGEQRF, either refused to accept the input data (it helpfully identified the first argument of the routine which was off in some way, but didn't say why it was wrong), or only factored part of the matrix.  Staring at the ScaLAPACK Users' Guide -- in particular, at the diagrams illustrating global matrix layout -- helped me decode the mysterious parameters describing how the matrix is distributed among processors.

Using ScaLAPACK turned out not to be so difficult as I had thought.  The Users' Guide is well-written, with many helpful diagrams.  The routines have great documentation (and helpfully repeat documentation for certain library-standard interfaces, like the matrix descriptor array).  Having a pre-built library with the linker flags ready for copy-and-paste into my build configuration system was a huge help.  I imagine much of the pain people have reported came from trying to build everything themselves.  Note also that Matlab now has support for distributed-memory parallel computation, and distributed-memory data layouts.  It's based on ScaLAPACK, which means the MathWorks' developers have gone through a lot of this pain for you ;-)

A final interesting point:  the ScaLAPACK Users' Guide was written when High Performance Fortran (HPF) was under serious consideration as the way to write distributed-memory parallel code.  ScaLAPACK doesn't use HPF, but it does use HPF as an alternate notation to describe different ways to distribute matrices over processors.  HPF proved unpopular because it was too hard to build an efficient compiler for it, but parts of HPF made their way into modern versions of the Fortran standard. 

No comments: