15 September 2010

The user experience of lazy automatic tuning

I've been doing some performance experiments with an MPI-parallel linear algebra kernel.  I've implemented the kernel in two different ways:  a "reduce"  (not quite MPI_Reduce(), though it has a similar communication pattern) and "broadcast," and a "butterfly."  The latter is more general (it can be used to compute things other than what I'm benchmarking).  However, it might be slow in certain cases (e.g., if the network can't handle too many messages at once, or if there are many MPI ranks per node and the network card serializes on processing all those messages going in and out of the node).  Each run involves testing two different implementation alternatives, and four different scalar data types (float, double, complex float, and complex double).

When I ran the benchmarks, I noticed that the very first run (butterfly implementation strategy, "float" data type) of each invocation of the benchmark is 10 - 100x slower than for other data types with the same implementation strategy. When I rebuild the executable using a different MPI library, this problem went away.  What was going on?  I noticed that for the "slow" MPI library, the minimum benchmark run time was reasonable, but the maximum run time was a lot more.  After reading through some messages in the e-mail archives of that MPI library, I found out that the library delays some setup costs until execution time.  That means the first few calls of MPI collectives might be slow, but overall they will be fast.  The other MPI library doesn't seem to be doing this.

I was reminded of some troubles that the MathWorks had with integrating FFTW into Matlab, a few years ago.  FFTW does automatic performance tuning as a function of problem size.  The tuning phase takes time, but once it's finished, running any problem of that size will likely be much faster.  However, users didn't realize this.  They ran an FFT once, noticed that it was much slower than before (because FFTW was busy doing its thing), and then complained.  One can imagine all sorts of fixes for this, but the point is that something happened to performance which was not adequately communicated to users.

Now, MPI users are likely more sophisticated programmers than most Matlab users.  They likely have heard of automatic run-time performance tuning, and maybe even use it themselves.  I'm fairly familiar with such things and have been programming in MPI for a while.  Yet, this phenomenon still puzzled me until I poked around through the e-mail list archives.  What I was missing was an obvious pointer to documentation describing performance phenomena such as this one.  The performance of this MPI library is not transparent.  This is not a bad thing in this case, because the opaqueness is hiding a performance tuning process that will make my code faster overall.  However, I want to know this right away, so I don't have to think about what I could be doing wrong. 

See, I'm not just calling MPI directly.  I've wrapped up some other wrapper over MPI.  I didn't know if my wrapper was broken, the wrapper underneath was broken, the MPI library was broken, the cluster job execution software was broken, or if nothing was wrong and this was just expected behavior.  That's why I want to know right away whether I should expect significantly nonuniform performance behavior from some library.

No comments: