24 September 2010

Using ScaLAPACK: not so painful!

I finally got an excuse this week to try out ScaLAPACK, a library for distributed-memory parallel dense linear algebra computations.  I've mainly been working on iterative methods for sparse problems, so it's not common for me to want to factor some giant dense matrix.  However, part of those iterative methods involves an orthogonalization method for small groups of dense vectors.  Comparing against ScaLAPACK's QR factorization PDGEQRF should make a good sanity check for accuracy and performance.  (PDGEQRF should be slower, since it uses an algorithm that requires more communication, and also since it likely doesn't use multicore parallelism as effectively for the panel factorization.)

It took me two and a half days to get a prototype accuracy test and benchmark up and running, which is a lot less than I had thought.  The Intel MKL Link Line Advisor was a big help in linking the right libraries.  I've tried building ScaLAPACK from source before -- while that was some four years ago, it was a real pain and I was happy not to have to try that.  The last few hours of those two-and-a-half days were spent trying to figure out why the matrix distribution wasn't working.  The routine in question, PDGEQRF, either refused to accept the input data (it helpfully identified the first argument of the routine which was off in some way, but didn't say why it was wrong), or only factored part of the matrix.  Staring at the ScaLAPACK Users' Guide -- in particular, at the diagrams illustrating global matrix layout -- helped me decode the mysterious parameters describing how the matrix is distributed among processors.

Using ScaLAPACK turned out not to be so difficult as I had thought.  The Users' Guide is well-written, with many helpful diagrams.  The routines have great documentation (and helpfully repeat documentation for certain library-standard interfaces, like the matrix descriptor array).  Having a pre-built library with the linker flags ready for copy-and-paste into my build configuration system was a huge help.  I imagine much of the pain people have reported came from trying to build everything themselves.  Note also that Matlab now has support for distributed-memory parallel computation, and distributed-memory data layouts.  It's based on ScaLAPACK, which means the MathWorks' developers have gone through a lot of this pain for you ;-)

A final interesting point:  the ScaLAPACK Users' Guide was written when High Performance Fortran (HPF) was under serious consideration as the way to write distributed-memory parallel code.  ScaLAPACK doesn't use HPF, but it does use HPF as an alternate notation to describe different ways to distribute matrices over processors.  HPF proved unpopular because it was too hard to build an efficient compiler for it, but parts of HPF made their way into modern versions of the Fortran standard. 

15 September 2010

The user experience of lazy automatic tuning

I've been doing some performance experiments with an MPI-parallel linear algebra kernel.  I've implemented the kernel in two different ways:  a "reduce"  (not quite MPI_Reduce(), though it has a similar communication pattern) and "broadcast," and a "butterfly."  The latter is more general (it can be used to compute things other than what I'm benchmarking).  However, it might be slow in certain cases (e.g., if the network can't handle too many messages at once, or if there are many MPI ranks per node and the network card serializes on processing all those messages going in and out of the node).  Each run involves testing two different implementation alternatives, and four different scalar data types (float, double, complex float, and complex double).

When I ran the benchmarks, I noticed that the very first run (butterfly implementation strategy, "float" data type) of each invocation of the benchmark is 10 - 100x slower than for other data types with the same implementation strategy. When I rebuild the executable using a different MPI library, this problem went away.  What was going on?  I noticed that for the "slow" MPI library, the minimum benchmark run time was reasonable, but the maximum run time was a lot more.  After reading through some messages in the e-mail archives of that MPI library, I found out that the library delays some setup costs until execution time.  That means the first few calls of MPI collectives might be slow, but overall they will be fast.  The other MPI library doesn't seem to be doing this.

I was reminded of some troubles that the MathWorks had with integrating FFTW into Matlab, a few years ago.  FFTW does automatic performance tuning as a function of problem size.  The tuning phase takes time, but once it's finished, running any problem of that size will likely be much faster.  However, users didn't realize this.  They ran an FFT once, noticed that it was much slower than before (because FFTW was busy doing its thing), and then complained.  One can imagine all sorts of fixes for this, but the point is that something happened to performance which was not adequately communicated to users.

Now, MPI users are likely more sophisticated programmers than most Matlab users.  They likely have heard of automatic run-time performance tuning, and maybe even use it themselves.  I'm fairly familiar with such things and have been programming in MPI for a while.  Yet, this phenomenon still puzzled me until I poked around through the e-mail list archives.  What I was missing was an obvious pointer to documentation describing performance phenomena such as this one.  The performance of this MPI library is not transparent.  This is not a bad thing in this case, because the opaqueness is hiding a performance tuning process that will make my code faster overall.  However, I want to know this right away, so I don't have to think about what I could be doing wrong. 

See, I'm not just calling MPI directly.  I've wrapped up some other wrapper over MPI.  I didn't know if my wrapper was broken, the wrapper underneath was broken, the MPI library was broken, the cluster job execution software was broken, or if nothing was wrong and this was just expected behavior.  That's why I want to know right away whether I should expect significantly nonuniform performance behavior from some library.

08 September 2010

CPUs and GPUs, part 3: memory and programming models

My last post got a good comment which pointed out that I hadn't been covering another important class of programming models: annotations.  C and C++ programmers would call these "pragmas," since their compilers understand lines starting with "#pragma" as directives to the compiler that affect one or more lines of code that follow.  They would likely be familiar with OpenMP, which uses annotations to tell the compiler to parallelize certain loops or fragments of code.  Other programming languages, like Common Lisp, use annotations as hints for optimization -- in particular, as type hints. 

The commenter pointed out that the Hybrid Multicore Parallel Programming (HMPP) system of annotations, developed by CAPS and supported by the PathScale ENZO compiler suit, offers another programming model for heterogeneous node computing.  You write a kernel in a subset of C(++) or Fortran, annotated with operations that the compute device (e.g., GPU) understands (like when to transfer data to / from the host CPU, or when to synchronize device threads), and also label the kernel with an annotation to identify it for deployment to the device.   I was just introduced to HMPP last night, so please take a look at the comment to read the manual for yourself :-)  However, a brief read of the manual reminds me of Copperhead, a Python annotation system (based on decorators) that accomplishes similar goals.

Annotations require the programmer to (1) know the annotation language as well as the main programming language being used, (2) understand what subset of the programming language can be sent to the device, and (3) understand when and how the annotation changes program semantics.  Compilers should help with #2, but may not; if given invalid code for the device, they may generate incorrect device code, which may crash or silently give the wrong answer.  This is true for other annotation systems as well.  OpenMP, for example, doesn't help you avoid race conditions if multiple threads modify shared state.  You have to identify and annotate critical sections and / or atomic updates yourself.  #3 can also be a challenge, especially if annotations are sprinkled throughout the code.

Before talking about HMPP, I'll first use OpenMP as an example of the difficulties of popular annotation-based approaches.  Many OpenMP performance problems and code-writing challenges relate to the memory model.  You've taken a sequential code, where memory is laid out for sequential operation, and introduced parallelism to the computation but not the memory layout.  On nonuniform memory access (NUMA) hardware, performance tanks, because the data lives in the wrong place for most of the threads.  At least modern OpenMP implementations try not to let threads wander around to different NUMA regions from where they started, which means that the operating system could migrate pages closer to the cores that are working on them.  That suggests a "preconditioning" approach, where you first use OpenMP parallelism to read through the whole data set (thereby letting the operating system move data to where it needs to be), and then use OpenMP to do the computations you wanted to do in the first place.  Operating systems in common use don't do that yet!  Another approach is just syntactic sugar for explicitly threaded programming: each thread allocates its own data, and then you program in something like a distributed-memory model.  (It would be easier and less buggy just to run MPI with one process per core.)

Both of these OpenMP approaches are awkward.  The problem is that OpenMP separates computation and memory layout.  The annotations only govern computation; they don't say where the data should live in memory.  That often means throwing away useful information, which many OpenMP programmers know, about which threads will operate on which data.  If you know how to allocate an array, you likely know how you would divide up simple parallel for loops or reductions on the array among the CPU cores.  If you know that, you almost certainly know how to lay out the array in the various NUMA regions.
 Heterogeneous programming has its own memory layout problem: data either lives on the host CPU or the device.  Those are two separate memory spaces.  Even if they weren't separate, it still might be expensive to compute on remote data, or even to copy data from one space to another.  This is exactly the same problem as on multicore CPUs with NUMA.  When I look at HMPP, I see annotations for when to perform data transfers between the host and the device, and even ways to overlap data transfers with useful computation.  However, I don't see annotations for where to allocate the data.  The HMPP model assumes (at least from what I can tell from a brief look -- please correct me if I'm wrong) that data always lives on the host, is always transferred to the device for computations, and is always transferred back to the host afterwards.  This is like allocating all your data in one NUMA region and operating on it using CPU cores in all the NUMA regions.

One could augment the annotation approach to include a memory layout model, but it seems like this is just making the language more baroque.  Plenty of programming languages reify data layout without the complication of additional annotations.  Co-Array Fortran is a good example.  This need not be done in the language; libraries can also hide the details of layout and allocation.  Kokkos in Trilinos does this for GPUs, for example.  CUDA and OpenCL are languages, but their special routines for allocating memory on the device are just plain C.

I like the Array Building Blocks (and Kokkos) approach, because it also gives a good model for interactions between legacy code (with legacy memory layout) and new parallel code (that expects or should have data in a particular nonlegacy layout).  Kokkos does this with smart pointers and copy routines: you aren't allowed to call parallel kernels on legacy memory.  Array Building Blocks does this with "bind", which links up legacy memory and compute buffers.  The general idea is that "you don't get the pointer": memory on which heterogeneous or node-level parallel computation is done, is encapsulated and organized optimally by the language or library system.  This is also a great way to unify different kinds of nodes with different memory space(s), which is why Kokkos does it.  Pretty much any kind of node has parallel computing resources, associated to one or more memory spaces, for which data transfer costs are high and data layout matters for performance.

05 September 2010

GPUs and CPUs, part 2: programming models

After finishing last night's post, I was thinking about CPU and GPU programming models -- in particular, about short vector instructions.  On CPUs these are called SIMD (single instruction, multiple data) instructions.  NVIDIA markets them instead as "SIMT" (single instruction, multiple thread), not because of differences in the hardware implementation, but because the programming model is different.  The "CUDA C Programming Guide" version 3.1 says:  "... SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread."  Here, "thread" corresponds to an element in an array, operated on by a vector instruction.

CPU vendors have historically done a poor job presenting programmers with a usable interface to SIMD instructions.  Compilers are supposed to vectorize code automatically, so that you write ordinary scalar loops and vectorized code comes out.  However, my experience a few years ago writing SIMD code is that compilers have a hard time doing this for anything but the simplest code.  Often, they will use SIMD instructions, but only operate on one element in the short vector, just because the vector pipes are faster than the scalar ones (even if the extra element(s) in the vector are wasted).  The reason they struggle at vectorization -- more so than compilers for traditional vector machines like the Crays -- is because the SIMD instructions are more restrictive than traditional vector instructions.  They demand contiguous memory, aligned on wide boundaries (like 128 bits).  If the compiler can't prove that your memory accesses have this nice friendly pattern, it won't vectorize your code.  SIMD branching is expensive (it means you lose the vector parallelism), so compilers might refuse to vectorize branchy code. 

Note that GPU hardware has the same properties!   They also like contiguous memory accesses and nonbranchy code.  The difference is the programming model: "thread" means "array element" and consecutive threads are contiguous, whether you like it or not.  The model encourages coding in a SIMD-friendly way.  In contrast, CPU SIMD instructions didn't historically have much of a programming model at all, other than trusting in the compiler.  The fact that compilers make the instructions directly available via "intrinsics" (only one step removed from inline assembler code) already indicates that coders had no other satisfactory interface to these instructions, yet didn't trust the compiler to vectorize.  Experts like Sam Williams, a master performance tuner of computational kernels like sparse matrix-vector multiply and stencils, used the intrinsics to vectorize.  This made their codes dependent on the particular instruction set, as well as on details of the instruction set implementation.  (For example, older AMD CPUs used a "half-piped" implementation of SIMD instructions on 64-bit floating-point words.  This meant that the implementation wasn't parallel, even though the instructions were.  Using the x87 scalar instructions instead offered comparable performance, was easier to program, and even offered better accuracy for certain computations, since temporaries are stored with extra precision.)  Using SIMD instructions complicated their code in other ways as well.  For example, allocated memory has to be aligned to a particular size, such as 128 bits. That bit value depends on the SIMD vector width, which again decreases portability.  SIMD instruction widths increase over time, and continue to do so.  Furthermore, these codes are brittle, because feeding nonaligned memory to SIMD instructions often results in errors that can crash one's program.

CPU vendors are finally starting to think about programming models that make it easier to exploit SIMD instructions.  OpenCL, while as low-level a model as CUDA, also lets programmers reason about vector instructions in a less hardware-dependent way.  One of the most promising programming models is Intel's Array Building Blocks, featured in a recent Dr. Dobb's Journal article.  I'm excited about Array Building Blocks for the following reasons:
  1. It includes a memory model with a segregated memory space.  This can cover all kinds of complicated hardware details (like alignment and NUMA affinity).  It's also validation for the work being done in libraries like Trilinos' Kokkos to hide array memory management from the programmer ("you don't get a pointer"), thus freeing the library to place and manage memory as it sees fit.  All of this will help future-proof code from the details of allocation, and also make code safer (you don't get the pointer, so you can't give nonaligned memory to instructions that want aligned memory, or give GPU device memory to a CPU host routine).
  2. It's a programming language (embedded in C++, with a run time compilation model) -- which will expose mainstream programmers to the ideas of embedded special-purpose programming languages, and run time code generation and compilation.  Scientific computing folks tend to be conservative about generating and compiling code at run time, in part because the systems on which they have to run often don't support it or only support it weakly.  If Array Building Blocks gives the promised performance benefits and helps "future-proof" code, HPC system vendors will be driven to support these features.
  3. Its parallel operators are deterministic; programmers won't have to reason about shared-memory nightmares like race conditions or memory models.  (Even writing a shared-memory barrier is a challenging task, and the best-performing barrier implementation depends on the application as well as the definition of "performance" (speed or energy?).)
I've been pleased with Intel's work on Threading Building Blocks, and I'm happy to see them so committed to high-level, usable, forward-looking programming models.  NVIDIA's Thrust library is another great project along these lines.  It's also great to see both companies releasing their libraries as open source.  The "buy our hardware and we will make programming it easier" business model is great for us coders, great for doing science (open source means we know what the code is doing), and great for performance (we can improve the code if we want).  I'm hopeful that the competition between CPU and GPU vendors to woo coders will improve programming models for all platforms.

04 September 2010

Programming GPUs makes us better CPU programmers

High-performance computing blogs like "horse-race" journalism, especially when covering competing architectures like CPUs and GPUs.  One commonly hears of 20x or even larger speedups when porting a CPU code to a GPU.  Recently someone pointed out to me, though, that one rarely hears of the "reverse port": taking an optimized GPU code, and porting it back to the CPU.  Many GPU optimizations relate to memory access, and many scientific codes spend a lot of time reading and writing memory.  For example, coalescing loads and stores (for consecutive threads on a GPU warp) corresponds roughly to aligned, contiguous memory accesses on a CPU.  These are more friendly to cache lines, and are also amenable to optimizations like prefetching or using wide aligned load and store instructions.  CPUs are getting wider vector pipes too, which will increase the branch penalty.  From what I've heard, taking that GPU-optimized code and porting it back to the CPU might result in only a 2x slowdown over the GPU. 

I don't see this as bad, though.  First, GPUs are useful because the hardware forces programmers to think about performance.  After investing in learning a new programming language (such as CUDA or OpenCL) or at least a new library (such as Thrust), and after investing in new hardware and in supporting GPU runtimes in their applications, coders are obligated to get return on that investment.  Thus, they take the time to learn how to optimize their codes.  GPU vendors help a lot with this, and expose performance-related aspects of their architecture, helping programmers find exploitation points.  Second, optimizing for the GPU covers the future possibility that CPUs and GPUs will converge.  Finally, programmers' dissatisfaction with certain aspects of GPU hardware may cause divergence rather than convergence of GPUs and CPUs, or even the popularization of entirely new architectures.  (How about a Cray XMT on an accelerator board in your workstation, for highly irregular data processing?)