08 September 2010

CPUs and GPUs, part 3: memory and programming models

My last post got a good comment which pointed out that I hadn't been covering another important class of programming models: annotations.  C and C++ programmers would call these "pragmas," since their compilers understand lines starting with "#pragma" as directives to the compiler that affect one or more lines of code that follow.  They would likely be familiar with OpenMP, which uses annotations to tell the compiler to parallelize certain loops or fragments of code.  Other programming languages, like Common Lisp, use annotations as hints for optimization -- in particular, as type hints. 

The commenter pointed out that the Hybrid Multicore Parallel Programming (HMPP) system of annotations, developed by CAPS and supported by the PathScale ENZO compiler suit, offers another programming model for heterogeneous node computing.  You write a kernel in a subset of C(++) or Fortran, annotated with operations that the compute device (e.g., GPU) understands (like when to transfer data to / from the host CPU, or when to synchronize device threads), and also label the kernel with an annotation to identify it for deployment to the device.   I was just introduced to HMPP last night, so please take a look at the comment to read the manual for yourself :-)  However, a brief read of the manual reminds me of Copperhead, a Python annotation system (based on decorators) that accomplishes similar goals.

Annotations require the programmer to (1) know the annotation language as well as the main programming language being used, (2) understand what subset of the programming language can be sent to the device, and (3) understand when and how the annotation changes program semantics.  Compilers should help with #2, but may not; if given invalid code for the device, they may generate incorrect device code, which may crash or silently give the wrong answer.  This is true for other annotation systems as well.  OpenMP, for example, doesn't help you avoid race conditions if multiple threads modify shared state.  You have to identify and annotate critical sections and / or atomic updates yourself.  #3 can also be a challenge, especially if annotations are sprinkled throughout the code.

Before talking about HMPP, I'll first use OpenMP as an example of the difficulties of popular annotation-based approaches.  Many OpenMP performance problems and code-writing challenges relate to the memory model.  You've taken a sequential code, where memory is laid out for sequential operation, and introduced parallelism to the computation but not the memory layout.  On nonuniform memory access (NUMA) hardware, performance tanks, because the data lives in the wrong place for most of the threads.  At least modern OpenMP implementations try not to let threads wander around to different NUMA regions from where they started, which means that the operating system could migrate pages closer to the cores that are working on them.  That suggests a "preconditioning" approach, where you first use OpenMP parallelism to read through the whole data set (thereby letting the operating system move data to where it needs to be), and then use OpenMP to do the computations you wanted to do in the first place.  Operating systems in common use don't do that yet!  Another approach is just syntactic sugar for explicitly threaded programming: each thread allocates its own data, and then you program in something like a distributed-memory model.  (It would be easier and less buggy just to run MPI with one process per core.)

Both of these OpenMP approaches are awkward.  The problem is that OpenMP separates computation and memory layout.  The annotations only govern computation; they don't say where the data should live in memory.  That often means throwing away useful information, which many OpenMP programmers know, about which threads will operate on which data.  If you know how to allocate an array, you likely know how you would divide up simple parallel for loops or reductions on the array among the CPU cores.  If you know that, you almost certainly know how to lay out the array in the various NUMA regions.
 Heterogeneous programming has its own memory layout problem: data either lives on the host CPU or the device.  Those are two separate memory spaces.  Even if they weren't separate, it still might be expensive to compute on remote data, or even to copy data from one space to another.  This is exactly the same problem as on multicore CPUs with NUMA.  When I look at HMPP, I see annotations for when to perform data transfers between the host and the device, and even ways to overlap data transfers with useful computation.  However, I don't see annotations for where to allocate the data.  The HMPP model assumes (at least from what I can tell from a brief look -- please correct me if I'm wrong) that data always lives on the host, is always transferred to the device for computations, and is always transferred back to the host afterwards.  This is like allocating all your data in one NUMA region and operating on it using CPU cores in all the NUMA regions.

One could augment the annotation approach to include a memory layout model, but it seems like this is just making the language more baroque.  Plenty of programming languages reify data layout without the complication of additional annotations.  Co-Array Fortran is a good example.  This need not be done in the language; libraries can also hide the details of layout and allocation.  Kokkos in Trilinos does this for GPUs, for example.  CUDA and OpenCL are languages, but their special routines for allocating memory on the device are just plain C.

I like the Array Building Blocks (and Kokkos) approach, because it also gives a good model for interactions between legacy code (with legacy memory layout) and new parallel code (that expects or should have data in a particular nonlegacy layout).  Kokkos does this with smart pointers and copy routines: you aren't allowed to call parallel kernels on legacy memory.  Array Building Blocks does this with "bind", which links up legacy memory and compute buffers.  The general idea is that "you don't get the pointer": memory on which heterogeneous or node-level parallel computation is done, is encapsulated and organized optimally by the language or library system.  This is also a great way to unify different kinds of nodes with different memory space(s), which is why Kokkos does it.  Pretty much any kind of node has parallel computing resources, associated to one or more memory spaces, for which data transfer costs are high and data layout matters for performance.

No comments: