05 September 2010

GPUs and CPUs, part 2: programming models

After finishing last night's post, I was thinking about CPU and GPU programming models -- in particular, about short vector instructions.  On CPUs these are called SIMD (single instruction, multiple data) instructions.  NVIDIA markets them instead as "SIMT" (single instruction, multiple thread), not because of differences in the hardware implementation, but because the programming model is different.  The "CUDA C Programming Guide" version 3.1 says:  "... SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread."  Here, "thread" corresponds to an element in an array, operated on by a vector instruction.

CPU vendors have historically done a poor job presenting programmers with a usable interface to SIMD instructions.  Compilers are supposed to vectorize code automatically, so that you write ordinary scalar loops and vectorized code comes out.  However, my experience a few years ago writing SIMD code is that compilers have a hard time doing this for anything but the simplest code.  Often, they will use SIMD instructions, but only operate on one element in the short vector, just because the vector pipes are faster than the scalar ones (even if the extra element(s) in the vector are wasted).  The reason they struggle at vectorization -- more so than compilers for traditional vector machines like the Crays -- is because the SIMD instructions are more restrictive than traditional vector instructions.  They demand contiguous memory, aligned on wide boundaries (like 128 bits).  If the compiler can't prove that your memory accesses have this nice friendly pattern, it won't vectorize your code.  SIMD branching is expensive (it means you lose the vector parallelism), so compilers might refuse to vectorize branchy code. 

Note that GPU hardware has the same properties!   They also like contiguous memory accesses and nonbranchy code.  The difference is the programming model: "thread" means "array element" and consecutive threads are contiguous, whether you like it or not.  The model encourages coding in a SIMD-friendly way.  In contrast, CPU SIMD instructions didn't historically have much of a programming model at all, other than trusting in the compiler.  The fact that compilers make the instructions directly available via "intrinsics" (only one step removed from inline assembler code) already indicates that coders had no other satisfactory interface to these instructions, yet didn't trust the compiler to vectorize.  Experts like Sam Williams, a master performance tuner of computational kernels like sparse matrix-vector multiply and stencils, used the intrinsics to vectorize.  This made their codes dependent on the particular instruction set, as well as on details of the instruction set implementation.  (For example, older AMD CPUs used a "half-piped" implementation of SIMD instructions on 64-bit floating-point words.  This meant that the implementation wasn't parallel, even though the instructions were.  Using the x87 scalar instructions instead offered comparable performance, was easier to program, and even offered better accuracy for certain computations, since temporaries are stored with extra precision.)  Using SIMD instructions complicated their code in other ways as well.  For example, allocated memory has to be aligned to a particular size, such as 128 bits. That bit value depends on the SIMD vector width, which again decreases portability.  SIMD instruction widths increase over time, and continue to do so.  Furthermore, these codes are brittle, because feeding nonaligned memory to SIMD instructions often results in errors that can crash one's program.

CPU vendors are finally starting to think about programming models that make it easier to exploit SIMD instructions.  OpenCL, while as low-level a model as CUDA, also lets programmers reason about vector instructions in a less hardware-dependent way.  One of the most promising programming models is Intel's Array Building Blocks, featured in a recent Dr. Dobb's Journal article.  I'm excited about Array Building Blocks for the following reasons:
  1. It includes a memory model with a segregated memory space.  This can cover all kinds of complicated hardware details (like alignment and NUMA affinity).  It's also validation for the work being done in libraries like Trilinos' Kokkos to hide array memory management from the programmer ("you don't get a pointer"), thus freeing the library to place and manage memory as it sees fit.  All of this will help future-proof code from the details of allocation, and also make code safer (you don't get the pointer, so you can't give nonaligned memory to instructions that want aligned memory, or give GPU device memory to a CPU host routine).
  2. It's a programming language (embedded in C++, with a run time compilation model) -- which will expose mainstream programmers to the ideas of embedded special-purpose programming languages, and run time code generation and compilation.  Scientific computing folks tend to be conservative about generating and compiling code at run time, in part because the systems on which they have to run often don't support it or only support it weakly.  If Array Building Blocks gives the promised performance benefits and helps "future-proof" code, HPC system vendors will be driven to support these features.
  3. Its parallel operators are deterministic; programmers won't have to reason about shared-memory nightmares like race conditions or memory models.  (Even writing a shared-memory barrier is a challenging task, and the best-performing barrier implementation depends on the application as well as the definition of "performance" (speed or energy?).)
I've been pleased with Intel's work on Threading Building Blocks, and I'm happy to see them so committed to high-level, usable, forward-looking programming models.  NVIDIA's Thrust library is another great project along these lines.  It's also great to see both companies releasing their libraries as open source.  The "buy our hardware and we will make programming it easier" business model is great for us coders, great for doing science (open source means we know what the code is doing), and great for performance (we can improve the code if we want).  I'm hopeful that the competition between CPU and GPU vendors to woo coders will improve programming models for all platforms.

3 comments:

Unknown said...

HMPP directives have been made an open standard and the overall programming model may be an interesting 3rd part to your article. Please do cover the pragma/directive based approaches as well..

http://www.pathscale.com/pdf/PathScale-ENZO-1.0-UserGuide.pdf

Unknown said...

HMPP directives were made an open standard by PathScale and CAPS and could be an interesting addition for comparison against CUDA/OpenCL programming models. The directive/pragma based approach similar to OpenMP, but tailored to manycore/gpu offers a number of advantages..

Here's our userguide which should give enough details to see what I mean..
http://www.pathscale.com/pdf/PathScale-ENZO-1.0-UserGuide.pdf

HilbertAstronaut said...

Thanks for the link! I appreciate that you point out annotations as another approach to mixed GPU/CPU programming, since I'm not so familiar with it. I'm familiar with annotation-based programming from OpenMP and also from programming languages like ANSI Common Lisp, where type hints serve as optional annotations for improving performance.

A big concern for mixed GPU/CPU programming, and indeed all heterogeneous node programming, is the memory model. It's reasonable to expect the following:

1. Different devices will often have different memory spaces (and this may be good -- GPU memory sacrifices bandwidth for latency, which you may not want on a CPU)

2. Even within a single shared-memory image, memory layout matters for performance (e.g., for NUMA, or if GPU and CPU memories get mapped into a single space but are physically different for performance reasons)

3. Different devices have different requirements for memory alignment -- the way that you allocate arrays and lay out structures in memory may differ on different devices

4. Copy overhead is prohibitive for many algorithms. The performance of sparse matrix-vector multiply is already typically memory bandwidth - bound, for example.

All of these things mean that managing memory -- allocation, placement, alignment, and "conditioning" (convincing the OS to put pages where they should be for best performance) -- is perhaps the most important part of heterogeneous node computing. The most productive programming languages, libraries, or models for heterogeneous nodes will be those that help programmers manage those memory issues with minimal intervention.

In Trilinos, we aim to do this with the Kokkos Node API:

http://trilinos.sandia.gov/packages/docs/dev/packages/kokkos/doc/html/group__kokkos__node__api.html

Intel's Array Building Blocks uses a similar model to Kokkos' "compute buffers," which convinces me that it's the way things will or should go.

How do the HMPP directives help programmers manage memory issues with heterogeneous computing?