CPU vendors have historically done a poor job presenting programmers with a usable interface to SIMD instructions. Compilers are supposed to vectorize code automatically, so that you write ordinary scalar loops and vectorized code comes out. However, my experience a few years ago writing SIMD code is that compilers have a hard time doing this for anything but the simplest code. Often, they will use SIMD instructions, but only operate on one element in the short vector, just because the vector pipes are faster than the scalar ones (even if the extra element(s) in the vector are wasted). The reason they struggle at vectorization -- more so than compilers for traditional vector machines like the Crays -- is because the SIMD instructions are more restrictive than traditional vector instructions. They demand contiguous memory, aligned on wide boundaries (like 128 bits). If the compiler can't prove that your memory accesses have this nice friendly pattern, it won't vectorize your code. SIMD branching is expensive (it means you lose the vector parallelism), so compilers might refuse to vectorize branchy code.
Note that GPU hardware has the same properties! They also like contiguous memory accesses and nonbranchy code. The difference is the programming model: "thread" means "array element" and consecutive threads are contiguous, whether you like it or not. The model encourages coding in a SIMD-friendly way. In contrast, CPU SIMD instructions didn't historically have much of a programming model at all, other than trusting in the compiler. The fact that compilers make the instructions directly available via "intrinsics" (only one step removed from inline assembler code) already indicates that coders had no other satisfactory interface to these instructions, yet didn't trust the compiler to vectorize. Experts like Sam Williams, a master performance tuner of computational kernels like sparse matrix-vector multiply and stencils, used the intrinsics to vectorize. This made their codes dependent on the particular instruction set, as well as on details of the instruction set implementation. (For example, older AMD CPUs used a "half-piped" implementation of SIMD instructions on 64-bit floating-point words. This meant that the implementation wasn't parallel, even though the instructions were. Using the x87 scalar instructions instead offered comparable performance, was easier to program, and even offered better accuracy for certain computations, since temporaries are stored with extra precision.) Using SIMD instructions complicated their code in other ways as well. For example, allocated memory has to be aligned to a particular size, such as 128 bits. That bit value depends on the SIMD vector width, which again decreases portability. SIMD instruction widths increase over time, and continue to do so. Furthermore, these codes are brittle, because feeding nonaligned memory to SIMD instructions often results in errors that can crash one's program.
CPU vendors are finally starting to think about programming models that make it easier to exploit SIMD instructions. OpenCL, while as low-level a model as CUDA, also lets programmers reason about vector instructions in a less hardware-dependent way. One of the most promising programming models is Intel's Array Building Blocks, featured in a recent Dr. Dobb's Journal article. I'm excited about Array Building Blocks for the following reasons:
- It includes a memory model with a segregated memory space. This can cover all kinds of complicated hardware details (like alignment and NUMA affinity). It's also validation for the work being done in libraries like Trilinos' Kokkos to hide array memory management from the programmer ("you don't get a pointer"), thus freeing the library to place and manage memory as it sees fit. All of this will help future-proof code from the details of allocation, and also make code safer (you don't get the pointer, so you can't give nonaligned memory to instructions that want aligned memory, or give GPU device memory to a CPU host routine).
- It's a programming language (embedded in C++, with a run time compilation model) -- which will expose mainstream programmers to the ideas of embedded special-purpose programming languages, and run time code generation and compilation. Scientific computing folks tend to be conservative about generating and compiling code at run time, in part because the systems on which they have to run often don't support it or only support it weakly. If Array Building Blocks gives the promised performance benefits and helps "future-proof" code, HPC system vendors will be driven to support these features.
- Its parallel operators are deterministic; programmers won't have to reason about shared-memory nightmares like race conditions or memory models. (Even writing a shared-memory barrier is a challenging task, and the best-performing barrier implementation depends on the application as well as the definition of "performance" (speed or energy?).)