04 September 2010

Programming GPUs makes us better CPU programmers

High-performance computing blogs like "horse-race" journalism, especially when covering competing architectures like CPUs and GPUs.  One commonly hears of 20x or even larger speedups when porting a CPU code to a GPU.  Recently someone pointed out to me, though, that one rarely hears of the "reverse port": taking an optimized GPU code, and porting it back to the CPU.  Many GPU optimizations relate to memory access, and many scientific codes spend a lot of time reading and writing memory.  For example, coalescing loads and stores (for consecutive threads on a GPU warp) corresponds roughly to aligned, contiguous memory accesses on a CPU.  These are more friendly to cache lines, and are also amenable to optimizations like prefetching or using wide aligned load and store instructions.  CPUs are getting wider vector pipes too, which will increase the branch penalty.  From what I've heard, taking that GPU-optimized code and porting it back to the CPU might result in only a 2x slowdown over the GPU. 

I don't see this as bad, though.  First, GPUs are useful because the hardware forces programmers to think about performance.  After investing in learning a new programming language (such as CUDA or OpenCL) or at least a new library (such as Thrust), and after investing in new hardware and in supporting GPU runtimes in their applications, coders are obligated to get return on that investment.  Thus, they take the time to learn how to optimize their codes.  GPU vendors help a lot with this, and expose performance-related aspects of their architecture, helping programmers find exploitation points.  Second, optimizing for the GPU covers the future possibility that CPUs and GPUs will converge.  Finally, programmers' dissatisfaction with certain aspects of GPU hardware may cause divergence rather than convergence of GPUs and CPUs, or even the popularization of entirely new architectures.  (How about a Cray XMT on an accelerator board in your workstation, for highly irregular data processing?)

No comments: