22 June 2009

On metadata, shared memory, NUMA, and caches

Many modern shared-memory multiprocessors improve memory locality by providing each socket (group of processors) in the machine with its own chunk of memory.  Each socket can access all the chunks of memory, but accesses to its own chunk are faster (in terms of latency, at least).  This design decision is called "nonuniform memory access" (NUMA).

Programming a NUMA machine for performance in scientific kernels looks a lot like programming a distributed-memory machine:  one must "distribute" the data among the chunks so as to improve locality.  There is a difference, however, in the way one deals with metadata -- the data describing the layout of the data. 

In the distributed-memory world, one generally tries to distribute the metadata, because fetching it from another node takes a long time.  This means that memory usage increases with the number of processors, regardless of the actual problem size.  It also means that programmers have to think harder and debug longer, as distributing metadata correctly (and avoiding bottlenecks) is hard.  (Imagine, for example, implementing a reduction tree of arbitrary topology on arbitrary subsets of processors, when you aren't allowed to store a list of all the processors on any one node.)

In the NUMA world, one need not replicate the metadata.  Of course one _may_, but metadata access latency is small enough that it's practical to keep one copy of the metadata for all the processors.  This is where caches come in:  if the metadata is read-only, caches can replicate the relevant parts of the metadata for free (in terms of programming effort).  Of course, it would be nice also to have a non-coherent "local store," accessible via direct memory transfers, for the actual data.  Once you have the metadata, it's often easy to know what data you want, so it's easier to manage the explicit memory transfers to and from the local store.  However, if you only have a local store, you have to store metadata there, and managing that reduces to the distributed-memory case (except more painful, since you have much less memory capacity).

Memory capacity is really the most important difference between the shared-memory NUMA and distributed-memory worlds.  Our experience is that on a single shared-memory node, memory capacity cannot scale with the number of processors.  This is because DRAM uses too much power.  The same power concerns hold, of course, for the distributed-memory world.  However, an important use of distributed-memory machines is for their scale-out capacity to solve very large problems, so their architects budget for those power requirements.  Architects of single-node shared-memory systems don't have that freedom to expand memory capacity.  This, along with the lower latencies between sockets within a single node, makes sharing metadata in the shared-memory world attractive.

It's often easy to delimit sections of a computation in which metadata is only read and never modified.  This suggests an architectural feature:  using cache coherence to fetch metadata once, but tagging the metadata in cache as read-only, or having a "read-only" section of cache.  I'm not a hardware person so I don't know if this is worthwhile, but it's a way to extract locality out of semantic information.

13 June 2009

Productivity language for performance-oriented programmers

Our computer science department last semester offered a graduate seminar course on parallel productivity languages -- that is, programming languages in which it's easy to write parallel code whose performance scales well, even if single-processor performance is not so good (e.g., because the language is interpreted or there is a lot of runtime error checking).  Implicit in the word "productivity" is that the target users need not be expert low-level programmers.  

According to a professor who helped lead the course, however, most of the class projects ended up being "productivity languages for efficiency-layer programmers."  That is, the target users were not non-computer-scientist domain experts, but experienced low-level programmers who want to boost their productivity when optimizing a particular computational kernel.  One example was a domain-specific language (DSL) for stencil operations, along with a compiler and autotuner.  Users write the stencil operation in a familiar notation, and the system transforms it at runtime (the source language is dynamic) into highly optimized code which performs just about as well as if it were optimized by hand. 

The professor's observation resonated, because I'm currently near the end of what turned out to be a long, painful implementation and debugging project.  It's a benchmark implementation of a new solver, which itself contains at least three different computational kernels.  The project has taught me hard lessons about what bugs consume the most time.  What I've found is that working in a language "close to the metal" is not the problem.  I'm not afraid to optimize for hardware or operating system features, to allocate and deallocate chunks of memory, or to impose matrix semantics on chunks of raw memory.  My frustrations usually come from silly human errors, that result in inscrutable errors which debuggers like gdb and memory analysis tools like Valgrind cannot diagnose.  Furthermore, sometimes bugs only manifest with very large problems, that make losing a factor of 10-100x performance to an analysis tool (like Valgrind, which is an amazing tool and perhaps one of the best optimized out there) impractical (or at least annoying).

One source of human errors is conversion between units.  The classic example is the loss of the Mars Climate Orbiter in 1999 due to a failure to convert between metric and English units.  However, units, or a related kind of semantic information, often come up in pure programming tasks that have little to do with physical measurements.  For example, my current project involves a lot of C code that iterates over one kind of data structure, and copies it into a new kind of data structure.  A lot of indices fly around, and I usually have to give them suggestive names like "index_old" or "index_new" to indicate whether they "belong" to the old or new data structure.  Naming conventions aren't enough to prevent a mismatch, though, and a mistake (like addressing a "new" array with an "old" index) may result in a segfault, smashing the stack, intermittent semantic errors that depend on runtime parameters, or no problems at all.  Here, "old" and "new" are a kind of "unit" or semantic restriction:  for example, "this index can only be used to address this array, not any other."  This kind of language feature seems like it should result in little or no runtime performance hit in many cases. 

Another source of human errors is resource scope:  who initializes what, when, and who de-initializes it, when.  Many HPC codes require careful control of memory allocation and deallocation that often cannot be left to a garbage collector, either because the amounts of memory involved are huge, or because the allocation method is specialized (e.g., optimizations for nonuniform memory access (NUMA) machines).  Furthermore, garbage collection doesn't solve all resource management problems:  files, pipes, sockets, and the like are often shared by multiple threads.  Many languages, such as ANSI Common Lisp and the concurrent JVM-based language Clojure, have a special syntax for managing the scope of objects like files that need to be opened and closed, regardless of whether the operations done during the lifetime of the opened file succeed.  Common Lisp has a convention to name such scope management macros "WITH-*", where one replaces the * with the kind of object being managed.  Of course, a syntax for scope management assumes that one can statically determine the scope of the object, but many situations and kinds of object fit that pattern.

My current project involves a shared-memory parallel solver written in an SPMD programming model.  There's a lot of shared state, and it gets initialized at different times:  either before the compute threads are created, during the scope of the threads, or after the threads complete.  Sometimes, threads cooperate to initialize an object, and sometimes they let just one compute thread do the work.  In general, it's hard for compilers to tell when different threads might be accessing a shared object.  However, my choice of the SPMD programming model restricts the implementation enough that the compiler can easily identify whether something happens before or after thread creation or completion. It would be nice to have a syntax or some kind of help from the language to manage the different kinds of shared object creation.  The runtime could also help out by forbidding multiple initializations, or by "tagging" a shared object as accessible only by a certain thread.

A third source of errors has been imposing structure on raw pointers.  For example, a memory allocation returns a chunk of contiguously stored memory, but I have to divide up that chunk in place into multiple nonoverlapping matrices.  This is something that Fortran (especially Fortran 2003, if implemented correctly, *ahem*) does much better than C:  Fortran always imposes array structure on chunks of memory, and if you've done the mapping right, the runtime can identify indexing errors.  C only gives you raw pointers, onto which you have to impose structure.  If you mess up, only something like Valgrind may be able to help.  However, even Valgrind has no access to the semantic information that a particular chunk of memory is supposed to be an M by N matrix with leading dimension LDA.  Having an option to do bounds checking, which does not interfere with or affect parallel performance, but which can be turned off (at build time even) to improve performance on one thread, would be a big help.

When I see new languages proposed for scientific computing, often they have lots of new features -- nested parallelism, data structures suited to scientific things like irregular discretizations, or a full library of parallel primitives.
What I don't see, however, is a new low-level language which can be used to implement those new higher-level languages, or to implement highly optimized library routines.  A good low-level language is also important for bridging the gap between declarative optimization approaches (i.e., identifying features of the data or algorithm that enable optimizations without requiring them), and imperative approaches (mandating which optimization or feature to use, so that one can predict and measure performance).  Predictable performance makes optimization a science rather than a collection of incantations.  Furthermore, if a good low-level imperative language is used as a compilation target of a declarative language, then one can examine the generated code and figure out what the compiler is doing -- or even introspect and alter the implementation strategy at runtime.

10 June 2009

C++: Worlds of Pain

MS's efforts to implement the new C++0x standard in their compiler remind me of the world of pain that is C++ syntax.  James Iry jokes that the "language is so complex that programs must be sent to the future to be compiled by the Skynet artificial intelligence," but of course programmers do a kind of compilation in their head too, in order to understand and work with the language. 

My experience with C++ was typified by looking at Boost's graph library and wondering what the developers were smoking and whether or not I wanted to try some.  I took C++ reasonably far in my professional work; I used the STL containers extensively and even dipped into some of the generic algorithms (even with Visual Studio 6's awful C++ templates support, some of them were still useful).  However, there were too many ad hoc - seeming rules and conventions and ways to shoot myself in the foot, and name mangling was always an annoyance for me.