The nerdiest of the nerds: Productivity language for performance-oriented programmers

13 June 2009

Productivity language for performance-oriented programmers

Our computer science department last semester offered a graduate seminar course on parallel productivity languages -- that is, programming languages in which it's easy to write parallel code whose performance scales well, even if single-processor performance is not so good (e.g., because the language is interpreted or there is a lot of runtime error checking). Implicit in the word "productivity" is that the target users need not be expert low-level programmers.

According to a professor who helped lead the course, however, most of the class projects ended up being "productivity languages for efficiency-layer programmers." That is, the target users were not non-computer-scientist domain experts, but experienced low-level programmers who want to boost their productivity when optimizing a particular computational kernel. One example was a domain-specific language (DSL) for stencil operations, along with a compiler and autotuner. Users write the stencil operation in a familiar notation, and the system transforms it at runtime (the source language is dynamic) into highly optimized code which performs just about as well as if it were optimized by hand.

The professor's observation resonated, because I'm currently near the end of what turned out to be a long, painful implementation and debugging project. It's a benchmark implementation of a new solver, which itself contains at least three different computational kernels. The project has taught me hard lessons about what bugs consume the most time. What I've found is that working in a language "close to the metal" is not the problem. I'm not afraid to optimize for hardware or operating system features, to allocate and deallocate chunks of memory, or to impose matrix semantics on chunks of raw memory. My frustrations usually come from silly human errors, that result in inscrutable errors which debuggers like gdb and memory analysis tools like Valgrind cannot diagnose. Furthermore, sometimes bugs only manifest with very large problems, that make losing a factor of 10-100x performance to an analysis tool (like Valgrind, which is an amazing tool and perhaps one of the best optimized out there) impractical (or at least annoying).

One source of human errors is conversion between units. The classic example is the loss of the Mars Climate Orbiter in 1999 due to a failure to convert between metric and English units. However, units, or a related kind of semantic information, often come up in pure programming tasks that have little to do with physical measurements. For example, my current project involves a lot of C code that iterates over one kind of data structure, and copies it into a new kind of data structure. A lot of indices fly around, and I usually have to give them suggestive names like "index_old" or "index_new" to indicate whether they "belong" to the old or new data structure. Naming conventions aren't enough to prevent a mismatch, though, and a mistake (like addressing a "new" array with an "old" index) may result in a segfault, smashing the stack, intermittent semantic errors that depend on runtime parameters, or no problems at all. Here, "old" and "new" are a kind of "unit" or semantic restriction: for example, "this index can only be used to address this array, not any other." This kind of language feature seems like it should result in little or no runtime performance hit in many cases.

Another source of human errors is resource scope: who initializes what, when, and who de-initializes it, when. Many HPC codes require careful control of memory allocation and deallocation that often cannot be left to a garbage collector, either because the amounts of memory involved are huge, or because the allocation method is specialized (e.g., optimizations for nonuniform memory access (NUMA) machines). Furthermore, garbage collection doesn't solve all resource management problems: files, pipes, sockets, and the like are often shared by multiple threads. Many languages, such as ANSI Common Lisp and the concurrent JVM-based language Clojure, have a special syntax for managing the scope of objects like files that need to be opened and closed, regardless of whether the operations done during the lifetime of the opened file succeed. Common Lisp has a convention to name such scope management macros "WITH-*", where one replaces the * with the kind of object being managed. Of course, a syntax for scope management assumes that one can statically determine the scope of the object, but many situations and kinds of object fit that pattern.

My current project involves a shared-memory parallel solver written in an SPMD programming model. There's a lot of shared state, and it gets initialized at different times: either before the compute threads are created, during the scope of the threads, or after the threads complete. Sometimes, threads cooperate to initialize an object, and sometimes they let just one compute thread do the work. In general, it's hard for compilers to tell when different threads might be accessing a shared object. However, my choice of the SPMD programming model restricts the implementation enough that the compiler can easily identify whether something happens before or after thread creation or completion. It would be nice to have a syntax or some kind of help from the language to manage the different kinds of shared object creation. The runtime could also help out by forbidding multiple initializations, or by "tagging" a shared object as accessible only by a certain thread.

A third source of errors has been imposing structure on raw pointers. For example, a memory allocation returns a chunk of contiguously stored memory, but I have to divide up that chunk in place into multiple nonoverlapping matrices. This is something that Fortran (especially Fortran 2003, if implemented correctly, *ahem*) does much better than C: Fortran always imposes array structure on chunks of memory, and if you've done the mapping right, the runtime can identify indexing errors. C only gives you raw pointers, onto which you have to impose structure. If you mess up, only something like Valgrind may be able to help. However, even Valgrind has no access to the semantic information that a particular chunk of memory is supposed to be an M by N matrix with leading dimension LDA. Having an option to do bounds checking, which does not interfere with or affect parallel performance, but which can be turned off (at build time even) to improve performance on one thread, would be a big help.

When I see new languages proposed for scientific computing, often they have lots of new features -- nested parallelism, data structures suited to scientific things like irregular discretizations, or a full library of parallel primitives.
What I don't see, however, is a new low-level language which can be used to implement those new higher-level languages, or to implement highly optimized library routines. A good low-level language is also important for bridging the gap between declarative optimization approaches (i.e., identifying features of the data or algorithm that enable optimizations without requiring them), and imperative approaches (mandating which optimization or feature to use, so that one can predict and measure performance). Predictable performance makes optimization a science rather than a collection of incantations. Furthermore, if a good low-level imperative language is used as a compilation target of a declarative language, then one can examine the generated code and figure out what the compiler is doing -- or even introspect and alter the implementation strategy at runtime.

7 comments:

John said...: I wonder if there is low-level support of the kind you are talking about, but most of those who do implement such checks do so as a set of tool-enforced conventions on existing languages, that are then separately held.

Also, what do you see of the role of richer intermediate languages in passing along the information that certain conventions are held?; 17 June, 2009 04:49
HilbertAstronaut said...: Sorry about the late reply -- I've been very very busy lately (I usually write these blog posts as memory dumps for later reference).

I'm not fond of tool-based approaches because I think it's hard to analyze the kinds of codes that people like me write. Data gets shared in weird ways, for example, and pointers pass in and out of libraries. But maybe I'm just not so familiar with existing tools.

Could you clarify that question on passing conventions? What would be an example of such a convention? Would it be something like whether indices are zero-based or one-based?; 22 June, 2009 13:02
John said...: Sorry, I don't mean "conventions for passing", but instead I mean that the conventions within a name are passed through the compilation phases. For example, if you were to say _meters and if a tool could trace a direct chain of assignments to something with _foot, you would have an error, or a warning if there could be a path, but it can't tell.

The idea overall is that you could embed a declarative language inside the usage of an imperative language, using variable naming conventions and specialized syntax within comments. This would then be co-compiled, annotating the intermediate form.

As another example, from your next post, you could add a "start constant" and "end constant" as both a hint to the hardware and compiler, and as an assertion.

I'm sure this idea is really as old as the hills, with some compilers having a rich ecology of these kinds of pragmas and whatnot, but maybe not.; 29 June, 2009 05:44
HilbertAstronaut said...: I see what you mean now -- the combination of pragmas / annotations embedded in comments and source-to-source translation is a common thing in the HPC world. Vectorizing compilers and OpenMP compilers use this technique extensively. It seems to work best when it's a hook for experienced programmers to give the compiler semantic information for optimizations -- for example, that a variable is private to a loop body, or that a function has no side effects. (Why am I lecturing you -- you know all this stuff! ;-P )

For correctness issues, though, I don't think this technique scales. For example, it doesn't solve resource management and scoping problems, because coders still have to remember to close resources in a block and annotate the block. For example:

#pragma scope(FILE* f, fopen(filename, "w"), (void) fclose(f))
{
/* ... do stuff to f ... */
}

/* would be equivalent to */

{
FILE* f = fopen(filename, "w");
/* ... do stuff to f ... */
(void) fclose (f);
}

If the annotations offer much more syntax than that, then it's an entirely different language; might as well go all the way. Same with units -- the effort of building a system for annotating variables, functions, etc. with units could be saved by using a language that can tag objects with metadata.

"Might as well go all the way" is a strong argument for me. It takes a lot of effort to build a robust source-to-source translator (imagine trying to guess what's a variable declaration in the full C++ grammar, for example). One might as well start with a cleaner language that has a tight link to the implementation language (so you don't have to reimplement existing libraries). If you pick the right language features then it's a much more general solution -- metadata is a good example, because it can solve not just the units problem, but any kind of semantic information propagation problem (constantness, function purity, thread-local vs. globally visible).

It's fun to discuss these things, even though I definitely won't be doing any real work with this any time soon ;-); 29 June, 2009 08:46
HilbertAstronaut said...: btw here's a pastebin for that source code snippet I posted:

http://pastebin.com/m65546f66; 29 June, 2009 08:47
John said...: The "might as well go all the way" argument seems like a reasonable one at first, but I don't know. Many languages have started by compiling down into another. It isn't at all clear to me that the person who would begin such a project wouldn't find it easier to start with additional checks, and then compile down into another language, then to incorporate that compiler, working from there. To frame it a different way, since we both note that this is work for "somebody else", suppose that person were just to come along and offer to listen to your requirements. I don't suppose you would preclude them from working this way, were that their preference?; 29 June, 2009 18:22
HilbertAstronaut said...: John said: "The 'might as well go all the way' argument seems like a reasonable one at first, but I don't know. Many languages have started by compiling down into another."

I'm all about source-to-source translation. It's perfectly OK for C or Fortran (probably C _and_ Fortran; each has its strengths) to be a target.

The thing is, I don't want to write a general-purpose programming language. It's not so hard to write computational kernels in low-level languages, even efficient ones. (We know how to do autotuning; that's just source-to-source translation.) It's the glue code, the resource allocation, input and output, etc. that are hard for programmers, and that take real time.

Of course it would be awesome if units and other kinds of metadata would propagate nicely through the entire computation, but I'm OK with some of that computation living in opaque or low-level libraries -- tagging inputs and outputs of those libraries would be a great start, though.

I think if that wonderful "somebody else" were to come along, I'd say to him or her that it's much more general and helpful to translate code from a simple language into C, than to translate annotated C into C (for example).; 29 June, 2009 21:32

The nerdiest of the nerds

13 June 2009

Productivity language for performance-oriented programmers

7 comments:

Blog Archive

About Me