28 November 2009

James Hardy Wilkinson, FRS

I love this photo of James Hardy Wilkinson, the great numerical analyst and computer scientist.  Something not so commonly known is that Wilkinson's ability to get along with Alan Turing helped bring the Automatic Computing Engine, which was designed by Turing, to fruition.

21 July 2009

C is too high level

I propose that C is too high level of a language for the purposes for which it's used.  While it purports to give programmers power over memory layout -- in particular, over heap (via malloc) and stack (via alloca) allocation -- it gives them no control, and does not allow them to describe, how function arguments or struct fields are laid out in memory.  Knowing how struct fields are laid out means you can use structs from languages other than C, in a portable way.  (Some foreign function interfaces, such as ANSI Common Lisp's CFFI, refuse to allow passing structs by value for this reason.)  You can know exactly how much memory to allocate, and bound more tightly from above how much stack space a particular C function call requires.

Given the state of C as it is, what I would like is a domain-specific language for describing binary interfaces, such as struct layout or function call signatures.  I would like C compilers to support standard annotations that guarantee particular layouts.  Currently this is done in an ad hoc, compiler-specific way -- sometimes by command-line flags ("pack the structs") and sometimes by pragmas or annotations. 

The main reason I want standard compiler support for such a minilanguage is for interoperability between C and other languages.  I have the misfortune of needing to call into a lot of C libraries from my code, but I don't want to be stuck writing C or C++ (The Language Which Shall Not Be Parsed).  Nevertheless, I don't want to tie any other users of my code to a particular C compiler (if the code were just for me, it wouldn't matter so much).

01 July 2009

Python win: csv, sqlite3, subprocess, signal

I've been working these past few months (ugh, months...) on a benchmark-quality implementation of some new parallel shared-memory algorithms.  It's a messy, tempermental code that on occasion randomly hangs, but when it works, it often outperforms the competition.  Successful output consists of rows of space-delimited data in a plain text format, written to standard output.

I spent a week or so on writing a script driver and output data processor for the benchmark.  The effort was both minimal, and paid off handsomely, thanks to some handy Python packages: csv, sqlite3, subprocess, and signal. 

The csv ("comma-separated values") package, despite its name, can handle all kinds of text data delimited by some kind of separating character; it reads in the benchmark output with no troubles.  sqlite3 is a Python binding to the SQLite library, which is a lightweight database that supports a subset of SQL queries.  SQL's SELECT statement can replace pages and pages of potentially bug-ridden loops with a single line of code.  I use a cute trick:  I read in benchmark output using CSV, and create an SQLite database in memory (so I don't have to worry about keeping files around).  Then, I can issue pretty much arbitrary SQL queries in my script.  Since I only use one script, which takes command-line arguments to decide whether to run benchmarks or process the results, I don't have to maintain two different scripts if I decide to change the benchmark output format.

The subprocess and signal packages work together to help me deal with the benchmark's occasional flakiness.  The latter is a wrapper around the POSIX signalling facility, which lets me set a kind of alarm clock in the Python process.  If I don't stop the alarm clock early, it "goes off" by sending Python a signal, interrupting whatever it might be doing at the time.  The subprocess package lets me start an instance of the benchmark process and block until it returns.  "Block until it returns" means that Python doesn't steal my benchmark's cycles in a busy loop, and the alarm means I can time out the benchmark if it hangs (which it does sometimes).  This means I don't burn through valuable processing time if I'm benchmarking on a machine with a batch queue.

I wish the benchmark itself were as easy to write as the driver script was!  I've certainly found it immensely productive to use a non-barbaric language, with all kinds of useful libraries, to implement benchmarking driver logic and data processing.

22 June 2009

On metadata, shared memory, NUMA, and caches

Many modern shared-memory multiprocessors improve memory locality by providing each socket (group of processors) in the machine with its own chunk of memory.  Each socket can access all the chunks of memory, but accesses to its own chunk are faster (in terms of latency, at least).  This design decision is called "nonuniform memory access" (NUMA).

Programming a NUMA machine for performance in scientific kernels looks a lot like programming a distributed-memory machine:  one must "distribute" the data among the chunks so as to improve locality.  There is a difference, however, in the way one deals with metadata -- the data describing the layout of the data. 

In the distributed-memory world, one generally tries to distribute the metadata, because fetching it from another node takes a long time.  This means that memory usage increases with the number of processors, regardless of the actual problem size.  It also means that programmers have to think harder and debug longer, as distributing metadata correctly (and avoiding bottlenecks) is hard.  (Imagine, for example, implementing a reduction tree of arbitrary topology on arbitrary subsets of processors, when you aren't allowed to store a list of all the processors on any one node.)

In the NUMA world, one need not replicate the metadata.  Of course one _may_, but metadata access latency is small enough that it's practical to keep one copy of the metadata for all the processors.  This is where caches come in:  if the metadata is read-only, caches can replicate the relevant parts of the metadata for free (in terms of programming effort).  Of course, it would be nice also to have a non-coherent "local store," accessible via direct memory transfers, for the actual data.  Once you have the metadata, it's often easy to know what data you want, so it's easier to manage the explicit memory transfers to and from the local store.  However, if you only have a local store, you have to store metadata there, and managing that reduces to the distributed-memory case (except more painful, since you have much less memory capacity).

Memory capacity is really the most important difference between the shared-memory NUMA and distributed-memory worlds.  Our experience is that on a single shared-memory node, memory capacity cannot scale with the number of processors.  This is because DRAM uses too much power.  The same power concerns hold, of course, for the distributed-memory world.  However, an important use of distributed-memory machines is for their scale-out capacity to solve very large problems, so their architects budget for those power requirements.  Architects of single-node shared-memory systems don't have that freedom to expand memory capacity.  This, along with the lower latencies between sockets within a single node, makes sharing metadata in the shared-memory world attractive.

It's often easy to delimit sections of a computation in which metadata is only read and never modified.  This suggests an architectural feature:  using cache coherence to fetch metadata once, but tagging the metadata in cache as read-only, or having a "read-only" section of cache.  I'm not a hardware person so I don't know if this is worthwhile, but it's a way to extract locality out of semantic information.

13 June 2009

Productivity language for performance-oriented programmers

Our computer science department last semester offered a graduate seminar course on parallel productivity languages -- that is, programming languages in which it's easy to write parallel code whose performance scales well, even if single-processor performance is not so good (e.g., because the language is interpreted or there is a lot of runtime error checking).  Implicit in the word "productivity" is that the target users need not be expert low-level programmers.  

According to a professor who helped lead the course, however, most of the class projects ended up being "productivity languages for efficiency-layer programmers."  That is, the target users were not non-computer-scientist domain experts, but experienced low-level programmers who want to boost their productivity when optimizing a particular computational kernel.  One example was a domain-specific language (DSL) for stencil operations, along with a compiler and autotuner.  Users write the stencil operation in a familiar notation, and the system transforms it at runtime (the source language is dynamic) into highly optimized code which performs just about as well as if it were optimized by hand. 

The professor's observation resonated, because I'm currently near the end of what turned out to be a long, painful implementation and debugging project.  It's a benchmark implementation of a new solver, which itself contains at least three different computational kernels.  The project has taught me hard lessons about what bugs consume the most time.  What I've found is that working in a language "close to the metal" is not the problem.  I'm not afraid to optimize for hardware or operating system features, to allocate and deallocate chunks of memory, or to impose matrix semantics on chunks of raw memory.  My frustrations usually come from silly human errors, that result in inscrutable errors which debuggers like gdb and memory analysis tools like Valgrind cannot diagnose.  Furthermore, sometimes bugs only manifest with very large problems, that make losing a factor of 10-100x performance to an analysis tool (like Valgrind, which is an amazing tool and perhaps one of the best optimized out there) impractical (or at least annoying).

One source of human errors is conversion between units.  The classic example is the loss of the Mars Climate Orbiter in 1999 due to a failure to convert between metric and English units.  However, units, or a related kind of semantic information, often come up in pure programming tasks that have little to do with physical measurements.  For example, my current project involves a lot of C code that iterates over one kind of data structure, and copies it into a new kind of data structure.  A lot of indices fly around, and I usually have to give them suggestive names like "index_old" or "index_new" to indicate whether they "belong" to the old or new data structure.  Naming conventions aren't enough to prevent a mismatch, though, and a mistake (like addressing a "new" array with an "old" index) may result in a segfault, smashing the stack, intermittent semantic errors that depend on runtime parameters, or no problems at all.  Here, "old" and "new" are a kind of "unit" or semantic restriction:  for example, "this index can only be used to address this array, not any other."  This kind of language feature seems like it should result in little or no runtime performance hit in many cases. 

Another source of human errors is resource scope:  who initializes what, when, and who de-initializes it, when.  Many HPC codes require careful control of memory allocation and deallocation that often cannot be left to a garbage collector, either because the amounts of memory involved are huge, or because the allocation method is specialized (e.g., optimizations for nonuniform memory access (NUMA) machines).  Furthermore, garbage collection doesn't solve all resource management problems:  files, pipes, sockets, and the like are often shared by multiple threads.  Many languages, such as ANSI Common Lisp and the concurrent JVM-based language Clojure, have a special syntax for managing the scope of objects like files that need to be opened and closed, regardless of whether the operations done during the lifetime of the opened file succeed.  Common Lisp has a convention to name such scope management macros "WITH-*", where one replaces the * with the kind of object being managed.  Of course, a syntax for scope management assumes that one can statically determine the scope of the object, but many situations and kinds of object fit that pattern.

My current project involves a shared-memory parallel solver written in an SPMD programming model.  There's a lot of shared state, and it gets initialized at different times:  either before the compute threads are created, during the scope of the threads, or after the threads complete.  Sometimes, threads cooperate to initialize an object, and sometimes they let just one compute thread do the work.  In general, it's hard for compilers to tell when different threads might be accessing a shared object.  However, my choice of the SPMD programming model restricts the implementation enough that the compiler can easily identify whether something happens before or after thread creation or completion. It would be nice to have a syntax or some kind of help from the language to manage the different kinds of shared object creation.  The runtime could also help out by forbidding multiple initializations, or by "tagging" a shared object as accessible only by a certain thread.

A third source of errors has been imposing structure on raw pointers.  For example, a memory allocation returns a chunk of contiguously stored memory, but I have to divide up that chunk in place into multiple nonoverlapping matrices.  This is something that Fortran (especially Fortran 2003, if implemented correctly, *ahem*) does much better than C:  Fortran always imposes array structure on chunks of memory, and if you've done the mapping right, the runtime can identify indexing errors.  C only gives you raw pointers, onto which you have to impose structure.  If you mess up, only something like Valgrind may be able to help.  However, even Valgrind has no access to the semantic information that a particular chunk of memory is supposed to be an M by N matrix with leading dimension LDA.  Having an option to do bounds checking, which does not interfere with or affect parallel performance, but which can be turned off (at build time even) to improve performance on one thread, would be a big help.

When I see new languages proposed for scientific computing, often they have lots of new features -- nested parallelism, data structures suited to scientific things like irregular discretizations, or a full library of parallel primitives.
What I don't see, however, is a new low-level language which can be used to implement those new higher-level languages, or to implement highly optimized library routines.  A good low-level language is also important for bridging the gap between declarative optimization approaches (i.e., identifying features of the data or algorithm that enable optimizations without requiring them), and imperative approaches (mandating which optimization or feature to use, so that one can predict and measure performance).  Predictable performance makes optimization a science rather than a collection of incantations.  Furthermore, if a good low-level imperative language is used as a compilation target of a declarative language, then one can examine the generated code and figure out what the compiler is doing -- or even introspect and alter the implementation strategy at runtime.

10 June 2009

C++: Worlds of Pain


MS's efforts to implement the new C++0x standard in their compiler remind me of the world of pain that is C++ syntax.  James Iry jokes that the "language is so complex that programs must be sent to the future to be compiled by the Skynet artificial intelligence," but of course programmers do a kind of compilation in their head too, in order to understand and work with the language. 

My experience with C++ was typified by looking at Boost's graph library and wondering what the developers were smoking and whether or not I wanted to try some.  I took C++ reasonably far in my professional work; I used the STL containers extensively and even dipped into some of the generic algorithms (even with Visual Studio 6's awful C++ templates support, some of them were still useful).  However, there were too many ad hoc - seeming rules and conventions and ways to shoot myself in the foot, and name mangling was always an annoyance for me.


20 April 2009

C and Fortran: the right balance

I'm a fan of picking the right programming language for the job, even if it means mixing languages in a project.  Sometimes this can lead to pain, especially if I'm using relatively new language features that haven't had enough code coverage to be fully debugged.  However, getting the right balance between languages can make one's code concise and elegant.

This weekend, I've been working on rewriting a parallel matrix factorization.  I had written most of it in (modern) Fortran, but that made it tricky to interface with C code written by collaborators and others in our research group.  However, C's lack of 2-D array notation makes it awkward to write matrix operations.  Thankfully the factorization algorithm splits naturally into a series of operations on submatrices.  I implemented each of those "local" factorization steps in Fortran, using the ISO_C_BINDING module so that I can call them from C without needing to guess the Fortran compiler's name mangling convention, or needing to make everything a pointer (by default, Fortran calls everything by reference).  C code sequences those calls to local factorization steps and does the necessary parallel synchronization.  The resulting C is significantly shorter than the Fortran version I had written before, but I still get the benefits of Fortran's concise notation for matrix operations.

10 March 2009

Dark secrets of the art: C pointers and Fortran pointers


Lately, I've been implementing some new numerical algorithms for benchmarking.  Normally I would write such things in C rather than Fortran, out of habit and experience.  However, since this code calls Fortran routines (specifically, in the BLAS and LAPACK) a lot, I found it less trouble to write it in Fortran, than to try to manage calling C from Fortran.  This is because C and Fortran have different linking conventions, and C has no standard interface for calling into Fortran.  Different compilers have different standards about translating Fortran subroutine and function names into linker symbols -- sometimes they add one or more underscores, sometimes not.  It's deceptively easy if you only use one compiler, such as gcc; developers that aim to build with multiple compilers often discover the differences by accident, and account for them with ad-hoc symbol translation macros (or paper them over by build systems such as GNU Autoconf, that hide the ad-hockery for you).  Fortran subroutines and functions also have no idea of "pass by value," so when calling them from C, you have to keep track of which arguments are inputs and which are "outputs," and make sure to copy the output arguments in and out. 

These factors have annoyed me enough in the past that for this project, I taught myself enough "modern" Fortran to get by.  (I define "modern" as Fortran 90 and newer, or generally, anything with free-form input and colon array slice notation.)  However, I have users who code exclusively in C.  Fortran >= 2003 has what C does not, though:  an ISO standard for interoperability with C.  You can give a Fortran routine a C binding, which solves the linker symbol name problem.  More importantly, you can get Fortran and C pointers to talk to each other.


Yes, Fortran >= 90 does actually have a pointer type!  (It has dynamic memory allocation too, unlike in previous Fortran standards!)  Fortran's pointers to arrays are "smarter" than C pointers in some sense:  they have a "rank" and a "shape," which means (for instance) that you can compute with a submatrix of a larger matrix, whose elements aren't necessarily contiguous, as if it were just a plain old matrix.  The corresponding C code would look much less like linear algebra and much more like the innards of some device driver.  The price of Fortran pointers' ease of use is that a Fortran pointer contains more information than just a memory address.  You can't just pass a Fortran pointer into C space and expect C to understand it, or vice versa.  Fortran 2003 has an ISO_C_BINDING module whose C_F_POINTER subroutine solves this particular problem:  it pulls in a C pointer from C space, and associates it to a Fortran pointer along with a rank and shape. 

I'm currently using the ISO_C_BINDING module to build a C interface to my Fortran library.  It's easy enough that it makes me wonder why many other languages don't come equipped with a language-level C binding!

14 February 2009

You might be an LAPACK geek if...

Here's a quiz modeled after the infamous "purity tests," except that it has nothing to do with goats and everything to do with LAPACK, the most awesome body of Fortran ever created.  Score +1 for a yes answer, 0 for a no answer, and add up the points.

n00bs:


1. Have you ever called an LAPACK routine from one of your codes?

2. Do you know what the abbreviation LAPACK stands for?


3. Do you know how to pronounce LAPACK?  (No, it's not "la" like "lalala"...)

4. Are you aware that using the reference BLAS implementation is generally a bad idea, and do you know how to seek out a better one?

"Smug LAPACK weenies"

5. Have you ever built the reference LAPACK library from the Fortran, not because you didn't know it was installed but because you (a) wanted the latest version, or (b) it wasn't available on the exotic prototype hardware which you have the misfortune of being asked to use?

6. Have you ever found and reported an actual bug in some LAPACK implementation?  (One extra point if you went to the vendor's booth at a conference specifically to report the bug.)

7. Did you learn Fortran specifically to call BLAS or LAPACK routines, because it "seems more natural" than using the CBLAS header files?

8. Do you get the math joke on the cover of the LAPACK Users' Guide?

1337 |-|/\x0rz



9. Can you explain from memory what the first three letters of "DGEQRF" mean?


10. Have you ever contributed to LAPACK code?

11. Are you an author of an LAPACK Working Note?

12. Do you belong to a UC Berkeley or UTK "lapackers" e-mail list?

Beyond the Pale

13. If somebody asks for a banded LU factorization with partial pivoting, can you tell them the LAPACK routine name without looking?

14. Do you own and proudly wear a ScaLAPACK 1.0 T-shirt?  (One extra point if you get the math joke on the front.)

15. Have you found a counterexample for convergence of an eigenvalue routine?

Scoring:
1-5:  Just starting
6-10:  Escape while you can
11-15:  Hopeless
>15:  I know exactly who you are so you'd better not skew the curve!

21 January 2009

Modeling reality with code: Nethack

One of my secret vices is Nethack, a classic dungeon-crawl game with an ASCII user interface, in the tradition of Rogue and Angband.  (The name "nethack" refers to the game's development as a hacking project by people distributed over the Internet, and has nothing to do with the game's content.)  Nethack is a great game, despite its ludicrously primitive graphics by modern standards, for the following reasons:
  • Game play is rich and complex, but intuitive.  Rules make sense (especially if you catch the allusions to myth and modern storytelling) and actions have consequences.  Almost any object or creature can and will interact with anything else. 
  • Intelligent humor makes the perverse and random ways to suffer and die (almost) bearable.
  • The game is so challenging that even if I skim the spoilers and play with infinite lives, I can get myself into apparently inextricable situations.  Yet, it's still enjoyable without cheats and spoilers.
  • The display is stripped-down but suggestive.  For example, line of sight and memory interact intuitively -- both are combined into one map, which lets you fit more of the map on one screen (unlike, say, a first-person shooter, which keeps the map separate from the view). 
Nethack also makes a good motivating example for teaching new programmers how to model reality with code.  The game has some elements which are representative of real-life situations:
  1. Agents have limited information, and may even forget things.
  2. Things happen beyond agents' awareness.
  3. Attempted actions may not succeed: luck (which may improve with skill) plays a role.  (Nethack, like most games, models luck probabilistically.)
  4. Actions usually change the world state irreversibly.
  5. Object interactions depend on the properties of all the objects involved.  (In terms of object-oriented programming, your object system needs mixins.)
  6. Some items and creatures are unique and irreplaceable.
 It also has properties which are not representative of reality:
  1. Turn-based: there is no need to manage simultaneous attempts to change the world's state.
  2. Rule-based: interactions are modeled simply, with little resort to a priori physical modeling (e.g., Newton's laws).  This means interactions are not usually computationally intensive.
  3. Very simple graphics.
These three properties make Nethack fit into a natural progression of increasingly complex situations to model.  The game belongs after explaining object-oriented programming (OOP), but before explaining concurrency.  The simple graphics let students concentrate on the algorithmic interactions (and save teaching assistants and graders a lot of trouble writing a GUI framework for them!).  These interactions are rich enough to exercise many OOP lessons, such as member and class methods, inheritance, generic (a.k.a. virtual) methods, and object factories (for uniqueness).  However, they are not generally computationally expensive, so complicated algorithms are not necessary.

The chosen programming model affects the teaching approach.  For example, probabilistic success requires pseudorandom number generation, but if the programming model is functional, the novice's mental model ("rolling a die") differs from the usual stateful implementation.  Also, the lessons learned at this stage may require correction when the model is generalized to handle real-time (instead of turn-based) interactions, since multiple agents will need to draw random numbers simultaneously (parallel pseudorandom number generation). In addition, the programming language's version of OOP may make it more or less difficult to explain how to implement object interactions.  Some (such as Common Lisp) handle mixins trivially, whereas others (Java, C++) do not.  Reliance on "design patterns" to cover the programming language's deficiencies can help students deal with languages used in industry, but may also limit their flexibility with programming languages and models.  This doesn't mean a dogmatic insistence on "the one true programming language."  It's better to understand the deficiencies of one's chosen language, and learn how to work around them, than not to recognize deficiencies at all.

Nethack offers another useful pedagogical feature:  it's well-crafted and balanced.  Playing it teaches students that making an application good comes not so much from learning new programming tricks, as from careful and laborious design and testing.   "Testing" here means both correctness ("It didn't crash"), which students experience by necessity, as well as user experience testing ("It's fun to play"), which few students experience much in an undergraduate computer science curriculum.