11 October 2010

Git: recovering from an incorrect e-mail address

Our big software project uses Git (the version control system) in a CVS mode, with a single central repository to which we push commits (after they pass the check-in test suite).  This can be good, because it forces us to fix small but annoying things -- like setting the correct e-mail address!  I was testing on a different machine than my usual development workstation; I made some commits there, and then pushed to my workstation's local repository (Git shines for stuff like this).  Then, I activated the check-in tests, but the push failed: my e-mail address wasn't set!  (The tests check for that -- handy!)

While I had set my e-mail address for Git correctly on my workstation, I hadn't set it on the other machine.  There were seven commits sitting on my workstation (where I run the check-in test suite) with the wrong e-mail address.  Here's how I started the fix.  First, I ran

$ git rebase -i HEAD~7

which let me fix up the last 7 local commits.  "-i" means "interactive," so Git fired up a text editor and let me decide which commits to change (and how).  I then repeated the following three tasks seven times:

$ git commit --amend --reset-author
(Edit the commit message -- no author e-mail address here, but it gets fixed due to "--reset-author")
$ git rebase --continue

That fixed it!  There's probably an easier way, but I was pleased.

04 October 2010

Larval sparse matrices?

John Rose's post on "larval objects in the JVM" suggests a type distinction between "larval" and "adult" objects.  Larval objects are mutable, suitable for in-place updates and for use as scratch computation, but are not safe for sharing by multiple threads in the same shared-memory space.  Adult objects are immutable and safe to share in parallel, but "changing" them requires copying, which can be expensive.  His examples mainly involve small objects, like complex numbers or dates.  However, he does suggest a "large-scale" example of a database, where the default state is immutable ("adult"), but a mutable API can be made available by request.

This pattern of a "larval" initialization stage -- too complex and incremental to handle in a constructor -- followed by a "publish" command (we call it "fillComplete") that transforms the data structure into its "adult" state -- is more or less how sparse matrices are used in practice.  The data for the sparse matrix is processed in "element" form (more or less equation by equation), and then "assembled" into the optimized data structure.  Changing the optimized final data structure can be expensive and may require re-optimization.  Thus, we like to think of it as immutable when optimized.

The problem is, the sparse matrix API contains operations for the "larval" stage, which are either invalid or expensive when the matrix is in its "adult" stage.  Rather than discouraging users from doing the wrong thing, it would be nice for the API to forbid invalid operations syntactically (not by throwing an exception and frustrating new users with a mysterious error message), and also make clear which operations are allowed but undesirable due to their cost.  Making it syntactically harder to do suboptimal things is a good way to intimidate nonexpert programmers from doing them!

John mentions the library solution to this problem: "builder" objects or functions, like Java's StringBuilder.  They take as input some number of updates, applied incrementally.  When the user signals that all updates are done, they produce a "final" immutable object.  In the case of sparse matrices, that would mean users can only call mutating methods on the "SparseMatrixBuilder" object.  To mutate a "finished" sparse matrix, they would have to request a "builder."

The problem with this approach is the ability to request a builder, after the original completion of the immutable object.  It means that immutable objects can change state.  It also means that the immutable object (which is now only immutable in terms of its API) and the mutable "view" coinhabit the same scope.  Now the state of the "immutable" object depends on the semantics of view updates.  Do they happen as soon as the update methods are called?  (Simultaneity doesn't mean much in the context of multiple threads.)  Do they get accumulated and applied in batches?  Do they involve an asynchronous call to some remote compute device, like a GPU?  Programmers shouldn't be depending on these sorts of semantics.  It's better to avoid the mess and forbid changing immutable objects back into mutable ones, or return to the original solution and provide only a single object type with immutable and mutable states.

What's missing here is a "rebinding syntax" that prevents programmers from accessing the supposedly immutable object while it's in a mutable state.  This is a good use case for the "WITH-" Lisp idiom, or for Python's context guards:  the mutable view created by the "with" isn't allowed to escape its scope.

24 September 2010

Using ScaLAPACK: not so painful!

I finally got an excuse this week to try out ScaLAPACK, a library for distributed-memory parallel dense linear algebra computations.  I've mainly been working on iterative methods for sparse problems, so it's not common for me to want to factor some giant dense matrix.  However, part of those iterative methods involves an orthogonalization method for small groups of dense vectors.  Comparing against ScaLAPACK's QR factorization PDGEQRF should make a good sanity check for accuracy and performance.  (PDGEQRF should be slower, since it uses an algorithm that requires more communication, and also since it likely doesn't use multicore parallelism as effectively for the panel factorization.)

It took me two and a half days to get a prototype accuracy test and benchmark up and running, which is a lot less than I had thought.  The Intel MKL Link Line Advisor was a big help in linking the right libraries.  I've tried building ScaLAPACK from source before -- while that was some four years ago, it was a real pain and I was happy not to have to try that.  The last few hours of those two-and-a-half days were spent trying to figure out why the matrix distribution wasn't working.  The routine in question, PDGEQRF, either refused to accept the input data (it helpfully identified the first argument of the routine which was off in some way, but didn't say why it was wrong), or only factored part of the matrix.  Staring at the ScaLAPACK Users' Guide -- in particular, at the diagrams illustrating global matrix layout -- helped me decode the mysterious parameters describing how the matrix is distributed among processors.

Using ScaLAPACK turned out not to be so difficult as I had thought.  The Users' Guide is well-written, with many helpful diagrams.  The routines have great documentation (and helpfully repeat documentation for certain library-standard interfaces, like the matrix descriptor array).  Having a pre-built library with the linker flags ready for copy-and-paste into my build configuration system was a huge help.  I imagine much of the pain people have reported came from trying to build everything themselves.  Note also that Matlab now has support for distributed-memory parallel computation, and distributed-memory data layouts.  It's based on ScaLAPACK, which means the MathWorks' developers have gone through a lot of this pain for you ;-)

A final interesting point:  the ScaLAPACK Users' Guide was written when High Performance Fortran (HPF) was under serious consideration as the way to write distributed-memory parallel code.  ScaLAPACK doesn't use HPF, but it does use HPF as an alternate notation to describe different ways to distribute matrices over processors.  HPF proved unpopular because it was too hard to build an efficient compiler for it, but parts of HPF made their way into modern versions of the Fortran standard. 

15 September 2010

The user experience of lazy automatic tuning

I've been doing some performance experiments with an MPI-parallel linear algebra kernel.  I've implemented the kernel in two different ways:  a "reduce"  (not quite MPI_Reduce(), though it has a similar communication pattern) and "broadcast," and a "butterfly."  The latter is more general (it can be used to compute things other than what I'm benchmarking).  However, it might be slow in certain cases (e.g., if the network can't handle too many messages at once, or if there are many MPI ranks per node and the network card serializes on processing all those messages going in and out of the node).  Each run involves testing two different implementation alternatives, and four different scalar data types (float, double, complex float, and complex double).

When I ran the benchmarks, I noticed that the very first run (butterfly implementation strategy, "float" data type) of each invocation of the benchmark is 10 - 100x slower than for other data types with the same implementation strategy. When I rebuild the executable using a different MPI library, this problem went away.  What was going on?  I noticed that for the "slow" MPI library, the minimum benchmark run time was reasonable, but the maximum run time was a lot more.  After reading through some messages in the e-mail archives of that MPI library, I found out that the library delays some setup costs until execution time.  That means the first few calls of MPI collectives might be slow, but overall they will be fast.  The other MPI library doesn't seem to be doing this.

I was reminded of some troubles that the MathWorks had with integrating FFTW into Matlab, a few years ago.  FFTW does automatic performance tuning as a function of problem size.  The tuning phase takes time, but once it's finished, running any problem of that size will likely be much faster.  However, users didn't realize this.  They ran an FFT once, noticed that it was much slower than before (because FFTW was busy doing its thing), and then complained.  One can imagine all sorts of fixes for this, but the point is that something happened to performance which was not adequately communicated to users.

Now, MPI users are likely more sophisticated programmers than most Matlab users.  They likely have heard of automatic run-time performance tuning, and maybe even use it themselves.  I'm fairly familiar with such things and have been programming in MPI for a while.  Yet, this phenomenon still puzzled me until I poked around through the e-mail list archives.  What I was missing was an obvious pointer to documentation describing performance phenomena such as this one.  The performance of this MPI library is not transparent.  This is not a bad thing in this case, because the opaqueness is hiding a performance tuning process that will make my code faster overall.  However, I want to know this right away, so I don't have to think about what I could be doing wrong. 

See, I'm not just calling MPI directly.  I've wrapped up some other wrapper over MPI.  I didn't know if my wrapper was broken, the wrapper underneath was broken, the MPI library was broken, the cluster job execution software was broken, or if nothing was wrong and this was just expected behavior.  That's why I want to know right away whether I should expect significantly nonuniform performance behavior from some library.

08 September 2010

CPUs and GPUs, part 3: memory and programming models

My last post got a good comment which pointed out that I hadn't been covering another important class of programming models: annotations.  C and C++ programmers would call these "pragmas," since their compilers understand lines starting with "#pragma" as directives to the compiler that affect one or more lines of code that follow.  They would likely be familiar with OpenMP, which uses annotations to tell the compiler to parallelize certain loops or fragments of code.  Other programming languages, like Common Lisp, use annotations as hints for optimization -- in particular, as type hints. 

The commenter pointed out that the Hybrid Multicore Parallel Programming (HMPP) system of annotations, developed by CAPS and supported by the PathScale ENZO compiler suit, offers another programming model for heterogeneous node computing.  You write a kernel in a subset of C(++) or Fortran, annotated with operations that the compute device (e.g., GPU) understands (like when to transfer data to / from the host CPU, or when to synchronize device threads), and also label the kernel with an annotation to identify it for deployment to the device.   I was just introduced to HMPP last night, so please take a look at the comment to read the manual for yourself :-)  However, a brief read of the manual reminds me of Copperhead, a Python annotation system (based on decorators) that accomplishes similar goals.

Annotations require the programmer to (1) know the annotation language as well as the main programming language being used, (2) understand what subset of the programming language can be sent to the device, and (3) understand when and how the annotation changes program semantics.  Compilers should help with #2, but may not; if given invalid code for the device, they may generate incorrect device code, which may crash or silently give the wrong answer.  This is true for other annotation systems as well.  OpenMP, for example, doesn't help you avoid race conditions if multiple threads modify shared state.  You have to identify and annotate critical sections and / or atomic updates yourself.  #3 can also be a challenge, especially if annotations are sprinkled throughout the code.

Before talking about HMPP, I'll first use OpenMP as an example of the difficulties of popular annotation-based approaches.  Many OpenMP performance problems and code-writing challenges relate to the memory model.  You've taken a sequential code, where memory is laid out for sequential operation, and introduced parallelism to the computation but not the memory layout.  On nonuniform memory access (NUMA) hardware, performance tanks, because the data lives in the wrong place for most of the threads.  At least modern OpenMP implementations try not to let threads wander around to different NUMA regions from where they started, which means that the operating system could migrate pages closer to the cores that are working on them.  That suggests a "preconditioning" approach, where you first use OpenMP parallelism to read through the whole data set (thereby letting the operating system move data to where it needs to be), and then use OpenMP to do the computations you wanted to do in the first place.  Operating systems in common use don't do that yet!  Another approach is just syntactic sugar for explicitly threaded programming: each thread allocates its own data, and then you program in something like a distributed-memory model.  (It would be easier and less buggy just to run MPI with one process per core.)

Both of these OpenMP approaches are awkward.  The problem is that OpenMP separates computation and memory layout.  The annotations only govern computation; they don't say where the data should live in memory.  That often means throwing away useful information, which many OpenMP programmers know, about which threads will operate on which data.  If you know how to allocate an array, you likely know how you would divide up simple parallel for loops or reductions on the array among the CPU cores.  If you know that, you almost certainly know how to lay out the array in the various NUMA regions.
 Heterogeneous programming has its own memory layout problem: data either lives on the host CPU or the device.  Those are two separate memory spaces.  Even if they weren't separate, it still might be expensive to compute on remote data, or even to copy data from one space to another.  This is exactly the same problem as on multicore CPUs with NUMA.  When I look at HMPP, I see annotations for when to perform data transfers between the host and the device, and even ways to overlap data transfers with useful computation.  However, I don't see annotations for where to allocate the data.  The HMPP model assumes (at least from what I can tell from a brief look -- please correct me if I'm wrong) that data always lives on the host, is always transferred to the device for computations, and is always transferred back to the host afterwards.  This is like allocating all your data in one NUMA region and operating on it using CPU cores in all the NUMA regions.

One could augment the annotation approach to include a memory layout model, but it seems like this is just making the language more baroque.  Plenty of programming languages reify data layout without the complication of additional annotations.  Co-Array Fortran is a good example.  This need not be done in the language; libraries can also hide the details of layout and allocation.  Kokkos in Trilinos does this for GPUs, for example.  CUDA and OpenCL are languages, but their special routines for allocating memory on the device are just plain C.

I like the Array Building Blocks (and Kokkos) approach, because it also gives a good model for interactions between legacy code (with legacy memory layout) and new parallel code (that expects or should have data in a particular nonlegacy layout).  Kokkos does this with smart pointers and copy routines: you aren't allowed to call parallel kernels on legacy memory.  Array Building Blocks does this with "bind", which links up legacy memory and compute buffers.  The general idea is that "you don't get the pointer": memory on which heterogeneous or node-level parallel computation is done, is encapsulated and organized optimally by the language or library system.  This is also a great way to unify different kinds of nodes with different memory space(s), which is why Kokkos does it.  Pretty much any kind of node has parallel computing resources, associated to one or more memory spaces, for which data transfer costs are high and data layout matters for performance.

05 September 2010

GPUs and CPUs, part 2: programming models

After finishing last night's post, I was thinking about CPU and GPU programming models -- in particular, about short vector instructions.  On CPUs these are called SIMD (single instruction, multiple data) instructions.  NVIDIA markets them instead as "SIMT" (single instruction, multiple thread), not because of differences in the hardware implementation, but because the programming model is different.  The "CUDA C Programming Guide" version 3.1 says:  "... SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread."  Here, "thread" corresponds to an element in an array, operated on by a vector instruction.

CPU vendors have historically done a poor job presenting programmers with a usable interface to SIMD instructions.  Compilers are supposed to vectorize code automatically, so that you write ordinary scalar loops and vectorized code comes out.  However, my experience a few years ago writing SIMD code is that compilers have a hard time doing this for anything but the simplest code.  Often, they will use SIMD instructions, but only operate on one element in the short vector, just because the vector pipes are faster than the scalar ones (even if the extra element(s) in the vector are wasted).  The reason they struggle at vectorization -- more so than compilers for traditional vector machines like the Crays -- is because the SIMD instructions are more restrictive than traditional vector instructions.  They demand contiguous memory, aligned on wide boundaries (like 128 bits).  If the compiler can't prove that your memory accesses have this nice friendly pattern, it won't vectorize your code.  SIMD branching is expensive (it means you lose the vector parallelism), so compilers might refuse to vectorize branchy code. 

Note that GPU hardware has the same properties!   They also like contiguous memory accesses and nonbranchy code.  The difference is the programming model: "thread" means "array element" and consecutive threads are contiguous, whether you like it or not.  The model encourages coding in a SIMD-friendly way.  In contrast, CPU SIMD instructions didn't historically have much of a programming model at all, other than trusting in the compiler.  The fact that compilers make the instructions directly available via "intrinsics" (only one step removed from inline assembler code) already indicates that coders had no other satisfactory interface to these instructions, yet didn't trust the compiler to vectorize.  Experts like Sam Williams, a master performance tuner of computational kernels like sparse matrix-vector multiply and stencils, used the intrinsics to vectorize.  This made their codes dependent on the particular instruction set, as well as on details of the instruction set implementation.  (For example, older AMD CPUs used a "half-piped" implementation of SIMD instructions on 64-bit floating-point words.  This meant that the implementation wasn't parallel, even though the instructions were.  Using the x87 scalar instructions instead offered comparable performance, was easier to program, and even offered better accuracy for certain computations, since temporaries are stored with extra precision.)  Using SIMD instructions complicated their code in other ways as well.  For example, allocated memory has to be aligned to a particular size, such as 128 bits. That bit value depends on the SIMD vector width, which again decreases portability.  SIMD instruction widths increase over time, and continue to do so.  Furthermore, these codes are brittle, because feeding nonaligned memory to SIMD instructions often results in errors that can crash one's program.

CPU vendors are finally starting to think about programming models that make it easier to exploit SIMD instructions.  OpenCL, while as low-level a model as CUDA, also lets programmers reason about vector instructions in a less hardware-dependent way.  One of the most promising programming models is Intel's Array Building Blocks, featured in a recent Dr. Dobb's Journal article.  I'm excited about Array Building Blocks for the following reasons:
  1. It includes a memory model with a segregated memory space.  This can cover all kinds of complicated hardware details (like alignment and NUMA affinity).  It's also validation for the work being done in libraries like Trilinos' Kokkos to hide array memory management from the programmer ("you don't get a pointer"), thus freeing the library to place and manage memory as it sees fit.  All of this will help future-proof code from the details of allocation, and also make code safer (you don't get the pointer, so you can't give nonaligned memory to instructions that want aligned memory, or give GPU device memory to a CPU host routine).
  2. It's a programming language (embedded in C++, with a run time compilation model) -- which will expose mainstream programmers to the ideas of embedded special-purpose programming languages, and run time code generation and compilation.  Scientific computing folks tend to be conservative about generating and compiling code at run time, in part because the systems on which they have to run often don't support it or only support it weakly.  If Array Building Blocks gives the promised performance benefits and helps "future-proof" code, HPC system vendors will be driven to support these features.
  3. Its parallel operators are deterministic; programmers won't have to reason about shared-memory nightmares like race conditions or memory models.  (Even writing a shared-memory barrier is a challenging task, and the best-performing barrier implementation depends on the application as well as the definition of "performance" (speed or energy?).)
I've been pleased with Intel's work on Threading Building Blocks, and I'm happy to see them so committed to high-level, usable, forward-looking programming models.  NVIDIA's Thrust library is another great project along these lines.  It's also great to see both companies releasing their libraries as open source.  The "buy our hardware and we will make programming it easier" business model is great for us coders, great for doing science (open source means we know what the code is doing), and great for performance (we can improve the code if we want).  I'm hopeful that the competition between CPU and GPU vendors to woo coders will improve programming models for all platforms.

04 September 2010

Programming GPUs makes us better CPU programmers

High-performance computing blogs like "horse-race" journalism, especially when covering competing architectures like CPUs and GPUs.  One commonly hears of 20x or even larger speedups when porting a CPU code to a GPU.  Recently someone pointed out to me, though, that one rarely hears of the "reverse port": taking an optimized GPU code, and porting it back to the CPU.  Many GPU optimizations relate to memory access, and many scientific codes spend a lot of time reading and writing memory.  For example, coalescing loads and stores (for consecutive threads on a GPU warp) corresponds roughly to aligned, contiguous memory accesses on a CPU.  These are more friendly to cache lines, and are also amenable to optimizations like prefetching or using wide aligned load and store instructions.  CPUs are getting wider vector pipes too, which will increase the branch penalty.  From what I've heard, taking that GPU-optimized code and porting it back to the CPU might result in only a 2x slowdown over the GPU. 

I don't see this as bad, though.  First, GPUs are useful because the hardware forces programmers to think about performance.  After investing in learning a new programming language (such as CUDA or OpenCL) or at least a new library (such as Thrust), and after investing in new hardware and in supporting GPU runtimes in their applications, coders are obligated to get return on that investment.  Thus, they take the time to learn how to optimize their codes.  GPU vendors help a lot with this, and expose performance-related aspects of their architecture, helping programmers find exploitation points.  Second, optimizing for the GPU covers the future possibility that CPUs and GPUs will converge.  Finally, programmers' dissatisfaction with certain aspects of GPU hardware may cause divergence rather than convergence of GPUs and CPUs, or even the popularization of entirely new architectures.  (How about a Cray XMT on an accelerator board in your workstation, for highly irregular data processing?)

25 August 2010

"Archetype by prototype": a suggestion for making compile-time polymorphism more usable

A recent post on interface matching in ParaSail (a new parallel programming language under development) inspired me to think about compile-time polymorphism in languages like C++.  The author of the article observed that "C++ templates define essentially no 'contract' so there is no easy way to find out what sort of type will be acceptable without trying the instantiation." 

This stirred me to think about what C++ programmers can (and do) do to get around this.  A standard technique is to use "concept checks."  These test the required syntax by, well, trying the instantiation ;-) in a way that makes it easy to catch syntax errors.  They can be tedious, because doing them right requires two things:

1. Concept checking code, that verifies the interface statically and as simply as possible (so that compiler error messages are easier to read)

2. An "archetype class": like a Java interface, except with (skeletal) implementations of the required interface

In the above, i'm using the terminology of the Boost Concept Check Library.  #1 alone adds to the code maintenance burden, since as the "contract" evolves, you have to synchronize the checks with how you are actually using the object.  #2 alone doesn't ensure synchronization of your idea of the contract, with the classes you think are implementing it.  Nevertheless, there are still four separate bodies of code to synchronize: the archetype class, the concept checking code, the application class(es), and the application code that invokes these class(es).  The most straightforward (but not easiest) way to avoid this burden would be to

1. Write an archetype class, and make the compiler check the concept, using application code

I call this approach "declared archetypes."  It calls for a syntax for declaring that a particular concrete data type implements an archetype's interface.  This approach has implications for avoiding error-prone tedium.  A common case for templates is generic code for different kinds of numbers.  I've written a lot of code templated on "Scalar", which could be anything representing a real or complex number, and Ordinal, which represents an array index type.  It would be tedious to declare an archetype supporting all the different kinds of things one might like to do with a Scalar, for example.  It would at least have to include all of arithmetic, and all kinds of transcendental functions (such as trigonometric functions and logarithms).  That would be an awful lot of effort, especially since most people are only going to instantiate Scalar with "float" or "double".  (A few unlucky folks will make Scalar a complex number and discover all the places where the supposedly generic code was using less-than on Scalar!)

To avoid this tedium, the "concepts" proposal considered for addition to the C++ standard (but rejected) offered a shorthand for "aggregate archetypes."  You could say that Scalar is "FloatingPointLike" or that Ordinal is "SignedIntegralLike", and that would hopefully bring in the syntax that you want.  You could also "inherit" from archetypes to create your own.

The trouble with this approach is that somebody has to write a huge library of arbitrarily named predefined archetypes.  Huge, because pretty much anything you might want to do with a "plain old data" type has to have its own concept.  Arbitrarily named, because somebody has to decide what "ArithmeticLike" means.  (Does it mean integers or real numbers?)  This path seems to call for abstract algebra, where you have semigroups and rings and fields of a certain characteristic, and all kinds of things that nonmathematicians don't want to understand.  (I'm convinced the main reason why people don't use Haskell more is because it's so hard to explain what a monad is, and why you want to know.) 

This is overkill because in most cases, programmers can accomplish their work without so much formality.  The ParaSail blog post alludes to the reason why:  "... when ad hoc matching is used, there is a presumption that the module formal (or target type) is very stable and new operations are not expected to be added to it."  The typical use case of C++ templates (for example) is for things that "look like double" or "look like int."  That suggests a second approach:

2. "Archetype by prototype" or "looks like type T"

If T is something simple like "double" or "int", then you save a lot of syntax and / or library writing.  If T is complicated, this forces you to write an archetype class, which is probably a good thing if T is complicated!  For the numerical linear algebra algorithms i write and maintain, this would help remind me whether an algorithm (that claims to be generic on a Scalar type) can handle complex numbers.  (Complex arithmetic changes linear algebra in subtle ways that don't only have to do with syntax.) 

This approach does not require special syntax for defining archetypes.  However, it might still be nice to have such a syntax, and also to have a small library of archetype classes.  I could see this being useful for iterators, for example.

"Archetype by prototype" would impose requirements "lazily," much like users of C++ template metaprogramming expect.  The "actual archetype" would consist only of those operations with the concrete datatype that are actually written in code.  For example, if you never ask for the cosine of a Scalar, you don't need to have an implementation of cosine that takes a Scalar argument.  It would be nice to have some development environment support for "extracting" the "actual archetype."  C++ compilers do some of this already in their error messages, if you have the patience to read them (after about a page of compiler backtrace, you get something like "no implementation of cos(Scalar) exists for Scalar = YourType").

In summary, i propose "archetype by prototype" as an alternative to "concepts" for making compile-time polymorphism more usable.

24 June 2010

"The limits of my language define the limits of my world"

John Rose, speaking of programming languages, quotes the Logisch-philosophische Abhandlung of
Ludwig Wittgenstein:  "The limits of my language define the limits of my world" (Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt).  John nicely links to the German version of Wittgenstein's work, which gives me an lets me practice my rusty German! 

The phrase recalled a question a coworker asked me today:  "If I could choose the programming language I use for my work, what would it be?"  I answered that it depends on the work I want to do:  if I need to have a dialogue with the operating system and with low-level hardware details, C(++) is the tool of choice (mainly for ill); for numerical computations, some midway point between (modern) Fortran and Matlab would be nice.  C is more expressive than Fortran, I went on to say.  "Doesn't that mean C is better than Fortran?" he asked?  No, it doesn't mean that.  Fortran's restrictiveness (esp. its rules about pointer declarations and aliasing) make it easier for compilers to generate efficient code.  C's expressiveness makes it easier to control the computer at a low level.

Wittgenstein's observation applies here:  in C, everything looks like a pointer.  In Fortran, everything looks like an array (C doesn't have real multidimensional arrays, remember!).  In Smalltalk, everything looks like an object.  Lisp programmers fear writing domain-specific languages much less than Java programmers.  SISAL programmers fear returning an (apparently) freshly created array much less than C programmers.  SQL programmers fear complicated searches over tables of data much less than programmers of other languages, etc.  By choosing the language, I choose the way in which I attack programming problems.

If the language suggests a programming model (or a small set of them), then if a new programming model suggests itself, the rational response is to create a new language or modify an existing one in a way that naturally leads the programmer to work within the model.  Programming models ultimately reduce to our assumptions about computer hardware.  Where are these assumptions going?  It's not quite a Turing machine anymore.  We have to extend our hardware assumptions to include unreliability:  segfaults, kernel panics, nodes of a cluster going down and up, the occasional bad bit in memory or on disk. 

"Cloud" programming models ("everything is a reduction pass from disk to CPU and back to disk again") already handle nodes crashing to some extent, but it doesn't look like programming models for handling unexpected bit flips have gotten past the discussion phase.  In part this is because the hardware hasn't presented us with a programming model other than "compute and pray": even fixed addresses in read-only code pages aren't invulnerable, as the above link explains.  It's fair to assume, though, that the hardware will let us distinguish between "accurate" and "inaccurate" data or computations, just like it lets us distinguish between IEEE 754 float and IEEE 754 double. 

This distinction between accurate and possibly inaccurate computations and data will enter the programming model.  Languages have type systems for that sort of thing; it doesn't seem much different than a "const" annotation in C++, or a "synchronized" annotation in Java.  This attribute needs to be orthogonal to the actual datatype; the typical "oh well, floating-point numbers are inexact anyway" viewpoint will lead to disaster in an LAPACK LWORK query, for example.  The programming model will have to include promotions and demotions; these will have costs, of which the programmer must be made aware (just like with Java's "synchronized").

What are the implications of such a model?  First, programmers will have yet another type annotation to learn.  Just like with C's "volatile," it will lead to endless confusion, especially as hardware evolves.  Compilers can help with type deduction, but programmers will be tempted to use the "inaccurate" annotation as a magic "go faster" switch.  If computer science students can't wrap their heads around tail recursion in Scheme, how can we expect them to understand yet another low-level annotation? 

Second, programmers' view of the hardware memory space will grow even more fragmented.  We already have "GPU memory" and "CPU memory" (for OpenCL and CUDA programmers), "local" and "remote" memory (for those working in a Partitioned Global Address Space (PGAS) model), caches and local stores and "shared memory" (which is really a cache) and vector registers and...  In the future, we may have this times two: for each, an "accurate but slow" and an "inaccurate but fast" version.  Going from one to the other will require copying -- which may be implicit (as in a PGAS-like model), but still costs something.  Relaxing bit accuracy is a hardware performance optimization and so if you don't want to declare everything "accurate," then you likely are worried about the costs of copying things around.  This means you will have to reason about all of those fragmented memory spaces.

Given this likely state of hardware, I can rephrase my colleague's question: "What programming model would I like to use if the hardware behaves in this way?"  First, the PGAS model seems most natural: it lets me reason about different memory spaces if I need to, or program as if there is only one memory space (at potentially higher cost) if it helps me get my work done.  Second, the model should help programmers distinguish between "data" and "metadata."  Metadata (indices, permutations, pointers) should only be bit accurate.  (That doesn't mean integers always have to be!  Integer types get used a lot for signal processing and other bit-oriented domains.)  Third, programmers should have to handle metadata as little as possible.  The less metadata you handle explicitly, the less likely you are to "pollute" it with inexactness, and the more freedom the compiler has to perform optimizations on the metadata.  That means the language should support all kinds of collections.  Operations on collections should be fast, otherwise programmers won't want to use them.  Fourth, by extension of the third point, the language should prefer sequence operations ("map f over the array v") that let programmers avoid naming variables that don't really matter, such as index variables and intermediate quantities in loops.  (If you name it, the compiler has to keep it around unless it is clever enough to reason it away.  Storage necessitates annotation of its accuracy, which necessitates conservativism in optimizations, in the same way that C aliasing rules force the C compiler to be conservative.)

This post rambles, but I'm thinking out loud, and I'm curious to hear what others have to say about programming models for this new world of unexpected flipping bits.  The one thing that encourages me is the intelligence of those thinking about the problem.  As Shakespeare has Miranda say in The Tempest, "How beauteous mankind is! O brave new world, That has such people in't!"

23 May 2010

Post to deter spammers

I need to post something here to deter spammers, since they seem to be targeting idle blogs. I'll think of something Real Soon Now (tm) to post about generic APIs for parallelism, or something like that.