21 July 2009

C is too high level

I propose that C is too high level of a language for the purposes for which it's used.  While it purports to give programmers power over memory layout -- in particular, over heap (via malloc) and stack (via alloca) allocation -- it gives them no control, and does not allow them to describe, how function arguments or struct fields are laid out in memory.  Knowing how struct fields are laid out means you can use structs from languages other than C, in a portable way.  (Some foreign function interfaces, such as ANSI Common Lisp's CFFI, refuse to allow passing structs by value for this reason.)  You can know exactly how much memory to allocate, and bound more tightly from above how much stack space a particular C function call requires.

Given the state of C as it is, what I would like is a domain-specific language for describing binary interfaces, such as struct layout or function call signatures.  I would like C compilers to support standard annotations that guarantee particular layouts.  Currently this is done in an ad hoc, compiler-specific way -- sometimes by command-line flags ("pack the structs") and sometimes by pragmas or annotations. 

The main reason I want standard compiler support for such a minilanguage is for interoperability between C and other languages.  I have the misfortune of needing to call into a lot of C libraries from my code, but I don't want to be stuck writing C or C++ (The Language Which Shall Not Be Parsed).  Nevertheless, I don't want to tie any other users of my code to a particular C compiler (if the code were just for me, it wouldn't matter so much).

01 July 2009

Python win: csv, sqlite3, subprocess, signal

I've been working these past few months (ugh, months...) on a benchmark-quality implementation of some new parallel shared-memory algorithms.  It's a messy, tempermental code that on occasion randomly hangs, but when it works, it often outperforms the competition.  Successful output consists of rows of space-delimited data in a plain text format, written to standard output.

I spent a week or so on writing a script driver and output data processor for the benchmark.  The effort was both minimal, and paid off handsomely, thanks to some handy Python packages: csv, sqlite3, subprocess, and signal. 

The csv ("comma-separated values") package, despite its name, can handle all kinds of text data delimited by some kind of separating character; it reads in the benchmark output with no troubles.  sqlite3 is a Python binding to the SQLite library, which is a lightweight database that supports a subset of SQL queries.  SQL's SELECT statement can replace pages and pages of potentially bug-ridden loops with a single line of code.  I use a cute trick:  I read in benchmark output using CSV, and create an SQLite database in memory (so I don't have to worry about keeping files around).  Then, I can issue pretty much arbitrary SQL queries in my script.  Since I only use one script, which takes command-line arguments to decide whether to run benchmarks or process the results, I don't have to maintain two different scripts if I decide to change the benchmark output format.

The subprocess and signal packages work together to help me deal with the benchmark's occasional flakiness.  The latter is a wrapper around the POSIX signalling facility, which lets me set a kind of alarm clock in the Python process.  If I don't stop the alarm clock early, it "goes off" by sending Python a signal, interrupting whatever it might be doing at the time.  The subprocess package lets me start an instance of the benchmark process and block until it returns.  "Block until it returns" means that Python doesn't steal my benchmark's cycles in a busy loop, and the alarm means I can time out the benchmark if it hangs (which it does sometimes).  This means I don't burn through valuable processing time if I'm benchmarking on a machine with a batch queue.

I wish the benchmark itself were as easy to write as the driver script was!  I've certainly found it immensely productive to use a non-barbaric language, with all kinds of useful libraries, to implement benchmarking driver logic and data processing.