01 July 2009

Python win: csv, sqlite3, subprocess, signal

I've been working these past few months (ugh, months...) on a benchmark-quality implementation of some new parallel shared-memory algorithms.  It's a messy, tempermental code that on occasion randomly hangs, but when it works, it often outperforms the competition.  Successful output consists of rows of space-delimited data in a plain text format, written to standard output.

I spent a week or so on writing a script driver and output data processor for the benchmark.  The effort was both minimal, and paid off handsomely, thanks to some handy Python packages: csv, sqlite3, subprocess, and signal. 

The csv ("comma-separated values") package, despite its name, can handle all kinds of text data delimited by some kind of separating character; it reads in the benchmark output with no troubles.  sqlite3 is a Python binding to the SQLite library, which is a lightweight database that supports a subset of SQL queries.  SQL's SELECT statement can replace pages and pages of potentially bug-ridden loops with a single line of code.  I use a cute trick:  I read in benchmark output using CSV, and create an SQLite database in memory (so I don't have to worry about keeping files around).  Then, I can issue pretty much arbitrary SQL queries in my script.  Since I only use one script, which takes command-line arguments to decide whether to run benchmarks or process the results, I don't have to maintain two different scripts if I decide to change the benchmark output format.

The subprocess and signal packages work together to help me deal with the benchmark's occasional flakiness.  The latter is a wrapper around the POSIX signalling facility, which lets me set a kind of alarm clock in the Python process.  If I don't stop the alarm clock early, it "goes off" by sending Python a signal, interrupting whatever it might be doing at the time.  The subprocess package lets me start an instance of the benchmark process and block until it returns.  "Block until it returns" means that Python doesn't steal my benchmark's cycles in a busy loop, and the alarm means I can time out the benchmark if it hangs (which it does sometimes).  This means I don't burn through valuable processing time if I'm benchmarking on a machine with a batch queue.

I wish the benchmark itself were as easy to write as the driver script was!  I've certainly found it immensely productive to use a non-barbaric language, with all kinds of useful libraries, to implement benchmarking driver logic and data processing.

No comments: