09 September 2011

Commenting code may lead to category theory

Not that category theory is hazardous to your health, of course!

This topic came up when i was thinking about how to justify the deficiencies of a particular C++ interface i had written. These functions take the name of a file containing some kind of matrix data. They then construct and return the corresponding matrix object, possibly distributing that object over some number of MPI processes. The trick is that a file could contain all kinds of different data -- both different kinds of matrices (sparse or dense, symmetric or not) and different kinds of data (integer, real, or complex). The file begins with a header containing all of this "metadata" that describes its contents, but you have to open up the file and start reading it in order to find that out.

The issue is that the natural interface calls for each of these functions to return an object of a particular type. The type of object returned by the function is therefore fixed, even though the contents of the file are unknown. There are two reasons for this. First, different kinds of matrices (e.g., sparse or dense) require different kinds of metadata as input, before reading the file at all, in order to interpret the file's contents. (For example, large square sparse matrices may be distributed in a 2-D fashion over processors, so they need both row and column distribution information. Large, dense vectors may only be distributed by rows over processors, so they only need row distribution information.) Second, even if we fix what kind of matrix we want in the interface, we still don't know the type of data in the matrix. It could contain integers, real numbers of various precisions, complex numbers, or even more complicated objects (quotients of polynomials with coefficients in a finite field, for example).

Concretely, in C++ terms, i might have a SparseMatrix or DenseMatrix type. Each of these is "templated" (parametrized) on the type of entries in the matrix: SparseMatrix<double> for a sparse matrix with double-precision real values, SparseMatrix<int> for a sparse matrix with 32-bit signed integer values, etc. This lets us implement (for example) SparseMatrix once for entries of any type. It also lets me write the file reader function once for any parametrization of SparseMatrix. All these are good. The problem is that the user has to pick a specific parametrization of SparseMatrix (e.g., SparseMatrix<double>) before calling the function.

Now, there are ways of implementing types that could be any of various types. Really, what you want are "algebraic data types" that are a disjoint union of different types (for example, "int or double or complex"). C++ doesn't have those, strictly speaking, although you can fake them with so-called "variant" types. Other languages, like Haskell, do have algebraic data types.

Suppose, though, that C++ did have algebraic data types. I will represent the disjoint union here nonstandardly with a plus sign: e.g., "int + double + complex<double>." What i might like the file reader to do is to return a sparse matrix, whose entries are int + double + complex<double>. In our notation, this would be called a "SparseMatrix<int + double + complex<double> >." However, C++ makes this undesirable. This is because in C++, the bitwise arrangement of data in memory matters a lot. Implementing a "variant" type in C++ that can hold any of int, double, or complex<double> would require at least as much space as that required by a complex<double> -- twice as much as a double, and twice as much again as an int. Furthermore, i really don't want a sparse matrix for which different entries might have different types -- all of the entries should have the same type. Yet, i'm forced to do this if the function returns a SparseMatrix<int + double + complex<double> >.

The function might instead return a SparseMatrix<int> + SparseMatrix<double> + SparseMatrix<complex<double> >. However, now when i get the return value, i have to guess what type the object is, trying each of the possibilities until i find out. This is easy enough for the three possibilities here. However, suppose that SparseMatrix takes several type parameters (in our case, it takes five). Do i have to guess and check for all possible combinations?

It seems natural that SparseMatrix<int> + SparseMatrix<double> should be the same type as SparseMatrix<int + double>. However, i was explaining this problem to a colleague tonight, and it occurred to me that templated classes in C++ are just functions between types. For example, SparseMatrix<T> defines a function from the entry type T to the specific sparse matrix type SparseMatrix<T>. If f is a function, why should i expect that f(x + y) = f(x) + f(y)? That would make the function f only one step away from linear -- a very special function indeed! Similarly, i should expect that a SparseMatrix<int> + SparseMatrix<double> would be a very different sort of object than a SparseMatrix<int + double>. This makes sense if you think about possible representations of each of these objects in memory: the former might be a pointer to some kind of matrix, whose arrays of matrix entries are either arrays of int or arrays of doubles, while the latter would be a matrix whose arrays of matrix entries are arrays of a variant type.

I'm not a category theorist, but thinking about this problem made me think about type systems and algebras over type systems. So, i suppose it's a little bit of category theory, no? All i have to do is not embarrass myself too much in front of a real category theorist!

No comments: