22 June 2009

On metadata, shared memory, NUMA, and caches

Many modern shared-memory multiprocessors improve memory locality by providing each socket (group of processors) in the machine with its own chunk of memory.  Each socket can access all the chunks of memory, but accesses to its own chunk are faster (in terms of latency, at least).  This design decision is called "nonuniform memory access" (NUMA).

Programming a NUMA machine for performance in scientific kernels looks a lot like programming a distributed-memory machine:  one must "distribute" the data among the chunks so as to improve locality.  There is a difference, however, in the way one deals with metadata -- the data describing the layout of the data. 

In the distributed-memory world, one generally tries to distribute the metadata, because fetching it from another node takes a long time.  This means that memory usage increases with the number of processors, regardless of the actual problem size.  It also means that programmers have to think harder and debug longer, as distributing metadata correctly (and avoiding bottlenecks) is hard.  (Imagine, for example, implementing a reduction tree of arbitrary topology on arbitrary subsets of processors, when you aren't allowed to store a list of all the processors on any one node.)

In the NUMA world, one need not replicate the metadata.  Of course one _may_, but metadata access latency is small enough that it's practical to keep one copy of the metadata for all the processors.  This is where caches come in:  if the metadata is read-only, caches can replicate the relevant parts of the metadata for free (in terms of programming effort).  Of course, it would be nice also to have a non-coherent "local store," accessible via direct memory transfers, for the actual data.  Once you have the metadata, it's often easy to know what data you want, so it's easier to manage the explicit memory transfers to and from the local store.  However, if you only have a local store, you have to store metadata there, and managing that reduces to the distributed-memory case (except more painful, since you have much less memory capacity).

Memory capacity is really the most important difference between the shared-memory NUMA and distributed-memory worlds.  Our experience is that on a single shared-memory node, memory capacity cannot scale with the number of processors.  This is because DRAM uses too much power.  The same power concerns hold, of course, for the distributed-memory world.  However, an important use of distributed-memory machines is for their scale-out capacity to solve very large problems, so their architects budget for those power requirements.  Architects of single-node shared-memory systems don't have that freedom to expand memory capacity.  This, along with the lower latencies between sockets within a single node, makes sharing metadata in the shared-memory world attractive.

It's often easy to delimit sections of a computation in which metadata is only read and never modified.  This suggests an architectural feature:  using cache coherence to fetch metadata once, but tagging the metadata in cache as read-only, or having a "read-only" section of cache.  I'm not a hardware person so I don't know if this is worthwhile, but it's a way to extract locality out of semantic information.

2 comments:

Snails777 said...

Unrelated question...
I noticed a few things.

i) U love maths/programming(thats how i found your Blog.

ii) U have go to UCB

iii) U have a keen interest in Christianity an mysticism

Conclusion: U must know Dallas Willard?, sorry didnt know how else to communicate this so left it as a comment on unrelated blog of yours.
my email:
john.wales at breakthru dot org dot au

HilbertAstronaut said...

I will respond via e-mail -- no problem about leaving comments, as it's fun to know people are reading the stuff I write here ;-)