Tuesday, July 28, 2009

Partitioning, Data Locality and some more reading material

From my ongoing reading about Distributed stores, Map-reduce, BASE and other related topics here's some "gyan" (alt meaning: musings of an armchair engineer):

Effect of Partitioning and Data Locality for scaling in a Distributed System:

If you are familiar with the Fallacies of Distributed Computing, 2 of those fallacies are very relevant even in a well replicated system, in an internal, trouble-free network.
Or rather, especially in an internal network where the network security, topology and bandwidth are not major issues. The ones that still cannot be ignored are:
» Latency is zero
» Transport cost is zero

The trouble with storing data in such a distributed system (key-value pairs or columns + column families) is that everything looks alright until you really need to cross-reference/lookup/join data between 2 such caches/stores.

Unlike in a regular database where you normalize everything into separate tables and then perform an Indexed-join across more than one database to re-assemble the data, there is no proper facility to do that efficiently in a distributed system. The cost is most often prohibitive because of the high latency that comes with moving let's say the left-hand-side keys of the join to the right-hand-side table in a join. And if you are attempting to replicate a complex join plan like in a database, well good luck.

To alleviate this problem, you would have to:
» De-normalize your data
» Or move all your relationships closer together in the cluster (Co-location)

Object-oriented DBs face a similar problem where you cannot perform Joins as freely as you would in a Relational DB. Their solution is a "pre-joined" or "always-joined but lazily fetched" graph structure.

The most common Distributed Hashtables that have been around in some form or another since the 90's like Chord, DHT, Bamboo etc have exactly these problems. Fast forward to 2009 and Memcached has a similar problem. To be fair Memcached was never meant to be anything more advanced than a large, evenly spread out, distributed hash table/store. Which it seems to excel at.

So, for anything more than simply caching MySQL data like Flickr images or Facebook friends list, this form of blindly hashing to nodes does not suffice. It makes perfect sense to use it in Freenet - which offers anonymous storage and retrieval in a P2P environment.

In fact, I read recently that Cassandra is considering implementing a locality preserving distribution mechanism.

A while ago, I read this very interesting article about Constrained Tree Schemas, which points to the next level in the Distributed System architecture. Namely, Compute Grids - send your code to where your data is. We've seen an incredible number of Open Source Data Grids. A few pure Compute clusters like GridGain and JPPF, but there are still very few projects/products (barring a few commercial Java products) that support Data + Compute in a seamless package. (For the curious reader, I was referring to features such as Key-Association, Key-Partitioning, Space-Routing etc in the commercial Grids. Do your own research now :-).

In schools today:
Also, I've noticed that real world Distributed Systems like GFS and Amazon Dynamo are being studied in courses these days. When I was in college, all we had was vague references to papers and thought experiments in books like Nancy Lynch's.
Good reading:
I've never really understood how all those Hadoop sub-projects fit together - especially HDFS. Here's an excellent article about how LinkedIn uses Voldemort and HDFS to build and transfer their multi-GB search index every day.
Until next time...cheers!


Ashwin Jayaprakash said...

I found another article about using Location aware routing for better performance in a Cluster.

Jags Ramnarayan said...

Hi Ashwin,
Good points about needing to colocate and denormalize to avoid cross node joins (and the dreaded 2 phase commit problem). You might consider taking a peek at GemStone's GemFire Object data fabric (https://community.gemstone.com/display/gemfire/GemFire+Enterprise) which is another key-value pair based main-memory based data store. But, more interesting might be how we intend to manage structured data in tables supporting relationships (Referential integrity) and colocated joins, data-aware behavior routing, etc in the SQLFabric product (https://community.gemstone.com/display/sqlfabric/SQLFabric). We would love to get some feedback.


Ashwin Jayaprakash said...

I thought using Derby to do some of the grunt work in SQLFabric was quite cool.

I relied on in-memory Databases in StreamCruncher for some of the same reasons.