Tuesday, July 28, 2009

Partitioning, Data Locality and some more reading material

From my ongoing reading about Distributed stores, Map-reduce, BASE and other related topics here's some "gyan" (alt meaning: musings of an armchair engineer):

Effect of Partitioning and Data Locality for scaling in a Distributed System:

If you are familiar with the Fallacies of Distributed Computing, 2 of those fallacies are very relevant even in a well replicated system, in an internal, trouble-free network.
Or rather, especially in an internal network where the network security, topology and bandwidth are not major issues. The ones that still cannot be ignored are:
» Latency is zero
» Transport cost is zero

The trouble with storing data in such a distributed system (key-value pairs or columns + column families) is that everything looks alright until you really need to cross-reference/lookup/join data between 2 such caches/stores.

Unlike in a regular database where you normalize everything into separate tables and then perform an Indexed-join across more than one database to re-assemble the data, there is no proper facility to do that efficiently in a distributed system. The cost is most often prohibitive because of the high latency that comes with moving let's say the left-hand-side keys of the join to the right-hand-side table in a join. And if you are attempting to replicate a complex join plan like in a database, well good luck.

To alleviate this problem, you would have to:
» De-normalize your data
» Or move all your relationships closer together in the cluster (Co-location)

Object-oriented DBs face a similar problem where you cannot perform Joins as freely as you would in a Relational DB. Their solution is a "pre-joined" or "always-joined but lazily fetched" graph structure.

The most common Distributed Hashtables that have been around in some form or another since the 90's like Chord, DHT, Bamboo etc have exactly these problems. Fast forward to 2009 and Memcached has a similar problem. To be fair Memcached was never meant to be anything more advanced than a large, evenly spread out, distributed hash table/store. Which it seems to excel at.

So, for anything more than simply caching MySQL data like Flickr images or Facebook friends list, this form of blindly hashing to nodes does not suffice. It makes perfect sense to use it in Freenet - which offers anonymous storage and retrieval in a P2P environment.

In fact, I read recently that Cassandra is considering implementing a locality preserving distribution mechanism.

A while ago, I read this very interesting article about Constrained Tree Schemas, which points to the next level in the Distributed System architecture. Namely, Compute Grids - send your code to where your data is. We've seen an incredible number of Open Source Data Grids. A few pure Compute clusters like GridGain and JPPF, but there are still very few projects/products (barring a few commercial Java products) that support Data + Compute in a seamless package. (For the curious reader, I was referring to features such as Key-Association, Key-Partitioning, Space-Routing etc in the commercial Grids. Do your own research now :-).

In schools today:
Also, I've noticed that real world Distributed Systems like GFS and Amazon Dynamo are being studied in courses these days. When I was in college, all we had was vague references to papers and thought experiments in books like Nancy Lynch's.
Good reading:
I've never really understood how all those Hadoop sub-projects fit together - especially HDFS. Here's an excellent article about how LinkedIn uses Voldemort and HDFS to build and transfer their multi-GB search index every day.
Until next time...cheers!

Sunday, July 26, 2009

Hiking in Huddart County Park

Today I did the Dean Trail - Crystal Springs Trail in Huddart County Park. A distance of about 4.5 miles. Probably a little more because the same trails keep criss-crossing towards the Toyon Camp area. It's not too bad. The park itself has a large recreational area with lawns and picnic benches.

Saturday, July 25, 2009

Resistance to "shady" Algo trading

If you've been following recent technology trends in high speed Algorithmic trading which sort of intersects with CEP/ESP technologies, then you must've wondered at some point about what goes on in those big Wall St companies like Hedge funds and the like.

As an individual investor (if you can call us that) dabbling in the stock market with minuscule amounts of hard earned money - we often ask yourselves if we really are getting a good deal.

It turns out that the mainstream Press has caught wind of their questionable activities. It's no surprise that the Finance industry does this but still when it's laid out in such simple words before you, it's quite horrifying:

1) Senator Wants Restrictions on High-Speed Trading

2) Is Wall Street Picking Our Pockets?


(Update Sep 14 2009):
Recent ban on Flash orders - Gone in .50 Seconds

Wednesday, July 22, 2009

To parse or not to parse

I remember spending a lot of time working on SQL extensions and ANTLR grammar not too long ago. This was during my StreamCruncher days.

Looking back, the effort it required to write a clear, unambiguous grammar, simplify the AST, validate it and then finally produce some executable code was considerable. I distinctly recall that it was not very enjoyable.

Nowadays, the Fluent interface is gaining popularity, especially for implementing mini-DSLs. Here's a introduction to designing DSLs in Java.

For clarity, I'll refer to the custom language/grammar as Stringified-DSL and the Fluent interface as Fluent-DSL.

Now, don't get me wrong I'm all for designing a concise and crisp "language" for specific tasks. That's the whole reason I built my program on SQL extensions (i.e Stringified). But does it always have to be a dumb string - something in quotes, which cannot be refactored, no IDE support like auto-complete, the hassle of providing a custom editor for it...?

It doesn't just end there. When you expose a custom Stringified-DSL you and your users are stuck with it for a very long time. If you want to expose any data structures which your users will eventually ask for - after playing with the simple stringified demos, then you will have to ship the new grammar to them, enhance your custom editor, document them clearly. Also, if the underlying language - let's consider Java suddenly decides to unleash a whole bunch of new awesome APIs, syntax and other language enhancements; which is exactly what happened with Java 5 if you recall - Annotations, Concurrency libraries, Generics. You see what your users would be missing because you chose the Stringified route?

In case you are wondering why your users would want to use Concurrency, data structures and such stuff - we are talking about Technical users and not Business users. Business users will be happy to play with Excel or a sexy UI, so we are not talking about their requirements.

So, what are the benefits of exposing a Fluent-DSL?

  1. Easy to introduce even in a minor version of the product. Quick turn around because it involves mostly APIs

  2. Documentation is easy because it can go as part of the JavaDocs

  3. Easy for the users to consume and use. Lower setup costs and only a mild learning curve

  4. IDE features - Refactoring, Type safety, no custom editor required

  5. Base language features - Users don't have to shoehorn their code into a clumsy custom editor, code can be unconstrained and they are free to use data structures, Annotations, Generics, Concurrency, Fork-Join and all the other programming language goodies. Which is very good in the long run because you don't have to paint yourself into a corner with all that custom stuff

So, what tools does our mild-mannered Developer have at his/her disposal? Some nice Open Source libraries I've spent time about reading come to mind:
To be fair to Stringified-DSLs, there are some nice libraries that would make life easier for everyone involved in case you badly have to go the stringy route.
  • MVEL - my favorite. The Interceptor feature is a nice touch

  • JSF Expression Languages like - JUEL and SpEL

  • MPS from the IntelliJ guys. I haven't understood this fully. It seems to be somewhat like MS Oslo, maybe less. But Oslo itself is hard to understand because of the poor arrangement of their docs

  • DSLs in Groovy - although I'm not very convinced since Groovy itself has a learning curve

If you are a .Net user, how I envy your LINQ extensions. Man that was a brilliant move. I wonder how they managed to get their parsers and editors to work so seamlessly with the regular C# code.


Monday, July 06, 2009

Road trip - Zion National Park and Grand Canyon, North Rim

Last weekend I did a 2600 mile road trip to Zion National Park and the North Rim of the Grand Canyon.

Zion was pretty good. I was expecting it to be similar to Bryce and Grand Canyon, but I was quite wrong. The best part of Zion was the hike up the Virgin River in knee deep water, on the floor of the canyon.

We could've done another more strenuous hike to the top if we had another extra day. Perhaps another time. In Zion you have to use the Shuttle bus service to see about half of the park, which is a good thing. It gets even better if your bus driver happens to be the hilarious Jim L. I don't know is last name but you will enjoy his tour.

It rained one evening and I was afraid that it would ruin our trip. Thankfully, it stopped after about a half hour and we were treated to some beautiful sights when the sun peeped out of the clouds just before setting.

You can drive from Zion's east entrance (not the main entrance) to the North rim of the Grand Canyon in about 2.5-3 hours if I remember correctly. But if you start out late in the evening from Zion, it's good to just stop at Kanab town and find lodging there.

The North Rim is best experienced if you make proper arrangements in advance and go hiking to the bottom of the Canyon. We didn't and so I felt that another trip is in order some time.

Until next time...cheers!