Thursday, February 19, 2009

Got cache?

In recent years, we've seen a general trend in data handling and management where databases are being relegated to a slightly lower pedestal in the overall scheme of things.

Umm..that was a mouthful, I apologize. What I meant was that even to a casual observer (who doesn't get worked up like you and me when you talk about gigabytes of transactional data), it is becoming obvious that networks are getting faster and faster, memory is getting cheaper (cheaper but not inexpensive) but the darned hard disks are still plodding at much slower rates. Gigantic distributed systems that were once restricted to college campuses with a lot of PhD people mucking about have slowly made their way into everyday projects. Thankfully, not the PhDs. Google made it look sexy to have a huge number of PCs storing data mostly in RAM, forming a "data cloud" (No, I certainly did not invent that term).

Some large banks have been caching their data in modest sized distributed caches. Most of them have been doing it for some time, but they were really clustered App Servers with a cache that just happened to be there. But now, we see people designing their systems to be intentionally distributed. Web 2.0 startups are another prime example. Some of them have written their own distributed caches and clusters, gunning for Google's market share no doubt. All this sudden interest has resulted in some wonderful open source projects.

To prove my point, here's the list I'm talking about: (Some of them are new, some have been around for a while)

Apache Hadoop:
Hypertable: (similar to Hadoop's sister project)

As we try to think about where we are headed, in terms of data management it becomes fairly obvious that Databases, on which we've come to rely so much upon for storing/processing/slicing & dicing data are not really suited for every job. We don't really have to create strict relational schemas, to hell with normalization, disk quotas...we can just store entire objects as-is in a cache. That way we don't have to waste time joining the parent-child relations over and over. Sounds a lot like OODBMS. Let the programmers have the to freedom to store the objects optimally instead of a DBA instructing them. Distributed caches automatically come with fail-over capabilities, which makes things even simpler.

I'm sure I can keep rambling, but you get the point. I'll keep updating this list. Just reading the blogs of these projects' authors is educative, to say the least.


Ashwin Jayaprakash said...

GridGain seems to be pretty good. I was impressed by the ease of use and deployment. Very slick. I wonder how they handle multiple versions of the same GridTask...

Ashwin Jayaprakash said...

JPPF is another project worth looking at.

Ashwin Jayaprakash said...

Yahoo moved to a giant Hadoop cluster.

Sweet! It's pure Java. Sweeter!

Ashwin Jayaprakash said...

Another one - MemCached Java client. Although it's just a client for MemCached, what I like about this is the Consistent Hashing technique that's been implemented.

Here's another simple implementation of Consistent Hashing: This technique is becoming more and more common.