Thursday, February 19, 2009

Got cache?

In recent years, we've seen a general trend in data handling and management where databases are being relegated to a slightly lower pedestal in the overall scheme of things.

Umm..that was a mouthful, I apologize. What I meant was that even to a casual observer (who doesn't get worked up like you and me when you talk about gigabytes of transactional data), it is becoming obvious that networks are getting faster and faster, memory is getting cheaper (cheaper but not inexpensive) but the darned hard disks are still plodding at much slower rates. Gigantic distributed systems that were once restricted to college campuses with a lot of PhD people mucking about have slowly made their way into everyday projects. Thankfully, not the PhDs. Google made it look sexy to have a huge number of PCs storing data mostly in RAM, forming a "data cloud" (No, I certainly did not invent that term).

Some large banks have been caching their data in modest sized distributed caches. Most of them have been doing it for some time, but they were really clustered App Servers with a cache that just happened to be there. But now, we see people designing their systems to be intentionally distributed. Web 2.0 startups are another prime example. Some of them have written their own distributed caches and clusters, gunning for Google's market share no doubt. All this sudden interest has resulted in some wonderful open source projects.

To prove my point, here's the list I'm talking about: (Some of them are new, some have been around for a while)

Apache Hadoop: http://hadoop.apache.org/core/
Hypertable: http://www.hypertable.org/ (similar to Hadoop's sister project)
Memcached: http://www.danga.com/memcached/
JBossCache: http://jbosscache.blogspot.com/
Terracotta: http://www.terracotta.org/

As we try to think about where we are headed, in terms of data management it becomes fairly obvious that Databases, on which we've come to rely so much upon for storing/processing/slicing & dicing data are not really suited for every job. We don't really have to create strict relational schemas, to hell with normalization, disk quotas...we can just store entire objects as-is in a cache. That way we don't have to waste time joining the parent-child relations over and over. Sounds a lot like OODBMS. Let the programmers have the to freedom to store the objects optimally instead of a DBA instructing them. Distributed caches automatically come with fail-over capabilities, which makes things even simpler.

I'm sure I can keep rambling, but you get the point. I'll keep updating this list. Just reading the blogs of these projects' authors is educative, to say the least.