Sunday, July 25, 2010

Apache Cassandra for first timers

I wanted to get a feel for how Apache Cassandra works, so I downloaded and installed (just copied) the files. I decided to run the single node test. Here's what I did on my Windows 7 laptop:

1) Download and unzip the latest Cassandra zip file to some folder - D:\Dump\apache-cassandra-0.6.3
2) Open a command prompt at the main Cassandra directory and type - bin\cassandra.bat
3) That's it! You have a single node server that is running with the defaults. It creates and starts logging to some default location. In my case it was - D:\var\lib\cassandra\

Now, I wasn't too happy with the default Keyspace configuration - this is the "schema". So, I shut down the server, deleted the log directory and modified the configuration file in conf\storage-conf.xml. I simplified the Keyspace to 2 simple sub-sets - a column family called Message and a super column family called Car.

The more I look at the Cassandra column family structure, the more it reminds me of XML.


Then I started the CLI batch file to punch in some commands. I wasn't expecting this. I was really looking forward to a simple Java client program and there isn't any. So, those HBase guys were not kidding they said Cassandra does not have a simple client program. You have to use a Thrift client or some other third party client like Hector. I wasn't too eager to do that so I just went the command line way.

It seemed easy enough. It takes a few minutes to understand which one is the key name, the column family name and the super column family name. The advantage is that it's like a hierarchy of SortedMaps. Which means that the keys across records do not even have to have the same column names. Notice that there are some slight differences in the columns I've entered like - "Upgrade", "Leather seats" or "AWD" which are not there in the other records. So, there is some flexibility.


Some thought must be given to how efficient the storage is when you intend to store millions of records at the column/super column family/column family level. Search for discussions in the Cassandra-User mailing list. There are lots of such discussions and on which mode is better.

Thoughts:
1) The installation is easy but the lack of a proper client is bothersome
2) CLI looks good for key-value type of queries, but I was really interested in those queries on slices and ranges. I couldn't find anything ready made
3) Hbase and Hive along with Cloudera's Beeswax UI for running SQL-like queries are very compelling. But, have a look at the HBase installation. It doesn't look easy. That's why I decided to try Cassandra.
4) This article here, is the most succinct comparison of Cassandra and HBase

Until next time!

5 comments:

Ashwin Jayaprakash said...

sodeso.nl/?p=354 has a good series of articles.

Ashwin Jayaprakash said...

And of course this wtf-is-a-supercolumn-cassandra-data-model

Jonathan Ellis said...

Thanks for the blog post! And those are decent links. There are some even better at the top of http://wiki.apache.org/cassandra/ArticlesAndPresentations under "Recommended."

Ashwin Jayaprakash said...

Cloudkick's Roll up data.

Nicolas Janin said...

For me, the structure is not very different from tables.

Here is how i see it:
Cassandra columns = SQL fields
Cass supercolumns = SQL rows
Cass ColumnFamilies = SQL tables
ColumnFamily inside a ColumnFamily = SQL join

The only (BIG) difference is that everything is already ordered in Cassandra.