Wednesday, December 29, 2010

Proximity search using SQLite's FTS feature

A few months ago I was playing with SQLite's Full Text Search feature. I was especially interested in the Match-Near-Term operator - which allows you to search for a bunch of terms that are with 'm' words of each other. Lucene also has this feature (obviously) called SpanQuery. This is called Proximity search if you didn't already know.

This kind of search has its limitations - so does SQLite, especially performance problems for large data sets. I chose SQLite with the SQLite-JDBC driver because of its simplicty of setup and SQL interface (duh!). I created the FTS table in an in-memory database and tried some simple queries. It's not too bad. I'll just file it for later.

Here's the code. I just create 2 streams of stock ticks (all contrived, just like the rest of the code) and try to search for patterns in the 2 series. It does not exactly do what I wanted it to, but it was fun to play with the concept.

Friday, December 24, 2010

Enterprise technology landscape from 10,000 ft above - circa 2010

[Updated: Dec 25, 2010]

My personal view of where some technologies stand, today:

Happy holidays and a happy new year!

Wednesday, December 22, 2010

If you are going to lose your socks, make sure that you lose it in pairs.

Sunday, December 12, 2010

What they did not teach in OOP/OOAD class

For over a decade the Gang Of Four Design Patterns or the GoF patterns are they are fondly referred to have become one of the favorite topics in job interviews. And for good reason. It has even led to the rise in the popularity and acceptance of spin offs like - J2EE Blueprints, Enterprise Integration Patterns and even Anti-patterns.

Would it be too much to ask for to teach these concepts in school today in the final semester before sending graduates off into the real world?

However, knowledge of the GoF patterns is not sufficient to build elegant systems. In fact it must be said that an over-reliance and blind following of the GoF patterns quite often leads to bloated and over-engineered systems. Case in point - Spring superseded a bloated J2EE. Google Guice and JEE CDI are in turn attempts to improve upon Spring which itself has gained a lot of weight over the years.

In my experience, I have come to realize that in order to insure a more complete and proper understanding of the art of design and its application in a complex software system, there are 2 other essential sets of design patterns. The lesser known and under used ones:
    1) GRASP - General Responsibility Assignment Software Patterns
    2) SOLID - Single responsibility, Open-closed, Liskov substitution, Interface segregation and Dependency inversion

Where the GoF patterns and their spin offs explain "how" to build the components; SOLID and GRASP help in understanding "why" those components, packages, abstractions, interfaces and dependencies have to be built and assembled in a certain way.

To drive home the point, just sending off graduates with partial knowledge i.e GoF = "How" and without SOLID + GRASP = "Why" would be like teaching automobile engineers how to machine parts of a car and not teaching them how to assemble the parts to make a drivable car.

If students are expected to figure out the "why" on their own at their first jobs, they will unwittingly build and design Rube Goldberg type of software, inflicting the source code with factories of factories, bad adapters that don't do anything, redundant interfaces, singletons that resist unit testing and the list goes on.... until they learn from experience (if at all). For it takes years of experience and guidance to gain a holistic view of complex systems. Systems thinking also helps a great deal.

While we are discussing the subject of good design there is another important aspect that is even less understood - API design. There is a remedy for that too. Joshua Bloch's - How to Design a Good API and Why it Matters.

In conclusion, I'd like to list down some famous quotations and rules of thumb that help me when I'm designing a particularly tricky system:
   - Simple things should be simple, complex things should be possible (Alan Kay)
   - Simplicity before generality, use before reuse (97 Things ..)
   - Ask yourself if a feature or its design is necessary and sufficient. Anything more is a waste. Anything less means the job is not complete

Until next time!

Tuesday, December 07, 2010

LinkedIn's Kafka messaging project

Kudos to the LinkedIn team for making another highly focused and elegant project available as open source - Kafka. In spite of its name it is anything but Kafkaesque.

Kafka seems to be a serious attempt to address the messaging problem by starting from first principles. Not having played with the project yet, but from just reading the design doc it looks like a well thought out design.

I have written about the scalability limits of push-systems that are somewhat common to JMS implementations - here about polling from NoSql instead of push, a little here about JMS spec needing an upgrade and vaguely here when talking about alternatives to 2 phase transactions.

The alternative systems like Flume, Scribe, Hedgwig, Chukwa and such are too log-file-collection focused. Whereas Kafka looks more like a regular messaging system with a clean polling mechanism. Explicit polling with good storage automatically solves many of the problems that I had written about here like retries, slow consumers/flow control and durable subscriptions. I'm particularly glad to see that they've read the Varnish article on OS disk caching which Redis seems to have somewhat muddled up (comment #29). Funny, zero-copy was something I was exploring just a few weeks ago with Netty.

I don't however foresee any enterprise projects switching to Kafka immediately. Its performance and cost of license (ASL) might not be enough to motivate people from trying it out. The strangely simplistic yet clever design does require some careful reading of the docs and understanding of the APIs. Hopefully it will gain a wider user base unlike their other nice project Voldemort - another simple and elegant project.

Also be sure to have a look at their new disk store - Krati. I'm even more glad to see that all these projects are in Java (actually Scala).

Until next time!

Saturday, December 04, 2010

Clever Enum tricks and some things to read over the weekend

Here's a Java goody - changing the default maximum compile errors reported by Javac:

Clever things that you can do with Java Enums:
   Create a hierarchy -
   Make it implement an interface -

And some bizarre things to do like - converting Java to native:

What would happen if you tried to bypass the Sql engine in MySql? You'd get 750,000 reads per second!

Some Hadoop and cloud related presentations worth reading:

Sunday, November 21, 2010

Scalable compute & storage frameworks - A Refcard (in progress)

If you have been closely following the NoSql space or even shown a mild interest in scalable technologies such as Compute Grids, Data Grids, Distributed Caches or the countless other terms that people use interchangeably - you have probably realized that most Architects do not have the time or the resources to investigate the sift through the noise and decide on what to use.

Since I've had some experience using one such framework and also because I follow the progress of some others, I thought it would be helpful to everyone if I put together some information.

Please share and contribute information. Spread the word. Your efforts will be acknowledged. Ask for permission to work on the Wiki and Spreadsheet.
What I have done is created a Google Code project: scalable-frameworks where I hope I can spare the time to keep it updated and enlist some help from the community at large to gather correct information.
  • The intention is for it to serve as a ready reckoner and not be complete or authoritative
  • Performance is a criterion that has consciously been excluded from the lists here to avoid flame wars
  • For the full information it would be best to visit the actual product's/project's website
  • It is not official, nor has it been prepared by thorough research
  • If you have questions or would like to clarify/contribute, please get in touch
To start with, there are 2 parts:
  1. A very simple introduction with images describing the basic concepts
  2. A spreadsheet that is meant to serve as a ready reckoner - to help you choose the right framework/platform
    • It has some basic features listed
    • Pay attention to the features that you would find most useful and pick the project that has most/all the ones you are looking for

Basic concepts:
To help understand the basic processing and storage idioms being explored, here are a few images:

"Store and retrieve" (Scatter-Gather) on a cluster of compute + storage nodes:

"Store, notify changes, apply changes" (Scatter-Relay-Compute) on a cluster of compute + storage nodes:

"Store, notify changes, calculate, notify new calculation result" (Scatter-Relay-ComputeAlert) on a cluster of compute + storage nodes: 

(Full Spreadsheet)

Until next time!

Wednesday, November 17, 2010

Variety is the spice of (the Architect's) life

A mind map of the various aspects of software - from an architect's point of view:

Saturday, November 13, 2010

Hiking in Memorial Park (Off CA 84)

I decided to go hiking in one of the parks off CA-84 (La Honda). It's a longer drive from I-280 and goes closer to the coast. Driving on LA Honda is also quite fun if the weather is good. I had hiked in the same area a few years ago.

I went to Memorial Park, paid the $5 registration fee and went wandering around the empty camp grounds. I could not find a map anywhere (unlike the other parks) so I just went down to the creek from Tan oak camp grounds and walked along the creek, against the flow. After about a 1/4 mile, you reach a bridge which is actually part of Pescadero Creek Road. There are many camp grounds here. I crossed the bridge and wandered around in the Wurr campgrounds. Still no sign of a trail.

So, I came back to the road and then saw the entrance to Pescadero Creek County Park - Hoffman Creek Trailhead. I went in and kept going. There's a map here but Pomponio trail appears to be in both parks. It's confusing.

After about 10-15 minutes, there's a a junction where Old Haul Road (the one you will on) makes a left. I turned left and kept walking. Then you see a sign that says Pomponio Trail. Make a left here, cross the creek. There is no bridge. If it has rained, I don't know if it's safe to cross the 10 ft span of ankle deep water. There are rocks to step on to help cross the creek.

Pomponio trail looks like an out-and-back trail. There is no loop. I went in for a while and then headed back the same way.

Overall, it's a nice place. No crowd. I was worried that I did not see any fellow hikers until I saw just 1 couple on their way back. Camping here would also be a fun - very convenient too considering how close it is to 280.

Until next time!

Friday, November 12, 2010

If only the world were immune to diseases like it is to logic...

Tuesday, November 09, 2010

JMS spec - Time for an upgrade?

The last JMS specification was written 8 years ago! Since then the world has seen multi-core 64 bit systems, Hadoop, compute grids, data grids, AMQP, ZeroMQ, NoSql, Twitter, Facebook ..... and still JMS 1.1 is the backbone of enterprise systems.

There are a few things however, that are missing sorely from the spec and consequently from any implementation in a "standard" way. Each provider no doubt has an answer to some of its limitations but the emphasis is on the lack of a standard way.

8 years ago, running a whole swarm of machines as a single cluster was rare. But in today's world it is not. It is exactly here that the spec is lacking. I have listed some features below that I would find useful.

Message acknowledgment:

  • Out of order Acks: Acknowledging messages individually and in an order that is different from how they were received. The spec is vague about this. In some systems acknowledging a message will automatically ack all messages that have been received so far in that session! TIBCO EMS has Explicit Client Acknowledgement that is non-standard but is certainly very useful
  • Negative Acks: In some systems, if a message causes an exception at the receiving end, that message will not be redelivered by the server until the original client session disconnects. It would've been nice if the spec allowed some kind of a Negative Ack to force the server to redeliver the message to the same or a different session
  • Batch Acks: Since JMS is used in many setups to correlate multiple messages coming from different queues and considering that Ack'ing is expensive at high message rates; JMS should have a facility to accept a batch acknowledgment of many JMSMessageIds. Similar to JDBC batch statements
  • Disconnected & Federated Acks: In complex systems where messages can flow through multiple tiers of application servers either as the original JMS message or as an enriched DTO/VO, allowing the final tier in the flow to acknowledge the message purely by JMSMessageId would be very useful. Currently, only the client that received the message from the server can acknowledge it. Also the message has to be held in memory for the duration. In SEDA systems this does not work well and forces the developers to jump through unnecessary design hoops
Message selectors:
  • Smart selectors: JMS Selectors are very static and expensive to apply at high message rates. They were meant to work as a rudimentary form of Content based routing. What is really needed is a way to let the client run custom logic (like RMI) through the queue and pick what it needs. This way clients should be able to use a modified form of QueueBrowser and consume anything it finds interesting
  • Context aware routing: ActiveMQ has a non-standard feature called Message Group that can be used to perform smart message routing using a custom header. This would be a welcome feature where data aware routing would provide a huge performance improvement by exploiting data locality to deliver related messages to the same application server and thereby avoiding data/cache thrashing
Also, read this for a different take on the future of messaging.

Until next time, cheers!

Wednesday, October 27, 2010

"Flexible" design (Willis Tower)

Inside Chicago's Willis/Sears Tower - creaking sounds (mp3) when the tall building sways in the wind. The swaying is by design - to help it cope with the structural stresses it encounters at the top, from strong winds in stormy weather.

A little short of half a kilometer in height and one of the tallest buildings in the world. Now, that is extreme engineering!

Saturday, October 23, 2010

Even Star Trek technology has memory fragmentation issues

Memory fragmentation problems? Well, even Star Trek holograms have similar issues (1) (2). This is why you need a Compacting Garbage Collector :)

Thursday, October 21, 2010

select 'Whoa..' || 'Cool!' from mozilla_history_storage

SQLite is a brilliant embedded SQL engine and it is used by software that you and I use everyday - Firefox, Android and many more. See earlier reference.
What you might not know is how extensively it is used in Firefox. History, bookmarks all use SQLite. The kind of queries that are used would put many Enterprise applications to shame. I was just curious to see what was under the hood and decided to hunt down the History and Bookmarks source code.

 "SELECT v.visit_date, COALESCE( "
   "(SELECT r.visit_type FROM moz_historyvisits_temp r "
     "WHERE v.visit_type IN ") +
       nsPrintfCString("(%d,%d) ", TRANSITION_REDIRECT_PERMANENT,
                                   TRANSITION_REDIRECT_TEMPORARY) +
       NS_LITERAL_CSTRING(" AND = v.from_visit), "
   "(SELECT r.visit_type FROM moz_historyvisits r "
     "WHERE v.visit_type IN ") +
       nsPrintfCString("(%d,%d) ", TRANSITION_REDIRECT_PERMANENT,
                                   TRANSITION_REDIRECT_TEMPORARY) +
       NS_LITERAL_CSTRING(" AND = v.from_visit), "
   "visit_type) "
 "FROM moz_historyvisits_temp v "
 "WHERE v.place_id = :page_id "
 "SELECT v.visit_date, COALESCE( "
   "(SELECT r.visit_type FROM moz_historyvisits_temp r "
     "WHERE v.visit_type IN ") +
       nsPrintfCString("(%d,%d) ", TRANSITION_REDIRECT_PERMANENT,
                                   TRANSITION_REDIRECT_TEMPORARY) +
       NS_LITERAL_CSTRING(" AND = v.from_visit), "
   "(SELECT r.visit_type FROM moz_historyvisits r "
     "WHERE v.visit_type IN ") +
       nsPrintfCString("(%d,%d) ", TRANSITION_REDIRECT_PERMANENT,
                                   TRANSITION_REDIRECT_TEMPORARY) +
       NS_LITERAL_CSTRING(" AND = v.from_visit), "
   "visit_type) "
 "FROM moz_historyvisits v "
 "WHERE v.place_id = :page_id "
   "AND NOT IN (SELECT id FROM moz_historyvisits_temp) "
 "ORDER BY visit_date DESC LIMIT ") +

Now...see this. Imagine doing this without SQL? I am however, a little disappointed. I was expecting the code to be using some fancy NGram search like what Lucene does, but it doesn't look like it.
        :localhost, :localhost, null, null, null, null, null, null, null 
   FROM moz_places h 
   JOIN moz_historyvisits v ON v.place_id = 
   WHERE h.hidden <> 1 AND h.rev_host = '.' 
     AND h.visit_count > 0 
     AND h.url BETWEEN 'file://' AND 'file:/~' 
   FROM moz_places_temp h 
   JOIN moz_historyvisits v ON v.place_id = 
   WHERE h.hidden <> 1 AND h.rev_host = '.' 
     AND h.visit_count > 0 
     AND h.url BETWEEN 'file://' AND 'file:/~' 
   FROM moz_places h 
   JOIN moz_historyvisits_temp v ON v.place_id = 
   WHERE h.hidden <> 1 AND h.rev_host = '.' 
     AND h.visit_count > 0 
     AND h.url BETWEEN 'file://' AND 'file:/~' 
   FROM moz_places_temp h 
   JOIN moz_historyvisits_temp v ON v.place_id = 
   WHERE h.hidden <> 1 AND h.rev_host = '.' 
     AND h.visit_count > 0 
     AND h.url BETWEEN 'file://' AND 'file:/~' 
        host, host, null, null, null, null, null, null, null 
 FROM ( 
   SELECT DISTINCT get_unreversed_host(rev_host) AS host 
   FROM moz_places h 
   JOIN moz_historyvisits v ON v.place_id = 
   WHERE h.rev_host <> '.' 
     AND h.visit_count > 0 
   SELECT DISTINCT get_unreversed_host(rev_host) AS host 
   FROM moz_places_temp h 
   JOIN moz_historyvisits v ON v.place_id = 
   WHERE h.rev_host <> '.' 
     AND h.visit_count > 0 
   SELECT DISTINCT get_unreversed_host(rev_host) AS host 
   FROM moz_places h 
   JOIN moz_historyvisits_temp v ON v.place_id = 
   WHERE h.rev_host <> '.' 
     AND h.visit_count > 0 
   SELECT DISTINCT get_unreversed_host(rev_host) AS host 
   FROM moz_places_temp h 
   JOIN moz_historyvisits_temp v ON v.place_id =         
   WHERE h.rev_host <> '.' 
     AND h.visit_count > 0 

There's plenty more where this came from:
  - mozilla-central/source/toolkit/components/places/src/nsNavHistory.cpp

If you wish to use SQLite from your Java program, there are some drivers available. SQLite also supports a pure memory mode, much like my other favorite database H2.

Sunday, October 10, 2010

Time series, disk swapping, shazam patents and other stories

Here's a collection of useful Cassandra and HBase articles I've come across in the past few months:

Time series storage in the big 2 NoSQL systems - Cassandra and Hbase:

Apache Hive vs Pig:

Cassandra GC and swapping:

Geohashing sounded like an ingenious concept. Here's something built on Cassandra:


Well, yeah that's a lot of NoSQL articles to read. Here's a hilarious video against NoSQL to balance it. (Warning: Watch for foul language. For a more civilized roast, see this)

Patent trouble . Here's the story of a smart guy who wrote a music recognizer over a weekend and got into some trouble with Patent lawyers.

After a long time, I found a nice JUnit presentation that made me reconsider my decision to switch to TestNG from JUnit.

Some Linux fun - swapping OS pages and opening 500K sockets (Also see above for what Cassandara did  to prevent swapping):

Until next time!

Wednesday, October 06, 2010

Red Munia or Red Avadavat

Red Munia or Red Avadavat, originally uploaded by SRJP.

Pic taken by my dad - Dr. Jayaprakash.

Red Munia or Red Avadavat

Red Munia or Red Avadavat, originally uploaded by SRJP.

Pic taken by my dad - Dr. Jayaprakash.

Sunday, September 26, 2010

Txn.commit() - Are you sure?

[+ indicates updated on Sep 27, 2010]

Transactions - do we need them and are people really using them like they are claim to?

We know that transactions are theoretically the best way to keep data consistent, but it might not always be the most practical way to do it.

There could be a variety of reasons:
 - Reduced performance after using transactions
 - Lack of proper XA support across all the participating resources
    - "Last resource commit"/XA emulation can leave some edge cases in a mess
    - There could be more than 1 resource that does not support XA. In such cases emulation will not work
 - There could be a need for nested transactions which are not widely supported
 - The transaction manager might not have proper support for repair/recovery of heuristic hazards
 - Multi-step transactions that need savepoints and lack of proper support or semantics for restoring it
    - Transactions that might be too expensive to retry from the beginning
    - If the client program crashes, then having a new client continue the transaction might not be feasible
    - Multi-page, lengthy UI forms that need disconnected data sets
 - Impractical for long running transactions and so on..

 Many others have written about it. I'd rather refer to their notes instead of write my own, from scratch:
 - Starbucks Does Not Use Two-Phase Commit
 - ACID Transactions Are Overrated
 - Computer says no
 - Transactions - Overused Or Just Misunderstood (Mark Little)
Remember - if Transactions work for you and all your systems support it, then go for it.

Having said that, there still are many systems where data flows across large applications; where a simpler, resilient and more predictable compensating mechanism is suitable. Simpler it may be, but designing such systems require a lot of foresight and expertise:
  - Optimistic concurrency based on version numbers
  - Atomic compare-and-swap upsert/update operations
  - Polite spin locks and backoff-retry mechanisms
  - Clear error reporting
  - State capture, repair and consistency checking
  - Operation logging, undo and re-apply
  - Proper documentation and involvement of Developer/Architect

For much larger systems like Amazon, LinkedIn and the like, availability is as important as consistency. See earlier references - #1, #2, #3 and #4.

Some interesting notes on Transactions that I keep referring to every now and then:
  - XA Exposed, Part III: The Implementor's Notebook
  - Distributed Transactions and Two-Phase Commit

Saturday, September 25, 2010

Hiking in Upper Stevens Creek County Park

Upper Stevens Creek County Park is just opposite Long Ridge Open Space Preserve on CA 35. You have to go down Grizzly Flat trail and then back up. There are 2 trail heads next to each other - The North and the South legs. You can go down one and come back up the other.

The trail goes about 1.1 miles down where the North and South legs meet. You can go further down to Grizzly flats junction where it meets Canyon trail. There is a nice stream at the bottom. You can cross it and go further down towards Page Mill. I however, turned around and came back up.

This is not an easy trail. It is about as steep as Windy Hill Open Space Preserve, probably a little less. But the whole trail is always in the shade and is very quiet and pleasant.

Saturday, September 18, 2010

Whither ORM?

If you are scratching your head wondering why even after so many years there is still a confusing mix of ORM solutions, then you are not alone.

JPA/JDO/Hibernate/Spring/Mybatis/Cayenne - which one and why?

And if you are planning to write something on your own when there already are so many, then you should probably go read about NIH.

I'm actually amazed that even after the relative acceptance of NoSQL we are still struggling to standardize on ORM for SQL/RDBMS.

It's almost like a religious debate:

Good old Apache itself has 2 solid implementations of JPA and JDO. Both seem very mature and very well documented:

Apache also has some offbeat/non-standard implementations. Some dead, some doing well:
 - - formerly Apache iBatis

Reading a busy Twitter stream with @s and #s is as hard as parsing unformatted XML with your eyes - correction, with 1 eye closed.

Books I read in the last few months

Blindsight by Peter Watts: I thoroughly enjoyed reading this book. Although the ending was a bit of a let down, the amount of research that has gone into writing this book is impressive. It has a very refreshing combination of bio-chemistry, human vision, psychology and AI.

Galactic North by Alastair Reynolds: A collection of short stories. Generally, I try to stay away from short stories because I feel the characters do not have time to develop and neither does the story. This one however has a continuous feel across stories and is worth reading if you liked Revelation Space.

Eifelheim by Michael Flynn: Another Hugo nominee (I think). Not too bad it you'd rather have the story wander off into a medieval village setting during the time of the Black Death. Certainly not in the same league as the Sci-Fi masters.

Liar's Poker by Michael Lewis: This is not sci-fi at all. It's a 20 year old book about an Investment bank - the infamous Salomon brothers. They say history repeats itself. Just replace Salomon Brothers with Lehman Brothers and add a generous measure of greed and short sightedness. This is a very funny book considering what the book is all about. Well worth the read.

Spin by Robert Charles Wilson: Here's one I tried reading but just couldn't get myself to finish it. For a Hugo award nominee this was a disastrous read.

Sunday, September 12, 2010

Hiking in Skyline Ridge Open Space Preserve

Skyline Ridge Open Space Preserve, is right next door to Russian Ridge off CA 35. This is perhaps one of the nicest hikes for beginners. Gentle slopes, good combination of shade and open trails, great view just a mile into the hike, a park bench and a pond (actually 2) at the end of the 1.5 mile Ipiwa trail.

Start from the Skyline parking lot and follow the Ipiwa trail, cross Old Page Mill road and you will reach Alpine Pond. There are many shorter walkways you can use to spend time around the pond. Then head back the same way to Skyline parking.

Sunday, August 15, 2010

If life is a journey, shouldn't we be traveling light?

Friday, August 13, 2010

The blurring line between Messaging and Storage

Alternate Google Docs link: The Blurring Line (

Sunday, August 01, 2010

Hiking in Windy Hill Open Space Preserve

This time unlike the last, I started from the CA 35 entrance, and hiked down first and then back up. 1.8 miles of pretty steep incline each way. Best avoided if you haven't hiked in a while. The view of the entire Bay Area is unbeatable and worth the climb.

Saturday, July 31, 2010

Carbonado, Graphs, FUSE, Merge-Join and assorted stuff

This month in tech... before that, here's a testimonial for an open source project that you can't beat:

You could've been rich - My mother

Moving on, I heard about a persistence API called Carbonado on the Voldemort forums. It's an open source project from the Amazon guys. It's a no frills (read clean and simple) layer that works with Berkley DB and JDBC. It's even blessed by the BDB guys as a nicer layer on top of BDB.

Here's a decent presentation on graph algorithms from the Hadoop summit. Not very detailed, more like best practices and hints. And here's a nice illustration of PageRank using Javascript.

An interesting thread going on between the Hotspot GC team and a HBase engineer facing some GC problems. Have a look at the new generation sizes they've used for some deployments, it was new to me.

If you want to OD on JVM options, there's a list for that too.

Some folks playing with the userland filesystem in Unix - FUSE. Voldemort, Github and all sorts of funny stuff as Filesystems. Reminds me of GDrive.

NoSQL systems are notorious for not being able to do simple Joins. Their answer is Map-Reduce. For running multi-attribute filtering, there's Merge-Join. Google's App Engine which is like a poor man's data store suggests the same (slide 30). I am skeptical of such queries that run on a cluster of machines, without any indexes, burning CPU on all machines, moving data back and forth. Can't imagine what it does to latency.

Tuesday, July 27, 2010

Overheard in a coffee shop: "A few yrs ago I was young and restless. Now I'm just old and breathless most of the time".

Sunday, July 25, 2010

Apache Cassandra for first timers

I wanted to get a feel for how Apache Cassandra works, so I downloaded and installed (just copied) the files. I decided to run the single node test. Here's what I did on my Windows 7 laptop:

1) Download and unzip the latest Cassandra zip file to some folder - D:\Dump\apache-cassandra-0.6.3
2) Open a command prompt at the main Cassandra directory and type - bin\cassandra.bat
3) That's it! You have a single node server that is running with the defaults. It creates and starts logging to some default location. In my case it was - D:\var\lib\cassandra\

Now, I wasn't too happy with the default Keyspace configuration - this is the "schema". So, I shut down the server, deleted the log directory and modified the configuration file in conf\storage-conf.xml. I simplified the Keyspace to 2 simple sub-sets - a column family called Message and a super column family called Car.

The more I look at the Cassandra column family structure, the more it reminds me of XML.

Then I started the CLI batch file to punch in some commands. I wasn't expecting this. I was really looking forward to a simple Java client program and there isn't any. So, those HBase guys were not kidding they said Cassandra does not have a simple client program. You have to use a Thrift client or some other third party client like Hector. I wasn't too eager to do that so I just went the command line way.

It seemed easy enough. It takes a few minutes to understand which one is the key name, the column family name and the super column family name. The advantage is that it's like a hierarchy of SortedMaps. Which means that the keys across records do not even have to have the same column names. Notice that there are some slight differences in the columns I've entered like - "Upgrade", "Leather seats" or "AWD" which are not there in the other records. So, there is some flexibility.

Some thought must be given to how efficient the storage is when you intend to store millions of records at the column/super column family/column family level. Search for discussions in the Cassandra-User mailing list. There are lots of such discussions and on which mode is better.

1) The installation is easy but the lack of a proper client is bothersome
2) CLI looks good for key-value type of queries, but I was really interested in those queries on slices and ranges. I couldn't find anything ready made
3) Hbase and Hive along with Cloudera's Beeswax UI for running SQL-like queries are very compelling. But, have a look at the HBase installation. It doesn't look easy. That's why I decided to try Cassandra.
4) This article here, is the most succinct comparison of Cassandra and HBase

Until next time!

Tuesday, July 20, 2010

/dev/null on Windows

I was trying to run Apache ZooKeeper on Windows the other day. Getting it to run was super easy. I was more interested in running it without any file/snapshot logging.

I did ask around in the forums and I thought letting the "dataDir" directory to point to "/dev/null" would solve the problem. But being a (ahem) Windows user I couldn't quite get the "/dev/null" to work. In Windows the equivalent is "nul" but it doesn't quite work when you try to use it from Java. Some operations work, but some don't.

[Update 1:
Strangely getAbsolutePath() prepends the current directory's path but the file does not get created.
Creating nul:\abc.log throws an exception. But nul:abc.log does not and Java says its absolute path is d:\dump\nul:abc.log but the file is not there. Which means that it is indeed writing to the "null" device. I wonder what I'm missing.]

As usual, the full code is here:

Here's the output and it shows what works and what doesn't:

Sunday, July 18, 2010

Years before Inception (movie)

Here's a short list of TV episodes where similar concepts of dream manipulation and recursive realities were explored decades ago:

The Avengers 1967:
Deaths Door (A series of Diplomats are drugged and forced to participate in their own nightmares. On waking up, events unfold just like the nightmare - all the way up to the Diplomatic event)

Star Trek: The Next Generation: Season 6:
Ship in a Bottle (A sentient hologram of Dr. Moriarty tricks the crew into thinking they are back on the ship after visiting the Holodeck. In actuality they are in a hologram ship inside another hologram.)

Frame of Mind (An officer is trapped and taken prisoner while on a mission. He is then drugged and his dreams are tampered with. He starts thinking that his actual life on the Enterprise was a delusion and is convinced to find closure with his delusional characters by killing them in his mind among other things)

Star Trek: Voyager: Season 2
Projections (The ships holographic doctor is convinced into believing that he is not a hologram but the actual hologram designer who has got lost in his own mental simulations on a Star base near Jupiter. He is asked to give up control of the ship to end the simulation)

Similar types of episodes in Mission Impossible Season 1 (1966) and 2.

[Update 1: Also see DreamWithinADream]

Saturday, July 17, 2010

Having 2 chins is better than 1... Not!

Monday, July 12, 2010

It occurs to me that our mind is like a snow globe and the snow is our thoughts and ideas. It is most beautiful when the snow is swirling around the glass.

Sunday, July 11, 2010

A simple (Project) Voldemort test on Windows

Here's a simple Voldemort test program. It's basically the one that ships with the project but with a few, little modifications to make it work on Windows/Cygwin.

First off, the server scripts are all Unix shell scripts. So, I had to install Cygwin on my Windows 7 laptop. Then the most essential bin/ script had to be modified a little to work with Cygwin because Java does not recognize the Cygwin mapped Path and Classpaths like /cygdrive/d/Dump/voldemort-0.81. You have to wrap paths with cygpath to map them back to the actual paths:  java -cp $(cygpath -w -p -l -a $CLASSPATH) ...

The simple Java client test program needs dist/voldemort-0.81.jar and dist/voldemort-test-0.81.jar. I had to find out about the second jar from their issues list.

That's it! Now, you can start the single node server and run the client Java program as many times as you like.

All the files are available here: Here are the snippets.


Server script:

Client output:

Monday, July 05, 2010

Bay Area at Sunrise

Waking up at 4.30 in the morning on a holiday to watch the sunrise is not something people even think of doing. Today, I did exactly that.

I was up before dawn and was on my way to Page Mill Road (off CA 280) to watch the Sun rise. This being my first time (waking up this early, I mean) I was unsure of where to go. So, I chose my regular hiking spots. I drove up Page Mill before Sun rise (5.50 AM) and was well on my way up the hills.

There was no one in sight. No bikers! Yay! I saw quite a few deer grazing next to the road and many hare/rabbits. It was peaceful, cool and smelled like wild flowers. I drove down Page mill, stopping at a few places to take pictures. At this hour, all the parks are still closed. Your best bet is to drive towards the end of Page Mill, beyond Montebello and even cross the CA 35 intersection. There is one area which leads to private property. You can pull over in front of the gate and take a few snaps quickly. It's a pity there aren't any vista points to park and enjoy the sun rise on Page Mill.

(Move your mouse over the images to see descriptions.)

Sun rise. View from Page mill road. Few miles before reaching Montebello preserve

Sun rise. View from Page mill road. Few miles before reaching Montebello preserve

Woodside and areas west of CA 280. Just a few minutes after sunrise

Sun rising. Almost the end of Page mill road

Sun rising. Almost the end of Page mill road

Then, when I reached CA 35, I continued down Page Mill for a few miles and then turned around. Went north again on CA 35. Just a mile or so from that intersection, there is a vista point overlooking the Bay Area. This is a nice spot. The sun had already risen by the time I got here.

Sun rise over the Bay Area

Sun rise over the Bay Area

It's not the ocean, it's the Bay Area!! Low level clouds

It's not the ocean, it's the Bay Area!! Low level clouds

Deja (rear) view? The future looks just like the past

Overall, it was worth it. I was back home by 8 AM! The Bay Area was still cloudy. It was a surprising discovery for me to see that the hills are actually clear of fog and clouds. It's only at sea level, where the early morning are gray and dull.

Back to CA 280. 7.30 AM and still cloudy at sea level

Sunday, July 04, 2010

Weekend at the Zoo(Keeper)

I've written about Apache ZooKeeper before, but I had never actually tried it. Only today did I get a chance to play with it.

The ZooKeeper recipes really piqued my curiosity. So after spending a few hours reading the docs, I decided to give it a try. My interest was purely the performance side of it. ZK makes it very clear in the docs that it excels under read-heavy workloads. And the more replicated servers you add, the better it gets. They were not kidding.

I have my test code here - Keep in mind that this is a simple test, perhaps even a micro benchmark. It does not even have the minimum 3 servers for a quorum. Remember, my tests were run on a new (2010) laptop with 4 hyper threads with some simple Xms/Xmx JVM settings and everything else remaining as is - default, out of the box. This is by no means a representative test. There are official numbers on the ZK wiki with tests run on a real server class machine. You should have a look at those too

Well, what can I say - it is a little slow. Even writing messages with a few bytes take a while. Granted, each write in a loop requires a network call. So, if I write a 1000 messages, it requires a 1000 remote/network calls. The CreateMode.PERSISTENT_SEQUENTIAL is very handy, like the RDBMS autogenerated-id column.

I would've liked a few more batch-oriented calls like getDataForChildren() and createIfAbsent() instead of making 2 calls first to find out the child names and then to get the actual data. But hey, I'm just trying to shoehorn it into a wrong usecase.

This is the simple test and the sample console output is further below. You can always get the full code from my Gist repo :

Console output:

Sunday, June 27, 2010

Hiking in Russian Ridge (again)

I went hiking in Russian Ridge (again) yesterday. I had been there just a few weeks ago. This time, I took 2 photos from my cell phone. So, they're not so great. Summer is here and it was hotter and so everything has started to dry up. Too many flies on the trails for some reason. The open trails here mean that you should best avoid it during summer.

Yes, that's a blanket of fog on the trees, in the background. They are not clouds. It looks wonderful.

Saturday, June 26, 2010

Funny JavaZone Video

Here's a funny vid to watch over the weekend (it's only a few minutes long). Still strictly Java related, mind you:

Thursday, June 24, 2010

Varnish, Voldemort, Virtualization and other vaguely related stuff

A few interesting articles I read these past 2 weeks:

1) Voldemort presentation at QCon - Project Voldemort at Gilt Groupe. I didn't know that Voldemort was storage independent. Interesting to see the different storage options they tested. Lots of other useful presentations from QCon London 2010.

2) Some fairly detailed and low level notes on systems stuff like swapping and paging from the guys who wrote Varnish page cache - You're Doing It Wrong. The original Varnish Architect notes describes the remaining details.

3) The latency and throughput numbers of Market data - and

4) Azul Systems opens up a part of their awesome runtime - They have a Linux kernel module for better memory management. I'll follow this with interest, being a fan of the awesome Dr. Cliff Click.

5) Some questions I had posted on Virtualization expert blogs about when not to use them. There's no doubt that Sys Admins and IT guys want everything in the data centers to move to a 100% virtualized environment. Apparently it not just about server consolidation anymore. Companies are willing to sacrifice some performance for easier provisioning. To me the performance reduction is disappointing. It's like 2001 all over again when Java was slow and there was a lot of resistance to using it in production. Eventually the JVMs became faster (well, so did the processors to some extent) and people stopped thinking twice about using Java. Now, just when you thought Java was going to run blazing fast on multi-core, there's a second VM under the Java VM to slow things down. Dang! So much for progress. Now we have to wait for 10G pipes , SSDs and mega-multi-core to make up for the drop in power.

6) ORM - do we still have that impedance mismatch? Why are there so many standards - JDO and JPA and so many frameworks? Apache itself has so many. Then there is Hibernate, DataNucleus and EclipseLink.

7) Does Queuing interest you? ZeroMQ looks interesting. Anything that is open source interests me. Here's some eval notes on open source queuing systems. A little dated but still interesting.

Until next time!

Tuesday, June 15, 2010

This quarter in Sci-fi

1) Protector by Larry Niven has a very relaxing and smooth flow. Simple, unadulterated and enjoyable 80's Sci-fi.

2) The Scar by China Mieville. Another Mieville classic. This story is set on a pirate ship, complete with science-fantasy, steampunk and magic. Loads of sugary goodness, like his other novels.

3) The Light of Other Days by Arthur C. Clarke and Stephen Baxter. What happens when 2 masters work together? You get a Hard Sci-Fi masterpiece. If every action you make were to be visible later in the future with the help of wormhole cameras, would we still lie, commit crimes and injustices? A thought provoking novel. We talk about Facebook, Google and privacy. Hah! You should read this book. The issues in the novel are way more complex. Society and species level.

4) Ilium by Dan Simmons. A fat book, you have been warned. A strange mix of sci-fi, Greek gods, Trojan war, Shakespeare and post humanism. Too bad this is just part 1. I would say, it's almost as good as Hyperion.

Sunday, June 13, 2010

Hadoop, CDI, Weld, HSQL and Chuck Norris Java jokes

Big data doesn't get bigger than this...very rarely. i.e if you exclude GOOG, EBAY and MSFT. Facebook now has the largest Hadoop installation, surpassing Yahoo. Hive is the main interface into Hadoop for the Facebook analytics team, it seems.What caught my attention was the 16G heap for the JobTracker nodes and 58G heap for the NameNode.

I still find it hard to believe that the established OLAP and DW companies cannot offer competitive prices and solutions. One of them has docs explaining how they do joins in clusters - SQL Request and Transaction Processing. Just the doc for Join processing runs up to 400 pages. Impressive to say the least!

HSQL has a new re-written core with MVCC and 2 phase locking. This version 2.0 is now about as good as H2. I like the competition.

I've been trying to understand where this new (CDI) Dependency Injection business was heading. After digging up a few articles, I found these interesting: 2 JSRs - 330 and 299. 299 extends 330 as described here. And obviously, Spring, naturally is not very enthusiastic about making JEE easy to use - read the comments in that 299-330 article. JBoss' Weld is the RI.

And here's a bit of fun, courtesy Chuck Norris:
   1) Chuck Norris Java jokes -
   2) Chuck Norris facts -

Until next time!

Wednesday, May 26, 2010

Data cleansing, perceptual hash and others

Got dirty data? I am very impressed with this data cleansing tool built by the Freebase guys. Specifically because it is open source and the UI is so well done - Freebase Gridworks.

Haha, yes another NoSQL system which claims incredible performance - KumoFS.

Spring. No, not the season, I mean the framework. I have many questions about why it's required in light of the supposedly simplified new JEE spec.

I wonder how all those various distributed Lucene index implementations perform. Apache Solr itself offers most of those enterprise features.

Perceptual hashing to detect duplicate images. I'll file this under all the other "interesting" hashing techniques.

The LinkedIn guys have a simple implementation of a load balancer using Zookeeper and NIO/Netty. Nothing new here, but this one is in Scala. Apparently they like Scala too, just like the Twitter dev team.

Have a nice and long (for some) weekend!

Saturday, May 15, 2010

The 3 laws of error handling (and everything in life)

Having spent a few years wading through production log files and unruly code, I've come to some startlingly unoriginal conclusions. Also being a Science Fiction fan, I've come up with these 3 laws (Heard of the Three Laws of Robotics?):

The 3 laws of error handling (and everything in life):

  1) CYA
  2) Take responsibility
  3) When in doubt, refer #2

Now, there are some of you who are probably thinking "Take responsibility" should've been #1, but that would just sound very cliched. Let's be honest, CYA is what everyone does first in the real world. But #2 and #3 help keep that CYA attitude in check. It makes the world a better place. Well..sort of.

So, how exactly does this apply to umm.. programming? Let's take an arbitrary simple example that I found on the internet:

at org.apache.axis.encoding.ser.BeanPropertyTarget.set( 
at org.apache.axis.encoding.DeserializerImpl.valueComplete( 
at org.apache.axis.encoding.ser.ArrayDeserializer.valueComplete( 
at org.apache.axis.encoding.DeserializerImpl.endElement( 
at org.apache.axis.encoding.DeserializationContext.endElement(
at org.apache.axis.message.SAX2EventRecorder.replay(
at org.apache.axis.message.MessageElement.publishToHandler(
at org.apache.axis.message.RPCElement.deserialize(
.. .. .. 

This stack trace is not very helpful is it? If your program threw something like this in production what would you have done?

  1) How can you defend (if it comes to that) that it was not something in your code?
  2) Perhaps the input/arguments were wrong or the library you called caused it
  3) So, does this leave your object in a corrupt and incomplete state?
  4) Should the caller retry and hope that the problem does not occur again?
  5) What is the alternative? What now? How do you get around it? Can you get around it?

See? There are so many questions and if this happened in production, then you would have 0 answers.

How does the "3 laws" help? If you structure your code like this:

  - The error was caused because...
  - The input that was expected was... but was actually...
  - We 
      - Tried these things but...
      - Retried so many times but...

Take responsibility:
  - The error is
      - Recoverable - how?
      - Un-recoverable because... Therefore you have to...
  - The system/instance is now
      - Still ok except for...
      - Unstable and so you have to... or we already are doing...

Here's one simple way to implement parts of it:

   + getMessage()
   + getCause()

   + isCallersFault()
   + isOperationRecoverable()
   + isCalleeCorruptOrUnstable() 


Thursday, May 13, 2010

Orient database, Zookeeper, Interest rates, Yield curves and more NoSQL

By now you would've thought that people had stopped developing old fashioned datastores...well think again. I chanced upon another database - Orient. It supports many modes (or claims to) - raw storage, SQL, Object oriented along with ACID transactions.. sounds very interesting. And it's under the Apache License.

We've been reading so much about Hadoop and Cassandra and NoSQL everywhere, but not much about another essential Hadoop sub-project. It's worth using in its own right - ZooKeeper. Check out their recipes page.

Some more ZooKeeper coverage on other blogs:

Voldemort vs Cassandra...Wait.wait...this is not a rehash of that Twitter-Cassandra article that's making its rounds on the interwebs. This is a performance comparison -

Had enough of reading about software? How about investments and interest rates and Fed policies? It's a nice distraction:

Until next time.

Wednesday, May 05, 2010

Rule of thumb for Powerpoint slides - If you cannot read the text on your Blackberry, then you have too much content. It's a simple, low-def test much like the 5 second test to see how much people will/can remember.

Sunday, May 02, 2010

I tried to cash a Reality Check and it bounced.

Saturday, April 24, 2010

Hiking in Russian Ridge Open Space Preserve

Russian Ridge is an easy to find hiking area. It gets pretty crowded before noon. Spring is undoubtedly the best time to visit. Most of the trails are out in the open and there is no shade. So, it is probably best to avoid in Summer. But in Spring with all the hillsides covered with spring flowers, it is certainly worth a visit.

Tuesday, April 20, 2010

Overheard at someone's cafeteria - "There must be some kinda way out of here said the joker broker to the thief chief......all along the watch tower.." [Jimi Hendrix]

Saturday, April 17, 2010

Hiking in Long Ridge Open Space Preserve

I went hiking with friends to Long Ridge Open Space Preserve. We did the 4.71 mile loop on Peter's Creek Trail and the Long Ridge Trail. It's a mild trail, very pleasant surroundings. The best part is the meadow overlooking the Pacific at a distance. This is midway along the hike.

Also, don't forget to enjoy the small lake (pond?) in the first quarter of the loop. It seemed to be covered with algae. It didn't smell though.

Monday, April 12, 2010

Golden ratios, collapse of complex business models, time debt and GM (GM?)

JEXL version 2, the Apache Expression Language is out. It may be slower than MVEL (very likely) or Spring EL (don't know for sure) but JEXL is from Apache and that deserves a special mention. The v2 API looks good with facilities to plug in/customize many aspects of parsing and evaluation.

Need to sort gigabytes of text but don't have Hadoop or Unix "sort" handy? Not to worry. External sorting is a remarkably simple concept that does chunk-wise sorting and merging. Isn't that MapReduce without the pain of a Hadoop cluster setup?

Clay Shirky's note on the collapse of complex business models - after reading this I couldn't help thinking about the Innovator's dilemma and how some big companies need some kind of rebooting at some point (negative marginal value). And how open source is a disruptive model employed by many small companies to compete with larger software companies. For a long time their documentation and feature set is poor compared to more established companies but eventually they catch up. But during "awkward teenager" period, some customers are willing to use it, which is perplexing. Good enough is perfectly ok for some purposes I suppose.

Much like GM and its endless problems which eventually led to the closure of their Fremont plant (among other things). This in spite of a collaboration with their arch nemesis Toyota! Speaking of quality, this is a painful reminder of how bad quality has to be addressed ASAP. Not fixing quality in time (before release) is in effect giving yourself a time debt. Time debt? This guy defines it as - "Basically, time debt is anything that you do which will commit you to doing unavoidable work in the future." If you don't fix the problem before it leaves the assembly line like that Fremont plant, after the release you/Engineering, Support, Product management will spend hours identifying and fixing something that could've been fixed for much less cost before the release, with interest. Duh! Think of your customer's loss of faith too.

Oh, I almost forgot, here are some UI tips that require Math. I especially liked the Golden ratio. Very elegant.

Until next time..

Life is a long winding road....better drive a car that handles well.

Thursday, April 08, 2010

Old Chinese proverb: If your IQ is below room temperature, then move to a warmer climate.

Monday, April 05, 2010

Ice cream should come in only 1 flavor - "Self control" with a topping of nuts.

Saturday, April 03, 2010

After the Kindle and the iPad, I suppose there won't be any "Death by a 1000 paper cuts"? Does this mean that our lives are safer now?

Wednesday, March 31, 2010

Blue tailed bee eaters

Blue tailed bee eaters, originally uploaded by SRJP.

Pic taken by my dad - Dr. Jayaprakash.

Blue tailed bee eaters

Blue tailed bee eaters, originally uploaded by SRJP.

Pic taken by my dad - Dr. Jayaprakash.

Monday, March 29, 2010

Make your online reading more pleasant

If you spend a lot of time reading online, like me then eye strain is probably your biggest complaint. You can stop worrying, because there's a beautiful tool called Readable.

No, don't worry it's free and does not need any installation. All you have to do is drag a link and drop it on to your Browser's Bookmarks Toolbar.

It turns this .....


into this...! It's the same web page, just the style has changed. Beautiful isn't it? You can revert back to the original by just clicking the page.

You can use the theme I use - just drag and drop this link into your Bookmarks and when you are on any web page just click this Readable bookmark. Or you can make one yourself. Here's the actual site with a tutorial - Readable theme setup.

Happy reading!