Wednesday, May 26, 2010

Data cleansing, perceptual hash and others

Got dirty data? I am very impressed with this data cleansing tool built by the Freebase guys. Specifically because it is open source and the UI is so well done - Freebase Gridworks.

Haha, yes another NoSQL system which claims incredible performance - KumoFS.

Spring. No, not the season, I mean the framework. I have many questions about why it's required in light of the supposedly simplified new JEE spec.

I wonder how all those various distributed Lucene index implementations perform. Apache Solr itself offers most of those enterprise features.

Perceptual hashing to detect duplicate images. I'll file this under all the other "interesting" hashing techniques.

The LinkedIn guys have a simple implementation of a load balancer using Zookeeper and NIO/Netty. Nothing new here, but this one is in Scala. Apparently they like Scala too, just like the Twitter dev team.

Have a nice and long (for some) weekend!

Saturday, May 15, 2010

The 3 laws of error handling (and everything in life)

Having spent a few years wading through production log files and unruly code, I've come to some startlingly unoriginal conclusions. Also being a Science Fiction fan, I've come up with these 3 laws (Heard of the Three Laws of Robotics?):

The 3 laws of error handling (and everything in life):

  1) CYA
  2) Take responsibility
  3) When in doubt, refer #2

Now, there are some of you who are probably thinking "Take responsibility" should've been #1, but that would just sound very cliched. Let's be honest, CYA is what everyone does first in the real world. But #2 and #3 help keep that CYA attitude in check. It makes the world a better place. Well..sort of.

So, how exactly does this apply to umm.. programming? Let's take an arbitrary simple example that I found on the internet:

at org.apache.axis.encoding.ser.BeanPropertyTarget.set( 
at org.apache.axis.encoding.DeserializerImpl.valueComplete( 
at org.apache.axis.encoding.ser.ArrayDeserializer.valueComplete( 
at org.apache.axis.encoding.DeserializerImpl.endElement( 
at org.apache.axis.encoding.DeserializationContext.endElement(
at org.apache.axis.message.SAX2EventRecorder.replay(
at org.apache.axis.message.MessageElement.publishToHandler(
at org.apache.axis.message.RPCElement.deserialize(
.. .. .. 

This stack trace is not very helpful is it? If your program threw something like this in production what would you have done?

  1) How can you defend (if it comes to that) that it was not something in your code?
  2) Perhaps the input/arguments were wrong or the library you called caused it
  3) So, does this leave your object in a corrupt and incomplete state?
  4) Should the caller retry and hope that the problem does not occur again?
  5) What is the alternative? What now? How do you get around it? Can you get around it?

See? There are so many questions and if this happened in production, then you would have 0 answers.

How does the "3 laws" help? If you structure your code like this:

  - The error was caused because...
  - The input that was expected was... but was actually...
  - We 
      - Tried these things but...
      - Retried so many times but...

Take responsibility:
  - The error is
      - Recoverable - how?
      - Un-recoverable because... Therefore you have to...
  - The system/instance is now
      - Still ok except for...
      - Unstable and so you have to... or we already are doing...

Here's one simple way to implement parts of it:

   + getMessage()
   + getCause()

   + isCallersFault()
   + isOperationRecoverable()
   + isCalleeCorruptOrUnstable() 


Thursday, May 13, 2010

Orient database, Zookeeper, Interest rates, Yield curves and more NoSQL

By now you would've thought that people had stopped developing old fashioned datastores...well think again. I chanced upon another database - Orient. It supports many modes (or claims to) - raw storage, SQL, Object oriented along with ACID transactions.. sounds very interesting. And it's under the Apache License.

We've been reading so much about Hadoop and Cassandra and NoSQL everywhere, but not much about another essential Hadoop sub-project. It's worth using in its own right - ZooKeeper. Check out their recipes page.

Some more ZooKeeper coverage on other blogs:

Voldemort vs Cassandra...Wait.wait...this is not a rehash of that Twitter-Cassandra article that's making its rounds on the interwebs. This is a performance comparison -

Had enough of reading about software? How about investments and interest rates and Fed policies? It's a nice distraction:

Until next time.

Wednesday, May 05, 2010

Rule of thumb for Powerpoint slides - If you cannot read the text on your Blackberry, then you have too much content. It's a simple, low-def test much like the 5 second test to see how much people will/can remember.

Sunday, May 02, 2010

I tried to cash a Reality Check and it bounced.