Saturday, May 15, 2010

The 3 laws of error handling (and everything in life)

Having spent a few years wading through production log files and unruly code, I've come to some startlingly unoriginal conclusions. Also being a Science Fiction fan, I've come up with these 3 laws (Heard of the Three Laws of Robotics?):

The 3 laws of error handling (and everything in life):

  1) CYA
  2) Take responsibility
  3) When in doubt, refer #2

Now, there are some of you who are probably thinking "Take responsibility" should've been #1, but that would just sound very cliched. Let's be honest, CYA is what everyone does first in the real world. But #2 and #3 help keep that CYA attitude in check. It makes the world a better place. Well..sort of.

So, how exactly does this apply to umm.. programming? Let's take an arbitrary simple example that I found on the internet:

at org.apache.axis.encoding.ser.BeanPropertyTarget.set( 
at org.apache.axis.encoding.DeserializerImpl.valueComplete( 
at org.apache.axis.encoding.ser.ArrayDeserializer.valueComplete( 
at org.apache.axis.encoding.DeserializerImpl.endElement( 
at org.apache.axis.encoding.DeserializationContext.endElement(
at org.apache.axis.message.SAX2EventRecorder.replay(
at org.apache.axis.message.MessageElement.publishToHandler(
at org.apache.axis.message.RPCElement.deserialize(
.. .. .. 

This stack trace is not very helpful is it? If your program threw something like this in production what would you have done?

  1) How can you defend (if it comes to that) that it was not something in your code?
  2) Perhaps the input/arguments were wrong or the library you called caused it
  3) So, does this leave your object in a corrupt and incomplete state?
  4) Should the caller retry and hope that the problem does not occur again?
  5) What is the alternative? What now? How do you get around it? Can you get around it?

See? There are so many questions and if this happened in production, then you would have 0 answers.

How does the "3 laws" help? If you structure your code like this:

  - The error was caused because...
  - The input that was expected was... but was actually...
  - We 
      - Tried these things but...
      - Retried so many times but...

Take responsibility:
  - The error is
      - Recoverable - how?
      - Un-recoverable because... Therefore you have to...
  - The system/instance is now
      - Still ok except for...
      - Unstable and so you have to... or we already are doing...

Here's one simple way to implement parts of it:

   + getMessage()
   + getCause()

   + isCallersFault()
   + isOperationRecoverable()
   + isCalleeCorruptOrUnstable()