Ashwin Jayaprakash's Blog: 12/01/2009

Sunday, December 27, 2009

Indian bush lark

Indian bush lark, originally uploaded by SRJP.

Pic taken by my dad - Dr. Jayaprakash.

Monday, December 21, 2009

Primary Key object - an under appreciated programming idiom

[Updated: Jan 4 2010 - CustomerKey class has more details]

There is in my opinion an often under used idiom - The Primary Key object. If you've used Container Managed Persistence in EJB or Hibernate or JPA then you probably know what I'm talking about. It's the very simple idea of creating a Serializable POJO and storing as fields, the unique Id(s) of the entity.

JEE 6 says this about Primary Key Classes. You override equals() and hashCode() and you are all set.

public class CustomerKey implements Serializable {
    private static final long serialVersionUID = 1L;

    protected String passportId;
    
    protected String familyName;
    
    /**
     * @param passportId - The actual key property.
     * @param familyName - A non-key property to assist in caching/optimization
     *                     etc.
     */
    public CustomerKey(String passportId, String familyName) {
        this.passportId = passportId;
        this.familyName = familyName; 
    }
 
    public String getPassportId() {
        return passportId;
    }
    
    public String getFamilyName() {
        return familyName;
    }

     @Override
    public int hashCode() {
        //Only uses passportId for the hash.
        .. .. 
    }

    @Override
    public boolean equals(Object obj) {
        //Only checks for passportId match. 
        .. .. 
    } 
}

So, why is it interesting? The use of Serializable POJOs does not have to be restricted to modeling Database persisted objects.

I've seen people make the make the mistake of premature optimization - i.e deciding to use a String or long or integer (some Java primitive) as the unique Id of an entity because it appears to satisfy the immediate requirement of an Id. The operative words being "appears" and "immediate". Using just 4 or 8 bytes to store the Id might seem like huge savings in space/memory at first but lets look at what you are missing:

When your data grows and you realize that the 4/8 byte number is not enough what do you do?
Your system has an internal Id scheme and the upstream system has another scheme

By using Java primitives as your internal scheme you have very likely lost the original/upstream Id
This means you cannot track the record back to its source. You might think you will not need it but try saying that to your customer during a production crisis when you've lost some records and you can't tell which ones exactly (Ouch!)

Remember The Law of Leaky Abstractions? Once you start with your own internal/proprietary Id scheme it will eventually leak outside to the user/customer. Now your customer will want to use it directly and you are forced to expose/support/maintain that Id scheme (Double ouch!)
What you thought could be easily generated as a monotonically increasing number on 1 machine is now impossible to handle on a cluster of machines (unless you have a singleton Id generator - SPOF!)
Now that your data has grown and you have started using a Grid of some sort, here's what you would've liked to do had you taken the POJO Id route

Data locality hints: You have a family of related objects and you want all of them to reside in the same machine instead of being spread across the grid randomly. So you stick a field into all the Primary Key (PK) POJOs called "customerId". Now all the OrderPK, ShipmentPK, FulfilmentPK, OrderLineItemPK keys will have a "customerId" in them but they need not be part of the equals()/hashCode() combo. So, you can program your Grid to make use of this "customerId" hint to place all the objects of the same family together and speed up your queries/retrievals
Covering Index: If you realize over time that your application tends to retrieve only certain specific fields for processing and not all the data columns, you could actually move those fields into the PK class. This way you will get the columns you need along with the PK object and not have to download the values at all. Shaves a lot of time off retrievals but remember these things consume a little more memory. More on Covering Indexes

Migrating data also becomes easier as you can add hints and version numbers into PKs

Overall, using a plain java.lang.Object or better yet, a well defined basic Interface as the Primary Key will go a long way.

Remember - Good design is always future proof.

Hibernate/JPA has a lot of nice annotations for these - @IdClass and @Id.

Saturday, December 19, 2009

Graph DBs - The other NoSQL

Yeah, you guessed it right. NoSQL has been the theme for this month.

Graph DBs - they are another alternative to Relational DBs and gaining momentum as part of the NoSQL family. Graph DB have always fascinated me because they do not require Schemas (!) and unlike the crippled, de-normalized NoSQL formats you hear about, Graphs can store relationships. That means the Joins are already there as relationships.

My favorite Graph DB is Neo4J (obviously, since it's pure Java) and gaining some popularity and funding
Here's an interesting entry about Freebase's internal graph engine - graphd
Here's some SQL magic on - Trees In The Database - Advanced data structures
Some more SQL meets social networks

Friday, December 18, 2009

Comp Sci algos and theory behind NoSQL

Very informative docs on distributed systems:

# NoSQL Patterns by Ricky Ho
# Design Patterns for Distributed Non-Relational Databases by Todd Lipcon

Tuesday, December 15, 2009

The rise and rise (again) of Ingres; Gartner vs ZL and other essays

Mark Logic CEO Dave Kellogg recounts his experience at Ingres and how Oracle dominated the DBMS market during its early years.

In another post, he writes about an interesting legal battle ensuing between Gartner and a tech company ZL Technologies. It really is scary if you think about it - be it Stock rating agencies or movie critics. Any self-appointed "authority on a subject" that gains a lot of visibility automatically ends up only acquiring more authority than before. Call it what you like - The Cluster effect or PageRank etc etc ... You could have a majority opinion but it doesn't necessarily mean you are correct/right.

Saturday, December 12, 2009

Articles on OSS and de-commoditizing technology

Some things to learn from successful Open Source Software (OSS) companies. Notes taken by Nati Shalom - Takeaway from Qcon – Part I.

Many agree that OSS has had a steady commoditizing effect on technology. While there are no doubts about some of its merits, here's a post that does not talk about OSS per se, but about the non-technology side of things that we engineers rarely (want to) see - It’s The Relationship, Stupid! (Part1) - Stop Commoditizing The Client Facing Workforce.

The book I mentioned in a previous post also talks about de-commoditizing your software.

Monday, December 07, 2009

Book: Don't just roll the dice (Software pricing guide)

I read a small book during my road trip last weekend. It's called "Don't just roll the dice - A usefully short guide to software pricing". It's written by Neil Davidson, Co-founder and joint CEO of Red Gate Software a small-ish Software company that is doing well.

It's worth reading, even if you think you know about Price-vs-Demand, Support, Sales etc etc. His Blog is also very educative - Business of Software blog.

Deleting sub-folders based on a pattern (Windows)

I found this little script to delete ".svn" folders recursively in Windows. It's easy on Unix but I wanted one for Windows. Sweet! You can modify it to match and delete any pattern.

Ashwin Jayaprakash's Blog