Monday, December 21, 2009

Primary Key object - an under appreciated programming idiom

[Updated: Jan 4 2010 - CustomerKey class has more details]

There is in my opinion an often under used idiom - The Primary Key object. If you've used Container Managed Persistence in EJB or Hibernate or JPA then you probably know what I'm talking about. It's the very simple idea of creating a Serializable POJO and storing as fields, the unique Id(s) of the entity.

JEE 6 says this about Primary Key Classes. You override equals() and hashCode() and you are all set.

public class CustomerKey implements Serializable {
    private static final long serialVersionUID = 1L;

    protected String passportId;
    
    protected String familyName;
    
    /**
     * @param passportId - The actual key property.
     * @param familyName - A non-key property to assist in caching/optimization
     *                     etc.
     */
    public CustomerKey(String passportId, String familyName) {
        this.passportId = passportId;
        this.familyName = familyName; 
    }
 
    public String getPassportId() {
        return passportId;
    }
    
    public String getFamilyName() {
        return familyName;
    }

     @Override
    public int hashCode() {
        //Only uses passportId for the hash.
        .. .. 
    }

    @Override
    public boolean equals(Object obj) {
        //Only checks for passportId match. 
        .. .. 
    } 
}

So, why is it interesting? The use of Serializable POJOs does not have to be restricted to modeling Database persisted objects.

I've seen people make the make the mistake of premature optimization - i.e deciding to use a String or long or integer (some Java primitive) as the unique Id of an entity because it appears to satisfy the immediate requirement of an Id. The operative words being "appears" and "immediate". Using just 4 or 8 bytes to store the Id might seem like huge savings in space/memory at first but lets look at what you are missing:

  • When your data grows and you realize that the 4/8 byte number is not enough what do you do?
  • Your system has an internal Id scheme and the upstream system has another scheme
    • By using Java primitives as your internal scheme you have very likely lost the original/upstream Id
    • This means you cannot track the record back to its source. You might think you will not need it but try saying that to your customer during a production crisis when you've lost some records and you can't tell which ones exactly (Ouch!)
  • Remember The Law of Leaky Abstractions? Once you start with your own internal/proprietary Id scheme it will eventually leak outside to the user/customer. Now your customer will want to use it directly and you are forced to expose/support/maintain that Id scheme (Double ouch!)
  • What you thought could be easily generated as a monotonically increasing number on 1 machine is now impossible to handle on a cluster of machines (unless you have a singleton Id generator - SPOF!)
  • Now that your data has grown and you have started using a Grid of some sort, here's what you would've liked to do had you taken the POJO Id route
    • Data locality hints: You have a family of related objects and you want all of them to reside in the same machine instead of being spread across the grid randomly. So you stick a field into all the Primary Key (PK) POJOs called "customerId". Now all the OrderPK, ShipmentPK, FulfilmentPK, OrderLineItemPK keys will have a "customerId" in them but they need not be part of the equals()/hashCode() combo. So, you can program your Grid to make use of this "customerId" hint to place all the objects of the same family together and speed up your queries/retrievals
    • Covering Index: If you realize over time that your application tends to retrieve only certain specific fields for processing and not all the data columns, you could actually move those fields into the PK class. This way you will get the columns you need along with the PK object and not have to download the values at all. Shaves a lot of time off retrievals but remember these things consume a little more memory. More on Covering Indexes
  • Migrating data also becomes easier as you can add hints and version numbers into PKs


Overall, using a plain java.lang.Object or better yet, a well defined basic Interface as the Primary Key will go a long way.

Remember - Good design is always future proof.

Hibernate/JPA has a lot of nice annotations for these - @IdClass and @Id.

1 comments:

Ashwin Jayaprakash said...

Related note by Joshua Bloch in his famous slide deck - "How to Design a Good API and Why it Matters". Where he says - "Provide Programmatic Access to All Data Available in String Form".