Wednesday, May 26, 2010

Data cleansing, perceptual hash and others

Got dirty data? I am very impressed with this data cleansing tool built by the Freebase guys. Specifically because it is open source and the UI is so well done - Freebase Gridworks.

Haha, yes another NoSQL system which claims incredible performance - KumoFS.

Spring. No, not the season, I mean the framework. I have many questions about why it's required in light of the supposedly simplified new JEE spec.

I wonder how all those various distributed Lucene index implementations perform. Apache Solr itself offers most of those enterprise features.

Perceptual hashing to detect duplicate images. I'll file this under all the other "interesting" hashing techniques.

The LinkedIn guys have a simple implementation of a load balancer using Zookeeper and NIO/Netty. Nothing new here, but this one is in Scala. Apparently they like Scala too, just like the Twitter dev team.

