Application for Google Summer of Code 2010 - Mahout Project
Student: Cristian Prodan
I will add a map-reduce implementation of the SimHash clustering algorithm to the Mahout project. This algorithm provides an efficient way of finding similar/identical files in a large collection of files.
Storage capacities become larger and thus it is difficult to organize and manage growing file systems. It is easy to loose track of identical copies or older versions of the files in a directory structure. Duplicate detection technologies do not extend well when files are not identical. A typical idea for detecting similar files is to see the features of a file into a high-dimensional space and then use distance within space as a measure of similarity. This may not be feasible since it involves a O(n^2) complexity of the algorithm. If these file-to-vector mappings are reduced ow a one-dimensional space, then the data points could be sorted in O(n log n) time - a big increase of detection speed.
I will implement the SimHash algorithm presented in detail in . The idea is the following: using a hash function that hashed similar files to similar values, file similarity could be determined simply by comparing pre-sorted hash key values.
I will implement a family of similarity hash functions which will do this, as described in . Furthermore, the performance will be enhanced by storing auxiliary data used to compute the hash keys. This data will be used as a second filter after a hash key comparison indicates that two files are potentially similar.
Properties for the similarity function and the algorithm:
- very similar files map to very similar or even the same hash key;
- distance between keys should be some measure of the difference between files. This would lead to keys proportional to file sizes and this would create false positives. The auxiliary data mentioned above will provide an easy and efficient way of refining the similarity detection.
- the metric used will be a binary metric (simhash operates at byte level).
- given a similarity metric, there needs to be a threshold to determine how close within the metric files need to be to count as similar.
From a distributed point of view, the algorithm above is very suited for a MapReduce implementation. The sketch is the following:
I. MAP phase
In this phase we compute the hash for a file along with additional info which serves as a second filter in similarity detection.
It outputs (File, simhash_key).
II. REDUCE phase
Once every file has a simhash, group every file with the same simhash into a cluster, sorted by the simhashkey (the key is an integer) .
- The similarity check can be done in main memory.
1. Batch for sim-hashing the files. The hashes will be stored either on normal file system (for small tests), HDFS or on HBase.
2. A tool for answering the next type of queries:
- retrieving a set of files, similar to a given file;
- retrieving all pairs of similar files.
There will be the option to run this as a standalone program, for tests or smaller scale purposes.
3. Documentation and unit tests for the written code.
4. Getting started tutorials for SimHash.
5. Demos for SimHash applied to 20newsgroups and Wikipedia data sets.
4. Benefits for the Mahout community:
1) A distributed tool for efficient similarity detection of files in large datasets. Algorithms for detecting similar files can also be useful for classification purposes and as an aid to search. Next release of Mahout will contain this functionality.
2) This will be used at smaller scale also, by running it in an "urn-distributed" mode, for small scale datasets.
3) Good tutorials on how to work with this tool.
1). Community Bonding Period (21th April to 24rd May)
- 1 week: Familiarize with Mahout, hadoop, and the existing clustering algorithms;
- 2 weeks: Understand the data structures (Vector, ClusterBase, Writable and other interfaces and classes used in Mahout) and start initial planning of the system in terms of interactions and data structures.
- 1 week: Setup a hadoop cluster formed of 2-3 nodes;
- During this time, I plan to speak with my mentor and ask him very specific questions about the plans I make for the implementation.
2). May 24th to Jul 12th
- Week 1: Detailed planning on how the objects would interact (classes, methods, the Vector interfaces I plan to use etc.) and ask feedback from the mentor.
- Week 2: Write the code for hashing files (a batch process which will work as an indexer), according to the algorithm, in a TDD style. At the end tests will accompany the code. The results will be stored on normal file system or HDFS.
- Week 3: Half week: Test the program in "single" mode (urn-distributed). Start interacting with HBase if time permits.
- Week 4: Interact with HBase; The algorithm will store the hashes in HBase for fast retrieval.
- Week 5: Test the algorithm on different datasets: local file system, 20newsgroups and wikipedia.
- Week 6, 7: Add the distributed MapReduce implementation (based on hadoop) and write unit tests for it;
- Week 8: Continue with unit tests and test the distributed implementation on the data sets for local file system, HDFS and HBase. Fix outstanding bugs. Analyze results.
- Week 9: Continue with testing and bug fixing and make sure everything is ready for mid-term evaluation;
Mid-term evaluation period.
3). Jul 16th - Aug 9th
- Week 10: Extend the batch tool with different options (different hash functions families, threshold for similarity). Experiment with different hash functions families; Give the user an option to specify hash functions.
- Week 11: Finish off the demos for the 20newsgroups and wikipedia datasets. Write some tests for them. Publish the results from the demos on the wiki.
- Week 12: Write the getting started guides on how to use the SimHash tool;
- Week 13: Finish off everything, fix outstanding bugs, clean and document undocumented code.
My name is Cristi Prodan, I am 23 years old and currently a 2nd year student pursuing a MSc degree in Computer Science. During the past year, I have been interested in the study of machine learning mainly Recommender Systems and clustering techniques. I have heard about hadoop and Mahout and I found challenging the idea of working with them if I ever have the chance. My dissertation paper on Recommender Systems and I will use Mahout at it's core.
Since I heard that The Apache Software Foundation is willing to participate in this year edition of Google Summer of Code, I was eager to try my hand at contributing to the Mahout project. Being a subscriber to the list since December 2009, I started to interact with the community and presenting my ideas. I was mostly interested in the MinHash algorithm  (which may also be used for similarity detection), whose implementation was started by a contributor. After analyzing his code and seeking for advice from the other members, I have committed my first patch to Mahout (on JIRA,
During this time I have also skimmed through the wiki and .
At university, I have extensively worked with Java. I am also a big fan of Ruby, Python and currently started digging into Erlang. In the past two years, in the free time, I did some freelancing, working on various web applications.
 Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity Detection, December 13, 2007 (Manning MEAP).
 Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News Personalization: Scalable Online Collaborative Filtering, WWW 2007.
 Owen, Anil - Mahout in Action. Manning, 2010.
This proposal is partly inspired by the one of Konstantin Kafer http://drupal.org/files/application.pdf