Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3
    • Fix Version/s: 0.4
    • Component/s: Clustering
    • Labels:
      None

      Description

      Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor type of queries efficiently.

      1. MAHOUT-344-v1.patch
        17 kB
        Ankur
      2. MAHOUT-344-v2.patch
        34 kB
        Cristi Prodan
      3. MAHOUT-344-v3.patch
        41 kB
        Cristi Prodan
      4. MAHOUT-344-v4.patch
        39 kB
        Ankur
      5. MAHOUT-344-v5.patch
        28 kB
        Ankur
      6. MAHOUT-344-v6.patch
        42 kB
        Ankur
      7. MAHOUT-344-v7.patch
        48 kB
        Ankur

        Activity

        Hide
        Grant Ingersoll added a comment -

        Ankur, do you have a reference for this implementation. Trying to compare this with the original Broder paper.

        Show
        Grant Ingersoll added a comment - Ankur, do you have a reference for this implementation. Trying to compare this with the original Broder paper.
        Hide
        Ankur added a comment -

        Grant, The idea behind keyGroups is to concatenate hashes from multiple hash functions reduce the probability of collision between 2 users that agreed on 1 or more individual hash values. This essentially improves the average similarity of users in a cluster.

        About documentation, I am caught up with a few urgent issues at work and will need more time. Hope to get some free cycles before end of this week.

        Show
        Ankur added a comment - Grant, The idea behind keyGroups is to concatenate hashes from multiple hash functions reduce the probability of collision between 2 users that agreed on 1 or more individual hash values. This essentially improves the average similarity of users in a cluster. About documentation, I am caught up with a few urgent issues at work and will need more time. Hope to get some free cycles before end of this week.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1150 (See https://builds.apache.org/job/Mahout-Quality/1150/)
        MAHOUT-344: added minhash to build-asf-email.sh and to driver.classes.props

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197803
        Files :

        • /mahout/trunk/examples/bin/build-asf-email.sh
        • /mahout/trunk/src/conf/driver.classes.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1150 (See https://builds.apache.org/job/Mahout-Quality/1150/ ) MAHOUT-344 : added minhash to build-asf-email.sh and to driver.classes.props gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197803 Files : /mahout/trunk/examples/bin/build-asf-email.sh /mahout/trunk/src/conf/driver.classes.props
        Hide
        Grant Ingersoll added a comment -

        Also, Ankur, do you have a reference to what your implementation is based off of? I'm mostly wondering about the purpose of the keyGroups option. (I get what it does at the syntax level, wondering about the purpose of it)

        Show
        Grant Ingersoll added a comment - Also, Ankur, do you have a reference to what your implementation is based off of? I'm mostly wondering about the purpose of the keyGroups option. (I get what it does at the syntax level, wondering about the purpose of it)
        Hide
        Ankur added a comment -

        I will surely get it done by 08/Nov/11. Apologies for delay

        Show
        Ankur added a comment - I will surely get it done by 08/Nov/11. Apologies for delay
        Hide
        Grant Ingersoll added a comment -

        Ankur, any luck on documenting this stuff?

        Show
        Grant Ingersoll added a comment - Ankur, any luck on documenting this stuff?
        Hide
        Ankur added a comment -

        Sure, will do it in next 2 - 3 days

        Show
        Ankur added a comment - Sure, will do it in next 2 - 3 days
        Hide
        Grant Ingersoll added a comment -
        Show
        Grant Ingersoll added a comment - Ankur, Could you doc this at https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
        Hide
        Cristi Prodan added a comment -

        I sure do . I've been silently watching the set of changes. I had a lot to learn when working on it as well as when seeing others contribute. Glad to see this making it into the trunk and hoping to contribute more in the future.

        Show
        Cristi Prodan added a comment - I sure do . I've been silently watching the set of changes. I had a lot to learn when working on it as well as when seeing others contribute. Glad to see this making it into the trunk and hoping to contribute more in the future.
        Hide
        Ankur added a comment -

        I am glad this was accepted. Thanks you folks.

        Also, Thanks to Cristi Prodan for making progress on this. Cristi hope you see this

        Show
        Ankur added a comment - I am glad this was accepted. Thanks you folks. Also, Thanks to Cristi Prodan for making progress on this. Cristi hope you see this
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Mahout-Quality #358 (See https://hudson.apache.org/hudson/job/Mahout-Quality/358/ ) MAHOUT-344
        Hide
        Sean Owen added a comment -

        I'm gonna submit my flavor of Ankur's patches, with "new Random(11)" left in place. We can iterate from there. Cool with all?

        Show
        Sean Owen added a comment - I'm gonna submit my flavor of Ankur's patches, with "new Random(11)" left in place. We can iterate from there. Cool with all?
        Hide
        Ted Dunning added a comment -

        Sorry to chime in late,

        >Random - java.util.Random
        Each mapper should get the exact copy of hash functions for constructing the minhash signatures or else the chances of collision are quite less even for highly similar items. that is the reason for hard-coding seed for Random(). The reason for test failure is not that, it is the linear hash function that has a higher false positive rate than murmur and polynomial hash functions. The remedy is to concatenate more hashes for a key groups. For testLinearMinHashMRJob() I changed number of hash functions to 20 and number of key groups to 4 and now test passes successfully.

        This sounds like we should be using murmurHash instead. It is almost as fast a j.u.Random and has MUCH better properties. This kind of change is fine to defer to 0.5.

        Show
        Ted Dunning added a comment - Sorry to chime in late, >Random - java.util.Random Each mapper should get the exact copy of hash functions for constructing the minhash signatures or else the chances of collision are quite less even for highly similar items. that is the reason for hard-coding seed for Random(). The reason for test failure is not that, it is the linear hash function that has a higher false positive rate than murmur and polynomial hash functions. The remedy is to concatenate more hashes for a key groups. For testLinearMinHashMRJob() I changed number of hash functions to 20 and number of key groups to 4 and now test passes successfully. This sounds like we should be using murmurHash instead. It is almost as fast a j.u.Random and has MUCH better properties. This kind of change is fine to defer to 0.5.
        Hide
        Ankur added a comment -

        Sean,
        Thanks for agreeing to merge the style and latest code changes.I'll study your style changes and hopefully do better next time . Code changes should only be in 'LastfmDataConverter' & 'LastfmClusterEvaluator'.

        Updated patch with fixed Javadoc comments and added support for converting LastFM 1K users dataset http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html into vector format for running MinHash.

        Sadly the changes in seed generation as we discussed did not help much and addition RandomUtil.getRandom(11) was causing testLinearMinHashMRJob() to fail consistently . So I reverted the code change in HashFactory.java

        Show
        Ankur added a comment - Sean, Thanks for agreeing to merge the style and latest code changes.I'll study your style changes and hopefully do better next time . Code changes should only be in 'LastfmDataConverter' & 'LastfmClusterEvaluator'. Updated patch with fixed Javadoc comments and added support for converting LastFM 1K users dataset http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html into vector format for running MinHash. Sadly the changes in seed generation as we discussed did not help much and addition RandomUtil.getRandom(11) was causing testLinearMinHashMRJob() to fail consistently . So I reverted the code change in HashFactory.java
        Hide
        Sean Owen added a comment -

        Got it, I understand the point of this class now. Well I might only humbly suggest "RandomUtils.getRandom(11)" for sheer consistency in that line.

        I also have a modified version of your patch locally where I tried to adjust the items that were giving off minor style warnings. I am happy to bear the burden of merging your latest changes. If they're small, it might be simple too to commit and let you add them in too.

        Show
        Sean Owen added a comment - Got it, I understand the point of this class now. Well I might only humbly suggest "RandomUtils.getRandom(11)" for sheer consistency in that line. I also have a modified version of your patch locally where I tried to adjust the items that were giving off minor style warnings. I am happy to bear the burden of merging your latest changes. If they're small, it might be simple too to commit and let you add them in too.
        Hide
        Ankur added a comment -

        Having the seed as fixed parameters does not give us a family of hash functions, it will be the same hash functions repeated K times. What we need is a family of hash functions F where each function h(s) is defined as follows:-
        h(s) = index of first element in a random permutation of all possible values that is also present in set s
        Then probability that two sets agree on a hash value is equal to their Jaccard coefficient i.e.
        P[h(s1) = h(s2) ] = (s1 intersect s2) / (s1 union s2)

        Having multiple different hash functions is an implementation trick that simulates the random permutation of all possible values. Having said that, there is nothing that prevents our seed generator to generate duplicate seeds, essentially making 2 or more hash functions identical. Also I think it will make a difference if we just accept prime seeds. What do you think ?
        Let me make the change of discarding duplicate seeds and accepting only primes to see what happens.

        BTW I completed the javadoc comments and also added support for http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-1K.tar.gz dataset which I think is more dense.
        I will post a new patch shortly.

        Show
        Ankur added a comment - Having the seed as fixed parameters does not give us a family of hash functions, it will be the same hash functions repeated K times. What we need is a family of hash functions F where each function h(s) is defined as follows:- h(s) = index of first element in a random permutation of all possible values that is also present in set s Then probability that two sets agree on a hash value is equal to their Jaccard coefficient i.e. P [h(s1) = h(s2) ] = (s1 intersect s2) / (s1 union s2) Having multiple different hash functions is an implementation trick that simulates the random permutation of all possible values. Having said that, there is nothing that prevents our seed generator to generate duplicate seeds, essentially making 2 or more hash functions identical. Also I think it will make a difference if we just accept prime seeds. What do you think ? Let me make the change of discarding duplicate seeds and accepting only primes to see what happens. BTW I completed the javadoc comments and also added support for http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-1K.tar.gz dataset which I think is more dense. I will post a new patch shortly.
        Hide
        Sean Owen added a comment -

        I see. So really we might as well take the first 6 integers generated by "new Random(11)" and stick them in as the fixed parameters for these hash functions. If we do that, might we not make sure to pick good values rather than leave it to the RNG? For example, it's bad if "seedA" in linear hash is 0. I'm sure it isn't. But if they're both even, that's not so great I think?

        That is, what happens if I just stick in some arbitrary primes here?
        Would that remove the need to divide module a large prime at the end?

        (Also does it matter that 'byteVal' values can be negative? doesn't really seem so, from the math, but I stopped to wonder at it for a moment.

        Show
        Sean Owen added a comment - I see. So really we might as well take the first 6 integers generated by "new Random(11)" and stick them in as the fixed parameters for these hash functions. If we do that, might we not make sure to pick good values rather than leave it to the RNG? For example, it's bad if "seedA" in linear hash is 0. I'm sure it isn't. But if they're both even, that's not so great I think? That is, what happens if I just stick in some arbitrary primes here? Would that remove the need to divide module a large prime at the end? (Also does it matter that 'byteVal' values can be negative? doesn't really seem so, from the math, but I stopped to wonder at it for a moment.
        Hide
        Ankur added a comment -

        > Catching OutOfMemoryError
        LastfmClusterEvaluator - I was getting OOME as I was trying to run this on simply concatenated sequence files. Don't think we can proceed in this case as simply concatenated sequence files essentially mean corrupt data. Once this was fixed, there was no problem so we can safely remove the OOME catch block.
        LastfmDataConverter - The code works fine with -Xmx512m heap settings that should be reasonable for 1.5G of uncompressed data. This is what is suggested. Can't think of a reasonable in-memory approach.

        >Random - java.util.Random
        Each mapper should get the exact copy of hash functions for constructing the minhash signatures or else the chances of collision are quite less even for highly similar items. that is the reason for hard-coding seed for Random(). The reason for test failure is not that, it is the linear hash function that has a higher false positive rate than murmur and polynomial hash functions. The remedy is to concatenate more hashes for a key groups. For testLinearMinHashMRJob() I changed number of hash functions to 20 and number of key groups to 4 and now test passes successfully.

        Show
        Ankur added a comment - > Catching OutOfMemoryError LastfmClusterEvaluator - I was getting OOME as I was trying to run this on simply concatenated sequence files. Don't think we can proceed in this case as simply concatenated sequence files essentially mean corrupt data. Once this was fixed, there was no problem so we can safely remove the OOME catch block. LastfmDataConverter - The code works fine with -Xmx512m heap settings that should be reasonable for 1.5G of uncompressed data. This is what is suggested. Can't think of a reasonable in-memory approach. >Random - java.util.Random Each mapper should get the exact copy of hash functions for constructing the minhash signatures or else the chances of collision are quite less even for highly similar items. that is the reason for hard-coding seed for Random(). The reason for test failure is not that, it is the linear hash function that has a higher false positive rate than murmur and polynomial hash functions. The remedy is to concatenate more hashes for a key groups. For testLinearMinHashMRJob() I changed number of hash functions to 20 and number of key groups to 4 and now test passes successfully.
        Hide
        Sean Owen added a comment -

        I'm ready to commit after adjusting the code for project code style. However there are two outstanding issues IMHO:

        Catching OutOfMemoryError – this looks bad. Is it really possible to proceed in this case versus just crashing?
        Random – java.util.Random is used directly and hard-coded to a seed. Should it not use an un-seeded RNG? and use RandomUtils.getRandom() in particular so that tests are repeatable? But then the tests fail for me.

        Don't need a new patch, perhaps just targeted advice on how we might adjust for these as necessary.

        Show
        Sean Owen added a comment - I'm ready to commit after adjusting the code for project code style. However there are two outstanding issues IMHO: Catching OutOfMemoryError – this looks bad. Is it really possible to proceed in this case versus just crashing? Random – java.util.Random is used directly and hard-coded to a seed. Should it not use an un-seeded RNG? and use RandomUtils.getRandom() in particular so that tests are repeatable? But then the tests fail for me. Don't need a new patch, perhaps just targeted advice on how we might adjust for these as necessary.
        Hide
        Ankur added a comment -

        Updated (and hopefully) penultimate patch with

        1. All the code issues Fixed after another round of cleanup and testing.
        2. Example code for conversion of LastFM dataset into Mahout vector format written to sequenceFiles.
        3. Cluster quality evaluation code for analyzing the precision at various similarity thresholds.

        Here are the steps to run the clustering code on Last FM dataset after applying the patch and building mahout jars :-

        1. Download http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz and uncompress and untar into a local dir.

        2. Run the following command to create data converted into vector format, dumped into sequence files:-

        • java -Xms512m -Xmx512m -cp /path/to/commons-logging-1.0.4.jar:/path/to/log4j-1.2.15.jar:/path/to/hadoop-0.20.2-core.jar:/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-examples-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar org.apache.mahout.clustering.minhash.LastfmDataConverter /path/to/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv /path/to/lastfm-vector-formated.seq

        3. Upload file 'lastfm-vector-formated.seq' to DFS dir 'lastfm'

        4. Before running the clustering MR job do the following on the shell prompt:-

        • export HADOOP_CLASSPATH=/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar:/path/to/commons-cli-2.0-mahout.jar.

        5. Run the clustering MR job on the data. Here is a sample command line

        • hadoop jar /path/to/mahout-core-0.4-SNAPSHOT.jar org.apache.mahout.clustering.minhash.MinHashDriver -Dio.sort.mb=256 -Dio.sort.factor=20 -libjars /path/to/mahout-math-0.4-SNAPSHOT.jar --input lastfm --output lastfm-out --minClusterSize 10 --minVectorSize 10 --hashType polynomial --numHashFunctions 60 --keyGroups 2 --debugOutput true --numReducers 1
        • Note:- debugOutput is set to true so that entire vectors are clustered which later on will be used for similarity computation in quality evaluation.

        6. Download the file under 'lastfm-out' to local dir and run the evaluation code as follows:-

        • java -cp /path/to/commons-logging-1.0.4.jar:/path/to/log4j-1.2.15.jar:/path/to/hadoop-0.20.2-core.jar:/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-examples-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar org.apache.mahout.clustering.minhash.LastfmClusterEvaluator lastfm-cluster-data.seq 0.2 0.5

        Here are some of the results I got with different threshold parameters and sampling parameters

        Test Results
        =============
        (A) Listeners in same cluster with simiarity above threshold (0.2) : 4997
        (B) All listeners: 15564
        Average cluster precision: A/B = 32.11

        (A) Listeners in same cluster with simiarity above threshold (0.3) : 1872
        (B) All listeners: 15564
        Average cluster precision: A/B = 12.03

        The only task remaining here is updating Javadoc comments and incorporating any review comments. Apart from those 2 this should be good to go in.

        Show
        Ankur added a comment - Updated (and hopefully) penultimate patch with 1. All the code issues Fixed after another round of cleanup and testing. 2. Example code for conversion of LastFM dataset into Mahout vector format written to sequenceFiles. 3. Cluster quality evaluation code for analyzing the precision at various similarity thresholds. Here are the steps to run the clustering code on Last FM dataset after applying the patch and building mahout jars :- 1. Download http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz and uncompress and untar into a local dir. 2. Run the following command to create data converted into vector format, dumped into sequence files:- java -Xms512m -Xmx512m -cp /path/to/commons-logging-1.0.4.jar:/path/to/log4j-1.2.15.jar:/path/to/hadoop-0.20.2-core.jar:/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-examples-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar org.apache.mahout.clustering.minhash.LastfmDataConverter /path/to/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv /path/to/lastfm-vector-formated.seq 3. Upload file 'lastfm-vector-formated.seq' to DFS dir 'lastfm' 4. Before running the clustering MR job do the following on the shell prompt:- export HADOOP_CLASSPATH=/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar:/path/to/commons-cli-2.0-mahout.jar. 5. Run the clustering MR job on the data. Here is a sample command line hadoop jar /path/to/mahout-core-0.4-SNAPSHOT.jar org.apache.mahout.clustering.minhash.MinHashDriver -Dio.sort.mb=256 -Dio.sort.factor=20 -libjars /path/to/mahout-math-0.4-SNAPSHOT.jar --input lastfm --output lastfm-out --minClusterSize 10 --minVectorSize 10 --hashType polynomial --numHashFunctions 60 --keyGroups 2 --debugOutput true --numReducers 1 Note:- debugOutput is set to true so that entire vectors are clustered which later on will be used for similarity computation in quality evaluation. 6. Download the file under 'lastfm-out' to local dir and run the evaluation code as follows:- java -cp /path/to/commons-logging-1.0.4.jar:/path/to/log4j-1.2.15.jar:/path/to/hadoop-0.20.2-core.jar:/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-examples-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar org.apache.mahout.clustering.minhash.LastfmClusterEvaluator lastfm-cluster-data.seq 0.2 0.5 Here are some of the results I got with different threshold parameters and sampling parameters Test Results ============= (A) Listeners in same cluster with simiarity above threshold (0.2) : 4997 (B) All listeners: 15564 Average cluster precision: A/B = 32.11 (A) Listeners in same cluster with simiarity above threshold (0.3) : 1872 (B) All listeners: 15564 Average cluster precision: A/B = 12.03 The only task remaining here is updating Javadoc comments and incorporating any review comments. Apart from those 2 this should be good to go in.
        Hide
        Ankur added a comment -

        Updated patch with complete minhash implementation along with Unit test case. However I did not get enough time to finish

        1. Javadoc comments to my satisfaction.
        2. Example code re-write for Last.fm dataset.

        While I'll be certainly completing the javadoc comments within a days time, the example code rewrite might stretch this to a couple of days.

        Ted/Sean can you please take a look and provide review comments ?

        Show
        Ankur added a comment - Updated patch with complete minhash implementation along with Unit test case. However I did not get enough time to finish 1. Javadoc comments to my satisfaction. 2. Example code re-write for Last.fm dataset. While I'll be certainly completing the javadoc comments within a days time, the example code rewrite might stretch this to a couple of days. Ted/Sean can you please take a look and provide review comments ?
        Hide
        Ankur added a comment -

        Finally some action from my side

        1. HashFunction is now an interface with a single method - hash().
        2. Implementations of different hash functions are now moved to a HashFactory that also provides factory method for fetching hashFunctions of a requested type (linear, polynomial, murmur).
        3. Minhash mapper/reducer code cleaned up quite a bit.
        4. Added options for minimum vector size and hashType.

        Pending tasks
        1. Fix the Unit test case.
        2. Fix example code over Last FM dataset.
        3. Add Javadoc documentation.

        I hope to complete the above task by EOD tomorrow and submit a new patch.

        Show
        Ankur added a comment - Finally some action from my side 1. HashFunction is now an interface with a single method - hash(). 2. Implementations of different hash functions are now moved to a HashFactory that also provides factory method for fetching hashFunctions of a requested type (linear, polynomial, murmur). 3. Minhash mapper/reducer code cleaned up quite a bit. 4. Added options for minimum vector size and hashType. Pending tasks 1. Fix the Unit test case. 2. Fix example code over Last FM dataset. 3. Add Javadoc documentation. I hope to complete the above task by EOD tomorrow and submit a new patch.
        Hide
        Ted Dunning added a comment -

        Sounds like end of the week is a good time to decide.

        Show
        Ted Dunning added a comment - Sounds like end of the week is a good time to decide.
        Hide
        Ankur added a comment -

        Hi Ted,
        There is bit of work left in this, specifically:-

        1. Adding murmurhash as an option for available hash functions. Looks like I can use the checkin of MAHOUT-503.
        2. Unit test case completion demonstrating correctness.
        3. Example code completion and cleanup. After a bit of thought I feel that the average item similarity across all clusters might not be a good criteria for a probabilistic clustering technique like this. Instead I am planning to write a unit test to calculate precision as folllows:-

        • Set a min similarity threshold for intra cluster items, for e.x 0.4
        • For each cluster out of randomly selected subset of all clusters, run pairwise similarity test to count true positives (TP) that pass the threshold.
        • Precision = (TP/Total-items-in-clusters)
          Do you see any problems ? Any other suggestions ?
          4. Documentation and cleanup.

        I should be able to provide an updated patch by end of this week and with one more round of review and changes this should be good to go in by end of next week. If the timeline sounds acceptable for 0.4 then we're good else we'll have to push this one out to 0.5

        Show
        Ankur added a comment - Hi Ted, There is bit of work left in this, specifically:- 1. Adding murmurhash as an option for available hash functions. Looks like I can use the checkin of MAHOUT-503 . 2. Unit test case completion demonstrating correctness. 3. Example code completion and cleanup. After a bit of thought I feel that the average item similarity across all clusters might not be a good criteria for a probabilistic clustering technique like this. Instead I am planning to write a unit test to calculate precision as folllows:- Set a min similarity threshold for intra cluster items, for e.x 0.4 For each cluster out of randomly selected subset of all clusters, run pairwise similarity test to count true positives (TP) that pass the threshold. Precision = (TP/Total-items-in-clusters) Do you see any problems ? Any other suggestions ? 4. Documentation and cleanup. I should be able to provide an updated patch by end of this week and with one more round of review and changes this should be good to go in by end of next week. If the timeline sounds acceptable for 0.4 then we're good else we'll have to push this one out to 0.5
        Hide
        Ted Dunning added a comment -

        Ankur?

        How does this look for committing?

        Show
        Ted Dunning added a comment - Ankur? How does this look for committing?
        Hide
        Ankur added a comment -

        Marking as fix for version 0.4. I will review Cristi's patch in a day to see if we need any changes. Besides that can anyone take a cursory look to suggest what else this patch needs to make it to 0.4

        Show
        Ankur added a comment - Marking as fix for version 0.4. I will review Cristi's patch in a day to see if we need any changes. Besides that can anyone take a cursory look to suggest what else this patch needs to make it to 0.4
        Hide
        Cristi Prodan added a comment -

        See above comment.

        Show
        Cristi Prodan added a comment - See above comment.
        Hide
        Cristi Prodan added a comment -

        Sorry for the big delay - I had to finish my dissertation and other stuff. Anyways, here are some things I managed to do and for which I commit a patch:

        • Converted the existing code to work with hadoop 0.20+
        • Converted the input of the algorithm to RandomAccessSparseVector
        • Converted the LastFM db into RandomAccessSparseVector format
        • Added command options using the DefaultOptionCreator mechanism
          Running the MinHash clustering algorithm can be done using a configuration like this:

        i(-input) lastfm/med_db_seq
        o(-output) lastfm/med_db_clusters
        mcs(-minClusterSize) 5
        nh(-numHashFunctions) 2
        kg(-keyGroups) 2
        ow(-overwriteOutput)

        • Evaluating the clustering results with the above configuration using the metric suggested by Ankur yields a value of 0.20303965982542901 .. which is not to good IMO. I will still run tests with other parameters and see what happens.

        The following steps are the following:
        1. write tests for the current code (doing this now);
        2. refactor the code so that it uses "points" and "vectors" instead of "items" and "users". The algorithm will also cluster text files, for finding very similar files.
        3. Write some documentation on how to use the algorithm.
        4. Investigate a more general format for the output algorithm (Vectors or something like that).

        If you have any suggestions I would very much like to hear them.

        Show
        Cristi Prodan added a comment - Sorry for the big delay - I had to finish my dissertation and other stuff. Anyways, here are some things I managed to do and for which I commit a patch: Converted the existing code to work with hadoop 0.20+ Converted the input of the algorithm to RandomAccessSparseVector Converted the LastFM db into RandomAccessSparseVector format Added command options using the DefaultOptionCreator mechanism Running the MinHash clustering algorithm can be done using a configuration like this: i( -input) lastfm/med_db_seq o( -output) lastfm/med_db_clusters mcs( -minClusterSize) 5 nh( -numHashFunctions) 2 kg( -keyGroups) 2 ow( -overwriteOutput) Evaluating the clustering results with the above configuration using the metric suggested by Ankur yields a value of 0.20303965982542901 .. which is not to good IMO. I will still run tests with other parameters and see what happens. The following steps are the following: 1. write tests for the current code (doing this now); 2. refactor the code so that it uses "points" and "vectors" instead of "items" and "users". The algorithm will also cluster text files, for finding very similar files. 3. Write some documentation on how to use the algorithm. 4. Investigate a more general format for the output algorithm (Vectors or something like that). If you have any suggestions I would very much like to hear them.
        Hide
        Sean Owen added a comment -

        Just cleaning house and pinging this issue. Since it's possibly a 'bug' want to make sure it's not dropped.

        Show
        Sean Owen added a comment - Just cleaning house and pinging this issue. Since it's possibly a 'bug' want to make sure it's not dropped.
        Hide
        Ankur added a comment -

        Just back from a vacation. I am catching up with a lot of things so won't be able to review Cristi's changes for next 3 - 4 days but I am hoping that any further changes would be minor.

        Cristi, do you have some results to share from your testing over last.fm dataset ?

        Once this is in we can start working towards using this to generate recommendations.

        Show
        Ankur added a comment - Just back from a vacation. I am catching up with a lot of things so won't be able to review Cristi's changes for next 3 - 4 days but I am hoping that any further changes would be minor. Cristi, do you have some results to share from your testing over last.fm dataset ? Once this is in we can start working towards using this to generate recommendations.
        Hide
        Sean Owen added a comment -

        Just cleaning house here – is this suitable for committing? I don't see any comments to the contrary. Maybe you are in the best position to comment Ankur?

        Show
        Sean Owen added a comment - Just cleaning house here – is this suitable for committing? I don't see any comments to the contrary. Maybe you are in the best position to comment Ankur?
        Hide
        Cristi Prodan added a comment -

        See comment above for this patch.

        Show
        Cristi Prodan added a comment - See comment above for this patch.
        Hide
        Cristi Prodan added a comment -

        Thank you guys for all the encouragement and advices.

        I'm committing my first patch for MinHah clustering. The patch contain the following things:

        • in core - minhash:
        • MinHashMapRed - removed the distributed hash need; Each mapper generates the same hash functions using the same seed (as per instructions from Ankur).
        • RandomLinearHashFunction - added another random linear hash function in form: h( x ) = ax + b mod p . p will be as big as possible > 10000000 and it should be prime(not done yet, but committing in this form due to some time restrictions) .
        • in examples - minhash directory:
        • DisplayMinHash - contains an example of running min-hash, with the options commented. It's basically the main function from MinHashMapRed.
        • PrepareDataset - this class offers the ability to convert the last-fm database suggested above in a format readable by the MinHash algorithm. It also shows a progress bar with the percent done .
          For the future I believe that all the code in the algorithm should take a more generalized form and use the Vectors classes used by Mahout, then the users could either write their own version with a Vector interface or create a tool that converts their ds to the vector format the code will know.
          MurmurHash is used by PrepareDataset to hash the strings which denoted users (in the original last_fm dataset) - to integers.
        • TestClusterQuality - gets a clustered file, generated by the minhash algorithm and computes the average for each cluster aggregated over all clusters.
          In each cluster the mean is computed by:
          SUM (similarity (item_i, item_j)) / TOTAL_SIMILARITIES, for i Unable to render embedded object: File (= j . TOTAL_SIMILARITIES = n) not found. / k! * (n -k)!, n = total number of items in cluster, k = 2.
          The aggregated mean is the mean of all these values.

        As an example. Having the following input:

        1 1 2 3 4 5
        2 1 2 3 4 6
        3 7 6 3 8 9
        4 7 8 9 6 1
        5 5 6 7 8 9
        6 8 7 5 6

        The first column are the users. For each user, on the lines we have the items preferred (browsed, listened to) by him. Same format in the contents of each cluster bellow.

        we get the following output(PARAMETERS: 20 hash functions, 4 keygroups (hash indices in a bucket), 2 - minimum items within cluster):

        CLUSTER ID – 2359983695385880352354530253637788 (items=2)
        =========================================
        2 1 2 3 4 6
        1 1 2 3 4 5

        CLUSTER ID – 236643825172184878353970117486898894 (items=2)
        =========================================
        4 7 8 9 6 1
        3 7 6 3 8 9

        CLUSTER ID – 35606006580772015548743126287496777 (items=2)
        =========================================
        6 8 7 5 6
        5 5 6 7 8 9

        CLUSTER ID – 38797144231157365543316465389702468 (items=2)
        =========================================
        6 8 7 5 6
        5 5 6 7 8 9

        The aggregated average over theses clusters is 0.6793650793650793.

        I'm now testing on the last_fm dataset. The problem I currently encounter is that the size for the clustered file is kind of big, but I'm working on tuning the params.

        Show
        Cristi Prodan added a comment - Thank you guys for all the encouragement and advices. I'm committing my first patch for MinHah clustering. The patch contain the following things: in core - minhash: MinHashMapRed - removed the distributed hash need; Each mapper generates the same hash functions using the same seed (as per instructions from Ankur). RandomLinearHashFunction - added another random linear hash function in form: h( x ) = ax + b mod p . p will be as big as possible > 10000000 and it should be prime(not done yet, but committing in this form due to some time restrictions) . in examples - minhash directory: DisplayMinHash - contains an example of running min-hash, with the options commented. It's basically the main function from MinHashMapRed. PrepareDataset - this class offers the ability to convert the last-fm database suggested above in a format readable by the MinHash algorithm. It also shows a progress bar with the percent done . For the future I believe that all the code in the algorithm should take a more generalized form and use the Vectors classes used by Mahout, then the users could either write their own version with a Vector interface or create a tool that converts their ds to the vector format the code will know. MurmurHash is used by PrepareDataset to hash the strings which denoted users (in the original last_fm dataset) - to integers. TestClusterQuality - gets a clustered file, generated by the minhash algorithm and computes the average for each cluster aggregated over all clusters. In each cluster the mean is computed by: SUM (similarity (item_i, item_j)) / TOTAL_SIMILARITIES, for i Unable to render embedded object: File (= j . TOTAL_SIMILARITIES = n) not found. / k! * (n -k)!, n = total number of items in cluster, k = 2. The aggregated mean is the mean of all these values. As an example. Having the following input: 1 1 2 3 4 5 2 1 2 3 4 6 3 7 6 3 8 9 4 7 8 9 6 1 5 5 6 7 8 9 6 8 7 5 6 The first column are the users. For each user, on the lines we have the items preferred (browsed, listened to) by him. Same format in the contents of each cluster bellow. we get the following output(PARAMETERS: 20 hash functions, 4 keygroups (hash indices in a bucket), 2 - minimum items within cluster): CLUSTER ID – 2359983695385880352354530253637788 (items=2) ========================================= 2 1 2 3 4 6 1 1 2 3 4 5 CLUSTER ID – 236643825172184878353970117486898894 (items=2) ========================================= 4 7 8 9 6 1 3 7 6 3 8 9 CLUSTER ID – 35606006580772015548743126287496777 (items=2) ========================================= 6 8 7 5 6 5 5 6 7 8 9 CLUSTER ID – 38797144231157365543316465389702468 (items=2) ========================================= 6 8 7 5 6 5 5 6 7 8 9 The aggregated average over theses clusters is 0.6793650793650793. I'm now testing on the last_fm dataset. The problem I currently encounter is that the size for the clustered file is kind of big, but I'm working on tuning the params.
        Hide
        Ankur added a comment -

        Drew, thanks for pitching in as I've been running super busy with some crap

        @Cristi
        That's right but its totally unnecessary as each of the mappers can do their own initialization of hash functions. They will be the same hash function if they used the same seed for java.util.Random(). So distributed cache can be removed alltogther with that change. The code will be shorter and simpler.

        What is the min-cluster size you are using? How many hash hash functions? How many hashes are grouped together?
        We will need some tests to show how good the clusters are. As a start we can compute a simple metrics like average similarity of items within a cluster aggregated over all clusters.

        Show
        Ankur added a comment - Drew, thanks for pitching in as I've been running super busy with some crap @Cristi That's right but its totally unnecessary as each of the mappers can do their own initialization of hash functions. They will be the same hash function if they used the same seed for java.util.Random(). So distributed cache can be removed alltogther with that change. The code will be shorter and simpler. What is the min-cluster size you are using? How many hash hash functions? How many hashes are grouped together? We will need some tests to show how good the clusters are. As a start we can compute a simple metrics like average similarity of items within a cluster aggregated over all clusters.
        Hide
        Drew Farris added a comment -

        Hi Cristi,

        Sounds like a great start. Answers for a couple of your questions:

        Is there a standard formatting for the input on each clustering alg or the input format follows the same rules for all algorithms, and then the users write conversion tools which ?

        Take a look at the various Vector clases in the math module and the VectorWritable wrapper. Most of the clustering algorithms take vectors of one kind or another as input and the assumption is that users will write tools to convert their data to these common formats. The wiki page http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html is a good place to start

        would it be ok if I attach the code which does an example of running min-hash clustering in the examples dirs ? (it would first convert the dataset format accordingly)

        Go for it, code is good, patches are even better, see: http://cwiki.apache.org/MAHOUT/howtocontribute.html#HowToContribute-Creatingthepatchfile and simply attach it to this issue.

        Show
        Drew Farris added a comment - Hi Cristi, Sounds like a great start. Answers for a couple of your questions: Is there a standard formatting for the input on each clustering alg or the input format follows the same rules for all algorithms, and then the users write conversion tools which ? Take a look at the various Vector clases in the math module and the VectorWritable wrapper. Most of the clustering algorithms take vectors of one kind or another as input and the assumption is that users will write tools to convert their data to these common formats. The wiki page http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html is a good place to start would it be ok if I attach the code which does an example of running min-hash clustering in the examples dirs ? (it would first convert the dataset format accordingly) Go for it, code is good, patches are even better, see: http://cwiki.apache.org/MAHOUT/howtocontribute.html#HowToContribute-Creatingthepatchfile and simply attach it to this issue.
        Hide
        Cristi Prodan added a comment -

        I ran the code on the last.fm data set (2.). Due to the nature of the data set I had to write a small program that converts it to the a format used by the algorithm. Also I have clustered similar users instead of songs (the transformation I mentioned above was easier to do) and I wanted to see how does the algorithm runs. I've used MurmurHash for mapping artists hashes to integers - which can be used by the min-hash algorithm.

        Related to this I would like to ask, how does Mahout usually deals with this kind of situations:

        • Is there a standard formatting for the input on each clustering alg or the input format follows the same rules for all algorithms, and then the users write conversion tools which ?
        • would it be ok if I attach the code which does an example of running min-hash clustering in the examples dirs ? (it would first convert the dataset format accordingly)

        @Ankur: you are using a DdistributedCache for sharing the hash functions. That requires the distributed file to be on HDFS as far as I know. I believe it would be nice to have a flag or something which allows storing the hash on "normal" file system, for testing purposes. What do you think ?

        Sorry everybody If I'm doing something wrong, it's the first time I'm contributing to an open source project.

        Show
        Cristi Prodan added a comment - I ran the code on the last.fm data set (2.). Due to the nature of the data set I had to write a small program that converts it to the a format used by the algorithm. Also I have clustered similar users instead of songs (the transformation I mentioned above was easier to do) and I wanted to see how does the algorithm runs. I've used MurmurHash for mapping artists hashes to integers - which can be used by the min-hash algorithm. Related to this I would like to ask, how does Mahout usually deals with this kind of situations: Is there a standard formatting for the input on each clustering alg or the input format follows the same rules for all algorithms, and then the users write conversion tools which ? would it be ok if I attach the code which does an example of running min-hash clustering in the examples dirs ? (it would first convert the dataset format accordingly) @Ankur: you are using a DdistributedCache for sharing the hash functions. That requires the distributed file to be on HDFS as far as I know. I believe it would be nice to have a flag or something which allows storing the hash on "normal" file system, for testing purposes. What do you think ? Sorry everybody If I'm doing something wrong, it's the first time I'm contributing to an open source project.
        Hide
        Ankur added a comment -

        Appreciate your interest in this. I'd suggest that we pick a dataset to try this on and then make changes as required.
        For ideas :-
        1. Clustering a p2p dataset like this - http://warsteiner.db.cs.cmu.edu/db-site/Datasets/graphData/eDoneky-p2p/ to find out nodes closer to each other.
        2. Clustering similar items/songs in this - http://www.iua.upf.es/~ocelma/MusicRecommendationDataset/index.html for recommendations.

        Talking about missing things in the implementation:-
        1. Option of more hash functions for user's to experiment with.
        2. Code for cluster goodness evaluation (Precision/Recall tests?)
        3. Unit tests for completeness.

        May be other Mahout folks can take a quick look at the patch and suggest more ideas.

        Show
        Ankur added a comment - Appreciate your interest in this. I'd suggest that we pick a dataset to try this on and then make changes as required. For ideas :- 1. Clustering a p2p dataset like this - http://warsteiner.db.cs.cmu.edu/db-site/Datasets/graphData/eDoneky-p2p/ to find out nodes closer to each other. 2. Clustering similar items/songs in this - http://www.iua.upf.es/~ocelma/MusicRecommendationDataset/index.html for recommendations. Talking about missing things in the implementation:- 1. Option of more hash functions for user's to experiment with. 2. Code for cluster goodness evaluation (Precision/Recall tests?) 3. Unit tests for completeness. May be other Mahout folks can take a quick look at the patch and suggest more ideas.
        Hide
        Cristi Prodan added a comment -

        I've studied the min-hash algorithm these days, and your implementation a little bit. Also been looking through Mahout's code, the wiki, how to contribute etc.

        I'm thinking to try my hand at applying a patch to Mahout before submitting my proposal for GSoC. I would like to extend/improve this implementation. Could you please point out a way/idea on how I might do this ? (I would leave it's integrating with Taste as a second task for me.)

        Thank you.

        Show
        Cristi Prodan added a comment - I've studied the min-hash algorithm these days, and your implementation a little bit. Also been looking through Mahout's code, the wiki, how to contribute etc. I'm thinking to try my hand at applying a patch to Mahout before submitting my proposal for GSoC. I would like to extend/improve this implementation. Could you please point out a way/idea on how I might do this ? (I would leave it's integrating with Taste as a second task for me.) Thank you.
        Hide
        Ankur added a comment -

        As per "Yonik's law of patches" submitting my implementation. Please feel free to provide ideas for improvement or even submit an improved patch.

        Show
        Ankur added a comment - As per "Yonik's law of patches" submitting my implementation. Please feel free to provide ideas for improvement or even submit an improved patch.

          People

          • Assignee:
            Ankur
            Reporter:
            Ankur
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development