Mahout
  1. Mahout
  2. MAHOUT-929

Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.7
    • Component/s: Classification, Clustering
    • Labels:
      None

      Description

      The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.

      • Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
      • Implement a pluggable outlier removal capability for this classifier.
      • Consider building off of the ClusterClassifier & ClusterIterator ideas.
      1. Mahout-929
        13 kB
        Paritosh Ranjan
      2. Mahout-929
        30 kB
        Paritosh Ranjan
      3. Mahout-929
        28 kB
        Paritosh Ranjan
      4. Mahout-929
        11 kB
        Paritosh Ranjan

        Issue Links

          Activity

          Jeff Eastman created issue -
          Hide
          Paritosh Ranjan added a comment -

          I think that it would be difficult to manage discussions and patches for all the three issues ( points mentioned ) in this single Jira issue.

          In agile's context also, this user story is big and trying to do too many things.

          Would it be good to create three sub issues for the three points mentioned, as they are related? I think there is also an order in developing them, so, it would also be good to make sub issues dependent on each other (in order). If you agree, then we can create them.

          Show
          Paritosh Ranjan added a comment - I think that it would be difficult to manage discussions and patches for all the three issues ( points mentioned ) in this single Jira issue. In agile's context also, this user story is big and trying to do too many things. Would it be good to create three sub issues for the three points mentioned, as they are related? I think there is also an order in developing them, so, it would also be good to make sub issues dependent on each other (in order). If you agree, then we can create them.
          Hide
          Jeff Eastman added a comment -

          Sure, the first two at least are pretty significant stories. The last is more of a design constraint on the first story. Go ahead and subdivide if you wish.

          Show
          Jeff Eastman added a comment - Sure, the first two at least are pretty significant stories. The last is more of a design constraint on the first story. Go ahead and subdivide if you wish.
          Paritosh Ranjan made changes -
          Field Original Value New Value
          Link This issue incorporates MAHOUT-930 [ MAHOUT-930 ]
          Paritosh Ranjan made changes -
          Link This issue incorporates MAHOUT-931 [ MAHOUT-931 ]
          Paritosh Ranjan made changes -
          Link This issue incorporates MAHOUT-933 [ MAHOUT-933 ]
          Hide
          Paritosh Ranjan added a comment -

          Created sequential version of ClusterClassifier. Test case is also present.

          In next patch I will also add the MapReduce Version. It will be more or less implemented in a similar fashion.

          Please review the patch to find any early problems. The code is in working state.

          And sorry for taking time, I was very busy with my office work. Though I used some time to read recommendation, classification and also probability, I am sure I will be able to use it in future.

          Show
          Paritosh Ranjan added a comment - Created sequential version of ClusterClassifier. Test case is also present. In next patch I will also add the MapReduce Version. It will be more or less implemented in a similar fashion. Please review the patch to find any early problems. The code is in working state. And sorry for taking time, I was very busy with my office work. Though I used some time to read recommendation, classification and also probability, I am sure I will be able to use it in future.
          Paritosh Ranjan made changes -
          Attachment Mahout-929 [ 12511318 ]
          Paritosh Ranjan made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Paritosh Ranjan added a comment -

          Added mapreduce version of ClusterClassification Driver.

          Also added outlier removal functionality.

          Added test cases which demostrate outlier removal and cluster classification.

          Added JavaDocs.

          Show
          Paritosh Ranjan added a comment - Added mapreduce version of ClusterClassification Driver. Also added outlier removal functionality. Added test cases which demostrate outlier removal and cluster classification. Added JavaDocs.
          Paritosh Ranjan made changes -
          Attachment Mahout-929 [ 12511490 ]
          Hide
          Paritosh Ranjan added a comment -

          Added License Files.

          Show
          Paritosh Ranjan added a comment - Added License Files.
          Paritosh Ranjan made changes -
          Attachment Mahout-929 [ 12511537 ]
          Hide
          Jeff Eastman added a comment -

          Sequential version looks good but lacks tests of the MR implementation or at least of the mapper.

          What I get reading the code is that all points with a pdf > clusterClassificationThreshold will be clustered (else ignored as outliers) and that the most likely cluster will be chosen for each vector. To replace the current FuzzyK and Dirichlet capabilities, it will also need another classification threshold to support multiple classifications that the current implementations support.

          As this code is not used yet, it could be committed as-is if you are comfortable but it would still be a WIP. How would you like to proceed?

          Show
          Jeff Eastman added a comment - Sequential version looks good but lacks tests of the MR implementation or at least of the mapper. What I get reading the code is that all points with a pdf > clusterClassificationThreshold will be clustered (else ignored as outliers) and that the most likely cluster will be chosen for each vector. To replace the current FuzzyK and Dirichlet capabilities, it will also need another classification threshold to support multiple classifications that the current implementations support. As this code is not used yet, it could be committed as-is if you are comfortable but it would still be a WIP. How would you like to proceed?
          Jeff Eastman made changes -
          Assignee Jeff Eastman [ jeastman ]
          Hide
          Paritosh Ranjan added a comment -

          I would prefer committing the code because then I can do local changes with more ease.

          Future actions ( for me ) :

          a) Implement (plug) classifications for Dirichlet and FuzzyK ( similar to classification threshold ).
          b) Add test case for MR version( at least Mapper).

          If anything else is needed, then please point out.

          Show
          Paritosh Ranjan added a comment - I would prefer committing the code because then I can do local changes with more ease. Future actions ( for me ) : a) Implement (plug) classifications for Dirichlet and FuzzyK ( similar to classification threshold ). b) Add test case for MR version( at least Mapper). If anything else is needed, then please point out.
          Hide
          Jeff Eastman added a comment -

          I committed your patch today. Keep it going!

          Show
          Jeff Eastman added a comment - I committed your patch today. Keep it going!
          Hide
          Paritosh Ranjan added a comment - - edited

          I have added emitMostLikely feature to vector classification. If set to true, then only the vector having max pdf is classified.

          However, if clusterClassificationThreshold is present, then only vectors whose pdf's are greater than clusterClassificationThreshold would be classified. Its a bit different than the previous implementation, but makes more sense if you think in terms of outlier removal.

          So, even Dirichlet and FuzzyKMeans can be classified now.

          The patch only contains changes and test cases for the sequential version for now. I will make changes to mapreduce version with test cases and submit soon.

          Show
          Paritosh Ranjan added a comment - - edited I have added emitMostLikely feature to vector classification. If set to true, then only the vector having max pdf is classified. However, if clusterClassificationThreshold is present, then only vectors whose pdf's are greater than clusterClassificationThreshold would be classified. Its a bit different than the previous implementation, but makes more sense if you think in terms of outlier removal. So, even Dirichlet and FuzzyKMeans can be classified now. The patch only contains changes and test cases for the sequential version for now. I will make changes to mapreduce version with test cases and submit soon.
          Paritosh Ranjan made changes -
          Attachment Mahout-929 [ 12515093 ]
          Hide
          Jeff Eastman added a comment -

          Hey Paritosh, why don't you take over this issue since you now have committer karma

          Show
          Jeff Eastman added a comment - Hey Paritosh, why don't you take over this issue since you now have committer karma
          Paritosh Ranjan made changes -
          Assignee Jeff Eastman [ jeastman ] Paritosh Ranjan [ paritoshranjan ]
          Hide
          Paritosh Ranjan added a comment - - edited

          Assigned to myself.

          I think cluster classification driver is developed now. Would wait for some time for the ClusterClassificationMapper's Test case ( patch ) as we asked on dev.

          Else I will write it and commit it. Might need help while committing for the first time.

          Considering, ClusterClassificationDriver development is done, we need to refactor the KMeans, FuzzyK, Dirichlet, Canopy Drivers.
          I will create separate child issues for refactoring these algos ( Respective driver classes ), so that different people can pick it in parallel, if they want. It will help in avoiding duplicate efforts.

          Jeff, any comments/suggestions?

          Show
          Paritosh Ranjan added a comment - - edited Assigned to myself. I think cluster classification driver is developed now. Would wait for some time for the ClusterClassificationMapper's Test case ( patch ) as we asked on dev. Else I will write it and commit it. Might need help while committing for the first time. Considering, ClusterClassificationDriver development is done, we need to refactor the KMeans, FuzzyK, Dirichlet, Canopy Drivers. I will create separate child issues for refactoring these algos ( Respective driver classes ), so that different people can pick it in parallel, if they want. It will help in avoiding duplicate efforts. Jeff, any comments/suggestions?
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1368 (See https://builds.apache.org/job/Mahout-Quality/1368/)
          MAHOUT-931, MAHOUT-929. Added emitMostLikely and threshold based outlier removal capability in ClusterClassificationDriver. (Revision 1293874)

          Result = SUCCESS
          pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293874
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1368 (See https://builds.apache.org/job/Mahout-Quality/1368/ ) MAHOUT-931 , MAHOUT-929 . Added emitMostLikely and threshold based outlier removal capability in ClusterClassificationDriver. (Revision 1293874) Result = SUCCESS pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293874 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Hide
          Saikat Kanjilal added a comment -

          In reading through the ClusterClassificationMapper I have some questions:
          1) Do we need to worry about outlier removals when providing unit tests for the map reduce
          2) Is there a sample class I can look at to see how many mappers and reducers to specify or is this baked into the unit tests from the mahouttest already
          3) I am going to start with the simple test as Paritosh specified , something that classifies whether or not the vectors were classified correctly, so to do this I plan to take most of the code inside ClusterClassificationDriver and make the changes to have this logic work for doing the operations in map-reduce, let me know if there are issues with this approach
          4) In the ClusterClassificationDriverTest I noticed we were using 3 clusters, does it matter how many clusters we create, I was wondering what the relationship is (if any) with the number of clusters to the actual map-reduce operation of classification

          Show
          Saikat Kanjilal added a comment - In reading through the ClusterClassificationMapper I have some questions: 1) Do we need to worry about outlier removals when providing unit tests for the map reduce 2) Is there a sample class I can look at to see how many mappers and reducers to specify or is this baked into the unit tests from the mahouttest already 3) I am going to start with the simple test as Paritosh specified , something that classifies whether or not the vectors were classified correctly, so to do this I plan to take most of the code inside ClusterClassificationDriver and make the changes to have this logic work for doing the operations in map-reduce, let me know if there are issues with this approach 4) In the ClusterClassificationDriverTest I noticed we were using 3 clusters, does it matter how many clusters we create, I was wondering what the relationship is (if any) with the number of clusters to the actual map-reduce operation of classification
          Hide
          Paritosh Ranjan added a comment -

          1) Do not worry about outlier removals for the first cut. Use emitMostlikely=true and clusterClassificationThreshold = 0.0.
          2) I don't think there is any need to run a Hadoop job to test the mapper. Just test the logic inside mapper. You will need EasyMock or some other mocking framework to do it. Dev mailing list/other existing tests can help to tell other ways to write tests. There is no defined reducer for the job.
          3) I don't think there is any need to take the code inside ClusterClassificationDriver. The point is to test the cluster classification logic inside mapper, not the driver.
          4) It does not matter how many clusters you use. What matters is the clarity of the test cases. It really helps if the functionality to be tested is understandable from the test cases.
          The sequential and mapreduce should produce the same result. So, you can also use the assertions and data used in ClusterClassificationDriverTest, which is for the sequential cluster classification.

          Show
          Paritosh Ranjan added a comment - 1) Do not worry about outlier removals for the first cut. Use emitMostlikely=true and clusterClassificationThreshold = 0.0. 2) I don't think there is any need to run a Hadoop job to test the mapper. Just test the logic inside mapper. You will need EasyMock or some other mocking framework to do it. Dev mailing list/other existing tests can help to tell other ways to write tests. There is no defined reducer for the job. 3) I don't think there is any need to take the code inside ClusterClassificationDriver. The point is to test the cluster classification logic inside mapper, not the driver. 4) It does not matter how many clusters you use. What matters is the clarity of the test cases. It really helps if the functionality to be tested is understandable from the test cases. The sequential and mapreduce should produce the same result. So, you can also use the assertions and data used in ClusterClassificationDriverTest, which is for the sequential cluster classification.
          Hide
          Paritosh Ranjan added a comment -

          I have added the mapreduce version of the ClusterClassificationDriver with outlier removal capability.

          ClusterClassificationDriver if implemented now ( only some refactoring and CLI development is left ). So, the clustering refactorings can start.

          Saikat, if you want, you can look into ClusterClassificationDriverTest. I have added a MapReduce test case. You can try to add some more test scenarios there. This will help in getting a better understanding of ClusterClassification. Once you understand it, you can try to use it in KMeansDriver.

          Show
          Paritosh Ranjan added a comment - I have added the mapreduce version of the ClusterClassificationDriver with outlier removal capability. ClusterClassificationDriver if implemented now ( only some refactoring and CLI development is left ). So, the clustering refactorings can start. Saikat, if you want, you can look into ClusterClassificationDriverTest. I have added a MapReduce test case. You can try to add some more test scenarios there. This will help in getting a better understanding of ClusterClassification. Once you understand it, you can try to use it in KMeansDriver.
          Hide
          Saikat Kanjilal added a comment -

          Ha Paritosh, you beat me to the punch, pardon my newbieness, I was just reading through the code in more detail, I just created the ClusterClassificationMapperTest and was starting to add code to this, should I move your test case for map-reduce into this class. I will first try to add some more test cases.

          Show
          Saikat Kanjilal added a comment - Ha Paritosh, you beat me to the punch, pardon my newbieness, I was just reading through the code in more detail, I just created the ClusterClassificationMapperTest and was starting to add code to this, should I move your test case for map-reduce into this class. I will first try to add some more test cases.
          Hide
          Paritosh Ranjan added a comment -

          Adding few test cases in ClusterClassificationDriver will help you understand its funtionality, which will help in clustering refactorings. Adding/skipping mapper test is your wish. Just reiterating, once you understand ClusterClassificationDriver, you can try to use it in KMeansDriver. ClusterClassificationDriver will replace the clusterData phase of KMeansDriver. Feel free to ask questions on MAHOUT-981 regarding KMeansDriver refactoring.

          Show
          Paritosh Ranjan added a comment - Adding few test cases in ClusterClassificationDriver will help you understand its funtionality, which will help in clustering refactorings. Adding/skipping mapper test is your wish. Just reiterating, once you understand ClusterClassificationDriver, you can try to use it in KMeansDriver. ClusterClassificationDriver will replace the clusterData phase of KMeansDriver. Feel free to ask questions on MAHOUT-981 regarding KMeansDriver refactoring.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1371 (See https://builds.apache.org/job/Mahout-Quality/1371/)
          MAHOUT-929, MAHOUT-931. Implemented mapreduce version of ClusterClassificationDriver with outlier removal capability.
          Changed output of sequential to WeightedVectorWritable. Fixed and added test cases. (Revision 1294454)

          Result = SUCCESS
          pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294454
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationConfigKeys.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1371 (See https://builds.apache.org/job/Mahout-Quality/1371/ ) MAHOUT-929 , MAHOUT-931 . Implemented mapreduce version of ClusterClassificationDriver with outlier removal capability. Changed output of sequential to WeightedVectorWritable. Fixed and added test cases. (Revision 1294454) Result = SUCCESS pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294454 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationConfigKeys.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Hide
          Saikat Kanjilal added a comment -

          Paritosh,
          Some more questions after I read through your code inside ClusterClassificationDriverTest:
          1) it seems that the map-reduce method you added called testVectorClassificationWithOutlierRemovalMR only differs from testVectorClassificationWithOutlierRemoval by the following line: HadoopUtil.delete(conf, classifiedOutputPath);

          2) I was going to add the following test cases inside ClusterClassificationDriverTest (I chose not to add ClusterClassificationMapperTest):

          • testVectorClassificationWithoutOutlierRemovalMR
          • testVectorClassificationWithoutOutlierRemovalChangeThresholdMR
          • testVectorClassificationWithoutOutlierRemovalChangeThreshold (pass in some custom threshold here and mock out expectations)

          Finally I may add some edge error cases surrounding the above

          Thoughts, let me know if you think of other cases to add.

          I want to first spend some time learning this in more detail before diving into the kmeans driver rework.

          Show
          Saikat Kanjilal added a comment - Paritosh, Some more questions after I read through your code inside ClusterClassificationDriverTest: 1) it seems that the map-reduce method you added called testVectorClassificationWithOutlierRemovalMR only differs from testVectorClassificationWithOutlierRemoval by the following line: HadoopUtil.delete(conf, classifiedOutputPath); 2) I was going to add the following test cases inside ClusterClassificationDriverTest (I chose not to add ClusterClassificationMapperTest): testVectorClassificationWithoutOutlierRemovalMR testVectorClassificationWithoutOutlierRemovalChangeThresholdMR testVectorClassificationWithoutOutlierRemovalChangeThreshold (pass in some custom threshold here and mock out expectations) Finally I may add some edge error cases surrounding the above Thoughts, let me know if you think of other cases to add. I want to first spend some time learning this in more detail before diving into the kmeans driver rework.
          Hide
          Paritosh Ranjan added a comment - - edited

          The MR test differs where the runSequential argument is used. For MR, its false, and for sequential, its true.

          runClustering(pointsPath, conf, false);
          runClassificationWithOutlierRemoval(conf, false);

          Show
          Paritosh Ranjan added a comment - - edited The MR test differs where the runSequential argument is used. For MR, its false, and for sequential, its true. runClustering(pointsPath, conf, false); runClassificationWithOutlierRemoval(conf, false);
          Hide
          Saikat Kanjilal added a comment -

          Paritosh sorry about my dissappearance, was out for a few days, anyways I have added a few tests to the ClusterClassificationDriver, being that I am not a committer whats the process of submitting my change, can I submit a patch through the usual means if I'm not a committer?

          Show
          Saikat Kanjilal added a comment - Paritosh sorry about my dissappearance, was out for a few days, anyways I have added a few tests to the ClusterClassificationDriver, being that I am not a committer whats the process of submitting my change, can I submit a patch through the usual means if I'm not a committer?
          Hide
          Paritosh Ranjan added a comment -

          You can create a patch and attach to the jira issue. More about it is written on the How to Contribute Page https://cwiki.apache.org/MAHOUT/how-to-contribute.html.

          Show
          Paritosh Ranjan added a comment - You can create a patch and attach to the jira issue. More about it is written on the How to Contribute Page https://cwiki.apache.org/MAHOUT/how-to-contribute.html .
          Hide
          Jeff Eastman added a comment -

          Paritosh, can you take a look at this patch? If it needs work and you need help closing the issue let me know.

          Show
          Jeff Eastman added a comment - Paritosh, can you take a look at this patch? If it needs work and you need help closing the issue let me know.
          Hide
          Paritosh Ranjan added a comment - - edited

          All issues other than MAHOUT-990 are fixed. There is no other patch to review.

          I have not closed this issue since MAHOUT-990 is a subtask of MAHOUT-933 which is linked to this issue. Once we close MAHOUT-990. We will be done with all issues related to this refactoring.

          Show
          Paritosh Ranjan added a comment - - edited All issues other than MAHOUT-990 are fixed. There is no other patch to review. I have not closed this issue since MAHOUT-990 is a subtask of MAHOUT-933 which is linked to this issue. Once we close MAHOUT-990 . We will be done with all issues related to this refactoring.
          Hide
          Jeff Eastman added a comment -

          Resolving as all subtasks have been completed

          Show
          Jeff Eastman added a comment - Resolving as all subtasks have been completed
          Jeff Eastman made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Sean Owen made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          35d 1h 18m 1 Paritosh Ranjan 20/Jan/12 21:00
          Patch Available Patch Available Resolved Resolved
          110d 1h 6m 1 Jeff Eastman 09/May/12 23:07
          Resolved Resolved Closed Closed
          37d 11h 28m 1 Sean Owen 16/Jun/12 10:35

            People

            • Assignee:
              Paritosh Ranjan
              Reporter:
              Jeff Eastman
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development