Mahout
  1. Mahout
  2. MAHOUT-931

Implement a pluggable outlier removal capability for cluster classifiers

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.7
    • Component/s: Classification, Clustering
    • Labels:
      None

      Description

      A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction.

      1. MAHOUT-931
        18 kB
        Paritosh Ranjan
      2. MAHOUT-931
        31 kB
        Paritosh Ranjan

        Issue Links

          Activity

          Hide
          Paritosh Ranjan added a comment -

          This issue got resolved with MAHOUT-929.

          Show
          Paritosh Ranjan added a comment - This issue got resolved with MAHOUT-929 .
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1371 (See https://builds.apache.org/job/Mahout-Quality/1371/)
          MAHOUT-929, MAHOUT-931. Implemented mapreduce version of ClusterClassificationDriver with outlier removal capability.
          Changed output of sequential to WeightedVectorWritable. Fixed and added test cases. (Revision 1294454)

          Result = SUCCESS
          pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294454
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationConfigKeys.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1371 (See https://builds.apache.org/job/Mahout-Quality/1371/ ) MAHOUT-929 , MAHOUT-931 . Implemented mapreduce version of ClusterClassificationDriver with outlier removal capability. Changed output of sequential to WeightedVectorWritable. Fixed and added test cases. (Revision 1294454) Result = SUCCESS pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294454 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationConfigKeys.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1368 (See https://builds.apache.org/job/Mahout-Quality/1368/)
          MAHOUT-931, MAHOUT-929. Added emitMostLikely and threshold based outlier removal capability in ClusterClassificationDriver. (Revision 1293874)

          Result = SUCCESS
          pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293874
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1368 (See https://builds.apache.org/job/Mahout-Quality/1368/ ) MAHOUT-931 , MAHOUT-929 . Added emitMostLikely and threshold based outlier removal capability in ClusterClassificationDriver. (Revision 1293874) Result = SUCCESS pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293874 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Hide
          Jeff Eastman added a comment -
          • 929: Yes, use the existing ClusterClassifier to write sequential and mapreduce versions of a post processor to do vector classification. You should not need the ClusterIterator as that is used for the buildCluster phase.
          • 930: No, buildClusters runs to completion on all vectors before clusterPoints is called on them. Currently, it is not possible to run the clusterPoints without first running buildClusters. With the post processor, they will be completely independent jobs (the existing CLI drivers may still bundle them for compatibility).
          • 931: Yes, a probability-based threshold would work with the current ClusterClassifier API. A distance-based threshold (like Canopy T1 pruning) would need a different mechanism.
          Show
          Jeff Eastman added a comment - 929: Yes, use the existing ClusterClassifier to write sequential and mapreduce versions of a post processor to do vector classification. You should not need the ClusterIterator as that is used for the buildCluster phase. 930: No, buildClusters runs to completion on all vectors before clusterPoints is called on them. Currently, it is not possible to run the clusterPoints without first running buildClusters. With the post processor, they will be completely independent jobs (the existing CLI drivers may still bundle them for compatibility). 931: Yes, a probability-based threshold would work with the current ClusterClassifier API. A distance-based threshold (like Canopy T1 pruning) would need a different mechanism.
          Hide
          Paritosh Ranjan added a comment -

          Ok, I will start working in the following order then. I have few more doubts which I have written inline.

          • 929 implement a new post processor that does only classification as required by the various clusterPoints steps.

          The new post processor for clusterPoints() would use the Cluster Classifier to identify which vector belongs to which cluster. At least for K-means, Canopy, Dirichlet ( i.e. similar policies exist for them ). I need to create a mapreduce and sequential version of it. Am I correct?

          The current ClusterIterator is for buildCluster phase, as it is also training sideways?

          • 930 modify the existing drivers to use this post processor rather than their current, custom implementations.

          Currently, the buildClusters and clusteredPoints run in the same method call for each vector. The new implementation would let buildClusters run for all input vectors first. And only after buildClusters is completely finished, start a new call for clusterPoints ( for all input vectors, using the new post processor ).

          • 931 modify the post processor to support pluggable outlier removal.

          This would be a probability threshold based implementation?

          Show
          Paritosh Ranjan added a comment - Ok, I will start working in the following order then. I have few more doubts which I have written inline. 929 implement a new post processor that does only classification as required by the various clusterPoints steps. The new post processor for clusterPoints() would use the Cluster Classifier to identify which vector belongs to which cluster. At least for K-means, Canopy, Dirichlet ( i.e. similar policies exist for them ). I need to create a mapreduce and sequential version of it. Am I correct? The current ClusterIterator is for buildCluster phase, as it is also training sideways? 930 modify the existing drivers to use this post processor rather than their current, custom implementations. Currently, the buildClusters and clusteredPoints run in the same method call for each vector. The new implementation would let buildClusters run for all input vectors first. And only after buildClusters is completely finished, start a new call for clusterPoints ( for all input vectors, using the new post processor ). 931 modify the post processor to support pluggable outlier removal. This would be a probability threshold based implementation?
          Hide
          Jeff Eastman added a comment -

          1. I don't see a reason to introduce ClusterConfigs yet. I believe the various CLI arguments can be carried in the appropriate ClusteringPolicy implementations.

          2. Other than augmenting what exist already with some more CLI arguments, I think this is done

          3. Outlier removal is not a part of the buildClusters step, rather the clusterPoints step. I thought you were going to work on those stories while I finish up the mapreduce implementation of buildClusters using ClusterIterator/Classifier/Policies (MAHOUT-933)? This story (MAHOUT-931) should follow after -929 & -930, IMHO, for example:

          • 929 implement a new post processor that does only classification as required by the various clusterPoints steps.
          • 930 modify the existing drivers to use this post processor rather than their current, custom implementations.
          • 931 modify the post processor to support pluggable outlier removal.

          4. This can be done once -933 is complete.

          In any case, this is all post-0.6 stuff. Let's leave trunk where it is with the renaming for now.

          Show
          Jeff Eastman added a comment - 1. I don't see a reason to introduce ClusterConfigs yet. I believe the various CLI arguments can be carried in the appropriate ClusteringPolicy implementations. 2. Other than augmenting what exist already with some more CLI arguments, I think this is done 3. Outlier removal is not a part of the buildClusters step, rather the clusterPoints step. I thought you were going to work on those stories while I finish up the mapreduce implementation of buildClusters using ClusterIterator/Classifier/Policies ( MAHOUT-933 )? This story ( MAHOUT-931 ) should follow after -929 & -930, IMHO, for example: 929 implement a new post processor that does only classification as required by the various clusterPoints steps. 930 modify the existing drivers to use this post processor rather than their current, custom implementations. 931 modify the post processor to support pluggable outlier removal. 4. This can be done once -933 is complete. In any case, this is all post-0.6 stuff. Let's leave trunk where it is with the renaming for now.
          Hide
          Paritosh Ranjan added a comment -

          Ok.

          Should I proceed like this :

          Step 1) Encapsulte Cluster specific CLI arguments (ClusterConfig and its cluster specific implementations)

          Step 2) Implement all Clustering policies

          Step 3) Implement outlier removal in policies.
          Step 3a) First cut : use a probability threshold based outlier removal ( as described in previous comment )
          Step 3b) Final cut : Use cluster specific arguments for outlier removal.

          Step 4) Replace Clustering Algorithms with Classifier/Iterator ( for algorithms which can be done using this )

          Regarding naming, I would say, that, readability should always be given importance. I consider naming as an important part of software development, either working alone or in a team. I prefer readable code than JavaDocs. The current code is not having ample JavaDocs, so at least naming should be appropriate. I am not pushing for name change, just expressing my thoughts.

          If you agree upon implementing things in the order (Steps) I mentioned. Then I can start implementing them. If you have any suggestions to improve them, then please suggest.

          Show
          Paritosh Ranjan added a comment - Ok. Should I proceed like this : Step 1) Encapsulte Cluster specific CLI arguments (ClusterConfig and its cluster specific implementations) Step 2) Implement all Clustering policies Step 3) Implement outlier removal in policies. Step 3a) First cut : use a probability threshold based outlier removal ( as described in previous comment ) Step 3b) Final cut : Use cluster specific arguments for outlier removal. Step 4) Replace Clustering Algorithms with Classifier/Iterator ( for algorithms which can be done using this ) Regarding naming, I would say, that, readability should always be given importance. I consider naming as an important part of software development, either working alone or in a team. I prefer readable code than JavaDocs. The current code is not having ample JavaDocs, so at least naming should be appropriate. I am not pushing for name change, just expressing my thoughts. If you agree upon implementing things in the order (Steps) I mentioned. Then I can start implementing them. If you have any suggestions to improve them, then please suggest.
          Hide
          Jeff Eastman added a comment -

          Renaming existing entities may be appropriate, but that ought to be done as a separate, independent and agreed-upon change. Otherwise we do not have a consistent vocabulary to discuss the functionality issues. Can we hold off on renaming until we get a bit more of the semantics defined?

          I tend to agree that implementing a set of algorithm-specific clustering policy objects will enable many (not all) of the current implementations to be re-implemented with the ClusterClassifier/Iterator. I think we will need to preserve the existing driver classes which support CLI argument selection in their run() methods but that the buildClusters methods would be revamped to use the new implementation. It does seem like these policy objects need to encapsulate the relevant CLI arguments so we are in synch there.

          The clusterPoints methods can also be re-implemented using the new clustering postprocessor in MAHOUT-929.

          Show
          Jeff Eastman added a comment - Renaming existing entities may be appropriate, but that ought to be done as a separate, independent and agreed-upon change. Otherwise we do not have a consistent vocabulary to discuss the functionality issues. Can we hold off on renaming until we get a bit more of the semantics defined? I tend to agree that implementing a set of algorithm-specific clustering policy objects will enable many (not all) of the current implementations to be re-implemented with the ClusterClassifier/Iterator. I think we will need to preserve the existing driver classes which support CLI argument selection in their run() methods but that the buildClusters methods would be revamped to use the new implementation. It does seem like these policy objects need to encapsulate the relevant CLI arguments so we are in synch there. The clusterPoints methods can also be re-implemented using the new clustering postprocessor in MAHOUT-929 .
          Hide
          Paritosh Ranjan added a comment -

          I am a bit confused.

          Are we planning to get rid of the way clustering is being done currently, which is algorithms specific? i.e. the code in CanopyClusterer.
          Will the new clustering strategy be "only" what is implemented in ClusterClassifier? i.e. Calculating probabilities of vectors belonging to different models (clusters) and choose the model with highest probability?

          If yes, then Implementing Clustering policy for different clustering algorithms is all that is needed. And for outlier removal, just a threshold probability will be needed. All vectors below that probability won't be clustered. Am I correct?

          Till now, I have been thinking that the clustering code just needs to be refactored out ( without changing the implementation ). If this is the case, then, I think, I have been proceeding in the correct direction ( in terms of design ).

          However, I am doubting that we are not in sync regarding the way of implementation. I think you want to change the clustering implementation to a cluster classification implementation, with outlier removal ( and completely get rid of the algorithm specific implementation, which makes sense ).

          So, it would be really helpful if you can clarify my doubts.

          Show
          Paritosh Ranjan added a comment - I am a bit confused. Are we planning to get rid of the way clustering is being done currently, which is algorithms specific? i.e. the code in CanopyClusterer. Will the new clustering strategy be "only" what is implemented in ClusterClassifier? i.e. Calculating probabilities of vectors belonging to different models (clusters) and choose the model with highest probability? If yes, then Implementing Clustering policy for different clustering algorithms is all that is needed. And for outlier removal, just a threshold probability will be needed. All vectors below that probability won't be clustered. Am I correct? Till now, I have been thinking that the clustering code just needs to be refactored out ( without changing the implementation ). If this is the case, then, I think, I have been proceeding in the correct direction ( in terms of design ). However, I am doubting that we are not in sync regarding the way of implementation. I think you want to change the clustering implementation to a cluster classification implementation, with outlier removal ( and completely get rid of the algorithm specific implementation, which makes sense ). So, it would be really helpful if you can clarify my doubts.
          Hide
          Paritosh Ranjan added a comment -

          I think the Clustering Policy is all that is needed for extensibility. The design changes I did are :

          a) Passing the vector rather than the probability to the clustering policy. I think this might be needed for clustering/outlier removal. Might help in transforming vector/adding weight before classification ( thinking of some future functionalities )
          b) Added ClusterConfig objects to the policies. Now, the clustering policy will know all about the clustering parameters used. So, they will be able to classify accordingly.
          c) ClusterConfig objects will emerge as generic cluster configuration objects, which can be used anywhere in clustering algorithms. Right now, there are a bunch of clustering parameters scattered through method calls.

          I am in a habit of renaming/cleaning things while coding. So, it just happened.

          Show
          Paritosh Ranjan added a comment - I think the Clustering Policy is all that is needed for extensibility. The design changes I did are : a) Passing the vector rather than the probability to the clustering policy. I think this might be needed for clustering/outlier removal. Might help in transforming vector/adding weight before classification ( thinking of some future functionalities ) b) Added ClusterConfig objects to the policies. Now, the clustering policy will know all about the clustering parameters used. So, they will be able to classify accordingly. c) ClusterConfig objects will emerge as generic cluster configuration objects, which can be used anywhere in clustering algorithms. Right now, there are a bunch of clustering parameters scattered through method calls. I am in a habit of renaming/cleaning things while coding. So, it just happened.
          Hide
          Jeff Eastman added a comment -

          This patch looks to be mostly a rename of existing classes. I'm not one to be hung up on names, but I don't understand why the first thing you are proposing is to rename everything?

          Show
          Jeff Eastman added a comment - This patch looks to be mostly a rename of existing classes. I'm not one to be hung up on names, but I don't understand why the first thing you are proposing is to rename everything?
          Hide
          Paritosh Ranjan added a comment -

          The ClusterConfig class in the patch, can be further used to group all the clustering parameters of different clustering algorithms in a class. This will help in getting rid of long parameter list in the run() methods of the Clustering Drivers.

          Show
          Paritosh Ranjan added a comment - The ClusterConfig class in the patch, can be further used to group all the clustering parameters of different clustering algorithms in a class. This will help in getting rid of long parameter list in the run() methods of the Clustering Drivers.
          Hide
          Paritosh Ranjan added a comment -

          I was thinking about the implementation and interface designing. Thought it could be best described using some code.
          I think that this interface design will be able to tackle almost all future implementation changes in classification of clusters.
          If you have suggestions to improve it, then I can work on that. Else I think, we can also commit it and build over it.

          Show
          Paritosh Ranjan added a comment - I was thinking about the implementation and interface designing. Thought it could be best described using some code. I think that this interface design will be able to tackle almost all future implementation changes in classification of clusters. If you have suggestions to improve it, then I can work on that. Else I think, we can also commit it and build over it.
          Hide
          Paritosh Ranjan added a comment -

          Ok, I will try to submit a patch for it soon.

          Show
          Paritosh Ranjan added a comment - Ok, I will try to submit a patch for it soon.
          Hide
          Jeff Eastman added a comment -

          I agree that defining the interfaces for cluster classification and outlier removal are a good place to start. Why don't you take a stab at it since you seem to have some ideas in mind?

          Show
          Jeff Eastman added a comment - I agree that defining the interfaces for cluster classification and outlier removal are a good place to start. Why don't you take a stab at it since you seem to have some ideas in mind?
          Hide
          Paritosh Ranjan added a comment -

          This story depends on implementation/design of Mahout-930. I think Mahout-930's design of Vector classification is chalked out pretty nicely. We can start working to implementing all the policies, and other improvements.

          But before going on fully implementing the Cluster Classification, I think it would be good to at least finalize the interface for Outlier Removal. I also think that binding it only to an outlier removal is not going to help forever.

          So, following the open closed principle. Lets close it for further modification by plugging a Collection<Strategy> into the Policy. The Strategy can be outlier removal or any other feature which can be developed by implementing Strategy interface. So, this will also keep it open for extension. "Strategy" is just a thought, it can be any other name.

          I will try to submit a patch for some mock/Canopy Outlier Removal first, by implementing "Strategy". If the design works and look good, then the designing part would be over.

          Does it look like a good way to proceed? Any suggestions?

          Show
          Paritosh Ranjan added a comment - This story depends on implementation/design of Mahout-930. I think Mahout-930's design of Vector classification is chalked out pretty nicely. We can start working to implementing all the policies, and other improvements. But before going on fully implementing the Cluster Classification, I think it would be good to at least finalize the interface for Outlier Removal. I also think that binding it only to an outlier removal is not going to help forever. So, following the open closed principle. Lets close it for further modification by plugging a Collection<Strategy> into the Policy. The Strategy can be outlier removal or any other feature which can be developed by implementing Strategy interface. So, this will also keep it open for extension. "Strategy" is just a thought, it can be any other name. I will try to submit a patch for some mock/Canopy Outlier Removal first, by implementing "Strategy". If the design works and look good, then the designing part would be over. Does it look like a good way to proceed? Any suggestions?

            People

            • Assignee:
              Paritosh Ranjan
              Reporter:
              Paritosh Ranjan
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development