Description
A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction.
Attachments
Attachments
- MAHOUT-931
- 18 kB
- Paritosh Ranjan
- MAHOUT-931
- 31 kB
- Paritosh Ranjan
Issue Links
- depends upon
-
MAHOUT-930 Refactor Vector Classifaction out of Clustering - Make Classification abstract
- Closed
- is part of
-
MAHOUT-929 Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
- Closed
Activity
Integrated in Mahout-Quality #1371 (See https://builds.apache.org/job/Mahout-Quality/1371/)
MAHOUT-929, MAHOUT-931. Implemented mapreduce version of ClusterClassificationDriver with outlier removal capability.
Changed output of sequential to WeightedVectorWritable. Fixed and added test cases. (Revision 1294454)
Result = SUCCESS
pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294454
Files :
- /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationConfigKeys.java
- /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
- /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java
- /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
Integrated in Mahout-Quality #1368 (See https://builds.apache.org/job/Mahout-Quality/1368/)
MAHOUT-931, MAHOUT-929. Added emitMostLikely and threshold based outlier removal capability in ClusterClassificationDriver. (Revision 1293874)
Result = SUCCESS
pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293874
Files :
- /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
- /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
- 929: Yes, use the existing ClusterClassifier to write sequential and mapreduce versions of a post processor to do vector classification. You should not need the ClusterIterator as that is used for the buildCluster phase.
- 930: No, buildClusters runs to completion on all vectors before clusterPoints is called on them. Currently, it is not possible to run the clusterPoints without first running buildClusters. With the post processor, they will be completely independent jobs (the existing CLI drivers may still bundle them for compatibility).
- 931: Yes, a probability-based threshold would work with the current ClusterClassifier API. A distance-based threshold (like Canopy T1 pruning) would need a different mechanism.
Ok, I will start working in the following order then. I have few more doubts which I have written inline.
- 929 implement a new post processor that does only classification as required by the various clusterPoints steps.
The new post processor for clusterPoints() would use the Cluster Classifier to identify which vector belongs to which cluster. At least for K-means, Canopy, Dirichlet ( i.e. similar policies exist for them ). I need to create a mapreduce and sequential version of it. Am I correct?
The current ClusterIterator is for buildCluster phase, as it is also training sideways?
- 930 modify the existing drivers to use this post processor rather than their current, custom implementations.
Currently, the buildClusters and clusteredPoints run in the same method call for each vector. The new implementation would let buildClusters run for all input vectors first. And only after buildClusters is completely finished, start a new call for clusterPoints ( for all input vectors, using the new post processor ).
- 931 modify the post processor to support pluggable outlier removal.
This would be a probability threshold based implementation?
1. I don't see a reason to introduce ClusterConfigs yet. I believe the various CLI arguments can be carried in the appropriate ClusteringPolicy implementations.
2. Other than augmenting what exist already with some more CLI arguments, I think this is done
3. Outlier removal is not a part of the buildClusters step, rather the clusterPoints step. I thought you were going to work on those stories while I finish up the mapreduce implementation of buildClusters using ClusterIterator/Classifier/Policies (MAHOUT-933)? This story (MAHOUT-931) should follow after -929 & -930, IMHO, for example:
- 929 implement a new post processor that does only classification as required by the various clusterPoints steps.
- 930 modify the existing drivers to use this post processor rather than their current, custom implementations.
- 931 modify the post processor to support pluggable outlier removal.
4. This can be done once -933 is complete.
In any case, this is all post-0.6 stuff. Let's leave trunk where it is with the renaming for now.
Ok.
Should I proceed like this :
Step 1) Encapsulte Cluster specific CLI arguments (ClusterConfig and its cluster specific implementations)
Step 2) Implement all Clustering policies
Step 3) Implement outlier removal in policies.
Step 3a) First cut : use a probability threshold based outlier removal ( as described in previous comment )
Step 3b) Final cut : Use cluster specific arguments for outlier removal.
Step 4) Replace Clustering Algorithms with Classifier/Iterator ( for algorithms which can be done using this )
Regarding naming, I would say, that, readability should always be given importance. I consider naming as an important part of software development, either working alone or in a team. I prefer readable code than JavaDocs. The current code is not having ample JavaDocs, so at least naming should be appropriate. I am not pushing for name change, just expressing my thoughts.
If you agree upon implementing things in the order (Steps) I mentioned. Then I can start implementing them. If you have any suggestions to improve them, then please suggest.
Renaming existing entities may be appropriate, but that ought to be done as a separate, independent and agreed-upon change. Otherwise we do not have a consistent vocabulary to discuss the functionality issues. Can we hold off on renaming until we get a bit more of the semantics defined?
I tend to agree that implementing a set of algorithm-specific clustering policy objects will enable many (not all) of the current implementations to be re-implemented with the ClusterClassifier/Iterator. I think we will need to preserve the existing driver classes which support CLI argument selection in their run() methods but that the buildClusters methods would be revamped to use the new implementation. It does seem like these policy objects need to encapsulate the relevant CLI arguments so we are in synch there.
The clusterPoints methods can also be re-implemented using the new clustering postprocessor in MAHOUT-929.
I am a bit confused.
Are we planning to get rid of the way clustering is being done currently, which is algorithms specific? i.e. the code in CanopyClusterer.
Will the new clustering strategy be "only" what is implemented in ClusterClassifier? i.e. Calculating probabilities of vectors belonging to different models (clusters) and choose the model with highest probability?
If yes, then Implementing Clustering policy for different clustering algorithms is all that is needed. And for outlier removal, just a threshold probability will be needed. All vectors below that probability won't be clustered. Am I correct?
Till now, I have been thinking that the clustering code just needs to be refactored out ( without changing the implementation ). If this is the case, then, I think, I have been proceeding in the correct direction ( in terms of design ).
However, I am doubting that we are not in sync regarding the way of implementation. I think you want to change the clustering implementation to a cluster classification implementation, with outlier removal ( and completely get rid of the algorithm specific implementation, which makes sense ).
So, it would be really helpful if you can clarify my doubts.
I think the Clustering Policy is all that is needed for extensibility. The design changes I did are :
a) Passing the vector rather than the probability to the clustering policy. I think this might be needed for clustering/outlier removal. Might help in transforming vector/adding weight before classification ( thinking of some future functionalities )
b) Added ClusterConfig objects to the policies. Now, the clustering policy will know all about the clustering parameters used. So, they will be able to classify accordingly.
c) ClusterConfig objects will emerge as generic cluster configuration objects, which can be used anywhere in clustering algorithms. Right now, there are a bunch of clustering parameters scattered through method calls.
I am in a habit of renaming/cleaning things while coding. So, it just happened.
This patch looks to be mostly a rename of existing classes. I'm not one to be hung up on names, but I don't understand why the first thing you are proposing is to rename everything?
The ClusterConfig class in the patch, can be further used to group all the clustering parameters of different clustering algorithms in a class. This will help in getting rid of long parameter list in the run() methods of the Clustering Drivers.
I was thinking about the implementation and interface designing. Thought it could be best described using some code.
I think that this interface design will be able to tackle almost all future implementation changes in classification of clusters.
If you have suggestions to improve it, then I can work on that. Else I think, we can also commit it and build over it.
I agree that defining the interfaces for cluster classification and outlier removal are a good place to start. Why don't you take a stab at it since you seem to have some ideas in mind?
This story depends on implementation/design of Mahout-930. I think Mahout-930's design of Vector classification is chalked out pretty nicely. We can start working to implementing all the policies, and other improvements.
But before going on fully implementing the Cluster Classification, I think it would be good to at least finalize the interface for Outlier Removal. I also think that binding it only to an outlier removal is not going to help forever.
So, following the open closed principle. Lets close it for further modification by plugging a Collection<Strategy> into the Policy. The Strategy can be outlier removal or any other feature which can be developed by implementing Strategy interface. So, this will also keep it open for extension. "Strategy" is just a thought, it can be any other name.
I will try to submit a patch for some mock/Canopy Outlier Removal first, by implementing "Strategy". If the design works and look good, then the designing part would be over.
Does it look like a good way to proceed? Any suggestions?
This issue got resolved with
MAHOUT-929.