Description
Right now, each clustering algorithm has its own runClustering ( -cp ) implementation which produces clusteredPoints. The current design lacks :
1) Extensibility - No place to plugin new features like outlier removal while classification
2) Uniformity in design - as new algorithms don't have a pattern to follow.
3) Abstraction - the clusterData should only bother about classifying vectors i.e. assigning different vectors to clusters. Currently it lacks a bit of abstraction. It should not care about how to classify. That should be the work of a separate entity, which can have features like outlier removal.
The new implementation factor out & implement an independent entity to perform the classification step independently of the various clustering implementations. The new design would start with ClusterClassifier, ClusteringPolicy and ClusterIterator whose experimental versions are available and committed. The currently committed version seems to work for all the iterative clustering algorithms.
The ClusterClassifier provides probability of any vector belonging to the different clusters available. These probabilities are converted into weights by different ClusteringPolicy implementations, which are for respective clustering algorithms. This is the place where the outlier removal implementation can be plugged in. In future, different implementations of ClusteringPolicy can be provided (configured) for different type of classification.
The ClusteringPolicy can be initialized with the ClusterConfig objects. These ClusterConfig objects would hold the Clustering Algorithm parameters which will help in classifying the Clusters.
The ClusterClassifier also gives the capability to train the existing classifiers (clusters), by the input. This is the place where clustering/classification will converge.
The execution is done by a ClusterIterator for now, which runs a clustering policy on the input and tries to classify the vectors to different clusters. It can simultaneously train the classifiers, as it can run for given number of iterations and each iteration would improve the quality of the classifiers.
Attachments
Issue Links
- is depended upon by
-
MAHOUT-931 Implement a pluggable outlier removal capability for cluster classifiers
- Closed
- is part of
-
MAHOUT-929 Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
- Closed
I've been wanting to make a mapreduce version of the ClusterIterator. Currently it has only the sequential-HDFS and in-memory implementations. Maybe you can build a
MAHOUT-929to consume those clusters?