Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-930

Refactor Vector Classifaction out of Clustering - Make Classification abstract

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.6
    • 0.7
    • classic
    • None

    Description

      Right now, each clustering algorithm has its own runClustering ( -cp ) implementation which produces clusteredPoints. The current design lacks :
      1) Extensibility - No place to plugin new features like outlier removal while classification
      2) Uniformity in design - as new algorithms don't have a pattern to follow.
      3) Abstraction - the clusterData should only bother about classifying vectors i.e. assigning different vectors to clusters. Currently it lacks a bit of abstraction. It should not care about how to classify. That should be the work of a separate entity, which can have features like outlier removal.

      The new implementation factor out & implement an independent entity to perform the classification step independently of the various clustering implementations. The new design would start with ClusterClassifier, ClusteringPolicy and ClusterIterator whose experimental versions are available and committed. The currently committed version seems to work for all the iterative clustering algorithms.

      The ClusterClassifier provides probability of any vector belonging to the different clusters available. These probabilities are converted into weights by different ClusteringPolicy implementations, which are for respective clustering algorithms. This is the place where the outlier removal implementation can be plugged in. In future, different implementations of ClusteringPolicy can be provided (configured) for different type of classification.

      The ClusteringPolicy can be initialized with the ClusterConfig objects. These ClusterConfig objects would hold the Clustering Algorithm parameters which will help in classifying the Clusters.

      The ClusterClassifier also gives the capability to train the existing classifiers (clusters), by the input. This is the place where clustering/classification will converge.

      The execution is done by a ClusterIterator for now, which runs a clustering policy on the input and tries to classify the vectors to different clusters. It can simultaneously train the classifiers, as it can run for given number of iterations and each iteration would improve the quality of the classifiers.

      Attachments

        Issue Links

          Activity

            jeastman Jeff Eastman added a comment -

            I've been wanting to make a mapreduce version of the ClusterIterator. Currently it has only the sequential-HDFS and in-memory implementations. Maybe you can build a MAHOUT-929 to consume those clusters?

            jeastman Jeff Eastman added a comment - I've been wanting to make a mapreduce version of the ClusterIterator. Currently it has only the sequential-HDFS and in-memory implementations. Maybe you can build a MAHOUT-929 to consume those clusters?

            Created MAHOUT-933 to implement a mapreduce version of ClusterIterator.

            paritoshranjan Paritosh Ranjan added a comment - Created MAHOUT-933 to implement a mapreduce version of ClusterIterator.

            This issue got resolved with MAHOUT-929.

            paritoshranjan Paritosh Ranjan added a comment - This issue got resolved with MAHOUT-929 .

            People

              paritoshranjan Paritosh Ranjan
              paritoshranjan Paritosh Ranjan
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: