Mahout
  1. Mahout
  2. MAHOUT-933

Implement mapreduce version of ClusterIterator

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.7
    • Component/s: Classification, Clustering
    • Labels:
      None

      Description

      Right now, ClusterIterator consumes vectors only from in-memory and sequential hdfs. A mapreduce version to consume vectors needs to be implemented.

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1272 (See https://builds.apache.org/job/Mahout-Quality/1272/)
          MAHOUT-846: Improved scalability of GaussianCluster.pdf. Introduced some beginnings for MAHOUT-933. All tests run.

          jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1224730
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/AbstractCluster.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/CIMapper.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/CIReducer.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/Cluster.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterClassifier.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterIterator.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/Model.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletCluster.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/dirichlet/models/GaussianCluster.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/TestClusterClassifier.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayDirichlet.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayKMeans.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1272 (See https://builds.apache.org/job/Mahout-Quality/1272/ ) MAHOUT-846 : Improved scalability of GaussianCluster.pdf. Introduced some beginnings for MAHOUT-933 . All tests run. jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1224730 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/AbstractCluster.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/CIMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/CIReducer.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/Cluster.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterIterator.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/Model.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletCluster.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/dirichlet/models/GaussianCluster.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/TestClusterClassifier.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayDirichlet.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayKMeans.java
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1362 (See https://builds.apache.org/job/Mahout-Quality/1362/)
          MAHOUT-933: Refactored actual classification out of ClusterClassifier and into ClusteringPolicies. This
          allows classifier to be completely generic as to the algorithm and gives policies correct use of e.g. fuzzyK 'm'
          Introduced Canopy and MeanShift clustering policies for classification though not used by cluster iterator
          Modified serialization of ClusterClassifiers to include ClusteringPolicy
          Added ClusterClassifier serialization methods to exploded sequenceFile representation needed for MR
          Updated Display examples and unit tests. All run (Revision 1292563)

          Result = FAILURE
          jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292563
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/CIMapper.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/CanopyClusteringPolicy.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterClassifier.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterIterator.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusteringPolicy.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/DirichletClusteringPolicy.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/FuzzyKMeansClusteringPolicy.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/KMeansClusteringPolicy.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/MeanShiftClusteringPolicy.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansClusterer.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/TestClusterClassifier.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayDirichlet.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayKMeans.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1362 (See https://builds.apache.org/job/Mahout-Quality/1362/ ) MAHOUT-933 : Refactored actual classification out of ClusterClassifier and into ClusteringPolicies. This allows classifier to be completely generic as to the algorithm and gives policies correct use of e.g. fuzzyK 'm' Introduced Canopy and MeanShift clustering policies for classification though not used by cluster iterator Modified serialization of ClusterClassifiers to include ClusteringPolicy Added ClusterClassifier serialization methods to exploded sequenceFile representation needed for MR Updated Display examples and unit tests. All run (Revision 1292563) Result = FAILURE jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292563 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/CIMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/CanopyClusteringPolicy.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterIterator.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusteringPolicy.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/DirichletClusteringPolicy.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/FuzzyKMeansClusteringPolicy.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/KMeansClusteringPolicy.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/MeanShiftClusteringPolicy.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansClusterer.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/TestClusterClassifier.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayDirichlet.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayKMeans.java
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1363 (See https://builds.apache.org/job/Mahout-Quality/1363/)
          MAHOUT-933: Fixed undetected defects introduced by earlier commit.
          I will run all the unit tests before every check-in
          I will run all the unit tests before every check-in
          I will run all the unit tests before every check-in
          ... (Revision 1292629)

          Result = SUCCESS
          jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292629
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterClassifier.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterIterator.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/FuzzyKMeansClusteringPolicy.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1363 (See https://builds.apache.org/job/Mahout-Quality/1363/ ) MAHOUT-933 : Fixed undetected defects introduced by earlier commit. I will run all the unit tests before every check-in I will run all the unit tests before every check-in I will run all the unit tests before every check-in ... (Revision 1292629) Result = SUCCESS jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292629 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterClassifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/ClusterIterator.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/FuzzyKMeansClusteringPolicy.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java
          Hide
          Jeff Eastman added a comment -

          r1298625 made the following changes:

          MAHOUT-933:

          • refactored ClusteringPolicies into hierarchy under new AbstractClusteringPolicy
          • added close() to ClusteringPolicy to allow policy-specific actions needed to compute convergence
          • removed ClusteringPolicy from ClusterIterator constructor as ClusterClassifier already has one
          • added convergence computations for kmeans and fuzzyk
          • added final clustersOut renaming to add -final suffix
          • updated Display examples and unit tests to reflect above
          • all tests run

          I think it is time to begin refactoring the buildClusters methods of the respective clustering drivers to use ClusterIterator as it seems to be producing equivalent results to the original implementations. This will involve removing a lot of existing driver, mapper and reducer code and many time-consuming unit tests. It will also have some impact on other components as the representation of clusters in the file system changes from Cluster to self-describing ClusterWritable.

          I have created independent subtasks to address these conversion issues so that they may be undertaken independently.

          Show
          Jeff Eastman added a comment - r1298625 made the following changes: MAHOUT-933 : refactored ClusteringPolicies into hierarchy under new AbstractClusteringPolicy added close() to ClusteringPolicy to allow policy-specific actions needed to compute convergence removed ClusteringPolicy from ClusterIterator constructor as ClusterClassifier already has one added convergence computations for kmeans and fuzzyk added final clustersOut renaming to add -final suffix updated Display examples and unit tests to reflect above all tests run I think it is time to begin refactoring the buildClusters methods of the respective clustering drivers to use ClusterIterator as it seems to be producing equivalent results to the original implementations. This will involve removing a lot of existing driver, mapper and reducer code and many time-consuming unit tests. It will also have some impact on other components as the representation of clusters in the file system changes from Cluster to self-describing ClusterWritable. I have created independent subtasks to address these conversion issues so that they may be undertaken independently.
          Hide
          Jeff Eastman added a comment -

          Closing this as the last subtask has been completed

          Show
          Jeff Eastman added a comment - Closing this as the last subtask has been completed

            People

            • Assignee:
              Jeff Eastman
              Reporter:
              Paritosh Ranjan
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development