Mahout
  1. Mahout
  2. MAHOUT-5

Implement a k-means clustering prototype

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.1
    • Fix Version/s: 0.1
    • Component/s: Clustering
    • Labels:
      None

      Description

      K-means clustering is closely related to Canopy clustering and often uses canopies to determine the initial clusters. I'd like to implement a k-means prototype and tests in the package org.apache.mahout.clustering.kmeans.

      1. kmeans.zip
        8 kB
        Dawid Weiss
      2. MAHOUT-5.patch
        79 kB
        Grant Ingersoll
      3. MAHOUT-5a.diff
        117 kB
        Jeff Eastman
      4. MAHOUT-5b.diff
        75 kB
        Jeff Eastman
      5. MAHOUT-5c.diff
        86 kB
        Jeff Eastman
      6. MAHOUT-5d.diff
        86 kB
        Jeff Eastman
      7. MAHOUT-5e.diff
        78 kB
        Jeff Eastman

        Activity

        Jeff Eastman created issue -
        Hide
        Dawid Weiss added a comment -

        This is an implementation of k-means in non-MR form. It isn't intended for immediate inclusion, but if it helps, please feel free to copy/paste anything from this code.

        There are several things that should be taken into account when implementing kmeans:

        • initial centroid vectors (several possibilities: random, max-difference-pick, preclustering phase, subsampling and averaging),
        • termination criterion (decrease of the global objective function, number of iterations, combination of these),
        • various optimizations. The document vectors are typically truncated (leaving values of most significant dimensions for each document), sampled (leaving only the most significant dimensions for all documents), or transformed (SVD or other form of matrix decomposition, truncation of least significant dimensions after the decomposition).
        Show
        Dawid Weiss added a comment - This is an implementation of k-means in non-MR form. It isn't intended for immediate inclusion, but if it helps, please feel free to copy/paste anything from this code. There are several things that should be taken into account when implementing kmeans: initial centroid vectors (several possibilities: random, max-difference-pick, preclustering phase, subsampling and averaging), termination criterion (decrease of the global objective function, number of iterations, combination of these), various optimizations. The document vectors are typically truncated (leaving values of most significant dimensions for each document), sampled (leaving only the most significant dimensions for all documents), or transformed (SVD or other form of matrix decomposition, truncation of least significant dimensions after the decomposition).
        Dawid Weiss made changes -
        Field Original Value New Value
        Attachment kmeans.zip [ 12375815 ]
        Hide
        Dawid Weiss added a comment -

        Jeff, if you can start working before Wednesday, feel free to look at the code above. If it helps you, good. If it doesn't (which is very likely), no worries. These classes won't compile because there are missing dependencies (like vectors, taken from Colt if I recall), but I don't think it matters for understanding the main loop of the algorithm.

        Show
        Dawid Weiss added a comment - Jeff, if you can start working before Wednesday, feel free to look at the code above. If it helps you, good. If it doesn't (which is very likely), no worries. These classes won't compile because there are missing dependencies (like vectors, taken from Colt if I recall), but I don't think it matters for understanding the main loop of the algorithm.
        Hide
        Jeff Eastman added a comment -

        A working implementation of a K-Means Clustering algorithm. See unit tests for
        the evolution of the user stories leading to the full implementation. This implementation
        expects an input set of clusters (perhaps via canopy) and iterates them to convergence
        before the final pass which clusters the input points.

        This patch also does some refactoring of the Canopy Clustering code, since there
        were many common elements, so it is included in its entirety. See MAHOUT-3 for
        commit log details for Canopy. I will tease them apart if needed.

        TODO: Sort out the generics

        TODO: Allow points to be sparse, ...

        All unit tests run.

        • src/main/java/org/apache/mahout/clustering/kmeans
        • Cluster.java
          (formatCluster, decodeCluster): cluster i/o formatting
          (configure, config): configuration for jobs or unit tests
          (emitPointToNearestCluster): the essence of kmeans, does what it says
          (addPoint, addPoints): add one or more points to the cluster
          (recomputeCenter, computeConvergence, isConverged): useful helpers
          (toString, getCenter, getNumPoints, getPointTotal): accessors
        • KMeansMapper.java
          (configure, config): load the clusters for jobs or unit tests
          (map): emits all points to nearest cluster
        • KMeansCombiner.java
          (configure): configuration for jobs
          (reduce): computes partial totalCount values for points seen
          by associated mapper. Outputs canopy key and numPoints, pointTotal to
          reducer for final combination
        • KMeansReducer.java
          (configure): configuration for jobs
          (reduce): computes new canopy centroid and determines convergence
        • KMeansDriver.java
          (runIteration): runs an iteration M/R job to map/combine/reduce clusters
          and return convergence criteria
          (runClustering): final map step assigns original points to converged
          clusters. No reducer or combiner for this step.
          (runJob): runs one or more iterations until convergence is achieved or
          iteration limit is exceeded
        • src/main/java/org/apache/mahout/clustering/utils
        • DistanceMeasure.java: old friend from Canopy clustering
        • EuclideanDistanceMeasure.java: old friend from Canopy clustering
        • ManhattanDistanceMeasure.java: old friend from Canopy clustering
        • Point.java: refactors useful operations on Float[] points
        • src/test/java/org/apache/mahout/clustering/kmeans
        • TestKmeansClustering.java
          (referenceKmeans, iterateReference): reference implementation (thanks for
          the code to look at Dawid)
          (testReferenceImplementation, testKMeansMapper, testKMeansCombiner,
          testKMeansReducer): test isolated components
          (testKMeansMRJob): tests the job on all values of k for test points
        • VisibleCluster.java
          (addPoint, toString): overrides inherited methods for testing
        Show
        Jeff Eastman added a comment - A working implementation of a K-Means Clustering algorithm. See unit tests for the evolution of the user stories leading to the full implementation. This implementation expects an input set of clusters (perhaps via canopy) and iterates them to convergence before the final pass which clusters the input points. This patch also does some refactoring of the Canopy Clustering code, since there were many common elements, so it is included in its entirety. See MAHOUT-3 for commit log details for Canopy. I will tease them apart if needed. TODO: Sort out the generics TODO: Allow points to be sparse, ... All unit tests run. src/main/java/org/apache/mahout/clustering/kmeans Cluster.java (formatCluster, decodeCluster): cluster i/o formatting (configure, config): configuration for jobs or unit tests (emitPointToNearestCluster): the essence of kmeans, does what it says (addPoint, addPoints): add one or more points to the cluster (recomputeCenter, computeConvergence, isConverged): useful helpers (toString, getCenter, getNumPoints, getPointTotal): accessors KMeansMapper.java (configure, config): load the clusters for jobs or unit tests (map): emits all points to nearest cluster KMeansCombiner.java (configure): configuration for jobs (reduce): computes partial totalCount values for points seen by associated mapper. Outputs canopy key and numPoints, pointTotal to reducer for final combination KMeansReducer.java (configure): configuration for jobs (reduce): computes new canopy centroid and determines convergence KMeansDriver.java (runIteration): runs an iteration M/R job to map/combine/reduce clusters and return convergence criteria (runClustering): final map step assigns original points to converged clusters. No reducer or combiner for this step. (runJob): runs one or more iterations until convergence is achieved or iteration limit is exceeded src/main/java/org/apache/mahout/clustering/utils DistanceMeasure.java: old friend from Canopy clustering EuclideanDistanceMeasure.java: old friend from Canopy clustering ManhattanDistanceMeasure.java: old friend from Canopy clustering Point.java: refactors useful operations on Float[] points src/test/java/org/apache/mahout/clustering/kmeans TestKmeansClustering.java (referenceKmeans, iterateReference): reference implementation (thanks for the code to look at Dawid) (testReferenceImplementation, testKMeansMapper, testKMeansCombiner, testKMeansReducer): test isolated components (testKMeansMRJob): tests the job on all values of k for test points VisibleCluster.java (addPoint, toString): overrides inherited methods for testing
        Jeff Eastman made changes -
        Attachment MAHOUT-5a.diff [ 12375986 ]
        Hide
        Grant Ingersoll added a comment -

        Cool. Yes, let's separate out 3. I am ready to commit pending one last holdout question. Then, we can refactor for this one.

        Show
        Grant Ingersoll added a comment - Cool. Yes, let's separate out 3. I am ready to commit pending one last holdout question. Then, we can refactor for this one.
        Hide
        Jeff Eastman added a comment - - edited

        This patch integrates my refactoring of Point and the DistanceMeasures into utils and adds some changes to improve unit test reliability: removing all test directories in setup; and improvements to the KMeansDriver.runJob to avoid mv-ing directories in the loop since those operations are evidently asynchronous in Hadoop.

        All tests run, more reliably than before.

        Run this on trunk after r629348

        Show
        Jeff Eastman added a comment - - edited This patch integrates my refactoring of Point and the DistanceMeasures into utils and adds some changes to improve unit test reliability: removing all test directories in setup; and improvements to the KMeansDriver.runJob to avoid mv-ing directories in the loop since those operations are evidently asynchronous in Hadoop. All tests run, more reliably than before. Run this on trunk after r629348
        Jeff Eastman made changes -
        Attachment MAHOUT-5b.diff [ 12375995 ]
        Hide
        Jeff Eastman added a comment -

        Added generic declarations to make most of the annoying little Eclipse marks go away. Modified the canopy and kmeans reducers to output the same format (key=identifierString; value=formatString) so that canopy clusters can be used directly for input to kmeans. Added a unit test thereof to prove it works.

        All unit tests run.

        This patch should be applied to trunk r629348.

        Show
        Jeff Eastman added a comment - Added generic declarations to make most of the annoying little Eclipse marks go away. Modified the canopy and kmeans reducers to output the same format (key=identifierString; value=formatString) so that canopy clusters can be used directly for input to kmeans. Added a unit test thereof to prove it works. All unit tests run. This patch should be applied to trunk r629348.
        Jeff Eastman made changes -
        Attachment MAHOUT-5c.diff [ 12376052 ]
        Grant Ingersoll made changes -
        Assignee Jeff Eastman [ jeastman ] Grant Ingersoll [ gsingers ]
        Hide
        Grant Ingersoll added a comment -

        Hi Jeff,

        Can you generate a patch against trunk? Rev. 630076.

        I'm getting:
        patching file src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java
        can't find file to patch at input line 960
        Perhaps you used the wrong -p or --strip option?
        The text leading up to this was:
        --------------------------

        Index: src/main/java/org/apache/mahout/clustering/utils/EuclideanDistanceMeasure.java
        ===================================================================
        — src/main/java/org/apache/mahout/clustering/utils/EuclideanDistanceMeasure.java (revision 0)
        +++ src/main/java/org/apache/mahout/clustering/utils/EuclideanDistanceMeasure.java (working copy)
        --------------------------

        Thanks,
        Grant

        Show
        Grant Ingersoll added a comment - Hi Jeff, Can you generate a patch against trunk? Rev. 630076. I'm getting: patching file src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java can't find file to patch at input line 960 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- Index: src/main/java/org/apache/mahout/clustering/utils/EuclideanDistanceMeasure.java =================================================================== — src/main/java/org/apache/mahout/clustering/utils/EuclideanDistanceMeasure.java (revision 0) +++ src/main/java/org/apache/mahout/clustering/utils/EuclideanDistanceMeasure.java (working copy) -------------------------- Thanks, Grant
        Hide
        Jeff Eastman added a comment -

        I just did an update of trunk and a fresh diff. Hopefully this one will work for you.

        Show
        Jeff Eastman added a comment - I just did an update of trunk and a fresh diff. Hopefully this one will work for you.
        Jeff Eastman made changes -
        Attachment MAHOUT-5d.diff [ 12376193 ]
        Hide
        Grant Ingersoll added a comment -

        Still no dice. I think the issue is that it hasn't captured the move of the DistanceMeasures to the utils package.

        Do you just want them moved to utils? I can do that and commit, then you can generate a new patch with that change.

        Show
        Grant Ingersoll added a comment - Still no dice. I think the issue is that it hasn't captured the move of the DistanceMeasures to the utils package. Do you just want them moved to utils? I can do that and commit, then you can generate a new patch with that change.
        Hide
        Jeff Eastman added a comment -

        Subclipse is not very good at moving classes. Why don't you go ahead and move them and I will submit another patch as you suggest.

        Show
        Jeff Eastman added a comment - Subclipse is not very good at moving classes. Why don't you go ahead and move them and I will submit another patch as you suggest.
        Hide
        Grant Ingersoll added a comment -

        OK, I moved them to mahout.utils. I figure a DistanceMeasure is generally useful beyond clustering.

        Show
        Grant Ingersoll added a comment - OK, I moved them to mahout.utils. I figure a DistanceMeasure is generally useful beyond clustering.
        Hide
        Jeff Eastman added a comment -

        I did the merge with trunk r630688 and this diff runs all the unit tests.

        Show
        Jeff Eastman added a comment - I did the merge with trunk r630688 and this diff runs all the unit tests.
        Jeff Eastman made changes -
        Attachment MAHOUT-5e.diff [ 12376416 ]
        Hide
        Grant Ingersoll added a comment -

        I have rev 630694 as the latest.

        I'm still getting errors in applying the patch.

        Show
        Grant Ingersoll added a comment - I have rev 630694 as the latest. I'm still getting errors in applying the patch.
        Hide
        Grant Ingersoll added a comment -

        OK, I fixed the patch. Will upload a patch shortly.

        Also, Jeff, check where you are getting the Mahout Jar from in various places. It is currently being built by ant into dist/apache-mahout-0.1.jar. I fixed it in kMeans, so just FYI for future. I am assuming you are generating it outside of Ant

        Show
        Grant Ingersoll added a comment - OK, I fixed the patch. Will upload a patch shortly. Also, Jeff, check where you are getting the Mahout Jar from in various places. It is currently being built by ant into dist/apache-mahout-0.1.jar. I fixed it in kMeans, so just FYI for future. I am assuming you are generating it outside of Ant
        Grant Ingersoll made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        Grant Ingersoll added a comment -

        Fixes various issues with last 5e.diff patch. All tests pass.

        Will review a bit more and then commit, assuming no other issues.

        Show
        Grant Ingersoll added a comment - Fixes various issues with last 5e.diff patch. All tests pass. Will review a bit more and then commit, assuming no other issues.
        Grant Ingersoll made changes -
        Attachment MAHOUT-5.patch [ 12376890 ]
        Hide
        Grant Ingersoll added a comment -

        Committed revision 632543.

        Thanks Jeff!

        Show
        Grant Ingersoll added a comment - Committed revision 632543. Thanks Jeff!
        Grant Ingersoll made changes -
        Fix Version/s 0.1 [ 12312976 ]
        Resolution Fixed [ 1 ]
        Status In Progress [ 3 ] Resolved [ 5 ]
        Grant Ingersoll made changes -
        Workflow jira [ 12423782 ] no-reopen-closed, patch-avail [ 12444655 ]
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Jeff Eastman
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development