Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5
    • Component/s: Clustering
    • Labels:
      None

      Description

      Per http://www.lucidimagination.com/search/document/10b562f10288993c/validating_clustering_output#9d3f6a55f4a91cb6, it would be great to have some utilities to help evaluate the effectiveness of clustering.

      1. MAHOUT-236.patch
        70 kB
        Jeff Eastman
      2. MAHOUT-236.patch
        85 kB
        Jeff Eastman
      3. MAHOUT-236.patch
        107 kB
        Jeff Eastman
      4. MAHOUT-236.patch
        137 kB
        Jeff Eastman
      5. MAHOUT-236.patch
        21 kB
        Jeff Eastman

        Activity

        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #328 (See https://hudson.apache.org/hudson/job/Mahout-Quality/328/)
        MAHOUT-236:

        • Modified ClusterEvaluator to use the same dataset as the clustering Display examples.
        • Switched evaluator to run sequential versions of the clustering jobs to reduce execution time.
        • Fixed a clusteredPoints path bug in sequential Mean Shift clustering
          All tests run
        Show
        Hudson added a comment - Integrated in Mahout-Quality #328 (See https://hudson.apache.org/hudson/job/Mahout-Quality/328/ ) MAHOUT-236 : Modified ClusterEvaluator to use the same dataset as the clustering Display examples. Switched evaluator to run sequential versions of the clustering jobs to reduce execution time. Fixed a clusteredPoints path bug in sequential Mean Shift clustering All tests run
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #322 (See https://hudson.apache.org/hudson/job/Mahout-Quality/322/)
        MAHOUT-236

        • Implemented ClusterEvaluator that uses Mahout In Action code for
          inter-cluster density and similar code for intra-cluster density over a set of
          representative points, not the entire clustered data set.
        • Generalized CDbwDriver etc to RepresentativePointsDriver so any cluster
          evaluator tool can use them
        • Added cluster pruning to CDbwEvaluator and ClusterEvaluator that removes
          clusters which cause numerical instabilities in the evaluation
        • Added unit tests. All tests run
        Show
        Hudson added a comment - Integrated in Mahout-Quality #322 (See https://hudson.apache.org/hudson/job/Mahout-Quality/322/ ) MAHOUT-236 Implemented ClusterEvaluator that uses Mahout In Action code for inter-cluster density and similar code for intra-cluster density over a set of representative points, not the entire clustered data set. Generalized CDbwDriver etc to RepresentativePointsDriver so any cluster evaluator tool can use them Added cluster pruning to CDbwEvaluator and ClusterEvaluator that removes clusters which cause numerical instabilities in the evaluation Added unit tests. All tests run
        Hide
        Sean Owen added a comment -

        If I read this right, Jeff, you're done here? at least for 0.4 purposes?

        Show
        Sean Owen added a comment - If I read this right, Jeff, you're done here? at least for 0.4 purposes?
        Hide
        Jeff Eastman added a comment -

        With the recent commit (r1000021) of corner-case tests and a little more use, this is probably ready for 0.4

        Show
        Jeff Eastman added a comment - With the recent commit (r1000021) of corner-case tests and a little more use, this is probably ready for 0.4
        Hide
        Jeff Eastman added a comment -

        The patches have all been committed and CDbw produces some numbers that, without deeper analysis, appear to be reasonable for all the algorithms. I'm inclined to close this issue unless others have concerns about this implementation's correctness, completeness, etc.

        Show
        Jeff Eastman added a comment - The patches have all been committed and CDbw produces some numbers that, without deeper analysis, appear to be reasonable for all the algorithms. I'm inclined to close this issue unless others have concerns about this implementation's correctness, completeness, etc.
        Hide
        Jeff Eastman added a comment -

        Here's a new patch that has initial, probably incorrect, implementations of the CDbw computations. The patch builds upon trunk and does not include the previous patch contents which are already committed.

        Show
        Jeff Eastman added a comment - Here's a new patch that has initial, probably incorrect, implementations of the CDbw computations. The patch builds upon trunk and does not include the previous patch contents which are already committed.
        Hide
        Robin Anil added a comment -

        No Jeff, I dont have any implementations with me. Sorry for not replying earlier. Will have to start from scratch on it.

        Show
        Robin Anil added a comment - No Jeff, I dont have any implementations with me. Sorry for not replying earlier. Will have to start from scratch on it.
        Hide
        Jeff Eastman added a comment -

        Ok, the above patch was committed on the 21st and is now in trunk. What remains for this issue is to complete the CDbw calculations from the now-computed representative points. Robin, do you have any implementation code for this or should I start from scratch?

        Show
        Jeff Eastman added a comment - Ok, the above patch was committed on the 21st and is now in trunk. What remains for this issue is to complete the CDbw calculations from the now-computed representative points. Robin, do you have any implementation code for this or should I start from scratch?
        Hide
        Jeff Eastman added a comment -

        This patch runs on top of Sean's latest patch (r936453) and adds a DirichletClusterMapper and clustering step (most likely only) with tweaks to the models to support Cluster.id properly. The CDbw representative points are now calculated for all five clustering algorithms. More to test before I commit but this is pretty close.

        Show
        Jeff Eastman added a comment - This patch runs on top of Sean's latest patch (r936453) and adds a DirichletClusterMapper and clustering step (most likely only) with tweaks to the models to support Cluster.id properly. The CDbw representative points are now calculated for all five clustering algorithms. More to test before I commit but this is pretty close.
        Hide
        Jeff Eastman added a comment -

        Added a mean shift clustering job and now it works for CDbw too. On to Dirichlet...

        Show
        Jeff Eastman added a comment - Added a mean shift clustering job and now it works for CDbw too. On to Dirichlet...
        Hide
        Jeff Eastman added a comment -

        I made some small changes to fuzzyK clustering and now the evaluator runs on its output too. The clustering produces some funny values for the clusters which I have not yet understood. A lot more of the other clustering unit tests are working too.

        Still a work in progress but representative points are being calculated for Canopy, Kmeans and now FuzzyKmeans.

        Show
        Jeff Eastman added a comment - I made some small changes to fuzzyK clustering and now the evaluator runs on its output too. The clustering produces some funny values for the clusters which I have not yet understood. A lot more of the other clustering unit tests are working too. Still a work in progress but representative points are being calculated for Canopy, Kmeans and now FuzzyKmeans.
        Hide
        Robin Anil added a comment - - edited

        Yeah for partial membership, we can add multiple strategies like choose top K clusters or choose Top Cluster or choose Top Cluster and all clusters > threshold. The CDbw computation will have to be modified to use the partial weights that all.

        So I think your idea do make sense and whether or not it gives meaningful result, that we have to experiment and see.

        Robin

        Show
        Robin Anil added a comment - - edited Yeah for partial membership, we can add multiple strategies like choose top K clusters or choose Top Cluster or choose Top Cluster and all clusters > threshold. The CDbw computation will have to be modified to use the partial weights that all. So I think your idea do make sense and whether or not it gives meaningful result, that we have to experiment and see. Robin
        Hide
        Ted Dunning added a comment -

        Typically any place where you have an algorithm that assumes a hard-membership, but what you have is a soft membership clustering algorithm, you can just pick the cluster with the strongest membership signal. You don't need a threshold.

        Conversely, in applications where you need soft membership and have hard membership, you should insert (1-epsilon) for the one cluster the document is in and epsilon/(k-1) for the other k-1 clusters. Epsilon should be tuned for best results on a corpus but should generally not be zero.

        Show
        Ted Dunning added a comment - Typically any place where you have an algorithm that assumes a hard-membership, but what you have is a soft membership clustering algorithm, you can just pick the cluster with the strongest membership signal. You don't need a threshold. Conversely, in applications where you need soft membership and have hard membership, you should insert (1-epsilon) for the one cluster the document is in and epsilon/(k-1) for the other k-1 clusters. Epsilon should be tuned for best results on a corpus but should generally not be zero.
        Hide
        Jeff Eastman added a comment -

        I'm running into a challenge integrating Fuzzy KMeans (and Dirichlet) into this evaluator. Currently the clustering step of the fuzzyK emits the vector as key and a FuzzyKMeansOutput writable as the value of the sequence file. This is backwards from the [clusterId :: VectorWritable] encoding that the patch uses for Canopy and KMeans. Also the Fuzzy...Output bean contains all of the clusters and the probability the vector is a member of each; rather large to be a key.

        For CDbw to find the reference points it really needs to iterate over [clusterId :: VectorWritable] pairs and this begs the question of what to do with fuzzy membership. I don't know if CDbw can be adjusted to handle fuzzyness in general but it will probably will work with some points assigned to more than one cluster. Does it make sense to apply a settable threshold to the clustering step so that all points with cluster membership probability > threshold would be assigned to that cluster?

        This would work also for Dirichlet. To implement in fuzzyK I would need to change the FuzzyKMeansClusterer and FuzzyKMeansClusterMapper to match the other clustering jobs.

        Does this make sense?

        Show
        Jeff Eastman added a comment - I'm running into a challenge integrating Fuzzy KMeans (and Dirichlet) into this evaluator. Currently the clustering step of the fuzzyK emits the vector as key and a FuzzyKMeansOutput writable as the value of the sequence file. This is backwards from the [clusterId :: VectorWritable] encoding that the patch uses for Canopy and KMeans. Also the Fuzzy...Output bean contains all of the clusters and the probability the vector is a member of each; rather large to be a key. For CDbw to find the reference points it really needs to iterate over [clusterId :: VectorWritable] pairs and this begs the question of what to do with fuzzy membership. I don't know if CDbw can be adjusted to handle fuzzyness in general but it will probably will work with some points assigned to more than one cluster. Does it make sense to apply a settable threshold to the clustering step so that all points with cluster membership probability > threshold would be assigned to that cluster? This would work also for Dirichlet. To implement in fuzzyK I would need to change the FuzzyKMeansClusterer and FuzzyKMeansClusterMapper to match the other clustering jobs. Does this make sense?
        Hide
        Robin Anil added a comment -

        Great start Jeff, I will test it and see if the CDbw makes sense with Reuters data and post results

        Show
        Robin Anil added a comment - Great start Jeff, I will test it and see if the CDbw makes sense with Reuters data and post results
        Hide
        Jeff Eastman added a comment -

        Here's a patch that adds a CDbw reference point MR job that iterates over the clustered points passed to it. I had to change the clustered point output formats to [clusterId :: VectorWritable] and that required other changes - mostly to unit tests. Patch includes three unit tests (Canopy, KMeans + partial Dirichlet) .

        It's a work in progress since I need to make some more changes to get the fuzzy kmeans tests to pass and the Dirichlet process doesnt actually cluster points.

        Run TestCDbwEvaluator to see some output from the reference point engine. Still need to compute the final CDbw.

        Show
        Jeff Eastman added a comment - Here's a patch that adds a CDbw reference point MR job that iterates over the clustered points passed to it. I had to change the clustered point output formats to [clusterId :: VectorWritable] and that required other changes - mostly to unit tests. Patch includes three unit tests (Canopy, KMeans + partial Dirichlet) . It's a work in progress since I need to make some more changes to get the fuzzy kmeans tests to pass and the Dirichlet process doesnt actually cluster points. Run TestCDbwEvaluator to see some output from the reference point engine. Still need to compute the final CDbw.
        Hide
        Grant Ingersoll added a comment -

        I don't have any, but should be pretty easy to do

        Show
        Grant Ingersoll added a comment - I don't have any, but should be pretty easy to do
        Hide
        Robin Anil added a comment -

        Hi Grant, Sashi. do you have any patch ready. I am available to help out on this if, there is anything left for me.

        Show
        Robin Anil added a comment - Hi Grant, Sashi. do you have any patch ready. I am available to help out on this if, there is anything left for me.

          People

          • Assignee:
            Jeff Eastman
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development