Mahout
  1. Mahout
  2. MAHOUT-825

Canopies grouping records outside t1

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Environment:

      windows, linux

      Description

      While finding closest canopy, there is no check to ensure that it returns canopies which are within distance t1 from the point. This results in incorrect result i.e. Points outside t1 are grouped in canopies.

      1. MAHOUT-825.patch
        79 kB
        Jeff Eastman
      2. canopy-radius-based-outlier-elimination
        14 kB
        Paritosh Ranjan
      3. canopy-outlier-elimination
        20 kB
        Paritosh Ranjan
      4. canopy-strict-clustering-flag
        13 kB
        Paritosh Ranjan
      5. Not Clustering Remote Points - Two Meaningful Clusters.txt
        4 kB
        Paritosh Ranjan
      6. Clustering Remote Points - Two Big, Useless Clusters.txt
        403 kB
        Paritosh Ranjan
      7. canopy-clusterFilter-t1
        7 kB
        Paritosh Ranjan
      8. canopy-outside-t1-points-patch-1
        5 kB
        Paritosh Ranjan

        Activity

        Hide
        Jeff Eastman added a comment -

        Closing this as Won't Fix in favor of MAHOUT-929

        Show
        Jeff Eastman added a comment - Closing this as Won't Fix in favor of MAHOUT-929
        Hide
        Jeff Eastman added a comment -

        Scary we've reversed positions . But I tend to agree that adding outlier removal in a pluggable manner is the best long term solution. Actually, I think it is possible to factor all of the various clustering steps (classification of points) into an independent job which accepts pointsIn, clustersIn and some other args to do the classification for all the clustering algorithms (they are really quite redundant in their current incarnations). It might even use the ClusterClassifier.classify() method which would help with classification/clustering convergence.

        But this should be for 0.7, not 0.6.

        Show
        Jeff Eastman added a comment - Scary we've reversed positions . But I tend to agree that adding outlier removal in a pluggable manner is the best long term solution. Actually, I think it is possible to factor all of the various clustering steps (classification of points) into an independent job which accepts pointsIn, clustersIn and some other args to do the classification for all the clustering algorithms (they are really quite redundant in their current incarnations). It might even use the ClusterClassifier.classify() method which would help with classification/clustering convergence. But this should be for 0.7, not 0.6.
        Hide
        Paritosh Ranjan added a comment -

        . Now I think that stop distance > T1 is increasing the complexity of the code. We should keep the outlier removal as a separate component. Its messing up a clean code.

        But, post processor will add one more step in the processing which will hit the performance. I propose to plugin in a outlier removal component in all the clustering algorithms, with different implementations for different algorithms. But in a cleaner (segregated) way.

        Show
        Paritosh Ranjan added a comment - . Now I think that stop distance > T1 is increasing the complexity of the code. We should keep the outlier removal as a separate component. Its messing up a clean code. But, post processor will add one more step in the processing which will hit the performance. I propose to plugin in a outlier removal component in all the clustering algorithms, with different implementations for different algorithms. But in a cleaner (segregated) way.
        Hide
        Jeff Eastman added a comment -

        Hmn, recent major reformatting has invalidated the patch. Not clear where to go on this.

        a) Back up to a minimalist fix to stop distance > T1 clusterings like the issue says. Shouldn't happen in sequential mode; the outliers are a result of the subsequent, reducer, cluster mergers. (quit grinning, Paritosh)
        b) Write a new post processor to generalize the outlier filter capability. Target 0.7

        Show
        Jeff Eastman added a comment - Hmn, recent major reformatting has invalidated the patch. Not clear where to go on this. a) Back up to a minimalist fix to stop distance > T1 clusterings like the issue says. Shouldn't happen in sequential mode; the outliers are a result of the subsequent, reducer, cluster mergers. (quit grinning, Paritosh) b) Write a new post processor to generalize the outlier filter capability. Target 0.7
        Hide
        Jeff Eastman added a comment -

        Changing this from a defect to an improvement. I'd still like to see this capability done as a post processor rather than augmenting all the existing clustering algorithms with this nascent capability

        Show
        Jeff Eastman added a comment - Changing this from a defect to an improvement. I'd still like to see this capability done as a post processor rather than augmenting all the existing clustering algorithms with this nascent capability
        Hide
        Jeff Eastman added a comment -

        +1 I'm now inclined to agree

        Show
        Jeff Eastman added a comment - +1 I'm now inclined to agree
        Hide
        Paritosh Ranjan added a comment -

        I agree that the features for all clustering algorithms should be consistent.

        However, I also advocate early feedback, and I think everyone else does. If we can commit this patch now, the users can use it and can give feedback. And its impelmentation in other algorithms can go sideways. Even the user feedback can effect the priority of implementing it in other clustering algorithms ( on how important they feel this feature is ).

        I believe that the user feedback is the most important stuff in any software development. And early feedback is something that helps in making better products.

        So, refraining from commit an available patch does not make much sense in my view.

        Show
        Paritosh Ranjan added a comment - I agree that the features for all clustering algorithms should be consistent. However, I also advocate early feedback, and I think everyone else does. If we can commit this patch now, the users can use it and can give feedback. And its impelmentation in other algorithms can go sideways. Even the user feedback can effect the priority of implementing it in other clustering algorithms ( on how important they feel this feature is ). I believe that the user feedback is the most important stuff in any software development. And early feedback is something that helps in making better products. So, refraining from commit an available patch does not make much sense in my view.
        Hide
        Jeff Eastman added a comment -

        Modified version of canopy-radius-based-outlier-elimination. Renames the filter to "outlierFilter". Adjusts filter semantics to reject distance > outlierFilter * cluster.radius(). Emits rejected points to clusterId = -1 so they do not disappear silently. Includes Apache License text, adjusts to Mahout formatter style. Seems to pass the relevant tests.

        This should be implemented in the other clustering algorithms before committing.

        Show
        Jeff Eastman added a comment - Modified version of canopy-radius-based-outlier-elimination. Renames the filter to "outlierFilter". Adjusts filter semantics to reject distance > outlierFilter * cluster.radius(). Emits rejected points to clusterId = -1 so they do not disappear silently. Includes Apache License text, adjusts to Mahout formatter style. Seems to pass the relevant tests. This should be implemented in the other clustering algorithms before committing.
        Hide
        Paritosh Ranjan added a comment -

        Nothing wrong with the patch. Tested all boundary conditions, please go ahead.

        Show
        Paritosh Ranjan added a comment - Nothing wrong with the patch. Tested all boundary conditions, please go ahead.
        Hide
        Paritosh Ranjan added a comment -

        I have found a scope of improvement in the patch ( one missed scenario ). Please wait. I will attach a improved patch.

        Show
        Paritosh Ranjan added a comment - I have found a scope of improvement in the patch ( one missed scenario ). Please wait. I will attach a improved patch.
        Hide
        Paritosh Ranjan added a comment -

        Here goes the compromise patch named canopy-radius-based-outlier-elimination ( attached ) J . It has no impact on the buildCluster phase.

        It uses radius based outlier elimination controlled by a parameter clusterStrictness.

        There are junit tests, which show the usage of this flag.

        Show
        Paritosh Ranjan added a comment - Here goes the compromise patch named canopy-radius-based-outlier-elimination ( attached ) J . It has no impact on the buildCluster phase. It uses radius based outlier elimination controlled by a parameter clusterStrictness. There are junit tests, which show the usage of this flag.
        Hide
        Jeff Eastman added a comment -

        Ok, as a compromise, if you remove the generation stuff I can support implementation of an experimental, radius-based, outlier filter in Canopy classification. If it seems useful after some period of evaluation, then we can generalize it and extend it to the other clustering classification steps. With at patch of that nature, I could become an advocate.

        Show
        Jeff Eastman added a comment - Ok, as a compromise, if you remove the generation stuff I can support implementation of an experimental, radius-based, outlier filter in Canopy classification. If it seems useful after some period of evaluation, then we can generalize it and extend it to the other clustering classification steps. With at patch of that nature, I could become an advocate.
        Hide
        Paritosh Ranjan added a comment -

        hmmm, applying it in Canopy generation phasethat was an experiment ( as you can see from previous posts ). I can remove that.

        I don't need to compute parameters if I remove it from Canopy generation phase ( which was anyway an experiment ).

        Then we will simply use the radius, which is already calculated ( and not t1 ).

        Show
        Paritosh Ranjan added a comment - hmmm, applying it in Canopy generation phasethat was an experiment ( as you can see from previous posts ). I can remove that. I don't need to compute parameters if I remove it from Canopy generation phase ( which was anyway an experiment ). Then we will simply use the radius, which is already calculated ( and not t1 ).
        Hide
        Jeff Eastman added a comment -

        -2 Sorry Paritosh, but computeParameters() resets the Gaussian statistics every time it is called. This patch breaks Canopy and turns it into a different clustering algorithm altogether.

        And I, at least, do see harm in adding this still-experimental outlier elimination scheme to only one of the clustering algorithms. It is clearly not specific to Canopy and we have enough one-off, inconsistent features in Mahout already.

        Show
        Jeff Eastman added a comment - -2 Sorry Paritosh, but computeParameters() resets the Gaussian statistics every time it is called. This patch breaks Canopy and turns it into a different clustering algorithm altogether. And I, at least, do see harm in adding this still-experimental outlier elimination scheme to only one of the clustering algorithms. It is clearly not specific to Canopy and we have enough one-off, inconsistent features in Mahout already.
        Hide
        Paritosh Ranjan added a comment -

        I experimented on Canopy generation phase as Ted suggested it. And the results are positive, and it can be switched on/off using a flag, and, the default is off.

        The radius is being calculated. I think you missed it :

        private double computeCanopyNeighbourhoodDistance(Canopy canopy) {
              canopy.computeParameters();
              double radius = canopy.getRadius().getLengthSquared();
              return radius*clusterStrictness;
          }
        

        I don't see any harm in having an outlier elimination mechanism in Canopy if we don't have it in other clustering mechanisms. As, there is a flag to control that, and, its off by default.

        Its evolving from some time. And, I think, all the doubts mentioned earlier have been fixed. Its controlled by a flag, and, its not dependent on t1 anymore.

        Show
        Paritosh Ranjan added a comment - I experimented on Canopy generation phase as Ted suggested it. And the results are positive, and it can be switched on/off using a flag, and, the default is off. The radius is being calculated. I think you missed it : private double computeCanopyNeighbourhoodDistance(Canopy canopy) { canopy.computeParameters(); double radius = canopy.getRadius().getLengthSquared(); return radius*clusterStrictness; } I don't see any harm in having an outlier elimination mechanism in Canopy if we don't have it in other clustering mechanisms. As, there is a flag to control that, and, its off by default. Its evolving from some time. And, I think, all the doubts mentioned earlier have been fixed. Its controlled by a flag, and, its not dependent on t1 anymore.
        Hide
        Jeff Eastman added a comment - - edited

        -2 on incorporating clusterStrictness in the Canopy generation phase. This completely changes the semantics of canopy. Furthermore, the radius of a canopy is not even calculated until cleanup(), after all the points have been processed in the mapper.

        -0.75 on using clusterStrictness in the Canopy classification phase to remove outliers. If we are going to add an outlier filter of this flavor then it should be added to all clustering classification codes, not just Canopy.

        I still do not see the merit of incorporating this (experimental, untested and still evolving) outlier filtering scheme into the classification phase at all. Subsequent processing steps which read the clusteredPoints can easily also load the clusters and perform whatever outlier removal is best for that particular application. It does not need to be a separate step, and it is quite application specific.

        Show
        Jeff Eastman added a comment - - edited -2 on incorporating clusterStrictness in the Canopy generation phase. This completely changes the semantics of canopy. Furthermore, the radius of a canopy is not even calculated until cleanup(), after all the points have been processed in the mapper. -0.75 on using clusterStrictness in the Canopy classification phase to remove outliers. If we are going to add an outlier filter of this flavor then it should be added to all clustering classification codes, not just Canopy. I still do not see the merit of incorporating this (experimental, untested and still evolving) outlier filtering scheme into the classification phase at all. Subsequent processing steps which read the clusteredPoints can easily also load the clusters and perform whatever outlier removal is best for that particular application. It does not need to be a separate step, and it is quite application specific.
        Hide
        Paritosh Ranjan added a comment - - edited

        I have added the patch canopy-outlier-elimination. This patch uses the flag clusterStrictness, both for outlier elimination while calculating centroids as well as while outlier elimination from the canopy clustering. The previous boolean variable (clusterStrictly ) has been changed to double, as we needed a parameter to control the quality of the Cluster. Outlier elimination ( both centroid calculation and elimination from cluster), are switched off by default.

        Both steps i.e. using radius instead of t1, and eliminitating outlier while calculating centroid have increased the quality of the result. Many points, which were not being clustered earlier ( while using t1, and even when using outlier point in centroid calculation ), are being clustered now. The quality is also controllable by tuning the value of clusterStrictness.

        Using a percentile to reject 10-30% of outliers looks like a good option, but it is a post processing step which will impact the performance. I think, the same functionality is achievable using a higher clusterStrictness value.

        Show
        Paritosh Ranjan added a comment - - edited I have added the patch canopy-outlier-elimination. This patch uses the flag clusterStrictness, both for outlier elimination while calculating centroids as well as while outlier elimination from the canopy clustering. The previous boolean variable (clusterStrictly ) has been changed to double, as we needed a parameter to control the quality of the Cluster. Outlier elimination ( both centroid calculation and elimination from cluster), are switched off by default. Both steps i.e. using radius instead of t1, and eliminitating outlier while calculating centroid have increased the quality of the result. Many points, which were not being clustered earlier ( while using t1, and even when using outlier point in centroid calculation ), are being clustered now. The quality is also controllable by tuning the value of clusterStrictness. Using a percentile to reject 10-30% of outliers looks like a good option, but it is a post processing step which will impact the performance. I think, the same functionality is achievable using a higher clusterStrictness value.
        Hide
        Paritosh Ranjan added a comment -

        Already did it. Thanks .

        Show
        Paritosh Ranjan added a comment - Already did it. Thanks .
        Hide
        Ted Dunning added a comment -

        Don't make the parameter an integer. Make it a double. Think about the 1.5 that you suggested just a bit ago.

        Show
        Ted Dunning added a comment - Don't make the parameter an integer. Make it a double. Think about the 1.5 that you suggested just a bit ago.
        Hide
        Paritosh Ranjan added a comment -

        CanopyDriver's run method is already 12 parameters long. So, adding a new parameter will not be good.

        So, instead of that, I will change the boolean parameter "clusterStrictly" (the new flag) to an int as "clusterStrictness".
        clusterStrictness should be > 1 for any strictness, and the strictness will increase with the value to clusterStrictness.

        Default value of "clusterStrictness" will be negative.

        Allowing points to be excluded only from the computation of centroids will increase the centroid calculation quality for sure, but not excluding points from cluster will anyway create at least few useless clusters.

        But, I will surely try to exclude distant points from centroid calculation, either in this patch or later.

        Show
        Paritosh Ranjan added a comment - CanopyDriver's run method is already 12 parameters long. So, adding a new parameter will not be good. So, instead of that, I will change the boolean parameter "clusterStrictly" (the new flag) to an int as "clusterStrictness". clusterStrictness should be > 1 for any strictness, and the strictness will increase with the value to clusterStrictness. Default value of "clusterStrictness" will be negative. Allowing points to be excluded only from the computation of centroids will increase the centroid calculation quality for sure, but not excluding points from cluster will anyway create at least few useless clusters. But, I will surely try to exclude distant points from centroid calculation, either in this patch or later.
        Hide
        Ted Dunning added a comment - - edited

        I experimented with the distance calculation, and now, I am using radius instead of the t1 parameter.

        private boolean shouldCluster(Canopy canopy, Vector point) {
          if (clusterStrictly) {
            Vector currentCenter = canopy.getCenter(); 
            double distance = measure.distance(currentCenter.getLengthSquared(), currentCenter, point); 
            double radius = canopy.getRadius().getLengthSquared(); 
            return distance < radius*3;
          }
          return true;
        }
        

        The positives and negatives of this approach are :

        +ve : Its not dependent on t1. Radius is, I think a better way to calculate distances from canopies ( acoording to the discussion above ). I experienced, that the results are also "much better" than using t1. Some meaningful points, those were missed by using t1, are being clustered using this approach.

        -ve : Now, I have no control on the quality of the cluster. The number, 3, is a constant. With t1, at least I was able to control the quality of the cluster.

        Regarding the +ve, I am tempted to say I told you so, but I think that honor should go to Jeff.

        Regarding the -ve, don't misuse t1 here. Just make up a new parameter which is intended to be >= 1.

        Also, I think it would be interesting and useful behavior to allow points to be excluded only from the computation of the centroids, but not excluded from being assigned to the cluster. I don't know if your code permits such a subtlety or not. Another thought is to use a percentile limit rather than a multiplier on the radius. Such a limit would be more robust in the fact of strange distributions of radii.

        Show
        Ted Dunning added a comment - - edited I experimented with the distance calculation, and now, I am using radius instead of the t1 parameter. private boolean shouldCluster(Canopy canopy, Vector point) { if (clusterStrictly) { Vector currentCenter = canopy.getCenter(); double distance = measure.distance(currentCenter.getLengthSquared(), currentCenter, point); double radius = canopy.getRadius().getLengthSquared(); return distance < radius*3; } return true ; } The positives and negatives of this approach are : +ve : Its not dependent on t1. Radius is, I think a better way to calculate distances from canopies ( acoording to the discussion above ). I experienced, that the results are also "much better" than using t1. Some meaningful points, those were missed by using t1, are being clustered using this approach. -ve : Now, I have no control on the quality of the cluster. The number, 3, is a constant. With t1, at least I was able to control the quality of the cluster. Regarding the +ve, I am tempted to say I told you so, but I think that honor should go to Jeff. Regarding the -ve, don't misuse t1 here. Just make up a new parameter which is intended to be >= 1. Also, I think it would be interesting and useful behavior to allow points to be excluded only from the computation of the centroids, but not excluded from being assigned to the cluster. I don't know if your code permits such a subtlety or not. Another thought is to use a percentile limit rather than a multiplier on the radius. Such a limit would be more robust in the fact of strange distributions of radii.
        Hide
        Paritosh Ranjan added a comment - - edited

        I experimented with the distance calculation, and now, I am using radius instead of the t1 parameter.

        private boolean shouldCluster(Canopy canopy, Vector point) {

        if(clusterStrictly)

        { Vector currentCenter = canopy.getCenter(); double distance = measure.distance(currentCenter.getLengthSquared(), currentCenter, point); double radius = canopy.getRadius().getLengthSquared(); return distance < radius*3 ; }

        return true;
        }

        The positives and negatives of this approach are :

        +ve : Its not dependent on t1. Radius is, I think a better way to calculate distances from canopies ( acoording to the discussion above ). I experienced, that the results are also "much better" than using t1. Some meaningful points, those were missed by using t1, are being clustered using this approach.

        -ve : Now, I have no control on the quality of the cluster. The number, 3, is a constant. With t1, at least I was able to control the quality of the cluster.

        I think, that

        return distance < (radius+(radius*t1/10)) will, at least give a control to the user on the quality of the output.

        I am against implementing this as a post processing step, because it degrades the performance, as it adds one more step in computing clusters. And, canopy is supposed to be fast.

        Thanks for the suggestions to improve it. I hope this implementation is better than the previous one.

        Show
        Paritosh Ranjan added a comment - - edited I experimented with the distance calculation, and now, I am using radius instead of the t1 parameter. private boolean shouldCluster(Canopy canopy, Vector point) { if(clusterStrictly) { Vector currentCenter = canopy.getCenter(); double distance = measure.distance(currentCenter.getLengthSquared(), currentCenter, point); double radius = canopy.getRadius().getLengthSquared(); return distance < radius*3 ; } return true; } The positives and negatives of this approach are : +ve : Its not dependent on t1. Radius is, I think a better way to calculate distances from canopies ( acoording to the discussion above ). I experienced, that the results are also "much better" than using t1. Some meaningful points, those were missed by using t1, are being clustered using this approach. -ve : Now, I have no control on the quality of the cluster. The number, 3, is a constant. With t1, at least I was able to control the quality of the cluster. I think, that return distance < (radius+(radius*t1/10)) will, at least give a control to the user on the quality of the output. I am against implementing this as a post processing step, because it degrades the performance, as it adds one more step in computing clusters. And, canopy is supposed to be fast. Thanks for the suggestions to improve it. I hope this implementation is better than the previous one.
        Hide
        Ted Dunning added a comment -

        Well, it may not do much harm if it is default off, but I think that we can do much better for an outlier trimmer. This may help on a few data sets, but it really is not a widely accepted thing to do for all the obvious reasons (most notably that a single cutoff for all clusters regardless of size is misguided).

        Surely there is a better strategy that doesn't re-use the canopy parameters.

        Show
        Ted Dunning added a comment - Well, it may not do much harm if it is default off, but I think that we can do much better for an outlier trimmer. This may help on a few data sets, but it really is not a widely accepted thing to do for all the obvious reasons (most notably that a single cutoff for all clusters regardless of size is misguided). Surely there is a better strategy that doesn't re-use the canopy parameters.
        Hide
        Paritosh Ranjan added a comment -

        I understand that there "might be" better ways to do it. But, rejecting something, on the hope of, able to add something better in future ( specially when, there is no existing functionality to support this ), is incorrect in my view.

        Show
        Paritosh Ranjan added a comment - I understand that there "might be" better ways to do it. But, rejecting something, on the hope of, able to add something better in future ( specially when, there is no existing functionality to support this ), is incorrect in my view.
        Hide
        Sean Owen added a comment -

        +0 I'd say. The patch is unobtrusive enough to not bother me. The functionality is not obscure or illogical. The thing about it is that it is sort of endorsing one way to trim outliers from final clusters (and trimming is a good thing to support), when that way of doing it seems problematic as it's a fixed threshold.

        Show
        Sean Owen added a comment - +0 I'd say. The patch is unobtrusive enough to not bother me. The functionality is not obscure or illogical. The thing about it is that it is sort of endorsing one way to trim outliers from final clusters (and trimming is a good thing to support), when that way of doing it seems problematic as it's a fixed threshold.
        Hide
        Jeff Eastman added a comment -

        Ok, I'll go with a majority vote of course. I'm still -1 on including a fixed-distance-threshold-based outlier elimination scheme in canopy. I believe this belongs in the realm of application-specific post-processing and that, as Ted observed, outlier elimination is tricky business at best. I would be willing to consider a uniform outlier elimination capability for all clustering algorithms "(but absolutely not with a fixed distance)".

        Sean, does this mean you are +0 or +1?

        Show
        Jeff Eastman added a comment - Ok, I'll go with a majority vote of course. I'm still -1 on including a fixed-distance-threshold-based outlier elimination scheme in canopy. I believe this belongs in the realm of application-specific post-processing and that, as Ted observed, outlier elimination is tricky business at best. I would be willing to consider a uniform outlier elimination capability for all clustering algorithms "(but absolutely not with a fixed distance)". Sean, does this mean you are +0 or +1?
        Hide
        Sean Owen added a comment -

        It seems like a relatively unintrusive flag. It's not terribly obscure behavior to support. I don't see much harm in adding it in this form.

        Show
        Sean Owen added a comment - It seems like a relatively unintrusive flag. It's not terribly obscure behavior to support. I don't see much harm in adding it in this form.
        Hide
        Paritosh Ranjan added a comment -

        There is a flag to control that, whether, the user wants to eliminate the outliers or not. So, I think, he can decide which clustering mechanism to use. I think it is better than the existing approach where the user is not even having the option to do it.

        This patch adds a functionality to canopy clustering without eliminating any previous functionality. And the result files ( real clustering data ) that I have attached, clearly show its benefit.

        Show
        Paritosh Ranjan added a comment - There is a flag to control that, whether, the user wants to eliminate the outliers or not. So, I think, he can decide which clustering mechanism to use. I think it is better than the existing approach where the user is not even having the option to do it. This patch adds a functionality to canopy clustering without eliminating any previous functionality. And the result files ( real clustering data ) that I have attached, clearly show its benefit.
        Hide
        Jeff Eastman added a comment -

        -1 I'm glad that using T1 for outlier elimination is working for you to produce "more meaningful" clusters. It is; however, for reasons amply enumerated above, not a generally useful mechanism that is appropriate to include in Mahout. I'm willing to leave this issue and its associated patches open for a while. If there is a ground swell of support for including this patch from other users then I will revisit this decision.

        Show
        Jeff Eastman added a comment - -1 I'm glad that using T1 for outlier elimination is working for you to produce "more meaningful" clusters. It is; however, for reasons amply enumerated above, not a generally useful mechanism that is appropriate to include in Mahout. I'm willing to leave this issue and its associated patches open for a while. If there is a ground swell of support for including this patch from other users then I will revisit this decision.
        Hide
        Paritosh Ranjan added a comment - - edited

        I have attached two versions of clustered points ( clusters ). This is real data which my application was clustering. Opening this file in Notepad++, EditPlus etc. will show you the results in better format. I have also printed Vectors, but you can only see the Strings in each line. The functionality is, that, similar looking Strings should be clustered together.

        One file ( Clustering Remote Points - Two Big, Useless Clusters ) contains two big clusters ( Scroll down to see the second cluster ). This file is created using cluster all points approach.

        Second File ( Not Clustering Remote Points - Two Meaningful Clusters ) contains points clustered only within t1.

        I hope you will agree that the second file with two small clusters makes sense in this case.

        Now I have used a flag to cluster data Strictly (instead of clusterFilter) . I have created a patch and attached ( canopy-strict-clustering-flag ). It also has test cases which demonstrate how remotely present points can be added or declined from a canopy (cluster).

        The cluster files that I have attached, are, the exact use case, for which, I fixed this issue in Mahout. I encountered this problem, and I was clueless of what is happening inside. I kept checking my vector creation and distance measures, but the problem was not there. So, I think this patch can help others. Also, the default option of this flag is false, so, it can be used as per user's requirement.

        Show
        Paritosh Ranjan added a comment - - edited I have attached two versions of clustered points ( clusters ). This is real data which my application was clustering. Opening this file in Notepad++, EditPlus etc. will show you the results in better format. I have also printed Vectors, but you can only see the Strings in each line. The functionality is, that, similar looking Strings should be clustered together. One file ( Clustering Remote Points - Two Big, Useless Clusters ) contains two big clusters ( Scroll down to see the second cluster ). This file is created using cluster all points approach. Second File ( Not Clustering Remote Points - Two Meaningful Clusters ) contains points clustered only within t1. I hope you will agree that the second file with two small clusters makes sense in this case. Now I have used a flag to cluster data Strictly (instead of clusterFilter) . I have created a patch and attached ( canopy-strict-clustering-flag ). It also has test cases which demonstrate how remotely present points can be added or declined from a canopy (cluster). The cluster files that I have attached, are, the exact use case, for which, I fixed this issue in Mahout. I encountered this problem, and I was clueless of what is happening inside. I kept checking my vector creation and distance measures, but the problem was not there. So, I think this patch can help others. Also, the default option of this flag is false, so, it can be used as per user's requirement.
        Hide
        Ted Dunning added a comment -

        If you want to do something clever with these outliers, add a few more clusters to the mix.

        Or use Dirichlet process clustering which inherently uses a variable number of clusters.

        Or add an option for trimming post facto as Jeff suggests (but absolutely not with a fixed distance). Cluster radii are widely different and a single number won't do as a threshold. You also have to be careful about trimming because trimming a point from the cluster can decrease the average radius and the std which can lead to other deletions. You have to make sure that the outlier detection is stable. It is also common that the points in a cluster are not normally distributed at all. More typically, clusters in dense areas tend to be polytopes bounded on all sides by other clusters. Clusters on the periphery tend to be asymmetric with sharp boundaries on one side and a spray of potential outliers on other sides. These situations make simple trimming difficult.

        If the impact of outliers is just too heinous to contemplate, then changing the centroid to a trimmed centroid computation might help. The idea is that you compute the centroid conventionally and then progressively remove the 10-30% most distant points from the centroid computation, but NOT from the cluster. Such a trimmed centroid is, much like a trimmed mean, much less sensitive to outliers. Trimming the centroid computation might give the desired results without the bad effects of outlier elimination.

        Show
        Ted Dunning added a comment - If you want to do something clever with these outliers, add a few more clusters to the mix. Or use Dirichlet process clustering which inherently uses a variable number of clusters. Or add an option for trimming post facto as Jeff suggests (but absolutely not with a fixed distance). Cluster radii are widely different and a single number won't do as a threshold. You also have to be careful about trimming because trimming a point from the cluster can decrease the average radius and the std which can lead to other deletions. You have to make sure that the outlier detection is stable. It is also common that the points in a cluster are not normally distributed at all. More typically, clusters in dense areas tend to be polytopes bounded on all sides by other clusters. Clusters on the periphery tend to be asymmetric with sharp boundaries on one side and a spray of potential outliers on other sides. These situations make simple trimming difficult. If the impact of outliers is just too heinous to contemplate, then changing the centroid to a trimmed centroid computation might help. The idea is that you compute the centroid conventionally and then progressively remove the 10-30% most distant points from the centroid computation, but NOT from the cluster. Such a trimmed centroid is, much like a trimmed mean, much less sensitive to outliers. Trimming the centroid computation might give the desired results without the bad effects of outlier elimination.
        Hide
        Jeff Eastman added a comment -

        One final response:

        If "Canopy Clustering is not even a clustering algorithm" (I disagree), then why is this patch proposing to augment its classification semantics with outlier elimination? Accepting this proposition, canopy should only be used for cluster generation e.g. for priming kmeans.

        That said, outlier elimination could be considered as a new option for all the various clustering classification steps. The question in my mind is how best would this be specified? Clusters, in general, have a center and a radius that are determined by the points which have been assigned to them in the respective cluster generation steps. Offering a single, constant distance threshold to identify outliers wouldn't be my first choice: I'd suggest using the radius (= stdev) of each cluster instead, perhaps everything > x*radius would be considered an outlier. Even canopy clusters have these Gaussian statistics. Using T1 would be an especially poor choice given arguments I've made earlier.

        Finally, outlier post-processing is trivial once we arrive at a solid definition of "outlier": Read the clusters into the mapper, compare each of the clustered points with its cluster statistics and determine if it is an outlier. If it is not, output the point. The job does not require any reducers and the outlier removal can easily be folded into other post-processing steps which are, also, application-specific.

        I'm still -1 on this patch

        Show
        Jeff Eastman added a comment - One final response: If "Canopy Clustering is not even a clustering algorithm" (I disagree), then why is this patch proposing to augment its classification semantics with outlier elimination? Accepting this proposition, canopy should only be used for cluster generation e.g. for priming kmeans. That said, outlier elimination could be considered as a new option for all the various clustering classification steps. The question in my mind is how best would this be specified? Clusters, in general, have a center and a radius that are determined by the points which have been assigned to them in the respective cluster generation steps. Offering a single, constant distance threshold to identify outliers wouldn't be my first choice: I'd suggest using the radius (= stdev) of each cluster instead, perhaps everything > x*radius would be considered an outlier. Even canopy clusters have these Gaussian statistics. Using T1 would be an especially poor choice given arguments I've made earlier. Finally, outlier post-processing is trivial once we arrive at a solid definition of "outlier": Read the clusters into the mapper, compare each of the clustered points with its cluster statistics and determine if it is an outlier. If it is not, output the point. The job does not require any reducers and the outlier removal can easily be folded into other post-processing steps which are, also, application-specific. I'm still -1 on this patch
        Hide
        Sean Owen added a comment -

        With clusterFilter set, distant points that would have formed a canopy of 1 no longer form a canopy at all. I don't think this is the same as requesting that those points not be clustered at all.

        I don't see why these distant points affect any other canopies. They're already farther than t1 away from other canopies, so don't affect the others. Right?

        I do see the potential value in requesting to drop points that are too far from clusters, rather than assign to clusters. I don't think that should be the only or default behavior, since it's not the behavior now or in "vanilla" implementations.

        But it could be a flag. It does not have the same meaning as clusterFilter though, so would need to be a different flag. I sympathize with not wanting so many flags – you can't have a flag for every little possible choice. This would have to be. 10 flags isn't such a big deal. You don't have to set them all. In any event, that's about the only type of change here that I could imagine achieving consensus.

        More thoughts from other voices?

        Show
        Sean Owen added a comment - With clusterFilter set, distant points that would have formed a canopy of 1 no longer form a canopy at all. I don't think this is the same as requesting that those points not be clustered at all. I don't see why these distant points affect any other canopies. They're already farther than t1 away from other canopies, so don't affect the others. Right? I do see the potential value in requesting to drop points that are too far from clusters, rather than assign to clusters. I don't think that should be the only or default behavior, since it's not the behavior now or in "vanilla" implementations. But it could be a flag. It does not have the same meaning as clusterFilter though, so would need to be a different flag. I sympathize with not wanting so many flags – you can't have a flag for every little possible choice. This would have to be. 10 flags isn't such a big deal. You don't have to set them all. In any event, that's about the only type of change here that I could imagine achieving consensus. More thoughts from other voices?
        Hide
        Paritosh Ranjan added a comment -

        Yes, canopy generation and clustering is separately done. As Jeff has already mentioned.

        Of course using clusterFilter instead of a new variable would be a overuse. However, introducing a new parameter to a already 9-10 parameter long run method of CanopyDriver won't make any sense. And since there is no existing "concept" as such of canopy clustering. So, I don't think that a user control over this is required.

        Let me answer why do I think that clusterFilter and clustering strict canopies go hand-in-hand.

        When a user specifies clusterFilter > 0. He already has specified that he does not wants canopies having single point ( or less than clusterFilter ). So, he is more interested in better/closely grouped clusters and less interested in isolated points. So, clustering remote points in this case will not help the user. ( This was the exact problem where I was stuck, and created this patch, while developing my application ).

        And the closestCanopy can not be always that "close". It depends on the data. So, sometimes, a really far placed point is also assigned to the canopy which, has, otherwise, good quality/nearby points grouped data ( just because it was the closest canopy to that isolated point ). This simply destroys the quality of the cluster.

        And the user gets this bad quality cluster because "every clustering algorithm does it", which, in my view is not helping the user. And, Canopy Clustering is not even a clustering algorithm, its just a means to find the approximate number of clusters of size T1/T2, which can be used further in K-means ( all of which is already done before clusterData phase, so this change won't effect buildCluster phase ). Canopy Clustering is just a utility which helps the user to get clusters easily with a high performance.

        I hope I have answered all your queries.

        Show
        Paritosh Ranjan added a comment - Yes, canopy generation and clustering is separately done. As Jeff has already mentioned. Of course using clusterFilter instead of a new variable would be a overuse. However, introducing a new parameter to a already 9-10 parameter long run method of CanopyDriver won't make any sense. And since there is no existing "concept" as such of canopy clustering. So, I don't think that a user control over this is required. Let me answer why do I think that clusterFilter and clustering strict canopies go hand-in-hand. When a user specifies clusterFilter > 0. He already has specified that he does not wants canopies having single point ( or less than clusterFilter ). So, he is more interested in better/closely grouped clusters and less interested in isolated points. So, clustering remote points in this case will not help the user. ( This was the exact problem where I was stuck, and created this patch, while developing my application ). And the closestCanopy can not be always that "close". It depends on the data. So, sometimes, a really far placed point is also assigned to the canopy which, has, otherwise, good quality/nearby points grouped data ( just because it was the closest canopy to that isolated point ). This simply destroys the quality of the cluster. And the user gets this bad quality cluster because "every clustering algorithm does it", which, in my view is not helping the user. And, Canopy Clustering is not even a clustering algorithm, its just a means to find the approximate number of clusters of size T1/T2, which can be used further in K-means ( all of which is already done before clusterData phase, so this change won't effect buildCluster phase ). Canopy Clustering is just a utility which helps the user to get clusters easily with a high performance. I hope I have answered all your queries.
        Hide
        Jeff Eastman added a comment -

        cf. "But I think the question of whether every point participates in, or generates its own canopy, is separate from whether every point is assigned to a cluster. Am I right about this?"

        Yes, precisely correct. This issue is confusing the generation of canopies with the classification of points given existing canopies. T1 et. al. have a role to play in the generation phase but none to play in the classification phase. Classification does what it advertises: it groups points into their most likely (closest) cluster. This is common behavior over all clustering algorithms. Failing to classify a point is not correct.

        The clusterFilter option was introduced to improve the performance of the mapreduce canopy implementation by allowing the user to eliminate generation of very small clusters. This does have a role to play in priming k-means or other iterative clustering algorithms, and would have the effect of reducing k. It is unrelated to the classification step.

        I'm still -1 on this patch. Classification outliers should be removed in a separate processing step.

        Show
        Jeff Eastman added a comment - cf. "But I think the question of whether every point participates in, or generates its own canopy, is separate from whether every point is assigned to a cluster. Am I right about this?" Yes, precisely correct. This issue is confusing the generation of canopies with the classification of points given existing canopies. T1 et. al. have a role to play in the generation phase but none to play in the classification phase. Classification does what it advertises: it groups points into their most likely (closest) cluster. This is common behavior over all clustering algorithms. Failing to classify a point is not correct. The clusterFilter option was introduced to improve the performance of the mapreduce canopy implementation by allowing the user to eliminate generation of very small clusters. This does have a role to play in priming k-means or other iterative clustering algorithms, and would have the effect of reducing k. It is unrelated to the classification step. I'm still -1 on this patch. Classification outliers should be removed in a separate processing step.
        Hide
        Sean Owen added a comment -

        I am not an expert on this code, so take my comments as informed questions rather than something authoritative.

        But I think the question of whether every point participates in, or generates its own canopy, is separate from whether every point is assigned to a cluster. Am I right about this? You're right that not every point will contribute to a canopy, or even be in a canopy. But, those points may still become clustered, to the canopy to which they are nearest?

        I also understand what you're doing with the new patch, and it does seem that at best this is a selectable feature. But does it make sense to overload the clusterFilter setting to control this as well? I do get that you think they go hand-in-hand, and maybe they do, though a separate flag probably still makes more sense IMHO.

        I'm trying to describe the down-side in clustering a distant point. The cost of emitting the point is trivial. Providing an answer, instead of no answer, can't be worse, I think. What does it slow down or degrade? I can only come up with the possibility that it degrades accuracy in iterative algorithms like k-means.

        Show
        Sean Owen added a comment - I am not an expert on this code, so take my comments as informed questions rather than something authoritative. But I think the question of whether every point participates in, or generates its own canopy, is separate from whether every point is assigned to a cluster. Am I right about this? You're right that not every point will contribute to a canopy, or even be in a canopy. But, those points may still become clustered, to the canopy to which they are nearest? I also understand what you're doing with the new patch, and it does seem that at best this is a selectable feature. But does it make sense to overload the clusterFilter setting to control this as well? I do get that you think they go hand-in-hand, and maybe they do, though a separate flag probably still makes more sense IMHO. I'm trying to describe the down-side in clustering a distant point. The cost of emitting the point is trivial. Providing an answer, instead of no answer, can't be worse, I think. What does it slow down or degrade? I can only come up with the possibility that it degrades accuracy in iterative algorithms like k-means.
        Hide
        Paritosh Ranjan added a comment - - edited

        I have attached a new patch (canopy-clusterFilter-t1), which checks whether it needs to apply dist < t1 or not. Which means, whether it needs to cluster all points, or, it should cluster only points within t1 ( based on user specified clusterFilter variable ).

        I have also added test cases, demonstrating clustering of both cases, (based on clusterFilter variable)
        1) all points being clustered
        2) only points with dist < t1 being clustered

        Show
        Paritosh Ranjan added a comment - - edited I have attached a new patch (canopy-clusterFilter-t1), which checks whether it needs to apply dist < t1 or not. Which means, whether it needs to cluster all points, or, it should cluster only points within t1 ( based on user specified clusterFilter variable ). I have also added test cases, demonstrating clustering of both cases, (based on clusterFilter variable) 1) all points being clustered 2) only points with dist < t1 being clustered
        Hide
        Paritosh Ranjan added a comment -

        Reopening as I think the problem of "clustering every point" has been answered in my new comment.

        Show
        Paritosh Ranjan added a comment - Reopening as I think the problem of "clustering every point" has been answered in my new comment.
        Hide
        Paritosh Ranjan added a comment -

        As Sean mentioned,

        "This canopy-generation business is a heuristic speed-up. I would not expect it to mean some things don't get clustered."

        These two points can not be achieved together in canopy generation. Clustering every point makes canopy generation process really really slow ( as all canopies are processed on a "single" reducer in buildCluster phase).

        To get rid of this performance problem, the number of canopies were controlled ( based on their quality) using a variable clusterFilter ( introduced a week ago ). This is an int variable and it prevents formation of canopies having less than clusterFilter points ( which, drastically improves the performance of the reducer phase of buildCluster, as the number of canopies generated are less ). In a sense, this variable (clusterFilter) tells whether the user wants every point to be inside a cluster or not. ( See code snippet )

        if (canopy.getNumPoints() > clusterFilter)

        { context.write(new Text(canopy.getIdentifier()), canopy); }

        Setting this varaible clusterFilter > 0 implies that the user is only interested in clusters which have points > clusterFilter, and is not interested in remote/single/few isolated points. Which also means, that, he does not want every point to be clustered.

        So, we can have an additional check of ( clusterFilter > 0 ) in this patch, which, if true, implies that, the user is not interested in clustering every point. He is more interested in the quality of the clusters. Then, this patch will give hime the result he desires.

        Show
        Paritosh Ranjan added a comment - As Sean mentioned, "This canopy-generation business is a heuristic speed-up. I would not expect it to mean some things don't get clustered." These two points can not be achieved together in canopy generation. Clustering every point makes canopy generation process really really slow ( as all canopies are processed on a "single" reducer in buildCluster phase). To get rid of this performance problem, the number of canopies were controlled ( based on their quality) using a variable clusterFilter ( introduced a week ago ). This is an int variable and it prevents formation of canopies having less than clusterFilter points ( which, drastically improves the performance of the reducer phase of buildCluster, as the number of canopies generated are less ). In a sense, this variable (clusterFilter) tells whether the user wants every point to be inside a cluster or not. ( See code snippet ) if (canopy.getNumPoints() > clusterFilter) { context.write(new Text(canopy.getIdentifier()), canopy); } Setting this varaible clusterFilter > 0 implies that the user is only interested in clusters which have points > clusterFilter, and is not interested in remote/single/few isolated points. Which also means, that, he does not want every point to be clustered. So, we can have an additional check of ( clusterFilter > 0 ) in this patch, which, if true, implies that, the user is not interested in clustering every point. He is more interested in the quality of the clusters. Then, this patch will give hime the result he desires.
        Hide
        Jeff Eastman added a comment -

        Outlier points may be problematic for subsequent processing steps, but they appear only in the clusteredPoints output and do not impact the cluster generation itself.

        K-means, for example, processes every point in every iteration and assigns each point to one cluster based upon minimum distance. Only at the end of an iteration is the centroid recalculated. There's really no concept of outliers during cluster generation; every point contributes to the centroid of exactly one cluster. In Canopy, points can contribute to the centroids of multiple clusters if their distance is d<T1. In FuzzyK, each point will contribute to the centroid of every cluster. In Dirichlet, each point contributes to the centroid of one of the clusters based upon its pdf() and the Dirichlet distribution mixture. This is typically not the same cluster in each iteration.

        I really think outlier elimination is best implemented as a post processing step.

        Show
        Jeff Eastman added a comment - Outlier points may be problematic for subsequent processing steps, but they appear only in the clusteredPoints output and do not impact the cluster generation itself. K-means, for example, processes every point in every iteration and assigns each point to one cluster based upon minimum distance. Only at the end of an iteration is the centroid recalculated. There's really no concept of outliers during cluster generation; every point contributes to the centroid of exactly one cluster. In Canopy, points can contribute to the centroids of multiple clusters if their distance is d<T1. In FuzzyK, each point will contribute to the centroid of every cluster. In Dirichlet, each point contributes to the centroid of one of the clusters based upon its pdf() and the Dirichlet distribution mixture. This is typically not the same cluster in each iteration. I really think outlier elimination is best implemented as a post processing step.
        Hide
        Sean Owen added a comment -

        As far as I understand: t1 and t2 are used in coming up with canopies. Then, points are assigned to their nearest canopy, regardless of t1 and t2. That's how it works in Mahout, and is certainly a way to do it, and probably the most simple / simplistic.

        It seems that the difference of opinion is just whether t1 should also be applied in the second step. Should a point not within t1 of any canopy be considered "unclusterable"? I think there's a logic to that. Perhaps I'm just influenced by having stared at Mahout, but I don't find that the most expected behavior.

        Any clustering algorithm I know does put everything in a cluster. This canopy-generation business is a heuristic speed-up. I would not expect it to mean some things don't get clustered.

        In this sense I tend to agree with Jeff, though I don't see anything wrong with the additional filtering either; it just seems like an additional step, or logic, you could impose afterwards.

        But Paritosh remains correct that these outlier points are probably bad for clustering? they stick around across iterations of k-means, for example, and affect centroids. I have no experience quantifying that effect though it must be non-zero.

        Show
        Sean Owen added a comment - As far as I understand: t1 and t2 are used in coming up with canopies. Then, points are assigned to their nearest canopy, regardless of t1 and t2. That's how it works in Mahout, and is certainly a way to do it, and probably the most simple / simplistic. It seems that the difference of opinion is just whether t1 should also be applied in the second step. Should a point not within t1 of any canopy be considered "unclusterable"? I think there's a logic to that. Perhaps I'm just influenced by having stared at Mahout, but I don't find that the most expected behavior. Any clustering algorithm I know does put everything in a cluster. This canopy-generation business is a heuristic speed-up. I would not expect it to mean some things don't get clustered. In this sense I tend to agree with Jeff, though I don't see anything wrong with the additional filtering either; it just seems like an additional step, or logic, you could impose afterwards. But Paritosh remains correct that these outlier points are probably bad for clustering? they stick around across iterations of k-means, for example, and affect centroids. I have no experience quantifying that effect though it must be non-zero.
        Hide
        Jeff Eastman added a comment -

        For reasons stated above

        Show
        Jeff Eastman added a comment - For reasons stated above
        Hide
        Jeff Eastman added a comment -

        Paritosh,

        I remain unconvinced that you have demonstrated your assertion that "current Canopy Clustering is simply not working".

        Sure, T1 is related to canopy generation, but it is unrelated to maximum-likelihood classification which is common to most of the other clustering algorithms. This step simply assigns each point to its closest cluster. If you wish to impose additional semantics on the clustering outputs (e.g. outlier elimination), you need to do this in a separate processing step.

        I'm not going to commit this patch for the reasons I've stated above. If your arguments convince one of the other committers to weigh in and argue for a different outcome then we can continue this discussion. For now, I'm marking this wont fix.

        Show
        Jeff Eastman added a comment - Paritosh, I remain unconvinced that you have demonstrated your assertion that "current Canopy Clustering is simply not working". Sure, T1 is related to canopy generation, but it is unrelated to maximum-likelihood classification which is common to most of the other clustering algorithms. This step simply assigns each point to its closest cluster. If you wish to impose additional semantics on the clustering outputs (e.g. outlier elimination), you need to do this in a separate processing step. I'm not going to commit this patch for the reasons I've stated above. If your arguments convince one of the other committers to weigh in and argue for a different outcome then we can continue this discussion. For now, I'm marking this wont fix.
        Hide
        Paritosh Ranjan added a comment -

        As you said,

        "ClusterData simply assigns each point to the closest, maximum-likelihood cluster given the computed centroids and the distance measure chosen."

        which is not happening now for Canopy, and this patch fixes this problem.

        I disagree with your statement that "T1, in particular, is unrelated to clusterData". t1 "is" related to Canopy for sure. It can be t1, or t2, but the canopies will be meaningful. Right now, any random point is getting clustered in the canopy which is simply incorrect. This patch does makes sense by fixing this problem.

        Of course, multiple things can be done with the design and the clustering approach. But, do you agree that this patch is fixing a currnt bug in the code. If yes, then I think we should apply this patch, as the current Canopy Clustering is simply not working, and gives a really hard time to the user, I have experienced that. This patch fixes this problem.

        Show
        Paritosh Ranjan added a comment - As you said, "ClusterData simply assigns each point to the closest, maximum-likelihood cluster given the computed centroids and the distance measure chosen." which is not happening now for Canopy, and this patch fixes this problem. I disagree with your statement that "T1, in particular, is unrelated to clusterData". t1 "is" related to Canopy for sure. It can be t1, or t2, but the canopies will be meaningful. Right now, any random point is getting clustered in the canopy which is simply incorrect. This patch does makes sense by fixing this problem. Of course, multiple things can be done with the design and the clustering approach. But, do you agree that this patch is fixing a currnt bug in the code. If yes, then I think we should apply this patch, as the current Canopy Clustering is simply not working, and gives a really hard time to the user, I have experienced that. This patch fixes this problem.
        Hide
        Jeff Eastman added a comment -

        Sorry, but I continue to disagree. As you point out, clusterData is not a canopy-specific activity: It could be factored out of Canopy and Kmeans, since they do it identically (FuzzyK and Dirichlet can also do maximum-likelihood classification, as an option). ClusterData simply assigns each point to the closest, maximum-likelihood cluster given the computed centroids and the distance measure chosen. Imposing additional semantics which causes some points to not be classified at all is just not correct, IMHO. T1, in particular, is unrelated to clusterData and indeed a given point may be within T1 of multiple canopy clusters. If you want to impose additional semantics, (e.g. to remove outliers) you need to do this in a separate processing step.

        Show
        Jeff Eastman added a comment - Sorry, but I continue to disagree. As you point out, clusterData is not a canopy-specific activity: It could be factored out of Canopy and Kmeans, since they do it identically (FuzzyK and Dirichlet can also do maximum-likelihood classification, as an option). ClusterData simply assigns each point to the closest, maximum-likelihood cluster given the computed centroids and the distance measure chosen. Imposing additional semantics which causes some points to not be classified at all is just not correct, IMHO. T1, in particular, is unrelated to clusterData and indeed a given point may be within T1 of multiple canopy clusters. If you want to impose additional semantics, (e.g. to remove outliers) you need to do this in a separate processing step.
        Hide
        Paritosh Ranjan added a comment -

        The clusterData phase runs only when runClustering variable is true ( as shown in the code snippet below ). So, the canopies are already generated and computed before clusterData phase.

        if (runClustering)

        { clusterData(conf, input, clustersOut, output, measure, t1, t2, runSequential); }

        clusterData is just grouping/pointing/classifying/identifying that which point belongs to which canopy. Adding a remote point (greater than t1) to the already generated/computed canopies is just wasting this grouping functionality. There is no use of doing it.

        Show
        Paritosh Ranjan added a comment - The clusterData phase runs only when runClustering variable is true ( as shown in the code snippet below ). So, the canopies are already generated and computed before clusterData phase. if (runClustering) { clusterData(conf, input, clustersOut, output, measure, t1, t2, runSequential); } clusterData is just grouping/pointing/classifying/identifying that which point belongs to which canopy. Adding a remote point (greater than t1) to the already generated/computed canopies is just wasting this grouping functionality. There is no use of doing it.
        Hide
        Paritosh Ranjan added a comment -

        Canopy is primarily used to calculate the approx number of clusters ( of size around t1, t2 ), and then use this number of clusters ( and its centroids ) in some algorithm like K-means. Which is already being done in buildClusters phase.

        The clusterData is not a Canopy Algorithm specific thing. Its a utility to help identify/classify the clusters even before going to K-means, with the canopy centroids, if, you think, that, your vectors are well separated and your distance (t1, t2) is appropriate. This helps in faster clustering ( Because it eradicates the need to go to K-means ).

        So, assigning any remote, isolated point to any canopy which is really at a long distance from its centroid simply degrades the quality of the clustered points. There is no benefit from doing this.

        And, The computeParameters() or computeCentroid() is not even called during clusterData phase. So, assigning points is clusterData phase is not having any impact on the centroid of the canopy, which is already calculated in the buildCluster phase.

        Show
        Paritosh Ranjan added a comment - Canopy is primarily used to calculate the approx number of clusters ( of size around t1, t2 ), and then use this number of clusters ( and its centroids ) in some algorithm like K-means. Which is already being done in buildClusters phase. The clusterData is not a Canopy Algorithm specific thing. Its a utility to help identify/classify the clusters even before going to K-means, with the canopy centroids, if, you think, that, your vectors are well separated and your distance (t1, t2) is appropriate. This helps in faster clustering ( Because it eradicates the need to go to K-means ). So, assigning any remote, isolated point to any canopy which is really at a long distance from its centroid simply degrades the quality of the clustered points. There is no benefit from doing this. And, The computeParameters() or computeCentroid() is not even called during clusterData phase. So, assigning points is clusterData phase is not having any impact on the centroid of the canopy, which is already calculated in the buildCluster phase.
        Hide
        Jeff Eastman added a comment -

        You misinterpreted my statement but what you say is correct. FindClosestCanopy is called during the classification, or clustering (clusterData) phase, not during the buildClusters phase. The classification phase assigns each point to its closest canopy, irrespective of that canopy's T parameters. Introducing (d<T1) into this process may be satisfying from your perspective but failing to classify a point at all is not satisfactory from mine.

        The T1 argument does not imply any assignment semantics. Points within the T1 radius will influence the computed centroid of a canopy. It will also impact the numPoints of the canopy because this value is used to compute the centroid.

        The T2 argument may imply some assignment semantics, in that points found to be within T2 will not result in new canopies being output. Canopies having numPoints <= clusterFilter will not be output during the buildClusters phase. This is quite different than failing to classify a point in the clusterData phase.

        Canopy is intended to be a simple, approximate, fast clustering algorithm. We've already added a number of bells and whistles (T3, T4, clusterFilter) in order to make it more usable for large data sets. I just don't see the (d<T1) test as being necessary or even correct.

        Show
        Jeff Eastman added a comment - You misinterpreted my statement but what you say is correct. FindClosestCanopy is called during the classification, or clustering (clusterData) phase, not during the buildClusters phase. The classification phase assigns each point to its closest canopy, irrespective of that canopy's T parameters. Introducing (d<T1) into this process may be satisfying from your perspective but failing to classify a point at all is not satisfactory from mine. The T1 argument does not imply any assignment semantics. Points within the T1 radius will influence the computed centroid of a canopy. It will also impact the numPoints of the canopy because this value is used to compute the centroid. The T2 argument may imply some assignment semantics, in that points found to be within T2 will not result in new canopies being output. Canopies having numPoints <= clusterFilter will not be output during the buildClusters phase. This is quite different than failing to classify a point in the clusterData phase. Canopy is intended to be a simple, approximate, fast clustering algorithm. We've already added a number of bells and whistles (T3, T4, clusterFilter) in order to make it more usable for large data sets. I just don't see the (d<T1) test as being necessary or even correct.
        Hide
        Paritosh Ranjan added a comment -

        I meant to say clusterFilter > 0.

        Show
        Paritosh Ranjan added a comment - I meant to say clusterFilter > 0.
        Hide
        Paritosh Ranjan added a comment -

        I don't see findClosestCanopy() or emitPointToClosestCanopy() being called in buildClusters() phase as you have mentioned.

        (d<T1) will be called while clustering phase. So, if there is a record which is not inside t1 is assigned to the canopy.

        So, when the runClustering variable is true, then the clustered created are meaning less.

        The other place from where it is called is CanopyDriver's clusterDataSeq() which also demonstrates the same problem. If runClustering is set to true, the clusters created are incorrect ( I have seen the problem in both sequential and mapreduce ). I experience this while using CanopyDriver's run method with clusterFilter > 0 and runClustering=true.

        The problem you have explained above will not happen because this method is not even used in map reduce phase of buildClusters.

        You can try CanopyDriver's run method with clusterFilter > 1 and runClustering = true to reproduce this issue.

        Please don't' confuse this issue with "Sequential and MapReduce give different results". That's a separate problem.

        Thanks and Regards,
        Paritosh Ranjan

        Show
        Paritosh Ranjan added a comment - I don't see findClosestCanopy() or emitPointToClosestCanopy() being called in buildClusters() phase as you have mentioned. (d<T1) will be called while clustering phase. So, if there is a record which is not inside t1 is assigned to the canopy. So, when the runClustering variable is true, then the clustered created are meaning less. The other place from where it is called is CanopyDriver's clusterDataSeq() which also demonstrates the same problem. If runClustering is set to true, the clusters created are incorrect ( I have seen the problem in both sequential and mapreduce ). I experience this while using CanopyDriver's run method with clusterFilter > 0 and runClustering=true. The problem you have explained above will not happen because this method is not even used in map reduce phase of buildClusters. You can try CanopyDriver's run method with clusterFilter > 1 and runClustering = true to reproduce this issue. Please don't' confuse this issue with "Sequential and MapReduce give different results". That's a separate problem. Thanks and Regards, Paritosh Ranjan
        Hide
        Jeff Eastman added a comment -

        Canopy is intended to be a fast, approximate clustering algorithm. The Mahout sequential implementation runs a single pass over the data to produce approximate cluster centers. The mapreduce implementation runs one pass in each mapper and another pass in the reducer, to combine the results from the various mappers. The clusters produced by the sequential and mapreduce implementations will be different as a result.

        Once cluster centers are determined, the classification (clustering) of points follows a maximum-likelihood method which assigns each point to the closest cluster. This proposed patch modifies that method to impose an additional (d<T1) criteria on cluster assignment. This can result in some of the input points not being classified at all. I don't view this as a step in the right direction, nor do I think this is an incorrect result.

        -1 I'm inclined to reject this patch for these reasons.

        Show
        Jeff Eastman added a comment - Canopy is intended to be a fast, approximate clustering algorithm. The Mahout sequential implementation runs a single pass over the data to produce approximate cluster centers. The mapreduce implementation runs one pass in each mapper and another pass in the reducer, to combine the results from the various mappers. The clusters produced by the sequential and mapreduce implementations will be different as a result. Once cluster centers are determined, the classification (clustering) of points follows a maximum-likelihood method which assigns each point to the closest cluster. This proposed patch modifies that method to impose an additional (d<T1) criteria on cluster assignment. This can result in some of the input points not being classified at all. I don't view this as a step in the right direction, nor do I think this is an incorrect result. -1 I'm inclined to reject this patch for these reasons.
        Hide
        Paritosh Ranjan added a comment -

        One test case in patch (TestCanopyClusterer) will fail without applying the fix in the patch. Which explains the problem.

        After applying the fix in the patch. The test cases will pass.

        Show
        Paritosh Ranjan added a comment - One test case in patch (TestCanopyClusterer) will fail without applying the fix in the patch. Which explains the problem. After applying the fix in the patch. The test cases will pass.
        Hide
        Paritosh Ranjan added a comment -

        I have demonstrated the problem with the test case in the patch. And I have also fixed the issue which is tested with another test case.

        The patch is for mahout-core.

        Show
        Paritosh Ranjan added a comment - I have demonstrated the problem with the test case in the patch. And I have also fixed the issue which is tested with another test case. The patch is for mahout-core.

          People

          • Assignee:
            Jeff Eastman
            Reporter:
            Paritosh Ranjan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development