Mahout
  1. Mahout
  2. MAHOUT-513

ClusterEvaluator inter-cluster density returns NaN

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3
    • Fix Version/s: 0.4
    • Component/s: Clustering
    • Labels:
      None

      Description

      Hi Jeff,

      I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid. For example, I have a cluster with:

      Centroid:

      {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}

      and one of the representative points (3 per cluster):

      [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]

      As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.

      Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:

      min = max = 1.5397509610616733E-7
      count = 3
      sum = 4.61925288318502E-7
      max - min: 0.0
      count - min: 2.9999998460249038
      (sum / count - min) = 0.0

      This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?

      BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?

      Thanks,

      Derek

        Activity

        Hide
        Jeff Eastman added a comment -

        Hi Derek,

        Thanks for your help on this new (experimental) code! If a particular cluster actually has no points assigned to it by one of the clustering jobs, then the centroid of the cluster will be repeated n times in its representative points and the (max-min) will fail as you note. Dirichlet does this quite often, as there are usually more models allocated than receive points in an iteration. The invalidCluster method is attempting to detect this degenerate situation and remove all clusters that would mess up the calculations.

        In your situation, I gather your representative points are so close to the centroid that (max-min) becomes zero while the centroid vector equality test returns false because there is still some small difference. My hunch is that these clusters need to be pruned too, and adding an epsilon test to invalidCluster would be the right choice. Otherwise, one would have to return a very large number for the normalized density of that cluster and it would radically skew the intra-cluster density average. OTOH, if your clusters really do have distinct representative points you might want a very large intra-cluster density. I'm open to suggestions here. NaN is clearly not helpful.

        Is this a text-clustering problem you are working?

        Show
        Jeff Eastman added a comment - Hi Derek, Thanks for your help on this new (experimental) code! If a particular cluster actually has no points assigned to it by one of the clustering jobs, then the centroid of the cluster will be repeated n times in its representative points and the (max-min) will fail as you note. Dirichlet does this quite often, as there are usually more models allocated than receive points in an iteration. The invalidCluster method is attempting to detect this degenerate situation and remove all clusters that would mess up the calculations. In your situation, I gather your representative points are so close to the centroid that (max-min) becomes zero while the centroid vector equality test returns false because there is still some small difference. My hunch is that these clusters need to be pruned too, and adding an epsilon test to invalidCluster would be the right choice. Otherwise, one would have to return a very large number for the normalized density of that cluster and it would radically skew the intra-cluster density average. OTOH, if your clusters really do have distinct representative points you might want a very large intra-cluster density. I'm open to suggestions here. NaN is clearly not helpful. Is this a text-clustering problem you are working?
        Hide
        Derek O'Callaghan added a comment -

        Hi Jeff,

        In this case it appears that there are ~20 points in the cluster, and they're all almost identical to each other. It's a text-clustering problem, using reduced dimensionality, and these original 20 points have almost identical terms. I'm not sure either what the solution is, this is an acceptable cluster which so happens to be quite dense, so it'd be good to see this in the results. Having said that, the average density will then be skewed as you say, as the remaining clusters in this case are nowhere near as dense. I need to think about it a bit more.

        I'm also getting a couple of strange values in the CDbwEvaluator, I suspect it could be a similar issue but I haven't had a chance to confirm it yet.

        Thanks,

        Derek

        Show
        Derek O'Callaghan added a comment - Hi Jeff, In this case it appears that there are ~20 points in the cluster, and they're all almost identical to each other. It's a text-clustering problem, using reduced dimensionality, and these original 20 points have almost identical terms. I'm not sure either what the solution is, this is an acceptable cluster which so happens to be quite dense, so it'd be good to see this in the results. Having said that, the average density will then be skewed as you say, as the remaining clusters in this case are nowhere near as dense. I need to think about it a bit more. I'm also getting a couple of strange values in the CDbwEvaluator, I suspect it could be a similar issue but I haven't had a chance to confirm it yet. Thanks, Derek
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #344 (See https://hudson.apache.org/hudson/job/Mahout-Quality/344/)
        MAHOUT-513: Added a sequential version of representativePoints extraction. Changed other tests to use it. All tests run

        Show
        Hudson added a comment - Integrated in Mahout-Quality #344 (See https://hudson.apache.org/hudson/job/Mahout-Quality/344/ ) MAHOUT-513 : Added a sequential version of representativePoints extraction. Changed other tests to use it. All tests run
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #346 (See https://hudson.apache.org/hudson/job/Mahout-Quality/346/)
        MAHOUT-513: Reviewed calculations from paper in detail, making some changes that seemed needed. All tests run but I'm still not very confident in the computed numbers

        Show
        Hudson added a comment - Integrated in Mahout-Quality #346 (See https://hudson.apache.org/hudson/job/Mahout-Quality/346/ ) MAHOUT-513 : Reviewed calculations from paper in detail, making some changes that seemed needed. All tests run but I'm still not very confident in the computed numbers
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #347 (See https://hudson.apache.org/hudson/job/Mahout-Quality/347/)
        MAHOUT-513: hopefully fixed weighting of vectors by adding weightedX
        MAHOUT-513

        • Created interface GaussianAccumulator and two concrete implementations:
        • RunningSumsGaussianAccumulator uses running sums approach
        • OnlineGaussianAccumulator uses Knuth (Welford) approach
        • Added unit test thereof which produces significant std deviations and drastically-odd
          variances. I'm committing this so it can get more eyeballs. It is not used anywhere yet.
        • Refactored CDbwClusterEvaluator to use RunningSumsGaussianAccumulator and
          existing tests continue to run
        • Cleaned up logging in various clustering algorithms to increase use of debug vs. info
          to reduce log clutter
          All tests run.
        Show
        Hudson added a comment - Integrated in Mahout-Quality #347 (See https://hudson.apache.org/hudson/job/Mahout-Quality/347/ ) MAHOUT-513 : hopefully fixed weighting of vectors by adding weightedX MAHOUT-513 Created interface GaussianAccumulator and two concrete implementations: RunningSumsGaussianAccumulator uses running sums approach OnlineGaussianAccumulator uses Knuth (Welford) approach Added unit test thereof which produces significant std deviations and drastically-odd variances. I'm committing this so it can get more eyeballs. It is not used anywhere yet. Refactored CDbwClusterEvaluator to use RunningSumsGaussianAccumulator and existing tests continue to run Cleaned up logging in various clustering algorithms to increase use of debug vs. info to reduce log clutter All tests run.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #348 (See https://hudson.apache.org/hudson/job/Mahout-Quality/348/)
        MAHOUT-513

        • removed weighting from GaussianAccumulator.observe(). It's not needed for
          CDbw and is problematic in the OnlineGaussianAccumulator. Tests all run.
          MAHOUT-513
        • fixed bug in OnlineGaussianAccumulator.getStd()
        • added test of variance
          all tests now run, though std/variance results are different than with RunningSums
        Show
        Hudson added a comment - Integrated in Mahout-Quality #348 (See https://hudson.apache.org/hudson/job/Mahout-Quality/348/ ) MAHOUT-513 removed weighting from GaussianAccumulator.observe(). It's not needed for CDbw and is problematic in the OnlineGaussianAccumulator. Tests all run. MAHOUT-513 fixed bug in OnlineGaussianAccumulator.getStd() added test of variance all tests now run, though std/variance results are different than with RunningSums
        Hide
        Jeff Eastman added a comment -

        Replacing the standard deviation computation and some other algorithm changes seem to have resolved this issue. Marking Fixed.

        Show
        Jeff Eastman added a comment - Replacing the standard deviation computation and some other algorithm changes seem to have resolved this issue. Marking Fixed.

          People

          • Assignee:
            Jeff Eastman
            Reporter:
            Jeff Eastman
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development