Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Labels:
      None

      Description

      Improve currently implemented variant of the mean shift algorithm in Mahout so that kernels are supported.

        Activity

        Hide
        Sean Owen added a comment -

        Tiny question on this. Would it be OK to use Java's UnsupportedOperationException instead of Commons Lang's NotImplementedException? Just seems more standard.

        Show
        Sean Owen added a comment - Tiny question on this. Would it be OK to use Java's UnsupportedOperationException instead of Commons Lang's NotImplementedException? Just seems more standard.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #863 (See https://builds.apache.org/hudson/job/Mahout-Quality/863/)
        MAHOUT-597: Applied patch to support kernels in MeanShift. All tests run.

        jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1131504
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/kernel
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/cdbw/TestCDbwEvaluator.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyClusterer.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/commandline/DefaultOptionCreator.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayMeanShift.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterEvaluator.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/kernel/NormalKernelProfile.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/kernel/TriangularKernelProfile.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyConfigKeys.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/kernel/IKernelProfile.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #863 (See https://builds.apache.org/hudson/job/Mahout-Quality/863/ ) MAHOUT-597 : Applied patch to support kernels in MeanShift. All tests run. jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1131504 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/kernel /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/cdbw/TestCDbwEvaluator.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyClusterer.java /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/commandline/DefaultOptionCreator.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayMeanShift.java /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterEvaluator.java /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/kernel/NormalKernelProfile.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/kernel/TriangularKernelProfile.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyConfigKeys.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/kernel/IKernelProfile.java
        Hide
        Jeff Eastman added a comment -

        r1131504 committed this patch after fixing conflicts caused by the delay. All tests run

        Show
        Jeff Eastman added a comment - r1131504 committed this patch after fixing conflicts caused by the delay. All tests run
        Hide
        Jeff Eastman added a comment -

        Should have committed it earlier. It has a few conflicts now but I'm on it for 0.6 for sure.

        Show
        Jeff Eastman added a comment - Should have committed it earlier. It has a few conflicts now but I'm on it for 0.6 for sure.
        Hide
        Sean Owen added a comment -

        Jeff it looks like you approve of the patch – seems like something to be worked in for 0.6 then.

        Show
        Sean Owen added a comment - Jeff it looks like you approve of the patch – seems like something to be worked in for 0.6 then.
        Hide
        Jeff Eastman added a comment -

        This looks excellent, and the change in #1 clearly fixes a major bug. Nice work with the kernels too.

        Show
        Jeff Eastman added a comment - This looks excellent, and the change in #1 clearly fixes a major bug. Nice work with the kernels too.
        Hide
        Vasil Vasilev added a comment -

        This is a proposal for implementing various types of kernels with the means shift algorithm. The following changes were introduced:

        1. The original algorithm was changed: Previously when one cluster "touched" other clusters its centroid was computed only based on the centroids of the clusters it touched, but not based on his centroid itself. Now before calculating the shift I one more line was added which makes the cluster observe his personal centroid:

        public boolean shiftToMean(MeanShiftCanopy canopy)

        { canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); canopy.computeConvergence(measure, convergenceDelta); canopy.computeParameters(); return canopy.isConverged(); }

        With this modification also a problem with convergence of the algorithm is resolved, when there are 2 clusters left which are within their T1 boundaries and not within their T2 boundaries. In this cases their centroids will switch from one another until infinity.
        With this modification, however, the number of iterations until convergence increased slightly.
        This change was necessary in order to introduce the other types of "soft" kernels.

        2. IKernelProfile interface was introduced which has the methods:
        public double calculateValue(double distance, double h);
        public double calculateDerivativeValue(double distance, double h);

        The first one is not used by the current implementations, because the value of the kernel profile - k - is not needed in calculating the meanshift, rather the value of the derivative of the kernel profile - g - is used.

        3. Currenlty 2 implementations are created:
        TriangularKernelProfile with calculated value:
        @Override
        public double calculateDerivativeValue(double distance, double h)

        { return (distance < h) ? 1.0 : 0.0; }

        This profile covers the cases that are currently covered by the meanshift algorithm implementation

        In addition NormalKernelProfile was created with calculated value:
        @Override
        public double calculateDerivativeValue(double distance, double h)

        { return UncommonDistributions.dNorm(distance, 0.0, h); }

        4. The merging of canopies now uses the kernel profile

        public void mergeCanopy(MeanShiftCanopy aCanopy, Collection<MeanShiftCanopy> canopies) {
        MeanShiftCanopy closestCoveringCanopy = null;
        double closestNorm = Double.MAX_VALUE;
        for (MeanShiftCanopy canopy : canopies) {
        double norm = measure.distance(canopy.getCenter(), aCanopy.getCenter());
        double weight = kernelProfile.calculateDerivativeValue(norm, t1);
        if(weight > 0.0)

        { aCanopy.touch(canopy, weight); }

        if (norm < t2 && ((closestCoveringCanopy == null) || (norm < closestNorm)))

        { closestNorm = norm; closestCoveringCanopy = canopy; }

        }
        if (closestCoveringCanopy == null)

        { canopies.add(aCanopy); }

        else

        { closestCoveringCanopy.merge(aCanopy); }

        }

        5. The MeanShiftCanopy is modified so that its touch method includes also the weight with which the clusters should observe each other. This weight was calculated based on the kernel profile in use

        void touch(MeanShiftCanopy canopy, double weight)

        { canopy.observe(getCenter(), weight*((double)boundPoints.size())); observe(canopy.getCenter(), weight*((double)canopy.boundPoints.size())); }

        6. Some other modifications were made which were necessary to compile the code and to make the tests pass

        With the so changed algorithm the following observations can be made:

        1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster content is more intermixed
        2. The following procedure for estimating T2 and convergence delta can be used:

        • bind convergence delta = T2 / 2
        • When you decrease T2 the number of iterations increases and the number of resulting clusters after convergence decreases
        • You come to a moment when you decrease T2 the number of iterations increases, but the number of resulting clusters remains unchanged. This is the point with the best value for T2
          3. In case of Normal kernel what you give as T1 is in fact the standard deviation of a normal distribution, so the radius of the window will be T1^2
        Show
        Vasil Vasilev added a comment - This is a proposal for implementing various types of kernels with the means shift algorithm. The following changes were introduced: 1. The original algorithm was changed: Previously when one cluster "touched" other clusters its centroid was computed only based on the centroids of the clusters it touched, but not based on his centroid itself. Now before calculating the shift I one more line was added which makes the cluster observe his personal centroid: public boolean shiftToMean(MeanShiftCanopy canopy) { canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); canopy.computeConvergence(measure, convergenceDelta); canopy.computeParameters(); return canopy.isConverged(); } With this modification also a problem with convergence of the algorithm is resolved, when there are 2 clusters left which are within their T1 boundaries and not within their T2 boundaries. In this cases their centroids will switch from one another until infinity. With this modification, however, the number of iterations until convergence increased slightly. This change was necessary in order to introduce the other types of "soft" kernels. 2. IKernelProfile interface was introduced which has the methods: public double calculateValue(double distance, double h); public double calculateDerivativeValue(double distance, double h); The first one is not used by the current implementations, because the value of the kernel profile - k - is not needed in calculating the meanshift, rather the value of the derivative of the kernel profile - g - is used. 3. Currenlty 2 implementations are created: TriangularKernelProfile with calculated value: @Override public double calculateDerivativeValue(double distance, double h) { return (distance < h) ? 1.0 : 0.0; } This profile covers the cases that are currently covered by the meanshift algorithm implementation In addition NormalKernelProfile was created with calculated value: @Override public double calculateDerivativeValue(double distance, double h) { return UncommonDistributions.dNorm(distance, 0.0, h); } 4. The merging of canopies now uses the kernel profile public void mergeCanopy(MeanShiftCanopy aCanopy, Collection<MeanShiftCanopy> canopies) { MeanShiftCanopy closestCoveringCanopy = null; double closestNorm = Double.MAX_VALUE; for (MeanShiftCanopy canopy : canopies) { double norm = measure.distance(canopy.getCenter(), aCanopy.getCenter()); double weight = kernelProfile.calculateDerivativeValue(norm, t1); if(weight > 0.0) { aCanopy.touch(canopy, weight); } if (norm < t2 && ((closestCoveringCanopy == null) || (norm < closestNorm))) { closestNorm = norm; closestCoveringCanopy = canopy; } } if (closestCoveringCanopy == null) { canopies.add(aCanopy); } else { closestCoveringCanopy.merge(aCanopy); } } 5. The MeanShiftCanopy is modified so that its touch method includes also the weight with which the clusters should observe each other. This weight was calculated based on the kernel profile in use void touch(MeanShiftCanopy canopy, double weight) { canopy.observe(getCenter(), weight*((double)boundPoints.size())); observe(canopy.getCenter(), weight*((double)canopy.boundPoints.size())); } 6. Some other modifications were made which were necessary to compile the code and to make the tests pass With the so changed algorithm the following observations can be made: 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster content is more intermixed 2. The following procedure for estimating T2 and convergence delta can be used: bind convergence delta = T2 / 2 When you decrease T2 the number of iterations increases and the number of resulting clusters after convergence decreases You come to a moment when you decrease T2 the number of iterations increases, but the number of resulting clusters remains unchanged. This is the point with the best value for T2 3. In case of Normal kernel what you give as T1 is in fact the standard deviation of a normal distribution, so the radius of the window will be T1^2

          People

          • Assignee:
            Jeff Eastman
            Reporter:
            Vasil Vasilev
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development