Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31007

KMeans optimization based on triangle-inequality

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.1.0
    • Component/s: ML
    • Labels:
      None

      Description

      In current impl, following Lemma is used in KMeans:

      0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)|

      this can be applied in EuclideanDistance, but not in CosineDistance

      According to [Using the Triangle Inequality to Accelerate K-Means|https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf], we can go futher, and there are another two Lemmas can be used:

      1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);

      this can be applied in EuclideanDistance, but not in CosineDistance.

      However, luckily for CosineDistance we can get a variant in the space of radian/angle.

      2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, d(x,b)-d(b,c)};

      this can be applied in EuclideanDistance, but not in CosineDistance

      The application of Lemma 2 is a little complex: It need to cache/update the distance/lower bounds to previous centers, and thus can be only applied in training, not usable in prediction.

      So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.

        Attachments

        1. ICML03-022.pdf
          95 kB
          zhengruifeng

          Issue Links

            Activity

              People

              • Assignee:
                podongfeng zhengruifeng
                Reporter:
                podongfeng zhengruifeng
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: