Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2885

All-pairs similarity via DIMSUM

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.2.0
    • MLlib
    • None

    Description

      Build all-pairs similarity algorithm via DIMSUM.

      Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”.

      The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible.

      DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling.

      Current PR:
      https://github.com/apache/spark/pull/1778

      Justification for adding to MLlib:

      • When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix

      More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum

      For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf

      Attachments

        1. SimilarItemsSmallTest.java
          2 kB
          Clive Cox

        Activity

          People

            rezazadeh Reza Zadeh
            rezazadeh Reza Zadeh
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: