Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18392

LSH API, algorithm, and documentation follow-ups

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ML
    • Labels:


      This JIRA summarizes discussions from the initial LSH PR https://github.com/apache/spark/pull/15148 as well as the follow-up for hash distance https://github.com/apache/spark/pull/15800. This will be broken into subtasks:

      • API changes (targeted for 2.1)
      • algorithmic fixes (targeted for 2.1)
      • documentation improvements (ideally 2.1, but could slip)

      The major issues we have mentioned are as follows:

      • OR vs AND amplification
        • Need to make API flexible enough to support both types of amplification in the future
        • Need to clarify which we support, including in each model function (transform, similarity join, neighbors)
      • Need to clarify which algorithms we have implemented, improve docs and references, and fix the algorithms if needed.

      These major issues are broken down into detailed issues below.

      LSH abstraction

      • Rename outputDim to something indicative of OR-amplification.
        • My current top pick is numHashTables, with numHashFunctions used in the future for AND amplification (Thanks Nicholas Pentreath!)
      • transform
        • Update output schema to Array of Vector instead of Vector. This is the "raw" output of all hash functions, i.e., with no aggregation for amplification.
        • Clarify meaning of output in terms of multiple hash functions and amplification.
        • Note: We will not worry about users using this output for dimensionality reduction; if anything, that use case can be explained in the User Guide.
      • Documentation
        • Clarify terminology used everywhere
          • hash function h_i: basic hash function without amplification
          • hash value h_i(key): output of a hash function
          • compound hash function {{g = (h_0,h_1,...h_ {K-1}

            )}}: hash function with AND-amplification using K base hash functions

          • compound hash function value g(key): vector-valued output
          • hash table {{H = (g_0,g_1,...g_ {L-1}

            )}}: hash function with OR-amplification using L compound hash functions

          • hash table value H(key): output of array of vectors
          • This terminology is largely pulled from Wang et al.'s survey and the multi-probe LSH paper.
        • Link clearly to documentation (Wikipedia or papers) which matches our terminology and what we implemented

      RandomProjection (or P-Stable Distributions)

      • Rename RandomProjection
        • Options include: ScalarRandomProjectionLSH, BucketedRandomProjectionLSH, PStableLSH
      • API privacy
        • Make randUnitVectors private
      • hashFunction
        • Currently, this uses OR-amplification for single probing, as we intended.
        • It does not do multiple probing, at least not in the sense of the original MPLSH paper. We should fix that or at least document its behavior.
      • Documentation
        • Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
        • Also link to the multi-probe LSH paper since that explains this method very clearly.
        • Clarify hash function and distance metric


      • Rename MinHash -> MinHashLSH
      • API privacy
        • Make randCoefficients, numEntries private
      • hashDistance (used in approxNearestNeighbors)

      All references

      I'm just listing references I looked at.




        Issue Links



            • Assignee:
              josephkb Joseph K. Bradley


              • Created:

                Issue deployment