Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
None
-
None
Description
This JIRA summarizes discussions from the initial LSH PR https://github.com/apache/spark/pull/15148 as well as the follow-up for hash distance https://github.com/apache/spark/pull/15800. This will be broken into subtasks:
- API changes (targeted for 2.1)
- algorithmic fixes (targeted for 2.1)
- documentation improvements (ideally 2.1, but could slip)
The major issues we have mentioned are as follows:
- OR vs AND amplification
- Need to make API flexible enough to support both types of amplification in the future
- Need to clarify which we support, including in each model function (transform, similarity join, neighbors)
- Need to clarify which algorithms we have implemented, improve docs and references, and fix the algorithms if needed.
These major issues are broken down into detailed issues below.
LSH abstraction
- Rename outputDim to something indicative of OR-amplification.
- My current top pick is numHashTables, with numHashFunctions used in the future for AND amplification (Thanks mlnick!)
- transform
- Update output schema to Array of Vector instead of Vector. This is the "raw" output of all hash functions, i.e., with no aggregation for amplification.
- Clarify meaning of output in terms of multiple hash functions and amplification.
- Note: We will not worry about users using this output for dimensionality reduction; if anything, that use case can be explained in the User Guide.
- Documentation
- Clarify terminology used everywhere
- hash function h_i: basic hash function without amplification
- hash value h_i(key): output of a hash function
- compound hash function {{g = (h_0,h_1,...h_
{K-1}
)}}: hash function with AND-amplification using K base hash functions
- compound hash function value g(key): vector-valued output
- hash table {{H = (g_0,g_1,...g_
{L-1}
)}}: hash function with OR-amplification using L compound hash functions
- hash table value H(key): output of array of vectors
- This terminology is largely pulled from Wang et al.'s survey and the multi-probe LSH paper.
- Link clearly to documentation (Wikipedia or papers) which matches our terminology and what we implemented
- Clarify terminology used everywhere
RandomProjection (or P-Stable Distributions)
- Rename RandomProjection
- Options include: ScalarRandomProjectionLSH, BucketedRandomProjectionLSH, PStableLSH
- API privacy
- Make randUnitVectors private
- hashFunction
- Currently, this uses OR-amplification for single probing, as we intended.
- It does not do multiple probing, at least not in the sense of the original MPLSH paper. We should fix that or at least document its behavior.
- Documentation
- Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
- Also link to the multi-probe LSH paper since that explains this method very clearly.
- Clarify hash function and distance metric
MinHash
- Rename MinHash -> MinHashLSH
- API privacy
- Make randCoefficients, numEntries private
- hashDistance (used in approxNearestNeighbors)
- Update to use average of indicators of hash collisions
SPARK-18334 - See Wikipedia for a reference
- Update to use average of indicators of hash collisions
All references
I'm just listing references I looked at.
Papers
- http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf
- https://people.csail.mit.edu/indyk/p117-andoni.pdf
- http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf
- http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf - Multi-probe LSH paper
Wikipedia
Attachments
Issue Links
- contains
-
SPARK-18408 API Improvements for LSH
- Resolved
-
SPARK-18450 Add AND-amplification to Locality Sensitive Hashing
- Resolved
-
SPARK-18454 Changes to improve Nearest Neighbor Search for LSH
- Resolved
- is related to
-
SPARK-18334 What hashDistance should MinHash use?
- Resolved
-
SPARK-5992 Locality Sensitive Hashing (LSH)
- Resolved
- relates to
-
SPARK-19771 Support OR-AND amplification in Locality Sensitive Hashing (LSH)
- Resolved
- links to