[SPARK-18392] LSH API, algorithm, and documentation follow-ups - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Description

This JIRA summarizes discussions from the initial LSH PR https://github.com/apache/spark/pull/15148 as well as the follow-up for hash distance https://github.com/apache/spark/pull/15800. This will be broken into subtasks:

API changes (targeted for 2.1)
algorithmic fixes (targeted for 2.1)
documentation improvements (ideally 2.1, but could slip)

The major issues we have mentioned are as follows:

OR vs AND amplification
- Need to make API flexible enough to support both types of amplification in the future
- Need to clarify which we support, including in each model function (transform, similarity join, neighbors)
Need to clarify which algorithms we have implemented, improve docs and references, and fix the algorithms if needed.

These major issues are broken down into detailed issues below.

LSH abstraction

Rename outputDim to something indicative of OR-amplification.
- My current top pick is numHashTables, with numHashFunctions used in the future for AND amplification (Thanks mlnick!)

transform
- Update output schema to Array of Vector instead of Vector. This is the "raw" output of all hash functions, i.e., with no aggregation for amplification.
- Clarify meaning of output in terms of multiple hash functions and amplification.
- Note: We will not worry about users using this output for dimensionality reduction; if anything, that use case can be explained in the User Guide.

Documentation
- Clarify terminology used everywhere
  - hash function h_i: basic hash function without amplification
  - hash value h_i(key): output of a hash function
  - compound hash function {{g = (h_0,h_1,...h_ {K-1}
    )}}: hash function with AND-amplification using K base hash functions
  - compound hash function value g(key): vector-valued output
  - hash table {{H = (g_0,g_1,...g_ {L-1}
    )}}: hash function with OR-amplification using L compound hash functions
  - hash table value H(key): output of array of vectors
  - This terminology is largely pulled from Wang et al.'s survey and the multi-probe LSH paper.
- Link clearly to documentation (Wikipedia or papers) which matches our terminology and what we implemented

RandomProjection (or P-Stable Distributions)

Rename RandomProjection
- Options include: ScalarRandomProjectionLSH, BucketedRandomProjectionLSH, PStableLSH

API privacy
- Make randUnitVectors private

hashFunction
- Currently, this uses OR-amplification for single probing, as we intended.
- It does not do multiple probing, at least not in the sense of the original MPLSH paper. We should fix that or at least document its behavior.

Documentation
- Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
- Also link to the multi-probe LSH paper since that explains this method very clearly.
- Clarify hash function and distance metric

MinHash

Rename MinHash -> MinHashLSH

API privacy
- Make randCoefficients, numEntries private

hashDistance (used in approxNearestNeighbors)
- Update to use average of indicators of hash collisions ~~SPARK-18334~~
- See Wikipedia for a reference

All references

I'm just listing references I looked at.

Papers

Wikipedia

Attachments

Issue Links

contains

SPARK-18408 API Improvements for LSH

Resolved

SPARK-18450 Add AND-amplification to Locality Sensitive Hashing

Resolved

SPARK-18454 Changes to improve Nearest Neighbor Search for LSH

Resolved

is related to

SPARK-18334 What hashDistance should MinHash use?

Resolved

SPARK-5992 Locality Sensitive Hashing (LSH)

Resolved

relates to

SPARK-19771 Support OR-AND amplification in Locality Sensitive Hashing (LSH)

Resolved

links to

PR with discussion about MinHash approxNearestNeighbors

(1 relates to, 1 links to)

Activity

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 10/Nov/16 00:01

Updated:: 21/May/19 04:12

Resolved:: 21/May/19 04:12