Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18334

What hashDistance should MinHash use?

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      MinHash currently is using the same `hashDistance` function as RandomProjection. This does not make sense for MinHash because the Jaccard distance of two sets is not relevant to the absolute distance of their hash buckets indices.

      This bug could affect accuracy of multi probing NN search for MinHash.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yunn Yun Ni
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: