Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.0
    • Component/s: ML
    • Labels:
      None

      Description

      Locality Sensitive Hashing (LSH) would be very useful for ML. It would be great to discuss some possible algorithms here, choose an API, and make a PR for an initial algorithm.

        Issue Links

          Activity

          Hide
          yuu.ishikawa@gmail.com Yu Ishikawa added a comment -

          What kind of LSH algorithms should we support at first?

          • Bit sampling for Hamming distance
          • Min-wise independent permutations
          • Nilsimsa Hash
          • Random projection
          • Min Hash
          Show
          yuu.ishikawa@gmail.com Yu Ishikawa added a comment - What kind of LSH algorithms should we support at first? Bit sampling for Hamming distance Min-wise independent permutations Nilsimsa Hash Random projection Min Hash
          Hide
          josephkb Joseph K. Bradley added a comment -

          I'm not sure what the most important or useful ones would be. I've heard the most about random projections. Are you familiar with the others? Perhaps we can make a list of algorithms and their properties to try to get a reasonable coverage of use cases.

          Show
          josephkb Joseph K. Bradley added a comment - I'm not sure what the most important or useful ones would be. I've heard the most about random projections. Are you familiar with the others? Perhaps we can make a list of algorithms and their properties to try to get a reasonable coverage of use cases.
          Hide
          yuu.ishikawa@gmail.com Yu Ishikawa added a comment -

          As I said when we talked in Spark Summit East, I think we should design the public API and implement just random projection method at first.
          I am interested in this paper. However, I haven't read the details of this paper.

          Bahman Bahmani, Ashish Goel, and Rajendra Shinde. 2012. Efficient distributed locality sensitive hashing. In Proceedings of the 21st ACM international conference on Information and knowledge management (CIKM '12). ACM, New York, NY, USA, 2174-2178. DOI=10.1145/2396761.2398596
          http://web.stanford.edu/~ashishg/papers/darpa_unpub2.pdf

          Show
          yuu.ishikawa@gmail.com Yu Ishikawa added a comment - As I said when we talked in Spark Summit East, I think we should design the public API and implement just random projection method at first. I am interested in this paper. However, I haven't read the details of this paper. Bahman Bahmani, Ashish Goel, and Rajendra Shinde. 2012. Efficient distributed locality sensitive hashing. In Proceedings of the 21st ACM international conference on Information and knowledge management (CIKM '12). ACM, New York, NY, USA, 2174-2178. DOI=10.1145/2396761.2398596 http://web.stanford.edu/~ashishg/papers/darpa_unpub2.pdf
          Hide
          josephkb Joseph K. Bradley added a comment -

          That sounds good; I'll try to take a look at that paper soon. Will you be able to write a short design doc?

          Show
          josephkb Joseph K. Bradley added a comment - That sounds good; I'll try to take a look at that paper soon. Will you be able to write a short design doc?
          Hide
          yuu.ishikawa@gmail.com Yu Ishikawa added a comment -

          Yes, I will try to write a design doc. Thank you for your continues support.

          Show
          yuu.ishikawa@gmail.com Yu Ishikawa added a comment - Yes, I will try to write a design doc. Thank you for your continues support.
          Hide
          debasish83 Debasish Das added a comment -

          Did someone compared algebird LSH with spark minhash link above ? Unless algebird is slow (which I found for TopK monoid) we should use it the same way HLL is being used in Spark streaming ? Is it ok to add algebird to mllib ?

          Show
          debasish83 Debasish Das added a comment - Did someone compared algebird LSH with spark minhash link above ? Unless algebird is slow (which I found for TopK monoid) we should use it the same way HLL is being used in Spark streaming ? Is it ok to add algebird to mllib ?
          Hide
          yuu.ishikawa@gmail.com Yu Ishikawa added a comment -

          Debasish Das Thank you for your comment. I haven't compared algebird LSH with spark's one. And I don't know if adding algebird to mllib is ok. However, as you're suggesting, I think we should use it in Spark Streaming.

          Show
          yuu.ishikawa@gmail.com Yu Ishikawa added a comment - Debasish Das Thank you for your comment. I haven't compared algebird LSH with spark's one. And I don't know if adding algebird to mllib is ok. However, as you're suggesting, I think we should use it in Spark Streaming.
          Show
          yuu.ishikawa@gmail.com Yu Ishikawa added a comment - Initial version of desing doc https://github.com/yu-iskw/SPARK-5992-LSH-design-doc/blob/master/design-doc.md
          Hide
          karlhigley Karl Higley added a comment -

          To make it easier to define a common interface, it might help to restrict consideration to methods that produce hash signatures. For cosine similarity, sign-random-projection LSH would probably fit the bill. See Section 3 of "Similarity Estimation Techniques from Rounding
          Algorithms"
          http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

          Show
          karlhigley Karl Higley added a comment - To make it easier to define a common interface, it might help to restrict consideration to methods that produce hash signatures. For cosine similarity, sign-random-projection LSH would probably fit the bill. See Section 3 of "Similarity Estimation Techniques from Rounding Algorithms" http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
          Hide
          josephkb Joseph K. Bradley added a comment -

          I'm just noting here that Yu Ishikawa and I discussed the design doc. The current plan is to have these implemented under ml.feature, with 2 types of UnaryTransformers:

          • Vector -> T: hash feature vector to some other type (Vector, Int, etc.)
          • Vector -> Double: compare feature vector with a fixed base Vector, and return similarity
          Show
          josephkb Joseph K. Bradley added a comment - I'm just noting here that Yu Ishikawa and I discussed the design doc. The current plan is to have these implemented under ml.feature, with 2 types of UnaryTransformers: Vector -> T: hash feature vector to some other type (Vector, Int, etc.) Vector -> Double: compare feature vector with a fixed base Vector, and return similarity
          Hide
          debasish83 Debasish Das added a comment -

          Lsh is anyway optimized for cosine...I think we should use algebird lsh for
          quality comparisons but in mllib add idea like randomized kdtree or kmeans
          tree...but it should be generic for distance functions hopefully...

          Show
          debasish83 Debasish Das added a comment - Lsh is anyway optimized for cosine...I think we should use algebird lsh for quality comparisons but in mllib add idea like randomized kdtree or kmeans tree...but it should be generic for distance functions hopefully...
          Hide
          cepnyci Maruf Aytekin added a comment - - edited

          In addition to Charikar's scheme for cosine Karl Higley pointed out, LSH schemes for the other known similarity/distance measures are as follows:

          1. Hamming norm:
          A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
          http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf

          2. Lp norms:
          M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual
          http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf
          http://people.csail.mit.edu/indyk/nips-nn.ps

          3. Jaccard distance:
          Mining Massive Data Sets chapter#3
          spark-hash package referenced above already implements this

          4. Cosine distance and Earth movers distance (EMD):
          M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
          http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf

          Show
          cepnyci Maruf Aytekin added a comment - - edited In addition to Charikar's scheme for cosine Karl Higley pointed out, LSH schemes for the other known similarity/distance measures are as follows: 1. Hamming norm: A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999). http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf 2. Lp norms: M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf http://people.csail.mit.edu/indyk/nips-nn.ps 3. Jaccard distance: Mining Massive Data Sets chapter#3 spark-hash package referenced above already implements this 4. Cosine distance and Earth movers distance (EMD): M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002). http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf
          Hide
          cepnyci Maruf Aytekin added a comment -

          I have developed spark implementation of LSH for Charikar's scheme for collection of vectors. It is published here: https://github.com/marufaytekin/lsh-spark. The details are documented in Readme.md file. I'd really appreciate if you check it out and provide feedback.

          Show
          cepnyci Maruf Aytekin added a comment - I have developed spark implementation of LSH for Charikar's scheme for collection of vectors. It is published here: https://github.com/marufaytekin/lsh-spark . The details are documented in Readme.md file. I'd really appreciate if you check it out and provide feedback.
          Hide
          Demir Özgür Demir added a comment -

          hi, we just open sourced an lsh topk implementation for spark. it's an implementation for cosine similarity only. for search and recommendation tasks this is the most used similarity metric as it normalizes popularity.

          influenced by the 'dimsum' implementation its input is a row based matrix where each row represents one item. The output is a similarity matrix. This allowed us to easily switch between both implementations.

          this might be the most basic interface for other lsh join (all pair) implementations?

          you'll find the code here:

          https://github.com/soundcloud/cosine-lsh-join-spark

          cheers

          demir

          Show
          Demir Özgür Demir added a comment - hi, we just open sourced an lsh topk implementation for spark. it's an implementation for cosine similarity only. for search and recommendation tasks this is the most used similarity metric as it normalizes popularity. influenced by the 'dimsum' implementation its input is a row based matrix where each row represents one item. The output is a similarity matrix. This allowed us to easily switch between both implementations. this might be the most basic interface for other lsh join (all pair) implementations? you'll find the code here: https://github.com/soundcloud/cosine-lsh-join-spark cheers demir
          Hide
          jse Jonas Seiler added a comment -

          Very nice!

          Hope there will be soon a common interface and we can use all this methods.

          When deciding up on methods, please take the executor memory consumption into account. If I have to store a #num of org. dim x #of reduced dim Matrix on each executor. This might be too much already (at least in our case).
          If you use sparse random projections (e.g.)
          Ping Li, T. Hastie and K. W. Church, 2006, “Very Sparse Random Projections”.
          https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf
          You can store a sparse matrix and have a parameter to tune the memory consumption.

          Show
          jse Jonas Seiler added a comment - Very nice! Hope there will be soon a common interface and we can use all this methods. When deciding up on methods, please take the executor memory consumption into account. If I have to store a #num of org. dim x #of reduced dim Matrix on each executor. This might be too much already (at least in our case). If you use sparse random projections (e.g.) Ping Li, T. Hastie and K. W. Church, 2006, “Very Sparse Random Projections”. https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf You can store a sparse matrix and have a parameter to tune the memory consumption.
          Hide
          karlhigley Karl Higley added a comment - - edited

          I'm a bit confused by this section of the design doc:

          It is pretty hard to define a common interface. Because LSH algorithm has two types at least. One is to calculate hash value. The other is to calculate a similarity between a feature(vector) and another one.

          For example, random projection algorithm is a type of calculating a similarity. It is designed to approximate the cosine distance between vectors. On the other hand, min hash algorithm is a type of calculating a hash value. The hash function maps a d dimensional vector onto a set of integers.

          Sign-random-projection LSH does calculate a hash value (essentially a Bitset) for each feature vector, and the Hamming distance between two hash values is used to estimate the cosine similarity between the corresponding feature vectors. The two "types" of LSH mentioned here seem more like two kinds of operations which are sometimes applied sequentially. Maybe this distinction makes more sense for other types of LSH?

          Show
          karlhigley Karl Higley added a comment - - edited I'm a bit confused by this section of the design doc: It is pretty hard to define a common interface. Because LSH algorithm has two types at least. One is to calculate hash value. The other is to calculate a similarity between a feature(vector) and another one. For example, random projection algorithm is a type of calculating a similarity. It is designed to approximate the cosine distance between vectors. On the other hand, min hash algorithm is a type of calculating a hash value. The hash function maps a d dimensional vector onto a set of integers. Sign-random-projection LSH does calculate a hash value (essentially a Bitset) for each feature vector, and the Hamming distance between two hash values is used to estimate the cosine similarity between the corresponding feature vectors. The two "types" of LSH mentioned here seem more like two kinds of operations which are sometimes applied sequentially. Maybe this distinction makes more sense for other types of LSH?
          Hide
          karlhigley Karl Higley added a comment -

          I've been working on a Spark package for approximate nearest neighbors that implements several LSH flavors for different distance measures behind a unified interface. Currently, the package supports Hamming, Jaccard, Euclidean, and cosine distance. It's still a work in progress, but maybe it will provide some food for thought on how to proceed with the implementation for MLlib.

          Show
          karlhigley Karl Higley added a comment - I've been working on a Spark package for approximate nearest neighbors that implements several LSH flavors for different distance measures behind a unified interface. Currently, the package supports Hamming, Jaccard, Euclidean, and cosine distance. It's still a work in progress, but maybe it will provide some food for thought on how to proceed with the implementation for MLlib.
          Hide
          mengxr Xiangrui Meng added a comment -

          FYI, I had an offline discussion with Kelvin Chu and Erran Li from Uber after Spark Summit, where they talked about the large-scale LSH implementation at Uber. I asked them to join the discussion and possibly contribute their API design and implementation to MLlib.

          Show
          mengxr Xiangrui Meng added a comment - FYI, I had an offline discussion with Kelvin Chu and Erran Li from Uber after Spark Summit, where they talked about the large-scale LSH implementation at Uber. I asked them to join the discussion and possibly contribute their API design and implementation to MLlib.
          Hide
          tonygrey tony added a comment - - edited

          Hi,
          I am new to spark. How can I make a contribution to Spark LSH implementation. I am a Python programmer. I am interested to make a contribution in the LSH technique of random projection for approximating cosine distance. can you guys keep me in loop?

          Show
          tonygrey tony added a comment - - edited Hi, I am new to spark. How can I make a contribution to Spark LSH implementation. I am a Python programmer. I am interested to make a contribution in the LSH technique of random projection for approximating cosine distance. can you guys keep me in loop?
          Hide
          snehil.w snehil suresh wakchaure added a comment - - edited

          Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started. Is this going to be a scala, java or python codebase?

          Any updates from the Uber community?

          Show
          snehil.w snehil suresh wakchaure added a comment - - edited Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started. Is this going to be a scala, java or python codebase? Any updates from the Uber community?
          Hide
          yunn Yun Ni added a comment - - edited

          Hi,

          We are engineers from Uber. Here is our design doc for LSH:
          https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

          Please take a look and let us know if this meets your requirements or not.

          Thanks,
          Yun Ni

          Show
          yunn Yun Ni added a comment - - edited Hi, We are engineers from Uber. Here is our design doc for LSH: https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit Please take a look and let us know if this meets your requirements or not. Thanks, Yun Ni
          Hide
          josephkb Joseph K. Bradley added a comment -

          Awesome, thanks! I made some comments, but it looks like a good approach overall.

          Show
          josephkb Joseph K. Bradley added a comment - Awesome, thanks! I made some comments, but it looks like a good approach overall.
          Hide
          yunn Yun Ni added a comment -

          Thanks very much for reviewing, Joseph!

          Based on your comments, I have made the following major changes:
          (1) Change the API Design to use Estimator instead of UnaryTransformer.
          (2) Add a Testing section to elaborate more about the unit tests and scale tests design.

          Please let me know if you have any further comments, so that we can go forward.

          Show
          yunn Yun Ni added a comment - Thanks very much for reviewing, Joseph! Based on your comments, I have made the following major changes: (1) Change the API Design to use Estimator instead of UnaryTransformer. (2) Add a Testing section to elaborate more about the unit tests and scale tests design. Please let me know if you have any further comments, so that we can go forward.
          Hide
          josephkb Joseph K. Bradley added a comment -

          The design doc LGTM! Thanks for updating it. Shall we proceed with an initial PR?

          Show
          josephkb Joseph K. Bradley added a comment - The design doc LGTM! Thanks for updating it. Shall we proceed with an initial PR?
          Hide
          yunn Yun Ni added a comment -

          Thank you very much for reviewing it, Joseph!

          I will work on the first commit according to the doc and send a PR soon.

          Show
          yunn Yun Ni added a comment - Thank you very much for reviewing it, Joseph! I will work on the first commit according to the doc and send a PR soon.
          Hide
          yunn Yun Ni added a comment -

          Hi Joseph,

          I have made an initial PR based on the design doc: https://github.com/apache/spark/pull/15148
          Please take a look and I will test on the open dataset in the mean time.

          Thanks very much,
          Yun Ni

          Show
          yunn Yun Ni added a comment - Hi Joseph, I have made an initial PR based on the design doc: https://github.com/apache/spark/pull/15148 Please take a look and I will test on the open dataset in the mean time. Thanks very much, Yun Ni
          Hide
          apachespark Apache Spark added a comment -

          User 'Yunni' has created a pull request for this issue:
          https://github.com/apache/spark/pull/15148

          Show
          apachespark Apache Spark added a comment - User 'Yunni' has created a pull request for this issue: https://github.com/apache/spark/pull/15148
          Hide
          yunn Yun Ni added a comment -

          Some tests on the WEX Open Dataset, with steps and results:
          https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro

          Show
          yunn Yun Ni added a comment - Some tests on the WEX Open Dataset, with steps and results: https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro
          Hide
          debasish83 Debasish Das added a comment -

          Did you compare with brute force knn ? Normally lsh does not work well for nn queries and that's why hybrid spill trees and other ideas came along....I can run some comparisons using SPARK-4823

          Show
          debasish83 Debasish Das added a comment - Did you compare with brute force knn ? Normally lsh does not work well for nn queries and that's why hybrid spill trees and other ideas came along....I can run some comparisons using SPARK-4823
          Hide
          debasish83 Debasish Das added a comment -

          Also do you have hash function for euclidean distance? We use cosine, jaccard and euclidean with SPARK-4823...for knn comparison we can use overlap metric...pick up k and then compare overlap within lsh based approximate knn and brute force knn...let me know if you need help in running the benchmarks...

          Show
          debasish83 Debasish Das added a comment - Also do you have hash function for euclidean distance? We use cosine, jaccard and euclidean with SPARK-4823 ...for knn comparison we can use overlap metric...pick up k and then compare overlap within lsh based approximate knn and brute force knn...let me know if you need help in running the benchmarks...
          Hide
          yunn Yun Ni added a comment -

          Yes, I do have comparison with full scan. Usually kNN with 1 hash function does not work well, so I enabled OR-amplification with multiple hash functions. You can set the number of hash functions using "outputDim".

          Show
          yunn Yun Ni added a comment - Yes, I do have comparison with full scan. Usually kNN with 1 hash function does not work well, so I enabled OR-amplification with multiple hash functions. You can set the number of hash functions using "outputDim".
          Hide
          yunn Yun Ni added a comment -

          Yes, I have implemented cosine, jaccard, euclidean and hamming distance. Please check:
          https://github.com/apache/spark/pull/15148

          Show
          yunn Yun Ni added a comment - Yes, I have implemented cosine, jaccard, euclidean and hamming distance. Please check: https://github.com/apache/spark/pull/15148
          Hide
          josephkb Joseph K. Bradley added a comment -

          Issue resolved by pull request 15148
          https://github.com/apache/spark/pull/15148

          Show
          josephkb Joseph K. Bradley added a comment - Issue resolved by pull request 15148 https://github.com/apache/spark/pull/15148

            People

            • Assignee:
              yunn Yun Ni
              Reporter:
              josephkb Joseph K. Bradley
              Shepherd:
              Xiangrui Meng
            • Votes:
              9 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development