Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None

      Description

      RowMatrix has a columnSimilarities method to find cosine similarities between columns.

      A rowSimilarities method would be useful to find similarities between rows.

      This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows (> 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12.

      1. SparkMeetup2015-Experiments2.pdf
        56 kB
        Debasish Das
      2. SparkMeetup2015-Experiments1.pdf
        64 kB
        Debasish Das
      3. MovieLensSimilarity Comparisons.pdf
        93 kB
        Debasish Das

        Issue Links

          Activity

          Hide
          debasish83 Debasish Das added a comment -

          I am considering coming up with a baseline version that's very close to brute force but we cut the computation with a topK number...for each user come up with topK users where K is defined by the client..this will take care of matrix factorization use-case...

          Basically on master we collect a set of user factors, broadcast it to every node and does a reduceByKey to generate topK users for each user from this user block...We send a kernel function (cosine / polynomial / rbf) in this calculation...

          But this idea does not work for raw features right...If we do map features to a lower dimension using factorization then this approach should run fine...but I am not sure if we can ask users to map their data into a lower dimension

          Is it possible to bring in ideas from fastfood and kitchen sink to do this ?

          Show
          debasish83 Debasish Das added a comment - I am considering coming up with a baseline version that's very close to brute force but we cut the computation with a topK number...for each user come up with topK users where K is defined by the client..this will take care of matrix factorization use-case... Basically on master we collect a set of user factors, broadcast it to every node and does a reduceByKey to generate topK users for each user from this user block...We send a kernel function (cosine / polynomial / rbf) in this calculation... But this idea does not work for raw features right...If we do map features to a lower dimension using factorization then this approach should run fine...but I am not sure if we can ask users to map their data into a lower dimension Is it possible to bring in ideas from fastfood and kitchen sink to do this ?
          Hide
          debasish83 Debasish Das added a comment -

          Sean Owen did you implement map-reduce row similarities for user factors ? What's the algorithm that you used ? Any pointers will be really helpful...

          Show
          debasish83 Debasish Das added a comment - Sean Owen did you implement map-reduce row similarities for user factors ? What's the algorithm that you used ? Any pointers will be really helpful...
          Hide
          srowen Sean Owen added a comment -

          I don't think MapReduce matters here. You can compute pairs of similarities with any framework, or try to do it on the fly. It's not different than column similarities, right? I don't think there's anything more to it than applying a similarity metric to all pairs of vectors. I think the JIRA is about exposing a method just for API convenience, not because it's conceptually different.

          Show
          srowen Sean Owen added a comment - I don't think MapReduce matters here. You can compute pairs of similarities with any framework, or try to do it on the fly. It's not different than column similarities, right? I don't think there's anything more to it than applying a similarity metric to all pairs of vectors. I think the JIRA is about exposing a method just for API convenience, not because it's conceptually different.
          Hide
          rezazadeh Reza Zadeh added a comment -

          Given that we're talking about RowMatrices, computing rowSimilarities the same way as columnSimilarities would require transposing the matrix, which is dangerous when the original matrix has many rows. RowMatrix assumes a single row should fit in memory on a single machine, but this might not happen after transposing a RowMatrix.

          Show
          rezazadeh Reza Zadeh added a comment - Given that we're talking about RowMatrices, computing rowSimilarities the same way as columnSimilarities would require transposing the matrix, which is dangerous when the original matrix has many rows. RowMatrix assumes a single row should fit in memory on a single machine, but this might not happen after transposing a RowMatrix.
          Hide
          debasish83 Debasish Das added a comment -

          Even for matrix factorization userFactors are user x rank...with modest ranks of 50..and users at 10M, I don't think it is possible to transpose the matrix and run column similarities...doing it on the fly complexity wise is still O(n*n) right...

          Show
          debasish83 Debasish Das added a comment - Even for matrix factorization userFactors are user x rank...with modest ranks of 50..and users at 10M, I don't think it is possible to transpose the matrix and run column similarities...doing it on the fly complexity wise is still O(n*n) right...
          Hide
          debasish83 Debasish Das added a comment - - edited

          Xiangrui Meng I need level 3 BLAS for this JIRA as well as https://issues.apache.org/jira/browse/SPARK-4675. Specifically I am looking for dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some code with breeze dot and it was extremely slow...I will migrate the code to netlib java BLAS from mllib and update the results on the JIRA...

          Show
          debasish83 Debasish Das added a comment - - edited Xiangrui Meng I need level 3 BLAS for this JIRA as well as https://issues.apache.org/jira/browse/SPARK-4675 . Specifically I am looking for dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some code with breeze dot and it was extremely slow...I will migrate the code to netlib java BLAS from mllib and update the results on the JIRA...
          Hide
          debasish83 Debasish Das added a comment - - edited

          I implemented the idea I mentioned above using level 1 BLAS since I abstract a kernel out and I wanted the code to support 2 distributed matrix multiply, kernel abstraction and both sparse/dense vector...in future for dense dense we can do some level 3 BLAS..Code is written in blocked form.

          On Netflix dataset we run rowSimilarity using CosineKernel in 20 nodes, 4 cores, 16 gb per node in 500 seconds. If I go from raw data to reduced dimension and then run the rowSimilarity with CosineKernel it runs in 320 seconds. colSimilarity without dimsum sampling has been run upto 28 mins where I killed the job...For matrices that are not twitter tall but say 100M and columns are at 1-10M, I feel this code will work well...

          Next trick for this flow is LSH and Random Kitchen Sink...The code is going through legal and will open the PR soon for reviews...Also this code will bring kernel generation to mllib...

          Also I will add an examples.MovieLensSimilarity which will compare colSimilarity with threshold (dimsum sampling will be activated), row Similarity and row Similarity with dimension reduced by ALS....My experiments so far shows 40% intersection with raw similarity and ALS implicit on Movielens, 24% on Netflix dataset in predicting topk items for every item...surprising but I think going to a larger rank than 50/100 is the solution to decrease this gap..LSH and factorization both will try to do that

          Show
          debasish83 Debasish Das added a comment - - edited I implemented the idea I mentioned above using level 1 BLAS since I abstract a kernel out and I wanted the code to support 2 distributed matrix multiply, kernel abstraction and both sparse/dense vector...in future for dense dense we can do some level 3 BLAS..Code is written in blocked form. On Netflix dataset we run rowSimilarity using CosineKernel in 20 nodes, 4 cores, 16 gb per node in 500 seconds. If I go from raw data to reduced dimension and then run the rowSimilarity with CosineKernel it runs in 320 seconds. colSimilarity without dimsum sampling has been run upto 28 mins where I killed the job...For matrices that are not twitter tall but say 100M and columns are at 1-10M, I feel this code will work well... Next trick for this flow is LSH and Random Kitchen Sink...The code is going through legal and will open the PR soon for reviews...Also this code will bring kernel generation to mllib... Also I will add an examples.MovieLensSimilarity which will compare colSimilarity with threshold (dimsum sampling will be activated), row Similarity and row Similarity with dimension reduced by ALS....My experiments so far shows 40% intersection with raw similarity and ALS implicit on Movielens, 24% on Netflix dataset in predicting topk items for every item...surprising but I think going to a larger rank than 50/100 is the solution to decrease this gap..LSH and factorization both will try to do that
          Hide
          apachespark Apache Spark added a comment -

          User 'debasish83' has created a pull request for this issue:
          https://github.com/apache/spark/pull/6213

          Show
          apachespark Apache Spark added a comment - User 'debasish83' has created a pull request for this issue: https://github.com/apache/spark/pull/6213
          Hide
          debasish83 Debasish Das added a comment -

          I opened up a PR that worked well for our datasets. It is still brute-force computation although we use blocked cartesian and user defined kernels to optimize on cutting computation and shuffle...There are trivial ideas to go from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib BLAS although I don't think it will give us the runtime boost we are looking for...

          We are looking into approximate KNN family of algorithms to improve the runtime further...KDTree is good for dense vector with low features but for sparse vector in higher dimensions researches did not find it useful..

          LSH seems to be most commonly used and that's the direction we are looking into. I looked into papers but the one that showed good recall values in their experiments as compared to brute force KNN is Google Correlate and that's the validation strategy we will focus at https://www.google.com/trends/correlate/nnsearch.pdf. Please point to any other references that deem fit. There are twitter papers as well using LSH and the implementation is available in algebird. We will start with algebird LSH but ideally we don't want to have a distance metric hardcoded in LSH.

          If we get good recall using LSH based method compared to the rowSimilarities code from the PR, we will use LSH based method to approximate compute similarities between dense/sparse rows using cosine kernel, dense userFactor, productFactor from factorization using product kernel and dense user/product factor similarities using cosine kernel.

          The kernel abstraction is part of the current PR and right now we support Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but it's not added yet. For approximate row similarity I will open up a separate JIRA.

          Show
          debasish83 Debasish Das added a comment - I opened up a PR that worked well for our datasets. It is still brute-force computation although we use blocked cartesian and user defined kernels to optimize on cutting computation and shuffle...There are trivial ideas to go from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib BLAS although I don't think it will give us the runtime boost we are looking for... We are looking into approximate KNN family of algorithms to improve the runtime further...KDTree is good for dense vector with low features but for sparse vector in higher dimensions researches did not find it useful.. LSH seems to be most commonly used and that's the direction we are looking into. I looked into papers but the one that showed good recall values in their experiments as compared to brute force KNN is Google Correlate and that's the validation strategy we will focus at https://www.google.com/trends/correlate/nnsearch.pdf . Please point to any other references that deem fit. There are twitter papers as well using LSH and the implementation is available in algebird. We will start with algebird LSH but ideally we don't want to have a distance metric hardcoded in LSH. If we get good recall using LSH based method compared to the rowSimilarities code from the PR, we will use LSH based method to approximate compute similarities between dense/sparse rows using cosine kernel, dense userFactor, productFactor from factorization using product kernel and dense user/product factor similarities using cosine kernel. The kernel abstraction is part of the current PR and right now we support Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but it's not added yet. For approximate row similarity I will open up a separate JIRA.
          Hide
          debasish83 Debasish Das added a comment -

          The attached file shows the runtime comparison of row and column based flow on all items from MovieLens dataset on my local Macbook with 8 cores, 1 GB driver, 4 GB executor memory.

          1e-2 is the threshold that's being set to both row based kernel flow and column based dimsum flow.

          Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s

          Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins

          This shows the power of blocking in Spark and I have not yet gone to gemv which will decrease the runtime further.

          I updated the driver code in examples.mllib.MovieLensSimilarity

          Show
          debasish83 Debasish Das added a comment - The attached file shows the runtime comparison of row and column based flow on all items from MovieLens dataset on my local Macbook with 8 cores, 1 GB driver, 4 GB executor memory. 1e-2 is the threshold that's being set to both row based kernel flow and column based dimsum flow. Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins This shows the power of blocking in Spark and I have not yet gone to gemv which will decrease the runtime further. I updated the driver code in examples.mllib.MovieLensSimilarity
          Hide
          debasish83 Debasish Das added a comment -

          We did more detailed experiment for July 2015 Spark Meetup to understand the shuffle effects on runtime. I attached the data for experiments in the JIRA. I will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5.

          Show
          debasish83 Debasish Das added a comment - We did more detailed experiment for July 2015 Spark Meetup to understand the shuffle effects on runtime. I attached the data for experiments in the JIRA. I will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5.
          Hide
          superwai Jerry Lam added a comment -

          Hi Debasish Das, I wonder if this is still work in progress or something that can be merged to 1.5 soon? Thank you.

          Show
          superwai Jerry Lam added a comment - Hi Debasish Das , I wonder if this is still work in progress or something that can be merged to 1.5 soon? Thank you.
          Hide
          debasish83 Debasish Das added a comment -

          We use it in multiple usecases internally but did not get time to refactor the PR into 3 smaller PRs....I will update the PR to 2.0

          Show
          debasish83 Debasish Das added a comment - We use it in multiple usecases internally but did not get time to refactor the PR into 3 smaller PRs....I will update the PR to 2.0

            People

            • Assignee:
              Unassigned
              Reporter:
              rezazadeh Reza Zadeh
            • Votes:
              4 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development