Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4675

Find similar products and similar users in MatrixFactorizationModel

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:

      Description

      Using the latent feature space that is learnt in MatrixFactorizationModel, I have added 2 new functions to find similar products and similar users. A user of the API can for example pass a product ID, and get the closest products.

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'sbourke' has created a pull request for this issue:
          https://github.com/apache/spark/pull/3536

          Show
          apachespark Apache Spark added a comment - User 'sbourke' has created a pull request for this issue: https://github.com/apache/spark/pull/3536
          Hide
          debasish83 Debasish Das added a comment -

          There are few issues:

          1. Batch API for topK similar users and topK similar products
          2. Comparison of product x product similarities generated with columnSimilarities and compared with topK similar products

          I added batch APIs for topK product recommendation for each user and topK user recommendation for each product in SPARK-4231...similar batch API will be very helpful for topK similar users and topK similar products...

          I agree with Cosine Similarity...you should be able to re-use column similarity calculations...I think a better idea is to add rowMatrix.similarRows and re-use that code to generate product similarities and user similarities...

          But my question is more on validation. We can compute product similarities on raw features and we can compute product similarities on matrix product factor...which one is better ?

          Show
          debasish83 Debasish Das added a comment - There are few issues: 1. Batch API for topK similar users and topK similar products 2. Comparison of product x product similarities generated with columnSimilarities and compared with topK similar products I added batch APIs for topK product recommendation for each user and topK user recommendation for each product in SPARK-4231 ...similar batch API will be very helpful for topK similar users and topK similar products... I agree with Cosine Similarity...you should be able to re-use column similarity calculations...I think a better idea is to add rowMatrix.similarRows and re-use that code to generate product similarities and user similarities... But my question is more on validation. We can compute product similarities on raw features and we can compute product similarities on matrix product factor...which one is better ?
          Hide
          josephkb Joseph K. Bradley added a comment -

          Just to make sure I get your last question, are you asking, "Why compute product similarities using the low-dimensional space when we could do it in the high-dimensional space?" If so, then my understanding is that the low-dimensional space will give more meaningful similarities in general.

          Show
          josephkb Joseph K. Bradley added a comment - Just to make sure I get your last question, are you asking, "Why compute product similarities using the low-dimensional space when we could do it in the high-dimensional space?" If so, then my understanding is that the low-dimensional space will give more meaningful similarities in general.
          Hide
          debasish83 Debasish Das added a comment -

          Joseph K. Bradley how do we validate that low dimension space is giving more meaningful similarities than the feature space (which is sparse) ?

          Show
          debasish83 Debasish Das added a comment - Joseph K. Bradley how do we validate that low dimension space is giving more meaningful similarities than the feature space (which is sparse) ?
          Hide
          srowen Sean Owen added a comment -

          The lower dimensional space is of course smaller. This makes it faster and more efficient to work with, which is an advantage to be sure at scale. But the real reason is that the original high-dimensional space is extremely sparse. Standard similarity measures are undefined for most pairs, or are 0. It's sort of a symptom of the curse of dimensionality.

          Show
          srowen Sean Owen added a comment - The lower dimensional space is of course smaller. This makes it faster and more efficient to work with, which is an advantage to be sure at scale. But the real reason is that the original high-dimensional space is extremely sparse. Standard similarity measures are undefined for most pairs, or are 0. It's sort of a symptom of the curse of dimensionality.
          Hide
          debasish83 Debasish Das added a comment -

          Is there a metric like MAP / AUC kind of measure that can help us validate similarUsers and similarProducts ?

          Right now if I run column similarities with sparse vector on matrix factorization datasets for product similarities, it will assume all unvisited entries (which should be ?) as 0 and compute column similarities for...If the sparse vector has ? in place of 0 then basically all similarity calculation is incorrect...so in that sense it makes more sense to compute the similarities on the matrix factors...

          But then we are back to map-reduce calculation of rowSimilarities.

          Show
          debasish83 Debasish Das added a comment - Is there a metric like MAP / AUC kind of measure that can help us validate similarUsers and similarProducts ? Right now if I run column similarities with sparse vector on matrix factorization datasets for product similarities, it will assume all unvisited entries (which should be ?) as 0 and compute column similarities for...If the sparse vector has ? in place of 0 then basically all similarity calculation is incorrect...so in that sense it makes more sense to compute the similarities on the matrix factors... But then we are back to map-reduce calculation of rowSimilarities.
          Hide
          apachespark Apache Spark added a comment -

          User 'debasish83' has created a pull request for this issue:
          https://github.com/apache/spark/pull/6213

          Show
          apachespark Apache Spark added a comment - User 'debasish83' has created a pull request for this issue: https://github.com/apache/spark/pull/6213

            People

            • Assignee:
              Unassigned
              Reporter:
              steve_b Steven Bourke
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development