Details

Type: Improvement

Status: In Progress

Priority: Major

Resolution: Unresolved

Affects Version/s: None

Fix Version/s: None

Component/s: MLlib

Labels:None
Description
RowMatrix has a columnSimilarities method to find cosine similarities between columns.
A rowSimilarities method would be useful to find similarities between rows.
This is JIRA is to investigate which algorithms are suitable for such a method, better than bruteforcing it. Note that when there are many rows (> 10^6), it is unlikely that bruteforce will be feasible, since the output will be of order 10^12.
Issue Links
 relates to

SPARK3066 Support recommendAll in matrix factorization model
 Resolved

SPARK4675 Find similar products and similar users in MatrixFactorizationModel
 Resolved
 links to
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
I am considering coming up with a baseline version that's very close to brute force but we cut the computation with a topK number...for each user come up with topK users where K is defined by the client..this will take care of matrix factorization usecase...
Basically on master we collect a set of user factors, broadcast it to every node and does a reduceByKey to generate topK users for each user from this user block...We send a kernel function (cosine / polynomial / rbf) in this calculation...
But this idea does not work for raw features right...If we do map features to a lower dimension using factorization then this approach should run fine...but I am not sure if we can ask users to map their data into a lower dimension
Is it possible to bring in ideas from fastfood and kitchen sink to do this ?