Details

Type: Improvement

Status: In Progress

Priority: Major

Resolution: Unresolved

Affects Version/s: None

Fix Version/s: None

Component/s: MLlib

Labels:None
Description
RowMatrix has a columnSimilarities method to find cosine similarities between columns.
A rowSimilarities method would be useful to find similarities between rows.
This is JIRA is to investigate which algorithms are suitable for such a method, better than bruteforcing it. Note that when there are many rows (> 10^6), it is unlikely that bruteforce will be feasible, since the output will be of order 10^12.
Issue Links
 relates to

SPARK3066 Support recommendAll in matrix factorization model
 Resolved

SPARK4675 Find similar products and similar users in MatrixFactorizationModel
 Resolved
I am considering coming up with a baseline version that's very close to brute force but we cut the computation with a topK number...for each user come up with topK users where K is defined by the client..this will take care of matrix factorization usecase...
Basically on master we collect a set of user factors, broadcast it to every node and does a reduceByKey to generate topK users for each user from this user block...We send a kernel function (cosine / polynomial / rbf) in this calculation...
But this idea does not work for raw features right...If we do map features to a lower dimension using factorization then this approach should run fine...but I am not sure if we can ask users to map their data into a lower dimension
Is it possible to bring in ideas from fastfood and kitchen sink to do this ?