I opened up a PR that worked well for our datasets. It is still brute-force computation although we use blocked cartesian and user defined kernels to optimize on cutting computation and shuffle...There are trivial ideas to go from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib BLAS although I don't think it will give us the runtime boost we are looking for...
We are looking into approximate KNN family of algorithms to improve the runtime further...KDTree is good for dense vector with low features but for sparse vector in higher dimensions researches did not find it useful..
LSH seems to be most commonly used and that's the direction we are looking into. I looked into papers but the one that showed good recall values in their experiments as compared to brute force KNN is Google Correlate and that's the validation strategy we will focus at https://www.google.com/trends/correlate/nnsearch.pdf. Please point to any other references that deem fit. There are twitter papers as well using LSH and the implementation is available in algebird. We will start with algebird LSH but ideally we don't want to have a distance metric hardcoded in LSH.
If we get good recall using LSH based method compared to the rowSimilarities code from the PR, we will use LSH based method to approximate compute similarities between dense/sparse rows using cosine kernel, dense userFactor, productFactor from factorization using product kernel and dense user/product factor similarities using cosine kernel.
The kernel abstraction is part of the current PR and right now we support Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but it's not added yet. For approximate row similarity I will open up a separate JIRA.