Details
Description
Techniques reviewed in <a href="http://arxiv.org/abs/0909.4061">Halko, Martinsson, and Tropp</a>.
The basic idea of the implementation is as follows: if the input matrix is represented as a DistributedSparseRowMatrix (backed by a sequencefile of <Writable,VectorWritable>  the values of which should be SequentialAccessSparseVector instances for best performance), and you optionally have a kernel function f(v) which maps sparse numColumnsdimensional (here numColumns is unconstrained in size) vectors to sparse numKernelizedFeaturesdimensional (also unconstrained in size) vectors (in the case where you want to do kernelPCA, for example, for a kernel k(u,v) = f(u).dot( f(v) )), then take the MurmurHash (from MAHOUT228) and maps the numKernelizedFeaturesdimensional vectors and projects down to some numHashedFeaturesdimensional space (reasonablysized  no more than a 10^2 to 10^4).
This is all done in the Mapper, and there are two outputs: the numHashedFeaturesdimensional vector itself (if the leftsingular vectors are ever desired), which does not need to be Reduced, and the outerproduct of this vector with itself, where the Reducer/Combiner just does the matrix sum on the partial outputs, eventually producing the kernel / gram matrix of your hashed features, which can then be run through a simple eigendecomposition, the ((1/eigenvalue)scaled) eigenvectors of which can be applied to project the (optional) numHashedFeaturesdimensional outputs mentioned earlier in this paragraph to get the leftsingular vectors / reduced projections (which can be then run through clustering, etc...).
Good fun will be had by all.
Attachments
Issue Links
 is blocked by

MAHOUT228 Need sequential logistic regression implementation using SGD techniques
 Closed
 is duplicated by

MAHOUT376 Implement Mapreduce version of stochastic SVD
 Closed