# Implement Stochastic Decomposition

## Details

• Type: New Feature
• Status: Closed
• Priority: Major
• Resolution: Duplicate
• Affects Version/s: 0.4
• Fix Version/s:
• Component/s:
• Labels:
None

## Description

Techniques reviewed in <a href="http://arxiv.org/abs/0909.4061">Halko, Martinsson, and Tropp</a>.

The basic idea of the implementation is as follows: if the input matrix is represented as a DistributedSparseRowMatrix (backed by a sequence-file of <Writable,VectorWritable> - the values of which should be SequentialAccessSparseVector instances for best performance), and you optionally have a kernel function f(v) which maps sparse numColumns-dimensional (here numColumns is unconstrained in size) vectors to sparse numKernelizedFeatures-dimensional (also unconstrained in size) vectors (in the case where you want to do kernel-PCA, for example, for a kernel k(u,v) = f(u).dot( f(v) )), then take the MurmurHash (from MAHOUT-228) and maps the numKernelizedFeatures-dimensional vectors and projects down to some numHashedFeatures-dimensional space (reasonably-sized - no more than a 10^2 to 10^4).

This is all done in the Mapper, and there are two outputs: the numHashedFeatures-dimensional vector itself (if the left-singular vectors are ever desired), which does not need to be Reduced, and the outer-product of this vector with itself, where the Reducer/Combiner just does the matrix sum on the partial outputs, eventually producing the kernel / gram matrix of your hashed features, which can then be run through a simple eigen-decomposition, the ((1/eigenvalue)-scaled) eigenvectors of which can be applied to project the (optional) numHashedFeatures-dimensional outputs mentioned earlier in this paragraph to get the left-singular vectors / reduced projections (which can be then run through clustering, etc...).

Good fun will be had by all.

## Activity

Transition Time In Source Status Execution Times Last Executer Last Execution Date
 Open Resolved
375d 11h 1m 1 Sean Owen 07/Mar/11 10:54
 Resolved Closed
74d 16h 24m 1 Sean Owen 21/May/11 03:18
Sean Owen made changes -
 Status Resolved [ 5 ] Closed [ 6 ]
Hide
Ted Dunning added a comment -

Yes. Duplicated and superseded entirely.

Show
Ted Dunning added a comment - Yes. Duplicated and superseded entirely.
Sean Owen made changes -
 Status Open [ 1 ] Resolved [ 5 ] Assignee Jake Mannix [ jake.mannix ] Ted Dunning [ tdunning ] Resolution Duplicate [ 3 ]
Hide
Sean Owen added a comment -

And then likewise I think this, being the same as MAHOUT-376, is duplicated by MAHOUT-593 for all practical purposes?

Show
Sean Owen added a comment - And then likewise I think this, being the same as MAHOUT-376 , is duplicated by MAHOUT-593 for all practical purposes?
Ted Dunning made changes -
 Link This issue is duplicated by MAHOUT-376
Ted Dunning made changes -
 Fix Version/s 0.5 [ 12315255 ] Fix Version/s 0.4 [ 12314396 ]
Jake Mannix made changes -
Field Original Value New Value
Link This issue is blocked by MAHOUT-228
Jake Mannix created issue -

## People

• Assignee:
Ted Dunning
Reporter:
Jake Mannix