Ok, ugly, dirty patch which needs to be cleaned up, but it does work, in some circumstances, for some inputs (on my cluster). cough
This patch makes some extensions of the DocumentVectorizer as well. Lets say you already have a SequenceFile<Text,Text> of your corpus (living at text_path, then you can produce some good output by doing:
$HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i text_path -o corpus_as_vectors_path -seq true -w tfidf -chunk 1000 --minSupport 1 --minDF 5 --maxDFPercent 50 --norm 2
now I've got some SequentialAccessSparseVectors in corpus_as_vectors_path, tfidf weighted, stripping out terms which occur more than half of the time (L2 normalized), etc. Now for the fun: you need to know what the dimension of the vectors you spit out (you can do this by guessing and getting it wrong, and slightly more helpful CardinalityException will be spit out in the logs/console, or you can get it from the corpus_as_vectors entries themselves). If the value you find is numFeatures, then try this hadoop job:
$HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver -i corpus_as_vectors_path -o corpus_svd_path -nr 1 -nc <numFeatures> --rank 100
This will zip along making 100 passes over your data, then doing a decomposition of a nice and small (100x100) matrix in memory, and producing a SequenceFile<IntWritable,VectorWritable> (where the values are DenseVectors of dimension numFeatures - so should not be MAX_VALUE!), where the "name" of the vectors contains a string which is not actually the eigenvalue, but it's proportional to it - I'm working on that part still.
There's also a unit test (which currently takes about a minute on my laptop) - DistributedLanczosSolverTest, which validates accuracy.
TODO: cleanup, stuff mentioned above, a job which validates correctness explicitly after the fact, and some utilities for taking the eigenvectors and doing useful stuff with them.
NOTE: Lanczos spits out desiredRank - 1 orthogonal vectors which are pretty close to being eigenvectors of the square of your matrix (ie they are right-singular vectors of the input corpus), but they span the spectrum: the first few are the ones with the highest singular values, the last few are the ones with the lowest singular values. If you really want, e.g. the highest 100 singular vectors, ask Lanczos for 300 as the rank, and then only keep the top 100, and this will give you 100 "of the largest" singular vectors, but no guarantee that you don't miss part of that top of the spectrum. For most cases, this isn't a worry, but you should keep it in mind.