Following email correspondence with Jake, attached is a suggested patch to solve this issue.
The general idea was to define a new VectorIterableWriter that allows sequentially writing vectors to some underlying storage, and construct a VectorIterable over them when done. Currently implemented are RowMatrixWriter that uses a given matrix as storage, and DistributedRowMatrixWriter that uses a DistributedRowMatrix as storage. The algorithm was then modified to use a VectorIterableWriter for temporary storage and for its output, instead of a huge in-memory matrix.
The patch also partially fixes
MAHOUT-369: the returned eigenvalues should now correspond to the eigenvectors. Still, one less of each is returned (see TODO in code - removing the "-1" fails the unit-tests... didn't look into it).
1) Existing unit-tests pass. However, as commented in
MAHOUT-369, unit-tests for this package are far from complete. Unfortunately, my usual datasets were rendered unusable by recent changes in Mahout vector serialization, and I haven't the time to generate fictitious ones...
2) With this patch, the memory issue should be a thing of the past. However, with extremely large datasets a new computational issue may surface: iterating over a large disk-based dataset 'desiredRank' times (see the loop right below the TODO). This may be worked-around by rewriting this code as a MR job, but is outside the scope of this patch.
Jake, I'd appreciate any input you may have. It would also be very reassuring if you find the time to run some tests on real data you may have. And of course, your "seal of approval" if you think it does the trick. I might have some more time to work on it this Sunday (GMT+2), so any input till then would be greatly appreciated.