Details
Description
DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
However, a few lines later (line 106) we have
for(int i=0; i<eigenVectors.numRows()  1; i++) { ... }
which only persists eigenVectors.numRows()1 vectors.
Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
Also, I think it would be better if the eigenvectors are persisted in reverse order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
This, for two reasons:
1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.
2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.
Activity
Field  Original Value  New Value 

Description 
DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
{code} log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath); {code} However, a few lines later (line 106) we have {code} for(int i=0; i<eigenVectors.numRows()  1; i++) { ... } {code} which only persists eigenVectors.numRows()1 vectors. Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug? Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc. This, for two reasons: 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared. 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garabage. 
DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
{code} log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath); {code} However, a few lines later (line 106) we have {code} for(int i=0; i<eigenVectors.numRows()  1; i++) { ... } {code} which only persists eigenVectors.numRows()1 vectors. Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug? Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc. This, for two reasons: 1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared. 2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage. 
Fix Version/s  0.4 [ 12314396 ] 
Attachment  MAHOUT369.patch [ 12441827 ] 
Assignee  Jake Mannix [ jake.mannix ] 
Fix Version/s  0.5 [ 12315255 ]  
Fix Version/s  0.4 [ 12314396 ] 
Attachment  MAHOUT369.diff [ 12475343 ] 
Status  Open [ 1 ]  Patch Available [ 10002 ] 
Status  Patch Available [ 10002 ]  Resolved [ 5 ] 
Resolution  Fixed [ 1 ] 
Status  Resolved [ 5 ]  Closed [ 6 ] 
Transition  Time In Source Status  Execution Times  Last Executer  Last Execution Date  


361d 20h 11m  1  Jake Mannix  04/Apr/11 09:41  

14h 43m  1  Jake Mannix  05/Apr/11 00:24  

46d 2h 54m  1  Sean Owen  21/May/11 03:18 
Can you create a suggested patch?