Details
Description
DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
However, a few lines later (line 106) we have
for(int i=0; i<eigenVectors.numRows()  1; i++) { ... }
which only persists eigenVectors.numRows()1 vectors.
Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?
Also, I think it would be better if the eigenvectors are persisted in reverse order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.
This, for two reasons:
1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.
2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

 MAHOUT369.diff
 23 kB
 Jake Mannix

 ASF.LICENSE.NOT.GRANTEDMAHOUT369.patch
 2 kB
 Danny Leshem
Activity
Ok, fixed the import * crap (from bad IDE settings), and renamed a magic constant in the test to be a bit more readable, and fixed a few things in tests which were broken (tests had been written to assume you get back one less than the number of eigenvectors requested). Committing now!
I'll fix those imports, and add some comments on what has changed / what is being done now.
There are some more improvements which are necessary, but this is strictly better than was what was there before, so I'll commit this as soon as I can (after adding said comments, to be fleshed out on the list/wiki further).
LGTM, noting that the patch flips around some imports and that could be removed.
I am sure you're clear to commit as you see fit or you can leave it to me to take care of.
Patch might need some cleanup, but it should be basically there.
Patch addresses inconsistency mentioned by Danny. Might need some cleanup, but otherwise incorporates his suggestions as well as adds a more thorough consistency check between COLT's old EigenvalueDecompositionImpl, and the Lanczos impl in here. For certain problems, they should yield the same answers!
Are the issues with eigenvalue accuracy resolved?
I'd like to commit the patch. Danny seems confident it's the right change and Jake felt it was probably right. Derek suggests maybe there is more that needs to go into the patch. Danny could you confirm whether these are all the changes that are necessary?
Hi Danny,
I've tried out your testLanczosSolver2() test, but I get different output to yours as the eigenvalues are in the reverse order to what you got, i.e. (I've added a line to LanczosSolver to also print the realEigen eigenvector):
INFO: Lanczos iteration complete  now to diagonalize the tridiagonal auxiliary matrix.
01Mar2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 0 found with eigenvalue 0.0
01Mar2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 0 
01Mar2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 1 found with eigenvalue 0.03137295830774178
01Mar2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 1 
01Mar2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 2 found with eigenvalue 42.617610634772475
01Mar2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 2 
01Mar2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
INFO: LanczosSolver finished.
When I debug, I see that eigenVals contains [0.0, 0.03137295830774178, 42.617610634772475, 131.25526355941963]. I wanted to check if you'd made a change to LanczosSolver to reverse the order of the eigenvalues, before you generated the test output in your last comment? I don't see any changes to this file in the patch file attached here.
Thanks,
Derek
I just checked this patch and it is correct. There are some other minor problems.
1) The ordering of eigenvalues was the opposite than eigenvectors. But this the patch fixes.
2) The signs of the first and third eigenvectors are negative to the sign of matlab. The second eigenvalue has the correct sign.
3) When requesting a rank of 4, we get 3 eigenvalues... So it seems that the rank is always lower by one.
I have added a test function named estLanczosSolver2() to TestLanczosSolver.java (code below).
To run it, you need first to comment the line: //nextVector.assign(new Scale(1 / scaleFactor));
in LanczosSolver.java, so it is easier to compare the results to Matlab, without the normalization.
I further suggest to add an additional optional flag for avoiding normalization.
The factorized matrix is:
>> full(A)
ans =
3.1200 3.1212 3.0000
3.1110 1.5000 2.1212
7.0000 8.0000 4.0000
The eigenvalues are;
>> [a,b]=eig(full(A'*A))
a =
0.2132 0.8010 0.5593
0.5785 0.3578 0.7330
0.7873 0.4799 0.3871
b =
0.0314 0 0
0 42.6176 0
0 0 131.2553
Now I run the unit test testLanczosSolver2 and I get:
INFO: Lanczos iteration complete  now to diagonalize the tridiagonal auxiliary matrix.
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 0 found with eigenvalue 131.25526355941963
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 1 found with eigenvalue 42.61761063477249
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 2 found with eigenvalue 0.03137295830779152
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: LanczosSolver finished.
As you can see the eigenvalues are correct, but when I look at the eigenvectors I see that
V0 (the row number zero eigenvector) = [ 0.5593 0.7330 0.387] , where in matlab we get V0 (the third column of the matrix a above)
@
Test
public void testLanczosSolver2() throws Exception {
int numRows = 3; int numCols = 3;
int numColumns = 3;
SparseRowMatrix m = new SparseRowMatrix(new int[]
);
/**
 3.1200 3.1212 3.0000
3.1110 1.5000 2.1212
7.0000 8.0000 4.0000
*/
m.set(0,0,3.12);
m.set(0,1,3.12121);
m.set(0,2,3);
m.set(1,0,3.111);
m.set(1,1,1.5);
m.set(1,2,2.12122);
m.set(2,0,7);
m.set(2,1,8);
m.set(2,2,4);
int rank = 4;
Matrix eigens = new DenseMatrix(rank, numColumns);
long time = timeLanczos(m, eigens, rank, false);
assertTrue("Lanczos taking too long! Are you in the debugger? ", time < 10000);
//assertOrthonormal(eigens);
//assertEigen(eigens, m, 0.1, false);
}
Best,
Danny Bickson
I haven't looked at this particular patch in ages, but let me try and check it out again, because I think this patch fixes a real problem and should get in, not archived.
This one's also been on the shelf for about 4 months. Is it ready to go, or should it be archived?
Same question – commitable Jake? or needs more review and time and so for 0.5?
Danny, thanks for looking into this so carefully, I think you definitely have described what's going on in this now.
I'm going to try and look at this some more in the next few days and clean it up a bit, and add in some of your suggested unit tests.
Okay, please ignore my previous patch.
I think I finally understand what's going on with the code. There are two seemingly unrelated issues:
1) The code returns one less eigenvector/eigenvalue than requested.
LanczosSolver.java (157) returns one less eigenvector/eigenvalue than requested
for (int i = 0; i < basis.numRows()  1; i++) {
DistributedLanczosSolver.java (106) serializes one less eigenvector than requested
for(int i=0; i<eigenVectors.numRows()  1; i++) {
2) When asked for N eigenvectors, the code returns (serializes) the most important N1 eigenvectors in descending order (meaning, important ones are serialized first  which, other than the "N1" part, is excellent!), but associates them with the bottom N1 out of the top N most important eigenvalues in ascending order (meaning, important ones are serialized last).
In other words, the #1 important eigenvalue is not serialized at all, the #2 important eigenvalue is associated with the #N1 important eigenvector, ... and finally the #N most important eigenvalue is associated with the #1 important eigenvector.
Vector names changed drastically during the last few days (following Sean's patch for MAHOUT379), so issue #2 is less obvious now. Previously the eigenvalues where serialized with the vector. Now they are only mentioned in log prints.
Here's some code to verify these claims. The code creates a 1000rows 100columns matrix. For each row, the first element is +1/1 with equal probabilities, and the rest are +0.001/0.001 with equal probabilities. The Elements are uncorrelated. So decomposing this matrix should reveal that the most important PC is the 100dimensional vector (1, 0, 0, ... , 0), and it should be associated with a much higher eigenvalue than the rest. To test my claims, remove the "1" from the two loops mentioned above, and run the following code:
[add to TestDistributedRowMatrix.java]
public static DistributedRowMatrix distributedMatrix(final Matrix matrix, String baseTmpDir) throws IOException { baseTmpDir = TESTDATA + baseTmpDir; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); ClusteringTestUtils.writePointsToFile(new Iterable<VectorWritable>() { @Override public Iterator<VectorWritable> iterator() { final Iterator<MatrixSlice> it = matrix.iterator(); return new Iterator<VectorWritable>() { @Override public boolean hasNext() { return it.hasNext(); } @Override public VectorWritable next() { MatrixSlice slice = it.next(); return new VectorWritable(slice.vector()); } @Override public void remove() { it.remove(); } }; } }, true, baseTmpDir + "/distMatrix/part00000", fs, conf); DistributedRowMatrix distMatrix = new DistributedRowMatrix(baseTmpDir + "/distMatrix", baseTmpDir + "/tmpOut", matrix.numRows(), matrix.numCols()); distMatrix.configure(new JobConf(conf)); return distMatrix; }
[add to TestDistributedLanczosSolver.java, and run the test]
private Random rand = new Random(); private double[] randomRow(int numColumns) { final double[] values = new double[numColumns]; // Only the first column's value is "important". Columns are uncorrelated. values[0] = (rand.nextBoolean() ? 1 : 1); for (int i = 1; i < numColumns; ++i) { values[i] = (rand.nextBoolean() ? 0.001 : 0.001); } return values; } private Matrix randomMatrix(int numRows, int numColumns) { final Matrix matrix = new DenseMatrix(numRows, numColumns); for (int row = 0; row < numRows; ++row) { matrix.set(row, randomRow(numColumns)); } return matrix; } public void testDistributedLanczosSolverSanity() throws Exception { final File testData = new File("testdata"); if (!testData.exists()) { testData.mkdir(); } final Matrix matrix = randomMatrix(1000, 100); final DistributedRowMatrix distMatrix = TestDistributedRowMatrix.distributedMatrix(matrix, "testdata"); distMatrix.configure(new JobConf()); final DistributedLanczosSolver solver = new DistributedLanczosSolver(); final Matrix eigenVectors = new DenseMatrix(30, 100); final List<Double> eigenValues = new ArrayList<Double>(); solver.solve(distMatrix, 30, eigenVectors, eigenValues, false); System.out.println(" Eigenvalues "); printDoubleList(eigenValues); System.out.println(" Eigenvectors "); printMatrix(eigenVectors); } private void printDoubleList(List<Double> values) { for (double value : values) { System.out.print(value + "\t"); } System.out.println(""); } private void printMatrix(Matrix matrix) { for (int row = 0; row < matrix.numRows(); ++row) { for (int col = 0; col < matrix.numCols(); ++col) { System.out.print(matrix.get(row, col) + "\t"); } System.out.println(""); } }
The test does not fail the current code, but you can see what's wrong by its prints. To make this a real test, one can replace the printing with verifying that 1) the first eigenvalue is, say, at least 100 times bigger than the second; and 2) the returned eigenvectors in fact correspond to their eigenvalues. This would fail the current code.
I considered submitting another patch, but I'm no longer sure about the best way to fix this. However, I am positive now that this is a real issue that needs to be fixed.
Attached is a simple patch that fixed the two issues raised.
The right way to do this is to introduce a new unittest that fails with the current version (e.g. decompose a fixed matrix and verify all its known eigenvalues are found). The attached patch has no such code.
All relevant unittests pass (I'm getting errors for a few org.apache.mahout.clustering tests, nothing related to this change though).
Jake, I'm doing this mostly to practice sending Mahout patches. If your original code indeed implemented the intended logic, you might want to add a unittest that fails my patch...
Hold on that Sean, I made the loop like that for a reason. I need to check in again and verify if/where it's wrong, but it was not an oversight, it has to do with the way the Colt code does EigenDecomposition.
If the patch amounts to making that loop into:
for (int i = eigenVectors.numRows()  1; i >= 0; i) {
Then I can commit that. But do the folks who understand this code agree it's the thing to do? seems so.
Can you create a suggested patch?
Committed revision 1088831.