Mahout
  1. Mahout
  2. MAHOUT-369

Issues with DistributedLanczosSolver output

    Details

      Description

      DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.

          log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
      

      However, a few lines later (line 106) we have

          for(int i=0; i<eigenVectors.numRows() - 1; i++) {
              ...
          }
      

      which only persists eigenVectors.numRows()-1 vectors.

      Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?

      Also, I think it would be better if the eigenvectors are persisted in reverse order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.

      This, for two reasons:
      1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.
      2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

      1. MAHOUT-369.diff
        23 kB
        Jake Mannix
      2. ASF.LICENSE.NOT.GRANTED--MAHOUT-369.patch
        2 kB
        Danny Leshem

        Activity

        Hide
        Jake Mannix added a comment -

        Committed revision 1088831.

        Show
        Jake Mannix added a comment - Committed revision 1088831.
        Hide
        Jake Mannix added a comment -

        Ok, fixed the import * crap (from bad IDE settings), and renamed a magic constant in the test to be a bit more readable, and fixed a few things in tests which were broken (tests had been written to assume you get back one less than the number of eigenvectors requested). Committing now!

        Show
        Jake Mannix added a comment - Ok, fixed the import * crap (from bad IDE settings), and renamed a magic constant in the test to be a bit more readable, and fixed a few things in tests which were broken (tests had been written to assume you get back one less than the number of eigenvectors requested). Committing now!
        Hide
        Jake Mannix added a comment -

        I'll fix those imports, and add some comments on what has changed / what is being done now.

        There are some more improvements which are necessary, but this is strictly better than was what was there before, so I'll commit this as soon as I can (after adding said comments, to be fleshed out on the list/wiki further).

        Show
        Jake Mannix added a comment - I'll fix those imports, and add some comments on what has changed / what is being done now. There are some more improvements which are necessary, but this is strictly better than was what was there before, so I'll commit this as soon as I can (after adding said comments, to be fleshed out on the list/wiki further).
        Hide
        Sean Owen added a comment -

        LGTM, noting that the patch flips around some imports and that could be removed.
        I am sure you're clear to commit as you see fit or you can leave it to me to take care of.

        Show
        Sean Owen added a comment - LGTM, noting that the patch flips around some imports and that could be removed. I am sure you're clear to commit as you see fit or you can leave it to me to take care of.
        Hide
        Jake Mannix added a comment -

        Patch might need some cleanup, but it should be basically there.

        Show
        Jake Mannix added a comment - Patch might need some cleanup, but it should be basically there.
        Hide
        Jake Mannix added a comment -

        Patch addresses inconsistency mentioned by Danny. Might need some cleanup, but otherwise incorporates his suggestions as well as adds a more thorough consistency check between COLT's old EigenvalueDecompositionImpl, and the Lanczos impl in here. For certain problems, they should yield the same answers!

        Show
        Jake Mannix added a comment - Patch addresses inconsistency mentioned by Danny. Might need some cleanup, but otherwise incorporates his suggestions as well as adds a more thorough consistency check between COLT's old EigenvalueDecompositionImpl, and the Lanczos impl in here. For certain problems, they should yield the same answers!
        Hide
        Dmitriy Lyubimov added a comment -

        Are the issues with eigenvalue accuracy resolved?

        Show
        Dmitriy Lyubimov added a comment - Are the issues with eigenvalue accuracy resolved?
        Hide
        Sean Owen added a comment -

        I'd like to commit the patch. Danny seems confident it's the right change and Jake felt it was probably right. Derek suggests maybe there is more that needs to go into the patch. Danny could you confirm whether these are all the changes that are necessary?

        Show
        Sean Owen added a comment - I'd like to commit the patch. Danny seems confident it's the right change and Jake felt it was probably right. Derek suggests maybe there is more that needs to go into the patch. Danny could you confirm whether these are all the changes that are necessary?
        Hide
        Derek O'Callaghan added a comment -

        Hi Danny,

        I've tried out your testLanczosSolver2() test, but I get different output to yours as the eigenvalues are in the reverse order to what you got, i.e. (I've added a line to LanczosSolver to also print the realEigen eigenvector):

        INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal auxiliary matrix.
        01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
        INFO: Eigenvector 0 found with eigenvalue 0.0
        01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
        INFO: Eigenvector 0 -

        {0:0.5593439330819562,1:0.7330112630790516,2:0.3870773213760546}

        01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
        INFO: Eigenvector 1 found with eigenvalue 0.03137295830774178
        01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
        INFO: Eigenvector 1 -

        {0:-0.8010370751145115,1:0.35784374789842055,2:0.47988275274487646}

        01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
        INFO: Eigenvector 2 found with eigenvalue 42.617610634772475
        01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
        INFO: Eigenvector 2 -

        {0:-0.21324626331168514,1:0.5784827916967494,2:-0.7873269275811279}

        01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info
        INFO: LanczosSolver finished.

        When I debug, I see that eigenVals contains [0.0, 0.03137295830774178, 42.617610634772475, 131.25526355941963]. I wanted to check if you'd made a change to LanczosSolver to reverse the order of the eigenvalues, before you generated the test output in your last comment? I don't see any changes to this file in the patch file attached here.

        Thanks,

        Derek

        Show
        Derek O'Callaghan added a comment - Hi Danny, I've tried out your testLanczosSolver2() test, but I get different output to yours as the eigenvalues are in the reverse order to what you got, i.e. (I've added a line to LanczosSolver to also print the realEigen eigenvector): INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal auxiliary matrix. 01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info INFO: Eigenvector 0 found with eigenvalue 0.0 01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info INFO: Eigenvector 0 - {0:0.5593439330819562,1:0.7330112630790516,2:0.3870773213760546} 01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info INFO: Eigenvector 1 found with eigenvalue 0.03137295830774178 01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info INFO: Eigenvector 1 - {0:-0.8010370751145115,1:0.35784374789842055,2:0.47988275274487646} 01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info INFO: Eigenvector 2 found with eigenvalue 42.617610634772475 01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info INFO: Eigenvector 2 - {0:-0.21324626331168514,1:0.5784827916967494,2:-0.7873269275811279} 01-Mar-2011 15:56:56 org.slf4j.impl.JCLLoggerAdapter info INFO: LanczosSolver finished. When I debug, I see that eigenVals contains [0.0, 0.03137295830774178, 42.617610634772475, 131.25526355941963] . I wanted to check if you'd made a change to LanczosSolver to reverse the order of the eigenvalues, before you generated the test output in your last comment? I don't see any changes to this file in the patch file attached here. Thanks, Derek
        Hide
        Danny Bickson added a comment - - edited

        I just checked this patch and it is correct. There are some other minor problems.
        1) The ordering of eigenvalues was the opposite than eigenvectors. But this the patch fixes.
        2) The signs of the first and third eigenvectors are negative to the sign of matlab. The second eigenvalue has the correct sign.
        3) When requesting a rank of 4, we get 3 eigenvalues... So it seems that the rank is always lower by one.

        I have added a test function named estLanczosSolver2() to TestLanczosSolver.java (code below).
        To run it, you need first to comment the line: //nextVector.assign(new Scale(1 / scaleFactor));
        in LanczosSolver.java, so it is easier to compare the results to Matlab, without the normalization.

        I further suggest to add an additional optional flag for avoiding normalization.

        The factorized matrix is:
        >> full(A)

        ans =

        3.1200 -3.1212 -3.0000
        -3.1110 1.5000 2.1212
        -7.0000 -8.0000 -4.0000

        The eigenvalues are;
        >> [a,b]=eig(full(A'*A))

        a =

        0.2132 -0.8010 -0.5593
        -0.5785 0.3578 -0.7330
        0.7873 0.4799 -0.3871

        b =

        0.0314 0 0
        0 42.6176 0
        0 0 131.2553

        Now I run the unit test testLanczosSolver2 and I get:
        INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal auxiliary matrix.
        Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: Eigenvector 0 found with eigenvalue 131.25526355941963
        Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: Eigenvector 1 found with eigenvalue 42.61761063477249
        Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: Eigenvector 2 found with eigenvalue 0.03137295830779152
        Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: LanczosSolver finished.

        As you can see the eigenvalues are correct, but when I look at the eigenvectors I see that
        V0 (the row number zero eigenvector) = [ 0.5593 0.7330 0.387] , where in matlab we get -V0 (the third column of the matrix a above)

        @
        Test
        public void testLanczosSolver2() throws Exception {
        int numRows = 3; int numCols = 3;
        int numColumns = 3;
        SparseRowMatrix m = new SparseRowMatrix(new int[]

        {numRows, numCols}

        );
        /**

        • 3.1200 -3.1212 -3.0000
          -3.1110 1.5000 2.1212
          -7.0000 -8.0000 -4.0000

        */
        m.set(0,0,3.12);
        m.set(0,1,-3.12121);
        m.set(0,2,-3);
        m.set(1,0,-3.111);
        m.set(1,1,1.5);
        m.set(1,2,2.12122);
        m.set(2,0,-7);
        m.set(2,1,-8);
        m.set(2,2,-4);

        int rank = 4;
        Matrix eigens = new DenseMatrix(rank, numColumns);
        long time = timeLanczos(m, eigens, rank, false);
        assertTrue("Lanczos taking too long! Are you in the debugger? ", time < 10000);
        //assertOrthonormal(eigens);
        //assertEigen(eigens, m, 0.1, false);
        }

        Best,

        Danny Bickson

        Show
        Danny Bickson added a comment - - edited I just checked this patch and it is correct. There are some other minor problems. 1) The ordering of eigenvalues was the opposite than eigenvectors. But this the patch fixes. 2) The signs of the first and third eigenvectors are negative to the sign of matlab. The second eigenvalue has the correct sign. 3) When requesting a rank of 4, we get 3 eigenvalues... So it seems that the rank is always lower by one. I have added a test function named estLanczosSolver2() to TestLanczosSolver.java (code below). To run it, you need first to comment the line: //nextVector.assign(new Scale(1 / scaleFactor)); in LanczosSolver.java, so it is easier to compare the results to Matlab, without the normalization. I further suggest to add an additional optional flag for avoiding normalization. The factorized matrix is: >> full(A) ans = 3.1200 -3.1212 -3.0000 -3.1110 1.5000 2.1212 -7.0000 -8.0000 -4.0000 The eigenvalues are; >> [a,b] =eig(full(A'*A)) a = 0.2132 -0.8010 -0.5593 -0.5785 0.3578 -0.7330 0.7873 0.4799 -0.3871 b = 0.0314 0 0 0 42.6176 0 0 0 131.2553 Now I run the unit test testLanczosSolver2 and I get: INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal auxiliary matrix. Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Eigenvector 0 found with eigenvalue 131.25526355941963 Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Eigenvector 1 found with eigenvalue 42.61761063477249 Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Eigenvector 2 found with eigenvalue 0.03137295830779152 Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info INFO: LanczosSolver finished. As you can see the eigenvalues are correct, but when I look at the eigenvectors I see that V0 (the row number zero eigenvector) = [ 0.5593 0.7330 0.387] , where in matlab we get -V0 (the third column of the matrix a above) @ Test public void testLanczosSolver2() throws Exception { int numRows = 3; int numCols = 3; int numColumns = 3; SparseRowMatrix m = new SparseRowMatrix(new int[] {numRows, numCols} ); /** 3.1200 -3.1212 -3.0000 -3.1110 1.5000 2.1212 -7.0000 -8.0000 -4.0000 */ m.set(0,0,3.12); m.set(0,1,-3.12121); m.set(0,2,-3); m.set(1,0,-3.111); m.set(1,1,1.5); m.set(1,2,2.12122); m.set(2,0,-7); m.set(2,1,-8); m.set(2,2,-4); int rank = 4; Matrix eigens = new DenseMatrix(rank, numColumns); long time = timeLanczos(m, eigens, rank, false); assertTrue("Lanczos taking too long! Are you in the debugger? ", time < 10000); //assertOrthonormal(eigens); //assertEigen(eigens, m, 0.1, false); } Best, Danny Bickson
        Hide
        Jake Mannix added a comment -

        I haven't looked at this particular patch in ages, but let me try and check it out again, because I think this patch fixes a real problem and should get in, not archived.

        Show
        Jake Mannix added a comment - I haven't looked at this particular patch in ages, but let me try and check it out again, because I think this patch fixes a real problem and should get in, not archived.
        Hide
        Sean Owen added a comment -

        This one's also been on the shelf for about 4 months. Is it ready to go, or should it be archived?

        Show
        Sean Owen added a comment - This one's also been on the shelf for about 4 months. Is it ready to go, or should it be archived?
        Hide
        Sean Owen added a comment -

        Same question – commitable Jake? or needs more review and time and so for 0.5?

        Show
        Sean Owen added a comment - Same question – commitable Jake? or needs more review and time and so for 0.5?
        Hide
        Jake Mannix added a comment -

        Danny, thanks for looking into this so carefully, I think you definitely have described what's going on in this now.

        I'm going to try and look at this some more in the next few days and clean it up a bit, and add in some of your suggested unit tests.

        Show
        Jake Mannix added a comment - Danny, thanks for looking into this so carefully, I think you definitely have described what's going on in this now. I'm going to try and look at this some more in the next few days and clean it up a bit, and add in some of your suggested unit tests.
        Hide
        Danny Leshem added a comment -

        Okay, please ignore my previous patch.

        I think I finally understand what's going on with the code. There are two seemingly unrelated issues:

        1) The code returns one less eigenvector/eigenvalue than requested.

        LanczosSolver.java (157) returns one less eigenvector/eigenvalue than requested-

            for (int i = 0; i < basis.numRows() - 1; i++) {
        

        DistributedLanczosSolver.java (106) serializes one less eigenvector than requested-

            for(int i=0; i<eigenVectors.numRows() - 1; i++) {
        

        2) When asked for N eigenvectors, the code returns (serializes) the most important N-1 eigenvectors in descending order (meaning, important ones are serialized first - which, other than the "N-1" part, is excellent!), but associates them with the bottom N-1 out of the top N most important eigenvalues in ascending order (meaning, important ones are serialized last).

        In other words, the #1 important eigenvalue is not serialized at all, the #2 important eigenvalue is associated with the #N-1 important eigenvector, ... and finally the #N most important eigenvalue is associated with the #1 important eigenvector.

        Vector names changed drastically during the last few days (following Sean's patch for MAHOUT-379), so issue #2 is less obvious now. Previously the eigenvalues where serialized with the vector. Now they are only mentioned in log prints.

        Here's some code to verify these claims. The code creates a 1000-rows 100-columns matrix. For each row, the first element is +1/-1 with equal probabilities, and the rest are +0.001/-0.001 with equal probabilities. The Elements are uncorrelated. So decomposing this matrix should reveal that the most important PC is the 100-dimensional vector (1, 0, 0, ... , 0), and it should be associated with a much higher eigenvalue than the rest. To test my claims, remove the "-1" from the two loops mentioned above, and run the following code:

        [add to TestDistributedRowMatrix.java]

          public static DistributedRowMatrix distributedMatrix(final Matrix matrix,
                                                               String baseTmpDir) throws IOException {
            baseTmpDir = TESTDATA + baseTmpDir;
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(conf);
        
            ClusteringTestUtils.writePointsToFile(new Iterable<VectorWritable>() {
              @Override
              public Iterator<VectorWritable> iterator() {
                final Iterator<MatrixSlice> it = matrix.iterator();
                return new Iterator<VectorWritable>() {
                  @Override
                  public boolean hasNext() { return it.hasNext(); }
                  @Override
                  public VectorWritable next() {
                    MatrixSlice slice = it.next();
                    return new VectorWritable(slice.vector());
                  }
                  @Override
                  public void remove() { it.remove(); }
                };
              }
            }, true, baseTmpDir + "/distMatrix/part-00000", fs, conf);
        
            DistributedRowMatrix distMatrix = new DistributedRowMatrix(baseTmpDir + "/distMatrix",
                                                                       baseTmpDir + "/tmpOut",
                                                                       matrix.numRows(),
                                                                       matrix.numCols());
            distMatrix.configure(new JobConf(conf));
        
            return distMatrix;
          }
        

        [add to TestDistributedLanczosSolver.java, and run the test]

          private Random rand = new Random();
        
          private double[] randomRow(int numColumns) {
              final double[] values = new double[numColumns];
        
              // Only the first column's value is "important". Columns are uncorrelated.
              values[0] = (rand.nextBoolean() ? 1 : -1);
              for (int i = 1; i < numColumns; ++i) {
                  values[i] = (rand.nextBoolean() ? 0.001 : -0.001);
              }
        
              return values;
          }
        
          private Matrix randomMatrix(int numRows, int numColumns) {
              final Matrix matrix = new DenseMatrix(numRows, numColumns);
              for (int row = 0; row < numRows; ++row) {
                  matrix.set(row, randomRow(numColumns));
              }
              return matrix;
          }
        
          public void testDistributedLanczosSolverSanity() throws Exception {
            final File testData = new File("testdata");
            if (!testData.exists()) {
              testData.mkdir();
            }
        
            final Matrix matrix = randomMatrix(1000, 100);
            final DistributedRowMatrix distMatrix =
                    TestDistributedRowMatrix.distributedMatrix(matrix, "testdata");
            distMatrix.configure(new JobConf());
        
            final DistributedLanczosSolver solver = new DistributedLanczosSolver();
            final Matrix eigenVectors = new DenseMatrix(30, 100);
            final List<Double> eigenValues = new ArrayList<Double>();
            solver.solve(distMatrix, 30, eigenVectors, eigenValues, false);
        
            System.out.println("--- Eigenvalues ---");
            printDoubleList(eigenValues);
            
            System.out.println("--- Eigenvectors ---");
            printMatrix(eigenVectors);
          }
        
          private void printDoubleList(List<Double> values) {
            for (double value : values) {
                System.out.print(value + "\t");
            }
            System.out.println("");
          }
        
          private void printMatrix(Matrix matrix) {
              for (int row = 0; row < matrix.numRows(); ++row) {
                  for (int col = 0; col < matrix.numCols(); ++col) {
                      System.out.print(matrix.get(row, col) + "\t");
                  }
                  System.out.println("");
              }
          }
        

        The test does not fail the current code, but you can see what's wrong by its prints. To make this a real test, one can replace the printing with verifying that 1) the first eigenvalue is, say, at least 100 times bigger than the second; and 2) the returned eigenvectors in fact correspond to their eigenvalues. This would fail the current code.

        I considered submitting another patch, but I'm no longer sure about the best way to fix this. However, I am positive now that this is a real issue that needs to be fixed.

        Show
        Danny Leshem added a comment - Okay, please ignore my previous patch. I think I finally understand what's going on with the code. There are two seemingly unrelated issues: 1) The code returns one less eigenvector/eigenvalue than requested. LanczosSolver.java (157) returns one less eigenvector/eigenvalue than requested- for ( int i = 0; i < basis.numRows() - 1; i++) { DistributedLanczosSolver.java (106) serializes one less eigenvector than requested- for ( int i=0; i<eigenVectors.numRows() - 1; i++) { 2) When asked for N eigenvectors, the code returns (serializes) the most important N-1 eigenvectors in descending order (meaning, important ones are serialized first - which, other than the "N-1" part, is excellent!), but associates them with the bottom N-1 out of the top N most important eigenvalues in ascending order (meaning, important ones are serialized last). In other words, the #1 important eigenvalue is not serialized at all, the #2 important eigenvalue is associated with the #N-1 important eigenvector, ... and finally the #N most important eigenvalue is associated with the #1 important eigenvector. Vector names changed drastically during the last few days (following Sean's patch for MAHOUT-379 ), so issue #2 is less obvious now. Previously the eigenvalues where serialized with the vector. Now they are only mentioned in log prints. Here's some code to verify these claims. The code creates a 1000-rows 100-columns matrix. For each row, the first element is +1/-1 with equal probabilities, and the rest are +0.001/-0.001 with equal probabilities. The Elements are uncorrelated. So decomposing this matrix should reveal that the most important PC is the 100-dimensional vector (1, 0, 0, ... , 0), and it should be associated with a much higher eigenvalue than the rest. To test my claims, remove the "-1" from the two loops mentioned above, and run the following code: [add to TestDistributedRowMatrix.java] public static DistributedRowMatrix distributedMatrix( final Matrix matrix, String baseTmpDir) throws IOException { baseTmpDir = TESTDATA + baseTmpDir; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); ClusteringTestUtils.writePointsToFile( new Iterable<VectorWritable>() { @Override public Iterator<VectorWritable> iterator() { final Iterator<MatrixSlice> it = matrix.iterator(); return new Iterator<VectorWritable>() { @Override public boolean hasNext() { return it.hasNext(); } @Override public VectorWritable next() { MatrixSlice slice = it.next(); return new VectorWritable(slice.vector()); } @Override public void remove() { it.remove(); } }; } }, true , baseTmpDir + "/distMatrix/part-00000" , fs, conf); DistributedRowMatrix distMatrix = new DistributedRowMatrix(baseTmpDir + "/distMatrix" , baseTmpDir + "/tmpOut" , matrix.numRows(), matrix.numCols()); distMatrix.configure( new JobConf(conf)); return distMatrix; } [add to TestDistributedLanczosSolver.java, and run the test] private Random rand = new Random(); private double [] randomRow( int numColumns) { final double [] values = new double [numColumns]; // Only the first column's value is "important" . Columns are uncorrelated. values[0] = (rand.nextBoolean() ? 1 : -1); for ( int i = 1; i < numColumns; ++i) { values[i] = (rand.nextBoolean() ? 0.001 : -0.001); } return values; } private Matrix randomMatrix( int numRows, int numColumns) { final Matrix matrix = new DenseMatrix(numRows, numColumns); for ( int row = 0; row < numRows; ++row) { matrix.set(row, randomRow(numColumns)); } return matrix; } public void testDistributedLanczosSolverSanity() throws Exception { final File testData = new File( "testdata" ); if (!testData.exists()) { testData.mkdir(); } final Matrix matrix = randomMatrix(1000, 100); final DistributedRowMatrix distMatrix = TestDistributedRowMatrix.distributedMatrix(matrix, "testdata" ); distMatrix.configure( new JobConf()); final DistributedLanczosSolver solver = new DistributedLanczosSolver(); final Matrix eigenVectors = new DenseMatrix(30, 100); final List< Double > eigenValues = new ArrayList< Double >(); solver.solve(distMatrix, 30, eigenVectors, eigenValues, false ); System .out.println( "--- Eigenvalues ---" ); printDoubleList(eigenValues); System .out.println( "--- Eigenvectors ---" ); printMatrix(eigenVectors); } private void printDoubleList(List< Double > values) { for ( double value : values) { System .out.print(value + "\t" ); } System .out.println(""); } private void printMatrix(Matrix matrix) { for ( int row = 0; row < matrix.numRows(); ++row) { for ( int col = 0; col < matrix.numCols(); ++col) { System .out.print(matrix.get(row, col) + "\t" ); } System .out.println(""); } } The test does not fail the current code, but you can see what's wrong by its prints. To make this a real test, one can replace the printing with verifying that 1) the first eigenvalue is, say, at least 100 times bigger than the second; and 2) the returned eigenvectors in fact correspond to their eigenvalues. This would fail the current code. I considered submitting another patch, but I'm no longer sure about the best way to fix this. However, I am positive now that this is a real issue that needs to be fixed.
        Hide
        Danny Leshem added a comment - - edited

        Attached is a simple patch that fixed the two issues raised.

        The right way to do this is to introduce a new unit-test that fails with the current version (e.g. decompose a fixed matrix and verify all its known eigenvalues are found). The attached patch has no such code.

        All relevant unit-tests pass (I'm getting errors for a few org.apache.mahout.clustering tests, nothing related to this change though).

        Jake, I'm doing this mostly to practice sending Mahout patches. If your original code indeed implemented the intended logic, you might want to add a unit-test that fails my patch...

        Show
        Danny Leshem added a comment - - edited Attached is a simple patch that fixed the two issues raised. The right way to do this is to introduce a new unit-test that fails with the current version (e.g. decompose a fixed matrix and verify all its known eigenvalues are found). The attached patch has no such code. All relevant unit-tests pass (I'm getting errors for a few org.apache.mahout.clustering tests, nothing related to this change though). Jake, I'm doing this mostly to practice sending Mahout patches. If your original code indeed implemented the intended logic, you might want to add a unit-test that fails my patch...
        Hide
        Jake Mannix added a comment -

        Hold on that Sean, I made the loop like that for a reason. I need to check in again and verify if/where it's wrong, but it was not an oversight, it has to do with the way the Colt code does EigenDecomposition.

        Show
        Jake Mannix added a comment - Hold on that Sean, I made the loop like that for a reason. I need to check in again and verify if/where it's wrong, but it was not an oversight, it has to do with the way the Colt code does EigenDecomposition.
        Hide
        Sean Owen added a comment -

        If the patch amounts to making that loop into:

        for (int i = eigenVectors.numRows() - 1; i >= 0; i--) {

        Then I can commit that. But do the folks who understand this code agree it's the thing to do? seems so.

        Show
        Sean Owen added a comment - If the patch amounts to making that loop into: for (int i = eigenVectors.numRows() - 1; i >= 0; i--) { Then I can commit that. But do the folks who understand this code agree it's the thing to do? seems so.
        Hide
        Ted Dunning added a comment -

        Can you create a suggested patch?

        Show
        Ted Dunning added a comment - Can you create a suggested patch?

          People

          • Assignee:
            Jake Mannix
            Reporter:
            Danny Leshem
          • Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development