Mahout
  1. Mahout
  2. MAHOUT-369

Issues with DistributedLanczosSolver output

    Details

      Description

      DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.

          log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
      

      However, a few lines later (line 106) we have

          for(int i=0; i<eigenVectors.numRows() - 1; i++) {
              ...
          }
      

      which only persists eigenVectors.numRows()-1 vectors.

      Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?

      Also, I think it would be better if the eigenvectors are persisted in reverse order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.

      This, for two reasons:
      1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.
      2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.

      1. MAHOUT-369.diff
        23 kB
        Jake Mannix
      2. ASF.LICENSE.NOT.GRANTED--MAHOUT-369.patch
        2 kB
        Danny Leshem

        Activity

        Danny Leshem created issue -
        Danny Leshem made changes -
        Field Original Value New Value
        Description DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
        {code}
            log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
        {code}

        However, a few lines later (line 106) we have
        {code}
            for(int i=0; i<eigenVectors.numRows() - 1; i++) {
                ...
            }
        {code}

        which only persists eigenVectors.numRows()-1 vectors.

        Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?


        Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.

        This, for two reasons:
        1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.
        2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garabage.
        DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
        {code}
            log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: " + outputPath);
        {code}

        However, a few lines later (line 106) we have
        {code}
            for(int i=0; i<eigenVectors.numRows() - 1; i++) {
                ...
            }
        {code}

        which only persists eigenVectors.numRows()-1 vectors.

        Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is omitted... off by one bug?


        Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning the most significant vector is marked "0", the 2nd most significant is marked "1", etc.

        This, for two reasons:
        1) When performing another PCA on the same corpus (say, with more principal componenets), corresponding eigenvalues can be easily matched and compared.
        2) Makes it easier to discard the least significant principal components, which for Lanczos decomposition are usually garbage.
        Danny Leshem made changes -
        Fix Version/s 0.4 [ 12314396 ]
        Danny Leshem made changes -
        Attachment MAHOUT-369.patch [ 12441827 ]
        Jake Mannix made changes -
        Assignee Jake Mannix [ jake.mannix ]
        Sean Owen made changes -
        Fix Version/s 0.5 [ 12315255 ]
        Fix Version/s 0.4 [ 12314396 ]
        Jake Mannix made changes -
        Attachment MAHOUT-369.diff [ 12475343 ]
        Jake Mannix made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Jake Mannix made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Jake Mannix
            Reporter:
            Danny Leshem
          • Votes:
            3 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development