Mahout
  1. Mahout
  2. MAHOUT-180

port Hadoop-ified Lanczos SVD implementation from decomposer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.2
    • Fix Version/s: 0.3
    • Component/s: Math
    • Labels:
      None

      Description

      I wrote up a hadoop version of the Lanczos algorithm for performing SVD on sparse matrices available at http://decomposer.googlecode.com/, which is Apache-licensed, and I'm willing to donate it. I'll have to port over the implementation to use Mahout vectors, or else add in these vectors as well.

      Current issues with the decomposer implementation include: if your matrix is really big, you need to re-normalize before decomposition: find the largest eigenvalue first, and divide all your rows by that value, then decompose, or else you'll blow over Double.MAX_VALUE once you've run too many iterations (the L^2 norm of intermediate vectors grows roughly as (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on the lower end is better than blowing over MAX_VALUE). When this is ported to Mahout, we should add in the capability to do this automatically (run a couple iterations to find the largest eigenvalue, save that, then iterate while scaling vectors by 1/max_eigenvalue).

      1. MAHOUT-180.patch
        66 kB
        Jake Mannix
      2. MAHOUT-180.patch
        63 kB
        Jake Mannix
      3. MAHOUT-180.patch
        48 kB
        Jake Mannix
      4. MAHOUT-180.patch
        66 kB
        Jake Mannix
      5. MAHOUT-180.patch
        52 kB
        Jake Mannix

        Issue Links

          Activity

          Hide
          Isabel Drost-Fromm added a comment -

          That sounds great! Thank you for offering to donate the code. If you need any help porting the code or any other support, we are happy to help.

          You may also want to have a look at http://incubator.apache.org/ip-clearance/index.html that explains the legal steps for donating large code donations.

          Show
          Isabel Drost-Fromm added a comment - That sounds great! Thank you for offering to donate the code. If you need any help porting the code or any other support, we are happy to help. You may also want to have a look at http://incubator.apache.org/ip-clearance/index.html that explains the legal steps for donating large code donations.
          Hide
          Prasen Mukherjee added a comment -

          This will be awesome. BTW, any references/articles on your approach will be of great help. I too am interested in paralellizing SVD ( I am sure there are many many folks like me and will be willing to contribute in this.

          Show
          Prasen Mukherjee added a comment - This will be awesome. BTW, any references/articles on your approach will be of great help. I too am interested in paralellizing SVD ( I am sure there are many many folks like me and will be willing to contribute in this.
          Hide
          Jake Mannix added a comment -

          Hey Prasen, the approach to doing parallelized Lanczos is parallelized multiplication of (the square of) your input matrix by vectors, so my decomposer project has a DistributedMatrix class which is backed by matrix with rows which live in HDFS, and then Lanczos iteration just leaves your matrix where it is (so no additional data transfer of order the size of your matrix) and sends one vector at a time out to it to do parallelized matrix-by-vector multiplication.

          Then, when you've gone far enough, you do a non-parallel SVD on a very small matrix which lives in memory, and you've got one half of your set of singular vector pairs (which is often all you need).

          I'm not sure where the best write up on this approach is - once you decide on doing Lanczos, the parallelization procedure is straightforward (in the words of far-too-many CS professors: "At this point, it's only a matter of 'engineering'" - heh). The main not-completely-straightforward trick I used here is avoiding squaring the input matrix anywhere (which is needed in some form if you've got a non-square or non-symmetric matrix), and staying in completely sparse row-matrix form.

          Currently I'm a bit blocked on providing a patch for this, given that the underlying linear algebra primitives in Mahout are probably going to change (to commons-math vectors and matrices most likely), in which case my port of decomposer will need to get completely refactored, and just doing the refactor once makes a lot more sense.

          Show
          Jake Mannix added a comment - Hey Prasen, the approach to doing parallelized Lanczos is parallelized multiplication of (the square of) your input matrix by vectors, so my decomposer project has a DistributedMatrix class which is backed by matrix with rows which live in HDFS, and then Lanczos iteration just leaves your matrix where it is (so no additional data transfer of order the size of your matrix) and sends one vector at a time out to it to do parallelized matrix-by-vector multiplication. Then, when you've gone far enough, you do a non-parallel SVD on a very small matrix which lives in memory, and you've got one half of your set of singular vector pairs (which is often all you need). I'm not sure where the best write up on this approach is - once you decide on doing Lanczos, the parallelization procedure is straightforward (in the words of far-too-many CS professors: "At this point, it's only a matter of 'engineering'" - heh). The main not-completely-straightforward trick I used here is avoiding squaring the input matrix anywhere (which is needed in some form if you've got a non-square or non-symmetric matrix), and staying in completely sparse row-matrix form. Currently I'm a bit blocked on providing a patch for this, given that the underlying linear algebra primitives in Mahout are probably going to change (to commons-math vectors and matrices most likely), in which case my port of decomposer will need to get completely refactored, and just doing the refactor once makes a lot more sense.
          Hide
          Sean Owen added a comment -

          Is this unblocked now that much of the Math stuff has been updated?

          Show
          Sean Owen added a comment - Is this unblocked now that much of the Math stuff has been updated?
          Hide
          jakes old (non-committer) account added a comment -

          This is waiting on my MAHOUT-205 and MAHOUT-206 patches to go in, but is otherwise unblocked.

          Show
          jakes old (non-committer) account added a comment - This is waiting on my MAHOUT-205 and MAHOUT-206 patches to go in, but is otherwise unblocked.
          Hide
          Jake Mannix added a comment -

          contains much more (both the online/sequential Generalized Hebbian Algorithm SVD, as well as a simple Lanczos), but also much less (does not tie in to serialization/persistence of any sort in this patch, let alone an HDFS-backed form of this), than this JIRA ticket discusses.

          It is completely self-contained, but the unit test (singular, ack!) needs too much memory to get committed currently (as it is in some ways a performance test, in other ways a functionality test, and in yet other ways a comparison test, between Lanczos and Hebbian). Maybe if I comment out the unit test so it doesn't run until it's pared down a bit, this will be committable as a "work in progress".

          This really is the "meat" of decomposer.

          Show
          Jake Mannix added a comment - contains much more (both the online/sequential Generalized Hebbian Algorithm SVD, as well as a simple Lanczos), but also much less (does not tie in to serialization/persistence of any sort in this patch, let alone an HDFS-backed form of this), than this JIRA ticket discusses. It is completely self-contained, but the unit test (singular, ack!) needs too much memory to get committed currently (as it is in some ways a performance test, in other ways a functionality test, and in yet other ways a comparison test, between Lanczos and Hebbian). Maybe if I comment out the unit test so it doesn't run until it's pared down a bit, this will be committable as a "work in progress". This really is the "meat" of decomposer.
          Hide
          Jake Mannix added a comment -

          Hmm... further work on unit testing shows that Lanczos is doing great, but the stream-oriented (Hebbian) version has some issues with orthogonalization which need to be worked out. Another patch will be forthcoming.

          Show
          Jake Mannix added a comment - Hmm... further work on unit testing shows that Lanczos is doing great, but the stream-oriented (Hebbian) version has some issues with orthogonalization which need to be worked out. Another patch will be forthcoming.
          Hide
          Jake Mannix added a comment -

          Jeepers, for performance, I had switched from using SparseRowMatrix to DenseMatrix in a few places, and suddenly failed to keep orthogonality. Why? Because of this ugliness I thought was long since fixed, in DenseMatrix:

            @Override
            public Vector getRow(int row) {
              if (row < 0 || row >= rowSize()) {
                throw new IndexException();
              }
              return new DenseVector(values[row]);
            }
          

          The lovely bug here? This is a full deep copy of the row, not a shallow view which allows you to mutate the original matrix! Arrrrrggggg! I swear there was already a bug filed and fixed regarding this. It's easy to do, for this method (make a "shallow" constructor for DenseVector, and use it here). The right fix also takes care of getColumn (which requires a little more work, but not much).

          Show
          Jake Mannix added a comment - Jeepers, for performance, I had switched from using SparseRowMatrix to DenseMatrix in a few places, and suddenly failed to keep orthogonality. Why? Because of this ugliness I thought was long since fixed, in DenseMatrix: @Override public Vector getRow( int row) { if (row < 0 || row >= rowSize()) { throw new IndexException(); } return new DenseVector(values[row]); } The lovely bug here? This is a full deep copy of the row, not a shallow view which allows you to mutate the original matrix! Arrrrrggggg! I swear there was already a bug filed and fixed regarding this. It's easy to do, for this method (make a "shallow" constructor for DenseVector, and use it here). The right fix also takes care of getColumn (which requires a little more work, but not much).
          Hide
          Jake Mannix added a comment -

          I have nobody to blame but myself. MAHOUT-211 was filed by me, assigned to me, but not fixed by me.

          Show
          Jake Mannix added a comment - I have nobody to blame but myself. MAHOUT-211 was filed by me, assigned to me, but not fixed by me.
          Hide
          Jake Mannix added a comment -

          This patch actually works, has disentangled unit tests (one for Lanczos, one for Hebbian, and a base junit test for them to share), and scaled down test parameters to run without a gazillion bytes of RAM available.

          Also includes a 90% solution for MAHOUT-211.

          If I made the patch correctly, that is.

          Show
          Jake Mannix added a comment - This patch actually works, has disentangled unit tests (one for Lanczos, one for Hebbian, and a base junit test for them to share), and scaled down test parameters to run without a gazillion bytes of RAM available. Also includes a 90% solution for MAHOUT-211 . If I made the patch correctly, that is.
          Hide
          Jake Mannix added a comment -

          Committed initial (non-hadoop dependent) pieces of this as of r901718.

          Show
          Jake Mannix added a comment - Committed initial (non-hadoop dependent) pieces of this as of r901718.
          Hide
          Sean Owen added a comment -

          Yeah this is done right?

          Show
          Sean Owen added a comment - Yeah this is done right?
          Hide
          jakes old (non-committer) account added a comment -

          We can call it done, and open new tickets related to hadoopification, bringing in decomposer-contrib-hadoop stuff, etc. Sure.

          Show
          jakes old (non-committer) account added a comment - We can call it done, and open new tickets related to hadoopification, bringing in decomposer-contrib-hadoop stuff, etc. Sure.
          Hide
          Jake Mannix added a comment -

          Reopening to finish what is described in the title: putting in the hadoop-version of the LanczosSolver. Patch will be forthcoming.

          Show
          Jake Mannix added a comment - Reopening to finish what is described in the title: putting in the hadoop-version of the LanczosSolver. Patch will be forthcoming.
          Hide
          Jake Mannix added a comment -

          Ok, I've got a basic unit test for the DistributedLanczosSolver, and I've finally got it running on a real (1-node) cluster against DocumentVectorized a 20newsgroups corpus. It's not crashing, but it's taking a while. 30 seconds per pass over the corpus, which means it's not going to finish anytime soon, so I'm going to bed now (or do I get up for breakfast? hmm...), and I'll see how it's going and possibly post my current patch in [a few] hours.

          Show
          Jake Mannix added a comment - Ok, I've got a basic unit test for the DistributedLanczosSolver, and I've finally got it running on a real (1-node) cluster against DocumentVectorized a 20newsgroups corpus. It's not crashing, but it's taking a while. 30 seconds per pass over the corpus, which means it's not going to finish anytime soon, so I'm going to bed now (or do I get up for breakfast? hmm...), and I'll see how it's going and possibly post my current patch in [a few] hours.
          Hide
          Jake Mannix added a comment -

          Ok, ugly, dirty patch which needs to be cleaned up, but it does work, in some circumstances, for some inputs (on my cluster). cough

          This patch makes some extensions of the DocumentVectorizer as well. Lets say you already have a SequenceFile<Text,Text> of your corpus (living at text_path, then you can produce some good output by doing:

          $HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i text_path -o corpus_as_vectors_path -seq true -w tfidf -chunk 1000 --minSupport 1 --minDF 5 --maxDFPercent 50 --norm 2

          now I've got some SequentialAccessSparseVectors in corpus_as_vectors_path, tfidf weighted, stripping out terms which occur more than half of the time (L2 normalized), etc. Now for the fun: you need to know what the dimension of the vectors you spit out (you can do this by guessing and getting it wrong, and slightly more helpful CardinalityException will be spit out in the logs/console, or you can get it from the corpus_as_vectors entries themselves). If the value you find is numFeatures, then try this hadoop job:

          $HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver -i corpus_as_vectors_path -o corpus_svd_path -nr 1 -nc <numFeatures> --rank 100

          This will zip along making 100 passes over your data, then doing a decomposition of a nice and small (100x100) matrix in memory, and producing a SequenceFile<IntWritable,VectorWritable> (where the values are DenseVectors of dimension numFeatures - so should not be MAX_VALUE!), where the "name" of the vectors contains a string which is not actually the eigenvalue, but it's proportional to it - I'm working on that part still.

          There's also a unit test (which currently takes about a minute on my laptop) - DistributedLanczosSolverTest, which validates accuracy.

          TODO: cleanup, stuff mentioned above, a job which validates correctness explicitly after the fact, and some utilities for taking the eigenvectors and doing useful stuff with them.

          NOTE: Lanczos spits out desiredRank - 1 orthogonal vectors which are pretty close to being eigenvectors of the square of your matrix (ie they are right-singular vectors of the input corpus), but they span the spectrum: the first few are the ones with the highest singular values, the last few are the ones with the lowest singular values. If you really want, e.g. the highest 100 singular vectors, ask Lanczos for 300 as the rank, and then only keep the top 100, and this will give you 100 "of the largest" singular vectors, but no guarantee that you don't miss part of that top of the spectrum. For most cases, this isn't a worry, but you should keep it in mind.

          Show
          Jake Mannix added a comment - Ok, ugly, dirty patch which needs to be cleaned up, but it does work, in some circumstances, for some inputs (on my cluster). cough This patch makes some extensions of the DocumentVectorizer as well. Lets say you already have a SequenceFile<Text,Text> of your corpus (living at text_path, then you can produce some good output by doing: $HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i text_path -o corpus_as_vectors_path -seq true -w tfidf -chunk 1000 --minSupport 1 --minDF 5 --maxDFPercent 50 --norm 2 now I've got some SequentialAccessSparseVectors in corpus_as_vectors_path, tfidf weighted, stripping out terms which occur more than half of the time (L2 normalized), etc. Now for the fun: you need to know what the dimension of the vectors you spit out (you can do this by guessing and getting it wrong, and slightly more helpful CardinalityException will be spit out in the logs/console, or you can get it from the corpus_as_vectors entries themselves). If the value you find is numFeatures, then try this hadoop job: $HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver -i corpus_as_vectors_path -o corpus_svd_path -nr 1 -nc <numFeatures> --rank 100 This will zip along making 100 passes over your data, then doing a decomposition of a nice and small (100x100) matrix in memory, and producing a SequenceFile<IntWritable,VectorWritable> (where the values are DenseVectors of dimension numFeatures - so should not be MAX_VALUE!), where the "name" of the vectors contains a string which is not actually the eigenvalue, but it's proportional to it - I'm working on that part still. There's also a unit test (which currently takes about a minute on my laptop) - DistributedLanczosSolverTest, which validates accuracy. TODO: cleanup, stuff mentioned above, a job which validates correctness explicitly after the fact, and some utilities for taking the eigenvectors and doing useful stuff with them. NOTE: Lanczos spits out desiredRank - 1 orthogonal vectors which are pretty close to being eigenvectors of the square of your matrix (ie they are right-singular vectors of the input corpus), but they span the spectrum: the first few are the ones with the highest singular values, the last few are the ones with the lowest singular values. If you really want, e.g. the highest 100 singular vectors, ask Lanczos for 300 as the rank, and then only keep the top 100, and this will give you 100 "of the largest" singular vectors, but no guarantee that you don't miss part of that top of the spectrum. For most cases, this isn't a worry, but you should keep it in mind.
          Hide
          Jake Mannix added a comment -

          Adds an EigenVerificationJob, which takes just as long as the DistributedLanczosJob, but at the end of the day prunes away superfluous eigenvectors and those with eigenvalue below a chosen threshold, and saves them back to hdfs, and you get nice output like this:

          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|0| = |2905105.325066675|, err = 8.231193504570911E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|2| = |921522.2179480934|, err = 3.312905505481467E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|5| = |422593.76267148677|, err = 9.690026558928366E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|7| = |326205.19577841857|, err = 8.280043317654417E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|8| = |284990.59331463446|, err = 2.398081733190338E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|11| = |253500.28860628756|, err = 0.03797261160467913 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|12| = |253433.24512060767|, err = 1.4885870314174099E-12 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|15| = |222354.15336025952|, err = 7.081002451059248E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|17| = |215156.97325760772|, err = 1.2456702336294256E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|18| = |200592.7782982773|, err = 3.72257780156815E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|19| = |191270.06867188454|, err = 3.610445276081009E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|20| = |168446.0868356986|, err = 1.2598810883446276E-12 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|21| = |166320.23361954943|, err = 1.0635936575909E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|22| = |158882.90261344844|, err = 2.708944180085382E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|25| = |134577.41793521316|, err = 1.7830181775480014E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|26| = |124021.7093344012|, err = 1.773026170326375E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|27| = |121824.37156673158|, err = 1.3680168109431179E-12 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|28| = |119613.86741751489|, err = 1.1823875212257917E-12 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|29| = |119104.75278971005|, err = 1.2567724638756772E-13 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|30| = |100623.36519880772|, err = 1.155742168634788E-12 to verifiedOutput/largestCleanEigens
          10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|31| = |88661.27936320615|, err = 1.0856870957809406E-12 to verifiedOutput/largestCleanEigens

          the indexes skip (e|0|, e|2|, e|5|, e|7|, ... ) because superfluous ones were found (error too high, not orthogonal either), and these are in descending eigenvalue order (this is on the 20news dataset).

          Next to try it on wikipedia, and then some cleanup and this should be ready for public consumption.

          Show
          Jake Mannix added a comment - Adds an EigenVerificationJob, which takes just as long as the DistributedLanczosJob, but at the end of the day prunes away superfluous eigenvectors and those with eigenvalue below a chosen threshold, and saves them back to hdfs, and you get nice output like this: 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|0| = |2905105.325066675|, err = 8.231193504570911E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|2| = |921522.2179480934|, err = 3.312905505481467E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|5| = |422593.76267148677|, err = 9.690026558928366E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|7| = |326205.19577841857|, err = 8.280043317654417E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|8| = |284990.59331463446|, err = 2.398081733190338E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|11| = |253500.28860628756|, err = 0.03797261160467913 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|12| = |253433.24512060767|, err = 1.4885870314174099E-12 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|15| = |222354.15336025952|, err = 7.081002451059248E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|17| = |215156.97325760772|, err = 1.2456702336294256E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|18| = |200592.7782982773|, err = 3.72257780156815E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|19| = |191270.06867188454|, err = 3.610445276081009E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|20| = |168446.0868356986|, err = 1.2598810883446276E-12 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|21| = |166320.23361954943|, err = 1.0635936575909E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|22| = |158882.90261344844|, err = 2.708944180085382E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|25| = |134577.41793521316|, err = 1.7830181775480014E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|26| = |124021.7093344012|, err = 1.773026170326375E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|27| = |121824.37156673158|, err = 1.3680168109431179E-12 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|28| = |119613.86741751489|, err = 1.1823875212257917E-12 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|29| = |119104.75278971005|, err = 1.2567724638756772E-13 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|30| = |100623.36519880772|, err = 1.155742168634788E-12 to verifiedOutput/largestCleanEigens 10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|31| = |88661.27936320615|, err = 1.0856870957809406E-12 to verifiedOutput/largestCleanEigens the indexes skip (e|0|, e|2|, e|5|, e|7|, ... ) because superfluous ones were found (error too high, not orthogonal either), and these are in descending eigenvalue order (this is on the 20news dataset). Next to try it on wikipedia, and then some cleanup and this should be ready for public consumption.
          Hide
          Edward J. Yoon added a comment -

          Hi, Quick question.

          It works using M/R iterations?

          Show
          Edward J. Yoon added a comment - Hi, Quick question. It works using M/R iterations?
          Hide
          Jake Mannix added a comment -

          Yes. Multiplication of a matrix (or the square of a matrix) by a vector is the primary operation of Lanczos, and that is done in a M/R iteration. If you want the top-k singular vectors, you make k passes over the data.

          Once stochastic decomposition is added later, it will be O(1) passes over the data, but they will be slightly heavier duty passes.

          Show
          Jake Mannix added a comment - Yes. Multiplication of a matrix (or the square of a matrix) by a vector is the primary operation of Lanczos, and that is done in a M/R iteration. If you want the top-k singular vectors, you make k passes over the data. Once stochastic decomposition is added later, it will be O(1) passes over the data, but they will be slightly heavier duty passes.
          Hide
          Sean Owen added a comment -

          This is our last open issue for 0.3. How's it looking?

          Show
          Sean Owen added a comment - This is our last open issue for 0.3. How's it looking?
          Hide
          Jake Mannix added a comment -

          I need to regenerate the patch. It currently works, but it's got some "ugliness" in the code I'm trying to clean up (some semi-hardcoded things, a kludgey api for eigenvectors, inconsistent method names). I'll do some more tests (slightly bigger doc set) tonight and if it's still doing well and passing tests we can try to get it in there. We can iterate on refactoring later.

          Show
          Jake Mannix added a comment - I need to regenerate the patch. It currently works, but it's got some "ugliness" in the code I'm trying to clean up (some semi-hardcoded things, a kludgey api for eigenvectors, inconsistent method names). I'll do some more tests (slightly bigger doc set) tonight and if it's still doing well and passing tests we can try to get it in there. We can iterate on refactoring later.
          Hide
          Jake Mannix added a comment -

          Ok, new patch. Contains, in addition to the woefully unfinished DistributedRowMatrix (only timesSquared() is implemented really, for the purposes of Lanczos / SVD), DistributedLanczosSolver, for doing raw Lanczos, and EigenVerificationJob, which checks to see what the cosAngle errors on the purported eigenvectors are, what their eigenvalues are, and whether they're all orthonormal.

          It then removes "bad" eigenvector/value pairs (based on a user-specified error margin), and also trims out any which have eigenvalue below a user-specified minimum eigenvalue threshold, then saves them back to HDFS, with a totally hacky "decorated" vector which has the eigenvalue and error baked into the name of the vector (that part should be redone less horribly).

          So usage is like so:

          $HADOOP_HOME/bin/hadoop -jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input path/to/vector-sequence-file --output outpath --numRows 0 (currently ignored, not needed) --numCols 12345 --rank 100

          Show
          Jake Mannix added a comment - Ok, new patch. Contains, in addition to the woefully unfinished DistributedRowMatrix (only timesSquared() is implemented really, for the purposes of Lanczos / SVD), DistributedLanczosSolver, for doing raw Lanczos, and EigenVerificationJob, which checks to see what the cosAngle errors on the purported eigenvectors are, what their eigenvalues are, and whether they're all orthonormal. It then removes "bad" eigenvector/value pairs (based on a user-specified error margin), and also trims out any which have eigenvalue below a user-specified minimum eigenvalue threshold, then saves them back to HDFS, with a totally hacky "decorated" vector which has the eigenvalue and error baked into the name of the vector (that part should be redone less horribly). So usage is like so: $HADOOP_HOME/bin/hadoop -jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input path/to/vector-sequence-file --output outpath --numRows 0 (currently ignored, not needed) --numCols 12345 --rank 100
          Hide
          Sean Owen added a comment -

          It's looking good to me, from a cursory visual inspection. Since you've been working on it a while, and have tests, I'm sure it's substantially correct, and that seems about good enough to me for 0.x. I'd say commit.

          Show
          Sean Owen added a comment - It's looking good to me, from a cursory visual inspection. Since you've been working on it a while, and have tests, I'm sure it's substantially correct, and that seems about good enough to me for 0.x. I'd say commit.
          Hide
          Jake Mannix added a comment -

          Committed revision 912134.

          Wiki on usage forthcoming.

          Show
          Jake Mannix added a comment - Committed revision 912134. Wiki on usage forthcoming.
          Hide
          Danny Leshem added a comment - - edited

          While testing the new code, I encountered the following issue:

          ...
          10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished.
          10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input while processing Options
          Usage:
          [--eigenInput <eigenInput> --corpusInput <corpusInput> --help --output
          <output> --inMemory <inMemory> --maxError <maxError> --minEigenvalue
          <minEigenvalue>]
          Options
          ...

          The problem seems to be in DistributedLanczosSolver.java [73]:
          EigenVerificationJob expects the parameters' names to be "eigenInput" and "corpusInput", but you're mistakenly passing them as "input" and "output".

          Other than this minor issue, the code seems to be working fine and indeed produces the right amount of dense (eigen?) vectors.

          Show
          Danny Leshem added a comment - - edited While testing the new code, I encountered the following issue: ... 10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished. 10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input while processing Options Usage: [--eigenInput <eigenInput> --corpusInput <corpusInput> --help --output <output> --inMemory <inMemory> --maxError <maxError> --minEigenvalue <minEigenvalue>] Options ... The problem seems to be in DistributedLanczosSolver.java [73] : EigenVerificationJob expects the parameters' names to be "eigenInput" and "corpusInput", but you're mistakenly passing them as "input" and "output". Other than this minor issue, the code seems to be working fine and indeed produces the right amount of dense (eigen?) vectors.
          Hide
          Jake Mannix added a comment -

          Hi Danny, thanks for trying this out!

          You have indeed found some testing code which snuck in - I was trying to add the EigenVerificationJob to the final step of Lanczos, to save people the trouble of having to "clean" their eigenvectors at the end of the day, but didn't finish and yet it got checked in.

          The clue in the code is that I still have a line:

           // TODO ack!
          

          Which should be a hint that I should not have checked that file in just yet.

          I've removed it now - svn up and try again!

          If you want to see what your eigen-spectrum is like, after you've run the DistributedLanczosSolver, the EigenVerificationJob can be run next (it cleans out eigenvectors with too high error or too low eigenvalue):

          $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-{version}.job org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob \
          --eigenInput path/for/svd-output --corpusInput path/to/corpus --output path/for/cleanOutput --maxError 0.1 --minEigenvalue 10.0 
          

          Thanks for the bug report!

          Show
          Jake Mannix added a comment - Hi Danny, thanks for trying this out! You have indeed found some testing code which snuck in - I was trying to add the EigenVerificationJob to the final step of Lanczos, to save people the trouble of having to "clean" their eigenvectors at the end of the day, but didn't finish and yet it got checked in. The clue in the code is that I still have a line: // TODO ack! Which should be a hint that I should not have checked that file in just yet. I've removed it now - svn up and try again! If you want to see what your eigen-spectrum is like, after you've run the DistributedLanczosSolver, the EigenVerificationJob can be run next (it cleans out eigenvectors with too high error or too low eigenvalue): $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-{version}.job org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob \ --eigenInput path/ for /svd-output --corpusInput path/to/corpus --output path/ for /cleanOutput --maxError 0.1 --minEigenvalue 10.0 Thanks for the bug report!
          Hide
          Dan Brickley added a comment -

          This looks great, but a little more documentation would really help those of us new to Mahout.

          (perhaps in https://cwiki.apache.org/MAHOUT/svd-singular-value-decomposition.html ?)

          I would jump in and help document but I'd like to be happy I'm understanding things correctly first. Right now, I'm not.

          Couple of trivial things that tripped me:

          • the example above uses 'hadoop -jar' but I found (using Hadoop 0.20.2+737) that I needed 'hadoop jar' (no hyphen).
          • the example has "--numRows 0 (currently ignored, not needed)"; is this still not needed? text output suggests it is used now

          Conceptually (from a broad-brush understanding of SVD) I was initially expecting 3 matrices back, not a single eigenvectors matrix; am happy to RTFM there and brush up on the linear algebra but some pointers would really help. Is it possible to get the decomposition into U, s and V?

          Show
          Dan Brickley added a comment - This looks great, but a little more documentation would really help those of us new to Mahout. (perhaps in https://cwiki.apache.org/MAHOUT/svd-singular-value-decomposition.html ?) I would jump in and help document but I'd like to be happy I'm understanding things correctly first. Right now, I'm not. Couple of trivial things that tripped me: the example above uses 'hadoop -jar' but I found (using Hadoop 0.20.2+737) that I needed 'hadoop jar' (no hyphen). the example has "--numRows 0 (currently ignored, not needed)"; is this still not needed? text output suggests it is used now Conceptually (from a broad-brush understanding of SVD) I was initially expecting 3 matrices back, not a single eigenvectors matrix; am happy to RTFM there and brush up on the linear algebra but some pointers would really help. Is it possible to get the decomposition into U, s and V?

            People

            • Assignee:
              Jake Mannix
              Reporter:
              Jake Mannix
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development