Mahout
  1. Mahout
  2. MAHOUT-308

Improve Lanczos to handle extremely large feature sets (without hashing)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.3
    • Fix Version/s: 0.5
    • Component/s: Math
    • Labels:
      None
    • Environment:

      all

      Description

      DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the driver (client) computer while Hadoop is iterating. The memory requirements of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for desiredRank = a few hundred, starts to cap out usefulness at some-small-number * millions of columns for most commodity hardware.

      The solution (without doing stochastic decomposition) is to persist the Lanczos basis to disk, except for the most recent two vectors. Some care must be taken in the "orthogonalizeAgainstBasis()" method call, which uses the entire basis. This part would be slower this way.

      1. MAHOUT-308.patch
        14 kB
        Danny Leshem

        Activity

        Hide
        Jake Mannix added a comment -

        Of course, making sure that individual mappers have to be able to fit dense soon-to-be-eigenvectors in memory too, which becomes problematic as numCols grows too high, and even without this constraint this solution would only allow scaling to go from millions of columns to close to a billion columns.

        Modifying the M/R job to shard the column space is doable, but realistically, stochastic SVD is the solution for multi-billions of features.

        Show
        Jake Mannix added a comment - Of course, making sure that individual mappers have to be able to fit dense soon-to-be-eigenvectors in memory too, which becomes problematic as numCols grows too high, and even without this constraint this solution would only allow scaling to go from millions of columns to close to a billion columns. Modifying the M/R job to shard the column space is doable, but realistically, stochastic SVD is the solution for multi-billions of features.
        Hide
        Danny Leshem added a comment - - edited

        Currently a blocker for me, with numColumnsOfInput ~ 10M and desiredRank ~ 100.
        Will be more than happy to review and test any submitted patch!

        Probably also a blocker for Mahout's record-breaking attempt, if you ever manage to find a big enough dataset

        Show
        Danny Leshem added a comment - - edited Currently a blocker for me, with numColumnsOfInput ~ 10M and desiredRank ~ 100. Will be more than happy to review and test any submitted patch! Probably also a blocker for Mahout's record-breaking attempt , if you ever manage to find a big enough dataset
        Hide
        Danny Leshem added a comment -

        Following email correspondence with Jake, attached is a suggested patch to solve this issue.

        The general idea was to define a new VectorIterableWriter that allows sequentially writing vectors to some underlying storage, and construct a VectorIterable over them when done. Currently implemented are RowMatrixWriter that uses a given matrix as storage, and DistributedRowMatrixWriter that uses a DistributedRowMatrix as storage. The algorithm was then modified to use a VectorIterableWriter for temporary storage and for its output, instead of a huge in-memory matrix.

        The patch also partially fixes MAHOUT-369: the returned eigenvalues should now correspond to the eigenvectors. Still, one less of each is returned (see TODO in code - removing the "-1" fails the unit-tests... didn't look into it).

        Two issues:
        1) Existing unit-tests pass. However, as commented in MAHOUT-369, unit-tests for this package are far from complete. Unfortunately, my usual datasets were rendered unusable by recent changes in Mahout vector serialization, and I haven't the time to generate fictitious ones...

        2) With this patch, the memory issue should be a thing of the past. However, with extremely large datasets a new computational issue may surface: iterating over a large disk-based dataset 'desiredRank' times (see the loop right below the TODO). This may be worked-around by rewriting this code as a MR job, but is outside the scope of this patch.

        Jake, I'd appreciate any input you may have. It would also be very reassuring if you find the time to run some tests on real data you may have. And of course, your "seal of approval" if you think it does the trick. I might have some more time to work on it this Sunday (GMT+2), so any input till then would be greatly appreciated.

        Show
        Danny Leshem added a comment - Following email correspondence with Jake, attached is a suggested patch to solve this issue. The general idea was to define a new VectorIterableWriter that allows sequentially writing vectors to some underlying storage, and construct a VectorIterable over them when done. Currently implemented are RowMatrixWriter that uses a given matrix as storage, and DistributedRowMatrixWriter that uses a DistributedRowMatrix as storage. The algorithm was then modified to use a VectorIterableWriter for temporary storage and for its output, instead of a huge in-memory matrix. The patch also partially fixes MAHOUT-369 : the returned eigenvalues should now correspond to the eigenvectors. Still, one less of each is returned (see TODO in code - removing the "-1" fails the unit-tests... didn't look into it). Two issues: 1) Existing unit-tests pass. However, as commented in MAHOUT-369 , unit-tests for this package are far from complete. Unfortunately, my usual datasets were rendered unusable by recent changes in Mahout vector serialization, and I haven't the time to generate fictitious ones... 2) With this patch, the memory issue should be a thing of the past. However, with extremely large datasets a new computational issue may surface: iterating over a large disk-based dataset 'desiredRank' times (see the loop right below the TODO). This may be worked-around by rewriting this code as a MR job, but is outside the scope of this patch. Jake, I'd appreciate any input you may have. It would also be very reassuring if you find the time to run some tests on real data you may have. And of course, your "seal of approval" if you think it does the trick. I might have some more time to work on it this Sunday (GMT+2), so any input till then would be greatly appreciated.
        Hide
        Jake Mannix added a comment -

        Just a quick comment (still settling in GMT+1, actually!): I think this approach is the right one, and cleaner than my off-the-cuff approach, API-wise. (I need to run it, of course! Will try that in the next few days).

        The inefficiency of the desiredRank loops over the basis set is something that can be solved in a further patch, yes.

        Show
        Jake Mannix added a comment - Just a quick comment (still settling in GMT+1, actually!): I think this approach is the right one, and cleaner than my off-the-cuff approach, API-wise. (I need to run it, of course! Will try that in the next few days). The inefficiency of the desiredRank loops over the basis set is something that can be solved in a further patch, yes.
        Hide
        Danny Leshem added a comment -

        Jake, did you get a chance to verify the patch?

        Show
        Danny Leshem added a comment - Jake, did you get a chance to verify the patch?
        Hide
        Sean Owen added a comment -

        Is this patch still fresh, and commitable? looks like it's sort of pending review by Jake. I know this part has been in flux. Worth punting to 0.5?

        Show
        Sean Owen added a comment - Is this patch still fresh, and commitable? looks like it's sort of pending review by Jake. I know this part has been in flux. Worth punting to 0.5?
        Hide
        Sean Owen added a comment -

        This one's also going stale. Jake, do you have thoughts on this? I imagine the patch needs to be updated again, but, worth discussing whether it is something you'd like to commit before making that effort.

        Show
        Sean Owen added a comment - This one's also going stale. Jake, do you have thoughts on this? I imagine the patch needs to be updated again, but, worth discussing whether it is something you'd like to commit before making that effort.
        Hide
        Sean Owen added a comment -

        Going once going twice, is this commitable? or will anyone be able to get it ready within the next month or so?

        Show
        Sean Owen added a comment - Going once going twice, is this commitable? or will anyone be able to get it ready within the next month or so?
        Hide
        Sean Owen added a comment -

        Looks like this timed out too... it would at least need to be completely redone on the current code. But do any other recent related changes actually address this? seems like some other stuff has been going on in this code.

        Show
        Sean Owen added a comment - Looks like this timed out too... it would at least need to be completely redone on the current code. But do any other recent related changes actually address this? seems like some other stuff has been going on in this code.
        Hide
        Nathan Halko added a comment -

        Not sure if this is the right place for this but I have been searching elsewhere without success.

        Setup: Finding 150 singular values for 1e6 x 1e6 matrix, super sparse (2G on disk)

        I'm getting Java Heap errors using Lanczos svd in SNAPSHOT-0.6. The way I interpret the code, specifying a --workingDir uses HdfsBackedLanczosState which stores each basis vector in dfs (which I can see that they live there). When the vectors are needed (orthogonalization and projecting the eignevectors) it seems they are read from disk one by one with only a few dense vectors in memory at one time (current, basis vector i, and an accumulation vector). This should have very light mem requirements and hammer the network, however, I'm not seeing this behavior.

        Is this a known issue, a memory leak or something? Is there something behind the scenes that keeps these vectors in memory? I can't come to grips with the error below.

        Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.lang.Object.clone(Native Method)
        at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
        at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
        at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:99)
        at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
        at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76)
        at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
        at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:151)
        at org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
        at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
        at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
        at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200)
        at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:123)
        at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

        Show
        Nathan Halko added a comment - Not sure if this is the right place for this but I have been searching elsewhere without success. Setup: Finding 150 singular values for 1e6 x 1e6 matrix, super sparse (2G on disk) I'm getting Java Heap errors using Lanczos svd in SNAPSHOT-0.6. The way I interpret the code, specifying a --workingDir uses HdfsBackedLanczosState which stores each basis vector in dfs (which I can see that they live there). When the vectors are needed (orthogonalization and projecting the eignevectors) it seems they are read from disk one by one with only a few dense vectors in memory at one time (current, basis vector i, and an accumulation vector). This should have very light mem requirements and hammer the network, however, I'm not seeing this behavior. Is this a known issue, a memory leak or something? Is there something behind the scenes that keeps these vectors in memory? I can't come to grips with the error below. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.Object.clone(Native Method) at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44) at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39) at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:99) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945) at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76) at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:151) at org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190) at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238) at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:123) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
        Hide
        pansophism added a comment -

        2 * rank * numCols * 8bytes
        Is this equation still applicable? Or we have a reduced version?

        Show
        pansophism added a comment - 2 * rank * numCols * 8bytes Is this equation still applicable? Or we have a reduced version?

          People

          • Assignee:
            Jake Mannix
            Reporter:
            Jake Mannix
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development