Mahout
  1. Mahout
  2. MAHOUT-796

Modified power iterations in existing SSVD code

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: Math
    • Labels:

      Description

      Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation.

      Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .

      Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right).

      1. Power Iterations.pdf
        256 kB
        Dmitriy Lyubimov
      2. MAHOUT-796-4.patch
        175 kB
        Dmitriy Lyubimov
      3. MAHOUT-796-3.patch
        175 kB
        Dmitriy Lyubimov
      4. MAHOUT-796-2.patch
        163 kB
        Dmitriy Lyubimov
      5. MAHOUT-796.patch
        161 kB
        Dmitriy Lyubimov

        Issue Links

          Activity

          Hide
          Dmitriy Lyubimov added a comment -

          current code is essentially equivalent to running this mod with q=0.

          Show
          Dmitriy Lyubimov added a comment - current code is essentially equivalent to running this mod with q=0.
          Hide
          Ted Dunning added a comment -

          The outline that I have in MAHOUT-792 does exactly this without the iteration and has the framework
          available to do the power iteration efficiently, including the out-of-core QR.

          For most recommendation applications, I would be surprised if iterations are important, but at
          least one user of the current system was surprised when their test matrix with a very low condition
          number didn't get good results with the current algorithm. It would be nice to be able to say
          "turn the knob to get better results".

          Show
          Ted Dunning added a comment - The outline that I have in MAHOUT-792 does exactly this without the iteration and has the framework available to do the power iteration efficiently, including the out-of-core QR. For most recommendation applications, I would be surprised if iterations are important, but at least one user of the current system was surprised when their test matrix with a very low condition number didn't get good results with the current algorithm. It would be nice to be able to say "turn the knob to get better results".
          Hide
          Dmitriy Lyubimov added a comment -

          if by low condition number you mean ratio of max to min singular value, then with slow decay i never got good results either. That's a sign of either highly noisy or ever randomly generated data. I think it would help if user came forward and explained his/her case, esp. given experimental nature of the code.

          Show
          Dmitriy Lyubimov added a comment - if by low condition number you mean ratio of max to min singular value, then with slow decay i never got good results either. That's a sign of either highly noisy or ever randomly generated data. I think it would help if user came forward and explained his/her case, esp. given experimental nature of the code.
          Hide
          Ted Dunning added a comment -

          Yes. That is what I mean by condition number.

          The only place that I have seen slow decay is with synthetically generated random data. Even real random data has some interesting singular values.

          Show
          Ted Dunning added a comment - Yes. That is what I mean by condition number. The only place that I have seen slow decay is with synthetically generated random data. Even real random data has some interesting singular values.
          Hide
          Dmitriy Lyubimov added a comment -

          For most recommendation applications, I would be surprised if iterations are important,

          Ok I have no first hand knowledge myself, Nathan mentioned it did appear quite important in his previous work. I hope he could comment himself on a nature of the data he was looking at.

          So are you encouraging to proceed, benchmark, compare? get more info? do it anyway? or drop it without comparisons?

          Show
          Dmitriy Lyubimov added a comment - For most recommendation applications, I would be surprised if iterations are important, Ok I have no first hand knowledge myself, Nathan mentioned it did appear quite important in his previous work. I hope he could comment himself on a nature of the data he was looking at. So are you encouraging to proceed, benchmark, compare? get more info? do it anyway? or drop it without comparisons?
          Hide
          Nathan Halko added a comment -

          The lower the condition number (or low signal to noise) the harder it is to extract the top k singular vectors because in a sense they are not that much more important than the other n-k. We see pollution from the smaller n-k singular directions and that degrades our approximation of the top k space. Power iterations (just a few) are extremely important to amplify the gap between important directions and the unimportant directions. Instead of sampling matrix A, we sample matrix (AA*)^qA which has the same singular vectors but an exaggerated spectrum

          sigma^

          {2q+1}

          In infinite precision there would be no need to orthogonalize between iterations, only at the last step. However, in finite precision, the small singular values can fall below machine precision when taken to the 2q+1st power and we won't be able to accurately recover them. It also prevents overflow if your matrix has a very large sig_max. It is mostly a precaution to keep from loosing information and for most cases could probably be skipped or done only intermittently. If orthogonalization is a bottleneck we could consider not doing it.

          Show
          Nathan Halko added a comment - The lower the condition number (or low signal to noise) the harder it is to extract the top k singular vectors because in a sense they are not that much more important than the other n-k. We see pollution from the smaller n-k singular directions and that degrades our approximation of the top k space. Power iterations (just a few) are extremely important to amplify the gap between important directions and the unimportant directions. Instead of sampling matrix A, we sample matrix (AA*)^qA which has the same singular vectors but an exaggerated spectrum sigma^ {2q+1} In infinite precision there would be no need to orthogonalize between iterations, only at the last step. However, in finite precision, the small singular values can fall below machine precision when taken to the 2q+1st power and we won't be able to accurately recover them. It also prevents overflow if your matrix has a very large sig_max. It is mostly a precaution to keep from loosing information and for most cases could probably be skipped or done only intermittently. If orthogonalization is a bottleneck we could consider not doing it.
          Hide
          Ted Dunning added a comment -

          So are you encouraging to proceed, benchmark, compare? get more info? do it anyway? or drop it without comparisons?

          I will proceed with my latest family of computations without power iterations. Along the way, I will look for ways to do them efficiently. It took me months of looking at other things for the coin to drop on the fact that we don't have to keep the Q around for the QR decompositions, however, so it may be a bit before I figure out how to implement the power iteration really efficiently.

          Show
          Ted Dunning added a comment - So are you encouraging to proceed, benchmark, compare? get more info? do it anyway? or drop it without comparisons? I will proceed with my latest family of computations without power iterations. Along the way, I will look for ways to do them efficiently. It took me months of looking at other things for the coin to drop on the fact that we don't have to keep the Q around for the QR decompositions, however, so it may be a bit before I figure out how to implement the power iteration really efficiently.
          Hide
          Dmitriy Lyubimov added a comment -

          Orthogonalization, i wouldn't say it has been a bottleneck for me personally. I do orthogonalization in two MR steps, but those steps are also doing 2 multiplications. I am still looking to run an experiment with an input where it would be a practical problem (such as QR step taking > 10..15 min). It also so happened that my previous results were tainted by the fact that i ran without native libraries installed so decompression was chomping on my CPU too, so no clean data now. But it seems indeed like roughly 50% of everything else.

          but i am still waiting for run-a-terabyte-input opportunity.

          Show
          Dmitriy Lyubimov added a comment - Orthogonalization, i wouldn't say it has been a bottleneck for me personally. I do orthogonalization in two MR steps, but those steps are also doing 2 multiplications. I am still looking to run an experiment with an input where it would be a practical problem (such as QR step taking > 10..15 min). It also so happened that my previous results were tainted by the fact that i ran without native libraries installed so decompression was chomping on my CPU too, so no clean data now. But it seems indeed like roughly 50% of everything else. but i am still waiting for run-a-terabyte-input opportunity.
          Hide
          Nathan Halko added a comment -

          We did some work with facial recognition, computing 'eigenfaces' and reported in the paper. In this case there is only 2 orders of magnitude between the signal and the 'noise'. It shows a dramatic difference between the accuracy of one pass versus just one power iteration. Note that after one power iteration, there is now 6 orders of magnitude separating signal and noise.

          But this is only looking at approximation error ||A-UU*A||. It could very well be the case in recommendation applications that this measure is not appropriate, I don't know. But it is a very valuable option to have at one's disposal just in case.

          Show
          Nathan Halko added a comment - We did some work with facial recognition, computing 'eigenfaces' and reported in the paper. In this case there is only 2 orders of magnitude between the signal and the 'noise'. It shows a dramatic difference between the accuracy of one pass versus just one power iteration. Note that after one power iteration, there is now 6 orders of magnitude separating signal and noise. But this is only looking at approximation error ||A-UU*A||. It could very well be the case in recommendation applications that this measure is not appropriate, I don't know. But it is a very valuable option to have at one's disposal just in case.
          Hide
          Ted Dunning added a comment -

          For the in-memory implementations, I think that this is a non-issue. Power iteration should simply be implemented. In that case, the original form using Y = (A'A)^q A \Omega seems fine and I don't yet quite see how the iteration that Dmitriy proposes will get the right result. Whichever method is used, it is a good thing to do.

          The problems that I see are for the out-of-core problems. There, computing A'A can often give pathologically bad results if the sparse pattern is highly skewed. That approach also leads to significant fill-in which is not a good thing. On the hand, multiplying A times anything too large to store in memory such as B typically is may be horribly bad as well.

          The orthogonalization is no big deal since it requires only a single pass through the data to accumulate the small matrix required for the Cholesky trick.

          Show
          Ted Dunning added a comment - For the in-memory implementations, I think that this is a non-issue. Power iteration should simply be implemented. In that case, the original form using Y = (A'A)^q A \Omega seems fine and I don't yet quite see how the iteration that Dmitriy proposes will get the right result. Whichever method is used, it is a good thing to do. The problems that I see are for the out-of-core problems. There, computing A'A can often give pathologically bad results if the sparse pattern is highly skewed. That approach also leads to significant fill-in which is not a good thing. On the hand, multiplying A times anything too large to store in memory such as B typically is may be horribly bad as well. The orthogonalization is no big deal since it requires only a single pass through the data to accumulate the small matrix required for the Cholesky trick.
          Hide
          Nathan Halko added a comment - - edited

          I checked Dmitriy's scheme today and it makes sense. It accumulates Q' using the machinery already in place, QJob and BtJob

          QR = Y = A\Omega
          B0 = Q'A
          B1 = (AB0')'A = B0A'A = (Q'AA')A = (Q_new)'A

          A'A should never be computed, only Z = A'AY where Y is dense and X=AY, Z=A'X avoiding the problem of scarce overlap and fill in.

          Show
          Nathan Halko added a comment - - edited I checked Dmitriy's scheme today and it makes sense. It accumulates Q' using the machinery already in place, QJob and BtJob QR = Y = A\Omega B0 = Q'A B1 = (AB0')'A = B0A'A = (Q'AA')A = (Q_new)'A A'A should never be computed, only Z = A'AY where Y is dense and X=AY, Z=A'X avoiding the problem of scarce overlap and fill in.
          Hide
          Ted Dunning added a comment -

          One problem here is that the Q's are large and potentially dense. Thus, accumulating them is not a great idea. That can be worked around in the single iteration because we can keep R in memory and can reconstruct chunks of Y given chunks of A.

          That trick becomes a bit more involved if we want to keep all of the Q's in such an implicit form. Computing (AA')^q A\Omega is relatively simple as you point out if A' is available as a linear operator, but I thought I understood that reorthogonalizing was a good idea. What I don't see is how to re-orthogonalize without keeping very large matrices in memory or doing a dangerously dense operation. Yet.

          Show
          Ted Dunning added a comment - One problem here is that the Q's are large and potentially dense. Thus, accumulating them is not a great idea. That can be worked around in the single iteration because we can keep R in memory and can reconstruct chunks of Y given chunks of A. That trick becomes a bit more involved if we want to keep all of the Q's in such an implicit form. Computing (AA')^q A\Omega is relatively simple as you point out if A' is available as a linear operator, but I thought I understood that reorthogonalizing was a good idea. What I don't see is how to re-orthogonalize without keeping very large matrices in memory or doing a dangerously dense operation. Yet.
          Hide
          Ted Dunning added a comment -

          To clarify a bit about my worries, the issue that I see is that we could compute A\Omega in a single pass of A because we could effectively store all of \Omega in memory (i.e. just keep the seed and hash function). This is nice because row-wise decomposition of A makes everything work nicely.

          In computing (AB') A out-of-core, however, we have a product of A by a tall skinny matrix B that requires roughly as much storage as A. Each row-wise patch of A will have to be combined with each row-wise chunk of B' (i.e. each column-wise chunk of B). This means that we have to read B many times which leads to quadratic time.

          Show
          Ted Dunning added a comment - To clarify a bit about my worries, the issue that I see is that we could compute A\Omega in a single pass of A because we could effectively store all of \Omega in memory (i.e. just keep the seed and hash function). This is nice because row-wise decomposition of A makes everything work nicely. In computing (AB') A out-of-core, however, we have a product of A by a tall skinny matrix B that requires roughly as much storage as A. Each row-wise patch of A will have to be combined with each row-wise chunk of B' (i.e. each column-wise chunk of B). This means that we have to read B many times which leads to quadratic time.
          Hide
          Nathan Halko added a comment -

          The reorthogonalizations aren't essential and if its a barrier to power iterations we should forego them at the moment. A quick and dirty trick to avoid even sweeping through A again is to neglect the cross terms in the product (AA')^qA\Omega and just use (A_iA_i')^qA_i\Omega. This could be extremely naive but I've been getting some good results with it. The accuracy typically falls about half way between single pass and full power iterations so it could be useful (although it could be dangerous as well).

          sig_51 <- optimal
          60.6531

          full power iters 1
          81.2668

          full power iters 4
          67.5545

          row-wise power iters 1
          89.4983

          row-wise power iters 4
          82.2247

          single pass
          92.8736

          norm A
          100.0000

          The 'row-wise power iters' being (A_iA_i')^qA_i\Omega.

          Show
          Nathan Halko added a comment - The reorthogonalizations aren't essential and if its a barrier to power iterations we should forego them at the moment. A quick and dirty trick to avoid even sweeping through A again is to neglect the cross terms in the product (AA')^qA\Omega and just use (A_iA_i')^qA_i\Omega. This could be extremely naive but I've been getting some good results with it. The accuracy typically falls about half way between single pass and full power iterations so it could be useful (although it could be dangerous as well). sig_51 <- optimal 60.6531 full power iters 1 81.2668 full power iters 4 67.5545 row-wise power iters 1 89.4983 row-wise power iters 4 82.2247 single pass 92.8736 norm A 100.0000 The 'row-wise power iters' being (A_iA_i')^qA_i\Omega.
          Hide
          Dmitriy Lyubimov added a comment -

          Hi,

          I put together B_i pipeline https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad. It seems it is a pretty straightforward enhancement that falls back on a lot of existing stuff, with fundamental additions of AB' multiplication and QR pushdown to reducer of the first job (instead of doing it in the mapper of the first job)

          A quick and dirty trick to avoid even sweeping through A again is to neglect the cross terms in the product (AA')^qA\Omega and just use (A_iA_i')^qA_i\Omega.

          I think that even if that's less flops, it is still more difficult to implement than the full power iterations with reorthogonalization as you've initially proposed.

          After all, IMO there's no big reason to be afraid of more work for as long as it brings more precision and we have a control over how much more work we want to do.

          I also can incorporate a Cholesky trick into B_0 pipeline at some point – or just have it as an alternative flow controlled by a job parameter.

          Show
          Dmitriy Lyubimov added a comment - Hi, I put together B_i pipeline https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad . It seems it is a pretty straightforward enhancement that falls back on a lot of existing stuff, with fundamental additions of AB' multiplication and QR pushdown to reducer of the first job (instead of doing it in the mapper of the first job) A quick and dirty trick to avoid even sweeping through A again is to neglect the cross terms in the product (AA')^qA\Omega and just use (A_iA_i')^qA_i\Omega. I think that even if that's less flops, it is still more difficult to implement than the full power iterations with reorthogonalization as you've initially proposed. After all, IMO there's no big reason to be afraid of more work for as long as it brings more precision and we have a control over how much more work we want to do. I also can incorporate a Cholesky trick into B_0 pipeline at some point – or just have it as an alternative flow controlled by a job parameter.
          Hide
          Dmitriy Lyubimov added a comment - - edited

          PS it also just occurred to me that full B' does not have to be ever written out either because BB' can be accumulated in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full BB' since it's symmetrical, nor do we compute full outer products of BB' there). That saves on full Bt I/O and avoids startup costs of BB' job.

          Thus, full job is 2 MR passes with q=0, 4MR passes with q=1 and 6MR passes with q=2. If understand Nathan's point right, 3 orthogonalizations (which corresponds to q=2) is quite enough.

          V and U jobs are optional and running in parallel, so they can count for another iteration.

          so with q=2 (maximum case) and U,V output requested we end up with 7 sequential MR iterations.

          Show
          Dmitriy Lyubimov added a comment - - edited PS it also just occurred to me that full B' does not have to be ever written out either because BB' can be accumulated in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full BB' since it's symmetrical, nor do we compute full outer products of BB' there). That saves on full Bt I/O and avoids startup costs of BB' job. Thus, full job is 2 MR passes with q=0, 4MR passes with q=1 and 6MR passes with q=2. If understand Nathan's point right, 3 orthogonalizations (which corresponds to q=2) is quite enough. V and U jobs are optional and running in parallel, so they can count for another iteration. so with q=2 (maximum case) and U,V output requested we end up with 7 sequential MR iterations.
          Hide
          Dmitriy Lyubimov added a comment - - edited

          And finally, i can't see a reason why we can't incorporate "Cholesky trick" either by substituting Y_1..Y_q = AB' instead of A\Omega to compute B_i.

          In other words, MR operationally aside, if we assert that we have some function such that currently provides B_0=g(Y_0) where Y_0=A\Omega, then there's no reason to assume we can't use the same function g to compute B_i=g(Y_i) for as long as Y_i=AB'_

          {i-1}

          . also see my scratchpad for the same.

          Show
          Dmitriy Lyubimov added a comment - - edited And finally, i can't see a reason why we can't incorporate "Cholesky trick" either by substituting Y_1..Y_q = AB' instead of A\Omega to compute B_i. In other words, MR operationally aside, if we assert that we have some function such that currently provides B_0=g(Y_0) where Y_0=A\Omega, then there's no reason to assume we can't use the same function g to compute B_i=g(Y_i) for as long as Y_i=AB'_ {i-1} . also see my scratchpad for the same.
          Hide
          Dmitriy Lyubimov added a comment -

          I re-did dense test to construct 20,000x1,000 dense matrix (20 mln non-zero elements) with random singular vectors

          The way i construct input is i generate random singular vector matrices and orthogonalize them using stable Gramm-Schmidt, multiply one of them (whatever is shorter) by Sigma and then produce row-wise surrogate input.

          For predefined singular values = 10,4,1,(0.1...), n=1000, m=20000, k=3, p=10 i get stochastic values
          --SSVD solver singular values:
          svs: 9.998401 3.998322 0.972622 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000

          so you see if the decay is good then precision loss with 1 pass in my case doesn't exceed 2.8% in the worst case (3rd value) and the time is quite good. (same brunch as for mahout-797 in my github).

          Keep in mind that this precision loss also includes loss generated during simulated input construction, it's not all the solver's.

          I also got rid of BBt job and fixed problems with sparse input on that branch.

          Show
          Dmitriy Lyubimov added a comment - I re-did dense test to construct 20,000x1,000 dense matrix (20 mln non-zero elements) with random singular vectors The way i construct input is i generate random singular vector matrices and orthogonalize them using stable Gramm-Schmidt, multiply one of them (whatever is shorter) by Sigma and then produce row-wise surrogate input. For predefined singular values = 10,4,1,(0.1...), n=1000, m=20000, k=3, p=10 i get stochastic values --SSVD solver singular values: svs: 9.998401 3.998322 0.972622 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 so you see if the decay is good then precision loss with 1 pass in my case doesn't exceed 2.8% in the worst case (3rd value) and the time is quite good. (same brunch as for mahout-797 in my github). Keep in mind that this precision loss also includes loss generated during simulated input construction, it's not all the solver's. I also got rid of BBt job and fixed problems with sparse input on that branch.
          Hide
          Dmitriy Lyubimov added a comment -

          PS i removed result comparison to Colt's SVD since it takes too much time to compute 20mln matrix for its in-core full rank SVD algorithm.

          Show
          Dmitriy Lyubimov added a comment - PS i removed result comparison to Colt's SVD since it takes too much time to compute 20mln matrix for its in-core full rank SVD algorithm.
          Hide
          Dmitriy Lyubimov added a comment -

          linking to MAHOUT-790 as I have merged my branch on top of that ongoing work

          Show
          Dmitriy Lyubimov added a comment - linking to MAHOUT-790 as I have merged my branch on top of that ongoing work
          Hide
          Dmitriy Lyubimov added a comment -

          AB' is a heavy multiplication of course.

          I don't want to use standard multiplication because

          – i want to be doing more things in reducer
          – need custom grouping/sorting to ensure output is partitioned the same way as A splits

          At this point it seems that the best strategy is just to preload entire A block into memory as a (sparse) matrix and open B' stream as a side file and hope it is not going to generate too much flood i/o. I don't know a workaround for it anyway since whatever blocking scheme is used, we need cartesian products from both matrix inputs and that will cause i/o and i don't think there's any clever collocation trick to be had there

          Show
          Dmitriy Lyubimov added a comment - AB' is a heavy multiplication of course. I don't want to use standard multiplication because – i want to be doing more things in reducer – need custom grouping/sorting to ensure output is partitioned the same way as A splits At this point it seems that the best strategy is just to preload entire A block into memory as a (sparse) matrix and open B' stream as a side file and hope it is not going to generate too much flood i/o. I don't know a workaround for it anyway since whatever blocking scheme is used, we need cartesian products from both matrix inputs and that will cause i/o and i don't think there's any clever collocation trick to be had there
          Hide
          Ted Dunning added a comment -

          At this point it seems that the best strategy is just to preload entire A block into memory as a (sparse) matrix and open B' stream as a side file and hope it is not going to generate too much flood i/o. I don't know a workaround for it anyway since whatever blocking scheme is used, we need cartesian products from both matrix inputs and that will cause i/o and i don't think there's any clever collocation trick to be had there

          Presumably there could be a role for the distributed cache here to make the I/O load more manageable.

          This is just the sort of thing that the MapR ability to control placement, mirroring and to read via NFS comes in really, really handy. Can't really assume that for Mahout, though.

          Show
          Ted Dunning added a comment - At this point it seems that the best strategy is just to preload entire A block into memory as a (sparse) matrix and open B' stream as a side file and hope it is not going to generate too much flood i/o. I don't know a workaround for it anyway since whatever blocking scheme is used, we need cartesian products from both matrix inputs and that will cause i/o and i don't think there's any clever collocation trick to be had there Presumably there could be a role for the distributed cache here to make the I/O load more manageable. This is just the sort of thing that the MapR ability to control placement, mirroring and to read via NFS comes in really, really handy. Can't really assume that for Mahout, though.
          Hide
          Dmitriy Lyubimov added a comment -

          AFAIK distributed cache would actually do the same except it would also store the file on disk.

          The disadvantage here is that we add disk i/o time to this. The advantage is that if we hit the same node with a mapper of the same task more than once, as far as i understand, they'd have the entire B' locally. That's an interesting idea, actually. But for big clusters where a job is unlikely to hit the same node with more than 1 task, this probably would actually be detrimental. Plus, if B is really big (somethin like 100Gb big) then we are requiring a lot of hdd from a node.

          Plus for jobs that use memory mapping or any sort of random access, distributed cache is the only option – but we don't need that.

          Ok, let me make implementation that opens a stream first, just to prove/measure whatever we are improving, and later perhaps there's a good sense to add an option to use distributed cache for this. Maybe there will be another trick we don't see to streamline this, but i so far did not find any. So it will give us some time to think.

          Show
          Dmitriy Lyubimov added a comment - AFAIK distributed cache would actually do the same except it would also store the file on disk. The disadvantage here is that we add disk i/o time to this. The advantage is that if we hit the same node with a mapper of the same task more than once, as far as i understand, they'd have the entire B' locally. That's an interesting idea, actually. But for big clusters where a job is unlikely to hit the same node with more than 1 task, this probably would actually be detrimental. Plus, if B is really big (somethin like 100Gb big) then we are requiring a lot of hdd from a node. Plus for jobs that use memory mapping or any sort of random access, distributed cache is the only option – but we don't need that. Ok, let me make implementation that opens a stream first, just to prove/measure whatever we are improving, and later perhaps there's a good sense to add an option to use distributed cache for this. Maybe there will be another trick we don't see to streamline this, but i so far did not find any. So it will give us some time to think.
          Hide
          Dmitriy Lyubimov added a comment -

          Ok first implementation with QR solvers is ready, added -q parameter. (all in git remote git@github.com:dlyubimov/mahout-commits branch MAHOUT-796)

          Did not have time to test distributed version and larger inputs. on a toy input 1000x2000, k=3, p=10 (optimal 10,4,1,(0.1) ):

          q=0:
          --SSVD solver singular values:
          svs: 9.998472 3.993542 0.990456 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000

          q=1: (+2 more sequential steps):
          --SSVD solver singular values:
          svs: 10.000000 4.000000 0.999999 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000

          So, much better (although much slower as well).
          I of course understand that each run exhibit noise, so to prove it works better consistently i need to run more than just 2 attempts. But that's encouraging. it worked (and actually at first attempt)!

          I tried some optimization to handle sparse cases a little better as well, i guess it taxes densier cases a little bit.

          So this will be put on hold until i add Cholesky option and then i will have to return to this issue to enable the same schema but Y'Y+ Cholesky path.

          I refactored QR steps into standalone OutputCollector implementations so that they can now be more easily be pipelined inside mappers and reducers so code is much more readable now.

          So after a few tests and final fixes i think it is a commit worthy but it has dependency on Ted's refactoring MAHOUT-790 pushed to trunk.

          Show
          Dmitriy Lyubimov added a comment - Ok first implementation with QR solvers is ready, added -q parameter. (all in git remote git@github.com:dlyubimov/mahout-commits branch MAHOUT-796 ) Did not have time to test distributed version and larger inputs. on a toy input 1000x2000, k=3, p=10 (optimal 10,4,1,(0.1) ): q=0: --SSVD solver singular values: svs: 9.998472 3.993542 0.990456 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 q=1: (+2 more sequential steps): --SSVD solver singular values: svs: 10.000000 4.000000 0.999999 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 So, much better (although much slower as well). I of course understand that each run exhibit noise, so to prove it works better consistently i need to run more than just 2 attempts. But that's encouraging. it worked (and actually at first attempt)! I tried some optimization to handle sparse cases a little better as well, i guess it taxes densier cases a little bit. So this will be put on hold until i add Cholesky option and then i will have to return to this issue to enable the same schema but Y'Y+ Cholesky path. I refactored QR steps into standalone OutputCollector implementations so that they can now be more easily be pipelined inside mappers and reducers so code is much more readable now. So after a few tests and final fixes i think it is a commit worthy but it has dependency on Ted's refactoring MAHOUT-790 pushed to trunk.
          Hide
          Nathan Halko added a comment -

          Wow those are great results Dmitriy! q definitely adds power and reliability to the algorithm.

          Show
          Nathan Halko added a comment - Wow those are great results Dmitriy! q definitely adds power and reliability to the algorithm.
          Hide
          Dmitriy Lyubimov added a comment -

          Patch. Local solver tests pass. I also tested multiple splits on sufficiently larger inputs and sparse inputs with local MR.

          I still need to test with yet bigger file with multiple reducers since local MR does not support multiple reducers.

          What i noticed is that with just one additional power iteration with orthogonalization there's practically no need to run any oversampling (p). So yes power iteration runs more steps but runtime can be reduced significantly just because you don't need as wide projection anymore. small values are pretty good without much oversampling.

          Amazing.

          Show
          Dmitriy Lyubimov added a comment - Patch. Local solver tests pass. I also tested multiple splits on sufficiently larger inputs and sparse inputs with local MR. I still need to test with yet bigger file with multiple reducers since local MR does not support multiple reducers. What i noticed is that with just one additional power iteration with orthogonalization there's practically no need to run any oversampling (p). So yes power iteration runs more steps but runtime can be reduced significantly just because you don't need as wide projection anymore. small values are pretty good without much oversampling. Amazing.
          Hide
          Dmitriy Lyubimov added a comment -

          Patch attached.
          I will commit after another round of MR in distributed mode. Hopefully will be done by Tue.

          Show
          Dmitriy Lyubimov added a comment - Patch attached. I will commit after another round of MR in distributed mode. Hopefully will be done by Tue.
          Hide
          Dmitriy Lyubimov added a comment -

          Fixed CLI issues.

          Tested distributed MR on 1Gb input with multiple reducers q=1 ok.

          Show
          Dmitriy Lyubimov added a comment - Fixed CLI issues. Tested distributed MR on 1Gb input with multiple reducers q=1 ok.
          Hide
          Dmitriy Lyubimov added a comment - - edited

          I am now convinced that all the inefficiency and cpu-bound behavior comes from the multiplications Q'B, AB'. Indeed, I generate outer products which i then output row by row and sum up in combiners and reducers. this creates tremendous pressure in terms of keys for spill and reduce sorts. Although it seems DRM.times(DRM) employs exactly the same problem.

          I think tremendous improvements would result if we output outer products of B' or AB' in vertical thin but long blocks.

          This seem like a trivial idea but I was preoccupied by 'proof of concept' issues whereas matrix multiplication has been seen as algorithmically trivial.

          the command line i used:

          bin/mahout ssvd --input /tmp/DRM/* --output /tmp/SSVD1 --tempDir /tmp/SSVD-tmp -k 10 -p 3 -q 1 -r 10000 --reduceTasks 3 -Dmapred.child.java.opts='-Xmx500m'

          For ABt job you need perhaps more than double the memory of the split in case of dense input because ABtJob is catering to sparse inputs first and loads A block column-wise using SequentialAccessSparseVectors. So for a split size of 64Mb standard -Xmx200m may not be enough, and it is definitely not enough for 128mb splits.

          note also that I ran on CDH3u0 without any Mahout recompilation with CDH3 binaries and it seems to work out of the box.

          Show
          Dmitriy Lyubimov added a comment - - edited I am now convinced that all the inefficiency and cpu-bound behavior comes from the multiplications Q'B, AB'. Indeed, I generate outer products which i then output row by row and sum up in combiners and reducers. this creates tremendous pressure in terms of keys for spill and reduce sorts. Although it seems DRM.times(DRM) employs exactly the same problem. I think tremendous improvements would result if we output outer products of B' or AB' in vertical thin but long blocks. This seem like a trivial idea but I was preoccupied by 'proof of concept' issues whereas matrix multiplication has been seen as algorithmically trivial. the command line i used: bin/mahout ssvd --input /tmp/DRM/* --output /tmp/SSVD1 --tempDir /tmp/SSVD-tmp -k 10 -p 3 -q 1 -r 10000 --reduceTasks 3 -Dmapred.child.java.opts='-Xmx500m' For ABt job you need perhaps more than double the memory of the split in case of dense input because ABtJob is catering to sparse inputs first and loads A block column-wise using SequentialAccessSparseVectors. So for a split size of 64Mb standard -Xmx200m may not be enough, and it is definitely not enough for 128mb splits. note also that I ran on CDH3u0 without any Mahout recompilation with CDH3 binaries and it seems to work out of the box.
          Hide
          Dmitriy Lyubimov added a comment -

          rewrote AB' and Bt jobs to do row-wise sparse blocking of outer product outputs from mapper and combiner.

          In 1G tests looks like a major improvment, about order of magnitude speed up for multiplication part and 4 times speed-up of a combo QR+multiplication (multiplication is still the bigger part).

          I think this is the final iteration on this issue, i will commit within couple of days.

          Show
          Dmitriy Lyubimov added a comment - rewrote AB' and Bt jobs to do row-wise sparse blocking of outer product outputs from mapper and combiner. In 1G tests looks like a major improvment, about order of magnitude speed up for multiplication part and 4 times speed-up of a combo QR+multiplication (multiplication is still the bigger part). I think this is the final iteration on this issue, i will commit within couple of days.
          Hide
          Dmitriy Lyubimov added a comment -

          oops. test fix.

          Show
          Dmitriy Lyubimov added a comment - oops. test fix.
          Hide
          Dmitriy Lyubimov added a comment -

          also changed CLI a little bit:

          • made A block height optional with default value of 10,000 which should be fine with most inputs on 64mb splits, ~200 eigen values and -Xmx500m in child processes.
          • added -oh outer product sparse row-wise block cardinality used by 'big' multiplications Q'A and AB' with default value of 10,000. It may need increases with very sparse inputs in order to be more efficient for spill sorts (map-side combiners), but it would be always equally efficent in reduce-side sorts which was so far the longest running stuff in all there is (blocked multiplications are still most expensive; but they are closing in on QR expenses which does not use sorts).
          Show
          Dmitriy Lyubimov added a comment - also changed CLI a little bit: made A block height optional with default value of 10,000 which should be fine with most inputs on 64mb splits, ~200 eigen values and -Xmx500m in child processes. added -oh outer product sparse row-wise block cardinality used by 'big' multiplications Q'A and AB' with default value of 10,000. It may need increases with very sparse inputs in order to be more efficient for spill sorts (map-side combiners), but it would be always equally efficent in reduce-side sorts which was so far the longest running stuff in all there is (blocked multiplications are still most expensive; but they are closing in on QR expenses which does not use sorts).
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1017 (See https://builds.apache.org/job/Mahout-Quality/1017/)
          MAHOUT-796: SSVD power iterations; performance patches.

          dlyubimov : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1164806
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/common/IOUtils.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BBtJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/GivensThinSolver.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockAccumulator.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockWritable.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SplitPartitionedWritable.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UpperTriangular.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/GivensThinSolver.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/GrammSchmidt.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/QRFirstStep.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/QRLastStep.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java
          • /mahout/trunk/src/conf/ssvd.props
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1017 (See https://builds.apache.org/job/Mahout-Quality/1017/ ) MAHOUT-796 : SSVD power iterations; performance patches. dlyubimov : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1164806 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/IOUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BBtJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/GivensThinSolver.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockAccumulator.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockWritable.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SplitPartitionedWritable.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UpperTriangular.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/GivensThinSolver.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/GrammSchmidt.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/QRFirstStep.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/QRLastStep.java /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java /mahout/trunk/src/conf/ssvd.props
          Hide
          Dmitriy Lyubimov added a comment -

          Resolving.

          i will add code doing the same with Cholesky trick as a part of MAHOUT-797 once it's implemented.

          Show
          Dmitriy Lyubimov added a comment - Resolving. i will add code doing the same with Cholesky trick as a part of MAHOUT-797 once it's implemented.

            People

            • Assignee:
              Dmitriy Lyubimov
              Reporter:
              Dmitriy Lyubimov
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development