Details
Description
Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation.
Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvdlsi/wiki/Poweriterationsscratchpad .
Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right).

 MAHOUT796.patch
 161 kB
 Dmitriy Lyubimov

 MAHOUT7962.patch
 163 kB
 Dmitriy Lyubimov

 MAHOUT7963.patch
 175 kB
 Dmitriy Lyubimov

 MAHOUT7964.patch
 175 kB
 Dmitriy Lyubimov

 Power Iterations.pdf
 256 kB
 Dmitriy Lyubimov
Issue Links
 depends upon

MAHOUT790 Redundancy in Matrix API, view or get?
 Closed
Activity
Integrated in MahoutQuality #1017 (See https://builds.apache.org/job/MahoutQuality/1017/)
MAHOUT796: SSVD power iterations; performance patches.
dlyubimov : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1164806
Files :
 /mahout/trunk/core/src/main/java/org/apache/mahout/common/IOUtils.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BBtJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/GivensThinSolver.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockAccumulator.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockWritable.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SplitPartitionedWritable.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UpperTriangular.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/GivensThinSolver.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/GrammSchmidt.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/QRFirstStep.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/QRLastStep.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java
 /mahout/trunk/src/conf/ssvd.props
also changed CLI a little bit:
 made A block height optional with default value of 10,000 which should be fine with most inputs on 64mb splits, ~200 eigen values and Xmx500m in child processes.
 added oh outer product sparse rowwise block cardinality used by 'big' multiplications Q'A and AB' with default value of 10,000. It may need increases with very sparse inputs in order to be more efficient for spill sorts (mapside combiners), but it would be always equally efficent in reduceside sorts which was so far the longest running stuff in all there is (blocked multiplications are still most expensive; but they are closing in on QR expenses which does not use sorts).
oops. test fix.
rewrote AB' and Bt jobs to do rowwise sparse blocking of outer product outputs from mapper and combiner.
In 1G tests looks like a major improvment, about order of magnitude speed up for multiplication part and 4 times speedup of a combo QR+multiplication (multiplication is still the bigger part).
I think this is the final iteration on this issue, i will commit within couple of days.
I am now convinced that all the inefficiency and cpubound behavior comes from the multiplications Q'B, AB'. Indeed, I generate outer products which i then output row by row and sum up in combiners and reducers. this creates tremendous pressure in terms of keys for spill and reduce sorts. Although it seems DRM.times(DRM) employs exactly the same problem.
I think tremendous improvements would result if we output outer products of B' or AB' in vertical thin but long blocks.
This seem like a trivial idea but I was preoccupied by 'proof of concept' issues whereas matrix multiplication has been seen as algorithmically trivial.
the command line i used:
bin/mahout ssvd input /tmp/DRM/* output /tmp/SSVD1 tempDir /tmp/SSVDtmp k 10 p 3 q 1 r 10000 reduceTasks 3 Dmapred.child.java.opts='Xmx500m'
For ABt job you need perhaps more than double the memory of the split in case of dense input because ABtJob is catering to sparse inputs first and loads A block columnwise using SequentialAccessSparseVectors. So for a split size of 64Mb standard Xmx200m may not be enough, and it is definitely not enough for 128mb splits.
note also that I ran on CDH3u0 without any Mahout recompilation with CDH3 binaries and it seems to work out of the box.
Fixed CLI issues.
Tested distributed MR on 1Gb input with multiple reducers q=1 ok.
Patch attached.
I will commit after another round of MR in distributed mode. Hopefully will be done by Tue.
Patch. Local solver tests pass. I also tested multiple splits on sufficiently larger inputs and sparse inputs with local MR.
I still need to test with yet bigger file with multiple reducers since local MR does not support multiple reducers.
What i noticed is that with just one additional power iteration with orthogonalization there's practically no need to run any oversampling (p). So yes power iteration runs more steps but runtime can be reduced significantly just because you don't need as wide projection anymore. small values are pretty good without much oversampling.
Amazing.
Wow those are great results Dmitriy! q definitely adds power and reliability to the algorithm.
Ok first implementation with QR solvers is ready, added q parameter. (all in git remote git@github.com:dlyubimov/mahoutcommits branch MAHOUT796)
Did not have time to test distributed version and larger inputs. on a toy input 1000x2000, k=3, p=10 (optimal 10,4,1,(0.1) ):
q=0:
SSVD solver singular values:
svs: 9.998472 3.993542 0.990456 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000
q=1: (+2 more sequential steps):
SSVD solver singular values:
svs: 10.000000 4.000000 0.999999 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000
So, much better (although much slower as well).
I of course understand that each run exhibit noise, so to prove it works better consistently i need to run more than just 2 attempts. But that's encouraging. it worked (and actually at first attempt)!
I tried some optimization to handle sparse cases a little better as well, i guess it taxes densier cases a little bit.
So this will be put on hold until i add Cholesky option and then i will have to return to this issue to enable the same schema but Y'Y+ Cholesky path.
I refactored QR steps into standalone OutputCollector implementations so that they can now be more easily be pipelined inside mappers and reducers so code is much more readable now.
So after a few tests and final fixes i think it is a commit worthy but it has dependency on Ted's refactoring MAHOUT790 pushed to trunk.
AFAIK distributed cache would actually do the same except it would also store the file on disk.
The disadvantage here is that we add disk i/o time to this. The advantage is that if we hit the same node with a mapper of the same task more than once, as far as i understand, they'd have the entire B' locally. That's an interesting idea, actually. But for big clusters where a job is unlikely to hit the same node with more than 1 task, this probably would actually be detrimental. Plus, if B is really big (somethin like 100Gb big) then we are requiring a lot of hdd from a node.
Plus for jobs that use memory mapping or any sort of random access, distributed cache is the only option – but we don't need that.
Ok, let me make implementation that opens a stream first, just to prove/measure whatever we are improving, and later perhaps there's a good sense to add an option to use distributed cache for this. Maybe there will be another trick we don't see to streamline this, but i so far did not find any. So it will give us some time to think.
At this point it seems that the best strategy is just to preload entire A block into memory as a (sparse) matrix and open B' stream as a side file and hope it is not going to generate too much flood i/o. I don't know a workaround for it anyway since whatever blocking scheme is used, we need cartesian products from both matrix inputs and that will cause i/o and i don't think there's any clever collocation trick to be had there
Presumably there could be a role for the distributed cache here to make the I/O load more manageable.
This is just the sort of thing that the MapR ability to control placement, mirroring and to read via NFS comes in really, really handy. Can't really assume that for Mahout, though.
AB' is a heavy multiplication of course.
I don't want to use standard multiplication because
– i want to be doing more things in reducer
– need custom grouping/sorting to ensure output is partitioned the same way as A splits
At this point it seems that the best strategy is just to preload entire A block into memory as a (sparse) matrix and open B' stream as a side file and hope it is not going to generate too much flood i/o. I don't know a workaround for it anyway since whatever blocking scheme is used, we need cartesian products from both matrix inputs and that will cause i/o and i don't think there's any clever collocation trick to be had there
linking to MAHOUT790 as I have merged my branch on top of that ongoing work
PS i removed result comparison to Colt's SVD since it takes too much time to compute 20mln matrix for its incore full rank SVD algorithm.
I redid dense test to construct 20,000x1,000 dense matrix (20 mln nonzero elements) with random singular vectors
The way i construct input is i generate random singular vector matrices and orthogonalize them using stable GrammSchmidt, multiply one of them (whatever is shorter) by Sigma and then produce rowwise surrogate input.
For predefined singular values = 10,4,1,(0.1...), n=1000, m=20000, k=3, p=10 i get stochastic values
SSVD solver singular values:
svs: 9.998401 3.998322 0.972622 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000
so you see if the decay is good then precision loss with 1 pass in my case doesn't exceed 2.8% in the worst case (3rd value) and the time is quite good. (same brunch as for mahout797 in my github).
Keep in mind that this precision loss also includes loss generated during simulated input construction, it's not all the solver's.
I also got rid of BBt job and fixed problems with sparse input on that branch.
And finally, i can't see a reason why we can't incorporate "Cholesky trick" either by substituting Y_1..Y_q = AB' instead of A\Omega to compute B_i.
In other words, MR operationally aside, if we assert that we have some function such that currently provides B_0=g(Y_0) where Y_0=A\Omega, then there's no reason to assume we can't use the same function g to compute B_i=g(Y_i) for as long as Y_i=AB'_
{i1}. also see my scratchpad for the same.
PS it also just occurred to me that full B' does not have to be ever written out either because BB' can be accumulated in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full BB' since it's symmetrical, nor do we compute full outer products of BB' there). That saves on full Bt I/O and avoids startup costs of BB' job.
Thus, full job is 2 MR passes with q=0, 4MR passes with q=1 and 6MR passes with q=2. If understand Nathan's point right, 3 orthogonalizations (which corresponds to q=2) is quite enough.
V and U jobs are optional and running in parallel, so they can count for another iteration.
so with q=2 (maximum case) and U,V output requested we end up with 7 sequential MR iterations.
Hi,
I put together B_i pipeline https://github.com/dlyubimov/ssvdlsi/wiki/Poweriterationsscratchpad. It seems it is a pretty straightforward enhancement that falls back on a lot of existing stuff, with fundamental additions of AB' multiplication and QR pushdown to reducer of the first job (instead of doing it in the mapper of the first job)
A quick and dirty trick to avoid even sweeping through A again is to neglect the cross terms in the product (AA')^qA\Omega and just use (A_iA_i')^qA_i\Omega.
I think that even if that's less flops, it is still more difficult to implement than the full power iterations with reorthogonalization as you've initially proposed.
After all, IMO there's no big reason to be afraid of more work for as long as it brings more precision and we have a control over how much more work we want to do.
I also can incorporate a Cholesky trick into B_0 pipeline at some point – or just have it as an alternative flow controlled by a job parameter.
The reorthogonalizations aren't essential and if its a barrier to power iterations we should forego them at the moment. A quick and dirty trick to avoid even sweeping through A again is to neglect the cross terms in the product (AA')^qA\Omega and just use (A_iA_i')^qA_i\Omega. This could be extremely naive but I've been getting some good results with it. The accuracy typically falls about half way between single pass and full power iterations so it could be useful (although it could be dangerous as well).
sig_51 < optimal
60.6531
full power iters 1
81.2668
full power iters 4
67.5545
rowwise power iters 1
89.4983
rowwise power iters 4
82.2247
single pass
92.8736
norm A
100.0000
The 'rowwise power iters' being (A_iA_i')^qA_i\Omega.
To clarify a bit about my worries, the issue that I see is that we could compute A\Omega in a single pass of A because we could effectively store all of \Omega in memory (i.e. just keep the seed and hash function). This is nice because rowwise decomposition of A makes everything work nicely.
In computing (AB') A outofcore, however, we have a product of A by a tall skinny matrix B that requires roughly as much storage as A. Each rowwise patch of A will have to be combined with each rowwise chunk of B' (i.e. each columnwise chunk of B). This means that we have to read B many times which leads to quadratic time.
One problem here is that the Q's are large and potentially dense. Thus, accumulating them is not a great idea. That can be worked around in the single iteration because we can keep R in memory and can reconstruct chunks of Y given chunks of A.
That trick becomes a bit more involved if we want to keep all of the Q's in such an implicit form. Computing (AA')^q A\Omega is relatively simple as you point out if A' is available as a linear operator, but I thought I understood that reorthogonalizing was a good idea. What I don't see is how to reorthogonalize without keeping very large matrices in memory or doing a dangerously dense operation. Yet.
I checked Dmitriy's scheme today and it makes sense. It accumulates Q' using the machinery already in place, QJob and BtJob
QR = Y = A\Omega
B0 = Q'A
B1 = (AB0')'A = B0A'A = (Q'AA')A = (Q_new)'A
A'A should never be computed, only Z = A'AY where Y is dense and X=AY, Z=A'X avoiding the problem of scarce overlap and fill in.
For the inmemory implementations, I think that this is a nonissue. Power iteration should simply be implemented. In that case, the original form using Y = (A'A)^q A \Omega seems fine and I don't yet quite see how the iteration that Dmitriy proposes will get the right result. Whichever method is used, it is a good thing to do.
The problems that I see are for the outofcore problems. There, computing A'A can often give pathologically bad results if the sparse pattern is highly skewed. That approach also leads to significant fillin which is not a good thing. On the hand, multiplying A times anything too large to store in memory such as B typically is may be horribly bad as well.
The orthogonalization is no big deal since it requires only a single pass through the data to accumulate the small matrix required for the Cholesky trick.
We did some work with facial recognition, computing 'eigenfaces' and reported in the paper. In this case there is only 2 orders of magnitude between the signal and the 'noise'. It shows a dramatic difference between the accuracy of one pass versus just one power iteration. Note that after one power iteration, there is now 6 orders of magnitude separating signal and noise.
But this is only looking at approximation error AUU*A. It could very well be the case in recommendation applications that this measure is not appropriate, I don't know. But it is a very valuable option to have at one's disposal just in case.
Orthogonalization, i wouldn't say it has been a bottleneck for me personally. I do orthogonalization in two MR steps, but those steps are also doing 2 multiplications. I am still looking to run an experiment with an input where it would be a practical problem (such as QR step taking > 10..15 min). It also so happened that my previous results were tainted by the fact that i ran without native libraries installed so decompression was chomping on my CPU too, so no clean data now. But it seems indeed like roughly 50% of everything else.
but i am still waiting for runaterabyteinput opportunity.
So are you encouraging to proceed, benchmark, compare? get more info? do it anyway? or drop it without comparisons?
I will proceed with my latest family of computations without power iterations. Along the way, I will look for ways to do them efficiently. It took me months of looking at other things for the coin to drop on the fact that we don't have to keep the Q around for the QR decompositions, however, so it may be a bit before I figure out how to implement the power iteration really efficiently.
The lower the condition number (or low signal to noise) the harder it is to extract the top k singular vectors because in a sense they are not that much more important than the other nk. We see pollution from the smaller nk singular directions and that degrades our approximation of the top k space. Power iterations (just a few) are extremely important to amplify the gap between important directions and the unimportant directions. Instead of sampling matrix A, we sample matrix (AA*)^qA which has the same singular vectors but an exaggerated spectrum
sigma^
{2q+1}In infinite precision there would be no need to orthogonalize between iterations, only at the last step. However, in finite precision, the small singular values can fall below machine precision when taken to the 2q+1st power and we won't be able to accurately recover them. It also prevents overflow if your matrix has a very large sig_max. It is mostly a precaution to keep from loosing information and for most cases could probably be skipped or done only intermittently. If orthogonalization is a bottleneck we could consider not doing it.
For most recommendation applications, I would be surprised if iterations are important,
Ok I have no first hand knowledge myself, Nathan mentioned it did appear quite important in his previous work. I hope he could comment himself on a nature of the data he was looking at.
So are you encouraging to proceed, benchmark, compare? get more info? do it anyway? or drop it without comparisons?
Yes. That is what I mean by condition number.
The only place that I have seen slow decay is with synthetically generated random data. Even real random data has some interesting singular values.
if by low condition number you mean ratio of max to min singular value, then with slow decay i never got good results either. That's a sign of either highly noisy or ever randomly generated data. I think it would help if user came forward and explained his/her case, esp. given experimental nature of the code.
The outline that I have in MAHOUT792 does exactly this without the iteration and has the framework
available to do the power iteration efficiently, including the outofcore QR.
For most recommendation applications, I would be surprised if iterations are important, but at
least one user of the current system was surprised when their test matrix with a very low condition
number didn't get good results with the current algorithm. It would be nice to be able to say
"turn the knob to get better results".
current code is essentially equivalent to running this mod with q=0.
Resolving.
i will add code doing the same with Cholesky trick as a part of
MAHOUT797once it's implemented.