Details

Type: New Feature

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: None

Fix Version/s: 0.6

Component/s: None

Labels:None
Description
I have figured out some simplification for our SSVD algorithms. This eliminates the QR decomposition and makes life easier.
I will produce a patch that contains the following:
 a CholeskyDecomposition implementation that does pivoting (and thus rankrevealing) or not. This should actually be useful for solution of large outofcore least squares problems.
 an inmemory SSVD implementation that should work for matrices up to about 1/3 of available memory.
 an outofcore SSVD threaded implementation that should work for very large matrices. It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

 sd2.pdf
 142 kB
 Ted Dunning

 MAHOUT792.patch
 231 kB
 Ted Dunning

 MAHOUT792.patch
 211 kB
 Ted Dunning
Issue Links
 is blocked by

MAHOUT790 Redundancy in Matrix API, view or get?
 Closed
 is depended upon by

MAHOUT797 MapReduce SSVD: provide alternative Bpipeline per B=R' ^{1} Y'A
 Closed
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
This presumes MAHOUT790 and MAHOUT793 and includes them in this patch.
Github branch MAHOUT792 available at git://github.com/tdunning/mahout.git
That provides more details on successive changes.
Updated with small fixes. The tests are enough better now that I am beginning to believe that this might be correct.
Github has been updated as well.
a It looks like it affects a lot of not very relevant stuff? (GraidentMachine, AbstractClassifier files)
b Ted, are you going to implement MR version? do you have a parallelization strategy?
Thanks.
On (a), the patch includes MAHOUT790 since it depends on it. When I commit that JIRA, it will be easier to read.
On (b), see the paper. Yes.
This looks like a very promising shortcut for MR operationally.
AFAICT, you suggest to compute Y'Y and then derive R^1 out of it and use it in another pass to derive B. Y'Y is indeed operationally much easier to accumulate and produce with MR in the same projection step.
One thing that i don't understand is how you are proposing to derive this R^1. To do Cholesky on Y'Y, take transpose and inverse and declare it R^1?
more thoughts:
1 – it looks like this method is compressing a lot of info into k+p uppertriangular R. I guess there's more potential for rounding errors there. (I am not buying too much into rounding errors though).
2 – Argument of Q matrix being large is not very convincing to me. Q is m x (k+p), that is, assuming A is much wider (in terms of nonzero element average #) which it should be in order for projection method to make sense at all, it is a fraction of A. Besides, it is written in blocks by mappers, and read by mappers, so each mapper sees only one block of size say 30,000 x (k+p), which it will not be, assuming A is sufficiently wide, and assuming k+p=500, amounts to 120mb per mapper. So there's not so much actual memory pressure, that's what distributed computations are for. Same goes for dense operations, we just distribute them.
3 – it looks like you are writing B as output of the second MR, which is (k+p) x n. Going back to argument of a 'big Q', we can't assume that B would be any less. In fact, some time ago i came to conclusion that it looks like projection method would be much more efficient if input is wide rather than tall since projection compression factor would be much better for what seems to be fairly inexpensive operation (since it costs nothing to redistribute Omega which is only backed by one long number as a random gen seed, as opposed to actual long or wide bands such as B or Q). So we can't exclude very wide inputs.
Overall it looks like a great improvement. I am not convinced that it would cut processing time that much (it looks it has the same amount of proposed MR steps but it looks like all of them would require shuffleandsort and reduce phase, whereas with QR is a reduceless process), but it definitely reduces complexity of MR implementation and i would be very eager to try it out. Certainly all i said is the first impression and intuition only; and in my experience intuition turns out to be wrong surprisingly often as far as benchmark guesstimates are concerned.
One thing that i don't understand is how you are proposing to derive this R^1. To do Cholesky on Y'Y, take transpose and inverse and declare it R^1?
This is just a notational deficiency. The CholeskyDecomposition has solveLeft and solveRight methods that do use back or forward substitution to effectively right or left multiply by the inverse of R (or R'). I find my current naming to be confusing and think that it might be better to have the methods timesLinv, timesRinv and leftTimesLinv and leftTimesRinv or some such because that would match the actual mathematical exposition better. If you have a good idea for naming, please suggest it.
1 â€“ it looks like this method is compressing a lot of info into k+p uppertriangular R. I guess there's more potential for rounding errors there. (I am not buying too much into rounding errors though).
This is absolutely true. On the other hand, this is only really putting as much information into that matrix as we ultimately want back out (i.e. we are reducing dimension here).
2 â€“ Argument of Q matrix being large is not very convincing to me. Q is m x (k+p), that is, assuming A is much wider (in terms of nonzero element average #) which it should be in order for projection method to make sense at all, it is a fraction of A.
But A is sparse so Y, B and Q are all about the same (storage) size as A. In fact, if k+p > average number of elements per row of A, then Y, B and Q will all be larger than A.
Besides, it is written in blocks by mappers, and read by mappers, so each mapper sees only one block of size say 30,000 x (k+p), which it will not be, assuming A is sufficiently wide, and assuming k+p=500, amounts to 120mb per mapper. So there's not so much actual memory pressure, that's what distributed computations are for. Same goes for dense operations, we just distribute them.
Of course. But using the Cholesky trick means that Q'A can be done by reading only A instead of Q and A.
3 â€“ it looks like you are writing B as output of the second MR, which is (k+p) x n. Going back to argument of a 'big Q', we can't assume that B would be any less.
I completely agree. B is too big to store all in memory and could easily be comparable to or larger than A.
In fact, some time ago i came to conclusion that it looks like projection method would be much more efficient if input is wide rather than tall since projection compression factor would be much better for what seems to be fairly inexpensive operation (since it costs nothing to redistribute Omega which is only backed by one long number as a random gen seed, as opposed to actual long or wide bands such as B or Q). So we can't exclude very wide inputs.
Indeed.
The common webcase could easily wind up that way. You might have a million users exploring a billion web pages.
Overall it looks like a great improvement. I am not convinced that it would cut processing time that much (it looks it has the same amount of proposed MR steps but it looks like all of them would require shuffleandsort and reduce phase, whereas with QR is a reduceless process), but it definitely reduces complexity of MR implementation
This is might thought exactly. It also decreases the memory load of inmemory implementations which is good. But as you say later, intuition is extraordinarily dangerous for runtimes.
But A is sparse so Y, B and Q are all about the same (storage) size as A. In fact, if k+p > average number of elements per row of A, then Y, B and Q will all be larger than A.
That's true. I had one guy who had millions of rows but only 10 average measurements per row. Probably ratings or something. it is not going to be efficient (cpuwise) in these cases.
But my response has always been, if you have input thinner than projection, then why even use projection? It's my understanding that the whole idea of this is to drastically reduce the input to analyze. Original paper actually never suggested to compute BB' as far as i remember, it's something i did to open up the n by sacrificing rounding. In original paper, they compute SVD of B, so if B is greater than input, it would only cost more to compute SVD of one. So that's what i understood – B is supposed to be much less than A in original work, otherwise there's not much sense.
But my response has always been, if you have input thinner than projection, then why even use projection?
Because it is cheaper to process a thin dense matrix than a wide sparse one.
Likewise, even if B is bigger than A, it can be much cheaper to compute the SVD of B than of A if only because the Cholesky trick works on the skinny dense case. If the Cholesky decomposition loses some accuracy then a QR or LQ decomposition could be used the same way.
Ted,
what's degree of stability of the CholeskyDecomposition code? I am getting test failure
Failed tests: testSingularValues(org.apache.mahout.math.ssvd.SequentialOutOfCoreSvdTest): expected:<0.0> but was:<5250028.000000007>
It should be very good.
I have a few corrections that I need to post. That might be the cause of
your problem.
On Saturday, August 27, 2011, Dmitriy Lyubimov (JIRA) <jira@apache.org>
https://issues.apache.org/jira/browse/MAHOUT792?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13092404#comment13092404]
test failure
testSingularValues(org.apache.mahout.math.ssvd.SequentialOutOfCoreSvdTest):
expected:<0.0> but was:<5250028.000000007>
eliminates the QR decomposition and makes life easier.
rankrevealing) or not. This should actually be useful for solution of
large outofcore least squares problems.
about 1/3 of available memory.
large matrices. It should take time about equal to the cost of reading the
input matrix 4 times and will require working disk roughly equal to the size
of the input.
Integrated in MahoutQuality #1012 (See https://builds.apache.org/job/MahoutQuality/1012/)
MAHOUT790  kill the cardinality array and size() for matrices. Use rowSize() and columnSize() instead.
MAHOUT792  Fix RTM to avoid size() and cardinality array.
tdunning : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1164016
Files :
 /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/eval/InMemoryFactorizationEvaluator.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/bayes/InMemoryBayesDatastore.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/discriminative/LinearTrainer.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainUtils.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/sequencelearning/hmm/HmmUtils.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/MatrixWritable.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/decomposer/EigenVerificationJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UpperTriangular.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/hadoop/als/ParallelALSFactorizationJobTest.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractMatrix.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/DenseMatrix.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/DiagonalMatrix.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/Matrix.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/MatrixView.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/PivotedMatrix.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomTrinaryMatrix.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/SparseColumnMatrix.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/SparseMatrix.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/SparseRowMatrix.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/decomposer/hebbian/HebbianSolver.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/AbstractTestVector.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/MatrixTest.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestMatrixView.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseColumnMatrix.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseMatrix.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseRowMatrix.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestVectorView.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/als/AlternateLeastSquaresSolverTest.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/decomposer/SolverTest.java
Please run this:
public static void main(String[] args) { RandomTrinaryMatrix rtm = new RandomTrinaryMatrix(0, 30, 40, true); for(int row = 0; row < 30; row++) for(int column = 0; column < 40; column++) System.out.println(rtm.get(row, column)); rtm = new RandomTrinaryMatrix(0, 30, 40, true); for(int row = 0; row < 30; row++) for(int column = 0; column < 40; column++) System.out.println(rtm.get(row, column)); }
Lance,
Can you say why? What is it that one would see here?
Yes, bad habit. The "lowgrade" mode does not emit 0. The highgrade mode does not emit negative numbers.
I'm curious: what is the math behind an even split of 1,0,1? The Achlioptas math says +1/1 or +1/1/0/0/0 are OK. So, the lowgrade mode is proven correct, just not what the comment describes
so, this is not fully committed yet?
I don't see our version of CholeskyDecomposition class on the trunk (only apachemath's one). Neighter do i think i can see the SequentialOutOfCoreSVD etc. that seems to be present in the patch posted?
No. I was waiting for comments. If I can find time between packing this evening I will commit.
Another thing that I realized while working on MAHOUT797, is that this execution plan requires to pass 3 times over input of A instead of existing 1 pass, if we are not saving Y. (first time to comute Y'Y, second time to compute B, and third time to compute U).
So we either have to save Y, or incur passes over big dataset (assuming A is much densier than Y).
And we we do save Y then we swap Q for Y, which are of the same size.
So.. it looks like we got to save Y... or it is not worth it.
If you look at my comments in sd2, you see that Y'Y is computed in one pass and then in the pass where you compute Q'A using Y, I just recompute Y on the fly and solve using R to get a block of Q, also on the fly. Since we are reading blocks of A anyway in this pass, it costs nothing except a small amount of computation to avoid the storing of Y.
Yes... and that's what I mean. I was implementing it as Yonthefly. But that implies full pass over A every time we need access to Y, and we need it 3 times without option to parallelize. That's why I think I need to save it.
Also, I am thinking one step ahead, power iterations. In that chain, Yi is AB', and there's no way to compute that on the fly. So, for first stab at it, it would make my life easier to save Y with low degree of replication, and then use the rest of pipeline the same way regardless of i. Besides, I think that saving Y is supposed to make things more efficient, not less, with most generic cases, assuming A >> Y in volume.
Yes... and that's what I mean. I was implementing it as Yonthefly. But that implies full pass over A every time we need access to Y, and we need it 3 times without option to parallelize. That's why I think I need to save it.
Without the power iterations, every time that Y is needed, A is also scanned. That means that Yonthefly is fine.
But I think that saving Y is a fine idea. It should usually (but surprisingly, not at all always) be smaller than A. It is the same size as Q.
So we either have to save Y, or incur passes over big dataset (assuming A is much densier than Y).
I think that this got mangled in the typing. A is sparser than Y and may be larger or smaller depending on the average number of nonzero elements. If A is binary or contains only small integers, then it could well be smaller on disk than Y if a continuous random matrix Omega is used. If Omega contains only 1, 0, 1 (trinary values), then the values of Y should compress nearly as well as the values of A and will be dense as well so we don't have to store the indices.
I like trinary matrix idea. I would like to add it as an option with time. Indeed, may save space fro specific extrasoarse cases.
Looking to try this, I tried applying the last patch per https://cwiki.apache.org/MAHOUT/howtocontribute.html#HowToContributeApplyingapatch i.e. 'patch p 0 i MAHOUT792.patch', and get a pile of 'Reversed (or previously applied) patch detected' and a 'can't find file to patch'. Is some but not all of this now committed?
Per "Github branch MAHOUT792 available at git://github.com/tdunning/mahout.git" I'm looking around https://github.com/tdunning/mahout but there's no branch listed of that name. Presumably https://github.com/tdunning/mahout/tree/newstochastic is the place to go?
Yes. The newstochastic branch is the one to go with. It is a little (but not much) out of date.
Committed this new code.
Finally committed these. This provides Cholesky decomposition and several inmemory implementations of SSVD. These are deficient with respect to power iterations.
Integrated in MahoutQuality #1265 (See https://builds.apache.org/job/MahoutQuality/1265/)
MAHOUT792  More small fixes. Added internal exception class for CholeskyDecomp
MAHOUT792  Made tests use odd sizes to detect row/column confusion. Fixed small errors in out of core SVD
MAHOUT792  New SSVD codes.
MAHOUT792  Copyright fixes for Cholesky
MAHOUT792  Additional test for QR decomposition.
MAHOUT792  Cholesky decomposition
MAHOUT792  New Cholesky decomposition for SSVD update.
MAHOUT792  Switch Cholesky to views.
tdunning : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1222517
Files :
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvd.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/CholeskyDecomposition.java
tdunning : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1222516
Files :
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvdTest.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/ssvd/SequentialBigSvd.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/ssvd/SequentialBigSvdTest.java
tdunning : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1222515
Files :
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/ssvd
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvd.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/ssvd
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvdTest.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/CholeskyDecomposition.java
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/ssvd/SequentialBigSvd.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/CholeskyDecompositionTest.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/ssvd
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/ssvd/SequentialBigSvdTest.java
tdunning : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1222514
Files :
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/QRDecompositionTest.java
tdunning : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1222513
Files :
 /mahout/trunk/math/src/main/java/org/apache/mahout/math/CholeskyDecomposition.java
 /mahout/trunk/math/src/test/java/org/apache/mahout/math/CholeskyDecompositionTest.java
I don't think that this issue should have been closed.
My current theory is that the problem has to do with the ordering of the result of File.listFiles. I just committed a cleanup fix. The cleanups were important in any case.
Let's see what Jenkins has to say.
Jenkins succeeded after this change. Good news.
I will try on an EC2 instance to see if I can.
Hmm... https://builds.apache.org/job/mahoutnightly/752/console shows the evil test still failing.
Very confusing.
This is not obviously an ubuntu portability issue. We have failures on ubuntu4 and ubuntu5, successes on ubuntu3 and solaris1. And then a failure on ubuntu3.
OK. I think I have this failing test corralled. The problem was that the test case was reading the results and assuming the blocks in the result matrix would be sorted. This is true on the mac and sometimes true on ubuntu. Once it fails, it is pretty consistent.
I put a sort into test and all seems well. I replicated the failure 3 times in a row, added the sort and verified that the failure did not occur on multiple runs. I removed the sort and the failure returned consistently.
Fix is forthcoming shortly as soon as I do the same tests on ubuntu 10 as well as 11.
Integrated in MahoutQuality #1309 (See https://builds.apache.org/job/MahoutQuality/1309/)
MAHOUT792  Forced correct block ordering in outofcore SVD. Hopefully addresses ubuntu test failures. Also forced file closing.
tdunning : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1231800
Files :
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvd.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvdTest.java
Last few Jenkins runs confirm my expectation that this problem was a bug in the test case itself that is now fixed.
This is the current state. This has an inmemory implementation and an outofcore implementation for review. The outofcore implementation needs to be changed to use threads well before it is production ready.