Details
Description
This patch contains an implementation of conjugate gradient, an iterative algorithm for solving large linear systems. In particular, it is well suited for large sparse systems where a traditional QR or Cholesky decomposition is infeasible. Conjugate gradient only works for matrices that are square, symmetric, and positive definite (basically the same types where Cholesky decomposition is applicable). Systems like these commonly occur in statistics and machine learning problems (e.g. regression).
Both a standard (in memory) solver and a distributed hadoopbased solver (basically the standard solver run using a DistributedRowMatrix a la DistributedLanczosSolver) are included.
There is already a version of this algorithm in taste package, but it doesn't operate on standard mahout matrix/vector objects, nor does it implement a distributed version. I believe this implementation will be more generically useful to the community than the specialized one in taste.
This implementation solves the following types of systems:
Ax = b, where A is square, symmetric, and positive definite
A'Ax = b where A is arbitrary but A'A is positive definite. Directly solving this system is more efficient than computing A'A explicitly then solving.
(A + lambda * I)x = b and (A'A + lambda * I)x = b, for systems where A or A'A is singular and/or not full rank. This occurs commonly if A is large and sparse. Solving a system of this form is used, for example, in ridge regression.
In addition to the normal conjugate gradient solver, this implementation also handles preconditioning, and has a sample Jacobi preconditioner included as an example. More work will be needed to build more advanced preconditioners if desired.

 0001MAHOUT672LSMRiterativelinearsolver.patch
 24 kB
 Ted Dunning

 0001MAHOUT672LSMRiterativelinearsolver.patch
 24 kB
 Ted Dunning

 mahout672.patch
 58 kB
 Jonathan Traupman

 MAHOUT672.patch
 3 kB
 Ted Dunning

 mahout672111023.patch
 58 kB
 Jonathan Traupman
Issue Links
 blocks

MAHOUT499 Implement LSMR inmemory
 Closed
 relates to

MAHOUT772 Refactor Matrix/Vector implementation with linear operators
 Closed
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
committed at svn revision 1188491
I agree that having things that scale not as numDocs or numFeatures is pretty critical, if we're talking about mapreduce passes. This code doesn't fall into that trap, from what I can see.
As I've heard no objections, I committed this (revision 1188491). Thanks Jonathan!
I think it's a bit extreme to say we need to have nearly O(1)
These may mean quite different things in practice. the devil is in the details. by ~O(1) i meant ok, if in practice it grows so slow that it's enought to process several billion rows (m) of input by having 40 iterations, that's perhaps still 'nearly' O(1) in my definiton.
I just said that i seem to have gleaned in the javadoc explanation that with this patch
num of iterations ~ n (num of columns) so if for million columns it means a million iterations, that's probably not cool. On the other side, conversion will unlikely require a million. Then what? I kind of still did not get a clear clarifications on the estimate of num iterations (except that it is not exactly O(1)).
I've run the test suite and all tests are passing on my box too, so I'm going to commit this later today if nobody objects.
> Jonathan, what size inputs have you run this on, with what running time in comparison to the other algorithms we have? From what I can see, this looks good to commit as well.
The largest dataset I've run it on was a synthetic set of about the same size as the old distributed Lanczos test matrix (before it was made smaller to speed up the tests). I ended up using a smaller matrix in the test because it was taking too long to run. I haven't tested it on truly huge matrices, but I can do that at some point if people are interested.
As for comparisons to other algorithms, both CG and LSMR ran faster than the QR decomp/solve that's already in Mahout on most of the test inputs I was playing with. They also ran faster than a Cholesky decomp/solve that I had implemented but haven't submitted. I don't have numbers on me right now, though.
Between the two, LSMR often converges faster than CG, and when it does, it's faster. For problems where they both take the same number of steps, performance is about the same, with CG sometimes being a bit quicker since it does less computation per iteration. The big advantage of the CG implementation is that it can use m/r, so should be scalable to larger matrices.
> That's the Achilles' heel of much of distributed stuff in Mahout. I.e. space of iterations (I feel) must be very close to O(1), otherwise it severely affects stuff. Even using side information is not that painful it seems compared to iteration growth. That severely decreases pragmatical use.
I think it's a bit extreme to say we need to have nearly O(1) Mapreduce passes to be useful. Lots of iterative stuff requires quite a few passes before convergence (as you say: Lanczos and LDA both fall into this realm), yet it's just the price you have to pay sometimes.
This may be similar.
Jonathan, what size inputs have you run this on, with what running time in comparison to the other algorithms we have? From what I can see, this looks good to commit as well.
Basically, at the core of the CG solver is a matrix/vector multiply, so we get the map/reduce implementation by using a DistributedRowMatrix instead of an inmemory matrix. Since we do one matrix/vector multiply per iteration, it will require one map/reduce job per iteration, which somewhat limits its performance â€“ there's a large range of data sizes that could benefit from distributed computation but that get bogged down by Hadoop's slow job setup/teardown. Essentially, we're looking at many of the same tradeoffs we have with the distributed Lanczos decomposition stuff.
Thank you, Jonathan.
Yeah, so i figured. that's my concern. That's the Achilles' heel of much of distributed stuff in Mahout. I.e. space of iterations (I feel) must be very close to O(1), otherwise it severely affects stuff. Even using side information is not that painful it seems compared to iteration growth. That severely decreases pragmatical use.
My thinking is that we need to keep algorithms we recommend accountable to some standard. My understanding is that there's similar problem with ALS WR implementation right now. I.e. we can have it in the codebase but Sebastien stops short of recommending it to folks on the list.
That's kind of the problem i touched recently : Mahout stuff is different in a sense that it requires deeper investigation of best parallelization strategy than just let's throwitatit approach. Even with matrix multiplications ther're notions that decrease computation time tenfold compared to DRM approach for some cases which are less than general but in practice are suprisingly common (example of such are multiplication steps in SSVD). Hadoop sorting is not as inexpensive as its pitch may suggest. And tenfold is sort of interesting...it's a difference between 10 hours and 1 hour.
Anyway. I am ok with committing this with @Maturity(Experimental) until we confirm running time on some input. I will probably even have to check this out soon, i may have a use case for it soon.
Regarding doing many multiplications at once, I did some of the math just now and it looks like you can build a solver that does this sort of thing, but the resulting algorithm really begins to look more like the stochastic projection for SVD than like CG.
Let's get this in place.
> Anyway, I'd really like to reach some closure on this issue. These two algorithms aren't the endall beall of linear system solvers, but I think they're useful in their current form and can be a foundation for further work.
I'll try out the patch now!
Reply to Ted's and Jake's comments:
> Is it possible to multiply by many vectors at once to accelerate the convergence here by essentially exploring in multiple directions at once?
This might be possible, but I don't think it's just an easy tweak. At iteration i, we compute the conjugate gradient, then move up it to a local min and repeat. We don't know the direction we're going to go at i+1 until after we've finished iteration i. To do what you suggest would mean moving along a vector that's different than the gradient, trying to collapse multiple CG steps into one. I'd imagine there's some literature on this (if not, might be a fruitful avenue of research).
However, my gut reaction is 1) this won't just be a simple modification to this algorithm and 2) the technique used to approximate the gradient at i+1 while still on iteration i might involve operations that will make a distributed version more difficult/impossible. E.g. I've seen that the LSMR solver often converges in fewer steps than the CG one, but it's doing enough additional stuff with the data matrix that creating a map/reduce version would be a lot more work than just using a DistributedRowMatrix.
> I gather you replaced a.addTo(b) with b.assign(a, Functions.PLUS)? If so, then all will be well.
Yes, that's all I did.
Anyway, I'd really like to reach some closure on this issue. These two algorithms aren't the endall beall of linear system solvers, but I think they're useful in their current form and can be a foundation for further work.
Hey Jonathan,
I gather you replaced a.addTo(b) with b.assign(a, Functions.PLUS)? If so, then all will be well.
Jonathan,
Is it possible to multiply by many vectors at once to accelerate the convergence here by essentially exploring in multiple directions at once?
Dmitriy
Basically, at the core of the CG solver is a matrix/vector multiply, so we get the map/reduce implementation by using a DistributedRowMatrix instead of an inmemory matrix. Since we do one matrix/vector multiply per iteration, it will require one map/reduce job per iteration, which somewhat limits its performance – there's a large range of data sizes that could benefit from distributed computation but that get bogged down by Hadoop's slow job setup/teardown. Essentially, we're looking at many of the same tradeoffs we have with the distributed Lanczos decomposition stuff.
The mappers and reducers do not use much memory for side storage. I'm not super familiar with the guts of DistributedRowMatrix, but I believe the only side information it needs is the vector it is multiplying the matrix by, so a couple of MB max.
If Mahout ever adopts something like Giraph, we can probably make a lot of these distributed iterative linear algebra algorithms a lot more efficient.
Here is an updated patch that applies to the current trunk. It compiles and all tests pass.
The only change from the July 25 patch is to remove a call to the now removed Vector.addTo() method.
Jonathan,
Thank you for this work. It would be very interesting to have solver like this.
One question I have is, for a mapreduce version, how many Map reduce steps this would require? And secondly, does it use any interestingly sized side information in mapppers or reducers?
I skimmed the patch and it mentions it may take up to numcols iterations assuming fixed lambda – how do these iterations map into mapreduce steps?
Thank you.
The currently attached patch was good to go when I uploaded it a few months back. I can verify this weekend that it still applies cleanly. I took out all the linear operator stuff per request, but the current patch has working implementations of CG and LSMR for symmetric positive definite matrices.
The ability to operate on other matrices (e.g. of the form A'A) is no longer in this patch/issue because it relies on the linear operator stuff, but there is no reason to wait for that to get this core functionality.
Folks – what's the status on this? It's been sitting about for a few months. Is this going to go in for 0.6 or should we retire it?
I removed all the matrix/vector changes and linear operator stuff from this patch, so this code just implements the conjugate gradient and LSMR solvers using the 0.5 standard linear algebra stuff.
I'll create a new issue for the linear operators and other linear algebra refactoring. I'm not sure when I'll have the time to work on it, but I'll try to implement the suggestions made here.
Since I'd like to get the linear operator stuff out sooner rather than later, I did not add the code for the A'A and (A + kI) cases back to the CG implementation. So for now, the CG solver will only work for symmetric pos. def. matrices.
May I suggest that a redo of Matrices include a solution to the doubledispatch problem?
In this case, there are many variations of the exact code to apply for Matrix :: Vector operations, and way too many uses of instanceof.
Also, the LinearOperator suite is big enough to be its own patch.
Yes, it is a big a patch...wasn't always this way. Here's a brief summary of this came to be the monster it is:
 originally, this was just an implementation of a conjugate gradient solver for linear systems that worked with either normal or distributed matrices.
 Ted mentioned that he had some mostly completed LSMR code that did very similar stuff and asked if I could integrate it with this patch, which I did.
 a long discussion between Ted, Jake and me ensued about how a lot of algorithms (e.g. CG, LSMR, Lanczos SVD) all used the same concept of a matrix that could be multiplied by a vector but that didn't need rowwise or elementwise access to the data in the matrix. After much back and forth, we settled on the name "LinearOperator."
The LinearOperator stuff is disruptive to the code base, but it does provide some nice functionality, too. For example, you can implement things like preconditioners, diagonal offsets (e.g. ridge regression) and other transformations to the data efficiently using linear operators without needing to either actually modify your underlying data set or add the functionality to the specific algorithm you're using. This was the original motivation behind it, since I had included a diagonal offset option for lowrank matrices in my CG code but that wasn't in Ted's LSMR implementation. We decided that it might be better to put this in some common place that all similar algorithms could use for free. Since all matrices are linear operators, but the converse isn't true, it was decided that Matrix should be a subclass of LinearOperator, not the other way around.
One thing I'm not 100% comfortable with is the parallel interface and class hierarchies (i.e. LinearOperator, Matrix vs. AbstractLinearOperator, AbstractMatrix). I'd like to see the interfaces go away in favor of abstract classes, but I don't recall us reaching any consensus on this.
I have some time to work on this stuff now (after a long busy spell), so if you can send me specific issues (e.g. the DistributedRowMatrix stuff you mentioned) I'll try to take a look at it.
This is a big patch and I'm not qualified to review it. But I noticed a few small issues. Look at DistributedRowMatrix for instance – these changes should not be applied. There are also some spurious whitespace and import changes.
It's also a pretty big patch. Are there more opportunities for reuse? thinking of AbstractLinearOperator for instance.
Sorry I haven't updated this issue in a while – got very busy with work, etc.
Anyway, this is a new patch based against the latest trunk. All the unit tests pass and it should be ready to go with the LSMR, conjugate gradient, and linear operator stuff.
Use the "mahout672combined.patch" attachment and ignore the rest.
OK, here's a combined patch in SVN format. Applied it to a clean trunk w/o problems. All tests pass.
Canceling the old stuff uploaded to this issue.
Jonathan,
On Mahout, we don't have the magic application of patches that Hadoop and Zookeeper have. If we did, then using git diff noindex would prevent the need for p1 and would make the scripts work. As I mentioned, this doesn't matter here, but is very nice on those other projects.
OK. Here is a patch combining all the stuff that we've been talking about with this issue:
 LinearOperators and refactor of various algorithms to use LinearOperators instead of VectorIterables
 Conjugate gradient solver from the original patch
 Ted's LSMR implementation, refactored to use LinearOperators
The patch is from git diff, so you'll need to use "patch p1" to apply it to trunk in svn.
I've tested that it applies successfully to a copy of trunk checked out from SVN. Everything compiles and tests all pass.
OK, sounds good. The VirtualMatrix stuff I've written so far looks a lot like the signature for LinearOperator you described. I can rename it easily enough, and "LinearOperator" has a good ring to it.
I looked over the mahout319 patch and I don't think the conflicts will be too bad. Mostly it will just be replacing VectorIterable > LinearOperator in a bunch of places. I'll clone your github repo to do the work off that. If mahout319 is going out soon, we should probably just back this work up behind it, since I don't think there's any urgency to it.
I'll make the changes to getInitialVector() as you suggest, that should be easy.
For onthefly collaboration, I've cloned apache/mahout.git on GitHub, and applied my MAHOUT319 patch to it, and will be continuing on there throughout the week. Clone me there and we can avoid collisions.
I think we need to kill VectorIterable, and replace it with something like "LinearOperator", which just has:
Vector times(Vector)
LinearOperator times(LinearOperator)
LinearOperator transpose()
int domainDimension() // ie numCols
int rangeDimension() // ie numRows
and no iterator methods.
getInitialVector() doesn't need to be implemented the way it is. LanczosSolver uses the iterator to calculate a good starting vector, but it doesn't need to: DistributedLanczosSolver just takes the vector of all 1's (normalized), and that works great in practice. Let's just change the behavior of LanczosSolver to do this as well, skipping on the iteration.
Before you get too involved with this refactoring on trunk, Jonathan, you should be careful: as I mentioned above, you're likely going to conflict with my changes for MAHOUT319. They're API changes to LanczosSolver's core solve() method.
Got a bit of this done this evening, but ran into a roadblock: I had to separate out the new VirtualMatrix interface from VectorIterable. VirtualMatrix contains the times() method while VectorIterable keeps the iterator() and iterateAll() methods. I had to do this because there's no efficient way to iterate the rows of a squared A'A matrix without actually constructing the full product matrix.
However, when I started looking at porting the Lanczos solver to use the new VirtualMatrix type, the core algorithm translates fine but the getInitialVector() routine, which relies on an iteration through the data matrix, presents difficulties.
The easiest path through this impasse that I can see would be defining the new DistributedSquaredMatrix (which implements A'A) to also implement VectorIterable, but have it iterate over its underlying source matrix A, rather than the product matrix. This would preserve the current behavior of Lanczos solver albeit at the expense of an iterator on DistributedSquaredMatrix that doesn't make a great deal of sense in a more general context. This solution would probably not work for the diagonal offset case, because it's unclear how to transform the iterated rows using the offset. We could always define a "IterableVirtualMatrix" interface that extends both VectorIterable and VirtualMatrix for algorithms, like Lanczos, that require it, though I'm still bothered by the weird iterator semantics.
The other possible solution I considered would be to add the times(VirtualMatrix m) method to the VirtualMatrix interface, then rewriting the getInitialVector() routine in terms of fundamental matrix operations: for the symmetric case, scaleFactor looks to be trace(A*A) and the initial vector looks to be A times a vector of all ones. The asymmetric case is mathematically different, but I don't know enough about Lanczos to fully understand why. Unfortunately, implementing matrix multiplication with virtual matrices may be hard, or at the very least computationally expensive.
Thoughts?
OK.
IdentityOffsetMatrix?
For that matter, why not call it DiagonalOffsetMatrix and just have the identity case be a special case?
Strike that, reverse it.
I thought about calling it DiagonalOffsetMatrix, but these are a proper subset of multiples of the identity.
PlusIdentityMultipleMatrix
DiagonalOffsetMatrix?
Sounds like removing the timesSquared method is moving ahead of deprecating. I would prefer to remove it, but would defer to anybody who had an objection.
Yes, a SquaredMatrix and a PlusIdentityMultipleMatrix (? ugly name) would be enough, if composable properly.
We might need two variants, sadly. Maybe we should migrate VectorIterable to some other abstract base class (get rid of interface, for previously discussed interface/abstract class reasons), and give it a better name. Maybe that would make it easier to not have two variants? It would be a class which just has the times(Vector) and times(Matrix) methods, and that's almost it? (numRows/numCols too, I guess).
As for LanczosSolver, please check out the patch for MAHOUT319. The api for solve is most likely changing anyways. And I'm in favor of just removing timesSquared() and isSymmetric, not marking deprecated. Still pre1.0days, folks!
with the virtual matrices, do we want two classes for A + lambda * I and A'A + lambda * I, or 4 classes for each of A, A'A, A + lambda * I, and A'A + lambda * I?
I think 2. It is easy enough to handle the lambda = 0 case with another constructor. I think that A and A' A are fundamentally different, however.
should we make the virtual matrices subclasses of AbstractMatrix as in Ted's patch or implementations of VectorIterable like the current DistributedRowMatrix?
Or make two variants?
what should we do with timesSquared() and the isSymmetric flag in DistributedLanczos solver? Remove it? Mark it deprecated? Leave it unchanged?
Deprecating it would be nice. Jake should have a major vote.
is several separate patches for the different pieces preferred or is one big patch easier?
One big one is much easier to deal with.
Integrated in MahoutQuality #769 (See https://builds.apache.org/hudson/job/MahoutQuality/769/)
MAHOUT672  the forgotten files
I should have some time this weekend to pull all this stuff together into an integrated patch with the virtual matrix code, conj. gradient, and LSMR.
A few questions, though:
 with the virtual matrices, do we want two classes for A + lambda * I and A'A + lambda * I, or 4 classes for each of A, A'A, A + lambda * I, and A'A + lambda * I?
 should we make the virtual matrices subclasses of AbstractMatrix as in Ted's patch or implementations of VectorIterable like the current DistributedRowMatrix?
 what should we do with timesSquared() and the isSymmetric flag in DistributedLanczos solver? Remove it? Mark it deprecated? Leave it unchanged?
 is several separate patches for the different pieces preferred or is one big patch easier?
In fact, I'll say that I'd love to have a chance to kill the ohsopoorlynamed timesSquared() method. If we just had a virtual SquaredMatrix whose times() method was the implementation we currently have, that would be awesome.
Ok, you guys have convinced me (esp Ted's version, with the virtual matrix idea, but it's basically the same thing, because DRM just wraps an HDFS file). We'll either need to throw UnsupportedOperationException for a lot of methods, or else define a nice simple superinterface to VectorIterable which only defines Vector times(Vector input), and maybe VectorIterable times(VectorIterable m).
All,
Sorry for posting a patch that is different in function and which shadows the older patch. The latest one I posted
has just the virtual matrix stuff. The earlier one that Jonathan posted has his solver code as well.
So are we ready to integrate the three strands of development here (LSMR, CG and VirtualMatrix?)
Or do we need to think and talk a bit more?
Yes, I'm talking about the logic inside of things like LanczosSolver. Ted pointed out that we have a number of algorithms that use matrix/vector multiply as a fundamental operation but that have special case code to handle certain common matrix forms. It's a bit redundant to have these cases in each algorithm. It also means that every time you create new special form matrix, you have to modify each of these algorithms to handle that form.
I don't think we're suggesting overloading what times() means. Rather, we're suggesting having subclasses of DistributedRowMatrix (or possibly separate implementations of VectorIterable) for special form matrices whose internal representations may be done in a more efficient manner. E.g. define a "SquaredDistributedMatrix" class that represents a matrix of the form B = A'A. All the operations, including times() mathematically mean exactly what they should: B.times means Bx. Under the hood, it will be implemented as A'(Ax) because it's more efficient, but that implementation detail shouldn't matter or be exposed to an algorithm that's just interested in doing a matrix/vector multiply. Likewise for (A + lambda * I) or (A'A + lamdba * I) or (A'A + B'B) or banddiagonal matrices. The specific implementation of the times() method takes care of the representational details so that any algorithm that accepts one of these matrix types can operate on any of them.
Here is an alternative to a switch in the solver. We would use
s.solve(new SquaredMatrix(A, lambda), b)
to solve A'A + lambda I
I prefer a class that implements a virtual matrix.
I have a patch that I will attach shortly that illustrates this.
What flag? Doing (A'A)x vs. (A)x is pretty fundamental to the algorithm. Why would you prefer to switch choices of which to do based on class type? I can see how it would make the logic inside of something like LanczosSolver simpler (you just always do "times(vector)", instead of "if symmetric, do times(), else do timesSquared()"), but if you really want to do it generally, what you want is a mini MapReduce paradigm: define a method which takes a pair of functions:
Vector DistributedRowMatrix#mapRowsWithCombiner(Vector input, Function<Vector, Vector> rowMapper, Function<Vector, Vector> resultReducer)
{ // for each row, emit Vectors: rowMapper.apply(input, row) // combine/reduce: for each intermediateOutputRow, output < resultReducer.apply(intermediateOutputRow, output) return output; }But I'm not sure this has the right level of generality (too general? Not general enough?).
I'm definitely not convinced that overloading times() to mean many different things is really wise. Having it mean (A)x vs. (A + lambda I)x could totally be fine, however. It's defining the matrix to have a extremely compressed way of showing it's diagonal.
I think it's a good idea to do this and would be happy to make it happen. Probably won't be before the weekend with my schedule unfortunately.
The only problem the current times()/timesSquared() implementation in DistributedRowMatrix is that each algorithm that uses it needs a flag to determine whether to use times() or timesSquared(). If we created a few subclasses of DistributedRowMatrix for the A'A, (A + lambda * I), etc. cases, we could have them just expose the times() method, implementing it as appropriate through calls to the superclass methods.
DistributedRowMatrix already does A'A x (it's the timesSquared() method), and I've long thought about doing the other two as well (for easy PageRank computation).
Jonathan,
What would you think about adapting this code so that we actually define some special matrix types that do
the A'A x, (A + lambda I) x and (A'A + lambda I) x products efficiently.
That would allow all product only methods like LSMR, CG and inmemory random projection for SVD to work on
all of these cases directly without having the extra convenience methods.
What do you think?
Jonathan,
Here is the LSMR implementation in a form that can be applied to trunk.
Does this help you out?
See https://github.com/tdunning/LatentFactorLogLinear/tree/lsmr
As I mentioned, I will try rebasing back to trunk to get a clean patch set you can apply there.
Also, can you point me to the specific branch and path in your repo to the LSMR implementation? I poked around but couldn't readily find it.
OK, yeah, I think I misunderstood which code you were talking about.
I assume this is the reference for the LSMR stuff: http://www.stanford.edu/group/SOL/reports/SOL20102R1.pdf
I'll have to take some time to digest it, but based on a quick skim it looks like both LSMR and LSQR are more or less mathematically equivalent to CG applied to least squares regression, but with better convergence and numeric properties with inexact arithmetic.
BTW, do you have any links to a SGD bibliography or other list of resources on it? From what I've seen in the code and some of your comments, it looks like a cool technology that I'd like to know more about.
As for a linear regression implementation using CG compared to one using SGD, it would be hard for me to reach any conclusions without comparing the two approaches head to head on the same data. CG would probably gain some benefit from being easily parallelizable, but the individual updates in SGD seem very fast and lightweight, so any speed advantage to CG would probably only come up for truly massive datasets. The SGD implementation in your patch also has a lot of regularization support that a simple CG implementation of LMS would lack (ridge regression i.e. L2 regularization comes for free, but L1 is considerably harder). I'm also unaware of how one would do the automatic validation/hyperparameter tuning using CG that your SGD implementation does.
The other big difference, btw, is that all of our parallel approaches require at least one pass through the data. The SGD stuff can stop early and often only needs a small fraction of the input to converge. That gives sublinear convergence time in terms of input size (which sounds whacky, but is real). Any approach that needs to read the entire data set obviously can't touch that scaling factor.
Offsetting this is the idea that if we don't need all the data for a given complexity of model, then we probably don't want to stop but would rather just have a more complex model. This is where the nonparametric approaches come in. The would give simple answers with small inputs and more nuanced answers with large data.
Jonathan,
This all sounds good. There is a point of confusion, I think.
I'd have to dig a little deeper to get a full understanding of all that your doing with the SGD regression stuff. (BTW, I think you mean
MAHOUT529?) Broadly speaking, though, I'd say the two patches are complementary. Conjugate gradient is just a iterative method for solving linear systems. Regression is one obvious application, but linear systems come up a lot in a whole range of algorithms, making CG a fairly general building block.
I was talking about the LSMR implementation. It is an iterative sparse solver similar to LSQR, but with better convergence properties. Like your code, it requires only a forward product. I should pull out a separate patch and attach it here.
Ted
I'd have to dig a little deeper to get a full understanding of all that your doing with the SGD regression stuff. (BTW, I think you mean MAHOUT529?) Broadly speaking, though, I'd say the two patches are complementary. Conjugate gradient is just a iterative method for solving linear systems. Regression is one obvious application, but linear systems come up a lot in a whole range of algorithms, making CG a fairly general building block.
As a linear system solver, the big advantage of CG over e.g. the Cholesky decomposition is a) being iterative, it's very easy to adapt it to map/reduce for very large datasets and b) for matrices of the form (cI + A), where A is of rank k, CG will typically run in O(kn^2) time instead of O(n^3). CG also has a few disadvantages, namely that for full rank matrices, it requires n^3 multiplies compared to IIRC n^3/3 for Cholesky – the same asymptotic performance, but that constant factor difference can add up in the real world. Another large disadvantage is that if you are solving a collection of linear systems, i.e. AX = B, where X and B are both matrices instead of vectors, you have to run a separate CG solver for each of the k columns of X for a total O(kn^3) runtime. Traditional matrix decomposition methods are usually O(n^3) to do the decomposition, but only O(n^2) to solve the system using the decomposed matrix, so you can solve a collection of k systems in O(n^3 + kn^2).
As for a linear regression implementation using CG compared to one using SGD, it would be hard for me to reach any conclusions without comparing the two approaches head to head on the same data. CG would probably gain some benefit from being easily parallelizable, but the individual updates in SGD seem very fast and lightweight, so any speed advantage to CG would probably only come up for truly massive datasets. The SGD implementation in your patch also has a lot of regularization support that a simple CG implementation of LMS would lack (ridge regression i.e. L2 regularization comes for free, but L1 is considerably harder). I'm also unaware of how one would do the automatic validation/hyperparameter tuning using CG that your SGD implementation does.
FWIW, I also have an implementation of the Cholesky decomposition, which I've been meaning to Mahoutize and submit when I can find the time to do it.
Jonathan,
This is a cool thing. What do you think of it in relationship to the LSMR code I have up on github as part of MAHOUT525 ?
Based on r1092853.
Patch consists of 11 new files in 4 new directories under both core/ and math/. No changes to existing code.
Includes 3 new unit tests. All unit tests pass.
Integrated in MahoutQuality #1118 (See https://builds.apache.org/job/MahoutQuality/1118/)
MAHOUT672on behalf of jtraupmanjmannix : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1188491
Files :