Details
Description
I'm a new to Mahout, I didn't find some basic matrix functions. This make users cannot do many tasks by CLI or API, if user get some result through existing mapreduce matrix operation (like svd), he cannot do farther steps. I make a list for it:
1) Addition, Subtraction
2) Norm (like norm1, norm2, normfrobenius)
3) Matrix compare
4) Get lower triangle, upper triangle and diagonal
5) Get identity and zero matrix
6) Put two or matrix to together: A = [A1, A2]
7) More linear equations solver method, like Gaussian elimination (maybe it's hard to implement)
8) import and export CSV, ARFF ... (this will very useful when user want to reuse result from or to other applications like MATLAB)
I want to know is there any plan to do this, if so, I can make some efforts to implement these.

 MAHOUT880.patch
 47 kB
 Raphael Cendrillon

 MAHOUT880.patch
 34 kB
 Raphael Cendrillon

 MAHOUT880.patch
 26 kB
 Raphael Cendrillon
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
many of these sound great, yes!
I'd have one suggestion, however: DistributedRowMatrix implements the interface VectorIterable, which the interface Matrix extends. The methods you mention which are already in VectorIterable should just get pulled up into VectorIterable.
Of course, it requires that we do some careful checking that someone who calls DistributedRowMatrix.minus(DenseMatrix) behaves sensibly. I would imagine this case would be handled by the fact that there is no sensible reason why you would have a DistributedRowMatrix and a DenseMatrix of the exact same cardinalities (one fits in RAM, but the other needs to live on HDFS?).
Regarding some of these methods: 4) I'm not sure about  do we have uses for these? If you have a DistributedRowMatrix: a humongous HDFS SequenceFile of Vectors, what exactly are you going to do with the upper triangle of it? Diagonal I can see, I guess. Extract a vector of the diagonal from the whole distributed matrix, sure.
6) is actually being looked at in MAHOUT884
7) we like solvers, yes, but the methods don't go in our matrix classes, they go in separate solver classes, and take matrix (or DistributedRowMatrix) as inputs.
8) also is good and we'd always like more I/O hooks, but again, should be in other classes, and in some ways already
exists: VectorDumper allows the option of dumping a DistributedRowMatrix from SequenceFile to CSV, and I think we have some support for ARFF as well, somewhere.
Hi Jake. If you get a chance could you take a look through the attached patch? Your feedback would be great.
Hi Raphael,
Can you create a reviewboard request for this ticket? (See MAHOUT888 for details on how)
I'll be glad to. Thanks.

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/2955/

Review request for Jake Mannix.
Summary

Jobs for matrixvector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix
This addresses bug MAHOUT880.
https://issues.apache.org/jira/browse/MAHOUT880
Diffs
trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRECREATION
trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431
Diff: https://reviews.apache.org/r/2955/diff
Testing

Junit tests for each job
Thanks,
Raphael
Great work!
I'm working on the norm job, I try to finish it ASAP

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/2955/#review3552

I'm not seeing the centering of the rows for the covariance computation.
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java
<https://reviews.apache.org/r/2955/#comment7923>
Don't we have to center the rows for covariance? Am I missing something or do you assume that the data is already centered?
 Sebastian
On 20111129 05:40:30, Raphael Cendrillon wrote:

This is an automatically generated email. To reply, visit:

(Updated 20111129 05:40:30)
Review request for Jake Mannix.
Summary

Jobs for matrixvector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix
This addresses bug
MAHOUT880.
Diffs

trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRECREATION
trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431
Testing

Junit tests for each job
Thanks,
Raphael
On 20111129 08:41:06, Sebastian Schelter wrote:
> trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java, line 119
> <https://reviews.apache.org/r/2955/diff/1/?file=60410#file60410line119>
>
> Don't we have to center the rows for covariance? Am I missing something or do you assume that the data is already centered?
Thank you for the feedback Sebastian.
You're right, we first need to center the rows. I should rename this Job to remove confusion. In general it is just meant to compute x.transpose().times.
 Raphael

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/2955/#review3552

On 20111129 05:40:30, Raphael Cendrillon wrote:

This is an automatically generated email. To reply, visit:

(Updated 20111129 05:40:30)
Review request for Jake Mannix.
Summary

Jobs for matrixvector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix
This addresses bug
MAHOUT880.
Diffs

trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRECREATION
trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431
Testing

Junit tests for each job
Thanks,
Raphael

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/2955/

(Updated 20111129 18:44:49.585493)
Review request for Ted Dunning, Jake Mannix and Sebastian Schelter.
Summary

Jobs for matrixvector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix
This addresses bug MAHOUT880.
https://issues.apache.org/jira/browse/MAHOUT880
Diffs
trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRECREATION
trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431
Diff: https://reviews.apache.org/r/2955/diff
Testing

Junit tests for each job
Thanks,
Raphael

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/2955/#review3562

trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
<https://reviews.apache.org/r/2955/#comment7976>
I'm not sure about this method: you take in a DistributedRowMatrix, which by design is an big huge SequenceFile<IntWritable,VectorWritable>. Why don't you just take in a Vector, put that in the DistributedCache (or even serialize it into the Configuration, if it's small enough), and use that?
Passing in a DistributedRowMatrix makes people assume you can put in a real full matrix.
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java
<https://reviews.apache.org/r/2955/#comment7977>
This will force a huge bottleneck of one reducer, will it not?
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java
<https://reviews.apache.org/r/2955/#comment7978>
I think we already have a VectorSummingReducer somewhere, we should reuse that.
 Jake
On 20111129 18:44:49, Raphael Cendrillon wrote:

This is an automatically generated email. To reply, visit:

(Updated 20111129 18:44:49)
Review request for Ted Dunning, Jake Mannix and Sebastian Schelter.
Summary

Jobs for matrixvector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix
This addresses bug
MAHOUT880.
Diffs

trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRECREATION
trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431
Testing

Junit tests for each job
Thanks,
Raphael
On 20111129 19:56:51, Jake Mannix wrote:
> trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java, line 116
> <https://reviews.apache.org/r/2955/diff/1/?file=60411#file60411line116>
>
> This will force a huge bottleneck of one reducer, will it not?
Thanks for the feedback Jake, it's really appreciated! I think the load will be distributed somewhat by the combiner at each node. Do you still think this will cause too much of a bottleneck?
Do you have any suggestions for a better way to implement this?
 Raphael

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/2955/#review3562

On 20111129 18:44:49, Raphael Cendrillon wrote:

This is an automatically generated email. To reply, visit:

(Updated 20111129 18:44:49)
Review request for Ted Dunning, Jake Mannix and Sebastian Schelter.
Summary

Jobs for matrixvector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix
This addresses bug
MAHOUT880.
Diffs

trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRECREATION
trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431
Testing

Junit tests for each job
Thanks,
Raphael

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/2955/

(Updated 20111201 08:39:37.868935)
Review request for Ted Dunning, Jake Mannix and Sebastian Schelter.
Changes

A fair bit of refactoring. Added plus() and minus() methods for MatrixMatrix and MatrixVector combinations. Renamed MatrixCovarianceJob() to TimesSelfJob() to improve clarity per Sebastian's suggestion. Moved vector argument to distributed cache and changed class to Vector per Jake's suggestion. Removed MatrixRowAverageJob.java for now.
Summary

Jobs for matrixvector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix
This addresses bug MAHOUT880.
https://issues.apache.org/jira/browse/MAHOUT880
Diffs (updated)
trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixMatrixElementwiseJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorElementwiseJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/TimesSelfJob.java PRECREATION
trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431
Diff: https://reviews.apache.org/r/2955/diff
Testing

Junit tests for each job
Thanks,
Raphael
Another problem I've seen in some places is to just pick one of the values when there is an overlap. Options would be to pick the left one, or randomly choose one.
Hi Lance. Sorry, I don't follow you. Could you expand a bit on this? Is this in response to the issue regarding heavily loading the reducer or something else?
Oops sorry. This is about the set of pairwise operators available when you combine two or more matrices: plus, minus, mean, etc. Another use case is to just use one of the values.
Does Mahout yet have a method to take a large full matrix, and convert it sparse matrix format (losing zero values or perhaps if it makes sense, nearzero values also...)?

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/2955/

(Updated 20111202 21:04:46.828990)
Review request for mahout, Ted Dunning, Jake Mannix, and Sebastian Schelter.
Summary

Jobs for matrixvector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix
This addresses bug MAHOUT880.
https://issues.apache.org/jira/browse/MAHOUT880
Diffs
trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixMatrixElementwiseJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorElementwiseJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/TimesSelfJob.java PRECREATION
trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431
Diff: https://reviews.apache.org/r/2955/diff
Testing

Junit tests for each job
Thanks,
Raphael
I'm thinking of building this out a bit more, however first I'd be interested to hear people's thoughts on this, what methods you would find useful for DistributedRowMatrix, and your own use cases.
Personally I've found that the DistributedRowMatrix and MatrixMultiplicationJob classes provide a great foundation for writing MapReduce jobs involving matrices. I think adding a few basic matrix operations, as suggested by Wangda, could be very helpful so that its not necessary to reinvent the wheel / write MapReduce jobs from scratch when doing common linear operations. I also find that being able to do things like matrixA.times(matrixB) makes it easy to quickly build a process by chaining together MR jobs in a very readable form.
I'd be very interested to hear other people's thoughts on this.

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/2955/

(Updated 20111206 00:26:13.113561)
Review request for mahout, Ted Dunning, Jake Mannix, and Sebastian Schelter.
Changes

Added jobs for calculating columnwise row average of a DistributedRowMatrix
Summary

Jobs for matrixvector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix
This addresses bug MAHOUT880.
https://issues.apache.org/jira/browse/MAHOUT880
Diffs (updated)
trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1210678
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixMatrixElementwiseJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMRJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorElementwiseJob.java PRECREATION
trunk/core/src/main/java/org/apache/mahout/math/hadoop/TimesSelfJob.java PRECREATION
trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1210678
Diff: https://reviews.apache.org/r/2955/diff
Testing

Junit tests for each job
Thanks,
Raphael
Hi Raphael,
I agree with you, DistributedRowMatrix is a very useful abstract component for us, we can add many useful operations on it, matrix multiplication and matrix transpose jobs are good examples.
I'm now working on the matrix norm, the norm2 need svd operation, it's really expensive, is there any light weighted method can let us get the biggest singular value?
Thanks,
Wangda
There are the beginnings of single machine outofcore SVD operations in MAHOUT792
Hi Ted,
Thanks for your reply, I'll take a look at it
I think rowMeans approach is still suboptimal for my use case (MAHOUT817). It is possible i don't understand something about DRM though.
The DRM formation as a solver requires knowledge of num rows and num columns. This is technically never required for any operation in PCA (including colMeans() ) and in many cases also impractical as previous pipeline jobs don't necessarily calculate those.
Nor does SSVD require preliminary knowledge of matrix dimensions.
Ideally, in PCA flow we want to compute pairs (numRows, sumRows) for each reducer output and then have a frontend routine to finish reducing that to just one mean row.
Ideally to optimize this i guess DRM better have a notion that dimensions (or whatever other parameters inside solver) may not be initially known. When this happens, first operation in pipeline (whatever it happens to be) may also employ standard strategies to come up with those in the end.
Similarly, there's a "poststep" strategy concept: using output and some additional parameters you can reassemble required knowledge (such as mean or small result of multiplication) in post step by recombining result of all reducers or separate factors of computation (if it happens to be a small product in the end).
this is a fundamental technique in SSVD (and seems to become even more prominent with PCA efficiency tricks).
Thanks Dmitry. I've pulled the row mean job out as a separate issue under MAHOUT923. Could you please take a look?
Moving this to the Backlog. I think people should create separate issues for all new methods that should be introduced.
I also think it could be useful to add support for a few more standard matrix operations to DistributedRowMatrix. Here's a patch with a few operations to start with. Is there broader interest in this?