Mahout
  1. Mahout
  2. MAHOUT-880

Add some matrix method(like addition, subtraction, norm ... etc) to DistributedRowMatrix

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.6
    • Fix Version/s: None
    • Component/s: Math

      Description

      I'm a new to Mahout, I didn't find some basic matrix functions. This make users cannot do many tasks by CLI or API, if user get some result through existing map-reduce matrix operation (like svd), he cannot do farther steps. I make a list for it:
      1) Addition, Subtraction
      2) Norm (like norm-1, norm-2, norm-frobenius)
      3) Matrix compare
      4) Get lower triangle, upper triangle and diagonal
      5) Get identity and zero matrix
      6) Put two or matrix to together: A = [A1, A2]
      7) More linear equations solver method, like Gaussian elimination (maybe it's hard to implement)
      8) import and export CSV, ARFF ... (this will very useful when user want to reuse result from or to other applications like MATLAB)
      I want to know is there any plan to do this, if so, I can make some efforts to implement these.

      1. MAHOUT-880.patch
        47 kB
        Raphael Cendrillon
      2. MAHOUT-880.patch
        34 kB
        Raphael Cendrillon
      3. MAHOUT-880.patch
        26 kB
        Raphael Cendrillon

        Activity

        Hide
        Sebastian Schelter added a comment -

        Moving this to the Backlog. I think people should create separate issues for all new methods that should be introduced.

        Show
        Sebastian Schelter added a comment - Moving this to the Backlog. I think people should create separate issues for all new methods that should be introduced.
        Hide
        Raphael Cendrillon added a comment -

        Thanks Dmitry. I've pulled the row mean job out as a separate issue under MAHOUT-923. Could you please take a look?

        Show
        Raphael Cendrillon added a comment - Thanks Dmitry. I've pulled the row mean job out as a separate issue under MAHOUT-923 . Could you please take a look?
        Hide
        Dmitriy Lyubimov added a comment - - edited

        Ideally to optimize this i guess DRM better have a notion that dimensions (or whatever other parameters inside solver) may not be initially known. When this happens, first operation in pipeline (whatever it happens to be) may also employ standard strategies to come up with those in the end.

        Similarly, there's a "post-step" strategy concept: using output and some additional parameters you can re-assemble required knowledge (such as mean or small result of multiplication) in post step by re-combining result of all reducers or separate factors of computation (if it happens to be a small product in the end).

        this is a fundamental technique in SSVD (and seems to become even more prominent with PCA efficiency tricks).

        Show
        Dmitriy Lyubimov added a comment - - edited Ideally to optimize this i guess DRM better have a notion that dimensions (or whatever other parameters inside solver) may not be initially known. When this happens, first operation in pipeline (whatever it happens to be) may also employ standard strategies to come up with those in the end. Similarly, there's a "post-step" strategy concept: using output and some additional parameters you can re-assemble required knowledge (such as mean or small result of multiplication) in post step by re-combining result of all reducers or separate factors of computation (if it happens to be a small product in the end). this is a fundamental technique in SSVD (and seems to become even more prominent with PCA efficiency tricks).
        Hide
        Dmitriy Lyubimov added a comment -

        I think rowMeans approach is still suboptimal for my use case (MAHOUT-817). It is possible i don't understand something about DRM though.

        The DRM formation as a solver requires knowledge of num rows and num columns. This is technically never required for any operation in PCA (including colMeans() ) and in many cases also impractical as previous pipeline jobs don't necessarily calculate those.

        Nor does SSVD require preliminary knowledge of matrix dimensions.

        Ideally, in PCA flow we want to compute pairs (numRows, sumRows) for each reducer output and then have a front-end routine to finish reducing that to just one mean row.

        Show
        Dmitriy Lyubimov added a comment - I think rowMeans approach is still suboptimal for my use case ( MAHOUT-817 ). It is possible i don't understand something about DRM though. The DRM formation as a solver requires knowledge of num rows and num columns. This is technically never required for any operation in PCA (including colMeans() ) and in many cases also impractical as previous pipeline jobs don't necessarily calculate those. Nor does SSVD require preliminary knowledge of matrix dimensions. Ideally, in PCA flow we want to compute pairs (numRows, sumRows) for each reducer output and then have a front-end routine to finish reducing that to just one mean row.
        Hide
        Wangda Tan added a comment -

        Hi Ted,
        Thanks for your reply, I'll take a look at it

        Show
        Wangda Tan added a comment - Hi Ted, Thanks for your reply, I'll take a look at it
        Hide
        Ted Dunning added a comment -

        There are the beginnings of single machine out-of-core SVD operations in MAHOUT-792

        Show
        Ted Dunning added a comment - There are the beginnings of single machine out-of-core SVD operations in MAHOUT-792
        Hide
        Wangda Tan added a comment -

        Hi Raphael,
        I agree with you, DistributedRowMatrix is a very useful abstract component for us, we can add many useful operations on it, matrix multiplication and matrix transpose jobs are good examples.
        I'm now working on the matrix norm, the norm-2 need svd operation, it's really expensive, is there any light weighted method can let us get the biggest singular value?
        Thanks,
        Wangda

        Show
        Wangda Tan added a comment - Hi Raphael, I agree with you, DistributedRowMatrix is a very useful abstract component for us, we can add many useful operations on it, matrix multiplication and matrix transpose jobs are good examples. I'm now working on the matrix norm, the norm-2 need svd operation, it's really expensive, is there any light weighted method can let us get the biggest singular value? Thanks, Wangda
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2955/
        -----------------------------------------------------------

        (Updated 2011-12-06 00:26:13.113561)

        Review request for mahout, Ted Dunning, Jake Mannix, and Sebastian Schelter.

        Changes
        -------

        Added jobs for calculating column-wise row average of a DistributedRowMatrix

        Summary
        -------

        Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix

        This addresses bug MAHOUT-880.
        https://issues.apache.org/jira/browse/MAHOUT-880

        Diffs (updated)


        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1210678
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixMatrixElementwiseJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMRJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorElementwiseJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/TimesSelfJob.java PRE-CREATION
        trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1210678

        Diff: https://reviews.apache.org/r/2955/diff

        Testing
        -------

        Junit tests for each job

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/ ----------------------------------------------------------- (Updated 2011-12-06 00:26:13.113561) Review request for mahout, Ted Dunning, Jake Mannix, and Sebastian Schelter. Changes ------- Added jobs for calculating column-wise row average of a DistributedRowMatrix Summary ------- Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix This addresses bug MAHOUT-880 . https://issues.apache.org/jira/browse/MAHOUT-880 Diffs (updated) trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1210678 trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixMatrixElementwiseJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMRJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorElementwiseJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/TimesSelfJob.java PRE-CREATION trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1210678 Diff: https://reviews.apache.org/r/2955/diff Testing ------- Junit tests for each job Thanks, Raphael
        Hide
        Raphael Cendrillon added a comment -

        I'm thinking of building this out a bit more, however first I'd be interested to hear people's thoughts on this, what methods you would find useful for DistributedRowMatrix, and your own use cases.

        Personally I've found that the DistributedRowMatrix and MatrixMultiplicationJob classes provide a great foundation for writing MapReduce jobs involving matrices. I think adding a few basic matrix operations, as suggested by Wangda, could be very helpful so that its not necessary to reinvent the wheel / write MapReduce jobs from scratch when doing common linear operations. I also find that being able to do things like matrixA.times(matrixB) makes it easy to quickly build a process by chaining together MR jobs in a very readable form.

        I'd be very interested to hear other people's thoughts on this.

        Show
        Raphael Cendrillon added a comment - I'm thinking of building this out a bit more, however first I'd be interested to hear people's thoughts on this, what methods you would find useful for DistributedRowMatrix, and your own use cases. Personally I've found that the DistributedRowMatrix and MatrixMultiplicationJob classes provide a great foundation for writing MapReduce jobs involving matrices. I think adding a few basic matrix operations, as suggested by Wangda, could be very helpful so that its not necessary to reinvent the wheel / write MapReduce jobs from scratch when doing common linear operations. I also find that being able to do things like matrixA.times(matrixB) makes it easy to quickly build a process by chaining together MR jobs in a very readable form. I'd be very interested to hear other people's thoughts on this.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2955/
        -----------------------------------------------------------

        (Updated 2011-12-02 21:04:46.828990)

        Review request for mahout, Ted Dunning, Jake Mannix, and Sebastian Schelter.

        Summary
        -------

        Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix

        This addresses bug MAHOUT-880.
        https://issues.apache.org/jira/browse/MAHOUT-880

        Diffs


        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixMatrixElementwiseJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorElementwiseJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/TimesSelfJob.java PRE-CREATION
        trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431

        Diff: https://reviews.apache.org/r/2955/diff

        Testing
        -------

        Junit tests for each job

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/ ----------------------------------------------------------- (Updated 2011-12-02 21:04:46.828990) Review request for mahout, Ted Dunning, Jake Mannix, and Sebastian Schelter. Summary ------- Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix This addresses bug MAHOUT-880 . https://issues.apache.org/jira/browse/MAHOUT-880 Diffs trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431 trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixMatrixElementwiseJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorElementwiseJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/TimesSelfJob.java PRE-CREATION trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431 Diff: https://reviews.apache.org/r/2955/diff Testing ------- Junit tests for each job Thanks, Raphael
        Hide
        Dan Brickley added a comment -

        Does Mahout yet have a method to take a large full matrix, and convert it sparse matrix format (losing zero values or perhaps if it makes sense, near-zero values also...)?

        Show
        Dan Brickley added a comment - Does Mahout yet have a method to take a large full matrix, and convert it sparse matrix format (losing zero values or perhaps if it makes sense, near-zero values also...)?
        Hide
        Lance Norskog added a comment -

        Oops sorry. This is about the set of pairwise operators available when you combine two or more matrices: plus, minus, mean, etc. Another use case is to just use one of the values.

        Show
        Lance Norskog added a comment - Oops sorry. This is about the set of pairwise operators available when you combine two or more matrices: plus, minus, mean, etc. Another use case is to just use one of the values.
        Hide
        Raphael Cendrillon added a comment -

        Hi Lance. Sorry, I don't follow you. Could you expand a bit on this? Is this in response to the issue regarding heavily loading the reducer or something else?

        Show
        Raphael Cendrillon added a comment - Hi Lance. Sorry, I don't follow you. Could you expand a bit on this? Is this in response to the issue regarding heavily loading the reducer or something else?
        Hide
        Lance Norskog added a comment -

        Another problem I've seen in some places is to just pick one of the values when there is an overlap. Options would be to pick the left one, or randomly choose one.

        Show
        Lance Norskog added a comment - Another problem I've seen in some places is to just pick one of the values when there is an overlap. Options would be to pick the left one, or randomly choose one.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2955/
        -----------------------------------------------------------

        (Updated 2011-12-01 08:39:37.868935)

        Review request for Ted Dunning, Jake Mannix and Sebastian Schelter.

        Changes
        -------

        A fair bit of refactoring. Added plus() and minus() methods for Matrix-Matrix and Matrix-Vector combinations. Renamed MatrixCovarianceJob() to TimesSelfJob() to improve clarity per Sebastian's suggestion. Moved vector argument to distributed cache and changed class to Vector per Jake's suggestion. Removed MatrixRowAverageJob.java for now.

        Summary
        -------

        Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix

        This addresses bug MAHOUT-880.
        https://issues.apache.org/jira/browse/MAHOUT-880

        Diffs (updated)


        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixMatrixElementwiseJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorElementwiseJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/TimesSelfJob.java PRE-CREATION
        trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431

        Diff: https://reviews.apache.org/r/2955/diff

        Testing
        -------

        Junit tests for each job

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/ ----------------------------------------------------------- (Updated 2011-12-01 08:39:37.868935) Review request for Ted Dunning, Jake Mannix and Sebastian Schelter. Changes ------- A fair bit of refactoring. Added plus() and minus() methods for Matrix-Matrix and Matrix-Vector combinations. Renamed MatrixCovarianceJob() to TimesSelfJob() to improve clarity per Sebastian's suggestion. Moved vector argument to distributed cache and changed class to Vector per Jake's suggestion. Removed MatrixRowAverageJob.java for now. Summary ------- Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix This addresses bug MAHOUT-880 . https://issues.apache.org/jira/browse/MAHOUT-880 Diffs (updated) trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431 trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixMatrixElementwiseJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorElementwiseJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/TimesSelfJob.java PRE-CREATION trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431 Diff: https://reviews.apache.org/r/2955/diff Testing ------- Junit tests for each job Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-11-29 19:56:51, Jake Mannix wrote:

        > trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java, line 116

        > <https://reviews.apache.org/r/2955/diff/1/?file=60411#file60411line116>

        >

        > This will force a huge bottleneck of one reducer, will it not?

        Thanks for the feedback Jake, it's really appreciated! I think the load will be distributed somewhat by the combiner at each node. Do you still think this will cause too much of a bottleneck?

        Do you have any suggestions for a better way to implement this?

        • Raphael

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2955/#review3562
        -----------------------------------------------------------

        On 2011-11-29 18:44:49, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/2955/

        -----------------------------------------------------------

        (Updated 2011-11-29 18:44:49)

        Review request for Ted Dunning, Jake Mannix and Sebastian Schelter.

        Summary

        -------

        Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix

        This addresses bug MAHOUT-880.

        https://issues.apache.org/jira/browse/MAHOUT-880

        Diffs

        -----

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION

        trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431

        Diff: https://reviews.apache.org/r/2955/diff

        Testing

        -------

        Junit tests for each job

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-11-29 19:56:51, Jake Mannix wrote: > trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java, line 116 > < https://reviews.apache.org/r/2955/diff/1/?file=60411#file60411line116 > > > This will force a huge bottleneck of one reducer, will it not? Thanks for the feedback Jake, it's really appreciated! I think the load will be distributed somewhat by the combiner at each node. Do you still think this will cause too much of a bottleneck? Do you have any suggestions for a better way to implement this? Raphael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/#review3562 ----------------------------------------------------------- On 2011-11-29 18:44:49, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/ ----------------------------------------------------------- (Updated 2011-11-29 18:44:49) Review request for Ted Dunning, Jake Mannix and Sebastian Schelter. Summary ------- Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix This addresses bug MAHOUT-880 . https://issues.apache.org/jira/browse/MAHOUT-880 Diffs ----- trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431 trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431 Diff: https://reviews.apache.org/r/2955/diff Testing ------- Junit tests for each job Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2955/#review3562
        -----------------------------------------------------------

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
        <https://reviews.apache.org/r/2955/#comment7976>

        I'm not sure about this method: you take in a DistributedRowMatrix, which by design is an big huge SequenceFile<IntWritable,VectorWritable>. Why don't you just take in a Vector, put that in the DistributedCache (or even serialize it into the Configuration, if it's small enough), and use that?

        Passing in a DistributedRowMatrix makes people assume you can put in a real full matrix.

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java
        <https://reviews.apache.org/r/2955/#comment7977>

        This will force a huge bottleneck of one reducer, will it not?

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java
        <https://reviews.apache.org/r/2955/#comment7978>

        I think we already have a VectorSummingReducer somewhere, we should re-use that.

        • Jake

        On 2011-11-29 18:44:49, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/2955/

        -----------------------------------------------------------

        (Updated 2011-11-29 18:44:49)

        Review request for Ted Dunning, Jake Mannix and Sebastian Schelter.

        Summary

        -------

        Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix

        This addresses bug MAHOUT-880.

        https://issues.apache.org/jira/browse/MAHOUT-880

        Diffs

        -----

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION

        trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431

        Diff: https://reviews.apache.org/r/2955/diff

        Testing

        -------

        Junit tests for each job

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/#review3562 ----------------------------------------------------------- trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java < https://reviews.apache.org/r/2955/#comment7976 > I'm not sure about this method: you take in a DistributedRowMatrix, which by design is an big huge SequenceFile<IntWritable,VectorWritable>. Why don't you just take in a Vector, put that in the DistributedCache (or even serialize it into the Configuration, if it's small enough), and use that? Passing in a DistributedRowMatrix makes people assume you can put in a real full matrix. trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java < https://reviews.apache.org/r/2955/#comment7977 > This will force a huge bottleneck of one reducer, will it not? trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java < https://reviews.apache.org/r/2955/#comment7978 > I think we already have a VectorSummingReducer somewhere, we should re-use that. Jake On 2011-11-29 18:44:49, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/ ----------------------------------------------------------- (Updated 2011-11-29 18:44:49) Review request for Ted Dunning, Jake Mannix and Sebastian Schelter. Summary ------- Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix This addresses bug MAHOUT-880 . https://issues.apache.org/jira/browse/MAHOUT-880 Diffs ----- trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431 trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431 Diff: https://reviews.apache.org/r/2955/diff Testing ------- Junit tests for each job Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2955/
        -----------------------------------------------------------

        (Updated 2011-11-29 18:44:49.585493)

        Review request for Ted Dunning, Jake Mannix and Sebastian Schelter.

        Summary
        -------

        Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix

        This addresses bug MAHOUT-880.
        https://issues.apache.org/jira/browse/MAHOUT-880

        Diffs


        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION
        trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431

        Diff: https://reviews.apache.org/r/2955/diff

        Testing
        -------

        Junit tests for each job

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/ ----------------------------------------------------------- (Updated 2011-11-29 18:44:49.585493) Review request for Ted Dunning, Jake Mannix and Sebastian Schelter. Summary ------- Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix This addresses bug MAHOUT-880 . https://issues.apache.org/jira/browse/MAHOUT-880 Diffs trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431 trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431 Diff: https://reviews.apache.org/r/2955/diff Testing ------- Junit tests for each job Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-11-29 08:41:06, Sebastian Schelter wrote:

        > trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java, line 119

        > <https://reviews.apache.org/r/2955/diff/1/?file=60410#file60410line119>

        >

        > Don't we have to center the rows for covariance? Am I missing something or do you assume that the data is already centered?

        Thank you for the feedback Sebastian.

        You're right, we first need to center the rows. I should rename this Job to remove confusion. In general it is just meant to compute x.transpose().times.

        • Raphael

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2955/#review3552
        -----------------------------------------------------------

        On 2011-11-29 05:40:30, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/2955/

        -----------------------------------------------------------

        (Updated 2011-11-29 05:40:30)

        Review request for Jake Mannix.

        Summary

        -------

        Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix

        This addresses bug MAHOUT-880.

        https://issues.apache.org/jira/browse/MAHOUT-880

        Diffs

        -----

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION

        trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431

        Diff: https://reviews.apache.org/r/2955/diff

        Testing

        -------

        Junit tests for each job

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-11-29 08:41:06, Sebastian Schelter wrote: > trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java, line 119 > < https://reviews.apache.org/r/2955/diff/1/?file=60410#file60410line119 > > > Don't we have to center the rows for covariance? Am I missing something or do you assume that the data is already centered? Thank you for the feedback Sebastian. You're right, we first need to center the rows. I should rename this Job to remove confusion. In general it is just meant to compute x.transpose().times . Raphael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/#review3552 ----------------------------------------------------------- On 2011-11-29 05:40:30, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/ ----------------------------------------------------------- (Updated 2011-11-29 05:40:30) Review request for Jake Mannix. Summary ------- Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix This addresses bug MAHOUT-880 . https://issues.apache.org/jira/browse/MAHOUT-880 Diffs ----- trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431 trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431 Diff: https://reviews.apache.org/r/2955/diff Testing ------- Junit tests for each job Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2955/#review3552
        -----------------------------------------------------------

        I'm not seeing the centering of the rows for the covariance computation.

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java
        <https://reviews.apache.org/r/2955/#comment7923>

        Don't we have to center the rows for covariance? Am I missing something or do you assume that the data is already centered?

        • Sebastian

        On 2011-11-29 05:40:30, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/2955/

        -----------------------------------------------------------

        (Updated 2011-11-29 05:40:30)

        Review request for Jake Mannix.

        Summary

        -------

        Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix

        This addresses bug MAHOUT-880.

        https://issues.apache.org/jira/browse/MAHOUT-880

        Diffs

        -----

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION

        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION

        trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431

        Diff: https://reviews.apache.org/r/2955/diff

        Testing

        -------

        Junit tests for each job

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/#review3552 ----------------------------------------------------------- I'm not seeing the centering of the rows for the covariance computation. trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java < https://reviews.apache.org/r/2955/#comment7923 > Don't we have to center the rows for covariance? Am I missing something or do you assume that the data is already centered? Sebastian On 2011-11-29 05:40:30, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/ ----------------------------------------------------------- (Updated 2011-11-29 05:40:30) Review request for Jake Mannix. Summary ------- Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix This addresses bug MAHOUT-880 . https://issues.apache.org/jira/browse/MAHOUT-880 Diffs ----- trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431 trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431 Diff: https://reviews.apache.org/r/2955/diff Testing ------- Junit tests for each job Thanks, Raphael
        Hide
        Wangda Tan added a comment -

        Great work!
        I'm working on the norm job, I try to finish it ASAP

        Show
        Wangda Tan added a comment - Great work! I'm working on the norm job, I try to finish it ASAP
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2955/
        -----------------------------------------------------------

        Review request for Jake Mannix.

        Summary
        -------

        Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix

        This addresses bug MAHOUT-880.
        https://issues.apache.org/jira/browse/MAHOUT-880

        Diffs


        trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION
        trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION
        trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431

        Diff: https://reviews.apache.org/r/2955/diff

        Testing
        -------

        Junit tests for each job

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2955/ ----------------------------------------------------------- Review request for Jake Mannix. Summary ------- Jobs for matrix-vector addition, covariance matrix calculation and row average calculation in DistributedRowMatrix This addresses bug MAHOUT-880 . https://issues.apache.org/jira/browse/MAHOUT-880 Diffs trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1206431 trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixCovarianceJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowAverageJob.java PRE-CREATION trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixVectorAdditionJob.java PRE-CREATION trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1206431 Diff: https://reviews.apache.org/r/2955/diff Testing ------- Junit tests for each job Thanks, Raphael
        Hide
        Raphael Cendrillon added a comment -

        I'll be glad to. Thanks.

        Show
        Raphael Cendrillon added a comment - I'll be glad to. Thanks.
        Hide
        Jake Mannix added a comment -

        Hi Raphael,

        Can you create a reviewboard request for this ticket? (See MAHOUT-888 for details on how)

        Show
        Jake Mannix added a comment - Hi Raphael, Can you create a reviewboard request for this ticket? (See MAHOUT-888 for details on how)
        Hide
        Raphael Cendrillon added a comment -

        Hi Jake. If you get a chance could you take a look through the attached patch? Your feedback would be great.

        Show
        Raphael Cendrillon added a comment - Hi Jake. If you get a chance could you take a look through the attached patch? Your feedback would be great.
        Hide
        Jake Mannix added a comment -

        many of these sound great, yes!

        I'd have one suggestion, however: DistributedRowMatrix implements the interface VectorIterable, which the interface Matrix extends. The methods you mention which are already in VectorIterable should just get pulled up into VectorIterable.

        Of course, it requires that we do some careful checking that someone who calls DistributedRowMatrix.minus(DenseMatrix) behaves sensibly. I would imagine this case would be handled by the fact that there is no sensible reason why you would have a DistributedRowMatrix and a DenseMatrix of the exact same cardinalities (one fits in RAM, but the other needs to live on HDFS?).

        Regarding some of these methods: 4) I'm not sure about - do we have uses for these? If you have a DistributedRowMatrix: a humongous HDFS SequenceFile of Vectors, what exactly are you going to do with the upper triangle of it? Diagonal I can see, I guess. Extract a vector of the diagonal from the whole distributed matrix, sure.

        6) is actually being looked at in MAHOUT-884

        7) we like solvers, yes, but the methods don't go in our matrix classes, they go in separate solver classes, and take matrix (or DistributedRowMatrix) as inputs.

        8) also is good and we'd always like more I/O hooks, but again, should be in other classes, and in some ways already
        exists: VectorDumper allows the option of dumping a DistributedRowMatrix from SequenceFile to CSV, and I think we have some support for ARFF as well, somewhere.

        Show
        Jake Mannix added a comment - many of these sound great, yes! I'd have one suggestion, however: DistributedRowMatrix implements the interface VectorIterable, which the interface Matrix extends. The methods you mention which are already in VectorIterable should just get pulled up into VectorIterable. Of course, it requires that we do some careful checking that someone who calls DistributedRowMatrix.minus(DenseMatrix) behaves sensibly. I would imagine this case would be handled by the fact that there is no sensible reason why you would have a DistributedRowMatrix and a DenseMatrix of the exact same cardinalities (one fits in RAM, but the other needs to live on HDFS?). Regarding some of these methods: 4) I'm not sure about - do we have uses for these? If you have a DistributedRowMatrix: a humongous HDFS SequenceFile of Vectors, what exactly are you going to do with the upper triangle of it? Diagonal I can see, I guess. Extract a vector of the diagonal from the whole distributed matrix, sure. 6) is actually being looked at in MAHOUT-884 7) we like solvers, yes, but the methods don't go in our matrix classes, they go in separate solver classes, and take matrix (or DistributedRowMatrix) as inputs. 8) also is good and we'd always like more I/O hooks, but again, should be in other classes, and in some ways already exists: VectorDumper allows the option of dumping a DistributedRowMatrix from SequenceFile to CSV, and I think we have some support for ARFF as well, somewhere.
        Hide
        Raphael Cendrillon added a comment -

        I also think it could be useful to add support for a few more standard matrix operations to DistributedRowMatrix. Here's a patch with a few operations to start with. Is there broader interest in this?

        Show
        Raphael Cendrillon added a comment - I also think it could be useful to add support for a few more standard matrix operations to DistributedRowMatrix. Here's a patch with a few operations to start with. Is there broader interest in this?

          People

          • Assignee:
            Unassigned
            Reporter:
            Wangda Tan
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development