Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: None
    • Component/s: Math
    • Labels:
      None

      Description

      Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

      1. MAHOUT-923.patch
        26 kB
        Raphael Cendrillon
      2. MAHOUT-923.patch
        8 kB
        Raphael Cendrillon
      3. MAHOUT-923.patch
        7 kB
        Raphael Cendrillon

        Activity

        Sebastian Schelter made changes -
        Fix Version/s Backlog [ 12318886 ]
        Raphael Cendrillon made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Raphael Cendrillon added a comment -

        Marking this as closed as patch has been integrated into MAHOUT-817

        Show
        Raphael Cendrillon added a comment - Marking this as closed as patch has been integrated into MAHOUT-817
        Raphael Cendrillon made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Raphael Cendrillon added a comment -

        Marking this as closed as patch has been integrated into MAHOUT-817

        Show
        Raphael Cendrillon added a comment - Marking this as closed as patch has been integrated into MAHOUT-817
        Raphael Cendrillon made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Dmitriy Lyubimov added a comment -

        sorry, i was/am busy with various performance tweaks. so i probably will get back to PCI no sooner than new year's eve . (At which point i would be a really sore loser, to spend a family holiday on patching Mahout, wouldn't I?)

        Show
        Dmitriy Lyubimov added a comment - sorry, i was/am busy with various performance tweaks. so i probably will get back to PCI no sooner than new year's eve . (At which point i would be a really sore loser, to spend a family holiday on patching Mahout, wouldn't I?)
        Hide
        Dmitriy Lyubimov added a comment -

        Raphael, thank you for seeing this thru.

        Q:
        1) – why do you need vector class for the accumulator now? mean is kind of expected to be dense in the end, if not in the mappers then at least in the reducer for sure. And secondly, if you want to do this, why don't your api would accept a class instance, not a "short" name? that would be consistent with the Hadoop Job and file format apis which kind of take classes, not strings.

        2) – I know you have a unit test, but did you test it on a simulated input, like say 2G big? if not, i will have to test it before you proceed.

        As a next step, i guess i need to try it out to see if it works on various kind of inputs.

        Show
        Dmitriy Lyubimov added a comment - Raphael, thank you for seeing this thru. Q: 1) – why do you need vector class for the accumulator now? mean is kind of expected to be dense in the end, if not in the mappers then at least in the reducer for sure. And secondly, if you want to do this, why don't your api would accept a class instance, not a "short" name? that would be consistent with the Hadoop Job and file format apis which kind of take classes, not strings. 2) – I know you have a unit test, but did you test it on a simulated input, like say 2G big? if not, i will have to test it before you proceed. As a next step, i guess i need to try it out to see if it works on various kind of inputs.
        Hide
        Raphael Cendrillon added a comment -

        Are there any more changes needed at this point? What's the next step from here?

        Show
        Raphael Cendrillon added a comment - Are there any more changes needed at this point? What's the next step from here?
        Raphael Cendrillon made changes -
        Attachment MAHOUT-923.patch [ 12507802 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-17 20:50:42.776447)

        Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov.

        Changes
        -------

        Correct version.

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1215567
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1215567

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-17 20:50:42.776447) Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov. Changes ------- Correct version. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1215567 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1215567 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-17 20:46:39.068863)

        Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov.

        Changes
        -------

        Added javadocs

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1215567
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1215567

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-17 20:46:39.068863) Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov. Changes ------- Added javadocs Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1215567 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1215567 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-17 20:38:27.774380)

        Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov.

        Changes
        -------

        Added javadocs

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1215567
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1215567

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-17 20:38:27.774380) Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov. Changes ------- Added javadocs Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1215567 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1215567 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-17 20:32:34.158332)

        Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov.

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs


        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-17 20:32:34.158332) Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3916
        -----------------------------------------------------------

        Looks much better from the trivial formatting standpoint.

        • Ted

        On 2011-12-13 17:53:35, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-13 17:53:35)

        Review request for mahout, lancenorskog and Dmitriy Lyubimov.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3916 ----------------------------------------------------------- Looks much better from the trivial formatting standpoint. Ted On 2011-12-13 17:53:35, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 17:53:35) Review request for mahout, lancenorskog and Dmitriy Lyubimov. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-13 17:53:35.691333)

        Review request for mahout, lancenorskog and Dmitriy Lyubimov.

        Changes
        -------

        Changed method name to columnMeans(). Removed trailing whitespaces.

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs (updated)


        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 17:53:35.691333) Review request for mahout, lancenorskog and Dmitriy Lyubimov. Changes ------- Changed method name to columnMeans(). Removed trailing whitespaces. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs (updated) /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        Raphael Cendrillon added a comment -

        Eclipse. I've added the provided format template, but sometimes whitespace still pops up. I'll look into this some more. It's driving me crazy

        Show
        Raphael Cendrillon added a comment - Eclipse. I've added the provided format template, but sometimes whitespace still pops up. I'll look into this some more. It's driving me crazy
        Hide
        Ted Dunning added a comment -

        For getting rid of trailing white space, most IDE's have this function built in.

        What are you using to write this?

        Show
        Ted Dunning added a comment - For getting rid of trailing white space, most IDE's have this function built in. What are you using to write this?
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-12-13 13:08:20, Ted Dunning wrote:

        > /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java, line 199

        > <https://reviews.apache.org/r/3147/diff/5/?file=64279#file64279line199>

        >

        > I would really rather use standard terminology here.

        >

        > A mean row is a row that is that average of all others, but a row mean would mean an average of the elements a single row. The plural form, row means, indicates the means of all rows. What you are computing are the means of every column.

        >

        > In contrast, R, Octave and Matlab all use columnMeans as the name of the function being implemented here.

        Sure. In Matlab/Octave I'm used to mean(A,1) (takes the mean across the 1st dimension, ie. across rows, but done per column). I'll change this to colMeans(), which seems to be clearer.

        On 2011-12-13 13:08:20, Ted Dunning wrote:

        > /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java, lines 129-132

        > <https://reviews.apache.org/r/3147/diff/5/?file=64280#file64280line129>

        >

        > There are lots of lines with trailing white space. Isn't this easily suppressed?

        I can use sed, or perhaps there's a better way?

        • Raphael

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3874
        -----------------------------------------------------------

        On 2011-12-13 04:46:47, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-13 04:46:47)

        Review request for mahout, lancenorskog and Dmitriy Lyubimov.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-12-13 13:08:20, Ted Dunning wrote: > /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java, line 199 > < https://reviews.apache.org/r/3147/diff/5/?file=64279#file64279line199 > > > I would really rather use standard terminology here. > > A mean row is a row that is that average of all others, but a row mean would mean an average of the elements a single row. The plural form, row means, indicates the means of all rows. What you are computing are the means of every column. > > In contrast, R, Octave and Matlab all use columnMeans as the name of the function being implemented here. Sure. In Matlab/Octave I'm used to mean(A,1) (takes the mean across the 1st dimension, ie. across rows, but done per column). I'll change this to colMeans(), which seems to be clearer. On 2011-12-13 13:08:20, Ted Dunning wrote: > /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java, lines 129-132 > < https://reviews.apache.org/r/3147/diff/5/?file=64280#file64280line129 > > > There are lots of lines with trailing white space. Isn't this easily suppressed? I can use sed, or perhaps there's a better way? Raphael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3874 ----------------------------------------------------------- On 2011-12-13 04:46:47, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 04:46:47) Review request for mahout, lancenorskog and Dmitriy Lyubimov. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        Sean Owen added a comment -

        clone() can return what it likes, though it is intended to return an object of the same class. So it could do what you describe "legally", even if it generally doesn't or perhaps shouldn't. What's the flaw that Lance was alluding to? I am aware of sins in implementation of clone() but nothing that was actually causing a problem.

        Show
        Sean Owen added a comment - clone() can return what it likes, though it is intended to return an object of the same class. So it could do what you describe "legally", even if it generally doesn't or perhaps shouldn't. What's the flaw that Lance was alluding to? I am aware of sins in implementation of clone() but nothing that was actually causing a problem.
        Hide
        Ted Dunning added a comment -

        Sean,

        Clone can only return exactly the same type. This is a real problem sometimes. For example, view matrices should not return view matrices but should return something of the type of the underlying matrix, but in the right size.

        The issue is analogous to the problem with constructors compared to factory methods. With constructors, you have already defined the return type and you may not know enough to really choose the correct return type. With factory methods, the framework is free to give you anything that satisfies the basic contract.

        Show
        Ted Dunning added a comment - Sean, Clone can only return exactly the same type. This is a real problem sometimes. For example, view matrices should not return view matrices but should return something of the type of the underlying matrix, but in the right size. The issue is analogous to the problem with constructors compared to factory methods. With constructors, you have already defined the return type and you may not know enough to really choose the correct return type. With factory methods, the framework is free to give you anything that satisfies the basic contract.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3874
        -----------------------------------------------------------

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
        <https://reviews.apache.org/r/3147/#comment8701>

        I would really rather use standard terminology here.

        A mean row is a row that is that average of all others, but a row mean would mean an average of the elements a single row. The plural form, row means, indicates the means of all rows. What you are computing are the means of every column.

        In contrast, R, Octave and Matlab all use columnMeans as the name of the function being implemented here.

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
        <https://reviews.apache.org/r/3147/#comment8702>

        There are lots of lines with trailing white space. Isn't this easily suppressed?

        • Ted

        On 2011-12-13 04:46:47, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-13 04:46:47)

        Review request for mahout, lancenorskog and Dmitriy Lyubimov.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3874 ----------------------------------------------------------- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java < https://reviews.apache.org/r/3147/#comment8701 > I would really rather use standard terminology here. A mean row is a row that is that average of all others, but a row mean would mean an average of the elements a single row. The plural form, row means, indicates the means of all rows. What you are computing are the means of every column. In contrast, R, Octave and Matlab all use columnMeans as the name of the function being implemented here. /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java < https://reviews.apache.org/r/3147/#comment8702 > There are lots of lines with trailing white space. Isn't this easily suppressed? Ted On 2011-12-13 04:46:47, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 04:46:47) Review request for mahout, lancenorskog and Dmitriy Lyubimov. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        Sean Owen added a comment -

        Lance, what are the clone() "flaws" you're talking about? I'm not aware of anything.

        Show
        Sean Owen added a comment - Lance, what are the clone() "flaws" you're talking about? I'm not aware of anything.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-13 04:46:47.630950)

        Review request for mahout, lancenorskog and Dmitriy Lyubimov.

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 04:46:47.630950) Review request for mahout, lancenorskog and Dmitriy Lyubimov. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        Raphael Cendrillon added a comment -

        Thanks Lance. I've updated this on reviewboard. Could you please take a look?

        Show
        Raphael Cendrillon added a comment - Thanks Lance. I've updated this on reviewboard. Could you please take a look?
        Hide
        Lance Norskog added a comment -

        The right way to set the vector class is to use Vector.like() instead of clone(). Then, copy the old vector into the new vector. clone() is basically bogus; it was a nice idea that has turned out to have flaws. Vector.like() allows any kind of Vector to make a different kind, as appropriate. And the class parameter would override that, but would not be necessary.

        Show
        Lance Norskog added a comment - The right way to set the vector class is to use Vector.like() instead of clone(). Then, copy the old vector into the new vector. clone() is basically bogus; it was a nice idea that has turned out to have flaws. Vector.like() allows any kind of Vector to make a different kind, as appropriate. And the class parameter would override that, but would not be necessary.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-13 00:58:36.591798)

        Review request for mahout and Dmitriy Lyubimov.

        Changes
        -------

        Added private static final to 'one', removed clone(), added option to specify class of return vector.

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 00:58:36.591798) Review request for mahout and Dmitriy Lyubimov. Changes ------- Added private static final to 'one', removed clone(), added option to specify class of return vector. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3866
        -----------------------------------------------------------

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
        <https://reviews.apache.org/r/3147/#comment8690>

        can be private static final

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
        <https://reviews.apache.org/r/3147/#comment8691>

        No need to clone here, see org.apache.mahout.common.mapreduce.VectorSumReducer

        • Sebastian

        On 2011-12-13 00:10:57, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-13 00:10:57)

        Review request for mahout and Dmitriy Lyubimov.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3866 ----------------------------------------------------------- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java < https://reviews.apache.org/r/3147/#comment8690 > can be private static final /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java < https://reviews.apache.org/r/3147/#comment8691 > No need to clone here, see org.apache.mahout.common.mapreduce.VectorSumReducer Sebastian On 2011-12-13 00:10:57, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 00:10:57) Review request for mahout and Dmitriy Lyubimov. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:

        > Hm. I hope i did not read the code or miss something.

        >

        > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code.

        > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).

        > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.

        Raphael Cendrillon wrote:

        Thanks Dmitry.

        Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?

        Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?

        Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?

        Dmitriy Lyubimov wrote:

        NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.

        1 – I don't think your statement about # of reduce tasks is true.

        The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually guaranteed to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1).

        Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing .

        So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was).

        2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input.

        So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all.

        Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum, k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude.

        Now, we send that handful tuples to single reducer and just do combining (sum(X)= sum_i(X); n= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation.

        But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.

        Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.

        3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.

        Raphael Cendrillon wrote:

        Thanks Dmitriy. I've updated the diff to push the row summation into the mapper as you suggested, force the number of reducers to 1, and make the final output key IntWritable. Could you please take a look?

        Dmitriy Lyubimov wrote:

        looks good on top of it.

        One nitpicking that i have is this

        context.write(NullWritable.get(), new VectorWritable(runningSum));

        but runningSum is initialized in the map loop which technically may never be called (not likely but theoretically possible).

        so therefore i'd initialize runningSum vector to something that is nonzero in setup. or better yet just check for null and skip map output if no data.

        Same considerations for reducer. it needs to handle the corner case when there's no input, correctly.

        Thanks Dmitry. I've updated the patch to add these checks.

        • Raphael

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3838
        -----------------------------------------------------------

        On 2011-12-13 00:10:57, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-13 00:10:57)

        Review request for mahout and Dmitriy Lyubimov.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote: > Hm. I hope i did not read the code or miss something. > > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly). > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private. Raphael Cendrillon wrote: Thanks Dmitry. Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something? Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference? Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable? Dmitriy Lyubimov wrote: NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now. 1 – I don't think your statement about # of reduce tasks is true. The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually guaranteed to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1). Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing . So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was). 2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input. So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all. Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum , k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude. Now, we send that handful tuples to single reducer and just do combining (sum(X) = sum_i(X); n = n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation. But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers. Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO. 3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable. Raphael Cendrillon wrote: Thanks Dmitriy. I've updated the diff to push the row summation into the mapper as you suggested, force the number of reducers to 1, and make the final output key IntWritable. Could you please take a look? Dmitriy Lyubimov wrote: looks good on top of it. One nitpicking that i have is this context.write(NullWritable.get(), new VectorWritable(runningSum)); but runningSum is initialized in the map loop which technically may never be called (not likely but theoretically possible). so therefore i'd initialize runningSum vector to something that is nonzero in setup. or better yet just check for null and skip map output if no data. Same considerations for reducer. it needs to handle the corner case when there's no input, correctly. Thanks Dmitry. I've updated the patch to add these checks. Raphael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3838 ----------------------------------------------------------- On 2011-12-13 00:10:57, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 00:10:57) Review request for mahout and Dmitriy Lyubimov. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Raphael Cendrillon made changes -
        Attachment MAHOUT-923.patch [ 12507099 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-13 00:10:57.848590)

        Review request for mahout and Dmitriy Lyubimov.

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 00:10:57.848590) Review request for mahout and Dmitriy Lyubimov. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-13 00:09:03.441301)

        Review request for mahout.

        Changes
        -------

        Added checks to ensure reliable operation with null matrices

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-13 00:09:03.441301) Review request for mahout. Changes ------- Added checks to ensure reliable operation with null matrices Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:

        > Hm. I hope i did not read the code or miss something.

        >

        > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code.

        > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).

        > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.

        Raphael Cendrillon wrote:

        Thanks Dmitry.

        Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?

        Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?

        Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?

        Dmitriy Lyubimov wrote:

        NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.

        1 – I don't think your statement about # of reduce tasks is true.

        The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually guaranteed to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1).

        Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing .

        So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was).

        2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input.

        So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all.

        Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum, k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude.

        Now, we send that handful tuples to single reducer and just do combining (sum(X)= sum_i(X); n= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation.

        But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.

        Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.

        3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.

        Raphael Cendrillon wrote:

        Thanks Dmitriy. I've updated the diff to push the row summation into the mapper as you suggested, force the number of reducers to 1, and make the final output key IntWritable. Could you please take a look?

        looks good on top of it.
        One nitpicking that i have is this

        context.write(NullWritable.get(), new VectorWritable(runningSum));

        but runningSum is initialized in the map loop which technically may never be called (not likely but theoretically possible).

        so therefore i'd initialize runningSum vector to something that is nonzero in setup. or better yet just check for null and skip map output if no data.

        Same considerations for reducer. it needs to handle the corner case when there's no input, correctly.

        • Dmitriy

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3838
        -----------------------------------------------------------

        On 2011-12-12 10:41:46, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-12 10:41:46)

        Review request for mahout.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote: > Hm. I hope i did not read the code or miss something. > > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly). > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private. Raphael Cendrillon wrote: Thanks Dmitry. Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something? Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference? Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable? Dmitriy Lyubimov wrote: NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now. 1 – I don't think your statement about # of reduce tasks is true. The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually guaranteed to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1). Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing . So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was). 2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input. So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all. Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum , k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude. Now, we send that handful tuples to single reducer and just do combining (sum(X) = sum_i(X); n = n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation. But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers. Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO. 3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable. Raphael Cendrillon wrote: Thanks Dmitriy. I've updated the diff to push the row summation into the mapper as you suggested, force the number of reducers to 1, and make the final output key IntWritable. Could you please take a look? looks good on top of it. One nitpicking that i have is this context.write(NullWritable.get(), new VectorWritable(runningSum)); but runningSum is initialized in the map loop which technically may never be called (not likely but theoretically possible). so therefore i'd initialize runningSum vector to something that is nonzero in setup. or better yet just check for null and skip map output if no data. Same considerations for reducer. it needs to handle the corner case when there's no input, correctly. Dmitriy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3838 ----------------------------------------------------------- On 2011-12-12 10:41:46, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-12 10:41:46) Review request for mahout. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:

        > Hm. I hope i did not read the code or miss something.

        >

        > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code.

        > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).

        > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.

        Raphael Cendrillon wrote:

        Thanks Dmitry.

        Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?

        Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?

        Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?

        Dmitriy Lyubimov wrote:

        NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.

        1 – I don't think your statement about # of reduce tasks is true.

        The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually guaranteed to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1).

        Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing .

        So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was).

        2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input.

        So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all.

        Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum, k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude.

        Now, we send that handful tuples to single reducer and just do combining (sum(X)= sum_i(X); n= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation.

        But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.

        Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.

        3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.

        Thanks Dmitriy. I've updated the diff to push the row summation into the mapper as you suggested, force the number of reducers to 1, and make the final output key IntWritable. Could you please take a look?

        • Raphael

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3838
        -----------------------------------------------------------

        On 2011-12-12 10:41:46, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-12 10:41:46)

        Review request for mahout.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote: > Hm. I hope i did not read the code or miss something. > > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly). > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private. Raphael Cendrillon wrote: Thanks Dmitry. Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something? Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference? Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable? Dmitriy Lyubimov wrote: NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now. 1 – I don't think your statement about # of reduce tasks is true. The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually guaranteed to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1). Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing . So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was). 2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input. So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all. Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum , k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude. Now, we send that handful tuples to single reducer and just do combining (sum(X) = sum_i(X); n = n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation. But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers. Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO. 3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable. Thanks Dmitriy. I've updated the diff to push the row summation into the mapper as you suggested, force the number of reducers to 1, and make the final output key IntWritable. Could you please take a look? Raphael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3838 ----------------------------------------------------------- On 2011-12-12 10:41:46, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-12 10:41:46) Review request for mahout. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-12 10:41:46.013180)

        Review request for mahout.

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-12 10:41:46.013180) Review request for mahout. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:

        > Hm. I hope i did not read the code or miss something.

        >

        > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code.

        > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).

        > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.

        Raphael Cendrillon wrote:

        Thanks Dmitry.

        Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?

        Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?

        Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?

        NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.

        1 – I don't think your statement about # of reduce tasks is true.

        The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually guaranteed to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1).
        Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing .

        So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was).

        2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input.

        So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all.

        Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum, k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude.

        Now, we send that handful tuples to single reducer and just do combining (sum(X)= sum_i(X); n= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation.

        But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.

        Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.

        3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.

        • Dmitriy

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3838
        -----------------------------------------------------------

        On 2011-12-12 00:30:24, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-12 00:30:24)

        Review request for mahout.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote: > Hm. I hope i did not read the code or miss something. > > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly). > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private. Raphael Cendrillon wrote: Thanks Dmitry. Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something? Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference? Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable? NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now. 1 – I don't think your statement about # of reduce tasks is true. The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually guaranteed to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1). Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing . So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was). 2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input. So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all. Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum , k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude. Now, we send that handful tuples to single reducer and just do combining (sum(X) = sum_i(X); n = n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation. But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers. Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO. 3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable. Dmitriy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3838 ----------------------------------------------------------- On 2011-12-12 00:30:24, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-12 00:30:24) Review request for mahout. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        Raphael Cendrillon added a comment -

        Thanks Lance. A combiner is definitely the next step. One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch? I know there is TupleWritable, but from what I've read online it's better to avoid that unless you're doing a multiple input join.

        Regarding the class for the output vector, are you saying that instead of inhereting the class from the rows of the DistributedRowMatrix you'd rather be able to specify this manually?

        Show
        Raphael Cendrillon added a comment - Thanks Lance. A combiner is definitely the next step. One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch? I know there is TupleWritable, but from what I've read online it's better to avoid that unless you're doing a multiple input join. Regarding the class for the output vector, are you saying that instead of inhereting the class from the rows of the DistributedRowMatrix you'd rather be able to specify this manually?
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:

        > Hm. I hope i did not read the code or miss something.

        >

        > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code.

        > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).

        > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.

        Thanks Dmitry.

        Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?

        Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?

        Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?

        • Raphael

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3838
        -----------------------------------------------------------

        On 2011-12-12 00:30:24, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-12 00:30:24)

        Review request for mahout.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote: > Hm. I hope i did not read the code or miss something. > > 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. > 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly). > 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private. Thanks Dmitry. Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something? Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper? I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed? Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference? Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable? Raphael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3838 ----------------------------------------------------------- On 2011-12-12 00:30:24, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-12 00:30:24) Review request for mahout. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        Lance Norskog added a comment -

        MatrixRowMeanJob writes <NullWritable,VectorWritable> but the convention for Mahout jobs is <IntWritable,VectorWritable>. Should it use an IntWritable instead for consistency? The Int would be one.

        Can MatrixRowMeanJob have a Combiner? Would this help with performance in the standard use cases? The key would still be NullWritable. value would be a VectorWritable and a counter. The Reducer would include the counters for each input Vector.

        Should MatrixRowMeanJob have a configurable class for the ouput vector?

        Show
        Lance Norskog added a comment - MatrixRowMeanJob writes <NullWritable,VectorWritable> but the convention for Mahout jobs is <IntWritable,VectorWritable>. Should it use an IntWritable instead for consistency? The Int would be one. Can MatrixRowMeanJob have a Combiner? Would this help with performance in the standard use cases? The key would still be NullWritable. value would be a VectorWritable and a counter. The Reducer would include the counters for each input Vector. Should MatrixRowMeanJob have a configurable class for the ouput vector?
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/#review3838
        -----------------------------------------------------------

        Hm. I hope i did not read the code or miss something.

        1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code.
        2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
        3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.

        • Dmitriy

        On 2011-12-12 00:30:24, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3147/

        -----------------------------------------------------------

        (Updated 2011-12-12 00:30:24)

        Review request for mahout.

        Summary

        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.

        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs

        -----

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095

        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION

        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095

        Diff: https://reviews.apache.org/r/3147/diff

        Testing

        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/#review3838 ----------------------------------------------------------- Hm. I hope i did not read the code or miss something. 1 – i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. 2 – mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly). 3 – i am not sure – is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private. Dmitriy On 2011-12-12 00:30:24, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-12 00:30:24) Review request for mahout. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs ----- /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        (Updated 2011-12-12 00:30:24.091994)

        Review request for mahout.

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- (Updated 2011-12-12 00:30:24.091994) Review request for mahout. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3147/
        -----------------------------------------------------------

        Review request for mahout.

        Summary
        -------

        Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.

        This addresses bug MAHOUT-923.
        https://issues.apache.org/jira/browse/MAHOUT-923

        Diffs


        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095
        /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
        /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095

        Diff: https://reviews.apache.org/r/3147/diff

        Testing
        -------

        Junit test

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3147/ ----------------------------------------------------------- Review request for mahout. Summary ------- Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. This addresses bug MAHOUT-923 . https://issues.apache.org/jira/browse/MAHOUT-923 Diffs /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 Diff: https://reviews.apache.org/r/3147/diff Testing ------- Junit test Thanks, Raphael
        Hide
        Raphael Cendrillon added a comment -

        It might be worthwhile to pull this into a seperate issue. Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
        Does a class like this already exist?

        Show
        Raphael Cendrillon added a comment - It might be worthwhile to pull this into a seperate issue. Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum. Does a class like this already exist?
        Raphael Cendrillon made changes -
        Field Original Value New Value
        Attachment MAHOUT-923.patch [ 12506962 ]
        Raphael Cendrillon created issue -

          People

          • Assignee:
            Raphael Cendrillon
            Reporter:
            Raphael Cendrillon
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development