Mahout
  1. Mahout
  2. MAHOUT-904

SplitInput should support randomizing the input

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: None

      Description

      For some learning tasks, we need the input to be randomized (SGD) instead of blocks of labels all at once. SplitInput is a useful tool for setting up train/test files but it currently doesn't support randomizing the input.

      1. MAHOUT-904.patch
        48 kB
        Grant Ingersoll
      2. MAHOUT-904.patch
        48 kB
        Grant Ingersoll
      3. MAHOUT-904.patch
        43 kB
        Raphael Cendrillon
      4. MAHOUT-904.patch
        42 kB
        Grant Ingersoll
      5. MAHOUT-904.patch
        39 kB
        Grant Ingersoll
      6. MAHOUT-904.patch
        32 kB
        Raphael Cendrillon
      7. MAHOUT-904.patch
        14 kB
        Grant Ingersoll
      8. MAHOUT-904.patch
        11 kB
        Raphael Cendrillon
      9. MAHOUT-904.patch
        8 kB
        Raphael Cendrillon

        Activity

        Hide
        Raphael Cendrillon added a comment -

        Is this still open? If so I could start to take a look.

        Show
        Raphael Cendrillon added a comment - Is this still open? If so I could start to take a look.
        Hide
        Grant Ingersoll added a comment -

        Go for it!

        Show
        Grant Ingersoll added a comment - Go for it!
        Hide
        Raphael Cendrillon added a comment -

        This is an early start but I've posted it up just to check if I'm on the right track.

        A couple of comments:

        • currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
        • the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        Show
        Raphael Cendrillon added a comment - This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome!
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/
        -----------------------------------------------------------

        Review request for Grant Ingersoll.

        Summary
        -------

        Early support for randomizing input in SplitInput class

        This addresses bug MAHOUT-904.
        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs


        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249
        /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249

        Diff: https://reviews.apache.org/r/3092/diff

        Testing
        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- Review request for Grant Ingersoll. Summary ------- Early support for randomizing input in SplitInput class This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249 /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249 Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/
        -----------------------------------------------------------

        (Updated 2011-12-09 08:57:18.798303)

        Review request for Grant Ingersoll.

        Summary (updated)
        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        • currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
        • the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.
        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs


        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249
        /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249

        Diff: https://reviews.apache.org/r/3092/diff

        Testing
        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-09 08:57:18.798303) Review request for Grant Ingersoll. Summary (updated) ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249 /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249 Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/#review3876
        -----------------------------------------------------------

        Thoughts:
        this class is often run from the command line, so we should add CLI support for telling it to randomly permute.

        I wonder if we should make this a map-reduce job. Perhaps we split out the existing version and leave as is and then add a new MR one that can do the permutation. One idea there would be to generate random keys (by appending onto the existing key) and letting the shuffle effectively do the permutations. Then, during reduce phase we simply strip off the random part of the key and output. I don't know how bad this would hurt the shuffle, but it seems like it would work functionally anyway.

        Otherwise, the approach seems reasonable. I don't know off hand if there is a better way of doing it (even though I wish there were).

        • Grant

        On 2011-12-09 08:57:18, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3092/

        -----------------------------------------------------------

        (Updated 2011-12-09 08:57:18)

        Review request for Grant Ingersoll.

        Summary

        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.

        - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.

        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs

        -----

        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249

        /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249

        Diff: https://reviews.apache.org/r/3092/diff

        Testing

        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/#review3876 ----------------------------------------------------------- Thoughts: this class is often run from the command line, so we should add CLI support for telling it to randomly permute. I wonder if we should make this a map-reduce job. Perhaps we split out the existing version and leave as is and then add a new MR one that can do the permutation. One idea there would be to generate random keys (by appending onto the existing key) and letting the shuffle effectively do the permutations. Then, during reduce phase we simply strip off the random part of the key and output. I don't know how bad this would hurt the shuffle, but it seems like it would work functionally anyway. Otherwise, the approach seems reasonable. I don't know off hand if there is a better way of doing it (even though I wish there were). Grant On 2011-12-09 08:57:18, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-09 08:57:18) Review request for Grant Ingersoll. Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs ----- /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249 /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249 Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-12-13 13:19:13, Grant Ingersoll wrote:

        > Thoughts:

        > this class is often run from the command line, so we should add CLI support for telling it to randomly permute.

        >

        > I wonder if we should make this a map-reduce job. Perhaps we split out the existing version and leave as is and then add a new MR one that can do the permutation. One idea there would be to generate random keys (by appending onto the existing key) and letting the shuffle effectively do the permutations. Then, during reduce phase we simply strip off the random part of the key and output. I don't know how bad this would hurt the shuffle, but it seems like it would work functionally anyway.

        >

        > Otherwise, the approach seems reasonable. I don't know off hand if there is a better way of doing it (even though I wish there were).

        Separating the randomization sounds like a nice idea. I still think that the SGD jobs need to be able to randomize within a single map as well.

        Permuting in the shuffle should work fine.

        • Ted

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/#review3876
        -----------------------------------------------------------

        On 2011-12-09 08:57:18, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3092/

        -----------------------------------------------------------

        (Updated 2011-12-09 08:57:18)

        Review request for Grant Ingersoll.

        Summary

        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.

        - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.

        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs

        -----

        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249

        /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249

        Diff: https://reviews.apache.org/r/3092/diff

        Testing

        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-12-13 13:19:13, Grant Ingersoll wrote: > Thoughts: > this class is often run from the command line, so we should add CLI support for telling it to randomly permute. > > I wonder if we should make this a map-reduce job. Perhaps we split out the existing version and leave as is and then add a new MR one that can do the permutation. One idea there would be to generate random keys (by appending onto the existing key) and letting the shuffle effectively do the permutations. Then, during reduce phase we simply strip off the random part of the key and output. I don't know how bad this would hurt the shuffle, but it seems like it would work functionally anyway. > > Otherwise, the approach seems reasonable. I don't know off hand if there is a better way of doing it (even though I wish there were). Separating the randomization sounds like a nice idea. I still think that the SGD jobs need to be able to randomize within a single map as well. Permuting in the shuffle should work fine. Ted ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/#review3876 ----------------------------------------------------------- On 2011-12-09 08:57:18, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-09 08:57:18) Review request for Grant Ingersoll. Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs ----- /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249 /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249 Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-12-13 13:19:13, Grant Ingersoll wrote:

        > Thoughts:

        > this class is often run from the command line, so we should add CLI support for telling it to randomly permute.

        >

        > I wonder if we should make this a map-reduce job. Perhaps we split out the existing version and leave as is and then add a new MR one that can do the permutation. One idea there would be to generate random keys (by appending onto the existing key) and letting the shuffle effectively do the permutations. Then, during reduce phase we simply strip off the random part of the key and output. I don't know how bad this would hurt the shuffle, but it seems like it would work functionally anyway.

        >

        > Otherwise, the approach seems reasonable. I don't know off hand if there is a better way of doing it (even though I wish there were).

        Ted Dunning wrote:

        Separating the randomization sounds like a nice idea. I still think that the SGD jobs need to be able to randomize within a single map as well.

        Permuting in the shuffle should work fine.

        Lance had a similar suggestion. I think there are two tasks required here. One is to randomize the training examples within a split, and the other is to randomize the order of different splits. I'll update this to use map reduce to randomize the splits aswell. Lance had a good suggestion for this based on hashing/randomizing the key.

        Given that we will be parallelizing this, I guess each split should fit comfortably into memory? If that's the case randomization of the lines within a split can be done much more efficiently.

        • Raphael

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/#review3876
        -----------------------------------------------------------

        On 2011-12-09 08:57:18, Raphael Cendrillon wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/3092/

        -----------------------------------------------------------

        (Updated 2011-12-09 08:57:18)

        Review request for Grant Ingersoll.

        Summary

        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.

        - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.

        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs

        -----

        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249

        /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249

        Diff: https://reviews.apache.org/r/3092/diff

        Testing

        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-12-13 13:19:13, Grant Ingersoll wrote: > Thoughts: > this class is often run from the command line, so we should add CLI support for telling it to randomly permute. > > I wonder if we should make this a map-reduce job. Perhaps we split out the existing version and leave as is and then add a new MR one that can do the permutation. One idea there would be to generate random keys (by appending onto the existing key) and letting the shuffle effectively do the permutations. Then, during reduce phase we simply strip off the random part of the key and output. I don't know how bad this would hurt the shuffle, but it seems like it would work functionally anyway. > > Otherwise, the approach seems reasonable. I don't know off hand if there is a better way of doing it (even though I wish there were). Ted Dunning wrote: Separating the randomization sounds like a nice idea. I still think that the SGD jobs need to be able to randomize within a single map as well. Permuting in the shuffle should work fine. Lance had a similar suggestion. I think there are two tasks required here. One is to randomize the training examples within a split, and the other is to randomize the order of different splits. I'll update this to use map reduce to randomize the splits aswell. Lance had a good suggestion for this based on hashing/randomizing the key. Given that we will be parallelizing this, I guess each split should fit comfortably into memory? If that's the case randomization of the lines within a split can be done much more efficiently. Raphael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/#review3876 ----------------------------------------------------------- On 2011-12-09 08:57:18, Raphael Cendrillon wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-09 08:57:18) Review request for Grant Ingersoll. Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs ----- /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249 /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249 Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        Grant Ingersoll added a comment -

        Note, I think we can still run all of this from the SplitInput driver.

        Show
        Grant Ingersoll added a comment - Note, I think we can still run all of this from the SplitInput driver.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/
        -----------------------------------------------------------

        (Updated 2011-12-16 01:37:06.294565)

        Review request for Grant Ingersoll.

        Changes
        -------

        Rewrote as map-reduce job using downsampling and random key in mapper stage. Actual key from mapper input is preserved and recovered at reducer. Added IntVectorWritable class to support concatenation of actual key with vector.

        Summary
        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        • currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
        • the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.
        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs (updated)


        /trunk/integration/src/main/java/org/apache/mahout/utils/RandomPermuteJob.java PRE-CREATION
        /trunk/integration/src/test/java/org/apache/mahout/utils/TestRandomPermuteJob.java PRE-CREATION
        /trunk/integration/src/main/java/org/apache/mahout/utils/IntVectorWritable.java PRE-CREATION

        Diff: https://reviews.apache.org/r/3092/diff

        Testing
        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-16 01:37:06.294565) Review request for Grant Ingersoll. Changes ------- Rewrote as map-reduce job using downsampling and random key in mapper stage. Actual key from mapper input is preserved and recovered at reducer. Added IntVectorWritable class to support concatenation of actual key with vector. Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs (updated) /trunk/integration/src/main/java/org/apache/mahout/utils/RandomPermuteJob.java PRE-CREATION /trunk/integration/src/test/java/org/apache/mahout/utils/TestRandomPermuteJob.java PRE-CREATION /trunk/integration/src/main/java/org/apache/mahout/utils/IntVectorWritable.java PRE-CREATION Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        Raphael Cendrillon added a comment -

        At the moment this is written only to handle vectors. I'm having an issue extending this to cover any record type. One option is to use a GenericWritable however this could be wasteful since the classname is stored with every record. Another approach could be to use generics, however I can't seem to use generics in setMapOutputKeyClass(). I'd like to do something like this:

        job.setMapOutputKeyClass(PairWritable<IntWritable,VectorWritable>.class);

        Any suggestions would be very welcome!

        Show
        Raphael Cendrillon added a comment - At the moment this is written only to handle vectors. I'm having an issue extending this to cover any record type. One option is to use a GenericWritable however this could be wasteful since the classname is stored with every record. Another approach could be to use generics, however I can't seem to use generics in setMapOutputKeyClass(). I'd like to do something like this: job.setMapOutputKeyClass(PairWritable<IntWritable,VectorWritable>.class); Any suggestions would be very welcome!
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/
        -----------------------------------------------------------

        (Updated 2011-12-16 02:01:25.825802)

        Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.

        Summary
        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        • currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
        • the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.
        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs


        /trunk/integration/src/main/java/org/apache/mahout/utils/RandomPermuteJob.java PRE-CREATION
        /trunk/integration/src/test/java/org/apache/mahout/utils/TestRandomPermuteJob.java PRE-CREATION
        /trunk/integration/src/main/java/org/apache/mahout/utils/IntVectorWritable.java PRE-CREATION

        Diff: https://reviews.apache.org/r/3092/diff

        Testing
        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-16 02:01:25.825802) Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll. Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs /trunk/integration/src/main/java/org/apache/mahout/utils/RandomPermuteJob.java PRE-CREATION /trunk/integration/src/test/java/org/apache/mahout/utils/TestRandomPermuteJob.java PRE-CREATION /trunk/integration/src/main/java/org/apache/mahout/utils/IntVectorWritable.java PRE-CREATION Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/
        -----------------------------------------------------------

        (Updated 2011-12-16 19:09:13.382909)

        Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.

        Changes
        -------

        Modified to accept any writable as the value (instead of just VectorWritable). This still requires the generic class PairWritable to be extended for each class of interest so that this extended class can be passed into setMapOutputValueClass(). I'm not sure if this is the best approach, any suggestions would be appreciated!

        Summary
        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        • currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
        • the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.
        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs (updated)


        /trunk/integration/src/main/java/org/apache/mahout/utils/RandomPermuteJob.java PRE-CREATION
        /trunk/integration/src/main/java/org/apache/mahout/utils/PairWritable.java PRE-CREATION
        /trunk/integration/src/main/java/org/apache/mahout/utils/IntVectorWritable.java PRE-CREATION
        /trunk/integration/src/test/java/org/apache/mahout/utils/TestRandomPermuteJob.java PRE-CREATION

        Diff: https://reviews.apache.org/r/3092/diff

        Testing
        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-16 19:09:13.382909) Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll. Changes ------- Modified to accept any writable as the value (instead of just VectorWritable). This still requires the generic class PairWritable to be extended for each class of interest so that this extended class can be passed into setMapOutputValueClass(). I'm not sure if this is the best approach, any suggestions would be appreciated! Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs (updated) /trunk/integration/src/main/java/org/apache/mahout/utils/RandomPermuteJob.java PRE-CREATION /trunk/integration/src/main/java/org/apache/mahout/utils/PairWritable.java PRE-CREATION /trunk/integration/src/main/java/org/apache/mahout/utils/IntVectorWritable.java PRE-CREATION /trunk/integration/src/test/java/org/apache/mahout/utils/TestRandomPermuteJob.java PRE-CREATION Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        Grant Ingersoll added a comment -

        Is there a patch here for your latest? I see the diff, but not the patch.

        Show
        Grant Ingersoll added a comment - Is there a patch here for your latest? I see the diff, but not the patch.
        Hide
        Raphael Cendrillon added a comment -

        Hi Grant,

        The diffs on reviewboard are all absolute, so you can just save the latest revision as a patch file and apply it to trunk. I'll update the attachment here as well.

        Show
        Raphael Cendrillon added a comment - Hi Grant, The diffs on reviewboard are all absolute, so you can just save the latest revision as a patch file and apply it to trunk. I'll update the attachment here as well.
        Hide
        Grant Ingersoll added a comment -

        OK. I still just like patches .

        Couple of things:

        1. We have a Pair class already, we should just make PairWritable use that and put it in the appropriate package.
        2. I think we should try to hook this into SplitInput such that we can still have many of it's options available, just done at bigger scale. Either that or we need a Driver for this. Ultimately, though, we need a way to create train/test splits where the train split is randomized.
        Show
        Grant Ingersoll added a comment - OK. I still just like patches . Couple of things: We have a Pair class already, we should just make PairWritable use that and put it in the appropriate package. I think we should try to hook this into SplitInput such that we can still have many of it's options available, just done at bigger scale. Either that or we need a Driver for this. Ultimately, though, we need a way to create train/test splits where the train split is randomized.
        Hide
        Grant Ingersoll added a comment -

        Adds driver, reuses Pair, not integrated into SplitInput yet.

        Show
        Grant Ingersoll added a comment - Adds driver, reuses Pair, not integrated into SplitInput yet.
        Hide
        Raphael Cendrillon added a comment -

        Thanks Grant. I'll update to drop the Pair class in and integrate into SplitInput.

        By the way, did you notice the way that PairWritable needs to be extended for each object type (e.g. IntVectorWritable if the object is a Vector)?

        Does this seem like a reasonable approach? It would require that a class be created for each object type of interest which is somewhat painfull. However I can't see a simpler approach since setMapOutputValueClass() needs to take a class that has a default constructor (and PairWritable doesn't have a default constructor since it doesn't know how to call new for first and second since it doesn't know what class first and second belong to).

        Show
        Raphael Cendrillon added a comment - Thanks Grant. I'll update to drop the Pair class in and integrate into SplitInput. By the way, did you notice the way that PairWritable needs to be extended for each object type (e.g. IntVectorWritable if the object is a Vector)? Does this seem like a reasonable approach? It would require that a class be created for each object type of interest which is somewhat painfull. However I can't see a simpler approach since setMapOutputValueClass() needs to take a class that has a default constructor (and PairWritable doesn't have a default constructor since it doesn't know how to call new for first and second since it doesn't know what class first and second belong to).
        Hide
        Lance Norskog added a comment -

        Hi-

        Don't see the 'add review' button. Can it extend AbstractJob?

        Show
        Lance Norskog added a comment - Hi- Don't see the 'add review' button. Can it extend AbstractJob?
        Hide
        Raphael Cendrillon added a comment -

        Hi Lance. Is that a general comment, or specifically for the issue regarding PairWritable/IntVectorWritable?

        Show
        Raphael Cendrillon added a comment - Hi Lance. Is that a general comment, or specifically for the issue regarding PairWritable/IntVectorWritable?
        Hide
        Raphael Cendrillon added a comment -

        Implemented downsampling more efficiently through mapper run(), implemented random permutation through sort comparator class, added driver, integrated into SplitInput

        Show
        Raphael Cendrillon added a comment - Implemented downsampling more efficiently through mapper run(), implemented random permutation through sort comparator class, added driver, integrated into SplitInput
        Hide
        Raphael Cendrillon added a comment -

        Implemented downsampling more efficiently through mapper run(), implemented random permutation through sort comparator class, added driver, integrated into SplitInput

        Show
        Raphael Cendrillon added a comment - Implemented downsampling more efficiently through mapper run(), implemented random permutation through sort comparator class, added driver, integrated into SplitInput
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/
        -----------------------------------------------------------

        (Updated 2011-12-22 04:38:08.261769)

        Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.

        Changes
        -------

        Implemented downsampling more efficiently through mapper run(), implemented random permutation through sort comparator class, added driver, integrated into SplitInput

        Summary
        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        • currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
        • the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.
        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs (updated)


        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1215567
        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION
        /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1215567

        Diff: https://reviews.apache.org/r/3092/diff

        Testing
        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-22 04:38:08.261769) Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll. Changes ------- Implemented downsampling more efficiently through mapper run(), implemented random permutation through sort comparator class, added driver, integrated into SplitInput Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs (updated) /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1215567 /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1215567 Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        Grant Ingersoll added a comment -

        Converts SplitInput to use AbstractJob.

        I think this is pretty close to being ready. I wonder if there is more opportunity for overlap between the sequential and the m/r version, but we can iterate on that later if people want.

        Show
        Grant Ingersoll added a comment - Converts SplitInput to use AbstractJob. I think this is pretty close to being ready. I wonder if there is more opportunity for overlap between the sequential and the m/r version, but we can iterate on that later if people want.
        Hide
        Raphael Cendrillon added a comment -

        Thanks Grant. I was wondering the same thing, for example supporting randomSelectionSize in addition to randomSelectionPct. However supporting size based splits may not be quite so straightforward since the size is generally unknown if the SequenceFile is large, plus its split across mappers.

        I also would have liked to have the training and test outputs go to different directories (instead of just using different filename prefixes), but this is not quite so straightforward due to issues with the new API (unless I just write to the SequenceFile by hand in the reducer which raises its own issues). I think this can be made a little neater once we move to Hadoop 0.21.

        Is there something else that you had in mind?

        Show
        Raphael Cendrillon added a comment - Thanks Grant. I was wondering the same thing, for example supporting randomSelectionSize in addition to randomSelectionPct. However supporting size based splits may not be quite so straightforward since the size is generally unknown if the SequenceFile is large, plus its split across mappers. I also would have liked to have the training and test outputs go to different directories (instead of just using different filename prefixes), but this is not quite so straightforward due to issues with the new API (unless I just write to the SequenceFile by hand in the reducer which raises its own issues). I think this can be made a little neater once we move to Hadoop 0.21. Is there something else that you had in mind?
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/
        -----------------------------------------------------------

        (Updated 2011-12-22 15:45:05.856640)

        Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.

        Changes
        -------

        Grant's changes

        Summary
        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        • currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
        • the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.
        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1222286
        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1222286
        /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1222286

        Diff: https://reviews.apache.org/r/3092/diff

        Testing
        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-22 15:45:05.856640) Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll. Changes ------- Grant's changes Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1222286 /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1222286 /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1222286 Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/
        -----------------------------------------------------------

        (Updated 2011-12-22 16:03:47.932528)

        Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.

        Changes
        -------

        Grant's changes

        Summary
        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        • currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
        • the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.
        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1222286
        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1222286
        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION
        /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1222286

        Diff: https://reviews.apache.org/r/3092/diff

        Testing
        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-22 16:03:47.932528) Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll. Changes ------- Grant's changes Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1222286 /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1222286 /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1222286 Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        Grant Ingersoll added a comment -

        This needs some more work, as I don't think we can assume IntWritables. For instance, see the usage in asf-email-examples.sh which needs to use the split functionality.

        This patch does not fully work yet.

        Show
        Grant Ingersoll added a comment - This needs some more work, as I don't think we can assume IntWritables. For instance, see the usage in asf-email-examples.sh which needs to use the split functionality. This patch does not fully work yet.
        Hide
        Raphael Cendrillon added a comment -

        I think we can replace IntWritable with WritableComparable and it should take care of this. I'll update the patch and post shortly.

        Show
        Raphael Cendrillon added a comment - I think we can replace IntWritable with WritableComparable and it should take care of this. I'll update the patch and post shortly.
        Hide
        Sean Owen added a comment -

        (I don't know if this is a relevant comment, but we ought to be using VarIntWritable and VarLongWritable, not IntWritable and LongWritable, for better space savings.)

        Show
        Sean Owen added a comment - (I don't know if this is a relevant comment, but we ought to be using VarIntWritable and VarLongWritable, not IntWritable and LongWritable, for better space savings.)
        Hide
        Raphael Cendrillon added a comment -

        Replaced IntWritable with WritableComparable so that any key class can be used. Added instantiation of Configuration to make sure tests pass when using SplitInputJob from within code

        Show
        Raphael Cendrillon added a comment - Replaced IntWritable with WritableComparable so that any key class can be used. Added instantiation of Configuration to make sure tests pass when using SplitInputJob from within code
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3092/
        -----------------------------------------------------------

        (Updated 2011-12-23 23:14:34.869723)

        Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.

        Changes
        -------

        Replaced IntWritable with WritableComparable so that any key class can be used. Added instantiation of Configuration to make sure tests pass when using SplitInputJob from within code

        Summary
        -------

        Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments:

        • currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
        • the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

        Any suggestions would be very welcome!

        This addresses bug MAHOUT-904.
        https://issues.apache.org/jira/browse/MAHOUT-904

        Diffs (updated)


        /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1221886
        /trunk/examples/bin/asf-email-examples.sh 1221886
        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1221886
        /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION
        /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1221886

        Diff: https://reviews.apache.org/r/3092/diff

        Testing
        -------

        Thanks,

        Raphael

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/ ----------------------------------------------------------- (Updated 2011-12-23 23:14:34.869723) Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll. Changes ------- Replaced IntWritable with WritableComparable so that any key class can be used. Added instantiation of Configuration to make sure tests pass when using SplitInputJob from within code Summary ------- Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. Any suggestions would be very welcome! This addresses bug MAHOUT-904 . https://issues.apache.org/jira/browse/MAHOUT-904 Diffs (updated) /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1221886 /trunk/examples/bin/asf-email-examples.sh 1221886 /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1221886 /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1221886 Diff: https://reviews.apache.org/r/3092/diff Testing ------- Thanks, Raphael
        Hide
        Grant Ingersoll added a comment -

        ASF examples at least run for SGD now, although the results are horrible.

        Show
        Grant Ingersoll added a comment - ASF examples at least run for SGD now, although the results are horrible.
        Hide
        Grant Ingersoll added a comment -

        upping the cardinality seems to fix things. Now need to try w/ more labels.

        Show
        Grant Ingersoll added a comment - upping the cardinality seems to fix things. Now need to try w/ more labels.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1287 (See https://builds.apache.org/job/Mahout-Quality/1287/)
        MAHOUT-904: Add randomization to split input

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1226475
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java
        • /mahout/trunk/examples/bin/asf-email-examples.sh
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/sgd/TestASFEmail.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainASFEmail.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1287 (See https://builds.apache.org/job/Mahout-Quality/1287/ ) MAHOUT-904 : Add randomization to split input gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1226475 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java /mahout/trunk/examples/bin/asf-email-examples.sh /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/sgd/TestASFEmail.java /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainASFEmail.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java
        Hide
        Raphael Cendrillon added a comment -

        Thanks Grant!

        Show
        Raphael Cendrillon added a comment - Thanks Grant!

          People

          • Assignee:
            Raphael Cendrillon
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development