Hadoop Common
  1. Hadoop Common
  2. HADOOP-3019

want input sampler & sorted partitioner

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Added a partitioner that effects a total order of output data, and an input sampler for generating the partition keyset for TotalOrderPartitioner for when the map's input keytype and distribution approximates its output.

      Description

      The input sampler should generate a small, random sample of the input, saved to a file.

      The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.

      Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.

      1. 3019-0.patch
        33 kB
        Chris Douglas
      2. 3019-1.patch
        34 kB
        Chris Douglas
      3. 3019-2.patch
        34 kB
        Chris Douglas
      4. 3019-3.patch
        39 kB
        Chris Douglas
      5. 3019-4.patch
        39 kB
        Chris Douglas
      6. 3019-5.patch
        40 kB
        Chris Douglas

        Issue Links

          Activity

          Hide
          Chris Douglas added a comment -

          Added a release note.

          Show
          Chris Douglas added a comment - Added a release note.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #611 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/ )
          Hide
          Owen O'Malley added a comment -

          I just committed this. Thanks, Chris!

          Show
          Owen O'Malley added a comment - I just committed this. Thanks, Chris!
          Hide
          Chris Douglas added a comment -

          More updates on Owen's feedback:

          • RandomSampler includes the selected element when selecting
          • Validate ordering of partition file when configuring TotalOrderPartitioner
          Show
          Chris Douglas added a comment - More updates on Owen's feedback: RandomSampler includes the selected element when selecting Validate ordering of partition file when configuring TotalOrderPartitioner
          Hide
          Chris Douglas added a comment -

          test-patch results for 3019-4:

               [exec] +1 overall.  
          
               [exec]     +1 @author.  The patch does not contain any @author tags.
          
               [exec]     +1 tests included.  The patch appears to include 5 new or modified tests.
          
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
          
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
          
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
          
          Show
          Chris Douglas added a comment - test-patch results for 3019-4: [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 5 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          Hide
          Chris Douglas added a comment -

          Fixed a javadoc warning (ant javadoc target didn't know about tools)

          Show
          Chris Douglas added a comment - Fixed a javadoc warning (ant javadoc target didn't know about tools)
          Hide
          Chris Douglas added a comment -

          Updated based on Owen's feedback:

          • Changed RandomSampler to sample evenly across all splits, rather than evenly from each split
          • Used double instead of float for sampling rate

          This patch also modifies the sort example to demonstrate how to use InputSampler in a job. This requires examples to depend on tools.

          Show
          Chris Douglas added a comment - Updated based on Owen's feedback: Changed RandomSampler to sample evenly across all splits, rather than evenly from each split Used double instead of float for sampling rate This patch also modifies the sort example to demonstrate how to use InputSampler in a job. This requires examples to depend on tools.
          Hide
          Chris Douglas added a comment -

          Submitting last patch to make 0.19.

          Show
          Chris Douglas added a comment - Submitting last patch to make 0.19.
          Hide
          Chris Douglas added a comment -

          Since the sample points are kept in array and sorted in memory, then its size is severely limited. Why not consider to use a map reduce job to generate the sampling points and the partition file?

          The client-side sampler is limited, no question, but it usually only takes a few seconds to run (unlike a distributed job), generates decent results, and can be easily rolled into the user's driver. The distributed sampler (planned, writing it) can be more accurate, but will take longer. The client-side sampler also needs to use the map class, so the sampling is on the map output keytype and distribution rather than the input.

          The latter requires that most of the InputSampler be rewritten to use MapRunnable, so I'm cancelling this for now.

          Show
          Chris Douglas added a comment - Since the sample points are kept in array and sorted in memory, then its size is severely limited. Why not consider to use a map reduce job to generate the sampling points and the partition file? The client-side sampler is limited, no question, but it usually only takes a few seconds to run (unlike a distributed job), generates decent results, and can be easily rolled into the user's driver. The distributed sampler (planned, writing it) can be more accurate, but will take longer. The client-side sampler also needs to use the map class, so the sampling is on the map output keytype and distribution rather than the input. The latter requires that most of the InputSampler be rewritten to use MapRunnable, so I'm cancelling this for now.
          Hide
          Runping Qi added a comment -

          Sorry for jump in late.

          Since the sample points are kept in array and sorted in memory, then its size is severely limited.
          Why not consider to use a map reduce job to generate the sampling points and the partition file?

          Show
          Runping Qi added a comment - Sorry for jump in late. Since the sample points are kept in array and sorted in memory, then its size is severely limited. Why not consider to use a map reduce job to generate the sampling points and the partition file?
          Hide
          Chris Douglas added a comment -

          Changed random sampling to be less dominated by keys in the latter part of each sampled split.

          Show
          Chris Douglas added a comment - Changed random sampling to be less dominated by keys in the latter part of each sampled split.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12390146/3019-1.patch
          against trunk revision 696002.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12390146/3019-1.patch against trunk revision 696002. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          Updated patch to refer to BinaryComparable instead of MemComparable and moved the change to bin/hadoop to the correct patch.

          Show
          Chris Douglas added a comment - Updated patch to refer to BinaryComparable instead of MemComparable and moved the change to bin/hadoop to the correct patch.
          Hide
          Chris Douglas added a comment -

          Results of test-patch with HADOOP-4151 applied:

               [exec] +1 overall.  
          
               [exec]     +1 @author.  The patch does not contain any @author tags.
          
               [exec]     +1 tests included.  The patch appears to include 18 new or modified tests.
          
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
          
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
          
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
          
          Show
          Chris Douglas added a comment - Results of test-patch with HADOOP-4151 applied: [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 18 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          Hide
          Chris Douglas added a comment -

          This won't compile until HADOOP-4151 is in, but marking it PA for review.

          Show
          Chris Douglas added a comment - This won't compile until HADOOP-4151 is in, but marking it PA for review.
          Hide
          Chris Douglas added a comment -

          Adapted TotalOrderPartitioner and input sampler from HADOOP-3402, with the following changes:

          • Adds two other kinds of samplers
          • Made memcmp-able types (Text, BytesWritable) use the trie, other data structures do a binary search over the partition keyset
          • Adds a unit test for the partitioner
          Show
          Chris Douglas added a comment - Adapted TotalOrderPartitioner and input sampler from HADOOP-3402 , with the following changes: Adds two other kinds of samplers Made memcmp-able types (Text, BytesWritable) use the trie, other data structures do a binary search over the partition keyset Adds a unit test for the partitioner
          Hide
          Enis Soztutar added a comment -

          The sampler can be easily written once Filters are in(HADOOP-449). I intent to come up with a patch today.

          Show
          Enis Soztutar added a comment - The sampler can be easily written once Filters are in( HADOOP-449 ). I intent to come up with a patch today.
          Hide
          Doug Cutting added a comment -

          > Should this be a part of examples like sort?

          I guess it could live with the examples, but I was thinking that this would be more like mapred/lib. The sampler should be generic enough that folks won't have to modify it to find it useful: it should work for different key/value types and for sequencefile and text data.

          Show
          Doug Cutting added a comment - > Should this be a part of examples like sort? I guess it could live with the examples, but I was thinking that this would be more like mapred/lib. The sampler should be generic enough that folks won't have to modify it to find it useful: it should work for different key/value types and for sequencefile and text data.
          Hide
          Amar Kamat added a comment -

          Should this be a part of examples like sort? Users can use it the way they use other examples.

          Show
          Amar Kamat added a comment - Should this be a part of examples like sort? Users can use it the way they use other examples.
          Hide
          Doug Cutting added a comment -

          Implementation thoughts:

          • the sampler can be implemented as an inputformat.
          • a generic sampling job class can configure the sampling input format and a single identity reducer.
          • samples should come from random positions in input files, since input files are frequently themselves sorted.
          Show
          Doug Cutting added a comment - Implementation thoughts: the sampler can be implemented as an inputformat. a generic sampling job class can configure the sampling input format and a single identity reducer. samples should come from random positions in input files, since input files are frequently themselves sorted.

            People

            • Assignee:
              Chris Douglas
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development