Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This feature allows to specify Hadoop Partitioner for the following operations: GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed' join). Partitioner controls the partitioning of the keys of the intermediate map-outputs. See http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html for more details.

      To use this feature you can add PARTITION BY clause to the appropriate operator:
      A = load 'input_data';
      B = group A by $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
      .....
      Here is the code for SimpleCustomPartitioner

      public class SimpleCustomPartitioner extends Partitioner<PigNullableWritable, Writable> {
           //@Override
          public int getPartition(PigNullableWritable key, Writable value, int numPartitions) {
              if(key.getValueAsPigType() instanceof Integer) {
                  int ret = (((Integer)key.getValueAsPigType()).intValue() % numPartitions);
                  return ret;
             }
             else {
                  return (key.hashCode()) % numPartitions;
              }
          }
      }
      Show
      This feature allows to specify Hadoop Partitioner for the following operations: GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed' join). Partitioner controls the partitioning of the keys of the intermediate map-outputs. See http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html for more details. To use this feature you can add PARTITION BY clause to the appropriate operator: A = load 'input_data'; B = group A by $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; ..... Here is the code for SimpleCustomPartitioner public class SimpleCustomPartitioner extends Partitioner<PigNullableWritable, Writable> {      //@Override     public int getPartition(PigNullableWritable key, Writable value, int numPartitions) {         if(key.getValueAsPigType() instanceof Integer) {             int ret = (((Integer)key.getValueAsPigType()).intValue() % numPartitions);             return ret;        }        else {             return (key.hashCode()) % numPartitions;         }     } }

      Description

      By adding custom partitioner we can give control over which output partition a key (/value) goes to. We can add keywords to language e.g.

      PARTITION BY UDF(...)

      or a similar syntax. UDF returns a number between 0 and n-1 where n is number of output partitions.

      1. CustomPartitioner.patch
        14 kB
        Aniket Mokashi
      2. CustomPartitionerFinale.patch
        24 kB
        Aniket Mokashi
      3. CustomPartitionerTest.patch
        24 kB
        Aniket Mokashi

        Issue Links

          Activity

          Hide
          Yiping Han added a comment -

          Any concerns on this issue?

          Show
          Yiping Han added a comment - Any concerns on this issue?
          Hide
          David Ciemiewicz added a comment -

          How will the custom partitioner be used in Pig?

          Is this for map partitioning and/or output partitioning?

          For instance, I'd love to have something that created separate directories based on the value of some key.

          Show
          David Ciemiewicz added a comment - How will the custom partitioner be used in Pig? Is this for map partitioning and/or output partitioning? For instance, I'd love to have something that created separate directories based on the value of some key.
          Hide
          Alan Gates added a comment -

          This JIRA refers to map->reduce partitioning. Output partitioning of spraying to directories based on a key can be done now via a custom store function.

          Show
          Alan Gates added a comment - This JIRA refers to map->reduce partitioning. Output partitioning of spraying to directories based on a key can be done now via a custom store function.
          Hide
          Dmitriy V. Ryaboy added a comment -

          David,
          take a look at https://issues.apache.org/jira/browse/PIG-958 (it's in 0.6)

          Show
          Dmitriy V. Ryaboy added a comment - David, take a look at https://issues.apache.org/jira/browse/PIG-958 (it's in 0.6)
          Hide
          Aniket Mokashi added a comment -

          1. It is suitable to have PARTITION BY mapreduce.Partitioner than UDF. This will be followed by PARALLEL n.
          2. Applicable to-
          GROUP
          COGROUP
          CROSS
          DISTINCT
          JOIN (except 'skewed' which uses SkewedPartitioner)
          3. ORDER partition by - not supported.
          4. No check for validation of custom partitioners parameters (<PigNullableWritable, Writable>).

          Approach-
          1. Added support for ClassType parsing and validation. Parsing for "partition by" is added to above mentioned clauses separately.
          2. Custom Partitioner is stored as a String in LO, PO and MR plan. LogicalOperator holds the partitioner in LO plan. We add partitioner to POGlobalRearrangement as it decides the map-reduce boundary. We read and set the partitioner when we visit the POGlobalRearrangement.

          Attaching a patch with initial changes...

          Show
          Aniket Mokashi added a comment - 1. It is suitable to have PARTITION BY mapreduce.Partitioner than UDF. This will be followed by PARALLEL n. 2. Applicable to- GROUP COGROUP CROSS DISTINCT JOIN (except 'skewed' which uses SkewedPartitioner) 3. ORDER partition by - not supported. 4. No check for validation of custom partitioners parameters (<PigNullableWritable, Writable>). Approach- 1. Added support for ClassType parsing and validation. Parsing for "partition by" is added to above mentioned clauses separately. 2. Custom Partitioner is stored as a String in LO, PO and MR plan. LogicalOperator holds the partitioner in LO plan. We add partitioner to POGlobalRearrangement as it decides the map-reduce boundary. We read and set the partitioner when we visit the POGlobalRearrangement. Attaching a patch with initial changes...
          Hide
          Aniket Mokashi added a comment -

          Initial Changes

          Show
          Aniket Mokashi added a comment - Initial Changes
          Hide
          Aniket Mokashi added a comment -

          Initial changes

          Show
          Aniket Mokashi added a comment - Initial changes
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12445704/CustomPartitioner.patch
          against trunk revision 949057.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 385 release audit warnings (more than the trunk's current 384 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/13/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/13/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/13/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/13/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12445704/CustomPartitioner.patch against trunk revision 949057. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 385 release audit warnings (more than the trunk's current 384 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/13/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/13/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/13/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/13/console This message is automatically generated.
          Hide
          Aniket Mokashi added a comment -

          Adding test cases and some small fixes.

          Show
          Aniket Mokashi added a comment - Adding test cases and some small fixes.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12446067/CustomPartitionerTest.patch
          against trunk revision 949057.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 386 release audit warnings (more than the trunk's current 385 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/18/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/18/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/18/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12446067/CustomPartitionerTest.patch against trunk revision 949057. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 386 release audit warnings (more than the trunk's current 385 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/18/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/18/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/18/console This message is automatically generated.
          Hide
          Aniket Mokashi added a comment -

          Added code review comments and some minor changes with test cases.

          Show
          Aniket Mokashi added a comment - Added code review comments and some minor changes with test cases.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12446172/CustomPartitionerFinale.patch
          against trunk revision 951229.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 380 release audit warnings (more than the trunk's current 379 warnings).

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/320/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/320/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/320/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/320/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12446172/CustomPartitionerFinale.patch against trunk revision 951229. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 380 release audit warnings (more than the trunk's current 379 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/320/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/320/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/320/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/320/console This message is automatically generated.
          Hide
          Daniel Dai added a comment -

          Manual test pass. Release audit warning is due to one additional jdiff artifacts. Patch committed, thanks Aniket!

          Show
          Daniel Dai added a comment - Manual test pass. Release audit warning is due to one additional jdiff artifacts. Patch committed, thanks Aniket!
          Hide
          Brad Tofel added a comment -

          Do I read this right - there is no way to specify a custom partitioner for use with "ORDER BY"?

          If so, is there any other way to perform a total ordering within Pig?

          I will be doing a STORE immediately after the ORDER - the relation will not be used again. Is there some other work around to achieve this?

          I would love to replace my current Hadoop Java code with Pig, but total ordering is a requirement.

          Show
          Brad Tofel added a comment - Do I read this right - there is no way to specify a custom partitioner for use with "ORDER BY"? If so, is there any other way to perform a total ordering within Pig? I will be doing a STORE immediately after the ORDER - the relation will not be used again. Is there some other work around to achieve this? I would love to replace my current Hadoop Java code with Pig, but total ordering is a requirement.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Brad, ORDER produces a total order out of the box.

          Show
          Dmitriy V. Ryaboy added a comment - Brad, ORDER produces a total order out of the box.

            People

            • Assignee:
              Aniket Mokashi
              Reporter:
              Amir Youssefi
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development