Pig
  1. Pig
  2. PIG-1249

Safe-guards against misconfigured Pig scripts without PARALLEL keyword

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallel), the value of 1 was used which in many cases was not a good choice and caused severe performance problems.

      In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control:

      pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB)
      pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

      The formula is very simple:

      #reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

      This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.

      Note that this is not a backward compatible change and set default_parallel to restore the value to 1
      Show
      In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallel), the value of 1 was used which in many cases was not a good choice and caused severe performance problems. In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control: pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB) pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999 The formula is very simple: #reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer. This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script. Note that this is not a backward compatible change and set default_parallel to restore the value to 1

      Description

      It would be very useful for Pig to have safe-guards against naive scripts which process a lot of data without the use of PARALLEL keyword.

      We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce.

      1. PIG-1249_5.patch
        10 kB
        Jeff Zhang
      2. PIG-1249-4.patch
        9 kB
        Alan Gates
      3. PIG_1249_3.patch
        9 kB
        Jeff Zhang
      4. PIG_1249_2.patch
        8 kB
        Jeff Zhang
      5. PIG-1249.patch
        7 kB
        Jeff Zhang

        Issue Links

          Activity

          Hide
          Jeff Zhang added a comment -

          +1, And I find that hive can estimate the reducer number according the input size. This is a really useful feature.

          Show
          Jeff Zhang added a comment - +1, And I find that hive can estimate the reducer number according the input size. This is a really useful feature.
          Hide
          Jeff Zhang added a comment -

          The current idea is borrowed from hive, use the input file size to estimate the reducer number.
          Two parameters can been set for this purpose
          pig.exec.reducers.bytes.per.reducer // the number of bytes of input for each reducer
          pig.exec.reducers.max // the max number of reducer number

          This only work for hdfs, won't work for other data source such as hbase or cassandra.

          Show
          Jeff Zhang added a comment - The current idea is borrowed from hive, use the input file size to estimate the reducer number. Two parameters can been set for this purpose pig.exec.reducers.bytes.per.reducer // the number of bytes of input for each reducer pig.exec.reducers.max // the max number of reducer number This only work for hdfs, won't work for other data source such as hbase or cassandra.
          Hide
          Thejas M Nair added a comment -

          If default_parallel has not been set, the patch sets a new default number of reducers based on input file sizes.
          If the 'input' specified in the load statement is not an hdfs file, it fail to find the file size and default of 1 reduce will be used.

          The next steps in automatically determining number of reducers (which can be addressed in separate jiras) are -
          1. Determining different number of reducers for each MR job of a pig-query, based on the input size for the MR job.
          2. Extending this functionality to load functions that don't take hdfs files as input. We can look at using LoadMetaData.getStatistics() .

          Comments on the patch -
          If default_parallel is specified, the number of reducers doesn't need to be determined.

          
          estimateNumberOfReducers(conf,mro);
          if (pigContext.defaultParallel > 0)
                 conf.set("mapred.reduce.tasks", ""+pigContext.defaultParallel);
          
          

          can be changed to

          if (pigContext.defaultParallel > 0)
               conf.set("mapred.reduce.tasks", ""+pigContext.defaultParallel);
          else
               estimateNumberOfReducers(conf,mro);
          

          Everything else looks good.

          Hudson still seems to be having problems. I am currently running unit tests with this patch.

          Show
          Thejas M Nair added a comment - If default_parallel has not been set, the patch sets a new default number of reducers based on input file sizes. If the 'input' specified in the load statement is not an hdfs file, it fail to find the file size and default of 1 reduce will be used. The next steps in automatically determining number of reducers (which can be addressed in separate jiras) are - 1. Determining different number of reducers for each MR job of a pig-query, based on the input size for the MR job. 2. Extending this functionality to load functions that don't take hdfs files as input. We can look at using LoadMetaData.getStatistics() . Comments on the patch - If default_parallel is specified, the number of reducers doesn't need to be determined. estimateNumberOfReducers(conf,mro); if (pigContext.defaultParallel > 0) conf.set( "mapred.reduce.tasks" , ""+pigContext.defaultParallel); can be changed to if (pigContext.defaultParallel > 0) conf.set( "mapred.reduce.tasks" , ""+pigContext.defaultParallel); else estimateNumberOfReducers(conf,mro); Everything else looks good. Hudson still seems to be having problems. I am currently running unit tests with this patch.
          Hide
          Alan Gates added a comment -

          One thing we want to be sure of is that if users explicitly set parallel to 1, we don't override it. From reviewing the above code it isn't clear to me whether that's the case here or not.

          Show
          Alan Gates added a comment - One thing we want to be sure of is that if users explicitly set parallel to 1, we don't override it. From reviewing the above code it isn't clear to me whether that's the case here or not.
          Hide
          Dmitriy V. Ryaboy added a comment -

          This is a good spot to leverage pinning options that were added to operators a while back. The parser would pin the parallel option if it encounters the PARALLEL keyword, and the code in this patch wouldn't get invoked unless parallelism is not pinned.

          Show
          Dmitriy V. Ryaboy added a comment - This is a good spot to leverage pinning options that were added to operators a while back. The parser would pin the parallel option if it encounters the PARALLEL keyword, and the code in this patch wouldn't get invoked unless parallelism is not pinned.
          Hide
          Jeff Zhang added a comment -

          Clear the logic.
          first check the PARALLE in query,
          if not set, then check the defaultParallel in PigContext,
          if not set, do estimation of reducer number.

          Show
          Jeff Zhang added a comment - Clear the logic. first check the PARALLE in query, if not set, then check the defaultParallel in PigContext, if not set, do estimation of reducer number.
          Hide
          Alan Gates added a comment -

          Questions/Comments:

          1. In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate.
          2. I'm curious where the values of ~1GB per reducer and 999 reducers came from.
          3. Does this estimate apply only to the first job or to all jobs?
          4. How does this work in the case of joins, where there are multiple inputs to a job?
          Show
          Alan Gates added a comment - Questions/Comments: In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate. I'm curious where the values of ~1GB per reducer and 999 reducers came from. Does this estimate apply only to the first job or to all jobs? How does this work in the case of joins, where there are multiple inputs to a job?
          Hide
          Jeff Zhang added a comment -

          Response to Alan's questions,

          1. In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate.

          It won't throw IOException when file doesn't exit, getTotalInputFileSize will return 0 if not loading from file or file doesn't exit. And the final estimated reducer number will be 1.

          2. I'm curious where the values of ~1GB per reducer and 999 reducers came from.

          These two numbers is what Hive use, I'm not sure how they came from. Maybe from their experience.

          3. Does this estimate apply only to the first job or to all jobs?

          It will apply to all the jobs

          4. How does this work in the case of joins, where there are multiple inputs to a job?

          it will estimate the reducer number according the all the inputs files' size

          Show
          Jeff Zhang added a comment - Response to Alan's questions, 1. In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate. It won't throw IOException when file doesn't exit, getTotalInputFileSize will return 0 if not loading from file or file doesn't exit. And the final estimated reducer number will be 1. 2. I'm curious where the values of ~1GB per reducer and 999 reducers came from. These two numbers is what Hive use, I'm not sure how they came from. Maybe from their experience. 3. Does this estimate apply only to the first job or to all jobs? It will apply to all the jobs 4. How does this work in the case of joins, where there are multiple inputs to a job? it will estimate the reducer number according the all the inputs files' size
          Hide
          Alan Gates added a comment -

          1. In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate.


          It won't throw IOException when file doesn't exit, getTotalInputFileSize will return 0 if not loading from file or file doesn't exit. And the final estimated reducer number will be 1.


          Could we add a test to test this? I think it would be good to assure it works in this situation. Maybe you could take one of the tests that uses the Hbase loader.

          2. I'm curious where the values of ~1GB per reducer and 999 reducers came from.


          These two numbers is what Hive use, I'm not sure how they came from. Maybe from their experience.


          ok, good enough. We can adjust them later if we need to.

          3. Does this estimate apply only to the first job or to all jobs?


          It will apply to all the jobs


          Eventually we should change this to do the estimation on the fly in the JobControlCompiler. Since most queries tend to aggregate data down after a number of steps I suspect that using the initial input to estimate the entire query will mean that the final results are parallelized too widely. But this is better than the current situation where they aren't parallelized at all.

          4. How does this work in the case of joins, where there are multiple inputs to a job?


          it will estimate the reducer number according the all the inputs files' size


          cool

          So other than testing the non-file case I'm +1 on this patch.

          Show
          Alan Gates added a comment - 1. In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate. It won't throw IOException when file doesn't exit, getTotalInputFileSize will return 0 if not loading from file or file doesn't exit. And the final estimated reducer number will be 1. Could we add a test to test this? I think it would be good to assure it works in this situation. Maybe you could take one of the tests that uses the Hbase loader. 2. I'm curious where the values of ~1GB per reducer and 999 reducers came from. These two numbers is what Hive use, I'm not sure how they came from. Maybe from their experience. ok, good enough. We can adjust them later if we need to. 3. Does this estimate apply only to the first job or to all jobs? It will apply to all the jobs Eventually we should change this to do the estimation on the fly in the JobControlCompiler. Since most queries tend to aggregate data down after a number of steps I suspect that using the initial input to estimate the entire query will mean that the final results are parallelized too widely. But this is better than the current situation where they aren't parallelized at all. 4. How does this work in the case of joins, where there are multiple inputs to a job? it will estimate the reducer number according the all the inputs files' size cool So other than testing the non-file case I'm +1 on this patch.
          Hide
          Jeff Zhang added a comment -

          Update the patch, including testcase of non-dfs input and do path checking when doing estimation

          Show
          Jeff Zhang added a comment - Update the patch, including testcase of non-dfs input and do path checking when doing estimation
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12445559/PIG_1249_3.patch
          against trunk revision 948526.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/6/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12445559/PIG_1249_3.patch against trunk revision 948526. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/6/console This message is automatically generated.
          Hide
          Alan Gates added a comment -

          +1, new test looks good.

          Hudson is still having troubles. We should run the "ant test" and "ant test-patch" directives manually.

          Show
          Alan Gates added a comment - +1, new test looks good. Hudson is still having troubles. We should run the "ant test" and "ant test-patch" directives manually.
          Hide
          Alan Gates added a comment -

          The latest patch doesn't apply because of a merge conflict. I'll attach a patch that addresses this.

          Show
          Alan Gates added a comment - The latest patch doesn't apply because of a merge conflict. I'll attach a patch that addresses this.
          Hide
          Alan Gates added a comment -

          Patch with merge conflict resolution.

          Show
          Alan Gates added a comment - Patch with merge conflict resolution.
          Hide
          Jeff Zhang added a comment -

          Alan,Thanks for your help.

          Show
          Jeff Zhang added a comment - Alan,Thanks for your help.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12446173/PIG-1249-4.patch
          against trunk revision 951229.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/329/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/329/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/329/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12446173/PIG-1249-4.patch against trunk revision 951229. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/329/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/329/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/329/console This message is automatically generated.
          Hide
          Ashutosh Chauhan added a comment -

          Map-reduce framework has a jira related to this issue. https://issues.apache.org/jira/browse/MAPREDUCE-1521 It has two implications for Pig:

          1) We need to reconsider whether we still want Pig to set number of reducers on user's behalf. We can choose not to "intelligently" choose # of reducers and let framework fail the job which doesn't "correctly" specify # of reducers. Then, Pig is out of this guessing game and users are forced by framework to correctly specify # of reducers.

          2) Now that MR framework will fail the job based on configured limits, operators where Pig does compute and set number of reducers (like skewed join etc.) should now be aware of those limits so that # of reducers computed by them fall within those limits.

          Show
          Ashutosh Chauhan added a comment - Map-reduce framework has a jira related to this issue. https://issues.apache.org/jira/browse/MAPREDUCE-1521 It has two implications for Pig: 1) We need to reconsider whether we still want Pig to set number of reducers on user's behalf. We can choose not to "intelligently" choose # of reducers and let framework fail the job which doesn't "correctly" specify # of reducers. Then, Pig is out of this guessing game and users are forced by framework to correctly specify # of reducers. 2) Now that MR framework will fail the job based on configured limits, operators where Pig does compute and set number of reducers (like skewed join etc.) should now be aware of those limits so that # of reducers computed by them fall within those limits.
          Hide
          Olga Natkovich added a comment -

          Ashutosh,

          First, the changes are not going to be in framework till Hadoop 22 and I don't think we want to wait that far as we are seeing quite a few problems on our cluster. Second, I think we want to take a direction with pig of setting things up for users. Of course, we don't have stats right now to do so accurately but I think this is a step in the right direction

          Show
          Olga Natkovich added a comment - Ashutosh, First, the changes are not going to be in framework till Hadoop 22 and I don't think we want to wait that far as we are seeing quite a few problems on our cluster. Second, I think we want to take a direction with pig of setting things up for users. Of course, we don't have stats right now to do so accurately but I think this is a step in the right direction
          Hide
          Olga Natkovich added a comment -

          Jeff, sorry this patch did not get much attention in a while. Can I ask you to do the following:

          (1) Regenrate the patch for the latest trunk and make sure that the tests are passing and we get no additional warnings
          (2) Add a docs comment that describes in one place what are the exact heuristics, when they are applied and how they can be influenced. I will ask our doc writer to incorporate this information in Pig 0.8.0 documentation
          (3) If it is not already done, can we log the value that will be used so that the user knows what is happenning

          Thanks!

          Show
          Olga Natkovich added a comment - Jeff, sorry this patch did not get much attention in a while. Can I ask you to do the following: (1) Regenrate the patch for the latest trunk and make sure that the tests are passing and we get no additional warnings (2) Add a docs comment that describes in one place what are the exact heuristics, when they are applied and how they can be influenced. I will ask our doc writer to incorporate this information in Pig 0.8.0 documentation (3) If it is not already done, can we log the value that will be used so that the user knows what is happenning Thanks!
          Hide
          Jeff Zhang added a comment -

          Olga, I generated the patch for the latest trunk. And add doc for in method estimateNumberOfReducers in JobControlCompiler. If you need anything else, feel free to tell me.

          Show
          Jeff Zhang added a comment - Olga, I generated the patch for the latest trunk. And add doc for in method estimateNumberOfReducers in JobControlCompiler. If you need anything else, feel free to tell me.
          Hide
          Olga Natkovich added a comment -

          Hi Jeff,

          Thanks for the quick response. I will review and commit the patch. I am going to add a log statement for the reduce value that has been computed. I will also copy your doc comment from the src to the JIRA to assist our doc writer

          Show
          Olga Natkovich added a comment - Hi Jeff, Thanks for the quick response. I will review and commit the patch. I am going to add a log statement for the reduce value that has been computed. I will also copy your doc comment from the src to the JIRA to assist our doc writer
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12450579/PIG-1249_5.patch
          against trunk revision 979503.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/359/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/359/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/359/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12450579/PIG-1249_5.patch against trunk revision 979503. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/359/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/359/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/359/console This message is automatically generated.
          Hide
          Olga Natkovich added a comment -

          Patch committed. Thanks Jeff!

          Show
          Olga Natkovich added a comment - Patch committed. Thanks Jeff!
          Hide
          Olga Natkovich added a comment -

          Comments for the documentation:

          + /**
          + * Currently the estimation of reducer number is only applied to HDFS, The estimation is based on the input size of data storage on HDFS.
          + * Two parameters can been configured for the estimation, one is pig.exec.reducers.max which constrain the maximum number of reducer task (default is 999). The other
          + * is pig.exec.reducers.bytes.per.reducer(default value is 1000*1000*1000) which means the how much data can been handled for each reducer.
          + * e.g. the following is your pig script
          + * a = load '/data/a';
          + * b = load '/data/b';
          + * c = join a by $0, b by $0;
          + * store c into '/tmp';
          + *
          + * The size of /data/a is 1000*1000*1000, and size of /data/b is 2*1000*1000*1000.
          + * Then the estimated reducer number is (1000*1000*1000+2*1000*1000*1000)/(1000*1000*1000)=3

          Show
          Olga Natkovich added a comment - Comments for the documentation: + /** + * Currently the estimation of reducer number is only applied to HDFS, The estimation is based on the input size of data storage on HDFS. + * Two parameters can been configured for the estimation, one is pig.exec.reducers.max which constrain the maximum number of reducer task (default is 999). The other + * is pig.exec.reducers.bytes.per.reducer(default value is 1000*1000*1000) which means the how much data can been handled for each reducer. + * e.g. the following is your pig script + * a = load '/data/a'; + * b = load '/data/b'; + * c = join a by $0, b by $0; + * store c into '/tmp'; + * + * The size of /data/a is 1000*1000*1000, and size of /data/b is 2*1000*1000*1000. + * Then the estimated reducer number is (1000*1000*1000+2*1000*1000*1000)/(1000*1000*1000)=3
          Hide
          Anup added a comment -

          one thing that we didn't take care is the use of the hadoop parameter "mapred.reduce.tasks".
          If I specify the hadoop parameter -Dmapred.reduce.tasks=450 for all the MR jobs , it is overwritten by estimateNumberOfReducers(conf,mro), which in my case is 15.
          I am not specifying any default_parallel and PARALLEL statements.

          Ideally, the number of reducer should be 450.

          I think we should prioritize this parameter above the estimate reducers calculations.
          The priority list should be

          1. PARALLEL statement
          2. default_parallel statement
          3. mapred.reduce.task hadoop parameter
          4. estimateNumberOfreducers();

          Show
          Anup added a comment - one thing that we didn't take care is the use of the hadoop parameter "mapred.reduce.tasks". If I specify the hadoop parameter -Dmapred.reduce.tasks=450 for all the MR jobs , it is overwritten by estimateNumberOfReducers(conf,mro), which in my case is 15. I am not specifying any default_parallel and PARALLEL statements. Ideally, the number of reducer should be 450. I think we should prioritize this parameter above the estimate reducers calculations. The priority list should be 1. PARALLEL statement 2. default_parallel statement 3. mapred.reduce.task hadoop parameter 4. estimateNumberOfreducers();
          Hide
          Daniel Dai added a comment -

          Agreed, "mapred.reduce.task" should have higher priority than estimateNumberOfreducers(). If not, then it's a bug.

          Show
          Daniel Dai added a comment - Agreed, "mapred.reduce.task" should have higher priority than estimateNumberOfreducers(). If not, then it's a bug.
          Hide
          Jeff Zhang added a comment -

          I've created a ticket for this issue PIG-1810

          Show
          Jeff Zhang added a comment - I've created a ticket for this issue PIG-1810

            People

            • Assignee:
              Jeff Zhang
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development