Pig
  1. Pig
  2. PIG-2779

Refactoring the code for setting number of reducers

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None

      Description

      As PIG-2652 observed, currently the code for setting number of reducers is a little messy. MapReduceOper.requestedParallelism seems being misused in some plases, and now we support runtime estimation of #reducer which further complicates the problem.

      For example, if we specify parallel 1 for the order-by, the estimated #reducer will be used. If we specify parallel 2 while it estimates 4, order-by will fail due to "Illegal partition for Null". If we specify parallel 4 while it estimates 2, then some reducers will have nothing to do.

      1. TestNumberOfReducers.java
        4 kB
        Jie Li
      2. TestNumberOfReducers.java
        4 kB
        Jie Li
      3. PIG-2779.6.patch
        31 kB
        Bill Graham
      4. PIG-2779.5.patch
        31 kB
        Bill Graham
      5. PIG-2779.4.patch
        20 kB
        Jie Li
      6. PIG-2779.3.patch
        20 kB
        Jie Li
      7. PIG-2779.2.patch
        18 kB
        Jie Li
      8. PIG-2779.1.patch
        18 kB
        Jie Li
      9. PIG-2779.0.patch
        9 kB
        Jie Li

        Issue Links

          Activity

          Hide
          Bill Graham added a comment -

          Committed, thanks Jie for seeing this one through!

          Show
          Bill Graham added a comment - Committed, thanks Jie for seeing this one through!
          Hide
          Bill Graham added a comment -

          There are a few tests that failed, but when run individually, they all pass so I think it's just flakeyness in the full CI test target. Attaching patch #6 that contains the ASF 2.0 license headers in TestNumberOfReducers.java.

          Show
          Bill Graham added a comment - There are a few tests that failed, but when run individually, they all pass so I think it's just flakeyness in the full CI test target. Attaching patch #6 that contains the ASF 2.0 license headers in TestNumberOfReducers.java .
          Hide
          Jie Li added a comment -

          Thanks Bill! These changes sound perfect to me. Let's see if any tests need to be fixed.

          Show
          Jie Li added a comment - Thanks Bill! These changes sound perfect to me. Let's see if any tests need to be fixed.
          Hide
          Bill Graham added a comment -

          Agreed, let's hold off on feature creep. I've made a few edits and updated the patch. A few things to note:

          • I added more assertions for all parallel values after each test in TestNumberOfReducers and TestJobSubmission.
          • I changed the contract of PigReducerEstimator to return -1 if estimation can't be done (i.e. HBaseStorage), since returning 1 was misleading.
          • I refactored some common code in TestNumberOfReducers.

          Take a look and let me know what you think Jie. I'll run the test suite with our CI server tonight.

          Show
          Bill Graham added a comment - Agreed, let's hold off on feature creep. I've made a few edits and updated the patch. A few things to note: I added more assertions for all parallel values after each test in TestNumberOfReducers and TestJobSubmission . I changed the contract of PigReducerEstimator to return -1 if estimation can't be done (i.e. HBaseStorage), since returning 1 was misleading. I refactored some common code in TestNumberOfReducers . Take a look and let me know what you think Jie. I'll run the test suite with our CI server tonight.
          Hide
          Jie Li added a comment -

          The bugs addressed by this patch would block PIG-2772 and PIG-483, where we need the right number of reducers to remove small jobs at runtime. So I guess we can probably move new features into separate tickets?

          Show
          Jie Li added a comment - The bugs addressed by this patch would block PIG-2772 and PIG-483 , where we need the right number of reducers to remove small jobs at runtime. So I guess we can probably move new features into separate tickets?
          Hide
          Bill Graham added a comment -

          I'm going to take a look to see what it would take to implement pig.info.reducers.keyword.parallel without refactoring requestedParallel, since I'm not sure we want to take that on.

          Show
          Bill Graham added a comment - I'm going to take a look to see what it would take to implement pig.info.reducers.keyword.parallel without refactoring requestedParallel , since I'm not sure we want to take that on.
          Hide
          Jie Li added a comment -

          Attached PIG-2779.4.patch that incorporates #1 discussed above.

          For #2, we probably can do it later with the cleaning up of requestedParallel?

          Show
          Jie Li added a comment - Attached PIG-2779 .4.patch that incorporates #1 discussed above. For #2, we probably can do it later with the cleaning up of requestedParallel?
          Hide
          Jie Li added a comment -

          Agreed with #1.

          Re #2, according to http://pig.apache.org/docs/r0.10.0/perf.html#parallel, these operators support PARALLEL:

          COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), and ORDER BY.

          We need to make sure the PARALLEL associate with these operators remains same across logic/phycical/mr phases. Seems it suffers from the same complexity faced by requestedParallel, such as query transformation, multi-query optimization, etc. Seems it's not trivial?

          Show
          Jie Li added a comment - Agreed with #1. Re #2, according to http://pig.apache.org/docs/r0.10.0/perf.html#parallel , these operators support PARALLEL: COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), and ORDER BY. We need to make sure the PARALLEL associate with these operators remains same across logic/phycical/mr phases. Seems it suffers from the same complexity faced by requestedParallel, such as query transformation, multi-query optimization, etc. Seems it's not trivial?
          Hide
          Bill Graham added a comment -

          Re #1, for later analysis it would be useful to know the estimated parallelism only if estimation was in-fact kicked in because it was needed. If it didn't kick in, that's also useful to know. Hence I think we should move that call into the else block.

          Agreed that requestedParallel is used all over and we shouldn't mess with that now. Let's instead set another field when PARALLEL is passed. pig.info.reducers.keyword.parallel? That said, do you still see value in setting pig.info.reducers.requested.parallel, since it could be used for so many different things?

          Show
          Bill Graham added a comment - Re #1, for later analysis it would be useful to know the estimated parallelism only if estimation was in-fact kicked in because it was needed. If it didn't kick in, that's also useful to know. Hence I think we should move that call into the else block. Agreed that requestedParallel is used all over and we shouldn't mess with that now. Let's instead set another field when PARALLEL is passed. pig.info.reducers.keyword.parallel ? That said, do you still see value in setting pig.info.reducers.requested.parallel , since it could be used for so many different things?
          Hide
          Jie Li added a comment -

          Hi Bill, regarding the first comment, I agree we can avoid estimating and should be adjusted before. But now would you need it for later analysis in any way? (e.g. comparing estimated #reducer and PALALLEL keyword). If not I'll fix it then.

          Regarding the second comment, yes requestedParallel has been used for multiple purposes for a long time and adjusted across logical/physical/mr phases, and also set as the final number of reducers. Many unit tests also assume it as the final number of reducers. I'm afraid cleaning it up can be another ticket? Or maybe an easier approach is to add another field just recording the PARALLEL keyword?

          Show
          Jie Li added a comment - Hi Bill, regarding the first comment, I agree we can avoid estimating and should be adjusted before. But now would you need it for later analysis in any way? (e.g. comparing estimated #reducer and PALALLEL keyword). If not I'll fix it then. Regarding the second comment, yes requestedParallel has been used for multiple purposes for a long time and adjusted across logical/physical/mr phases, and also set as the final number of reducers. Many unit tests also assume it as the final number of reducers. I'm afraid cleaning it up can be another ticket? Or maybe an easier approach is to add another field just recording the PARALLEL keyword?
          Hide
          Bill Graham added a comment -

          Thanks Jie, I've been running the test suite on our CI server and so far things look good. We're close. A few more comments though:

          • In JobControlCompiler we should only call estimateNumberOfReducers if we need to, since we don't need it if requestedParallelism or defaultParallel will govern and the call could be expensive. So we'd have this:
            } else {
              mro.estimatedParallelism = estimateNumberOfReducers(conf, lds, nwJob);
              jobParallelism = mro.estimatedParallelism;
            }
            
          • The semantics of pig.info.reducers.requested.parallel is a bit misleading as implemented. I would expect that would be the value set via the PARALLEL statement, but that's not the case, since requestedParallel gets set to jobParallelism on line 796 of JCC. Would you please add to the tests in TestJobSubmission that each of the pig.info.reducers.* fields are set as expected (or not set) after each of the scenarios. I suspect there are cases where pig.info.reducers.requested.parallel is being set when PARALLEL isn't used.
          Show
          Bill Graham added a comment - Thanks Jie, I've been running the test suite on our CI server and so far things look good. We're close. A few more comments though: In JobControlCompiler we should only call estimateNumberOfReducers if we need to, since we don't need it if requestedParallelism or defaultParallel will govern and the call could be expensive. So we'd have this: } else { mro.estimatedParallelism = estimateNumberOfReducers(conf, lds, nwJob); jobParallelism = mro.estimatedParallelism; } The semantics of pig.info.reducers.requested.parallel is a bit misleading as implemented. I would expect that would be the value set via the PARALLEL statement, but that's not the case, since requestedParallel gets set to jobParallelism on line 796 of JCC. Would you please add to the tests in TestJobSubmission that each of the pig.info.reducers.* fields are set as expected (or not set) after each of the scenarios. I suspect there are cases where pig.info.reducers.requested.parallel is being set when PARALLEL isn't used.
          Hide
          Jie Li added a comment -

          Attached PIG-2779.3.patch for setting various parallelism into job conf for later ananlysis, as suggested by Bill. Also add unit tests for testing them.

          Show
          Jie Li added a comment - Attached PIG-2779 .3.patch for setting various parallelism into job conf for later ananlysis, as suggested by Bill. Also add unit tests for testing them.
          Hide
          Bill Graham added a comment -

          I just checked and default_parallel doesn't show up in the jobconf when set so we should add it in the same format:

          pig.info.reducers.default.parallel
          

          Since pig.info.reducers.runtime.parallel is what becomes mapred.reduce.tasks, no we don't need that one.

          Show
          Bill Graham added a comment - I just checked and default_parallel doesn't show up in the jobconf when set so we should add it in the same format: pig.info.reducers.default.parallel Since pig.info.reducers.runtime.parallel is what becomes mapred.reduce.tasks , no we don't need that one.
          Hide
          Jie Li added a comment -

          estimatedParallelism is a good one to add and these names look good to me. Has default parallel already shown up in the job conf? For pig.info.reducers.runtime.parallel, it should be same as mapred.reduce.tasks, so do we still need it?

          Show
          Jie Li added a comment - estimatedParallelism is a good one to add and these names look good to me. Has default parallel already shown up in the job conf? For pig.info.reducers.runtime.parallel, it should be same as mapred.reduce.tasks, so do we still need it?
          Hide
          Bill Graham added a comment -

          Great! We should probably set estimatedParallelism too, in case that's what was used. If default parallel is used, it seems like that should show up already as default_parallel actually, so we can omit that one. How about this:

          pig.info.reducers.requested.parallel
          pig.info.reducers.estimated.parallel
          pig.info.reducers.runtime.parallel
          

          I'm prefixing with 'info' to denote that these fields are not accepted as input. Instead they are produced as output for debugging and analysis. I'll send an email to pig-dev about the suggested syntax.

          Show
          Bill Graham added a comment - Great! We should probably set estimatedParallelism too, in case that's what was used. If default parallel is used, it seems like that should show up already as default_parallel actually, so we can omit that one. How about this: pig.info.reducers.requested.parallel pig.info.reducers.estimated.parallel pig.info.reducers.runtime.parallel I'm prefixing with 'info' to denote that these fields are not accepted as input. Instead they are produced as output for debugging and analysis. I'll send an email to pig-dev about the suggested syntax.
          Hide
          Jie Li added a comment -

          Great point Bill! Yeah I'll set defaultParallel and requestedParallelism into the jobconf in this patch. What could be good names for them?

          Show
          Jie Li added a comment - Great point Bill! Yeah I'll set defaultParallel and requestedParallelism into the jobconf in this patch. What could be good names for them?
          Hide
          Bill Graham added a comment -

          Jie, as part of this clean up it would be really useful to record the various parallelism values in the job conf for later analysis. I was thinking we capture defaultParallel, requestedParallelism and runtimeParallelism (which should == mapred.reduce.tasks, right). That way we can see later which values were set and which was used. It would be great to know whether parallelism was determined by a PARALLEL statement, via estimation, or via a default. This would be in addition to the following related params we currently capture:

          pig.exec.reducers.max
          pig.exec.reducers.bytes.per.reducer
          

          Do you want to add this to this issue or do you think we should we do this in a separate JIRA?

          I use IntelliJ and I've just set the syntax to match Apaches.

          Show
          Bill Graham added a comment - Jie, as part of this clean up it would be really useful to record the various parallelism values in the job conf for later analysis. I was thinking we capture defaultParallel, requestedParallelism and runtimeParallelism (which should == mapred.reduce.tasks , right). That way we can see later which values were set and which was used. It would be great to know whether parallelism was determined by a PARALLEL statement, via estimation, or via a default. This would be in addition to the following related params we currently capture: pig.exec.reducers.max pig.exec.reducers.bytes.per.reducer Do you want to add this to this issue or do you think we should we do this in a separate JIRA? I use IntelliJ and I've just set the syntax to match Apaches.
          Hide
          Jie Li added a comment -

          Thanks Bill for the review! Attached PIG-2779.2.patch with fixes.

          MRCompiler
          1. You removed some logic which seems to now be in JCC. What about the removed call to eng.getJobConf().getNumReduceTasks()? Is that still being picked up elsewhere?

          It appears like that check was dropped.

          Yeah I purposely leave out the parameter "mapred.reduce.tasks", as the runtime adjustment used to ignore it.

          I thought about including it as an option for setting #reducers, but the problem is its default value is 1 instead of -1, so we can't distinguish whether users set it to 1 or not. (this was one of bugs before). Hive supports this parameter by making it default to -1 and let users override it. We can also add this support if we want.

          2. HExecutionEngine inport no longer needed.

          Nice catch. Removed.

          JobControlCompiler
          3. Should ParallelConstantVisitor be called with mro.reducePlan or nextMro.reducePlan?

          Should be mro, the sample job.

          4. calculateRuntimeReducers doesn't need MROperPlan plan in it's signature.

          Sure. Removed.

          5. Line 804, if statements should have curly braces.
          6. Line 819, should have space between else and trailing curly bracket.

          Nice catch. Btw do you use any syntax checking tool?

          test/smoke-tests
          7. Did you mean to include this in the patch?

          Sorry for that. Removed.

          Show
          Jie Li added a comment - Thanks Bill for the review! Attached PIG-2779 .2.patch with fixes. MRCompiler 1. You removed some logic which seems to now be in JCC. What about the removed call to eng.getJobConf().getNumReduceTasks()? Is that still being picked up elsewhere? It appears like that check was dropped. Yeah I purposely leave out the parameter "mapred.reduce.tasks", as the runtime adjustment used to ignore it. I thought about including it as an option for setting #reducers, but the problem is its default value is 1 instead of -1, so we can't distinguish whether users set it to 1 or not. (this was one of bugs before). Hive supports this parameter by making it default to -1 and let users override it. We can also add this support if we want. 2. HExecutionEngine inport no longer needed. Nice catch. Removed. JobControlCompiler 3. Should ParallelConstantVisitor be called with mro.reducePlan or nextMro.reducePlan? Should be mro, the sample job. 4. calculateRuntimeReducers doesn't need MROperPlan plan in it's signature. Sure. Removed. 5. Line 804, if statements should have curly braces. 6. Line 819, should have space between else and trailing curly bracket. Nice catch. Btw do you use any syntax checking tool? test/smoke-tests 7. Did you mean to include this in the patch? Sorry for that. Removed.
          Hide
          Bill Graham added a comment -

          Jie, these changes make sense to me, assuming tests pass. Just a few questions and minor nits:

          MRCompiler
          1. You removed some logic which seems to now be in JCC. What about the removed call to eng.getJobConf().getNumReduceTasks()? Is that still being picked up elsewhere? It appears like that check was dropped.
          2. HExecutionEngine inport no longer needed.

          JobControlCompiler
          3. Should ParallelConstantVisitor be called with mro.reducePlan or nextMro.reducePlan?
          4. calculateRuntimeReducers doesn't need MROperPlan plan in it's signature.
          5. Line 804, if statements should have curly braces.
          6. Line 819, should have space between else and trailing curly bracket.

          test/smoke-tests
          7. Did you mean to include this in the patch?

          Show
          Bill Graham added a comment - Jie, these changes make sense to me, assuming tests pass. Just a few questions and minor nits: MRCompiler 1. You removed some logic which seems to now be in JCC. What about the removed call to eng.getJobConf().getNumReduceTasks() ? Is that still being picked up elsewhere? It appears like that check was dropped. 2. HExecutionEngine inport no longer needed. JobControlCompiler 3. Should ParallelConstantVisitor be called with mro.reducePlan or nextMro.reducePlan ? 4. calculateRuntimeReducers doesn't need MROperPlan plan in it's signature. 5. Line 804, if statements should have curly braces. 6. Line 819, should have space between else and trailing curly bracket. test/smoke-tests 7. Did you mean to include this in the patch?
          Hide
          Jie Li added a comment -

          The latest PIG-2779.1.patch introduces the notion of runtimeParallelism, which is set to the first positive number of parallel, default_parallel and estimated parallel.

          For sampler jobs, we used to set #partitions at compile-time and reset it at runtime; this patch will remove the compile-time setting and only keep the runtime setting.

          For the runtime setting of #partitions, we used to estimate based on the sampler's input; this patch will instead estimate based on the next job's input, as for skew-join they are different.

          For sampler's next job, e.g. order-by and skew join, we used to calculate their #reducers independently from the sampler; this patch will instead calculate them together with the sampler, so we can keep sampler's #partitions and the next job's #reducers synchronized.

          Show
          Jie Li added a comment - The latest PIG-2779 .1.patch introduces the notion of runtimeParallelism, which is set to the first positive number of parallel, default_parallel and estimated parallel. For sampler jobs, we used to set #partitions at compile-time and reset it at runtime; this patch will remove the compile-time setting and only keep the runtime setting. For the runtime setting of #partitions, we used to estimate based on the sampler's input; this patch will instead estimate based on the next job's input, as for skew-join they are different. For sampler's next job, e.g. order-by and skew join, we used to calculate their #reducers independently from the sampler; this patch will instead calculate them together with the sampler, so we can keep sampler's #partitions and the next job's #reducers synchronized.
          Hide
          Jie Li added a comment -

          Currently Pig doesn't look at mapred.reduce.tasks when deciding #reducer. Shall we look at it after request_parallel and default_parallel, and before estimating parallel? It'll look like:

                  if (mro.requestedParallelism > 0) {
                      jobParallelism = mro.requestedParallelism;
                  } else if (pigContext.defaultParallel > 0) {
                      jobParallelism = pigContext.defaultParallel;
                  } else if (conf.getInt("mapred.reduce.tasks", -1) > 0) {
                      jobParallelism = conf.getInt("mapred.reduce.tasks", -1);
                  } else {
                      jobParallelism = estimatedParallelism;
                  }  
          
          Show
          Jie Li added a comment - Currently Pig doesn't look at mapred.reduce.tasks when deciding #reducer. Shall we look at it after request_parallel and default_parallel, and before estimating parallel? It'll look like: if (mro.requestedParallelism > 0) { jobParallelism = mro.requestedParallelism; } else if (pigContext.defaultParallel > 0) { jobParallelism = pigContext.defaultParallel; } else if (conf.getInt( "mapred.reduce.tasks" , -1) > 0) { jobParallelism = conf.getInt( "mapred.reduce.tasks" , -1); } else { jobParallelism = estimatedParallelism; }
          Hide
          Jie Li added a comment -

          Thanks Bill! Yeah this patch would break some tests and I'm fixing them now

          Show
          Jie Li added a comment - Thanks Bill! Yeah this patch would break some tests and I'm fixing them now
          Hide
          Bill Graham added a comment -

          +1 for the suggested refactor, as it would be good to clear up distinctions between default parallel, requested parallelism and runtime parallelism.

          Regarding the patch, you're changing logic that was added in the particularly nasty PIG-2652. Do all the tests in the comments and final patch from PIG-2652 pass?

          Show
          Bill Graham added a comment - +1 for the suggested refactor, as it would be good to clear up distinctions between default parallel, requested parallelism and runtime parallelism. Regarding the patch, you're changing logic that was added in the particularly nasty PIG-2652 . Do all the tests in the comments and final patch from PIG-2652 pass?
          Hide
          Jie Li added a comment -

          Attached a patch that fixed the failed cases. The idea is that the sample job shouldn't determine the order-by's parallelism at compile-time, because we'll estimate and adjust it at run-time.

          A more elegant refactoring may be possible after PIG-2784.

          Note this patch is necessary for us to remove the sample job at runtime.

          Show
          Jie Li added a comment - Attached a patch that fixed the failed cases. The idea is that the sample job shouldn't determine the order-by's parallelism at compile-time, because we'll estimate and adjust it at run-time. A more elegant refactoring may be possible after PIG-2784 . Note this patch is necessary for us to remove the sample job at runtime.
          Hide
          Jie Li added a comment -

          This would be better solved with PIG-2784, but maybe we also want a quick fix just in case?

          Show
          Jie Li added a comment - This would be better solved with PIG-2784 , but maybe we also want a quick fix just in case?
          Hide
          Jie Li added a comment -

          Maybe we can refactor like this:

          Merge default parallel
          First, let's merge PigContext.default_parallel into LogicalRelationalOperator's requestedParallelism after the logical plan is built, so we don't need to worry about the default parallel later in Physical phase and MR phase.

          Assignable only once
          Then, let's make all requestedParallelism fields in Logical/Physical/MR operators assignable only once, and add other fields if necessary (e.g. runtimeParallel introduced below), as we don't want to mess different semantics together. If we detect it's being reassigned, we can throw a runtime exception.

          Synchronize sample's #partitions and order-by's #reducers
          For the order-by, it's a little tricky: the sample job's requestedParallelism is set to 1 as it only needs one reducer, and it also needs the #partitions (which is the runtime #reducers of the order-by) to generate the partition file.

          Right now we set sample's #partitions twice (once in compilation-time and once in runtime), and we want to set only once at runtime.

          In order to prevent the sample job using a different #partition than the order-by's runtime #reducer, we can introduce another field runtimeParallel. The sample job will estimate and calculate #partitions and set it to the order-by's runtimeParallel, which won't be re-calculated later.

          Any comment is appreciated.

          Show
          Jie Li added a comment - Maybe we can refactor like this: Merge default parallel First, let's merge PigContext.default_parallel into LogicalRelationalOperator's requestedParallelism after the logical plan is built, so we don't need to worry about the default parallel later in Physical phase and MR phase. Assignable only once Then, let's make all requestedParallelism fields in Logical/Physical/MR operators assignable only once , and add other fields if necessary (e.g. runtimeParallel introduced below), as we don't want to mess different semantics together. If we detect it's being reassigned, we can throw a runtime exception. Synchronize sample's #partitions and order-by's #reducers For the order-by, it's a little tricky: the sample job's requestedParallelism is set to 1 as it only needs one reducer, and it also needs the #partitions (which is the runtime #reducers of the order-by) to generate the partition file. Right now we set sample's #partitions twice (once in compilation-time and once in runtime), and we want to set only once at runtime. In order to prevent the sample job using a different #partition than the order-by's runtime #reducer, we can introduce another field runtimeParallel . The sample job will estimate and calculate #partitions and set it to the order-by's runtimeParallel, which won't be re-calculated later. Any comment is appreciated.
          Hide
          Jie Li added a comment -

          Update the unit test file as somehow we can't set "order A by x parallel -1". (is that useful feature?)

          Now testEstimate2Default1 produces wrong #reducers instead of failure.

          Show
          Jie Li added a comment - Update the unit test file as somehow we can't set "order A by x parallel -1". (is that useful feature?) Now testEstimate2Default1 produces wrong #reducers instead of failure.
          Hide
          Jie Li added a comment -

          Attached a unit test file.

          In testEstimate2Parallel1 Pig uses wrong #reducers for the order-by.

          In testEstimate2Default1 and testEstimate6Default2 queries fail due to inconsistent #reducers used by the sample and the order-by.

          In testEstimate2Parallel4, the test passes but actually two reducers have nothing to do.

          Show
          Jie Li added a comment - Attached a unit test file. In testEstimate2Parallel1 Pig uses wrong #reducers for the order-by. In testEstimate2Default1 and testEstimate6Default2 queries fail due to inconsistent #reducers used by the sample and the order-by. In testEstimate2Parallel4, the test passes but actually two reducers have nothing to do.
          Hide
          Bill Graham added a comment -

          Jie do you have a sample script with dummy input that illustrates this problem that you can attach?

          Show
          Bill Graham added a comment - Jie do you have a sample script with dummy input that illustrates this problem that you can attach?
          Hide
          Jie Li added a comment -

          For the order-by, we need to pass its final #reducer (not the estimated one) to the sample job to generate the partition file, otherwise the partition file will be inconsistent and cause errors.

          The final #reducer is calculated based on the requested one and the estimated one, the latter of which is calculated based on the input data size. Luckily the sample job has the same input data with the order-by, thus it can calculate in advance the final #reducer of the order-by.

          Show
          Jie Li added a comment - For the order-by, we need to pass its final #reducer (not the estimated one) to the sample job to generate the partition file, otherwise the partition file will be inconsistent and cause errors. The final #reducer is calculated based on the requested one and the estimated one, the latter of which is calculated based on the input data size. Luckily the sample job has the same input data with the order-by, thus it can calculate in advance the final #reducer of the order-by.

            People

            • Assignee:
              Jie Li
              Reporter:
              Jie Li
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development