Hadoop Common
  1. Hadoop Common
  2. HADOOP-3296

Some levels are skipped while creating the task cache in JobInProgress

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Consider the following piece of code

      JobInProgress.createCache()
      Node node = jobtracker.resolveAndAddToTopology(host);
      for (int j = 0; j < maxLevel; j++) {
                node = JobTracker.getParentNode(node, j);
                .....
      

      With maxLevel > 2 the caches will be created in the following order

      j node-level
      0 0
      1 1
      2 3
      3 6

      which is not as desired.

      1. HADOOP-3296-v3.patch
        16 kB
        Amar Kamat
      2. HADOOP-3296-v2.patch
        16 kB
        Amar Kamat
      3. HADOOP-3296.patch
        15 kB
        Amar Kamat

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #499 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/499/ )
          Hide
          Chris Douglas added a comment -

          I just committed this. Thanks, Amar

          Show
          Chris Douglas added a comment - I just committed this. Thanks, Amar
          Hide
          Chris Douglas added a comment -

          +1

          Show
          Chris Douglas added a comment - +1
          Hide
          Amar Kamat added a comment -

          Seems like org.apache.hadoop.dfs.TestEditLog has failed. Ran the same test with a fresh trunk on my box and it passed.

          Show
          Amar Kamat added a comment - Seems like org.apache.hadoop.dfs.TestEditLog has failed. Ran the same test with a fresh trunk on my box and it passed.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381954/HADOOP-3296-v3.patch
          against trunk revision 656585.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 10 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2480/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2480/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2480/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2480/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381954/HADOOP-3296-v3.patch against trunk revision 656585. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2480/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2480/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2480/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2480/console This message is automatically generated.
          Hide
          Amar Kamat added a comment -

          Taking into consideration Chris's comments. Attaching a patch.

          Show
          Amar Kamat added a comment - Taking into consideration Chris's comments. Attaching a patch.
          Hide
          Chris Douglas added a comment -

          This should include all tasks that could be scheduled on a cached location (excluding the data_local and rack_local), and it could be disabled for cache level <= 2. Would this work? I believe Amar's testcase banks on this counter being present.

          That sounds easier to understand and parse than the current counter, which gets incremented for all cache hits.

          As per the trunk there is no way to infer whether the scheduling went correctly or not. With this aggregate counter one can check if the maps were from the cache or not. So the counter is just a count of how may maps got picked up from the task cache.

          Though if we apply the same reasoning used to justify a third counter- the distinction between a task scheduled between level 3 and a cache miss is significant- isn't the distinction between level 3 and level 4 significant? Without this counter, writing a good testcase for this fix would be very difficult, but other particular uses for it are more difficult to summon

          Show
          Chris Douglas added a comment - This should include all tasks that could be scheduled on a cached location (excluding the data_local and rack_local), and it could be disabled for cache level <= 2. Would this work? I believe Amar's testcase banks on this counter being present. That sounds easier to understand and parse than the current counter, which gets incremented for all cache hits. As per the trunk there is no way to infer whether the scheduling went correctly or not. With this aggregate counter one can check if the maps were from the cache or not. So the counter is just a count of how may maps got picked up from the task cache. Though if we apply the same reasoning used to justify a third counter- the distinction between a task scheduled between level 3 and a cache miss is significant- isn't the distinction between level 3 and level 4 significant? Without this counter, writing a good testcase for this fix would be very difficult, but other particular uses for it are more difficult to summon
          Hide
          Devaraj Das added a comment -

          The counter can be called OTHER_LOCAL . This should include all tasks that could be scheduled on a cached location (excluding the data_local and rack_local), and it could be disabled for cache level <= 2. Would this work? I believe Amar's testcase banks on this counter being present.

          Show
          Devaraj Das added a comment - The counter can be called OTHER_LOCAL . This should include all tasks that could be scheduled on a cached location (excluding the data_local and rack_local), and it could be disabled for cache level <= 2. Would this work? I believe Amar's testcase banks on this counter being present.
          Hide
          Amar Kamat added a comment -

          Minimally, calling the total number of maps scheduled "Local map tasks" is confusing

          Note that the counter will be incremented only if there is a locality at any level. Its different from calling all the maps as LOCAL. I agree that we should change the counter name. The intention is to show how many tasks got picked up from the cache. So it includes node-local, rack-local, switch-local etc. I think Aggregate Locality sounds better, comments?

          For which users would this be valuable information?

          In general when the users have cache topology more than 2 levels.

          If there are more than two levels in the task cache and the distinction is significant, how is an aggregate counter resolving the ambiguity?

          Consider a case where there is no locality at the node level and also at the rack level. As per the trunk there is no way to infer whether the scheduling went correctly or not. With this aggregate counter one can check if the maps were from the cache or not. So the counter is just a count of how may maps got picked up from the task cache.

          Show
          Amar Kamat added a comment - Minimally, calling the total number of maps scheduled "Local map tasks" is confusing Note that the counter will be incremented only if there is a locality at any level. Its different from calling all the maps as LOCAL. I agree that we should change the counter name. The intention is to show how many tasks got picked up from the cache. So it includes node-local, rack-local, switch-local etc. I think Aggregate Locality sounds better, comments? For which users would this be valuable information? In general when the users have cache topology more than 2 levels. If there are more than two levels in the task cache and the distinction is significant, how is an aggregate counter resolving the ambiguity? Consider a case where there is no locality at the node level and also at the rack level. As per the trunk there is no way to infer whether the scheduling went correctly or not. With this aggregate counter one can check if the maps were from the cache or not. So the counter is just a count of how may maps got picked up from the task cache.
          Hide
          Chris Douglas added a comment -

          The reason is that if there are more than 2 levels in the task cache (e.g local/rack/gateway/...) then there is no way to know if the maps were scheduled locally (at a level above rack) or not.

          Minimally, calling the total number of maps scheduled "Local map tasks" is confusing. I'll grant without hesitation that hard-coding two levels doesn't cover all topologies, but this counter provides only slightly more information than the existing counters, diluted. For which users would this be valuable information? If there are more than two levels in the task cache and the distinction is significant, how is an aggregate counter resolving the ambiguity? Please correct this impression if it's mistaken, but this counter is a partial fix to an issue orthogonal to the one described in this JIRA; it should be left out.

          Show
          Chris Douglas added a comment - The reason is that if there are more than 2 levels in the task cache (e.g local/rack/gateway/...) then there is no way to know if the maps were scheduled locally (at a level above rack) or not. Minimally, calling the total number of maps scheduled "Local map tasks" is confusing. I'll grant without hesitation that hard-coding two levels doesn't cover all topologies, but this counter provides only slightly more information than the existing counters, diluted. For which users would this be valuable information? If there are more than two levels in the task cache and the distinction is significant, how is an aggregate counter resolving the ambiguity? Please correct this impression if it's mistaken, but this counter is a partial fix to an issue orthogonal to the one described in this JIRA; it should be left out.
          Hide
          Amar Kamat added a comment -

          but I don't understand the purpose of the new counter (it could also use an entry in JobInProgress_Counter.properties, btw) ....

          The reason is that if there are more than 2 levels in the task cache (e.g local/rack/gateway/...) then there is no way to know if the maps were scheduled locally (at a level above rack) or not. Thanks for the review.

          Show
          Amar Kamat added a comment - but I don't understand the purpose of the new counter (it could also use an entry in JobInProgress_Counter.properties, btw) .... The reason is that if there are more than 2 levels in the task cache (e.g local/rack/gateway/...) then there is no way to know if the maps were scheduled locally (at a level above rack) or not. Thanks for the review.
          Hide
          Chris Douglas added a comment -

          The fix to this JIRA looks good, as does the refactoring in the test code, but I don't understand the purpose of the new counter (it could also use an entry in JobInProgress_Counter.properties, btw). It keeps a count of total (i.e. data-local + rack-local + other) placements? Is that necessary?

          Show
          Chris Douglas added a comment - The fix to this JIRA looks good, as does the refactoring in the test code, but I don't understand the purpose of the new counter (it could also use an entry in JobInProgress_Counter.properties, btw). It keeps a count of total (i.e. data-local + rack-local + other) placements? Is that necessary?
          Hide
          Amar Kamat added a comment -

          TestBalancer from the dfs tests failed. I dont think this is related to the patch.

          Show
          Amar Kamat added a comment - TestBalancer from the dfs tests failed. I dont think this is related to the patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12380913/HADOOP-3296.patch
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 10 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2328/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2328/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2328/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2328/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12380913/HADOOP-3296.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 10 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2328/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2328/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2328/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2328/console This message is automatically generated.
          Hide
          Amar Kamat added a comment -

          Submitting.

          Show
          Amar Kamat added a comment - Submitting.
          Hide
          Amar Kamat added a comment -

          Attaching a patch that fixes the problem and a test case.

          Show
          Amar Kamat added a comment - Attaching a patch that fixes the problem and a test case.
          Hide
          Robert Chansler added a comment -

          +11 Group (including QA) agrees this is not something for 17.

          Show
          Robert Chansler added a comment - +11 Group (including QA) agrees this is not something for 17.
          Hide
          Owen O'Malley added a comment -

          I think we should push this to 0.18. I don't see it as a blocker for 0.17, since most clusters either don't use topology or have 2 levels. We should add a release note saying that it is a known bug.

          Show
          Owen O'Malley added a comment - I think we should push this to 0.18. I don't see it as a blocker for 0.17, since most clusters either don't use topology or have 2 levels. We should add a release note saying that it is a known bug.

            People

            • Assignee:
              Amar Kamat
              Reporter:
              Amar Kamat
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development