Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: jobtracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Fixes the NPE in 'refreshNodes', ExpiryTracker thread and heartbeat. NPE occurred in the following cases
      - a blacklisted tracker is either decommissioned or expires.
      - a lost tracker gets blacklisted


      Show
      Fixes the NPE in 'refreshNodes', ExpiryTracker thread and heartbeat. NPE occurred in the following cases - a blacklisted tracker is either decommissioned or expires. - a lost tracker gets blacklisted

      Description

      NullPointerException is obtained in Tracker Expiry Thread. Below is the exception obtained in the JT logs

      ERROR org.apache.hadoop.mapred.JobTracker: Tracker Expiry Thread got exception: java.lang.NullPointerException
              at org.apache.hadoop.mapred.JobTracker.updateTaskTrackerStatus(JobTracker.java:2971)
              at org.apache.hadoop.mapred.JobTracker.access$300(JobTracker.java:104)
              at org.apache.hadoop.mapred.JobTracker$ExpireTrackers.run(JobTracker.java:381)
              at java.lang.Thread.run(Thread.java:619)
      

      The steps to reproduce this issue are:

      • Blacklist a TT.
      • Restart it.
      • The above exception is obtained when the first instance of TT is marked as lost.

      However the above exception does not break any functionality.

      1. mapreduce-754-wip.patch
        13 kB
        Sreekanth Ramakrishnan
      2. mapreduce-754-v1.1.patch
        10 kB
        Amar Kamat
      3. mapreduce-754-v1.2.patch
        14 kB
        Amar Kamat
      4. mapreduce-754-v2.2.patch
        14 kB
        Amar Kamat
      5. mapreduce-754-v2.2-yahoo.patch
        22 kB
        Amar Kamat
      6. mapreduce-754-v2.2-branch-21.patch
        15 kB
        Amar Kamat
      7. mapreduce-754-v2.2.1-yahoo.patch
        10 kB
        Amar Kamat

        Issue Links

          Activity

          Hide
          Sreekanth Ramakrishnan added a comment -

          Took a look at the issue, talked with Amareshwari also with regards to this. Putting up a temporary work in progress patch.

          The idea of the patch is when a tracker is blacklisted we just set the number of trackers on the host to zero and not remove from it. The removal from the map is only done when tracker is lost.

          • Refactored a bit of code in JobTracker so the lost tracker logic can be unit tested.
          • Added a new test case to check lost tracker.

          Test case code for integration between lost and black listed tracker needs further look/work

          Show
          Sreekanth Ramakrishnan added a comment - Took a look at the issue, talked with Amareshwari also with regards to this. Putting up a temporary work in progress patch. The idea of the patch is when a tracker is blacklisted we just set the number of trackers on the host to zero and not remove from it. The removal from the map is only done when tracker is lost. Refactored a bit of code in JobTracker so the lost tracker logic can be unit tested. Added a new test case to check lost tracker. Test case code for integration between lost and black listed tracker needs further look/work
          Hide
          Amareshwari Sriramadasu added a comment -

          Blacklisting should not add and remove hosts to the uniqueHostsMap. It should just do a get to get the number and the update should happen only in updateTaskTrackerStatus method.

          Show
          Amareshwari Sriramadasu added a comment - Blacklisting should not add and remove hosts to the uniqueHostsMap. It should just do a get to get the number and the update should happen only in updateTaskTrackerStatus method.
          Hide
          Amar Kamat added a comment -

          Attaching a file that fixes the error on uniqueHostsMap. With this patch the uniqueHostsMap add/remove operations will be done only in JobTracker.updateTaskTrackerStatuses(). Result of test-patch
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Testing in progress.

          Show
          Amar Kamat added a comment - Attaching a file that fixes the error on uniqueHostsMap. With this patch the uniqueHostsMap add/remove operations will be done only in JobTracker.updateTaskTrackerStatuses(). Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Testing in progress.
          Hide
          Amar Kamat added a comment -

          All tests passed except TestReduceFetch.

          Show
          Amar Kamat added a comment - All tests passed except TestReduceFetch.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12424195/mapreduce-754-v1.1.patch
          against trunk revision 833990.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/232/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/232/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/232/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/232/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424195/mapreduce-754-v1.1.patch against trunk revision 833990. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/232/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/232/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/232/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/232/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          The fix is fine for both MAPREDUCE-754 and MAPREDUCE-1188, but MAPREDUCE-746 is not solved yet.

          Show
          Amareshwari Sriramadasu added a comment - The fix is fine for both MAPREDUCE-754 and MAPREDUCE-1188 , but MAPREDUCE-746 is not solved yet.
          Hide
          Amar Kamat added a comment -

          Incorporating MAPREDUCE-746 in this jira.

          Show
          Amar Kamat added a comment - Incorporating MAPREDUCE-746 in this jira.
          Hide
          Amar Kamat added a comment -

          Attaching a patch that incorporates MAPREDUCE-746 also. Result of test-patch
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 6 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Testing in progress.

          Show
          Amar Kamat added a comment - Attaching a patch that incorporates MAPREDUCE-746 also. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Testing in progress.
          Hide
          Amar Kamat added a comment -

          All tests passed on my box.

          Show
          Amar Kamat added a comment - All tests passed on my box.
          Hide
          Amar Kamat added a comment -

          The latest patch (v1.2) includes testcase for MAPREDUCE-1188 and MAPREDUCE-746.

          Show
          Amar Kamat added a comment - The latest patch (v1.2) includes testcase for MAPREDUCE-1188 and MAPREDUCE-746 .
          Hide
          Hemanth Yamijala added a comment -

          This patch introduces a bug in getNumberOfUniqueHosts. It subtracts the number of blacklisted trackers from the number of unique hosts, which is incorrect if multiple trackers are running on a host.

          Show
          Hemanth Yamijala added a comment - This patch introduces a bug in getNumberOfUniqueHosts. It subtracts the number of blacklisted trackers from the number of unique hosts, which is incorrect if multiple trackers are running on a host.
          Hide
          Hemanth Yamijala added a comment -

          Some more comments:

          • It would be useful to add a javadoc for getNumberOfUniqueHosts, along with a reason why blacklisted hosts must be excluded from this count. Please remember our offline discussion where we spoke about why this number must be excluded when scheduling.

          Other comments on test cases:

          • In TestLostTracker, please separate case 3 into a separate test case. It is generally good unit testing practice to test separate conditions in separate tests.
          • We can assert some state after case 2 and case 3 in addition to just making sure method calls succeed. For e.g. in the case of blacklisting, we can check the number of active hosts is decremented by the right value (because we are changing that API as well and will be a good check). Likewise we can also check that a host is blacklisted or a job is finished, etc.
          • Please create the hosts.exclude file in a folder relative to TEST_DIR.
          • The testcase testBlacklistedNodeDecommissioning can blacklist a node by globally blacklisting - rather than the health check script, which is slightly more complicated. One reason for doing so is that we can do this without having to wait for blacklisting to happen asynchronously.
          • Common code in this class related to global blacklisting because of job failures as well as refresh of hosts can also be refactored into separate utility methods and reused.
          • Instead of checking if the decommissioned tracker is not present in the list of trackers, since we are starting with only one tracker, we can explicitly check that the number of trackers in jt.taskTrackers is 0.
          • Some additional tests that I can suggest
            • Blacklist + decommission when there are multiple trackers per host
            • Have a cluster with 3 trackers, blacklist one of them, decommission 2 of them, and make sure the active, decommissioned and blacklisted counts all match.
          Show
          Hemanth Yamijala added a comment - Some more comments: It would be useful to add a javadoc for getNumberOfUniqueHosts, along with a reason why blacklisted hosts must be excluded from this count. Please remember our offline discussion where we spoke about why this number must be excluded when scheduling. Other comments on test cases: In TestLostTracker, please separate case 3 into a separate test case. It is generally good unit testing practice to test separate conditions in separate tests. We can assert some state after case 2 and case 3 in addition to just making sure method calls succeed. For e.g. in the case of blacklisting, we can check the number of active hosts is decremented by the right value (because we are changing that API as well and will be a good check). Likewise we can also check that a host is blacklisted or a job is finished, etc. Please create the hosts.exclude file in a folder relative to TEST_DIR. The testcase testBlacklistedNodeDecommissioning can blacklist a node by globally blacklisting - rather than the health check script, which is slightly more complicated. One reason for doing so is that we can do this without having to wait for blacklisting to happen asynchronously. Common code in this class related to global blacklisting because of job failures as well as refresh of hosts can also be refactored into separate utility methods and reused. Instead of checking if the decommissioned tracker is not present in the list of trackers, since we are starting with only one tracker, we can explicitly check that the number of trackers in jt.taskTrackers is 0. Some additional tests that I can suggest Blacklist + decommission when there are multiple trackers per host Have a cluster with 3 trackers, blacklist one of them, decommission 2 of them, and make sure the active, decommissioned and blacklisted counts all match.
          Hide
          Amar Kamat added a comment -

          Shoudnt we simply fix the problem at hand and address the bug in JobTracker.getNumberOfUniqueHosts() in a separate jira? This fix should simply make sure that the uniqueHostsMap value for a given host is not null.

          Show
          Amar Kamat added a comment - Shoudnt we simply fix the problem at hand and address the bug in JobTracker.getNumberOfUniqueHosts() in a separate jira? This fix should simply make sure that the uniqueHostsMap value for a given host is not null.
          Hide
          Amar Kamat added a comment -

          Attaching a patch that simply guards for null check. Adding testcases for the same.

          Show
          Amar Kamat added a comment - Attaching a patch that simply guards for null check. Adding testcases for the same.
          Hide
          Amar Kamat added a comment -

          test-patch passed and ant tests failed on TestGridmixSubmission. Running through hudson.

          Show
          Amar Kamat added a comment - test-patch passed and ant tests failed on TestGridmixSubmission. Running through hudson.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427165/mapreduce-754-v2.2.patch
          against trunk revision 887844.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/297/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/297/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/297/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/297/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427165/mapreduce-754-v2.2.patch against trunk revision 887844. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/297/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/297/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/297/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/297/console This message is automatically generated.
          Hide
          Sharad Agarwal added a comment -

          +1

          Show
          Sharad Agarwal added a comment - +1
          Hide
          Amar Kamat added a comment -

          Attaching an example patch for Yahoo's distribution of Hadoop not to be committed.

          Show
          Amar Kamat added a comment - Attaching an example patch for Yahoo's distribution of Hadoop not to be committed.
          Hide
          Amar Kamat added a comment -

          Attaching a patch for branch-0.21. test-patch and all ant tests passed.

          Show
          Amar Kamat added a comment - Attaching a patch for branch-0.21. test-patch and all ant tests passed.
          Hide
          Sharad Agarwal added a comment -

          I just committed this to trunk and 0.21. Thanks Amar.

          Show
          Sharad Agarwal added a comment - I just committed this to trunk and 0.21. Thanks Amar.
          Hide
          Amar Kamat added a comment -

          Attaching a patch for Yahoo!'s distribution of Hadoop.

          Show
          Amar Kamat added a comment - Attaching a patch for Yahoo!'s distribution of Hadoop.
          Hide
          Amar Kamat added a comment -

          test-patch and all tests except TestDFSPermission, TestJobHistory, TestReduceFetch and TestPermission. I ran TestJobHistory and TestPermission locally again and it passed. TestDFSPermission failure is unrelated while TestReduceFetch failure is a known issue.

          Show
          Amar Kamat added a comment - test-patch and all tests except TestDFSPermission, TestJobHistory, TestReduceFetch and TestPermission. I ran TestJobHistory and TestPermission locally again and it passed. TestDFSPermission failure is unrelated while TestReduceFetch failure is a known issue.
          Hide
          Amar Kamat added a comment -

          test-patch and all tests except TestDFSPermission, TestJobHistory, TestReduceFetch and TestPermission. I ran TestJobHistory and TestPermission locally again and it passed. TestDFSPermission failure is unrelated while TestReduceFetch failure is a known issue.

          This is for the patch uploaded for Yahoo!'s distribution of Hadoop.

          Show
          Amar Kamat added a comment - test-patch and all tests except TestDFSPermission, TestJobHistory, TestReduceFetch and TestPermission. I ran TestJobHistory and TestPermission locally again and it passed. TestDFSPermission failure is unrelated while TestReduceFetch failure is a known issue. This is for the patch uploaded for Yahoo!'s distribution of Hadoop.

            People

            • Assignee:
              Amar Kamat
              Reporter:
              Ramya Sunil
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development