Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4464

Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.2.0
    • Component/s: task
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception.

      as per my peer Ahmed's diagnosis:

      In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from:

      String host = u.getHost();
      

      is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap.

      I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading.

      1. MAPREDUCE-4464.patch
        0.8 kB
        Clint Heath
      2. MAPREDUCE-4464.patch
        0.9 kB
        Harsh J
      3. MAPREDUCE-4464_new.patch
        0.8 kB
        Clint Heath

        Activity

        Hide
        Karthik Kambatla added a comment -

        Clint, thanks a lot for looking into this issue.

        *Minor comment: Would it be better to throw an IOException wrapped with your message, so that we can avoid the subsequent NullPointerException?

        Show
        Karthik Kambatla added a comment - Clint, thanks a lot for looking into this issue. *Minor comment: Would it be better to throw an IOException wrapped with your message, so that we can avoid the subsequent NullPointerException?
        Hide
        Clint Heath added a comment -

        Sorry, I should have supplied the exception that we encountered when this issue happened. As it turned out, the host names in the cluster all had illegal DNS characters in them (the underscore "_"), so when the getHost() call was made, null was returned and we saw the following.

        Mappers get about 80% complete when the reducers all begin to throw the following exceptions and then die almost immediately...eventually the whole job dies:

        2012-06-26 15:56:02,326 FATAL org.apache.hadoop.mapred.Task: attempt_201206251823_0004_r_000036_1 GetMapEventsThread Ignoring exception : java.lang.NullPointerException
            at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
            at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2835)
            at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2756)
        
        2012-06-26 15:56:02,356 FATAL org.apache.hadoop.mapred.Task: attempt_201206251823_0004_r_000036_1 GetMapEventsThread Ignoring exception : org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201206251823_0004_r_000036_1, with JvmId: jvm_201206251823_0004_r_-396118293
            at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3468)
            at org.apache.hadoop.mapred.TaskTracker.getMapCompletionEvents(TaskTracker.java:3731)
            at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            at java.lang.reflect.Method.invoke(Method.java:597)
            at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
            at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
            at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
            at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
            at org.apache.hadoop.ipc.Client.call(Client.java:1107)
            at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
            at $Proxy0.getMapCompletionEvents(Unknown Source)
            at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2798)
            at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2756)
        
        2012-06-26 15:56:02,361 FATAL org.apache.hadoop.mapred.Task: Failed to contact the tasktracker
        org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201206251823_0004_r_000036_1, with JvmId: jvm_201206251823_0004_r_-396118293
            at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3468)
            at org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3714)
            at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            at java.lang.reflect.Method.invoke(Method.java:597)
            at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
            at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
            at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
            at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
            at org.apache.hadoop.ipc.Client.call(Client.java:1107)
            at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
            at $Proxy0.fatalError(Unknown Source)
            at org.apache.hadoop.mapred.Task.reportFatalError(Task.java:294)
            at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2781)
        
        Show
        Clint Heath added a comment - Sorry, I should have supplied the exception that we encountered when this issue happened. As it turned out, the host names in the cluster all had illegal DNS characters in them (the underscore "_"), so when the getHost() call was made, null was returned and we saw the following. Mappers get about 80% complete when the reducers all begin to throw the following exceptions and then die almost immediately...eventually the whole job dies: 2012-06-26 15:56:02,326 FATAL org.apache.hadoop.mapred.Task: attempt_201206251823_0004_r_000036_1 GetMapEventsThread Ignoring exception : java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2835) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2756) 2012-06-26 15:56:02,356 FATAL org.apache.hadoop.mapred.Task: attempt_201206251823_0004_r_000036_1 GetMapEventsThread Ignoring exception : org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201206251823_0004_r_000036_1, with JvmId: jvm_201206251823_0004_r_-396118293 at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3468) at org.apache.hadoop.mapred.TaskTracker.getMapCompletionEvents(TaskTracker.java:3731) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) at org.apache.hadoop.ipc.Client.call(Client.java:1107) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy0.getMapCompletionEvents(Unknown Source) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2798) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2756) 2012-06-26 15:56:02,361 FATAL org.apache.hadoop.mapred.Task: Failed to contact the tasktracker org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201206251823_0004_r_000036_1, with JvmId: jvm_201206251823_0004_r_-396118293 at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3468) at org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3714) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) at org.apache.hadoop.ipc.Client.call(Client.java:1107) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy0.fatalError(Unknown Source) at org.apache.hadoop.mapred.Task.reportFatalError(Task.java:294) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2781)
        Hide
        Clint Heath added a comment -

        Karthik,

        I'm fine with that as long as it doesn't interrupt the overall flow and process of what's supposed to happen when a task fails. In our case, every reduce task failed and therefore the entire job, but I can see a situation where only one TT machine had a bad hostname and therefore only a subset of reduce tasks would fail and the overall job may still complete. I just want to make sure we are informative in the logs and that the tasks are allowed to be re-tried if applicable, etc. I haven't thought through all the logic far enough yet to know the ramifications of throwing an IOE right there. Harsh and I chatted about the same idea earlier, though. I'll vet that out...

        Show
        Clint Heath added a comment - Karthik, I'm fine with that as long as it doesn't interrupt the overall flow and process of what's supposed to happen when a task fails. In our case, every reduce task failed and therefore the entire job, but I can see a situation where only one TT machine had a bad hostname and therefore only a subset of reduce tasks would fail and the overall job may still complete. I just want to make sure we are informative in the logs and that the tasks are allowed to be re-tried if applicable, etc. I haven't thought through all the logic far enough yet to know the ramifications of throwing an IOE right there. Harsh and I chatted about the same idea earlier, though. I'll vet that out...
        Hide
        Clint Heath added a comment -

        I have just attached the new patch which throws an IO Exception instead. Would it be possible to request that this JIRA be assigned to me? I'd like to get this resolved as quickly as possible. Thank you.

        Show
        Clint Heath added a comment - I have just attached the new patch which throws an IO Exception instead. Would it be possible to request that this JIRA be assigned to me? I'd like to get this resolved as quickly as possible. Thank you.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12537745/MAPREDUCE-4464_new.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2747//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537745/MAPREDUCE-4464_new.patch against trunk revision . -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2747//console This message is automatically generated.
        Hide
        Harsh J added a comment -

        Hi Clint,

        Sorry on the delay here!

        I noticed that the line:

        String host = u.getHost();

        Which is the one in question of carrying a null, is then used in the lookup as:

        List<MapOutputLocation> loc = mapLocations.get(host);

        Hence, I think the most ideal fix would be to throw an exception. Because, in the chunks later, we rely heavily on host:

                      URI u = URI.create(event.getTaskTrackerHttp());
                      String host = u.getHost();
                      TaskAttemptID taskId = event.getTaskAttemptId();
                      URL mapOutputLocation = new URL(event.getTaskTrackerHttp() + 
                                              "/mapOutput?job=" + taskId.getJobID() +
                                              "&map=" + taskId + 
                                              "&reduce=" + getPartition());
                      List<MapOutputLocation> loc = mapLocations.get(host);
                      if (loc == null) {
                        loc = Collections.synchronizedList
                          (new LinkedList<MapOutputLocation>());
                        mapLocations.put(host, loc);
                       }
                      loc.add(new MapOutputLocation(taskId, host, mapOutputLocation));
                      numNewMaps ++;
        

        As seen by its usage, if host itself is undeterminable, and is consistently null, we cannot really work with it, and throwing an IOException makes sense.

        I'm currently running test-patch on your patch for branch-1, depending on whose results I'll commit it in or post some further comments.

        MR2 may be similarly affected on the netty side but may be failing properly already, I haven't the time to verify at the moment (perhaps another JIRA). So I'll just focus on the MR1 side now.

        Thanks for the patch!

        Show
        Harsh J added a comment - Hi Clint, Sorry on the delay here! I noticed that the line: String host = u.getHost(); Which is the one in question of carrying a null, is then used in the lookup as: List<MapOutputLocation> loc = mapLocations.get(host); Hence, I think the most ideal fix would be to throw an exception. Because, in the chunks later, we rely heavily on host: URI u = URI.create(event.getTaskTrackerHttp()); String host = u.getHost(); TaskAttemptID taskId = event.getTaskAttemptId(); URL mapOutputLocation = new URL(event.getTaskTrackerHttp() + "/mapOutput?job=" + taskId.getJobID() + "&map=" + taskId + "&reduce=" + getPartition()); List<MapOutputLocation> loc = mapLocations.get(host); if (loc == null ) { loc = Collections.synchronizedList ( new LinkedList<MapOutputLocation>()); mapLocations.put(host, loc); } loc.add( new MapOutputLocation(taskId, host, mapOutputLocation)); numNewMaps ++; As seen by its usage, if host itself is undeterminable, and is consistently null, we cannot really work with it, and throwing an IOException makes sense. I'm currently running test-patch on your patch for branch-1, depending on whose results I'll commit it in or post some further comments. MR2 may be similarly affected on the netty side but may be failing properly already, I haven't the time to verify at the moment (perhaps another JIRA). So I'll just focus on the MR1 side now. Thanks for the patch!
        Hide
        Clint Heath added a comment -

        Thanks Harsh! I'll take a look at yarn and see if a similar situation is present there.

        Show
        Clint Heath added a comment - Thanks Harsh! I'll take a look at yarn and see if a similar situation is present there.
        Hide
        Harsh J added a comment -

        Clint,

        Thanks for looking at YARN (do file a new JIRA even if its just for investigation). You will need to look at the ShuffleHandler class downwards.

        Show
        Harsh J added a comment - Clint, Thanks for looking at YARN (do file a new JIRA even if its just for investigation). You will need to look at the ShuffleHandler class downwards.
        Hide
        Harsh J added a comment -

        From test-patch on branch-1:

            [exec] -1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
             [exec]                         Please justify why no tests are needed for this patch.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     -1 findbugs.  The patch appears to introduce 220 new Findbugs (version 2.0.1-rc3) warnings.
        

        The 220 new findbugs from version 2.0.1-rc3 is what we get when the findbugs target is run on the patch-less branch-1. Doesn't look like there are any existing test-cases to cover this from an initial look. Also, the method in which we'll be throwing this exception, already grants it via a "throws IOException" agreement.

        I ran -Dtestcase=TestMR* and -Dtestcase=TestMap* to run some MR tests over branch-1 and they seem to pass with this applied.

        +1 for committing. Just gonna run one job over a cluster instance with a hostname with an underscore before doing so, to make sure this is working reliably well.

        Show
        Harsh J added a comment - From test-patch on branch-1: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 220 new Findbugs (version 2.0.1-rc3) warnings. The 220 new findbugs from version 2.0.1-rc3 is what we get when the findbugs target is run on the patch-less branch-1. Doesn't look like there are any existing test-cases to cover this from an initial look. Also, the method in which we'll be throwing this exception, already grants it via a "throws IOException" agreement. I ran -Dtestcase=TestMR* and -Dtestcase=TestMap* to run some MR tests over branch-1 and they seem to pass with this applied. +1 for committing. Just gonna run one job over a cluster instance with a hostname with an underscore before doing so, to make sure this is working reliably well.
        Hide
        Harsh J added a comment -

        I improved the message in Clint's patch a slight bit to indicate what to look at.

        And I could also successfully reproduce the issue on a forced bad hostname machine (devel_vm.vm):

        12/09/27 21:52:16 INFO mapred.JobClient: Task Id : attempt_201209272149_0001_r_000000_2, Status : FAILED
        Error: java.io.IOException: Invalid hostname found in tracker location: 'http://devel_vm.vm:50060'
        	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2920)
        	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2845)
        
        Show
        Harsh J added a comment - I improved the message in Clint's patch a slight bit to indicate what to look at. And I could also successfully reproduce the issue on a forced bad hostname machine (devel_vm.vm): 12/09/27 21:52:16 INFO mapred.JobClient: Task Id : attempt_201209272149_0001_r_000000_2, Status : FAILED Error: java.io.IOException: Invalid hostname found in tracker location: 'http: //devel_vm.vm:50060' at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2920) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2845)
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12546872/MAPREDUCE-4464.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2886//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12546872/MAPREDUCE-4464.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2886//console This message is automatically generated.
        Hide
        Harsh J added a comment -

        I've committed this to branch-1. Thanks very much for the report, your keen eye for issues, and your patch contributions Clint! Hope to see more in future!

        Show
        Harsh J added a comment - I've committed this to branch-1. Thanks very much for the report, your keen eye for issues, and your patch contributions Clint! Hope to see more in future!
        Hide
        Clint Heath added a comment -

        Thanks Harsh! I look forward to contributing much more too

        Show
        Clint Heath added a comment - Thanks Harsh! I look forward to contributing much more too
        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop 1.2.0.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.

          People

          • Assignee:
            Clint Heath
            Reporter:
            Clint Heath
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development