Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.3, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1
    • Fix Version/s: 0.20.2
    • Component/s: ipc
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Correct synchronization error in IPC where handler thread could hang if request reader got an error.

      Description

      I can reproduce some rpc call hang bug when connection thread of ipc client receives response for outstanding call.

      The stacks when hang occurs (TaskTracker):

      Waiting on org.apache.hadoop.ipc.Client$Call@1c3cbb4b
      Stack:
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:485)
      org.apache.hadoop.ipc.Client.call(Client.java:691)
      org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
      org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
      org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1250)
      org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1082)
      org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1785)
      org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2796)

      1. hangClient1.patch
        4 kB
        Hairong Kuang
      2. hangClient-0.20.patch
        4 kB
        Hairong Kuang
      3. hangClient.patch
        3 kB
        Hairong Kuang
      4. hadoop-6498-branch-0.20
        0.8 kB
        Ruyue Ma
      5. hadoop-6498.patch
        0.8 kB
        Ruyue Ma

        Issue Links

          Activity

          Hide
          Ruyue Ma added a comment -

          This bug is related to the ipc/Client.java. In receiveResponse() methods, we first remove the call, then read the data. When reading encouters exception, the thread waiting the call object will not be waken forever.

          Show
          Ruyue Ma added a comment - This bug is related to the ipc/Client.java. In receiveResponse() methods, we first remove the call, then read the data. When reading encouters exception, the thread waiting the call object will not be waken forever.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12430656/hadoop-6498.patch
          against trunk revision 899866.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/280/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/280/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/280/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/280/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12430656/hadoop-6498.patch against trunk revision 899866. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/280/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/280/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/280/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/280/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Though the fix is a proper direction to avoid client hangs, we should first fix the original problem. Heartbeat hanging would mean some kind of uncaught exception like NPE on the JobTracker side which left unhandled may result in inconsistent state. Can you dig through JT logs and see if any such issue happened?

          Show
          Vinod Kumar Vavilapalli added a comment - Though the fix is a proper direction to avoid client hangs, we should first fix the original problem. Heartbeat hanging would mean some kind of uncaught exception like NPE on the JobTracker side which left unhandled may result in inconsistent state. Can you dig through JT logs and see if any such issue happened?
          Hide
          Ruyue Ma added a comment -

          Two points:

          1) Connection exception(networking has some problem) also causes socket reading exception. It's not always server exception.

          2) Even if jobtracker gives the exception, we wouldn't hang the client.

          In my opinion, this is a pure coding logic bug.

          Show
          Ruyue Ma added a comment - Two points: 1) Connection exception(networking has some problem) also causes socket reading exception. It's not always server exception. 2) Even if jobtracker gives the exception, we wouldn't hang the client. In my opinion, this is a pure coding logic bug.
          Hide
          Ruyue Ma added a comment -

          P.S. This bug is not related to uncaught exception like NPE on the JobTracker side.

          Show
          Ruyue Ma added a comment - P.S. This bug is not related to uncaught exception like NPE on the JobTracker side.
          Hide
          Ruyue Ma added a comment -

          In Client.java->receiveResponse():

          Call call = calls.remove(id);

          int state = in.readInt();
          if (state == Status.SUCCESS.state)

          { Writable value = ReflectionUtils.newInstance(valueClass, conf); value.readFields(in); // if here throws exception, who will wake up the current call's waiting thread? call.setValue(value); }

          else if (state == Status.ERROR.state)

          { call.setException(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); }

          else if (state == Status.FATAL.state)

          { // Close the connection markClosed(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); }
          Show
          Ruyue Ma added a comment - In Client.java->receiveResponse(): Call call = calls.remove(id); int state = in.readInt(); if (state == Status.SUCCESS.state) { Writable value = ReflectionUtils.newInstance(valueClass, conf); value.readFields(in); // if here throws exception, who will wake up the current call's waiting thread? call.setValue(value); } else if (state == Status.ERROR.state) { call.setException(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); } else if (state == Status.FATAL.state) { // Close the connection markClosed(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); }
          Hide
          Hairong Kuang added a comment -

          Good catch, Ruyue! It would be nice that we get it fixed in 0.20 as well. Could you please post a patch for 0.20? I know that it is hard. Is it possible to write a unit test to reproduce the bug?

          Show
          Hairong Kuang added a comment - Good catch, Ruyue! It would be nice that we get it fixed in 0.20 as well. Could you please post a patch for 0.20? I know that it is hard. Is it possible to write a unit test to reproduce the bug?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Ruyue.. I am not against this fix. Just trying to make sure if you've already stumbled upon another bug in JobTracker. How did you find this bug? During code browsing or on a real cluster? If it is the later, that might point us to any bugs in JT APIs which we can fix right-away if you have traces on JT too.

          In any case, if such a bug indeed is there, this patch will make it more conspicuous by throwing it to the client.

          Show
          Vinod Kumar Vavilapalli added a comment - Ruyue.. I am not against this fix. Just trying to make sure if you've already stumbled upon another bug in JobTracker. How did you find this bug? During code browsing or on a real cluster? If it is the later, that might point us to any bugs in JT APIs which we can fix right-away if you have traces on JT too. In any case, if such a bug indeed is there, this patch will make it more conspicuous by throwing it to the client.
          Hide
          Ruyue Ma added a comment -

          The bug was found two times in our cluster.

          1. tasktracker heartbeat rpc hang: when hang occurs, the networking load is very high. Two nodes hang.

          2. Another time, our networking has some problem (ping latency is 200~300ms). get map events threads hang. About 50 nodes.

          This patch will be deployed our online cluster.

          Now, I can't make sure if jobtracker has some bug.

          Show
          Ruyue Ma added a comment - The bug was found two times in our cluster. 1. tasktracker heartbeat rpc hang: when hang occurs, the networking load is very high. Two nodes hang. 2. Another time, our networking has some problem (ping latency is 200~300ms). get map events threads hang. About 50 nodes. This patch will be deployed our online cluster. Now, I can't make sure if jobtracker has some bug.
          Hide
          Ruyue Ma added a comment -

          for branch 0.20

          Show
          Ruyue Ma added a comment - for branch 0.20
          Hide
          Ruyue Ma added a comment -

          To: Hairong

          Unit test is very difficult.

          You can modify the code according to the following:

          Client.java
          
              private void receiveResponse() {
                if (shouldCloseConnection.get()) {
                  return;
                }
                touch();
                
                try {
                  int id = in.readInt();                   
          
                  if (LOG.isDebugEnabled())
                    LOG.debug(getName() + " got value #" + id);
          
                  Call call = calls.remove(id);
          
                  int state = in.readInt();    
                  if (state == Status.SUCCESS.state) {
                    Writable value = ReflectionUtils.newInstance(valueClass, conf);
          
                   // < MODIFIED BY ME
                   if (id == 100) {
                        throw new IOException("network error");
                   } else {
                        value.readFields(in); 
                   }        
                   // > MODIFIED BY ME 
                 
                    call.setValue(value);
                  } else if (state == Status.ERROR.state) {
                    call.setException(new RemoteException(WritableUtils.readString(in),
                                                          WritableUtils.readString(in)));
                  } else if (state == Status.FATAL.state) {
                    // Close the connection
                    markClosed(new RemoteException(WritableUtils.readString(in), 
                                                   WritableUtils.readString(in)));
                  }
                } catch (IOException e) {
                  markClosed(e);
                }
              }
          
          

          The 100th rpc call will hang.

          Show
          Ruyue Ma added a comment - To: Hairong Unit test is very difficult. You can modify the code according to the following: Client.java private void receiveResponse() { if (shouldCloseConnection.get()) { return ; } touch(); try { int id = in.readInt(); if (LOG.isDebugEnabled()) LOG.debug(getName() + " got value #" + id); Call call = calls.remove(id); int state = in.readInt(); if (state == Status.SUCCESS.state) { Writable value = ReflectionUtils.newInstance(valueClass, conf); // < MODIFIED BY ME if (id == 100) { throw new IOException( "network error" ); } else { value.readFields(in); } // > MODIFIED BY ME call.setValue(value); } else if (state == Status.ERROR.state) { call.setException( new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); } else if (state == Status.FATAL.state) { // Close the connection markClosed( new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); } } catch (IOException e) { markClosed(e); } } The 100th rpc call will hang.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Right then. The fix looks good to me. We will need a test-case to verify that this call and any other queued calls are all notified so that they can properly exit.

          Show
          Vinod Kumar Vavilapalli added a comment - Right then. The fix looks good to me. We will need a test-case to verify that this call and any other queued calls are all notified so that they can properly exit.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          oh.. comments collided.. well

          For testing this, I guess we can modify TestIPCServerResponder to add a new test similar to testServerResponder(), but instead of BytesWritable for value class we can use a custom Writable that throws exception in readFields().

          Show
          Vinod Kumar Vavilapalli added a comment - oh.. comments collided.. well For testing this, I guess we can modify TestIPCServerResponder to add a new test similar to testServerResponder() , but instead of BytesWritable for value class we can use a custom Writable that throws exception in readFields().
          Hide
          Todd Lipcon added a comment -

          Ruyue: could you make this a unit test by making a client using a subclass of ObjectWritable as the valueClass which throws exceptions on readFields?

          The fault injection stuff in test-fi probably could also be used here.

          Show
          Todd Lipcon added a comment - Ruyue: could you make this a unit test by making a client using a subclass of ObjectWritable as the valueClass which throws exceptions on readFields? The fault injection stuff in test-fi probably could also be used here.
          Hide
          Todd Lipcon added a comment -

          I think we're having comment race conditions here

          Show
          Todd Lipcon added a comment - I think we're having comment race conditions here
          Hide
          Hairong Kuang added a comment -

          Ok, Ruyue, I know that you are busy but this is a critical bug. So I worked on the fix.

          Here comes a new patch. It has two changes to Ruyue's patch:
          1. adds a unit test.
          2. do calls.remove(id) only only when the call succeeds.

          Show
          Hairong Kuang added a comment - Ok, Ruyue, I know that you are busy but this is a critical bug. So I worked on the fix. Here comes a new patch. It has two changes to Ruyue's patch: 1. adds a unit test. 2. do calls.remove(id) only only when the call succeeds.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12431330/hangClient.patch
          against trunk revision 902745.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 1032 javac compiler warnings (more than the trunk's current 1031 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/290/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/290/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/290/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/290/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12431330/hangClient.patch against trunk revision 902745. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 1032 javac compiler warnings (more than the trunk's current 1031 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/290/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/290/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/290/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/290/console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          +1. The patch looks good.

          Show
          Suresh Srinivas added a comment - +1. The patch looks good.
          Hide
          Hairong Kuang added a comment -

          This patch fixed the javac warnings.

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Show
          Hairong Kuang added a comment - This patch fixed the javac warnings. [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          Hide
          Hairong Kuang added a comment -

          A patch for branch 0.20.

          Show
          Hairong Kuang added a comment - A patch for branch 0.20.
          Hide
          Hairong Kuang added a comment -

          I've just committed this. Thanks Ruyue!

          Show
          Hairong Kuang added a comment - I've just committed this. Thanks Ruyue!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #149 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/149/)
          . IPC client bug may cause rpc call hang. Contributed by Ruyue Ma and Hairong Kuang.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #149 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/149/ ) . IPC client bug may cause rpc call hang. Contributed by Ruyue Ma and Hairong Kuang.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #234 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/234/)
          . IPC client bug may cause rpc call hang. Contributed by Ruyue Ma and Hairong Kuang.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #234 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/234/ ) . IPC client bug may cause rpc call hang. Contributed by Ruyue Ma and Hairong Kuang.

            People

            • Assignee:
              Ruyue Ma
              Reporter:
              Ruyue Ma
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development