Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-9916

Race condition in ipc.Client causes TestIPC timeout

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 2.1.1-beta
    • None
    • None
    • Reviewed

    Description

      TestIPC timeouts occasionally, for example:
      https://issues.apache.org/jira/browse/HDFS-5130?focusedCommentId=13749870&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13749870
      https://issues.apache.org/jira/browse/HADOOP-9915?focusedCommentId=13753302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13753302

      Look into the code, there is race condition in oah.ipc.Client, the race condition happen between RPC call thread and connection read response thread:

              if (status == RpcStatusProto.SUCCESS) {
                Writable value = ReflectionUtils.newInstance(valueClass, conf);
                value.readFields(in);                 // read value
                call.setRpcResponse(value);
                calls.remove(callId);
      

      Read Thread:
      Connection.receiveRpcResponse-> call.setRpcResponse(value) -> notify Call Thread

      Call Thread:
      Client.call -> Connection.addCall(retry with the same callId) -> notify read thread

      Read Thread:
      calls.remove(callId) # intend to remove old call, but removes newly added call...
      Connection.waitForWork end up wait maxIdleTime and close the connection. The call never get response and dead.

      The problem doesn't show because previously callId is unique, we never accidentally remove newly added calls, but when retry added this race condition became possible.

      To solve this, we can simply change order, remove the call first, then notify call thread.
      Note there are many places need this order change(normal case, error case, cleanup case)

      And there are some minor issues in TestIPC:
      1. there are two method with same name:
      void testSerial()
      void testSerial(int handlerCount, boolean handlerSleep, ...)
      the second is not a test case(so should not use testXXX prefix), but somehow it causes testSerial(first one) run two times, see test report:

        <testcase time="26.896" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
        <testcase time="25.426" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
      

      2. timeout annotation should be added, so next time related log is available.

      Attachments

        1. HADOOP-9916.v1.patch
          10 kB
          Binglin Chang

        Issue Links

          Activity

            People

              decster Binglin Chang
              decster Binglin Chang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: