Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: ipc
    • Labels:
      None
    • Environment:

      Windows + Oracle Java 7.

      Description

      TestSaslRPC fails with exceptions such as the following:

      Tests run: 85, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 36.765 sec <<< FAILURE! - in org.apache.hadoop.ipc.TestSaslRPC
      testTokenOnlyServer[0](org.apache.hadoop.ipc.TestSaslRPC)  Time elapsed: 0.092 sec  <<< FAILURE!
      java.lang.AssertionError: expected:<.*RemoteException.*AccessControlException.*: SIMPLE authentication is not enabled.*> but was:<java.io.IOException: Failed on local exception: java.io.IOException: An established connection was aborted by the software in your host machine; Host Details : local host is: "WIN-Q5VLNTLIBJ0/10.0.2.15"; destination host is: "WIN-Q5VLNTLIBJ0":49623; >
      	at org.junit.Assert.fail(Assert.java:93)
      	at org.junit.Assert.failNotEquals(Assert.java:647)
      	at org.junit.Assert.assertEquals(Assert.java:128)
      	at org.junit.Assert.assertEquals(Assert.java:147)
      	at org.apache.hadoop.ipc.TestSaslRPC.assertAuthEquals(TestSaslRPC.java:978)
      	at org.apache.hadoop.ipc.TestSaslRPC.testTokenOnlyServer(TestSaslRPC.java:782)
      

      The exact location/number of failures varies by run.

        Issue Links

          Activity

          Hide
          Arpit Agarwal added a comment -

          The exception data written by the Server is lost and the client (test code) receives a generic connection aborted error.

          My guess is the Server is triggering a JVM bug when it writes the exception data and then closes the socket. Instead of the data being flushed to the peer the TCP connection is closed abortively. To check this hypothesis I added a small sleep (20ms) after writing the exception data to the SocketChannel and before closing the socket. This makes the error go away.

          No good ideas on how to fix it yet.

          Show
          Arpit Agarwal added a comment - The exception data written by the Server is lost and the client (test code) receives a generic connection aborted error. My guess is the Server is triggering a JVM bug when it writes the exception data and then closes the socket. Instead of the data being flushed to the peer the TCP connection is closed abortively. To check this hypothesis I added a small sleep (20ms) after writing the exception data to the SocketChannel and before closing the socket. This makes the error go away. No good ideas on how to fix it yet.
          Hide
          Daryn Sharp added a comment -

          With security disabled, the server immediately rejects connections that do not request SASL via the connection header. There's an async race with the insecure client that is blindly sending the connection context. I think the problem occurs if the delay between sending the connection header and the connection context is greater than the time it takes for the server to respond and close the connection.

          By sleeping after sending the response, it helped ensure the socket stayed open long enough for the client to send the connection context.

          Show
          Daryn Sharp added a comment - With security disabled, the server immediately rejects connections that do not request SASL via the connection header. There's an async race with the insecure client that is blindly sending the connection context. I think the problem occurs if the delay between sending the connection header and the connection context is greater than the time it takes for the server to respond and close the connection. By sleeping after sending the response, it helped ensure the socket stayed open long enough for the client to send the connection context.
          Hide
          Arpit Agarwal added a comment -

          Thanks for taking a look. If I understand you this may not be a platform-specific issue.

          FWIW I never see the failure on Linux or OS X. However every Windows run yields at least 1 failure.

          Show
          Arpit Agarwal added a comment - Thanks for taking a look. If I understand you this may not be a platform-specific issue. FWIW I never see the failure on Linux or OS X. However every Windows run yields at least 1 failure.
          Hide
          Chris Nauroth added a comment -

          I had found the same async race a while ago, but I saw it while running TestRPC instead of TestSaslRPC. I had filed HADOOP-8980, but I never solved it. Duplicate?

          Show
          Chris Nauroth added a comment - I had found the same async race a while ago, but I saw it while running TestRPC instead of TestSaslRPC . I had filed HADOOP-8980 , but I never solved it. Duplicate?
          Hide
          Arpit Agarwal added a comment -

          Hi Chris, thanks for the pointer. I think this is a different failure. The issue you describe seems to be a race condition between startup and receiving the request.

          This one looks like a race between sending the response and socket close. Not sure if they are related. We can probably keep both around for now.

          Show
          Arpit Agarwal added a comment - Hi Chris, thanks for the pointer. I think this is a different failure. The issue you describe seems to be a race condition between startup and receiving the request. This one looks like a race between sending the response and socket close. Not sure if they are related. We can probably keep both around for now.
          Hide
          Chris Nauroth added a comment -

          HADOOP-8980 might be in a confusing state at this point. It contains discussion of multiple problems. This specific comment is the one that I think relates to the current issue:

          https://issues.apache.org/jira/browse/HADOOP-8980?focusedCommentId=13509416&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13509416

          If you think HADOOP-10518 is a clearer bug report at this point, then please feel free to close HADOOP-8980 as the duplicate.

          Show
          Chris Nauroth added a comment - HADOOP-8980 might be in a confusing state at this point. It contains discussion of multiple problems. This specific comment is the one that I think relates to the current issue: https://issues.apache.org/jira/browse/HADOOP-8980?focusedCommentId=13509416&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13509416 If you think HADOOP-10518 is a clearer bug report at this point, then please feel free to close HADOOP-8980 as the duplicate.
          Hide
          Arpit Agarwal added a comment -

          Okay that does look similar to what I encountered.

          I will dup this against HADOOP-8980.

          Show
          Arpit Agarwal added a comment - Okay that does look similar to what I encountered. I will dup this against HADOOP-8980 .

            People

            • Assignee:
              Unassigned
              Reporter:
              Arpit Agarwal
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development