Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-9229

IPC: Retry on connection reset or socket timeout during SASL negotiation

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.3-alpha, 0.23.7, 3.0.0-alpha1
    • Fix Version/s: None
    • Component/s: ipc
    • Labels:
      None

      Description

      When an RPC server is overloaded, incoming connections may not get accepted in time, causing listen queue overflow. The impact on client varies depending on the type of OS in use. On Linux, connections in this state look fully connected to the clients, but they are without buffers, thus any data sent to the server will get dropped.

      This won't be a problem for protocols where client first wait for server's greeting. Even for clients-speak-first protocols, it will be fine if the overload is transient and such connections are accepted before the retransmission of dropped packets arrive. Otherwise, clients can hit socket timeout after several retransmissions. In certain situations, connection will get reset while clients still waiting for ack.

      We have seen this happening to IPC clients during SASL negotiation. Since no call has been sent, we should allow retry when connection reset or socket timeout happens in this stage.

        Activity

        Hide
        tlipcon Todd Lipcon added a comment -

        Hey Kihwal. Have you been watching HDFS-4404? Looks like basically the same issue, if I'm understanding you correctly. In particular, see this comment: https://issues.apache.org/jira/browse/HDFS-4404?focusedCommentId=13555680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13555680

        Show
        tlipcon Todd Lipcon added a comment - Hey Kihwal. Have you been watching HDFS-4404 ? Looks like basically the same issue, if I'm understanding you correctly. In particular, see this comment: https://issues.apache.org/jira/browse/HDFS-4404?focusedCommentId=13555680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13555680
        Hide
        sureshms Suresh Srinivas added a comment -

        we should allow retry when connection reset or socket timeout happens in this stage.

        I know in large clusters, it is possible to hit a condition where too many clients connect to the master servers such as namenode and overload it. The question is how do we want to handle this condition. There are two possible way to look at the solution:

        1. The overload condition is unexpected, hence the current behavior of degraded service where clients get disconnected could be the right behavior.
        2. If the load is some thing that namenode should handle, hence not an overload condition, we should look at scaling number of connections at the namenode. There are things that can be tuned here - number of RPC handlers, queue depth per RPC handler etc. If that is not sufficient, we may have to make further changes to scale connection handling.

        One concern I have with retry is - if you have overload condition which results in client getting dropped, retry will continue the overload condition for a longer duration and make the situation worse.

        Show
        sureshms Suresh Srinivas added a comment - we should allow retry when connection reset or socket timeout happens in this stage. I know in large clusters, it is possible to hit a condition where too many clients connect to the master servers such as namenode and overload it. The question is how do we want to handle this condition. There are two possible way to look at the solution: The overload condition is unexpected, hence the current behavior of degraded service where clients get disconnected could be the right behavior. If the load is some thing that namenode should handle, hence not an overload condition, we should look at scaling number of connections at the namenode. There are things that can be tuned here - number of RPC handlers, queue depth per RPC handler etc. If that is not sufficient, we may have to make further changes to scale connection handling. One concern I have with retry is - if you have overload condition which results in client getting dropped, retry will continue the overload condition for a longer duration and make the situation worse.
        Hide
        kihwal Kihwal Lee added a comment -

        Todd Lipcon In this scenario, setupIOStreams() will throw an exception without retrying, because handleSaslConnectionFailure() gives up. If the auth mode is kerberos, it will be retried, but that's still outside of setupConnection() without involving handleConnectionFailure(). May be we should add a check for connection retry policy in handleSaslConnectionFailure().

        Suresh Srinivas We've also seen this happening against AM. Since there are finite number of tasks, retrying would have made the job succeed. This failure mode is particularly bad since clients fail without retrying. For requests for which only one chance is allowed, this is fatal. Since failed jobs get retried, the same situation will likely repeat. If all requests are eventually served, the load will go away without doing more damage.

        I agree that if this condition is sustained, the cluster has bigger problem and no ipc-level actions will solve that. But for transient overloads, we want the system to behave more gracefully. One concern is server accepting too much connections and running out of FD, which causes all kinds of bad things. This can be prevented by HADOOP-9137.

        Show
        kihwal Kihwal Lee added a comment - Todd Lipcon In this scenario, setupIOStreams() will throw an exception without retrying, because handleSaslConnectionFailure() gives up. If the auth mode is kerberos, it will be retried, but that's still outside of setupConnection() without involving handleConnectionFailure(). May be we should add a check for connection retry policy in handleSaslConnectionFailure(). Suresh Srinivas We've also seen this happening against AM. Since there are finite number of tasks, retrying would have made the job succeed. This failure mode is particularly bad since clients fail without retrying. For requests for which only one chance is allowed, this is fatal. Since failed jobs get retried, the same situation will likely repeat. If all requests are eventually served, the load will go away without doing more damage. I agree that if this condition is sustained, the cluster has bigger problem and no ipc-level actions will solve that. But for transient overloads, we want the system to behave more gracefully. One concern is server accepting too much connections and running out of FD, which causes all kinds of bad things. This can be prevented by HADOOP-9137 .

          People

          • Assignee:
            Unassigned
            Reporter:
            kihwal Kihwal Lee
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:

              Development