Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8510

Provide different timeout settings for hdfs dfsadmin -getDatanodeInfo.

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: tools
    • Labels:
      None

      Description

      During a rolling upgrade, an administrator runs hdfs dfsadmin -getDatanodeInfo to check if a DataNode has stopped. Currently, this operation is subject to the RPC connection retries defined in ipc.client.connect.max.retries and ipc.client.connect.retry.interval. This issue proposes adding separate configuration properties to control the retries for this operation.

        Activity

        Hide
        cnauroth Chris Nauroth added a comment -

        The current situation is problematic for rolling upgrades in deployments that have set ipc.client.connect.max.retries and/or ipc.client.connect.retry.interval to something higher than the default. This command is run in situations where it is expected that the DataNode is down, therefore the expectation is that the connection will fail. The command can spend a lot of time in a connection retry loop. In the worst case, a script that stops and then restarts a DataNode will have to wait so long for the retry loop to complete that it can't restart the DataNode in time to meet the 30-second deadline required for OOB ack response handling in the client. Missing this deadline forces clients into pipeline recoveries, which is sub-optimal.

        To minimize surprises for existing deployments, let's set these new timeout configuration properties to use the same default values as ipc.client.connect.max.retries and ipc.client.connect.retry.interval.

        Show
        cnauroth Chris Nauroth added a comment - The current situation is problematic for rolling upgrades in deployments that have set ipc.client.connect.max.retries and/or ipc.client.connect.retry.interval to something higher than the default. This command is run in situations where it is expected that the DataNode is down, therefore the expectation is that the connection will fail. The command can spend a lot of time in a connection retry loop. In the worst case, a script that stops and then restarts a DataNode will have to wait so long for the retry loop to complete that it can't restart the DataNode in time to meet the 30-second deadline required for OOB ack response handling in the client. Missing this deadline forces clients into pipeline recoveries, which is sub-optimal. To minimize surprises for existing deployments, let's set these new timeout configuration properties to use the same default values as ipc.client.connect.max.retries and ipc.client.connect.retry.interval .

          People

          • Assignee:
            cnauroth Chris Nauroth
            Reporter:
            cnauroth Chris Nauroth
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development