Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1056

Multi-node RPC deadlocks during block recovery

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.2, 0.21.0, 0.22.0
    • Fix Version/s: 0.20-append
    • Component/s: datanode
    • Labels:
      None
    • Target Version/s:

      Description

      Believe it or not, I'm seeing HADOOP-3657 / HADOOP-3673 in a 5-node 0.20 cluster. I have many concurrent writes on the cluster, and when I kill a DN, some percentage of the time I get one of these cross-node deadlocks among 3 of the nodes (replication 3). All of the DN RPC server threads are tied up waiting on RPC clients to other datanodes.

        Activity

        Hide
        Eli Collins added a comment -

        Marking for 2.x, verified trunk still needs this fix.

        Show
        Eli Collins added a comment - Marking for 2.x, verified trunk still needs this fix.
        Hide
        Todd Lipcon added a comment -

        This fix could impact other code paths too, especially since the DN comparision is used by many code paths. Maybe a unit test would be good.

        Are you suggesting that the change be made to the equals() call instead of locally in the DataNode code? As is, the patch Nicolas uploaded is scoped to just that bit of code where it's been tested a lot and it's clear what the correct semantics are. I think changing equals() itself would be dangerous as it might break things in FSNamesystem, replication policy, etc.

        also, does this problem exist in trunk?

        Yep, it does - same fix applies

        Show
        Todd Lipcon added a comment - This fix could impact other code paths too, especially since the DN comparision is used by many code paths. Maybe a unit test would be good. Are you suggesting that the change be made to the equals() call instead of locally in the DataNode code? As is, the patch Nicolas uploaded is scoped to just that bit of code where it's been tested a lot and it's clear what the correct semantics are. I think changing equals() itself would be dangerous as it might break things in FSNamesystem, replication policy, etc. also, does this problem exist in trunk? Yep, it does - same fix applies
        Hide
        dhruba borthakur added a comment -

        This fix could impact other code paths too, especially since the DN comparision is used by many code paths. Maybe a unit test would be good.

        also, does this problem exist in trunk?

        Show
        dhruba borthakur added a comment - This fix could impact other code paths too, especially since the DN comparision is used by many code paths. Maybe a unit test would be good. also, does this problem exist in trunk?
        Hide
        Nicolas Spiegelberg added a comment -

        added Todd's fix for 0.20-append. no unit test yet

        Show
        Nicolas Spiegelberg added a comment - added Todd's fix for 0.20-append. no unit test yet
        Hide
        dhruba borthakur added a comment -

        This should be fixed for the 0.20-append branch as well.

        Show
        dhruba borthakur added a comment - This should be fixed for the 0.20-append branch as well.
        Hide
        Todd Lipcon added a comment -

        FWIW I made the suggested fix (comparing based on host and ipcPort) on the cluster in question and the problem went away. It may cause some other problem, though. I'll think about it a bit and post a patch and test case soon.

        Show
        Todd Lipcon added a comment - FWIW I made the suggested fix (comparing based on host and ipcPort) on the cluster in question and the problem went away. It may cause some other problem, though. I'll think about it a bit and post a patch and test case soon.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > Not really, ...
        Ah, you are right. It leads to a different problem. Thanks, Todd.

        Show
        Tsz Wo Nicholas Sze added a comment - > Not really, ... Ah, you are right. It leads to a different problem. Thanks, Todd.
        Hide
        Todd Lipcon added a comment -

        Then, setting ipcPort (i.e. dfs.datanode.ipc.address) to 0.0.0.0:0 would result the same problem

        Not really, because in the case that the ipcPort has changed, it will determine it's non-local and then get a connection refused trying to create the proxy to the old ipcPort. So it's still a bit of a problem, but not as bad as the loopback RPC client.

        Would just using storageID for equality be correct here? Maybe.

        Show
        Todd Lipcon added a comment - Then, setting ipcPort (i.e. dfs.datanode.ipc.address) to 0.0.0.0:0 would result the same problem Not really, because in the case that the ipcPort has changed, it will determine it's non-local and then get a connection refused trying to create the proxy to the old ipcPort. So it's still a bit of a problem, but not as bad as the loopback RPC client. Would just using storageID for equality be correct here? Maybe.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > I think the solution may be to determine the "equality" of the DNs based on IP and ipcPort, not by name (which is the xceiver port). There may be issues with this, though - have to think through it more thoroughly.

        Then, setting ipcPort (i.e. dfs.datanode.ipc.address) to 0.0.0.0:0 would result the same problem. The question is: how should a Datanode be identified? It seems that we were using name and storageID, where name = machineName + ":" + port.

        Show
        Tsz Wo Nicholas Sze added a comment - > I think the solution may be to determine the "equality" of the DNs based on IP and ipcPort, not by name (which is the xceiver port). There may be issues with this, though - have to think through it more thoroughly. Then, setting ipcPort (i.e. dfs.datanode.ipc.address) to 0.0.0.0:0 would result the same problem. The question is: how should a Datanode be identified? It seems that we were using name and storageID, where name = machineName + ":" + port.
        Hide
        dhruba borthakur added a comment -

        I agree. if you setup fixed ports for DN, then you might be able to avoid this problem completely.

        Show
        dhruba borthakur added a comment - I agree. if you setup fixed ports for DN, then you might be able to avoid this problem completely.
        Hide
        Todd Lipcon added a comment -

        (fwiw I have set dfs.datanode.address to 0.0.0.0:0 for these tests - I don't think this problem occurs if you set an explicit port for the xceiver server)

        Show
        Todd Lipcon added a comment - (fwiw I have set dfs.datanode.address to 0.0.0.0:0 for these tests - I don't think this problem occurs if you set an explicit port for the xceiver server)
        Hide
        Todd Lipcon added a comment -

        I think I understand what's happening here. I am restarting an HDFS cluster underneath an HBase cluster, and the following events transpire:

        1. DN with xceiver port X1 and ipcPort 11071 goes down
        2. DN starts back up with different xceiver port X2 but same ipcPort 11071
        3. Client calls recoverBlock, and since the ipcPort is the same, hits the new DN
        4. The DN (now known as 1.2.3.4:X2) sees that the target is 1.2.3.4:X1, which it decides is not local. It then connects to itself via RPC rather than using the "direct invocation" shortcut added by HADOOP-3673

        To verify this, I added a log message when creating the InterDataNodeProtocol Proxy:
        2010-03-21 16:02:33,378 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Creating IDNPP for non-local id 192.168.42.40:50397 (dnReg=DatanodeRegistration(192.168.42.40:39786, storageID=DS-126683980-192
        .168.42.40-41424-1269146536997, infoPort=40813, ipcPort=11071))

        (dnReg is the new local DN)

        I think the solution may be to determine the "equality" of the DNs based on IP and ipcPort, not by name (which is the xceiver port). There may be issues with this, though - have to think through it more thoroughly.

        Show
        Todd Lipcon added a comment - I think I understand what's happening here. I am restarting an HDFS cluster underneath an HBase cluster, and the following events transpire: DN with xceiver port X1 and ipcPort 11071 goes down DN starts back up with different xceiver port X2 but same ipcPort 11071 Client calls recoverBlock, and since the ipcPort is the same, hits the new DN The DN (now known as 1.2.3.4:X2) sees that the target is 1.2.3.4:X1, which it decides is not local. It then connects to itself via RPC rather than using the "direct invocation" shortcut added by HADOOP-3673 To verify this, I added a log message when creating the InterDataNodeProtocol Proxy: 2010-03-21 16:02:33,378 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Creating IDNPP for non-local id 192.168.42.40:50397 (dnReg=DatanodeRegistration(192.168.42.40:39786, storageID=DS-126683980-192 .168.42.40-41424-1269146536997, infoPort=40813, ipcPort=11071)) (dnReg is the new local DN) I think the solution may be to determine the "equality" of the DNs based on IP and ipcPort, not by name (which is the xceiver port). There may be issues with this, though - have to think through it more thoroughly.

          People

          • Assignee:
            Unassigned
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:

              Development