Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3091

Update the usage limitations of ReplaceDatanodeOnFailure policy in the config description for the smaller clusters.

    Details

      Description

      When verifying the HDFS-1606 feature, Observed couple of issues.

      Presently the ReplaceDatanodeOnFailure policy satisfies even though we dont have enough DN to replcae in cluster and will be resulted into write failure.

      12/03/13 14:27:12 WARN hdfs.DFSClient: DataStreamer Exception
      java.io.IOException: Failed to add a datanode: nodes.length != original.length + 1, nodes=[xx.xx.xx.xx:50010], original=[xx.xx.xx.xx1:50010]
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:778)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:834)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:930)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:741)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:416)

      Lets take some cases:
      1) Replication factor 3 and cluster size also 3 and unportunately pipeline drops to 1.

      ReplaceDatanodeOnFailure will be satisfied because existings(1)<= replication/2 (3/2==1).

      But when it finding the new node to replace obiously it can not find the new node and the sanity check will fail.

      This will be resulted to Wite failure.

      2) Replication factor 10 (accidentally user sets the replication factor to higher value than cluster size),
      Cluser has only 5 datanodes.

      Here even if one node fails also write will fail with same reason.
      Because pipeline max will be 5 and killed one datanode, then existings will be 4

      existings(4)<= replication/2(10/2==5) will be satisfied and obiously it can not replace with the new node as there is no extra nodes exist in the cluster. This will be resulted to write failure.

      3) sync realted opreations also fails in this situations ( will post the clear scenarios)

      1. h3091_20120319.patch
        0.8 kB
        Tsz Wo Nicholas Sze

        Issue Links

          Activity

          Tsz Wo Nicholas Sze made changes -
          Link This issue is related to HDFS-4600 [ HDFS-4600 ]
          Arun C Murthy made changes -
          Fix Version/s 2.0.0 [ 12320353 ]
          Fix Version/s 0.24.0 [ 12317653 ]
          Fix Version/s 0.23.3 [ 12320052 ]
          Tsz Wo Nicholas Sze made changes -
          Issue Type Bug [ 1 ] Improvement [ 4 ]
          Fix Version/s 0.23.3 [ 12320052 ]
          Target Version/s 0.23.3, 0.24.0 [ 12320052, 12317653 ] 0.24.0, 0.23.3 [ 12317653, 12320052 ]
          Uma Maheswara Rao G made changes -
          Fix Version/s 0.24.0 [ 12317653 ]
          Target Version/s 0.23.3, 0.24.0 [ 12320052, 12317653 ] 0.24.0, 0.23.3 [ 12317653, 12320052 ]
          Uma Maheswara Rao G made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Target Version/s 0.23.3, 0.24.0 [ 12320052, 12317653 ] 0.24.0, 0.23.3 [ 12317653, 12320052 ]
          Assignee Tsz Wo (Nicholas), SZE [ szetszwo ]
          Resolution Fixed [ 1 ]
          Uma Maheswara Rao G made changes -
          Summary Failed to add new DataNode in pipeline and will be resulted into write failure. Update the usage limitations of ReplaceDatanodeOnFailure policy in the config description for the smaller clusters.
          Target Version/s 0.23.3, 0.24.0 [ 12320052, 12317653 ] 0.24.0, 0.23.3 [ 12317653, 12320052 ]
          Tsz Wo Nicholas Sze made changes -
          Attachment h3091_20120319.patch [ 12518919 ]
          Tsz Wo Nicholas Sze made changes -
          Attachment h3091_20120319.patch [ 12518921 ]
          Tsz Wo Nicholas Sze made changes -
          Attachment h3091_20120319.patch [ 12518919 ]
          Uma Maheswara Rao G made changes -
          Target Version/s 0.23.3, 0.24.0 [ 12320052, 12317653 ] 0.24.0, 0.23.3 [ 12317653, 12320052 ]
          Description When verifying the HDFS-1606 feature, Observed couple of issues.

          Presently the ReplaceDatanodeOnFailure policy satisfies even though we dont have enough DN to replcae in cluster and will be resulted into write failure.

          {quote}
          12/03/13 14:27:12 WARN hdfs.DFSClient: DataStreamer Exception
          java.io.IOException: Failed to add a datanode: nodes.length != original.length + 1, nodes=[10.18.52.55:50010], original=[10.18.52.55:50010]
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:778)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:834)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:930)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:741)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:416)
          {quote}


          Lets take some cases:
          1) Replication factor 3 and cluster size also 3 and unportunately pipeline drops to 1.

          ReplaceDatanodeOnFailure will be satisfied because *existings(1)<= replication/2 (3/2==1)*.

          But when it finding the new node to replace obiously it can not find the new node and the sanity check will fail.

          This will be resulted to Wite failure.

          2) Replication factor 10 (accidentally user sets the replication factor to higher value than cluster size),
            Cluser has only 5 datanodes.

            Here even if one node fails also write will fail with same reason.
            Because pipeline max will be 5 and killed one datanode, then existings will be 4

            *existings(4)<= replication/2(10/2==5)* will be satisfied and obiously it can not replace with the new node as there is no extra nodes exist in the cluster. This will be resulted to write failure.

          3) sync realted opreations also fails in this situations ( will post the clear scenarios)
          When verifying the HDFS-1606 feature, Observed couple of issues.

          Presently the ReplaceDatanodeOnFailure policy satisfies even though we dont have enough DN to replcae in cluster and will be resulted into write failure.

          {quote}
          12/03/13 14:27:12 WARN hdfs.DFSClient: DataStreamer Exception
          java.io.IOException: Failed to add a datanode: nodes.length != original.length + 1, nodes=[xx.xx.xx.xx:50010], original=[xx.xx.xx.xx1:50010]
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:778)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:834)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:930)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:741)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:416)
          {quote}


          Lets take some cases:
          1) Replication factor 3 and cluster size also 3 and unportunately pipeline drops to 1.

          ReplaceDatanodeOnFailure will be satisfied because *existings(1)<= replication/2 (3/2==1)*.

          But when it finding the new node to replace obiously it can not find the new node and the sanity check will fail.

          This will be resulted to Wite failure.

          2) Replication factor 10 (accidentally user sets the replication factor to higher value than cluster size),
            Cluser has only 5 datanodes.

            Here even if one node fails also write will fail with same reason.
            Because pipeline max will be 5 and killed one datanode, then existings will be 4

            *existings(4)<= replication/2(10/2==5)* will be satisfied and obiously it can not replace with the new node as there is no extra nodes exist in the cluster. This will be resulted to write failure.

          3) sync realted opreations also fails in this situations ( will post the clear scenarios)
          Uma Maheswara Rao G made changes -
          Field Original Value New Value
          Target Version/s 0.23.3, 0.24.0 [ 12320052, 12317653 ] 0.24.0, 0.23.3 [ 12317653, 12320052 ]
          Description When verifying the HDFS-1606 feature, Observed couple of issues.

          Presently the ReplaceDatanodeOnFailure policy satisfies even though we dont have enough DN to replcae in cluster and will be resulted into write failure.

          {quote}
          12/03/13 14:27:12 WARN hdfs.DFSClient: DataStreamer Exception
          java.io.IOException: Failed to add a datanode: nodes.length != original.length + 1, nodes=[10.18.52.55:50010], original=[10.18.52.55:50010]
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:778)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:834)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:930)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:741)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:416)
          {quote}


          Lets take some cases:
          1) Replication factor 3 and cluster size also 3 and unportunately pipeline drops to 1.

          ReplaceDatanodeOnFailure will be satisfied because existings(1)<= replication/2 (3/2==1).

          But when it finding the new node to replace obiously it can not find the new node and the sanity check will fail.

          This will be resulted to Wite failure.

          2) Replication factor 10 (accidentally user sets the replication factor to higher value than cluster size),
            Cluser has only 5 datanodes.

            Here even if one node fails also write will fail with same reason.
            Because pipeline max will be 5 and killed one datanode, then existings will be 4

            existings(4)<= replication/2(10/2==5) will be satisfied and obiously it can not replace with the new node as there is no extra nodes exist in the cluster. This will be resulted to write failure.

          3) sync realted opreations also fails in this situations ( will post the clear scenarios)
          When verifying the HDFS-1606 feature, Observed couple of issues.

          Presently the ReplaceDatanodeOnFailure policy satisfies even though we dont have enough DN to replcae in cluster and will be resulted into write failure.

          {quote}
          12/03/13 14:27:12 WARN hdfs.DFSClient: DataStreamer Exception
          java.io.IOException: Failed to add a datanode: nodes.length != original.length + 1, nodes=[10.18.52.55:50010], original=[10.18.52.55:50010]
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:778)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:834)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:930)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:741)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:416)
          {quote}


          Lets take some cases:
          1) Replication factor 3 and cluster size also 3 and unportunately pipeline drops to 1.

          ReplaceDatanodeOnFailure will be satisfied because *existings(1)<= replication/2 (3/2==1)*.

          But when it finding the new node to replace obiously it can not find the new node and the sanity check will fail.

          This will be resulted to Wite failure.

          2) Replication factor 10 (accidentally user sets the replication factor to higher value than cluster size),
            Cluser has only 5 datanodes.

            Here even if one node fails also write will fail with same reason.
            Because pipeline max will be 5 and killed one datanode, then existings will be 4

            *existings(4)<= replication/2(10/2==5)* will be satisfied and obiously it can not replace with the new node as there is no extra nodes exist in the cluster. This will be resulted to write failure.

          3) sync realted opreations also fails in this situations ( will post the clear scenarios)
          Uma Maheswara Rao G created issue -

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Uma Maheswara Rao G
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development