Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-16601

DataTransfer should throw IOException to Client

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      In our production environment, we found a bug and stack like:

      java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK], DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]], original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK], DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
      	at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
      	at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
      	at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
      	at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
      	at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
      	at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
      	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
      

      And the root cause is that DFSClient cannot perceive the exception of TransferBlock during PipelineRecovery. If failed during TransferBlock, the DFSClient will retry all datanodes in the cluster and then failed.

      When client is recovering pipeline, the source dn selected to transfer block to new DN may be abnormal, it cannot successfully transfer the block to the new node. But the failed exception not returned to the client, Client also thought transfer successfully. But there is not block in the new DN, so Client failed to build the pipeline, and marked the new DN is bad. And then Client will add the new DN into exclude list to get a new DN for the new loop pipeline recovery. The new pipeline recovery will still choose the abnormal dn as the source dn to transfer block, and it will fail again..

      So I think that DN should return the failed exception of transfer to Client, so that Client can choose anther existed dn as the source dn to transfer the block to a new DN.

      Attachments

        Issue Links

          Activity

            People

              xuzq_zander ZanderXu
              xuzq_zander ZanderXu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m