Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15495

Decommissioning a DataNode with corrupted EC files should not be blocked indefinitely

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: block placement, ec
    • Labels:
      None

      Description

      Originally discovered in patched CDH 6.2.1 (with a bunch of EC fixes: HDFS-14699, HDFS-14849, HDFS-14847, HDFS-14920, HDFS-14768, HDFS-14946, HDFS-15186).

      When there's an EC file marked as corrupted on NN, if the admin tries to decommission a DataNode having one of the remaining blocks of the corrupted EC file, the decom will never finish unless the file is recovered by putting the missing blocks back in:

      The endless DatanodeAdminManager check loop, every 30s
      2020-07-23 16:36:12,805 TRACE blockmanagement.DatanodeAdminManager: Processed 0 blocks so far this tick
      2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Processing Decommission In Progress node 127.0.1.7:5007
      2020-07-23 16:36:12,806 TRACE blockmanagement.DatanodeAdminManager: Block blk_-9223372036854775728_1013 numExpected=9, numLive=4
      2020-07-23 16:36:12,806 INFO BlockStateChange: Block: blk_-9223372036854775728_1013, Expected Replicas: 9, live replicas: 4, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.1.12:5012 127.0.1.10:5010 127.0.1.8:5008 127.0.1.11:5011 127.0.1.7:5007 , Current Datanode: 127.0.1.7:5007, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
      2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Node 127.0.1.7:5007 still has 1 blocks to replicate before it is a candidate to finish Decommission In Progress.
      2020-07-23 16:36:12,806 INFO blockmanagement.DatanodeAdminManager: Checked 1 blocks and 1 nodes this tick
      

      "Corrupted" file here meaning the EC file doesn't have enough EC blocks in the block group to be reconstructed. e.g. for RS-6-3-1024k, when there are less than 6 blocks for an EC file, the file can no longer be retrieved correctly.

        Attachments

          Activity

            People

            • Assignee:
              smeng Siyao Meng
              Reporter:
              smeng Siyao Meng
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: