Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14847

Erasure Coding: Blocks are over-replicated while EC decommissioning



    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.2.0, 3.0.3, 3.1.2, 3.3.0
    • 3.3.0, 3.1.4, 3.2.2
    • ec
    • None
    • Reviewed


      Found that Some blocks are over-replicated while ec decommissioning. Messages in log as follow

      INFO BlockStateChange: Block: blk_-9223372035714984112_363779142, Expected Replicas: 9, live replicas: 8, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 3, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: , Current Datanode:, Is current datanode decommissioning: true, Is current datanode entering maintenance: false

      Decommisions hang for a long time.

      Deep into the code and find that There is a problem in ErasureCodingWork.java
      For Example, there are 2 nodes(dn0, dn1) in decommission and an ec block group with the 2 nodes. After creating an ErasureCodingWork to reconstruct, it will create 2 replication work.
      If dn0 replicates in success and dn1 replicates in failure, Then it will always create replication work for dn0. The block on dn0 is over-replicated and The block on dn1 will never replicate
      Here is the initial path for this.


        1. HDFS-14847.001.patch
          11 kB
          Hui Fei
        2. HDFS-14847.002.patch
          11 kB
          Hui Fei
        3. HDFS-14847.003.patch
          10 kB
          Hui Fei
        4. HDFS-14847.004.patch
          12 kB
          Hui Fei
        5. HDFS-14847.005.patch
          12 kB
          Hui Fei



            ferhui Hui Fei
            ferhui Hui Fei
            0 Vote for this issue
            9 Start watching this issue