Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9586

listCorruptFileBlocks should not output files that all replications are decommissioning

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      As HDFS-7933 said, we should count decommissioning and decommissioned nodes respectively and regard decommissioning nodes as special live nodes whose file is not corrupt or missing.

      So in listCorruptFileBlocks which is used by fsck and HDFS namenode website, we should collect a corrupt file only if liveReplicas and decommissioning are both 0.

      Attachments

        1. 9586-v1.patch
          2 kB
          Phil Yang

        Issue Links

          Activity

            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 0s Docker mode activated.
            -1 patch 0m 4s HDFS-9586 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 4s HDFS-9586 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HDFS-9586 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12778829/9586-v1.patch Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18083/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            yangzhe1991 Phil Yang added a comment -

            So all the blocks that go into QUEUE_WITH_CORRUPT_BLOCKS already has zero decommissioning replicas.

            In theory, I think it is right. However when decommissioning nodes there may be false positives of block-missing error. It is another bug that I'm digging out.

            I'm not sure why we need

            if (inode != null && blockManager.countNodes(blk).liveReplicas() == 0) 
            

            in FSNamesystem.listCorruptFileBlocks, in theory the second condition should be always true. But because of the bug, we need this condition indeed, so I think we need another condition about decommissioning replicas.

            yangzhe1991 Phil Yang added a comment - So all the blocks that go into QUEUE_WITH_CORRUPT_BLOCKS already has zero decommissioning replicas. In theory, I think it is right. However when decommissioning nodes there may be false positives of block-missing error. It is another bug that I'm digging out. I'm not sure why we need if (inode != null && blockManager.countNodes(blk).liveReplicas() == 0) in FSNamesystem.listCorruptFileBlocks, in theory the second condition should be always true. But because of the bug, we need this condition indeed, so I think we need another condition about decommissioning replicas.
            shahrs87 Rushabh Shah added a comment -

            FSNameSystem#listCorruptFileBlocks gets the list of corrupt blocks from UnderReplicatedBlocks.QUEUE_WITH_CORRUPT_BLOCKS queue.
            According to below code, the block will be added into QUEUE_WITH_CORRUPT_BLOCKS queue only if there are zero decommissionedReplicas (This name is little confusing since this is the sum of decommissioning and decommissioned replicas).

            if (curReplicas == 0) {
                  // If there are zero non-decommissioned replicas but there are
                  // some decommissioned replicas, then assign them highest priority
                  if (decommissionedReplicas > 0) {
                    return QUEUE_HIGHEST_PRIORITY;
                  }
                  if (readOnlyReplicas > 0) {
                    // only has read-only replicas, highest risk
                    // since the read-only replicas may go down all together.
                    return QUEUE_HIGHEST_PRIORITY;
                  }
                  //all we have are corrupt blocks
                  return QUEUE_WITH_CORRUPT_BLOCKS;
            

            So all the blocks that go into QUEUE_WITH_CORRUPT_BLOCKS already has zero decommissioning replicas.

            Please correct me if my understanding is wrong.

            shahrs87 Rushabh Shah added a comment - FSNameSystem#listCorruptFileBlocks gets the list of corrupt blocks from UnderReplicatedBlocks.QUEUE_WITH_CORRUPT_BLOCKS queue. According to below code, the block will be added into QUEUE_WITH_CORRUPT_BLOCKS queue only if there are zero decommissionedReplicas (This name is little confusing since this is the sum of decommissioning and decommissioned replicas). if (curReplicas == 0) { // If there are zero non-decommissioned replicas but there are // some decommissioned replicas, then assign them highest priority if (decommissionedReplicas > 0) { return QUEUE_HIGHEST_PRIORITY; } if (readOnlyReplicas > 0) { // only has read-only replicas, highest risk // since the read-only replicas may go down all together. return QUEUE_HIGHEST_PRIORITY; } //all we have are corrupt blocks return QUEUE_WITH_CORRUPT_BLOCKS; So all the blocks that go into QUEUE_WITH_CORRUPT_BLOCKS already has zero decommissioning replicas. Please correct me if my understanding is wrong.

            People

              yangzhe1991 Phil Yang
              yangzhe1991 Phil Yang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: