Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-4702

Failed block replication leaves an incomplete block in receiver's tmp data directory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.17.2
    • 0.18.3
    • None
    • None
    • Reviewed

    Description

      When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
      1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
      2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.

      A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.

      Attachments

        1. tmpBlockRemoval.patch
          7 kB
          Hairong Kuang
        2. tmpBlockRemoval1.patch
          7 kB
          Hairong Kuang
        3. tmpBlockRemoval2.patch
          8 kB
          Hairong Kuang

        Issue Links

          Activity

            People

              hairong Hairong Kuang
              hairong Hairong Kuang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: