Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-900

Corrupt replicas are not tracked correctly through block report from DN

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This one is tough to describe, but essentially the following order of events is seen to occur:

      1. A client marks one replica of a block to be corrupt by telling the NN about it
      2. Replication is then scheduled to make a new replica of this node
      3. The replication completes, such that there are now 3 good replicas and 1 corrupt replica
      4. The DN holding the corrupt replica sends a block report. Rather than telling this DN to delete the node, the NN instead marks this as a new good replica of the block, and schedules deletion on one of the good replicas.

      I don't know if this is a dataloss bug in the case of 1 corrupt replica with dfs.replication=2, but it seems feasible. I will attach a debug log with some commentary marked by '============>', plus a unit test patch which I can get to reproduce this behavior reliably. (it's not a proper unit test, just some edits to an existing one to show it)

        Attachments

        1. log-commented
          34 kB
          Todd Lipcon
        2. to-reproduce.patch
          6 kB
          Todd Lipcon
        3. reportCorruptBlock.patch
          2 kB
          Konstantin Shvachko

          Issue Links

            Activity

              People

              • Assignee:
                shv Konstantin Shvachko
                Reporter:
                tlipcon Todd Lipcon
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: