Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-900

Corrupt replicas are not tracked correctly through block report from DN

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This one is tough to describe, but essentially the following order of events is seen to occur:

      1. A client marks one replica of a block to be corrupt by telling the NN about it
      2. Replication is then scheduled to make a new replica of this node
      3. The replication completes, such that there are now 3 good replicas and 1 corrupt replica
      4. The DN holding the corrupt replica sends a block report. Rather than telling this DN to delete the node, the NN instead marks this as a new good replica of the block, and schedules deletion on one of the good replicas.

      I don't know if this is a dataloss bug in the case of 1 corrupt replica with dfs.replication=2, but it seems feasible. I will attach a debug log with some commentary marked by '============>', plus a unit test patch which I can get to reproduce this behavior reliably. (it's not a proper unit test, just some edits to an existing one to show it)

      1. reportCorruptBlock.patch
        2 kB
        Konstantin Shvachko
      2. to-reproduce.patch
        6 kB
        Todd Lipcon
      3. log-commented
        34 kB
        Todd Lipcon

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development