Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-900

Corrupt replicas are not tracked correctly through block report from DN

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.22.0
    • 0.22.0
    • None
    • None
    • Reviewed

    Description

      This one is tough to describe, but essentially the following order of events is seen to occur:

      1. A client marks one replica of a block to be corrupt by telling the NN about it
      2. Replication is then scheduled to make a new replica of this node
      3. The replication completes, such that there are now 3 good replicas and 1 corrupt replica
      4. The DN holding the corrupt replica sends a block report. Rather than telling this DN to delete the node, the NN instead marks this as a new good replica of the block, and schedules deletion on one of the good replicas.

      I don't know if this is a dataloss bug in the case of 1 corrupt replica with dfs.replication=2, but it seems feasible. I will attach a debug log with some commentary marked by '============>', plus a unit test patch which I can get to reproduce this behavior reliably. (it's not a proper unit test, just some edits to an existing one to show it)

      Attachments

        1. log-commented
          34 kB
          Todd Lipcon
        2. to-reproduce.patch
          6 kB
          Todd Lipcon
        3. reportCorruptBlock.patch
          2 kB
          Konstantin Shvachko

        Issue Links

          Activity

            People

              shv Konstantin Shvachko
              tlipcon Todd Lipcon
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: