Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-1623 High Availability Framework for HDFS NN
  3. HDFS-2742

HA: observed dataloss in replication stress test

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: HA branch (HDFS-1623)
    • Fix Version/s: None
    • Component/s: datanode, ha, namenode
    • Labels:
      None

      Description

      The replication stress test case failed over the weekend since one of the replicas went missing. Still diagnosing the issue, but it seems like the chain of events was something like:

      • a block report was generated on one of the nodes while the block was being written - thus the block report listed the block as RBW
      • when the standby replayed this queued message, it was replayed after the file was marked complete. Thus it marked this replica as corrupt
      • it asked the DN holding the corrupt replica to delete it. And, I think, removed it from the block map at this time.
      • That DN then did another block report before receiving the deletion. This caused it to be re-added to the block map, since it was "FINALIZED" now.
      • Replication was lowered on the file, and it counted the above replica as non-corrupt, and asked for the other replicas to be deleted.
      • All replicas were lost.

        Attachments

        1. hdfs-2742.txt
          78 kB
          Todd Lipcon
        2. hdfs-2742.txt
          79 kB
          Todd Lipcon
        3. hdfs-2742.txt
          75 kB
          Todd Lipcon
        4. hdfs-2742.txt
          83 kB
          Todd Lipcon
        5. hdfs-2742.txt
          79 kB
          Todd Lipcon
        6. hdfs-2742.txt
          74 kB
          Todd Lipcon
        7. hdfs-2742.txt
          56 kB
          Todd Lipcon
        8. hdfs-2742.txt
          5 kB
          Todd Lipcon
        9. log-colorized.txt
          7.47 MB
          Todd Lipcon

          Activity

            People

            • Assignee:
              tlipcon Todd Lipcon
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: