Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4799

Corrupt replica can be prematurely removed from corruptReplicas map

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 2.0.4-alpha
    • 2.1.0-beta
    • namenode
    • None
    • Reviewed

    Description

      We saw the following sequence of events in a cluster result in losing the most recent genstamp of a block:

      • client is writing to a pipeline of 3
      • the pipeline had nodes fail over some period of time, such that it left 3 old-genstamp replicas on the original three nodes, having recruited 3 new replicas with a later genstamp.
        • so, we have 6 total replicas in the cluster, three with old genstamps on downed nodes, and 3 with the latest genstamp
      • cluster reboots, and the nodes with old genstamps blockReport first. The replicas are correctly added to the corrupt replicas map since they have a too-old genstamp
      • the nodes with the new genstamp block report. When the latest one block reports, chooseExcessReplicates is called and incorrectly decides to remove the three good replicas, leaving only the old-genstamp replicas.

      Attachments

        1. hdfs-4799.txt
          10 kB
          Todd Lipcon
        2. hdfs-4799.txt
          10 kB
          Todd Lipcon
        3. hdfs-4799-unittest.txt
          9 kB
          Todd Lipcon

        Activity

          People

            tlipcon Todd Lipcon
            tlipcon Todd Lipcon
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: