Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11160

VolumeScanner reports write-in-progress replicas as corrupt incorrectly

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.7.4, 3.0.0-alpha2
    • Component/s: datanode
    • Labels:
      None
    • Environment:

      CDH5.7.4

    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixed a race condition that caused VolumeScanner to recognize a good replica as a bad one if the replica is also being written concurrently.

      Description

      Due to a race condition initially reported in HDFS-6804, VolumeScanner may erroneously detect good replicas as corrupt. This is serious because in some cases it results in data loss if all replicas are declared corrupt. This bug is especially prominent when there are a lot of append requests via HttpFs/WebHDFS.

      We are investigating an incidence that caused very high block corruption rate in a relatively small cluster. Initially, we thought HDFS-11056 is to blame. However, after applying HDFS-11056, we are still seeing VolumeScanner reporting corrupt replicas.

      It turns out that if a replica is being appended while VolumeScanner is scanning it, VolumeScanner may use the new checksum to compare against old data, causing checksum mismatch.

      I have a unit test to reproduce the error. Will attach later. A quick and simple fix is to hold FsDatasetImpl lock and read from disk the checksum.

        Attachments

        1. HDFS-11160.reproduce.patch
          13 kB
          Wei-Chiu Chuang
        2. HDFS-11160.001.patch
          10 kB
          Wei-Chiu Chuang
        3. HDFS-11160.002.patch
          11 kB
          Wei-Chiu Chuang
        4. HDFS-11160.003.patch
          17 kB
          Yongjun Zhang
        5. HDFS-11160.004.patch
          18 kB
          Wei-Chiu Chuang
        6. HDFS-11160.005.patch
          18 kB
          Wei-Chiu Chuang
        7. HDFS-11160.006.patch
          18 kB
          Wei-Chiu Chuang
        8. HDFS-11160.007.patch
          17 kB
          Wei-Chiu Chuang
        9. HDFS-11160.008.patch
          17 kB
          Wei-Chiu Chuang
        10. HDFS-11160.branch-2.patch
          21 kB
          Wei-Chiu Chuang

          Issue Links

            Activity

              People

              • Assignee:
                jojochuang Wei-Chiu Chuang
                Reporter:
                jojochuang Wei-Chiu Chuang
              • Votes:
                0 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: