Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11160

VolumeScanner reports write-in-progress replicas as corrupt incorrectly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.8.0, 2.7.4, 3.0.0-alpha2
    • datanode
    • None
    • CDH5.7.4

    • Reviewed
    • Fixed a race condition that caused VolumeScanner to recognize a good replica as a bad one if the replica is also being written concurrently.

    Description

      Due to a race condition initially reported in HDFS-6804, VolumeScanner may erroneously detect good replicas as corrupt. This is serious because in some cases it results in data loss if all replicas are declared corrupt. This bug is especially prominent when there are a lot of append requests via HttpFs/WebHDFS.

      We are investigating an incidence that caused very high block corruption rate in a relatively small cluster. Initially, we thought HDFS-11056 is to blame. However, after applying HDFS-11056, we are still seeing VolumeScanner reporting corrupt replicas.

      It turns out that if a replica is being appended while VolumeScanner is scanning it, VolumeScanner may use the new checksum to compare against old data, causing checksum mismatch.

      I have a unit test to reproduce the error. Will attach later. A quick and simple fix is to hold FsDatasetImpl lock and read from disk the checksum.

      Attachments

        1. HDFS-11160.reproduce.patch
          13 kB
          Wei-Chiu Chuang
        2. HDFS-11160.branch-2.patch
          21 kB
          Wei-Chiu Chuang
        3. HDFS-11160.008.patch
          17 kB
          Wei-Chiu Chuang
        4. HDFS-11160.007.patch
          17 kB
          Wei-Chiu Chuang
        5. HDFS-11160.006.patch
          18 kB
          Wei-Chiu Chuang
        6. HDFS-11160.005.patch
          18 kB
          Wei-Chiu Chuang
        7. HDFS-11160.004.patch
          18 kB
          Wei-Chiu Chuang
        8. HDFS-11160.003.patch
          17 kB
          Yongjun Zhang
        9. HDFS-11160.002.patch
          11 kB
          Wei-Chiu Chuang
        10. HDFS-11160.001.patch
          10 kB
          Wei-Chiu Chuang

        Issue Links

          Activity

            People

              weichiu Wei-Chiu Chuang
              weichiu Wei-Chiu Chuang
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: