Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8031 Follow-on work for erasure coding phase I (striping layout)
  3. HDFS-9826

Erasure Coding: Postpone the recovery work for a configurable time period

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • None
    • None

    Description

      Currently NameNode prepares recovering when finding an under replicated block group. This is inefficient and reduces resources for other operations. It would be better to postpone the recovery work for a period of time if only one internal block is corrupted considering points shown by papers such as [1][2]:
      1. Transient errors in which no data are lost account for more than 90% of data center failures, owing to network partitions, software problems, or non-disk hardware faults.
      2. Although erasure codes tolerate multiple simultaneous failures, single failures represent 99.75% of recoveries.

      Different clusters may have different status, so we should allow user to configure the time for postponing the recoveries. Proper configuration will reduce a large proportion of unnecessary recoveries. When finding multiple internal blocks corrupted in a block group, we prepare the recovery work immediately because it’s very rare and we don’t want to increase the risk of losing data.

      [1] Availability in globally distributed storage systems
      http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf
      [2] Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads
      http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf

      Attachments

        1. HDFS-9826-001.patch
          7 kB
          Li Bo
        2. HDFS-9826-002.patch
          8 kB
          Li Bo

        Activity

          People

            libo-intel Li Bo
            libo-intel Li Bo
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: