Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14978

In-place Erasure Coding Conversion

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • erasure-coding
    • None

    Description

      HDFS Erasure Coding is a new feature added in Apache Hadoop 3.0. It uses encoding algorithms to reduce disk space usage while retaining redundancy necessary for data recovery. It was a huge amount of work but it is just getting adopted after almost 2 years.

      One usability problem that’s blocking users from adopting HDFS Erasure Coding is that existing replicated files have to be copied to an EC-enabled directory explicitly. Renaming a file/directory to an EC-enabled directory does not automatically convert the blocks. Therefore users typically perform the following steps to erasure-code existing files:

      Create $tmp directory, set EC policy at it
      Distcp $src to $tmp
      Delete $src (rm -rf $src)
      mv $tmp $src
      

      There are several reasons why this is not popular:

      • Complex. The process involves several steps: distcp data to a temporary destination; delete source file; move destination to the source path.
      • Availability: there is a short period where nothing exists at the source path, and jobs may fail unexpectedly.
      • Overhead. During the copy phase, there is a point in time where all of source and destination files exist at the same time, exhausting disk space.
      • Not snapshot-friendly. If a snapshot is taken prior to performing the conversion, the source (replicated) files will be preserved in the cluster too. Therefore, the conversion actually increase storage space usage.
      • Not management-friendly. This approach changes file inode number, modification time and access time. Erasure coded files are supposed to store cold data, but this conversion makes data “hot” again.
      • Bulky. It’s either all or nothing. The directory may be partially erasure coded, but this approach simply erasure code everything again.

      To ease data management, we should offer a utility tool to convert replicated files to erasure coded files in-place.

      Attachments

        1. In-place Erasure Coding Conversion.pdf
          119 kB
          Wei-Chiu Chuang

        Issue Links

          Activity

            People

              avijayan Aravindan Vijayan
              weichiu Wei-Chiu Chuang
              Votes:
              0 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h