Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6840

Distcp to support cutoff time

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.6.0
    • None
    • distcp
    • None

    Description

      To ensure consistency in the datasets on HDFS, some projects like file formats on Hive do HDFS operations in a particular order. For example, if a file format uses an index file, a new version of the index file will only be written to HDFS after all files mentioned by the index are written to HDFS.

      When we do distcp, it's important to preserve that consistency, so that we don't break those file formats.

      A typical solution for that is to create a HDFS Snapshot beforehand, and only distcp the Snapshot. That could work well if the user has superuser privilege to make the directory snapshottable.

      If not, then it will be beneficial to have a cutoff time for distcp, so that distcp only copy files modified on/before that cutoff time.

      Attachments

        1. MAPREDUCE-6840.1.patch
          9 kB
          Zheng Shao

        Issue Links

          Activity

            People

              zshao Zheng Shao
              zshao Zheng Shao
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: