Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14137

Faster distcp by taking file list from fsimage or -lsr result

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 2.6.5
    • 2.6.6
    • tools/distcp
    • None

    Description

      DistCp is very slow to start when the src directory has a huge number of subdirectories. In our case, we already have the directory listing (via "hdfs oiv -i fsimage" or via nightly "hdfs dfs -lr -r /" dumps), and we would like to use that instead of doing realtime listing on the NameNode.

      The "-f" option doesn't help in this case because it would try to put everything into a single flat target directory.

      We'd like to introduce a new option "-listing <file>" for distcp. The <file> contains the result of listing the src directory.

      In order to achieve this, we plan to:
      1. Add a new CopyListing class PregeneratedCopyListing similar to SimpleCopyListing which doesn't "-ls -r" into the directory, but takes the listing via "-list"
      2. Add an option "-list <file>" which will automatically make distcp use the new PregeneratedCopyListing class.

      Attachments

        1. HADOOP-14137.branch26.1.patch
          21 kB
          Zheng Shao
        2. HADOOP-14137.branch26.2.patch
          22 kB
          Zheng Shao
        3. HADOOP-14137.branch26.3.patch
          20 kB
          Zheng Shao

        Activity

          People

            zshao Zheng Shao
            zshao Zheng Shao
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: