Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-16872

Performance improvement when distcp files in large dir with -direct option

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.3.0, 3.2.1
    • Fix Version/s: None
    • Component/s: tools/distcp
    • Labels:
      None
    • Flags:
      Patch

      Description

      We use distcp with -direct option to copy a file between two large directories. We found it costed a few minutes. If we launch too much distcp jobs at the same time, NameNode  performance degradation is serious.

      hadoop -direct -skipcrccheck -update -prbugaxt -i -numListstatusThreads 1 hdfs://cluster1:8020/source/100.log  hdfs://cluster2:8020/target/100.jpg

        Dir path Count
      Source dir   hdfs://cluster1:8020/source/  100k+ files
      Target dir hdfs://cluster2:8020/target/  100k+  files

       

      Check code in CopyCommitter.java, we find in function

      deleteAttemptTempFiles() has a code targetFS.globStatus(new Path(targetWorkPath, ".distcp.tmp." + jobId.replaceAll("job","attempt") + "*")); 

      It will waste a lot of time when distcp between two large dirs. When we use distcp with -direct option,  it will direct write to the target file without generate a  '.distcp.tmp'  temp file. So, i think this code need add a judgment before call function deleteAttemptTempFiles, if distcp with -direct option, do nothing , directly return .  

       

        Attachments

        1. HADOOP-16872.001.patch
          1 kB
          liuxiaolong
        2. optimise before.png
          314 kB
          liuxiaolong
        3. optimise after.png
          321 kB
          liuxiaolong

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lxl liuxiaolong
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated: