Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-12046

Avoid creating "._COPYING_" temporary file when copying file to Swift file system

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.7.0
    • None
    • fs/swift
    • None

    Description

      When copy file from HDFS or local to another file system implementation, in CommandWithDestination.java, it creates a temp file by adding suffix ".COPYING". Once file is successfully copied, it will remove the suffix by rename().

      try

      { PathData tempTarget = target.suffix("._COPYING_"); targetFs.setWriteChecksum(writeChecksum); targetFs.writeStreamToFile(in, tempTarget, lazyPersist); targetFs.rename(tempTarget, target); }

      finally

      { targetFs.close(); // last ditch effort to ensure temp file is removed }

      It is not costly in HDFS. However, if copy to Swift file system, the rename process is to create a new file. It is not efficient if users copy a lot of files to swift file system. I did some tests, for a 1G file copying to swift, it will take 10% more time. We should only do the copy one time for Swift file system. Changes should be limited to the Swift driver level.

      Attachments

        Issue Links

          Activity

            stevel@apache.org Steve Loughran added a comment -

            The reason the marker is there is because copy operations are not atomic in a normal filestore: the partly copied file is visible. In contrast, in an object store, the copy is atomic: the new entry is only visible after the PUT completes. What we can't do, today, is easily distinguish object store from real FS, for code upstream (CLI, distcp, output committers) to act differently.

            If you look at HADOOP-9565, we've been discussing offloading the copy operation to the FS implementation itself. as some object stores (S3) implement COPY internally, so can do a copy without the client having to do a full process, or that rename(), which is very expensive

            stevel@apache.org Steve Loughran added a comment - The reason the marker is there is because copy operations are not atomic in a normal filestore: the partly copied file is visible. In contrast, in an object store, the copy is atomic: the new entry is only visible after the PUT completes. What we can't do, today, is easily distinguish object store from real FS, for code upstream (CLI, distcp, output committers) to act differently. If you look at HADOOP-9565 , we've been discussing offloading the copy operation to the FS implementation itself. as some object stores (S3) implement COPY internally, so can do a copy without the client having to do a full process, or that rename(), which is very expensive
            airbots Chen He added a comment -

            Thank you for the quick reply, steve_l. I will read HADOOP-9565, it sounds interesting.

            airbots Chen He added a comment - Thank you for the quick reply, steve_l . I will read HADOOP-9565 , it sounds interesting.
            airbots Chen He added a comment -

            Attach the file copy process if user tries to copy a file(larger than 5GB) from HDFS to Swift using current Swift driver.

            airbots Chen He added a comment - Attach the file copy process if user tries to copy a file(larger than 5GB) from HDFS to Swift using current Swift driver.
            stevel@apache.org Steve Loughran added a comment -

            HADOOP-15281 covers this with the proposal for a no-rename distcp

            stevel@apache.org Steve Loughran added a comment - HADOOP-15281 covers this with the proposal for a no-rename distcp
            noslowerdna Andrew Olson added a comment -

            A patch is now available for HADOOP-15281, could someone review it?

            noslowerdna Andrew Olson added a comment - A patch is now available for HADOOP-15281 , could someone review it?
            noslowerdna Andrew Olson added a comment -

            HADOOP-15281 has been completed. Resolving this as a duplicate.

            noslowerdna Andrew Olson added a comment - HADOOP-15281 has been completed. Resolving this as a duplicate.

            People

              airbots Chen He
              airbots Chen He
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: