Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.7.0
-
None
-
None
Description
When copy file from HDFS or local to another file system implementation, in CommandWithDestination.java, it creates a temp file by adding suffix ".COPYING". Once file is successfully copied, it will remove the suffix by rename().
try
{ PathData tempTarget = target.suffix("._COPYING_"); targetFs.setWriteChecksum(writeChecksum); targetFs.writeStreamToFile(in, tempTarget, lazyPersist); targetFs.rename(tempTarget, target); }finally
{ targetFs.close(); // last ditch effort to ensure temp file is removed }It is not costly in HDFS. However, if copy to Swift file system, the rename process is to create a new file. It is not efficient if users copy a lot of files to swift file system. I did some tests, for a 1G file copying to swift, it will take 10% more time. We should only do the copy one time for Swift file system. Changes should be limited to the Swift driver level.
Attachments
Attachments
Issue Links
- depends upon
-
HADOOP-12057 swiftfs rename on partitioned file attempts to consolidate partitions
- Patch Available
- is related to
-
HADOOP-12038 SwiftNativeOutputStream should check whether a file exists or not before deleting
- Patch Available
- is superceded by
-
HADOOP-15281 Distcp to add no-rename copy option
- Resolved
- relates to
-
HADOOP-12109 Distcp of file > 5GB to swift fails with HTTP 413 error
- Open
-
HDFS-8673 HDFS reports file already exists if there is a file/dir name end with ._COPYING_
- Patch Available
-
HADOOP-9565 Add a Blobstore interface to add to blobstore FileSystems
- Patch Available
The reason the marker is there is because copy operations are not atomic in a normal filestore: the partly copied file is visible. In contrast, in an object store, the copy is atomic: the new entry is only visible after the PUT completes. What we can't do, today, is easily distinguish object store from real FS, for code upstream (CLI, distcp, output committers) to act differently.
If you look at HADOOP-9565, we've been discussing offloading the copy operation to the FS implementation itself. as some object stores (S3) implement COPY internally, so can do a copy without the client having to do a full process, or that rename(), which is very expensive