Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
3.3.1
-
S3A filesytem's createFile() operation supports an option to disable all safety checks when creating a file. Consult the documentation and use with care
Description
Magic committer tasks can be slow because every file created with overwrite=false triggers a HEAD (verify there's no file) and a LIST (that there's no dir). And because of delayed manifestations, it may not behave as expected.
ParquetOutputFormat is one example of a library which does this.
we could fix parquet to use overwrite=true, but (a) there may be surprises in other uses (b) it'd still leave the list and (c) do nothing for other formats call
Proposed: createFile() under a magic path to skip all probes for file/dir at end of path
Only a single task attempt Will be writing to that directory and it should know what it is doing. If there is conflicting file names and parts across tasks that won't even get picked up at this point. Oh and none of the committers ever check for this: you'll get the last file manifested (s3a) or renamed (file)
If we skip the checks we will save 2 HTTP requests/file.
Attachments
Issue Links
- blocks
-
HADOOP-18281 Tune S3A storage class support
- Open
- breaks
-
HADOOP-18402 S3A committer NPE in spark job abort
- Resolved
- causes
-
HADOOP-18757 S3A Committer only finalizes the commits in a single thread
- Resolved
- contains
-
HADOOP-15460 S3A FS to add "fs.s3a.create.performance" to the builder file creation option set
- Resolved
- depends upon
-
MAPREDUCE-7341 Add a task-manifest output committer for Azure and GCS
- Resolved
- incorporates
-
HADOOP-16017 Add some S3A-specific create file options
- Resolved
- is related to
-
HADOOP-17584 s3a magic committer may commit more data
- Resolved
-
HADOOP-18162 hadoop-common enhancements for the Manifest Committer of MAPREDUCE-7341
- Resolved
-
HADOOP-18568 Magic Committer optional clean up
- Open
- relates to
-
HADOOP-17935 Spark job stuck in S3A StagingCommitter::setupJob
- Resolved
- links to