Description
When appending data to a file system via Hadoop API, it's safer to ignore user defined output committer classes like DirectParquetOutputCommitter. Because it's relatively hard to handle task failure in this case. For example, DirectParquetOutputCommitter directly writes to the output directory to boost write performance when working with S3. However, there's no general way to determine task output file path of a specific task in Hadoop API, thus we don't know to revert a failed append job. (When doing overwrite, we can just remove the whole output directory.)
Attachments
Issue Links
- relates to
-
SPARK-10063 Remove DirectParquetOutputCommitter
- Resolved
- links to