Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8578

Should ignore user defined output committer when appending data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.4.1, 1.5.0
    • SQL
    • None

    Description

      When appending data to a file system via Hadoop API, it's safer to ignore user defined output committer classes like DirectParquetOutputCommitter. Because it's relatively hard to handle task failure in this case. For example, DirectParquetOutputCommitter directly writes to the output directory to boost write performance when working with S3. However, there's no general way to determine task output file path of a specific task in Hadoop API, thus we don't know to revert a failed append job. (When doing overwrite, we can just remove the whole output directory.)

      Attachments

        Issue Links

          Activity

            People

              yhuai Yin Huai
              lian cheng Cheng Lian
              Cheng Lian Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: