Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8578

Should ignore user defined output committer when appending data

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.4.1, 1.5.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When appending data to a file system via Hadoop API, it's safer to ignore user defined output committer classes like DirectParquetOutputCommitter. Because it's relatively hard to handle task failure in this case. For example, DirectParquetOutputCommitter directly writes to the output directory to boost write performance when working with S3. However, there's no general way to determine task output file path of a specific task in Hadoop API, thus we don't know to revert a failed append job. (When doing overwrite, we can just remove the whole output directory.)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                yhuai Yin Huai
                Reporter:
                lian cheng Cheng Lian
                Shepherd:
                Cheng Lian
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: