Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8121

When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.4.1
    • Component/s: SQL
    • Labels:
      None

      Description

      When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and spark.sql.sources.outputCommitterClass is configured, spark.sql.parquet.output.committer.class will be overriden.

      For example, if spark.sql.parquet.output.committer.class is set to FileOutputCommitter, while spark.sql.sources.outputCommitterClass is set to DirectParquetOutputCommitter, neither _metadata nor _common_metadata will be written because FileOutputCommitter overrides DirectParquetOutputCommitter.

      The reason is that, InsertIntoHadoopFsRelation initializes the TaskAttemptContext before calling ParquetRelation2.prepareForWriteJob(), which sets up Parquet output committer class. In the meanwhile, in Hadoop 1.x, TaskAttempContext constructor clones the job configuration, thus doesn't share the job configuration passed to ParquetRelation2.prepareForWriteJob().

      This issue can be fixed by simply switching these two lines.

      Here is a Spark shell snippet for reproducing this issue:

      import sqlContext._
      
      sc.hadoopConfiguration.set(
        "spark.sql.sources.outputCommitterClass",
        "org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter")
      
      sc.hadoopConfiguration.set(
        "spark.sql.parquet.output.committer.class",
        "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
      
      range(0, 1).write.mode("overwrite").parquet("file:///tmp/foo")
      

      Then check /tmp/foo, Parquet summary files are missing:

      /tmp/foo
      ├── _SUCCESS
      ├── part-r-00001.gz.parquet
      ├── part-r-00002.gz.parquet
      ├── part-r-00003.gz.parquet
      ├── part-r-00004.gz.parquet
      ├── part-r-00005.gz.parquet
      ├── part-r-00006.gz.parquet
      ├── part-r-00007.gz.parquet
      └── part-r-00008.gz.parquet
      

        Attachments

          Activity

            People

            • Assignee:
              lian cheng Cheng Lian
              Reporter:
              lian cheng Cheng Lian
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: