Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8014

DataFrame.write.mode("error").save(...) should not scan the output folder

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When saving a DataFrame with ErrorIfExists as save mode, we shouldn't do metadata discovery if the destination folder exists. This also applies to SaveMode.Overwrite and SaveMode.Ignore.

      To reproduce this issue, we may make an empty directory /tmp/foo and leave an empty file bar there, then execute the following code in Spark shell:

      import sqlContext._
      import sqlContext.implicits._
      
      Seq(1 -> "a").toDF("i", "s").write.format("parquet").mode("error").save("file:///tmp/foo")
      

      From the exception stack trace we can see that metadata discovery code path is executed:

      java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small)
              at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
              at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
              at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
              at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
              at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
              at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502)
              at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501)
              at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331)
              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
              ...
      Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small)
              at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408)
              at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228)
              at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

        Attachments

          Activity

            People

            • Assignee:
              lian cheng Cheng Lian
              Reporter:
              huangjs Jianshi Huang
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: