Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20185

csv decompressed incorrectly with extention other than 'gz'

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
    • None
    • Input/Output
    • None

    Description

      With code below:
      val start_time = System.currentTimeMillis()
      val gzFile = spark.read
      .format("com.databricks.spark.csv")
      .option("header", "false")
      .option("inferSchema", "false")
      .option("codec", "gzip")
      .load("/foo/someCsvFile.gz.bak")
      gzFile.repartition(1).write.mode("overwrite").parquet("/foo/")

      got error even if I indicated the codec:

      WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
      17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage 12.0 (TID 977)
      java.lang.NullPointerException
      at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109)
      at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
      at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
      at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)

      Have to add extension to GzipCodec to make my code run.

      import org.apache.hadoop.io.compress.GzipCodec
      class BakGzipCodec extends GzipCodec {
      override def getDefaultExtension(): String = ".gz.bak"
      }

      I suppose the file loader should get file codec depending on option first, and then to extension.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ranmx Ran Mingxuan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified