[SPARK-20185] csv decompressed incorrectly with extention other than 'gz' - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
Fix Version/s: None
Component/s: Input/Output
Labels:
None

Description

With code below:
val start_time = System.currentTimeMillis()
val gzFile = spark.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("inferSchema", "false")
.option("codec", "gzip")
.load("/foo/someCsvFile.gz.bak")
gzFile.repartition(1).write.mode("overwrite").parquet("/foo/")

got error even if I indicated the codec:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage 12.0 (TID 977)
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)

Have to add extension to GzipCodec to make my code run.

import org.apache.hadoop.io.compress.GzipCodec
class BakGzipCodec extends GzipCodec {
override def getDefaultExtension(): String = ".gz.bak"
}

I suppose the file loader should get file codec depending on option first, and then to extension.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ran Mingxuan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Apr/17 09:08

Updated:: 12/Dec/22 18:10

Resolved:: 04/Apr/17 11:02

Time Tracking

Estimated:

168h

Remaining:

168h

Logged:

Not Specified