Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Not A Problem
-
1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
-
None
-
None
Description
With code below:
val start_time = System.currentTimeMillis()
val gzFile = spark.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("inferSchema", "false")
.option("codec", "gzip")
.load("/foo/someCsvFile.gz.bak")
gzFile.repartition(1).write.mode("overwrite").parquet("/foo/")
got error even if I indicated the codec:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage 12.0 (TID 977)
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
Have to add extension to GzipCodec to make my code run.
import org.apache.hadoop.io.compress.GzipCodec
class BakGzipCodec extends GzipCodec {
override def getDefaultExtension(): String = ".gz.bak"
}
I suppose the file loader should get file codec depending on option first, and then to extension.