Description
Loading a malformed CSV files or a dataset can cause NullPointerException, for example the code:
val schema = StructType(StructField("a", IntegerType) :: Nil) val input = spark.createDataset(Seq("\u0000\u0000\u0001234")) spark.read.schema(schema).csv(input).collect()
crashes with the exception:
Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210) at org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) at org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68)
If schema is not specified, the following exception is thrown:
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192)
at scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99)
at scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109)
at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247)