Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Cannot Reproduce
-
2.3.2
-
None
-
None
Description
Reading a Ctrl-A delimited CSV file ignores rows with all null values. However a comma delimited CSV file doesn't.
Reproduction in spark-shell:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val l = List(List(1, 2), List(null,null), List(2,3))
val datasetSchema = StructType(List(StructField("colA", IntegerType, true), StructField("colB", IntegerType, true)))
val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
val df = spark.createDataFrame(rdd, datasetSchema)
df.show()
colA | colB | |
1 | 2 | |
null | null | |
2 | 3 |
df.write.option("delimiter", "\u0001").option("header", "true").csv("/ctrl-a-separated.csv")
df.write.option("delimiter", ",").option("header", "true").csv("/comma-separated.csv")
val commaDf = spark.read.option("header", "true").option("delimiter", ",").csv("/comma-separated.csv")
commaDf.show
colA | colB |
1 | 2 |
2 | 3 |
null | null |
val ctrlaDf = spark.read.option("header", "true").option("delimiter", "\u0001").csv("/ctrl-a-separated.csv")
ctrlaDf.show
colA | colB |
1 | 2 |
2 | 3 |
As seen above, for Ctrl-A delimited CSV, rows containing only null values are ignored.