Details
Description
The following Spark shell snippet under Spark 2.1 reproduces this issue:
val data = Seq( ("p1", 1, 1), ("p2", 2, 2), (null, 3, 3) ) // Correct case: Saving partitioned data to file system. val path = "/tmp/partitioned" data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). parquet(path) spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false) // +---+---+---+ // |c |a |b | // +---+---+---+ // |2 |p2 |2 | // |1 |p1 |1 | // +---+---+---+ // Incorrect case: Saving partitioned data as persisted table. data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). saveAsTable("test_null") spark.table("test_null").filter($"a".isNotNull).show(truncate = false) // +---+--------------------------+---+ // |c |a |b | // +---+--------------------------+---+ // |3 |__HIVE_DEFAULT_PARTITION__|3 | <-- This line should not be here // |1 |p1 |1 | // |2 |p2 |2 | // +---+--------------------------+---+
Hive-style partitioned tables use the magic string __HIVE_DEFAULT_PARTITION__ to indicate NULL partition values in partition directory names. However, in the case persisted partitioned table, this magic string is not interpreted as NULL but a regular string.