Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0
Description
When parquet files are persisted using partitions, spark's `readStream` returns data with all `null`s for the partitioned columns.
For example:
case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/mnt/databricks/test" data.write.partitionBy("id").parquet(url)
when data is read as stream:
spark.readStream.schema(spark.read.load(url).schema).parquet(url)
it reads:
id, value null, 1 null, 2 null, 3
A possible reason is `readStream` reads parquet files directly but when those are stored the columns they are partitioned by are excluded from the file itself. In the given example the parquet files contain `value` information only since `id` is partition.