Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Patch
Description
Parquet writer only checks the number of rows and the page size to decide whether it needs to fit a content to be written in a single page.
In the case of a composite column (ex: array/map) with a lot of nulls, it is possible to create 2billions+ values while under the default page-size & row-count threshold (1MB, 20000rows)
Repro using Spark:
val dir = "/tmp/anyrandomDirectory"
spark.range(0, 20000, 1, 1)
.selectExpr("array_repeat(cast(null as binary), 110000) as n")
.write
.mode("overwrite")
.save(dir)
val result = spark
.sql(s"select * from parquet.`$dir` limit 1000")
.collect() // This will break