Description
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source.
We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us.
The following code illustrates the issue.
Part 1: loading some nested data
val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
Part 2: reading it back and explaining the queries
val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) // pruned, only loading itemId // ReadSchema: struct<items:array<struct<itemId:bigint>>> read.select($"items.itemId").explain(true) // not pruned, loading both itemId // ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>> read.select(explode($"items.itemId")).explain(true) and itemData
Attachments
Issue Links
- causes
-
SPARK-30761 Nested pruning should not prune on required child outputs in Generate
- Closed
- links to