[SPARK-29721] Spark SQL reads unnecessary nested fields after using explode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.1.0
Component/s: SQL
Labels:
None

Description

This is a follow up for ~~SPARK-4502~~. ~~SPARK-4502~~ correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us.

The following code illustrates the issue.

Part 1: loading some nested data

val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")

Part 2: reading it back and explaining the queries

val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

// pruned, only loading itemId
// ReadSchema: struct<items:array<struct<itemId:bigint>>>
read.select($"items.itemId").explain(true) 

// not pruned, loading both itemId 
// ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
read.select(explode($"items.itemId")).explain(true) and itemData

Attachments

Issue Links

causes

SPARK-30761 Nested pruning should not prune on required child outputs in Generate

Closed

links to

GitHub Pull Request #26978

GitHub Pull Request #27504

GitHub Pull Request #27517

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Kai Kang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Nov/19 05:13

Updated:: 12/Dec/22 18:10

Resolved:: 27/Mar/20 17:47