Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29721

Spark SQL reads unnecessary nested fields after using explode

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.1.0
    • Component/s: SQL
    • Labels:
      None

      Description

      This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source.

      We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us.

       

      The following code illustrates the issue. 

      Part 1: loading some nested data

      val jsonStr = """{
       "items": [
         {"itemId": 1, "itemData": "a"},
         {"itemId": 2, "itemData": "b"}
       ]
      }"""
      val df = spark.read.json(Seq(jsonStr).toDS)
      df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
      

       
      Part 2: reading it back and explaining the queries

      val read = spark.table("persisted")
      spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
      
      // pruned, only loading itemId
      // ReadSchema: struct<items:array<struct<itemId:bigint>>>
      read.select($"items.itemId").explain(true) 
      
      // not pruned, loading both itemId 
      // ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
      read.select(explode($"items.itemId")).explain(true) and itemData
      

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                viirya L. C. Hsieh
                Reporter:
                ghoulken Kai Kang
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: