Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29721

Spark SQL reads unnecessary nested fields after using explode

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • SQL
    • None

    Description

      This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source.

      We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us.

       

      The following code illustrates the issue. 

      Part 1: loading some nested data

      val jsonStr = """{
       "items": [
         {"itemId": 1, "itemData": "a"},
         {"itemId": 2, "itemData": "b"}
       ]
      }"""
      val df = spark.read.json(Seq(jsonStr).toDS)
      df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
      

       
      Part 2: reading it back and explaining the queries

      val read = spark.table("persisted")
      spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
      
      // pruned, only loading itemId
      // ReadSchema: struct<items:array<struct<itemId:bigint>>>
      read.select($"items.itemId").explain(true) 
      
      // not pruned, loading both itemId 
      // ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
      read.select(explode($"items.itemId")).explain(true) and itemData
      

       

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              ghoulken Kai Kang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: