Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37450

Spark SQL reads unnecessary nested fields (another type of pruning case)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 3.3.0
    • SQL
    • None

    Description

      Based on this SPARK-34638 Maybe I found another nested fields pruning case. In this case I found full read with `count` function

      Example:
      1) Loading data

      val jsonStr = """{
       "items": [
         {"itemId": 1, "itemData": "a"},
         {"itemId": 2, "itemData": "b"}
       ]
      }"""
      val df = spark.read.json(Seq(jsonStr).toDS)
      df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
      

      2) read query with explain

      val read = spark.table("persisted")
      spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
      
      read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
      // ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
      

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            yuryn Jiri Humpolicek
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: