Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25363

Schema pruning doesn't work if nested column is used in where clause

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Schema pruning doesn't work if nested column is used in where clause.

      For example,

      sql("select name.first from contacts where name.first = 'David'")
      
      == Physical Plan ==
      *(1) Project [name#19.first AS first#40]
      +- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
         +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [], 
          PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>>
      

      In above query plan, the scan node reads the entire schema of `name` column.

      This issue is reported by:
      https://github.com/apache/spark/pull/21320#issuecomment-419290197

        Attachments

          Activity

            People

            • Assignee:
              viirya L. C. Hsieh
              Reporter:
              viirya L. C. Hsieh
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: