Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5446

Parquet column pruning should work for Map and Struct

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.2.0, 1.3.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      Consider the following query:

      select stddev_pop(variables.var1) stddev
      from model
      group by model_name
      

      Where variables is a Struct containing many fields, similarly it can be a Map with many key-value pairs.

      During execution, SparkSQL will shuffle the whole map or struct column instead of extracting the value first. The performance is very poor.

      The optimized version could use a subquery:

      select stddev_pop(var) stddev
      from (select variables.var1 as var, model_name from model) m
      group by model_name
      

      Where we extract the field/key-value only in the mapper side, so data being shuffled is small.

      A benchmark for a table with 600 variables shows drastic improvment in runtime:

        Parquet (using Map) Parquet (using Struct)
      Stddev (unoptimized) 12890s 583s
      Stddev (optimized) 84s 61s

      Parquet already supports reading a single field/key-value in the storage level, but SparkSQL currently doesn’t have optimization for it. This will be very useful optimization for tables having Map or Struct with many columns.

      Jianshi

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                huangjs Jianshi Huang
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: