Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5446

Parquet column pruning should work for Map and Struct

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.2.0, 1.3.0
    • None
    • SQL
    • None

    Description

      Consider the following query:

      select stddev_pop(variables.var1) stddev
      from model
      group by model_name
      

      Where variables is a Struct containing many fields, similarly it can be a Map with many key-value pairs.

      During execution, SparkSQL will shuffle the whole map or struct column instead of extracting the value first. The performance is very poor.

      The optimized version could use a subquery:

      select stddev_pop(var) stddev
      from (select variables.var1 as var, model_name from model) m
      group by model_name
      

      Where we extract the field/key-value only in the mapper side, so data being shuffled is small.

      A benchmark for a table with 600 variables shows drastic improvment in runtime:

        Parquet (using Map) Parquet (using Struct)
      Stddev (unoptimized) 12890s 583s
      Stddev (optimized) 84s 61s

      Parquet already supports reading a single field/key-value in the storage level, but SparkSQL currently doesn’t have optimization for it. This will be very useful optimization for tables having Map or Struct with many columns.

      Jianshi

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              huangjs Jianshi Huang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: