Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33910

Simplify/Optimize conditional expressions

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0
    • SQL
    • None

    Description

      1. Push down the foldable expressions through CaseWhen/If
      2. Simplify conditional in predicate
      3. Push the UnaryExpression into (if / case) branches
      4. Simplify CaseWhen if elseValue is None
      5. Simplify CaseWhen clauses with (true and false) and (false and true)

      Common use cases are:

      create table t1 using parquet as select * from range(100);
      create table t2 using parquet as select * from range(200);
      
      create temp view v1 as                                                
      select 'a' as event_type, * from t1                                   
      union all                                                             
      select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
      

      1. Reduce read the whole table.

      explain select * from v1 where event_type = 'a';
      Before simplify:
      == Physical Plan ==
      Union
      :- *(1) Project [a AS event_type#7, id#9L]
      :  +- *(1) ColumnarToRow
      :     +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
      +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L]
         +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
            +- *(2) ColumnarToRow
               +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
      
      After simplify:
      == Physical Plan ==
      *(1) Project [a AS event_type#8, id#4L]
      +- *(1) ColumnarToRow
         +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet
      

      2. Push down the conditional expressions to data source.

      explain select * from v1 where event_type = 'b';
      Before simplify:
      == Physical Plan ==
      Union
      :- LocalTableScan <empty>, [event_type#7, id#9L]
      +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L]
         +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
            +- *(1) ColumnarToRow
               +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
      
      After simplify:
      == Physical Plan ==
      *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS id#4L]
      +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
         +- *(1) ColumnarToRow
            +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:bigint>
      

      3. Reduce the amount of calculation.

      Before simplify:
      explain select event_type = 'e' from v1;
      == Physical Plan ==
      Union
      :- *(1) Project [false AS (event_type = e)#37]
      :  +- *(1) ColumnarToRow
      :     +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
      +- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type = e)#38]
         +- *(2) ColumnarToRow
            +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
      
      After simplify:
      == Physical Plan ==
      Union
      :- *(1) Project [false AS (event_type = e)#10]
      :  +- *(1) ColumnarToRow
      :     +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: Parquet,
      +- *(2) Project [false AS (event_type = e)#14]
         +- *(2) ColumnarToRow
            +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
      
      

      Attachments

        Activity

          People

            yumwang Yuming Wang
            yumwang Yuming Wang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: