Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36289

rewrite distinct count case when expressions without Expand node

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.2.0
    • None
    • SQL
    • None

    Description

      Currently, RewriteDistinctAggregates rule will rewrite distinct aggregates with extra Expand node. This causes unnecessary memory waste and performance fallback for some specific aggregates(e.g. count distinct case when).

      We can rewrite count distinct case when aggregates without Expand:

      for query:

      • {{ { * SELECT * cat1, * COUNT(DISTINCT CASE WHEN cond1 THEN cat2 ELSE null end) as cat2_cnt1, * COUNT(DISTINCT CASE WHEN cond2 THEN cat2 ELSE null end) as cat2_cnt2, * FROM * data * GROUP BY * key * }

        }}

       

      we should rewrite to :

      • {{ { * Aggregate( * key = ['key] * functions = [count(if (01 & 'bit_vector != 0) 0 else null), * count(if (10 & 'bit_vector != 0) 0 else null)] * output = ['key, 'cat2_cnt1, 'cat2_cnt2]) * Aggregate( * key = ['key, 'cat1] * functions = [bit_or(if (cond1) 01 else 00, if (cond2) 10 else 00)] * output = ['key, 'cat1, 'bit_vector]) * LocalTableScan [...] * }

        }}

       

      This method will improve performance and reduce memory waste

      Attachments

        Activity

          People

            Unassigned Unassigned
            gabriellee Wenqiang Lee
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: