Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34283

Combines all adjacent 'Union' operators into a single 'Union' when using 'Dataset.union.distinct.union.distinct'

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0
    • SQL
    • None

    Description

      Problem:

      Currently when using 'Dataset.union.distinct.union.distinct' to union some datasets, Optimizer can't combine all adjacent 'Union' operators into a single 'Union', but it can handle this case when using sql.

      For example:

      The 'Physical Plan' is shown below:

      But using sql:

      The 'Physical Plan' is shown below:

       

      Root cause:

      When using 'Dataset.union.distinct.union.distinct', the operator is  'Deduplicate(Keys, Union)', but AstBuilder transform sql 'Union' to operator 'Distinct(Union)', the rule 'CombineUnions' in Optimizer only handle 'Distinct(Union)' operator but not Deduplicate(Keys, Union).

        

       

      Attachments

        1. image-2021-01-29-11-12-44-112.png
          46 kB
          Zhichao Zhang
        2. image-2021-01-29-11-13-42-055.png
          177 kB
          Zhichao Zhang
        3. image-2021-01-29-11-14-08-822.png
          37 kB
          Zhichao Zhang
        4. image-2021-01-29-11-14-42-700.png
          113 kB
          Zhichao Zhang

        Activity

          People

            zzcclp Zhichao Zhang
            zzcclp Zhichao Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: