Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43778

RewriteCorrelatedScalarSubquery should handle duplicate attributes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 4.0.0
    • SQL

    Description

      This is a correctness problem caused by the fact that the decorrelation rule does not dedup join attributes properly. This leads to the join on (c1 = c1), which is simplified to True and the join becomes a cross product.

       

      Example query:

       

      create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)
      
      select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 having cnt = 0) from t t1
      -- Correct answer: [(0, 1, null), (0, 2, null), (1, 2, null)]
      +---+---+------------------+
      |c1 |c2 |scalarsubquery(c1)|
      +---+---+------------------+
      |0  |1  |null              |
      |0  |1  |null              |
      |0  |2  |null              |
      |0  |2  |null              |
      |1  |2  |null              |
      |1  |2  |null              |
      +---+---+------------------+ 

       

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gubichev Andrey Gubichev
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: