Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26078

WHERE .. IN fails to filter rows when used in combination with UNION

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.1, 2.4.0
    • Fix Version/s: 2.3.3, 2.4.1, 3.0.0
    • Component/s: SQL
    • Labels:

      Description

      Hey,

      We encountered a case where Spark SQL does not seem to handle WHERE .. IN correctly, when used in combination with UNION, but instead returns also rows that do not fulfill the condition. Swapping the order of the datasets in the UNION makes the problem go away. Repro below:

       

      sql = SQLContext(sc)
      
      a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
      b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
      a.registerTempTable('a')
      b.registerTempTable('b')
      
      bug = sql.sql("""
          SELECT id,num,source FROM
          (
              SELECT id, num, 'a' as source FROM a
              UNION ALL
              SELECT id, num, 'b' as source FROM b
          ) AS c
          WHERE c.id IN (SELECT id FROM b WHERE num = 2)
      """)
      
      no_bug = sql.sql("""
          SELECT id,num,source FROM
          (
              SELECT id, num, 'b' as source FROM b
              UNION ALL
              SELECT id, num, 'a' as source FROM a
          ) AS c
          WHERE c.id IN (SELECT id FROM b WHERE num = 2)
      """)
      
      bug.show()
      no_bug.show()
      
      bug.explain(True)
      no_bug.explain(True)
      

      This results in one extra row in the "bug" DF coming from DF "b", that should not be there as it  

      >>> bug.show()
      +---+---+------+
      | id|num|source|
      +---+---+------+
      |  a|  2|     a|
      |  a|  2|     b|
      |  b|  1|     b|
      +---+---+------+
      
      >>> no_bug.show()
      +---+---+------+
      | id|num|source|
      +---+---+------+
      |  a|  2|     b|
      |  a|  2|     a|
      +---+---+------+
      

       The reason can be seen in the query plans:

      >>> bug.explain(True)
      ...
      == Optimized Logical Plan ==
      Union
      :- Project [id#0, num#1L, a AS source#136]
      :  +- Join LeftSemi, (id#0 = id#4)
      :     :- LogicalRDD [id#0, num#1L], false
      :     +- Project [id#4]
      :        +- Filter (isnotnull(num#5L) && (num#5L = 2))
      :           +- LogicalRDD [id#4, num#5L], false
      +- Join LeftSemi, (id#4#172 = id#4#172)
         :- Project [id#4, num#5L, b AS source#137]
         :  +- LogicalRDD [id#4, num#5L], false
         +- Project [id#4 AS id#4#172]
            +- Filter (isnotnull(num#5L) && (num#5L = 2))
               +- LogicalRDD [id#4, num#5L], false
      

      Note the line +- Join LeftSemi, (id#4#172 = id#4#172) - this condition seems wrong, and I believe it causes the LeftSemi to return true for all rows in the left-hand-side table, thus failing to filter as the WHERE .. IN should. Compare with the non-buggy version, where both LeftSemi joins have distinct #-things on both sides:

      >>> no_bug.explain()
      ...
      == Optimized Logical Plan ==
      Union
      :- Project [id#4, num#5L, b AS source#142]
      :  +- Join LeftSemi, (id#4 = id#4#173)
      :     :- LogicalRDD [id#4, num#5L], false
      :     +- Project [id#4 AS id#4#173]
      :        +- Filter (isnotnull(num#5L) && (num#5L = 2))
      :           +- LogicalRDD [id#4, num#5L], false
      +- Project [id#0, num#1L, a AS source#143]
         +- Join LeftSemi, (id#0 = id#4#173)
            :- LogicalRDD [id#0, num#1L], false
            +- Project [id#4 AS id#4#173]
               +- Filter (isnotnull(num#5L) && (num#5L = 2))
                  +- LogicalRDD [id#4, num#5L], false
      

       

      Best,
      -Arttu 

       

        Attachments

          Activity

            People

            • Assignee:
              mgaido Marco Gaido
              Reporter:
              avoutilainen Arttu Voutilainen
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: