Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20744

Predicates with multiple columns do not work

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.0
    • None
    • SQL

    Description

      The following code reproduces the problem:

      scala> spark.range(10).selectExpr("id as a", "id as b").where("(a,b) in ((1,1))").show
      org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', `a`, 'b', `b`) IN (named_struct('col1', 1, 'col2', 1)))' due to data type mismatch: Arguments must be same type; line 1 pos 6;
      'Filter named_struct(a, a#42L, b, b#43L) IN (named_struct(col1, 1, col2, 1))
      +- Project [id#39L AS a#42L, id#39L AS b#43L]
         +- Range (0, 10, step=1, splits=Some(1))
      

      Similarly it won't work from SQL either, which is something that other SQL DB support:

      scala> spark.range(10).selectExpr("id as a", "id as b").createOrReplaceTempView("tab1")
      
      scala> sql("select * from tab1 where (a,b) in ((1,1), (2,2))").show
      org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', tab1.`a`, 'b', tab1.`b`) IN (named_struct('col1', 1, 'col2', 1), named_struct('col1', 2, 'col2', 2)))' due to data type mismatch: Arguments must be same type; line 1 pos 31;
      'Project [*]
      +- 'Filter named_struct(a, a#50L, b, b#51L) IN (named_struct(col1, 1, col2, 1),named_struct(col1, 2, col2, 2))
         +- SubqueryAlias tab1
            +- Project [id#47L AS a#50L, id#47L AS b#51L]
               +- Range (0, 10, step=1, splits=Some(1))
      

      Other examples:

      scala> sql("select * from tab1 where (a,b) =(1,1)").show
      org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', tab1.`a`, 'b', tab1.`b`) = named_struct('col1', 1, 'col2', 1))' due to data type mismatch: differing types in '(named_struct('a', tab1.`a`, 'b', tab1.`b`) = named_struct('col1', 1, 'col2', 1))' (struct<a:bigint,b:bigint> and struct<col1:int,col2:int>).; line 1 pos 25;
      'Project [*]
      +- 'Filter (named_struct(a, a#50L, b, b#51L) = named_struct(col1, 1, col2, 1))
         +- SubqueryAlias tab1
            +- Project [id#47L AS a#50L, id#47L AS b#51L]
               +- Range (0, 10, step=1, splits=Some(1))
      

      Expressions such as (1,1) are apparently read as structs and then the types do not match. Perhaps they should be arrays.
      The following code works:

      sql("select * from tab1 where array(a,b) in (array(1,1),array(2,2))").show
      

      This also works, but requires the cast:

      sql("select * from tab1 where (a,b) in (named_struct('a', cast(1 as bigint), 'b', cast(1 as bigint)))").show
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            bograd Bogdan Raducanu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: