Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.0
-
None
Description
The following code reproduces the problem:
scala> spark.range(10).selectExpr("id as a", "id as b").where("(a,b) in ((1,1))").show org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', `a`, 'b', `b`) IN (named_struct('col1', 1, 'col2', 1)))' due to data type mismatch: Arguments must be same type; line 1 pos 6; 'Filter named_struct(a, a#42L, b, b#43L) IN (named_struct(col1, 1, col2, 1)) +- Project [id#39L AS a#42L, id#39L AS b#43L] +- Range (0, 10, step=1, splits=Some(1))
Similarly it won't work from SQL either, which is something that other SQL DB support:
scala> spark.range(10).selectExpr("id as a", "id as b").createOrReplaceTempView("tab1") scala> sql("select * from tab1 where (a,b) in ((1,1), (2,2))").show org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', tab1.`a`, 'b', tab1.`b`) IN (named_struct('col1', 1, 'col2', 1), named_struct('col1', 2, 'col2', 2)))' due to data type mismatch: Arguments must be same type; line 1 pos 31; 'Project [*] +- 'Filter named_struct(a, a#50L, b, b#51L) IN (named_struct(col1, 1, col2, 1),named_struct(col1, 2, col2, 2)) +- SubqueryAlias tab1 +- Project [id#47L AS a#50L, id#47L AS b#51L] +- Range (0, 10, step=1, splits=Some(1))
Other examples:
scala> sql("select * from tab1 where (a,b) =(1,1)").show org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', tab1.`a`, 'b', tab1.`b`) = named_struct('col1', 1, 'col2', 1))' due to data type mismatch: differing types in '(named_struct('a', tab1.`a`, 'b', tab1.`b`) = named_struct('col1', 1, 'col2', 1))' (struct<a:bigint,b:bigint> and struct<col1:int,col2:int>).; line 1 pos 25; 'Project [*] +- 'Filter (named_struct(a, a#50L, b, b#51L) = named_struct(col1, 1, col2, 1)) +- SubqueryAlias tab1 +- Project [id#47L AS a#50L, id#47L AS b#51L] +- Range (0, 10, step=1, splits=Some(1))
Expressions such as (1,1) are apparently read as structs and then the types do not match. Perhaps they should be arrays.
The following code works:
sql("select * from tab1 where array(a,b) in (array(1,1),array(2,2))").show
This also works, but requires the cast:
sql("select * from tab1 where (a,b) in (named_struct('a', cast(1 as bigint), 'b', cast(1 as bigint)))").show