Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
3.3.3, 3.4.1, 3.5.0
Description
A query can have an incorrect output schema because of a subquery.
Assume this data:
create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); cache table t1; cache table t2; cache table t3;
When run in spark-sql, the following query has a superfluous boolean column:
select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true
The result should be:
1 2 3
When executed via the Dataset API, you don't see the incorrect result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong).
However, even with the Dataset API, this query goes wrong:
select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at scala.collection.AbstractIterable.foreach(Iterable.scala:933) ...
Other queries that have the wrong schema:
select * from t1 where a in ( select c1 from t2 where a in (select col1 from t3) );
and
select * from t1 where not exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) );