Details
Description
When the input data has a complex structure, using same expression in group by clause and select clause will throw "Expression not in GROUP BY".
val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), Person("Jim", Birthday("1980-02-28")))) people.registerTempTable("people") val year = sqlContext.sql("select count(*), upper(birthday.date) from people group by upper(birthday.date)") year.collect
Here is the plan of year:
SchemaRDD[3] at RDD at SchemaRDD.scala:105
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3]
Subquery people
LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36
The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`.
Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression.
Attachments
Issue Links
- duplicates
-
SPARK-4322 Struct fields can't be used as sub-expression of grouping fields
- Resolved
- links to