[SPARK-9338] Aliases from SELECT not available in GROUP BY - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.4.1
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed
Environment:

Observed on Mac OS X and Ubuntu 14.04

Description

It feels like this should really be a known issue, but I've not been able to find any mailing list or JIRA tickets for exactly this. There are a few closed/resolved tickets about specific types of exceptions, but I couldn't find this exact problem, so apologies if this is a dupe!

Spark SQL doesn't appear to support referencing aliases from a SELECT in the GROUP BY part of the query. This is confusing our analysts, as it works in most other tools they use. Here's an example to reproduce:

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val schema =
  StructType(
    StructField("x", IntegerType, nullable=false) ::
    StructField("y",
      StructType(StructField("a", DoubleType, nullable=false) :: Nil),
      nullable=false) :: Nil)

val rdd = sc.parallelize(
  Row(1, Row(1.0)) :: Row(2, Row(1.34)) :: Row(3, Row(2.3)) :: Row(4, Row(2.5)) :: Nil)

val df = sqlContext.createDataFrame(rdd, schema)

// DataFrame content looks like this:
// x   z
// 1   {a: 1.0}
// 2   {a: 1.34}
// 3   {a: 2.3}
// 4   {a: 2.5}


df.registerTempTable("test_data")
sqlContext.udf.register("roundToInt", (x: Double) => x.toInt)


sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data GROUP BY grp").show()
// => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given input columns x, y

sqlContext.sql("SELECT y.a as grp, SUM(x) as s FROM test_data GROUP BY grp").show()
// => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given input columns x, y;

sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(y.a) as s FROM test_data GROUP BY roundToInt(y.a)").show()
// =>
// +---+----+
// |grp|   s|
// +---+----+
// |  1|2.34|
// |  2| 4.8|
// +---+----+

As you can see, it's particularly inconvenient when using UDFs on nested fields, as it means repeating some potentially complex expressions. It's very common for us to want to make a date type conversion (from epoch milliseconds or something) from some nested field, then reference it in multiple places in the query. With this issue, it makes for quite verbose queries.

Might it also mean that we're mapping these functions over the data twice? I can't quite tell from the explain output whether that's been optimised out or not, but here it is for somebody who understands

sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data GROUP BY roundToInt(y.a)").explain()
// == Physical Plan ==
// Aggregate false, [PartialGroup#126], [PartialGroup#126 AS grp#116,CombineSum(PartialSum#125L) AS s#117L]
// Exchange (HashPartitioning 200)
// Aggregate true, [scalaUDF(y#7.a)], [scalaUDF(y#7.a) AS PartialGroup#126,SUM(CAST(x#6, LongType)) AS PartialSum#125L]
// PhysicalRDD [x#6,y#7], MapPartitionsRDD[10] at createDataFrame at <console>:31

Attachments

Issue Links

is duplicated by

SPARK-14471 The alias created in SELECT could be used in GROUP BY and followed expressions

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: James Aley

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 25/Jul/15 08:42

Updated:: 21/May/19 04:11

Resolved:: 21/May/19 04:11