Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11770

Spark SQL field resolution error in GROUP BY HAVING clause

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.5.1
    • None
    • SQL

    Description

      A query fails to resolve columns from the source data when an alias is added to the SELECT clause. I have not been able to isolate a reproducible standalone test. Below are a series of spark-shell operations that show the problem step-by-step. Spark SQL execution happens via HiveContext.

      I believe the root cause of the problem is that when (and only when) there are aliased expression columns in the SELECT clause, Spark SQL "sees" columns from the SELECT clause while evaluating a HAVING clause. According to the SQL standard that should not happen.

      The table in question is simple:

      scala> ctx.table("hevents_test").printSchema
      15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
      15/11/16 22:19:19 INFO audit: ugi=sim	ip=unknown-ip-addr	cmd=get_table : db=default tbl=hevents_test
      root
       |-- vertical: string (nullable = true)
       |-- did: string (nullable = true)
       |-- surl: string (nullable = true)
       |-- creative_id: long (nullable = true)
       |-- keyword_text: string (nullable = true)
       |-- errors: integer (nullable = true)
       |-- views: integer (nullable = true)
       |-- clicks: integer (nullable = true)
       |-- actions: long (nullable = true)
      

      A basic aggregation with a SELECT expression works without a problem:

      cala> ctx.sql("""
           |   select 1.0*creative_id, sum(views) as views
           |   from hevents_test
           |   group by creative_id
           |   having sum(views) > 500
           | """)
      15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, sum(views) as views
        from hevents_test
        group by creative_id
        having sum(views) > 500
      15/11/16 22:25:53 INFO ParseDriver: Parse Completed
      15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
      15/11/16 22:25:53 INFO audit: ugi=sim	ip=unknown-ip-addr	cmd=get_table : db=default tbl=hevents_test
      res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
      

      However, if the expression is aliased, the analyzer gets confused about views.

      scala> ctx.sql("""
           | select 1.0*creative_id as cid, sum(views) as views
           | from hevents_test
           | group by creative_id
           | having sum(views) > 500
           | """)
      15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id as cid, sum(views) as views
      from hevents_test
      group by creative_id
      having sum(views) > 500
      15/11/16 22:26:59 INFO ParseDriver: Parse Completed
      15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
      15/11/16 22:26:59 INFO audit: ugi=sim	ip=unknown-ip-addr	cmd=get_table : db=default tbl=hevents_test
      org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L missing from vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5 in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
      	at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
      	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
      	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:61)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:63)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:65)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
      	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:69)
      	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:71)
      	at $iwC$$iwC$$iwC.<init>(<console>:73)
      	at $iwC$$iwC.<init>(<console>:75)
      	at $iwC.<init>(<console>:77)
      	at <init>(<console>:79)
      	at .<init>(<console>:83)
      	at .<clinit>(<console>)
      	at .<init>(<console>:7)
      	at .<clinit>(<console>)
      	at $print(<console>)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
      	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
      	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
      	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
      	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
      	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
      	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
      	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
      	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
      	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
      	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
      	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
      	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
      	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
      	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
      	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
      	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
      	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
      	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
      	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
      	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
      	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
      	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
      	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
      	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
      	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
      	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
      	at org.apache.spark.repl.Main$.main(Main.scala:31)
      	at org.apache.spark.repl.Main.main(Main.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      

      If the views column in the HAVING clause is explicitly disambiguated, the problem goes away:

      scala> ctx.sql("""
           | select 1.0*creative_id as cid, sum(views) as views
           | from hevents_test
           | group by creative_id
           | having sum(hevents_test.views) > 500
           | """)
      15/11/16 22:29:17 INFO ParseDriver: Parsing command: select 1.0*creative_id as cid, sum(views) as views
      from hevents_test
      group by creative_id
      having sum(hevents_test.views) > 500
      15/11/16 22:29:17 INFO ParseDriver: Parse Completed
      15/11/16 22:29:17 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
      15/11/16 22:29:17 INFO audit: ugi=sim	ip=unknown-ip-addr	cmd=get_table : db=default tbl=hevents_test
      res23: org.apache.spark.sql.DataFrame = [cid: double, views: bigint]
      

      That disambiguation should not be necessary.

      Attachments

        Activity

          People

            Unassigned Unassigned
            simeons Simeon Simeonov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: