Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30997

An analysis failure in generators with aggregate functions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0, 3.1.0
    • 3.0.0
    • SQL
    • None

    Description

      We have supported generators in SQL aggregate expressions by SPARK-28782.
      But, the generator(explode) query with aggregate functions in DataFrame failed as follows;

      // SPARK-28782: Generator support in aggregate expressions
      scala> spark.range(3).toDF("id").createOrReplaceTempView("t")
      scala> sql("select explode(array(min(id), max(id))) from t").show()
      +---+
      |col|
      +---+
      |  0|
      |  2|
      +---+
      
      // A failure case handled in this pr
      scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show()
      org.apache.spark.sql.AnalysisException:
      The query operator `Generate` contains one or more unsupported
      expression types Aggregate, Window or Generate.
      Invalid expressions: [min(`id`), max(`id`)];;
      Project [col#46L]
      +- Generate explode(array(min(id#42L), max(id#42L))), false, [col#46L]
         +- Range (0, 3, step=1, splits=Some(4))
      
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:129)
      

      The root cause is that `ExtractGenerator` wrongly replaces a project w/ aggregate functions
      before `GlobalAggregates` replaces it with an aggregate as follows;

      scala> sql("SET spark.sql.optimizer.planChangeLog.level=warn")
      scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show()
      
      20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: 
      === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
      !'Project [explode(array(min('id), max('id))) AS List()]   'Project [explode(array(min(id#72L), max(id#72L))) AS List()]
       +- Range (0, 3, step=1, splits=Some(4))                   +- Range (0, 3, step=1, splits=Some(4))
                 
      20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: 
      === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator ===
      !'Project [explode(array(min(id#72L), max(id#72L))) AS List()]   Project [col#76L]
      !+- Range (0, 3, step=1, splits=Some(4))                         +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L]
      !                                                                   +- Range (0, 3, step=1, splits=Some(4))
                 
      20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: 
      === Result of Batch Resolution ===
      !'Project [explode(array(min('id), max('id))) AS List()]   Project [col#76L]
      !+- Range (0, 3, step=1, splits=Some(4))                   +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L]
      !                                                             +- Range (0, 3, step=1, splits=Some(4))
                
      // the analysis failed here...
      

      Attachments

        Issue Links

          Activity

            People

              maropu Takeshi Yamamuro
              maropu Takeshi Yamamuro
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: