Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38983

Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.1.2, 3.2.1
    • None
    • PySpark

    Description

      In a nutshell

      Pyspark emits an incorrect error message when committing a type error with the results of the grouping() function.

      Code to reproduce

      print(spark.version) # My environment, Azure DataBricks, defines spark automatically.
      from pyspark.sql import functions as f
      from pyspark.sql import types as tl = [
        ('a',),
        ('b',),
      ]
      s = t.StructType([
        t.StructField('col1', t.StringType())
      ])
      df = spark.createDataFrame(l, s)
      df.display()( # This expression raises an AnalysisException()
        df
        .cube(f.col('col1'))
        .agg(f.grouping('col1') & f.lit(True))
        .collect()
      )

      Expected results

      The code produces an AnalysisException() with error message along the lines of:
      AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and boolean).;

      Actual results

      The code throws an AnalysisException() with error message
      AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;

      Python provides the following traceback:
      ---------------------------------------------------------------------------
      AnalysisException                         Traceback (most recent call last)
      <command-2283735107422632> in <module>
           15 
           16 ( # This expression raises an AnalysisException()
      ---> 17   df
           18   .cube(f.col('col1'))
           19   .agg(f.grouping('col1') & f.lit(True))/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs)
          116             # Columns
          117             assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
      --> 118             jdf = self._jgd.agg(exprs[0]._jc,
          119                                 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
          120         return DataFrame(jdf, self.sql_ctx)/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in {}call{}(self, *args)
         1302 
         1303         answer = self.gateway_client.send_command(command)
      -> 1304         return_value = get_return_value(
         1305             answer, self.gateway_client, self.target_id, self.name)
         1306 /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
          121                 # Hide where the exception came from that shows a non-Pythonic
          122                 # JVM exception message.
      --> 123                 raise converted from None
          124             else:
          125                 raiseAnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;
      'Aggregate cube(col1#548), col1#548, (grouping(col1#548) AND true) AS (grouping(col1) AND true)#551
      +- LogicalRDD col1#548, false

      Workaround

      Note: The reason I opened this ticket is that, when the user makes a particular type error, the resulting error message is misleading. The code snippet below shows how to fix that type error. It does not address the false-error-message bug, which is the focus of this ticket.

      Cast the result of .grouping() to boolean type. That is, know ab ovo that .grouping() produces an integer 0 or 1 rather than a boolean True or False.

      (  # This expression does not raise an AnalysisException()
        df
        .cube(f.col('col1'))
        .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))
        .collect()
      )

      Additional notes

      The same error occurs if .cube() is replaced with .rollup() in "Code to reproduce".

      The same error occurs if .grouping() is replaced with .grouping_id() in "Code to reproduce".

      Related tickets

      https://issues.apache.org/jira/browse/SPARK-22748

      Relevant documentation

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            Kimmel Chris Kimmel
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: