Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18358

Multiple Aggregation Using 'countDistinct' and 'first' result in error

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 2.0.2
    • Component/s: None
    • Labels:
      None
    • Environment:

      Mac OS X 10.9.5
      Apache Spark 2.0.1
      Hadoop 1.4

    • Target Version/s:

      Description

      Using pyspark, when I attempt to perform multiple aggregations on the same groupBy object using the functions 'first' and 'countDistinct' it results in a Py4JJavaError.

      from pyspark.sql import SparkSession
      import pyspark.sql.functions as sfn
      
      sparkSession = SparkSession.builder.master('local').getOrCreate()
      
      df = spark.createDataFrame([
              (1, 'a', 'z'),
              (1, 'b', 'x'),
              (1, 'a', 'y'),
              (1, 'a', 'x'),
              (2, 'b', 'z'),
              (2, 'b', 'z')
          ], ['id', 'var1', 'var2'])
      
      ## Using two 'first' and one 'countDistinct' aggregations works
      df.groupby('id')    \
              .agg(sfn.first('var1'),  \
                      sfn.first('var2'),  \
                      sfn.countDistinct('var1')).show()
                               
      ## Using one 'max' with both 'countDistinct' works:
      df.groupby('id')    \
               .agg(sfn.max('var2'),                \
                       sfn.countDistinct('var1'),   \
                       sfn.countDistinct('var2')).show()
      
      ## But using both 'countDistinct' with at least one 'first' crashes
      df.groupby('id')    \
              .agg(sfn.first('var1'),   \
                      sfn.first('var2'),   \
                      sfn.countDistinct('var1'), \
                      sfn.countDistinct('var2')) \
              .show()
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                nasrallah Chris Nasrallah
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: