Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42346

distinct(count colname) with UNION ALL causes query analyzer bug

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0, 3.4.0, 3.5.0
    • 3.3.2, 3.4.0, 3.5.0
    • SQL
    • None

    Description

      If you combine a UNION ALL with a count(distinct colname) you get a query analyzer bug.

       

      This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.

       

      Here is a reprex in PySpark:

      df_pd = pd.DataFrame([
          {'surname': 'a', 'first_name': 'b'}
      ])
      df_spark = spark.createDataFrame(df_pd)
      df_spark.createOrReplaceTempView("input_table")

      sql = """

      SELECT 
          (SELECT Count(DISTINCT first_name) FROM   input_table) 
              AS distinct_value_count
      FROM   input_table
      UNION ALL
      SELECT 
          (SELECT Count(DISTINCT surname) FROM   input_table) 
              AS distinct_value_count
      FROM   input_table """

      spark.sql(sql).toPandas()

       

      Attachments

        Activity

          People

            petertoth Peter Toth
            RobinLinacre Robin
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: