Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25538

incorrect row counts after distinct()

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.4.0
    • Component/s: SQL
    • Labels:
    • Environment:

      Reproduced on a Centos7 VM and from source in Intellij on OS X.

      Description

      It appears that df.distinct.count can return incorrect values after SPARK-23713. It's possible that other operations are affected as well; distinct just happens to be the one that we noticed. I believe that this issue was introduced by SPARK-23713 because I can't reproduce it until that commit, and I've been able to reproduce it after that commit as well as with tags/v2.4.0-rc1

      Below are example spark-shell sessions to illustrate the problem. Unfortunately the data used in these examples can't be uploaded to this Jira ticket. I'll try to create test data which also reproduces the issue, and will upload that if I'm able to do so.

      Example from Spark 2.3.1, which behaves correctly:

      scala> val df = spark.read.parquet("hdfs:///data")
      df: org.apache.spark.sql.DataFrame = [<redacted>]
      
      scala> df.count
      res0: Long = 123
      
      scala> df.distinct.count
      res1: Long = 115
      

      Example from Spark 2.4.0-rc1, which returns different output:

      scala> val df = spark.read.parquet("hdfs:///data")
      df: org.apache.spark.sql.DataFrame = [<redacted>]
      
      scala> df.count
      res0: Long = 123
      
      scala> df.distinct.count
      res1: Long = 116
      
      scala> df.sort("col_0").distinct.count
      res2: Long = 123
      
      scala> df.withColumnRenamed("col_0", "newName").distinct.count
      res3: Long = 115
      

        Attachments

        1. SPARK-25538-repro.tgz
          4 kB
          Steven Rand

          Activity

            People

            • Assignee:
              mgaido Marco Gaido
              Reporter:
              Steven Rand Steven Rand
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: