[SPARK-25538] incorrect row counts after distinct() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
- correctness
Environment:

Reproduced on a Centos7 VM and from source in Intellij on OS X.

Description

It appears that df.distinct.count can return incorrect values after ~~SPARK-23713~~. It's possible that other operations are affected as well; distinct just happens to be the one that we noticed. I believe that this issue was introduced by ~~SPARK-23713~~ because I can't reproduce it until that commit, and I've been able to reproduce it after that commit as well as with tags/v2.4.0-rc1.

Below are example spark-shell sessions to illustrate the problem. Unfortunately the data used in these examples can't be uploaded to this Jira ticket. I'll try to create test data which also reproduces the issue, and will upload that if I'm able to do so.

Example from Spark 2.3.1, which behaves correctly:

scala> val df = spark.read.parquet("hdfs:///data")
df: org.apache.spark.sql.DataFrame = [<redacted>]

scala> df.count
res0: Long = 123

scala> df.distinct.count
res1: Long = 115

Example from Spark 2.4.0-rc1, which returns different output:

scala> val df = spark.read.parquet("hdfs:///data")
df: org.apache.spark.sql.DataFrame = [<redacted>]

scala> df.count
res0: Long = 123

scala> df.distinct.count
res1: Long = 116

scala> df.sort("col_0").distinct.count
res2: Long = 123

scala> df.withColumnRenamed("col_0", "newName").distinct.count
res3: Long = 115

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPARK-25538-repro.tgz
30/Sep/18 05:17
4 kB
Steven Rand

Issue Links

is broken by

SPARK-23713 Clean-up UnsafeWriter classes

Resolved

links to

[Github] Pull Request #22602 (mgaido91)

Activity

People

Assignee:: Marco Gaido

Reporter:: Steven Rand

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 26/Sep/18 05:53

Updated:: 20/Aug/20 15:39

Resolved:: 03/Oct/18 14:29