[SPARK-31430] Bug in the approximate quantile computation. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

I am seeing a bug where passing lower relative error to the approxQuantile function is leading to incorrect result in the presence of partitions. Setting a relative error 1e-6 causes it to compute equal values for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct results. This issue was not present in spark version 2.4.5, we noticed it when testing 3.0.0-preview.

>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', header=True, schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))
>>> df = df.repartition(200, 'Store').localCheckpoint()
>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)
[1422576000.0, 1430352000.0, 1438300800.0]
>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.00001)
[1422576000.0, 1430524800.0, 1438300800.0]
>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.000001)
[1422576000.0, 1438300800.0, 1438300800.0]
>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.000001)
[1422576000.0, 1430524800.0, 1438300800.0]

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

approx_quantile_data.csv
13/Apr/20 05:16
12.08 MB
Siddartha Naidu

Issue Links

duplicates

SPARK-32908 percentile_approx() returns incorrect results

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Siddartha Naidu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Apr/20 05:02

Updated:: 09/Oct/20 18:44

Resolved:: 09/Oct/20 18:44