[SPARK-47024] Sum of floats/doubles may be incorrect depending on partitioning - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 3.4.2, 3.5.0, 3.3.4
Fix Version/s: None
Component/s: SQL
Labels:
- correctness

Description

I found this problem using Hypothesis.

Here's a reproduction that fails on master, 3.5.0, 3.4.2, and 3.3.4 (and probably all prior versions as well):

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

SUM_EXAMPLE = [
    (1.0,),
    (0.0,),
    (1.0,),
    (9007199254740992.0,),
]

spark = (
    SparkSession.builder
    .config("spark.log.level", "ERROR")
    .getOrCreate()
)


def compare_sums(data, num_partitions):
    df = spark.createDataFrame(data, "val double").coalesce(1)
    result1 = df.agg(sum(col("val"))).collect()[0][0]
    df = spark.createDataFrame(data, "val double").repartition(num_partitions)
    result2 = df.agg(sum(col("val"))).collect()[0][0]
    assert result1 == result2, f"{result1}, {result2}"


if __name__ == "__main__":
    print(compare_sums(SUM_EXAMPLE, 2))

This fails as follows:

AssertionError: 9007199254740994.0, 9007199254740992.0

I suspected some kind of problem related to code generation, so tried setting all of these to false:

spark.sql.codegen.wholeStage
spark.sql.codegen.aggregate.map.twolevel.enabled
spark.sql.codegen.aggregate.splitAggregateFunc.enabled

But this did not change the behavior.

Somehow, the partitioning of the data affects the computed sum.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Chammas

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Feb/24 17:32

Updated:: 12/Feb/24 20:21

Resolved:: 12/Feb/24 20:21