Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
3.4.2, 3.5.0, 3.3.4
-
None
Description
I found this problem using Hypothesis.
Here's a reproduction that fails on master, 3.5.0, 3.4.2, and 3.3.4 (and probably all prior versions as well):
from pyspark.sql import SparkSession from pyspark.sql.functions import col, sum SUM_EXAMPLE = [ (1.0,), (0.0,), (1.0,), (9007199254740992.0,), ] spark = ( SparkSession.builder .config("spark.log.level", "ERROR") .getOrCreate() ) def compare_sums(data, num_partitions): df = spark.createDataFrame(data, "val double").coalesce(1) result1 = df.agg(sum(col("val"))).collect()[0][0] df = spark.createDataFrame(data, "val double").repartition(num_partitions) result2 = df.agg(sum(col("val"))).collect()[0][0] assert result1 == result2, f"{result1}, {result2}" if __name__ == "__main__": print(compare_sums(SUM_EXAMPLE, 2))
This fails as follows:
AssertionError: 9007199254740994.0, 9007199254740992.0
I suspected some kind of problem related to code generation, so tried setting all of these to false:
- spark.sql.codegen.wholeStage
- spark.sql.codegen.aggregate.map.twolevel.enabled
- spark.sql.codegen.aggregate.splitAggregateFunc.enabled
But this did not change the behavior.
Somehow, the partitioning of the data affects the computed sum.