Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
3.4.1, 3.5.0
Description
The following query should return 1000000
import org.apache.spark.storage.StorageLevel val df = spark.range(0, 1000000, 1, 5).map(l => (l, l)) val ee = df.select($"_1".as("src"), $"_2".as("dst")) .persist(StorageLevel.MEMORY_AND_DISK) ee.count() val minNbrs1 = ee .groupBy("src").agg(min(col("dst")).as("min_number")) .persist(StorageLevel.MEMORY_AND_DISK) val join = ee.join(minNbrs1, "src") join.count()
but on spark 3.5.0 there is a correctness bug causing it to return `104800` or some other smaller value.
Attachments
Issue Links
- relates to
-
SPARK-45282 Join loses records for cached datasets
- Resolved
- links to