Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
The original sort-based shuffle buffers shuffle input records in memory while sorting them. This causes problems when mutable records are presented to the shuffle, which happens in Spark SQL's Exchange operator. To work around this issue, SPARK-2967 and SPARK-4479 added defensive copying of shuffle inputs in the Exchange operator when sort-based shuffle is enabled.
I think that sandyr's recent patch for enabling serialization of records in sort-based shuffle (SPARK-4550) and my proposed unsafe-based shuffle path (SPARK-7081) may allow us to avoid this defensive copying in certain cases (since our patches cause records to be serialized one-at-a-time and remove the buffering of deserialized records).
As mentioned in SPARK-4479, a long-term fix for this issue might be to add hooks for informing the shuffle about object (im)mutability in order to allow the shuffle layer to decide whether to copy. In the meantime, though, I think that we should just extend the checks added in SPARK-4479 to avoid copies when these new serialized sort paths are used.