Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.0.2, 1.1.0
-
None
Description
If the values for one key do not collectively fit into memory, then the map will still OOM when you merge the spilled contents back in.
This is a problem especially for PySpark, since we hash the keys (Python objects) before a shuffle, and there are only so many integers out there in the world, so there could potentially be many collisions.
Attachments
Issue Links
- relates to
-
SPARK-3074 support groupByKey() with hot keys in PySpark
- Resolved