Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.4.0
-
None
Description
In PySpark, if the user calls RDD mapPartitions and does not consume the iterator, the Python worker will read the wrong signal and not be reused. Test to replicate:
def test_worker_reused_in_map_partition(self): def map_pid(iterator): # Fails when iterator not consumed, e.g. len(list(iterator)) return (os.getpid() for _ in xrange(10)) rdd = self.sc.parallelize([], 10) worker_pids_a = rdd.mapPartitions(map_pid).collect() worker_pids_b = rdd.mapPartitions(map_pid).collect() self.assertTrue(all([pid in worker_pids_a for pid in worker_pids_b]))
This is related to SPARK-26549 which fixes this issue, but only for use in rdd.range
Attachments
Issue Links
- relates to
-
SPARK-26549 PySpark worker reuse take no effect for parallelize xrange
- Resolved