[SPARK-26573] Python worker not reused with mapPartitions if not consuming iterator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: PySpark
Labels:
- bulk-closed

Description

In PySpark, if the user calls RDD mapPartitions and does not consume the iterator, the Python worker will read the wrong signal and not be reused. Test to replicate:

def test_worker_reused_in_map_partition(self):

    def map_pid(iterator):
        # Fails when iterator not consumed, e.g. len(list(iterator))
        return (os.getpid() for _ in xrange(10))

    rdd = self.sc.parallelize([], 10)

    worker_pids_a = rdd.mapPartitions(map_pid).collect()
    worker_pids_b = rdd.mapPartitions(map_pid).collect()

    self.assertTrue(all([pid in worker_pids_a for pid in worker_pids_b]))

This is related to ~~SPARK-26549~~ which fixes this issue, but only for use in rdd.range

Attachments

Issue Links

relates to

SPARK-26549 PySpark worker reuse take no effect for parallelize xrange

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Jan/19 18:32

Updated:: 25/May/21 01:51

Resolved:: 25/May/21 01:38