Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26573

Python worker not reused with mapPartitions if not consuming iterator

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None

      Description

      In PySpark, if the user calls RDD mapPartitions and does not consume the iterator, the Python worker will read the wrong signal and not be reused.  Test to replicate:

      def test_worker_reused_in_map_partition(self):
      
          def map_pid(iterator):
              # Fails when iterator not consumed, e.g. len(list(iterator))
              return (os.getpid() for _ in xrange(10))
      
          rdd = self.sc.parallelize([], 10)
      
          worker_pids_a = rdd.mapPartitions(map_pid).collect()
          worker_pids_b = rdd.mapPartitions(map_pid).collect()
      
          self.assertTrue(all([pid in worker_pids_a for pid in worker_pids_b]))

      This is related to SPARK-26549 which fixes this issue, but only for use in rdd.range

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                bryanc Bryan Cutler
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: