Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26573

Python worker not reused with mapPartitions if not consuming iterator

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.0
    • None
    • PySpark

    Description

      In PySpark, if the user calls RDD mapPartitions and does not consume the iterator, the Python worker will read the wrong signal and not be reused.  Test to replicate:

      def test_worker_reused_in_map_partition(self):
      
          def map_pid(iterator):
              # Fails when iterator not consumed, e.g. len(list(iterator))
              return (os.getpid() for _ in xrange(10))
      
          rdd = self.sc.parallelize([], 10)
      
          worker_pids_a = rdd.mapPartitions(map_pid).collect()
          worker_pids_b = rdd.mapPartitions(map_pid).collect()
      
          self.assertTrue(all([pid in worker_pids_a for pid in worker_pids_b]))

      This is related to SPARK-26549 which fixes this issue, but only for use in rdd.range

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bryanc Bryan Cutler
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: