Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35336

Pyspark - Using importlib + filter + named function + take causes pyspark to restart continuously until machine runs out of memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0, 3.1.1
    • None
    • PySpark
    • None

    Description

      Repo to reproduce issue

      https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction

       

      Expected behavour:

      Program runs and exits cleanly

       

      Actual behaviour:

      Program runs forever, eating up all the memory on the machine

       

      Steps to reproduce:

      ```

      pip install -r requirements.txt

      python run.py

      ``` 

      The problem only occurs if you run the code via `importlib`. The problem does not occur running `sparky.py` directly.

      Furthermore, the problem occurs if you replace filter with map or flatMap (anything that takes in a lambda function).

      The problem only occurs if you call a named function (i.e., when you use `def func`).

      So these break:

      ```

      def func(stuff):

          return True

       

      dataset.filter(func)

       ```

       

      ```

      def func(stuff):

          return True

       

      dataset.filter(lambda s: func(s))

       ```

       

      The problem does NOT occur if you do this:

      ```

      dataset.filter(lambda x: True)

      ```

      ```

      dataset.filter(lambda x: x == 'stuff')

       ```

      Attachments

        Activity

          People

            Unassigned Unassigned
            rajraj Raj Raj
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: