[SPARK-35336] Pyspark - Using importlib + filter + named function + take causes pyspark to restart continuously until machine runs out of memory - ASF JIRA

XML

Word

Printable

JSON

Repo to reproduce issue

Expected behavour:

Program runs and exits cleanly

Actual behaviour:

Program runs forever, eating up all the memory on the machine

Steps to reproduce:

```

pip install -r requirements.txt

python run.py

```

The problem only occurs if you run the code via `importlib`. The problem does not occur running `sparky.py` directly.

Furthermore, the problem occurs if you replace filter with map or flatMap (anything that takes in a lambda function).

The problem only occurs if you call a named function (i.e., when you use `def func`).

So these break:

```

def func(stuff):

return True

dataset.filter(func)

```

def func(stuff):

return True

dataset.filter(lambda s: func(s))

```

The problem does NOT occur if you do this:

```

dataset.filter(lambda x: True)

```

dataset.filter(lambda x: x == 'stuff')

```