Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0, 3.1.1
-
None
-
None
Description
Repo to reproduce issue
https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction
Expected behavour:
Program runs and exits cleanly
Actual behaviour:
Program runs forever, eating up all the memory on the machine
Steps to reproduce:
```
pip install -r requirements.txt
python run.py
```
The problem only occurs if you run the code via `importlib`. The problem does not occur running `sparky.py` directly.
Furthermore, the problem occurs if you replace filter with map or flatMap (anything that takes in a lambda function).
The problem only occurs if you call a named function (i.e., when you use `def func`).
So these break:
```
def func(stuff):
return True
dataset.filter(func)
```
```
def func(stuff):
return True
dataset.filter(lambda s: func(s))
```
The problem does NOT occur if you do this:
```
dataset.filter(lambda x: True)
```
```
dataset.filter(lambda x: x == 'stuff')
```