[SPARK-18274] Memory leak in PySpark StringIndexer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
Fix Version/s: 2.0.3, 2.1.0, 2.2.0
Component/s: ML, PySpark
Labels:
None

Target Version/s:

2.0.3, 2.1.0

Description

StringIndexerModel won't get collected by GC in Java even when deleted in Python. It can be reproduced by this code, which fails after couple of iterations (around 7 if you set driver memory to 600MB):

import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)

Explicit call to Python GC fixes the issue - following code runs fine:

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
    gc.collect()

The issue is similar to ~~SPARK-6194~~ and can be probably fixed by calling jvm detach in model's destructor. This is implemented in pyspark.mlib.common.JavaModelWrapper but missing in pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be affected by this memory leak.

Attachments

Issue Links

relates to

SPARK-18630 PySpark ML memory leak

Resolved

links to

[Github] Pull Request #15843 (techaddict)

Activity

People

Assignee:: Sandeep Singh

Reporter:: Jonas Amrich

Shepherd:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Nov/16 15:31

Updated:: 08/Dec/16 07:29

Resolved:: 01/Dec/16 21:23