Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18274

Memory leak in PySpark StringIndexer

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
    • 2.0.3, 2.1.0, 2.2.0
    • ML, PySpark
    • None

    Description

      StringIndexerModel won't get collected by GC in Java even when deleted in Python. It can be reproduced by this code, which fails after couple of iterations (around 7 if you set driver memory to 600MB):

      import random, string
      from pyspark.ml.feature import StringIndexer
      
      l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
      df = spark.createDataFrame(l, ['string'])
      
      for i in range(50):
          indexer = StringIndexer(inputCol='string', outputCol='index')
          indexer.fit(df)
      

      Explicit call to Python GC fixes the issue - following code runs fine:

      for i in range(50):
          indexer = StringIndexer(inputCol='string', outputCol='index')
          indexer.fit(df)
          gc.collect()
      

      The issue is similar to SPARK-6194 and can be probably fixed by calling jvm detach in model's destructor. This is implemented in pyspark.mlib.common.JavaModelWrapper but missing in pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be affected by this memory leak.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            techaddict Sandeep Singh
            jonasamrich Jonas Amrich
            Joseph K. Bradley Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment