Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1418

Python MLlib's _get_unmangled_rdd should uncache RDDs when training is done

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Implemented
    • None
    • 1.2.0
    • MLlib, PySpark
    • None

    Description

      Right now when PySpark converts a Python RDD of NumPy vectors to a Java one, it caches the Java one, since many of the algorithms are iterative. We should call unpersist() at the end of the algorithm though to free cache space. In addition it may be good to persist the Java RDD with StorageLevel.MEMORY_AND_DISK instead of going back through the NumPy conversion.. it will almost certainly be faster.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              matei Matei Alexandru Zaharia
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: