Description
Right now when PySpark converts a Python RDD of NumPy vectors to a Java one, it caches the Java one, since many of the algorithms are iterative. We should call unpersist() at the end of the algorithm though to free cache space. In addition it may be good to persist the Java RDD with StorageLevel.MEMORY_AND_DISK instead of going back through the NumPy conversion.. it will almost certainly be faster.
Attachments
Issue Links
- Is contained by
-
SPARK-4531 Cache serialized java objects instead of serialized python objects in MLlib
- Closed