[SPARK-1418] Python MLlib's _get_unmangled_rdd should uncache RDDs when training is done - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: MLlib, PySpark
Labels:
None

Description

Right now when PySpark converts a Python RDD of NumPy vectors to a Java one, it caches the Java one, since many of the algorithms are iterative. We should call unpersist() at the end of the algorithm though to free cache space. In addition it may be good to persist the Java RDD with StorageLevel.MEMORY_AND_DISK instead of going back through the NumPy conversion.. it will almost certainly be faster.

Attachments

Issue Links

Is contained by

SPARK-4531 Cache serialized java objects instead of serialized python objects in MLlib

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Matei Alexandru Zaharia

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 04/Apr/14 22:32

Updated:: 20/Feb/15 23:58

Resolved:: 20/Feb/15 23:58