[SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: PySpark
Labels:
None

Description

Since the data is serialized on the Python side, there's not much point in keeping it as byte arrays in Java, or even in skipping compression. We should make cache() in PySpark use MEMORY_ONLY_SER and turn on spark.rdd.compress for it.

Attachments

Issue Links

links to

[Github] Pull Request #1051 (ScrapCodes)

Activity

People

Assignee:: Prashant Sharma

Reporter:: Matei Alexandru Zaharia

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Jun/14 06:37

Updated:: 25/Jul/14 01:16

Resolved:: 25/Jul/14 01:16