[SPARK-30580] Why can PySpark persist data only in serialised format? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Invalid
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: PySpark
Labels:
- performance

Description

The storage levels in PySpark allow to persist data only in serialised format. There is also a comment explicitly stating that "Since the data is always serialized on the Python side, all the constants use the serialized formats." While that makes totally sense for RDDs, it is not clear to me why it is not possible to persist data without serialisation when using the dataframe/dataset APIs. In theory, in such cases, the persist would only be a directive and data would never leave the JVM, thus allowing for un-serialised persistence, correct? Many thanks for the feedback!

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Francesco Cavrini

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 20/Jan/20 08:43

Updated:: 12/Dec/22 18:11

Resolved:: 23/Jan/20 02:09