[SPARK-6728] Improve performance of py4j for large bytearray - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Won't Fix
Affects Version/s: 1.3.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

PySpark relies on py4j to transfer function arguments and return between Python and JVM, it's very slow to pass a large bytearray (larger than 10M).

In MLlib, it's possible to have a Vector with more than 100M bytes, which will need few GB memory, may crash.

The reason is that py4j use text protocol, it will encode the bytearray as base64, and do multiple string concat.

Binary will help a lot, create a issue for py4j: https://github.com/bartdag/py4j/issues/159

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Davies Liu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Apr/15 22:51

Updated:: 09/Nov/15 18:10

Resolved:: 09/Nov/15 18:10