[SPARK-677] PySpark should not collect results through local filesystem - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0
Fix Version/s: 1.2.2, 1.3.1, 1.4.0
Component/s: PySpark
Labels:
None

Target Version/s:

1.2.2, 1.3.1, 1.4.0

Description

Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance.

Instead, we should stream the data from Java to Python over a local socket or a FIFO.

Attachments

Issue Links

links to

[Github] Pull Request #4923 (davies)

Activity

People

Assignee:: Davies Liu

Reporter:: Josh Rosen

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/Feb/13 17:28

Updated:: 22/May/15 20:40

Resolved:: 22/May/15 20:40