[SPARK-17790] Support for parallelizing R data.frame larger than 2GB - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.1
Fix Version/s: 2.0.2, 2.1.0
Component/s: SparkR
Labels:
None

Description

This issue is a more specific version of ~~SPARK-17762~~.
Supporting larger than 2GB arguments is more general and arguably harder to do because the limit exists both in R and JVM (because we receive data as a ByteArray). However, to support parallalizing R data.frames that are larger than 2GB we can do what PySpark does.

PySpark uses files to transfer bulk data between Python and JVM. It has worked well for the large community of Spark Python users.

Attachments

Issue Links

links to

[Github] Pull Request #15375 (falaki)

Activity

People

Assignee:: Hossein Falaki

Reporter:: Hossein Falaki

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Oct/16 20:59

Updated:: 12/Oct/16 17:33

Resolved:: 12/Oct/16 17:33