Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6235 Address various 2G limits
  3. SPARK-17790

Support for parallelizing R data.frame larger than 2GB

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.1
    • 2.0.2, 2.1.0
    • SparkR
    • None

    Description

      This issue is a more specific version of SPARK-17762.
      Supporting larger than 2GB arguments is more general and arguably harder to do because the limit exists both in R and JVM (because we receive data as a ByteArray). However, to support parallalizing R data.frames that are larger than 2GB we can do what PySpark does.

      PySpark uses files to transfer bulk data between Python and JVM. It has worked well for the large community of Spark Python users.

      Attachments

        Activity

          People

            falaki Hossein Falaki
            falaki Hossein Falaki
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: