Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26759

Arrow optimization in SparkR's interoperability

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Done
    • 3.0.0
    • 3.0.0
    • SparkR, SQL

    Description

      Arrow 0.12.0 is release and it contains R API. We could optimize Spark DaraFrame <> R DataFrame interoperability.

      For instance see the examples below:

      • dapply
          df <- createDataFrame(mtcars)
          collect(dapply(df,
                         function(r.data.frame) {
                           data.frame(r.data.frame$gear)
                         },
                         structType("gear long")))
          
      • gapply
          df <- createDataFrame(mtcars)
          collect(gapply(df,
                         "gear",
                         function(key, group) {
                           data.frame(gear = key[[1]], disp = mean(group$disp) > group$disp)
                         },
                         structType("gear double, disp boolean")))
          
      • R DataFrame -> Spark DataFrame
          createDataFrame(mtcars)
          
      • Spark DataFrame -> R DataFrame
          collect(df)
          head(df)
          

      Currently, some of communication path between R side and JVM side has to buffer the data and flush it at once due to ARROW-4512. I don't target to fix it under this umbrella.

      Attachments

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              gurwls223 Hyukjin Kwon
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: