Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26759

Arrow optimization in SparkR's interoperability

    XMLWordPrintableJSON

    Details

    • Type: Umbrella
    • Status: Resolved
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.0
    • Component/s: SparkR, SQL
    • Labels:

      Description

      Arrow 0.12.0 is release and it contains R API. We could optimize Spark DaraFrame <> R DataFrame interoperability.

      For instance see the examples below:

      • dapply
          df <- createDataFrame(mtcars)
          collect(dapply(df,
                         function(r.data.frame) {
                           data.frame(r.data.frame$gear)
                         },
                         structType("gear long")))
          
      • gapply
          df <- createDataFrame(mtcars)
          collect(gapply(df,
                         "gear",
                         function(key, group) {
                           data.frame(gear = key[[1]], disp = mean(group$disp) > group$disp)
                         },
                         structType("gear double, disp boolean")))
          
      • R DataFrame -> Spark DataFrame
          createDataFrame(mtcars)
          
      • Spark DataFrame -> R DataFrame
          collect(df)
          head(df)
          

      Currently, some of communication path between R side and JVM side has to buffer the data and flush it at once due to ARROW-4512. I don't target to fix it under this umbrella.

        Attachments

          Issue Links

          There are no Sub-Tasks for this issue.

            Activity

              People

              • Assignee:
                hyukjin.kwon Hyukjin Kwon
                Reporter:
                hyukjin.kwon Hyukjin Kwon
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: