Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.3.0
    • Component/s: PySpark, SQL
    • Labels:
      None

      Description

      split-apply-merge is a common pattern when analyzing data. It is implemented in many popular data analyzing libraries such as Spark, Pandas, R, and etc. Split and merge operations in these libraries are similar to each other, mostly implemented by certain grouping operators. For instance, Spark DataFrame has groupBy, Pandas DataFrame has groupby. Therefore, for users familiar with either Spark DataFrame or pandas DataFrame, it is not difficult for them to understand how grouping works in the other library. However, apply is more native to different libraries and therefore, quite different between libraries. A pandas user knows how to use apply to do curtain transformation in pandas might not know how to do the same using pyspark. Also, the current implementation of passing data from the java executor to python executor is not efficient, there is opportunity to speed it up using Apache Arrow. This feature can enable use cases that uses Spark's grouping operators such as groupBy, rollUp, cube, window and Pandas's native apply operator.

      Related work:

      SPARK-13534
      This enables faster data serialization between Pyspark and Pandas using Apache Arrow. Our work will be on top of this and use the same serialization for pandas udf.

      SPARK-12919 and SPARK-12922
      These implemented two functions: dapply and gapply in Spark R which implements the similar split-apply-merge pattern that we want to implement with Pyspark.

        Issue Links

          Activity

          Hide
          icexelloss Li Jin added a comment -

          I am currently working on this. I'll keep updating status here.

          Show
          icexelloss Li Jin added a comment - I am currently working on this. I'll keep updating status here.
          Show
          icexelloss Li Jin added a comment - PR: https://github.com/apache/spark/pull/18732
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          User 'icexelloss' has created a pull request for this issue:
          https://github.com/apache/spark/pull/18732

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - User 'icexelloss' has created a pull request for this issue: https://github.com/apache/spark/pull/18732
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          Issue resolved by pull request 18732
          https://github.com/apache/spark/pull/18732

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - Issue resolved by pull request 18732 https://github.com/apache/spark/pull/18732
          Hide
          apachespark Apache Spark added a comment -

          User 'ueshin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/19505

          Show
          apachespark Apache Spark added a comment - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/19505
          Hide
          apachespark Apache Spark added a comment -

          User 'ueshin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/19517

          Show
          apachespark Apache Spark added a comment - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/19517

            People

            • Assignee:
              icexelloss Li Jin
              Reporter:
              icexelloss Li Jin
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development