Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216

Improving PySpark/Pandas interoperability

    Details

    • Type: Epic
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None

      Description

      This is an umbrella ticket tracking the general effort to improve performance and interoperability between PySpark and Pandas. The core idea is to Apache Arrow as serialization format to reduce the overhead between PySpark and Pandas.

        Attachments

          Issue Links

          1.
          groupBy().apply() with pandas udf in pyspark Sub-task Resolved Li Jin
          2.
          SPIP: Vectorized UDFs in Python Sub-task Resolved Bryan Cutler
          3.
          Simple Vectorized Python UDFs using Arrow Sub-task Closed Unassigned
          4.
          Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame Sub-task Resolved Bryan Cutler
          5.
          User-defined window functions with pandas udf Sub-task In Progress Unassigned
          6.
          User-defined aggregation functions with pandas udf Sub-task Resolved Li Jin
          7.
          Design doc for different types of pandas_udf Sub-task Resolved Unassigned
          8.
          Upgrade Arrow to version 0.8.0 and upgrade Netty to 4.1.17 Sub-task Resolved Bryan Cutler
          9.
          Add function type argument to pandas_udf Sub-task Resolved Li Jin
          10.
          Improve the description of Vectorized UDFs for non-deterministic cases Sub-task Resolved Li Jin
          11.
          Register Vectorized UDFs for SQL Statement Sub-task Resolved Xiao Li
          12.
          Using pandas_udf when inputs are not Pandas's Series or DataFrame Sub-task Resolved Hyukjin Kwon
          13.
          Support alternative function form with group aggregate pandas UDF Sub-task Resolved Li Jin
          14.
          Decrease memory consumption with toPandas() collection using Arrow Sub-task Open Unassigned
          15.
          Change MapVector to NullableMapVector in ArrowColumnVector Sub-task Resolved Li Jin
          16.
          Rename Pandas UDFs Sub-task Resolved Xiao Li
          17.
          Refactor group aggregate pandas UDF to its own catalyst rules Sub-task Open Unassigned
          18.
          Pandas grouped udf on dataset with timestamp column error Sub-task Resolved Li Jin
          19.
          Explicitly specify supported types in Pandas UDFs Sub-task Resolved Hyukjin Kwon
          20.
          Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame Sub-task Resolved Hyukjin Kwon
          21.
          Improve test cases for all supported types and unsupported types Sub-task Open Unassigned
          22.
          Explicitly check supported types in toPandas Sub-task Resolved Hyukjin Kwon
          23.
          Update Pandas UDFs section in sql-programming-guide Sub-task Open Unassigned
          24.
          Support partial function and callable object with pandas UDF Sub-task Open Unassigned

            Activity

              People

              • Assignee:
                icexelloss Li Jin
                Reporter:
                icexelloss Li Jin
              • Votes:
                0 Vote for this issue
                Watchers:
                21 Start watching this issue

                Dates

                • Created:
                  Updated: