Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216

Improving PySpark/Pandas interoperability

    Details

    • Type: Epic
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None

      Description

      This is an umbrella ticket tracking the general effort to improve performance and interoperability between PySpark and Pandas. The core idea is to Apache Arrow as serialization format to reduce the overhead between PySpark and Pandas.

        Attachments

          Issue Links

          1.
          groupBy().apply() with pandas udf in pyspark Sub-task Resolved Li Jin
          2.
          SPIP: Vectorized UDFs in Python Sub-task Resolved Bryan Cutler
          3.
          Simple Vectorized Python UDFs using Arrow Sub-task Closed Unassigned
          4.
          Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame Sub-task Resolved Bryan Cutler
          5.
          User-defined window functions with pandas udf (unbounded window) Sub-task Resolved Li Jin
          6.
          User-defined aggregation functions with pandas udf Sub-task Resolved Li Jin
          7.
          Design doc for different types of pandas_udf Sub-task Resolved Unassigned
          8.
          Upgrade Arrow to version 0.8.0 and upgrade Netty to 4.1.17 Sub-task Resolved Bryan Cutler
          9.
          Add function type argument to pandas_udf Sub-task Resolved Li Jin
          10.
          Improve the description of Vectorized UDFs for non-deterministic cases Sub-task Resolved Li Jin
          11.
          Register Vectorized UDFs for SQL Statement Sub-task Resolved Xiao Li
          12.
          Using pandas_udf when inputs are not Pandas's Series or DataFrame Sub-task Resolved Hyukjin Kwon
          13.
          Support alternative function form with group aggregate pandas UDF Sub-task Resolved Li Jin
          14.
          Decrease memory consumption with toPandas() collection using Arrow Sub-task Resolved Bryan Cutler
          15.
          Change MapVector to NullableMapVector in ArrowColumnVector Sub-task Resolved Li Jin
          16.
          Rename Pandas UDFs Sub-task Resolved Xiao Li
          17.
          Refactor group aggregate pandas UDF to its own catalyst rules Sub-task Open Unassigned
          18.
          Pandas grouped udf on dataset with timestamp column error Sub-task Resolved Li Jin
          19.
          Explicitly specify supported types in Pandas UDFs Sub-task Resolved Hyukjin Kwon
          20.
          Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame Sub-task Resolved Hyukjin Kwon
          21.
          Improve test cases for all supported types and unsupported types Sub-task Open Unassigned
          22.
          Explicitly check supported types in toPandas Sub-task Resolved Hyukjin Kwon
          23.
          Update Pandas UDFs section in sql-programming-guide Sub-task Resolved Li Jin
          24.
          Support partial function and callable object with pandas UDF Sub-task Open Unassigned
          25.
          Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator Sub-task Resolved Li Jin
          26.
          Pandas Grouped Map UserDefinedFunction mixes column labels Sub-task Resolved Bryan Cutler
          27.
          User-defined window functions with pandas udf (bounded window) Sub-task In Progress Unassigned
          28.
          Support GROUPED_AGG_PANDAS_UDF in Pivot Sub-task Open Unassigned
          29.
          Can not mix vectorized and non-vectorized UDFs Sub-task Resolved Li Jin
          30.
          Fix pandas_udf with return type StringType() to handle str type properly in Python 2. Sub-task Resolved Takuya Ueshin
          31.
          Allow None for Decimal type conversion (specific to PyArrow 0.9.0) Sub-task Resolved Hyukjin Kwon
          32.
          Show some kind of test output to indicate pyarrow tests were run Sub-task In Progress Bryan Cutler
          33.
          Improve toPandas with Arrow by sending out-of-order record batches Sub-task In Progress Unassigned
          34.
          Add an example for having two columns as the grouping key in group aggregate pandas UDF Sub-task Resolved Hyukjin Kwon

            Activity

              People

              • Assignee:
                icexelloss Li Jin
                Reporter:
                icexelloss Li Jin
              • Votes:
                0 Vote for this issue
                Watchers:
                27 Start watching this issue

                Dates

                • Created:
                  Updated: