Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216

Improving PySpark/Pandas interoperability

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Epic
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None

      Description

      This is an umbrella ticket tracking the general effort to improve performance and interoperability between PySpark and Pandas. The core idea is to Apache Arrow as serialization format to reduce the overhead between PySpark and Pandas.

        Attachments

        Issue Links

        1.
        groupBy().apply() with pandas udf in pyspark Sub-task Resolved Li Jin Actions
        2.
        SPIP: Vectorized UDFs in Python Sub-task Resolved Bryan Cutler Actions
        3.
        Simple Vectorized Python UDFs using Arrow Sub-task Closed Unassigned Actions
        4.
        Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame Sub-task Resolved Bryan Cutler Actions
        5.
        User-defined window functions with pandas udf (unbounded window) Sub-task Resolved Li Jin Actions
        6.
        User-defined aggregation functions with pandas udf Sub-task Resolved Li Jin Actions
        7.
        Design doc for different types of pandas_udf Sub-task Resolved Unassigned Actions
        8.
        Upgrade Arrow to version 0.8.0 and upgrade Netty to 4.1.17 Sub-task Resolved Bryan Cutler Actions
        9.
        Add function type argument to pandas_udf Sub-task Resolved Li Jin Actions
        10.
        Improve the description of Vectorized UDFs for non-deterministic cases Sub-task Resolved Li Jin Actions
        11.
        Register Scalar Vectorized UDFs for SQL Statement Sub-task Resolved Xiao Li Actions
        12.
        Using pandas_udf when inputs are not Pandas's Series or DataFrame Sub-task Resolved Hyukjin Kwon Actions
        13.
        Support alternative function form with group aggregate pandas UDF Sub-task Resolved Li Jin Actions
        14.
        Decrease memory consumption with toPandas() collection using Arrow Sub-task Resolved Bryan Cutler Actions
        15.
        Change MapVector to NullableMapVector in ArrowColumnVector Sub-task Resolved Li Jin Actions
        16.
        Rename Pandas UDFs Sub-task Resolved Xiao Li Actions
        17.
        Refactor group aggregate pandas UDF to its own catalyst rules Sub-task Open Unassigned Actions
        18.
        Pandas grouped udf on dataset with timestamp column error Sub-task Resolved Li Jin Actions
        19.
        Explicitly specify supported types in Pandas UDFs Sub-task Resolved Hyukjin Kwon Actions
        20.
        Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame Sub-task Resolved Hyukjin Kwon Actions
        21.
        Improve test cases for all supported types and unsupported types Sub-task Resolved Aleksandr Koriagin Actions
        22.
        Explicitly check supported types in toPandas Sub-task Resolved Hyukjin Kwon Actions
        23.
        Update Pandas UDFs section in sql-programming-guide Sub-task Resolved Li Jin Actions
        24.
        Support partial function and callable object with pandas UDF Sub-task Resolved Unassigned Actions
        25.
        Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator Sub-task Resolved Li Jin Actions
        26.
        Pandas Grouped Map UserDefinedFunction mixes column labels Sub-task Resolved Bryan Cutler Actions
        27.
        User-defined window functions with pandas udf (bounded window) Sub-task Resolved Li Jin Actions
        28.
        Support GROUPED_AGG_PANDAS_UDF in Pivot Sub-task Open Unassigned Actions
        29.
        Can not mix vectorized and non-vectorized UDFs Sub-task Resolved Li Jin Actions
        30.
        Fix pandas_udf with return type StringType() to handle str type properly in Python 2. Sub-task Resolved Takuya Ueshin Actions
        31.
        Allow None for Decimal type conversion (specific to PyArrow 0.9.0) Sub-task Resolved Hyukjin Kwon Actions
        32.
        Show some kind of test output to indicate pyarrow tests were run Sub-task Resolved Bryan Cutler Actions
        33.
        Improve toPandas with Arrow by sending out-of-order record batches Sub-task Resolved Bryan Cutler Actions
        34.
        Add an example for having two columns as the grouping key in group aggregate pandas UDF Sub-task Resolved Hyukjin Kwon Actions
        35.
        Register Grouped aggregate UDF Vectorized UDFs for SQL Statement Sub-task Resolved Hyukjin Kwon Actions
        36.
        Clarify/Improve EvalType for grouped aggregate and window aggregate Sub-task Open Unassigned Actions
        37.
        Internally document type conversion between Pandas data and SQL types in Pandas UDFs Sub-task Resolved Hyukjin Kwon Actions
        38.
        Update document type conversion for Pandas UDFs (pyarrow 0.13.0, pandas 0.24.2, Python 3.7) Sub-task Resolved Hyukjin Kwon Actions
        There are no issues in this epic.

          Activity

            People

            • Assignee:
              icexelloss Li Jin
              Reporter:
              icexelloss Li Jin

              Dates

              • Created:
                Updated:

                Issue deployment