Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34849

SPIP: Support pandas API layer on PySpark

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Blocker
    • Resolution: Done
    • 3.2.0
    • None
    • PySpark

    Description

      This is a SPIP for porting Koalas project to PySpark, that is once discussed on the dev-mailing list with the same title, [DISCUSS] Support pandas API layer on PySpark

      Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

      Porting Koalas into PySpark to support the pandas API layer on PySpark for:

      • Users can easily leverage their existing Spark cluster to scale their pandas workloads.
      • Support plot and drawing a chart in PySpark
      • Users can easily switch between pandas APIs and PySpark APIs

      Q2. What problem is this proposal NOT designed to solve?

      Some APIs of pandas are explicitly unsupported. For example, memory_usage in pandas will not be supported because DataFrames are not materialized in memory in Spark unlike pandas.

      This does not replace the existing PySpark APIs. PySpark API has lots of users and existing code in many projects, and there are still many PySpark users who prefer Spark’s immutable DataFrame API to the pandas API.

      Q3. How is it done today, and what are the limits of current practice?

      The current practice has 2 limits as below.

      1. There are many features missing in Apache Spark that are very commonly used in data science. Specifically, plotting and drawing a chart is missing which is one of the most important features that almost every data scientist use in their daily work.
      2. Data scientists tend to prefer pandas APIs, but it is very hard to change them into PySpark APIs when they need to scale their workloads. This is because PySpark APIs are difficult to learn compared to pandas' and there are many missing features in PySpark.

      Q4. What is new in your approach and why do you think it will be successful?

      I believe this suggests a new way for both PySpark and pandas users to easily scale their workloads. I think we can be successful because more and more people tend to use Python and pandas. In fact, there are already similar tries such as Dask and Modin which are all growing fast and successfully.

      Q5. Who cares? If you are successful, what difference will it make?

      Anyone who wants to scale their pandas workloads on their Spark cluster. It will also significantly improve the usability of PySpark.

      Q6. What are the risks?

      Technically I don't see many risks yet given that:

      • Koalas has grown separately for more than two years, and has greatly improved maturity and stability.
      • Koalas will be ported into PySpark as a separate package

      It is more about putting documentation and test cases in place properly with properly handling dependencies. For example, Koalas currently uses pytest with various dependencies whereas PySpark uses the plain unittest with fewer dependencies.

      In addition, Koalas' default Indexing system could not be much loved because it could potentially cause overhead, so applying it properly to PySpark might be a challenge.

      Q7. How long will it take?

      Before the Spark 3.2 release.

      Q8. What are the mid-term and final “exams” to check for success?

      The first check for success would be to make sure that all the existing Koalas APIs and tests work as they are without any affecting the existing Koalas workloads on PySpark.

      The last thing to confirm is to check whether the usability and convenience that we aim for is actually increased through user feedback and PySpark usage statistics.

      Also refer to:

      Attachments

        Issue Links

        1.
        Port/integrate Koalas main codes into PySpark Sub-task Resolved Haejoon Lee Actions
        2.
        Port/integrate Koalas DataFrame unit test into PySpark Sub-task Resolved Xinrong Meng Actions
        3.
        Renaming the package alias from pp to ps Sub-task Resolved Unassigned Actions
        4.
        Document the deprecation of Koalas Accessor after porting the documentation. Sub-task Resolved Unassigned Actions
        5.
        Port/integrate Koalas dependencies into PySpark Sub-task Resolved Xinrong Meng Actions
        6.
        Enable mypy for pandas-on-Spark Sub-task Resolved Takuya Ueshin Actions
        7.
        Make doctests work in Spark. Sub-task Resolved Takuya Ueshin Actions
        8.
        Port/integrate Koalas remaining codes into PySpark Sub-task Resolved Haejoon Lee Actions
        9.
        Port Koalas Series related unit tests into PySpark Sub-task Resolved Xinrong Meng Actions
        10.
        Consolidate PySpark testing utils Sub-task Resolved Xinrong Meng Actions
        11.
        Use ps as the short name instead of pp Sub-task Resolved Unassigned Actions
        12.
        Port Koalas DataFrame related unit tests into PySpark Sub-task Resolved Xinrong Meng Actions
        13.
        Port Koalas operations on different frames tests into PySpark Sub-task Resolved Xinrong Meng Actions
        14.
        Port Koalas Index unit tests into PySpark Sub-task Resolved Xinrong Meng Actions
        15.
        Port Koalas plot unit tests into PySpark Sub-task Resolved Xinrong Meng Actions
        16.
        Port Koalas miscellaneous unit tests into PySpark Sub-task Resolved Xinrong Meng Actions
        17.
        Port Koalas internal implementation unit tests into PySpark Sub-task Resolved Xinrong Meng Actions
        18.
        Remove Spark-version related codes from main codes. Sub-task Resolved Takuya Ueshin Actions
        19.
        Remove Spark-version related codes from test codes. Sub-task Resolved Xinrong Meng Actions
        20.
        Rename Koalas to pandas-on-Spark in main codes Sub-task Resolved Hyukjin Kwon Actions
        21.
        Revisit pandas-on-Spark test cases that are disabled because of pandas nondeterministic return values Sub-task Resolved Xinrong Meng Actions
        22.
        Standardize module name in install.rst Sub-task Resolved Xinrong Meng Actions
        23.
        Document migration guide from Koalas to pandas APIs on Spark Sub-task Resolved Hyukjin Kwon Actions
        24.
        Renaming the existing Koalas related codes. Sub-task Resolved Haejoon Lee Actions
        25.
        Move Koalas accessor to pandas_on_spark accessor Sub-task Resolved Haejoon Lee Actions
        26.
        Enable plotly tests in pandas-on-Spark Sub-task Resolved Hyukjin Kwon Actions
        27.
        Remove APIs that have been deprecated in Koalas. Sub-task Resolved Takuya Ueshin Actions
        28.
        Apply black to pandas API on Spark codes. Sub-task Resolved Haejoon Lee Actions
        29.
        Reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true Sub-task Resolved Hyukjin Kwon Actions
        30.
        Restore to_koalas to keep the backward compatibility Sub-task Resolved Haejoon Lee Actions
        31.
        Move to_pandas_on_spark to the Spark DataFrame. Sub-task Resolved Haejoon Lee Actions
        32.
        Split pyspark-pandas tests. Sub-task Resolved Takuya Ueshin Actions
        33.
        Adjust pandas-on-spark `test_groupby_multiindex_columns` test for different pandas versions Sub-task Resolved Apache Spark Actions
        34.
        Support y properly in DataFrame with non-numeric columns with plots Sub-task Resolved Hyukjin Kwon Actions
        35.
        Remove the upperbound for numpy for pandas-on-Spark Sub-task Resolved Takuya Ueshin Actions
        36.
        Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings. Sub-task Resolved Takuya Ueshin Actions
        37.
        Cleanup the version logic from the pandas API on Spark Sub-task Resolved Haejoon Lee Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            itholic Haejoon Lee
            itholic Haejoon Lee
            Hyukjin Kwon Hyukjin Kwon
            Votes:
            3 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment