[SPARK-34849] SPIP: Support pandas API layer on PySpark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Blocker
Resolution: Done
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: PySpark
Labels:
- SPIP

Target Version/s:

3.2.0
Epic Link:
Project Zen

Description

This is a SPIP for porting Koalas project to PySpark, that is once discussed on the dev-mailing list with the same title, [DISCUSS] Support pandas API layer on PySpark.

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

Porting Koalas into PySpark to support the pandas API layer on PySpark for:

Users can easily leverage their existing Spark cluster to scale their pandas workloads.
Support plot and drawing a chart in PySpark
Users can easily switch between pandas APIs and PySpark APIs

Q2. What problem is this proposal NOT designed to solve?

Some APIs of pandas are explicitly unsupported. For example, memory_usage in pandas will not be supported because DataFrames are not materialized in memory in Spark unlike pandas.

This does not replace the existing PySpark APIs. PySpark API has lots of users and existing code in many projects, and there are still many PySpark users who prefer Spark’s immutable DataFrame API to the pandas API.

Q3. How is it done today, and what are the limits of current practice?

The current practice has 2 limits as below.

There are many features missing in Apache Spark that are very commonly used in data science. Specifically, plotting and drawing a chart is missing which is one of the most important features that almost every data scientist use in their daily work.
Data scientists tend to prefer pandas APIs, but it is very hard to change them into PySpark APIs when they need to scale their workloads. This is because PySpark APIs are difficult to learn compared to pandas' and there are many missing features in PySpark.

Q4. What is new in your approach and why do you think it will be successful?

I believe this suggests a new way for both PySpark and pandas users to easily scale their workloads. I think we can be successful because more and more people tend to use Python and pandas. In fact, there are already similar tries such as Dask and Modin which are all growing fast and successfully.

Q5. Who cares? If you are successful, what difference will it make?

Anyone who wants to scale their pandas workloads on their Spark cluster. It will also significantly improve the usability of PySpark.

Q6. What are the risks?

Technically I don't see many risks yet given that:

Koalas has grown separately for more than two years, and has greatly improved maturity and stability.
Koalas will be ported into PySpark as a separate package

It is more about putting documentation and test cases in place properly with properly handling dependencies. For example, Koalas currently uses pytest with various dependencies whereas PySpark uses the plain unittest with fewer dependencies.

In addition, Koalas' default Indexing system could not be much loved because it could potentially cause overhead, so applying it properly to PySpark might be a challenge.

Q7. How long will it take?

Before the Spark 3.2 release.

Q8. What are the mid-term and final “exams” to check for success?

The first check for success would be to make sure that all the existing Koalas APIs and tests work as they are without any affecting the existing Koalas workloads on PySpark.

The last thing to confirm is to check whether the usability and convenience that we aim for is actually increased through user feedback and PySpark usage statistics.

Also refer to:

Attachments

Issue Links

incorporates

SPARK-34885 Port/integrate Koalas documentation into PySpark

Resolved

SPARK-35805 API auditing in Pandas API on Spark

Resolved

SPARK-35337 pandas API on Spark: Separate basic operations into data type based structures

Resolved

SPARK-35464 pandas API on Spark: Enable mypy check "disallow_untyped_defs" for main codes.

Resolved

SPARK-46291 Koalas Testing Migration

Resolved

is related to

SPARK-36310 Fix hasnan() window function in IndexOpsMixin

Resolved

SPARK-36348 unexpected Index loaded: pd.Index([10, 20, None], name="x")

Resolved

SPARK-36077 Support numpy literals as input for pandas-on-Spark APIs

Open

SPARK-36078 Complete mappings between numpy literals and Spark data types

Open

SPARK-36364 Move window and aggregate functions to DataTypeOps

Open

SPARK-35638 Introduce InternalField to manage dtypes and StructFields.

Resolved

SPARK-35997 Implement comparison operators for CategoricalDtype in pandas API on Spark

Resolved

SPARK-36000 Support creation and operations of ps.Series/Index with Decimal('NaN')

Resolved

SPARK-36003 Implement unary operator `invert` of integral ps.Series/Index

Resolved

SPARK-36031 Keep same behavior with pandas for operations of series with nan

Resolved

SPARK-36103 Manage InternalField in DataTypeOps.invert

Resolved

SPARK-36104 Manage InternalField in DataTypeOps.neg/abs

Resolved

SPARK-36140 Replace DataTypeOps tests that have operations on different Series

Resolved

SPARK-36167 Revisit more InternalField managements.

Resolved

SPARK-36320 Fix Series/Index.copy() to drop extra columns.

Resolved

SPARK-36338 Move distributed-sequence implementation to Scala side

Resolved

SPARK-36350 Make nanvl work with DataTypeOps

Resolved

SPARK-36365 Remove old workarounds related to null ordering.

Resolved

SPARK-36559 Allow column pruning on distributed sequence index (pandas API on Spark)

Resolved

SPARK-35537 Introduce a util function to check whether the underlying expressions of the columns are the same.

Resolved

SPARK-36192 Better error messages for DataTypeOps against list

Resolved

SPARK-36265 Use __getitem__ instead of getItem to suppress warnings.

Resolved

SPARK-36333 Reuse isnull where the null check is needed.

Resolved

SPARK-35981 Use check_exact=False in StatsTest.test_cov_corr_meta to loosen the check precision

Resolved

SPARK-36035 Adjust `test_astype`, `test_neg` for old pandas versions

Resolved

SPARK-36394 Increase pandas API coverage in PySpark

Open

SPARK-35941 pandas API on Spark: Improve type hints.

In Progress

SPARK-36185 Implement functions in CategoricalAccessor/CategoricalIndex

Resolved

SPARK-36474 Mention pandas API on Spark in Spark overview pages

Resolved

(29 is related to)

Sub-Tasks

1.	Port/integrate Koalas main codes into PySpark	Resolved	Haejoon Lee
2.	Renaming the package alias from pp to ps	Resolved	Unassigned
3.	Document the deprecation of Koalas Accessor after porting the documentation.	Resolved	Unassigned
4.	Enable mypy for pandas-on-Spark	Resolved	Takuya Ueshin
5.	Make doctests work in Spark.	Resolved	Takuya Ueshin
6.	Port/integrate Koalas remaining codes into PySpark	Resolved	Haejoon Lee
7.	Use ps as the short name instead of pp	Resolved	Unassigned
8.	Remove Spark-version related codes from main codes.	Resolved	Takuya Ueshin
9.	Rename Koalas to pandas-on-Spark in main codes	Resolved	Hyukjin Kwon
10.	Document migration guide from Koalas to pandas APIs on Spark	Resolved	Hyukjin Kwon
11.	Renaming the existing Koalas related codes.	Resolved	Haejoon Lee
12.	Move Koalas accessor to pandas_on_spark accessor	Resolved	Haejoon Lee
13.	Enable plotly tests in pandas-on-Spark	Resolved	Hyukjin Kwon
14.	Remove APIs that have been deprecated in Koalas.	Resolved	Takuya Ueshin
15.	Apply black to pandas API on Spark codes.	Resolved	Haejoon Lee
16.	Reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true	Resolved	Hyukjin Kwon
17.	Restore to_koalas to keep the backward compatibility	Resolved	Haejoon Lee
18.	Move to_pandas_on_spark to the Spark DataFrame.	Resolved	Haejoon Lee
19.	Split pyspark-pandas tests.	Resolved	Takuya Ueshin
20.	Adjust pandas-on-spark `test_groupby_multiindex_columns` test for different pandas versions	Resolved	Apache Spark
21.	Support y properly in DataFrame with non-numeric columns with plots	Resolved	Hyukjin Kwon
22.	Remove the upperbound for numpy for pandas-on-Spark	Resolved	Takuya Ueshin
23.	Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings.	Resolved	Takuya Ueshin
24.	Cleanup the version logic from the pandas API on Spark	Resolved	Haejoon Lee

Activity

People

Assignee:: Haejoon Lee

Reporter:: Haejoon Lee

Shepherd:: hyukjin.kwon

Votes:: 3 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 24/Mar/21 06:09

Updated:: 06/Dec/23 22:01

Resolved:: 04/Jul/21 02:35