Details
-
Umbrella
-
Status: Resolved
-
Blocker
-
Resolution: Done
-
3.2.0
-
None
Description
This is a SPIP for porting Koalas project to PySpark, that is once discussed on the dev-mailing list with the same title, [DISCUSS] Support pandas API layer on PySpark.
Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.
Porting Koalas into PySpark to support the pandas API layer on PySpark for:
- Users can easily leverage their existing Spark cluster to scale their pandas workloads.
- Support plot and drawing a chart in PySpark
- Users can easily switch between pandas APIs and PySpark APIs
Q2. What problem is this proposal NOT designed to solve?
Some APIs of pandas are explicitly unsupported. For example, memory_usage in pandas will not be supported because DataFrames are not materialized in memory in Spark unlike pandas.
This does not replace the existing PySpark APIs. PySpark API has lots of users and existing code in many projects, and there are still many PySpark users who prefer Spark’s immutable DataFrame API to the pandas API.
Q3. How is it done today, and what are the limits of current practice?
The current practice has 2 limits as below.
- There are many features missing in Apache Spark that are very commonly used in data science. Specifically, plotting and drawing a chart is missing which is one of the most important features that almost every data scientist use in their daily work.
- Data scientists tend to prefer pandas APIs, but it is very hard to change them into PySpark APIs when they need to scale their workloads. This is because PySpark APIs are difficult to learn compared to pandas' and there are many missing features in PySpark.
Q4. What is new in your approach and why do you think it will be successful?
I believe this suggests a new way for both PySpark and pandas users to easily scale their workloads. I think we can be successful because more and more people tend to use Python and pandas. In fact, there are already similar tries such as Dask and Modin which are all growing fast and successfully.
Q5. Who cares? If you are successful, what difference will it make?
Anyone who wants to scale their pandas workloads on their Spark cluster. It will also significantly improve the usability of PySpark.
Q6. What are the risks?
Technically I don't see many risks yet given that:
- Koalas has grown separately for more than two years, and has greatly improved maturity and stability.
- Koalas will be ported into PySpark as a separate package
It is more about putting documentation and test cases in place properly with properly handling dependencies. For example, Koalas currently uses pytest with various dependencies whereas PySpark uses the plain unittest with fewer dependencies.
In addition, Koalas' default Indexing system could not be much loved because it could potentially cause overhead, so applying it properly to PySpark might be a challenge.
Q7. How long will it take?
Before the Spark 3.2 release.
Q8. What are the mid-term and final “exams” to check for success?
The first check for success would be to make sure that all the existing Koalas APIs and tests work as they are without any affecting the existing Koalas workloads on PySpark.
The last thing to confirm is to check whether the usability and convenience that we aim for is actually increased through user feedback and PySpark usage statistics.
Also refer to:
Attachments
Issue Links
- incorporates
-
SPARK-34885 Port/integrate Koalas documentation into PySpark
- Resolved
-
SPARK-35805 API auditing in Pandas API on Spark
- Resolved
-
SPARK-35337 pandas API on Spark: Separate basic operations into data type based structures
- Resolved
-
SPARK-35464 pandas API on Spark: Enable mypy check "disallow_untyped_defs" for main codes.
- Resolved
-
SPARK-46291 Koalas Testing Migration
- Resolved
- is related to
-
SPARK-36310 Fix hasnan() window function in IndexOpsMixin
- Resolved
-
SPARK-36348 unexpected Index loaded: pd.Index([10, 20, None], name="x")
- Resolved
-
SPARK-36077 Support numpy literals as input for pandas-on-Spark APIs
- Open
-
SPARK-36078 Complete mappings between numpy literals and Spark data types
- Open
-
SPARK-36364 Move window and aggregate functions to DataTypeOps
- Open
-
SPARK-35638 Introduce InternalField to manage dtypes and StructFields.
- Resolved
-
SPARK-35997 Implement comparison operators for CategoricalDtype in pandas API on Spark
- Resolved
-
SPARK-36000 Support creation and operations of ps.Series/Index with Decimal('NaN')
- Resolved
-
SPARK-36003 Implement unary operator `invert` of integral ps.Series/Index
- Resolved
-
SPARK-36031 Keep same behavior with pandas for operations of series with nan
- Resolved
-
SPARK-36103 Manage InternalField in DataTypeOps.invert
- Resolved
-
SPARK-36104 Manage InternalField in DataTypeOps.neg/abs
- Resolved
-
SPARK-36140 Replace DataTypeOps tests that have operations on different Series
- Resolved
-
SPARK-36167 Revisit more InternalField managements.
- Resolved
-
SPARK-36320 Fix Series/Index.copy() to drop extra columns.
- Resolved
-
SPARK-36338 Move distributed-sequence implementation to Scala side
- Resolved
-
SPARK-36350 Make nanvl work with DataTypeOps
- Resolved
-
SPARK-36365 Remove old workarounds related to null ordering.
- Resolved
-
SPARK-36559 Allow column pruning on distributed sequence index (pandas API on Spark)
- Resolved
-
SPARK-35537 Introduce a util function to check whether the underlying expressions of the columns are the same.
- Resolved
-
SPARK-36192 Better error messages for DataTypeOps against list
- Resolved
-
SPARK-36265 Use __getitem__ instead of getItem to suppress warnings.
- Resolved
-
SPARK-36333 Reuse isnull where the null check is needed.
- Resolved
-
SPARK-35981 Use check_exact=False in StatsTest.test_cov_corr_meta to loosen the check precision
- Resolved
-
SPARK-36035 Adjust `test_astype`, `test_neg` for old pandas versions
- Resolved
-
SPARK-36394 Increase pandas API coverage in PySpark
- Open
-
SPARK-35941 pandas API on Spark: Improve type hints.
- In Progress
-
SPARK-36185 Implement functions in CategoricalAccessor/CategoricalIndex
- Resolved
-
SPARK-36474 Mention pandas API on Spark in Spark overview pages
- Resolved