The importance of Python and PySpark has grown radically in the last few years. The number of PySpark downloads reached more than 1.3 million every week when we count them only in PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error messages as an example, and the API documentation is poorly written.
This epic tickets aims to improve the usability in PySpark, and make it more Pythonic. To be more explicit, this JIRA targets four bullet points below. Each includes examples:
- Being Pythonic
- Pandas UDF enhancements and type hints
- Avoid dynamic function definitions, for example, at funcitons.py which makes IDEs unable to detect.
- Better and easier usability in PySpark
- Better interoperability with other Python libraries
- Visualization and plotting
- Potentially better interface by leveraging Arrow
- Compatibility with other libraries such as NumPy universal functions or pandas possibly by leveraging Koalas
- PyPI Installation
- PySpark with Hadoop 3 support on PyPi
- Better error handling