Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32082

Project Zen: Improving Python usability



    • Epic
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.1.0
    • None
    • PySpark
    • None
    • Project Zen


      The importance of Python and PySpark has grown radically in the last few years. The number of PySpark downloads reached more than 1.3 million every week when we count them only in PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error messages as an example, and the API documentation is poorly written.

      This epic tickets aims to improve the usability in PySpark, and make it more Pythonic. To be more explicit, this JIRA targets four bullet points below. Each includes examples:

      • Being Pythonic
        • Pandas UDF enhancements and type hints
        • Avoid dynamic function definitions, for example, at funcitons.py which makes IDEs unable to detect.
      • Better and easier usability in PySpark
        • User-facing error message and warnings
        • Documentation
        • User guide
        • Better examples and API documentation, e.g. Koalas and pandas
      • Better interoperability with other Python libraries
        • Visualization and plotting
        • Potentially better interface by leveraging Arrow
        • Compatibility with other libraries such as NumPy universal functions or pandas possibly by leveraging Koalas
      • PyPI Installation
        • PySpark with Hadoop 3 support on PyPi
        • Better error handling




            hyukjin.kwon Hyukjin Kwon
            hyukjin.kwon Hyukjin Kwon
            3 Vote for this issue
            33 Start watching this issue