Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32082

Project Zen: Improving Python usability



    • Epic
    • Status: Resolved
    • Critical
    • Resolution: Done
    • 3.1.0
    • 3.4.0
    • PySpark
    • None
    • Project Zen


      The importance of Python and PySpark has grown radically in the last few years. The number of PySpark downloads reached more than 1.3 million every week when we count them only in PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error messages as an example, and the API documentation is poorly written.

      This epic tickets aims to improve the usability in PySpark, and make it more Pythonic. To be more explicit, this JIRA targets four bullet points below. Each includes examples:

      • Being Pythonic
        • Pandas UDF enhancements and type hints
        • Avoid dynamic function definitions, for example, at funcitons.py which makes IDEs unable to detect.
      • Better and easier usability in PySpark
        • User-facing error message and warnings
        • Documentation
        • User guide
        • Better examples and API documentation, e.g. Koalas and pandas
      • Better interoperability with other Python libraries
        • Visualization and plotting
        • Potentially better interface by leveraging Arrow
        • Compatibility with other libraries such as NumPy universal functions or pandas possibly by leveraging Koalas
      • PyPI Installation
        • PySpark with Hadoop 3 support on PyPi
        • Better error handling


      SPARK-31382 Show a better error message for different python and pip installation mistake RESOLVED Hyukjin Kwon
      SPARK-31849 Improve Python exception messages to be more Pythonic RESOLVED Hyukjin Kwon
      SPARK-31851 Redesign PySpark documentation RESOLVED Hyukjin Kwon
      SPARK-32017 Make Pyspark Hadoop 3.2+ Variant available in PyPI RESOLVED Hyukjin Kwon
      SPARK-32084 Replace dictionary-based function definitions to proper functions in functions.py RESOLVED Maciej Szymkiewicz
      SPARK-32085 Migrate to NumPy documentation style RESOLVED Maciej Szymkiewicz
      SPARK-32161 Hide JVM traceback for SparkUpgradeException RESOLVED Pralabh Kumar
      SPARK-32185 User Guide - Monitoring RESOLVED Abhijeet Prasad
      SPARK-32195 Standardize warning types and messages RESOLVED Maciej Szymkiewicz
      SPARK-32204 Binder Integration RESOLVED Hyukjin Kwon
      SPARK-32681 PySpark type hints support RESOLVED Maciej Szymkiewicz
      SPARK-32686 Un-deprecate inferring DataFrame schema from list of dictionaries RESOLVED Nicholas Chammas
      SPARK-33247 Improve examples and scenarios in docstrings RESOLVED Unassigned
      SPARK-33407 Simplify the exception message from Python UDFs RESOLVED Hyukjin Kwon
      SPARK-33530 Support --archives option natively RESOLVED Hyukjin Kwon
      SPARK-34629 Python type hints improvement RESOLVED Maciej Szymkiewicz
      SPARK-34849 SPIP: Support pandas API layer on PySpark RESOLVED Haejoon Lee
      SPARK-34885 Port/integrate Koalas documentation into PySpark RESOLVED Hyukjin Kwon
      SPARK-35337 pandas API on Spark: Separate basic operations into data type based structures RESOLVED Xinrong Meng
      SPARK-35419 Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by default RESOLVED Hyukjin Kwon
      SPARK-35464 pandas API on Spark: Enable mypy check "disallow_untyped_defs" for main codes. RESOLVED Takuya Ueshin
      SPARK-35805 API auditing in Pandas API on Spark RESOLVED Haejoon Lee




            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            4 Vote for this issue
            34 Start watching this issue