Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32681 PySpark type hints support
  3. SPARK-17333

Make pyspark interface friendly with mypy static analysis

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • PySpark
    • None

    Description

      Static analysis tools such as those common to IDE for auto completion and error marking, tend to have poor results with pyspark.

      This is cause by two separate issues:
      The first is that many elements are created programmatically such as the max function in pyspark.sql.functions.
      The second is that we tend to use pyspark in a functional manner, meaning that we chain many actions (e.g. df.filter().groupby().agg()....) and since python has no type information this can become difficult to understand.

      I would suggest changing the interface to improve it.

      The way I see it we can either change the interface or provide interface enhancements.

      Changing the interface means defining (when possible) all functions directly, i.e. instead of having a _functions_ dictionary in pyspark.sql.functions.py and then generating the functions programmatically by using _create_function, create the function directly.
      def max(col):
      """
      docstring
      """
      _create_function(max,"docstring")

      Second we can add type indications to all functions as defined in pep 484 or pycharm's legacy type hinting (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
      So for example max might look like this:
      def max(col):
      """
      does a max.
      :type col: Column
      :rtype Column
      """
      This would provide a wide range of support as these types of hints, while old are pretty common.

      A second option is to use PEP 3107 to define interfaces (pyi files)
      in this case we might have a functions.pyi file which would contain something like:
      def max(col: Column) -> Column:
      """
      Aggregate function: returns the maximum value of the expression in a group.
      """
      ...

      This has the advantage of easier to understand types and not touching the code (only supported code) but has the disadvantage of being separately managed (i.e. greater chance of doing a mistake) and the fact that some configuration would be needed in the IDE/static analysis tool instead of working out of the box.

      Attachments

        Issue Links

          Activity

            People

              fokko Fokko Driesprong
              assaf.mendelson Assaf Mendelson
              Votes:
              6 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: