Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15018

PySpark ML Pipeline raises unclear error when no stages set

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.1.0
    • ML, PySpark
    • None

    Description

      When fitting a PySpark Pipeline with no stages, it should work as an identity transformer. Instead the following error is raised:

      Traceback (most recent call last):
        File "./spark/python/pyspark/ml/base.py", line 64, in fit
          return self._fit(dataset)
        File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
          for stage in stages:
      TypeError: 'NoneType' object is not iterable
      

      The param stages needs to be an empty list and getStages should call getOrDefault.

      Also, since the default value is None is then changed to and empty list [], this never changes the value if passed in as a keyword argument. Instead, the kwargs value should be changed directly if stages is None.

      For example

      if stages is None:
          stages = []
      

      should be this

      if stages is None:
          kwargs['stages'] = []
      

      However, since there is no default value in the Scala implementation, assigning a default here is not needed and should be cleaned up. The pydocs should better indicate that stages is required to be a list.

      Attachments

        Issue Links

          Activity

            People

              bryanc Bryan Cutler
              bryanc Bryan Cutler
              Yanbo Liang Yanbo Liang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: