Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25124

VectorSizeHint.size is buggy, breaking streaming pipeline

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 2.3.2, 2.4.0
    • Component/s: ML
    • Labels:

      Description

      Currently, when using VectorSizeHint().setSize(3) in an ML pipeline, transforming a stream will return a nondescript exception about the stream not started. At core are the following bugs that setSize and getSize do not return values but None:

      https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846

      How to reproduce, using the example in the doc:

      from pyspark.ml.linalg import Vectors
      from pyspark.ml import Pipeline, PipelineModel
      from pyspark.ml.feature import VectorAssembler, VectorSizeHint
      data = [(Vectors.dense([1., 2., 3.]), 4.)]
      df = spark.createDataFrame(data, ["vector", "float"])
      sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) # Will fail
      vecAssembler = VectorAssembler(inputCols=["vector", "float"], outputCol="assembled")
      pipeline = Pipeline(stages=[sizeHint, vecAssembler])
      pipelineModel = pipeline.fit(df)
      pipelineModel.transform(df).head().assembled
      

        Attachments

          Activity

            People

            • Assignee:
              huaxingao Huaxin Gao
              Reporter:
              timhunter Timothy Hunter
              Shepherd:
              Joseph K. Bradley
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: