Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24712

TrainValidationSplit ignores label column name and forces to be "label"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.2.0
    • None
    • ML
    • None

    Description

      When a TrainValidationSplit is fit on a Pipeline containing a ML model, the labelCol property of the model is ignored, and the call to fit() will fail unless the labelCol equals "label". As an example, the following pyspark code only works when the variable labelColumnĀ is set to "label"

      from pyspark.sql.functions import rand, randn
      from pyspark.ml.regression import LinearRegression
      
      labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS
      
      df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), randn(seed=27).alias(labelColumn))
      vectorAssembler = VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
      lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
      mypipeline = Pipeline(stages = [vectorAssembler, lr])
      
      paramGrid = ParamGridBuilder()\
      .addGrid(lr.regParam, [0.01, 0.1])\
      .build()
      
      trainValidationSplit = TrainValidationSplit()\
      .setEstimator(mypipeline)\
      .setEvaluator(RegressionEvaluator())\
      .setEstimatorParamMaps(paramGrid)\
      .setTrainRatio(0.8)
      
      trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            olbapjose Pablo J. Villacorta
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: