[SPARK-24712] TrainValidationSplit ignores label column name and forces to be "label" - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: ML
Labels:
None

Description

When a TrainValidationSplit is fit on a Pipeline containing a ML model, the labelCol property of the model is ignored, and the call to fit() will fail unless the labelCol equals "label". As an example, the following pyspark code only works when the variable labelColumn is set to "label"

from pyspark.sql.functions import rand, randn
from pyspark.ml.regression import LinearRegression

labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS

df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), randn(seed=27).alias(labelColumn))
vectorAssembler = VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
mypipeline = Pipeline(stages = [vectorAssembler, lr])

paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.01, 0.1])\
.build()

trainValidationSplit = TrainValidationSplit()\
.setEstimator(mypipeline)\
.setEvaluator(RegressionEvaluator())\
.setEstimatorParamMaps(paramGrid)\
.setTrainRatio(0.8)

trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Pablo J. Villacorta

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Jul/18 18:48

Updated:: 02/Jul/18 11:37

Resolved:: 02/Jul/18 11:37