[SPARK-33592] Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0, 3.1.0
Fix Version/s: None
Component/s: ML, PySpark
Labels:
None

Description

Two typical cases to reproduce it:
(1)

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100]) \
    .addGrid(lr.maxIter, [100, 200]) \
    .build()
tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=MulticlassClassificationEvaluator())

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params `hashingTF.numFeatures` and `lr.maxIter` are lost.

(2)

lr = LogisticRegression()
ova = OneVsRest(classifier=lr)
grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
evaluator = MulticlassClassificationEvaluator()
tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params`lr.maxIter` are lost.

Both CrossValidator and TrainValidationSplit in Pyspark has this issue.

Attachments

Issue Links

links to

[Github] Pull Request #30539 (WeichenXu123)

[Github] Pull Request #30590 (WeichenXu123)

Activity

People

Assignee:: Weichen Xu

Reporter:: Weichen Xu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Nov/20 01:09

Updated:: 03/Dec/20 11:52