Description
A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example.
import org.apache.spark.ml.feature._ val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity) val bucketizer = new Bucketizer() .setSplitsArray(Array(splits, splits)) .setInputCols(Array("foo1", "foo2")) .setOutputCols(Array("bar1", "bar2")) val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") bucketizer.transform(data) val path = "/temp/bucketrizer-persist-test" bucketizer.write.overwrite.save(path) val bucketizerAfterRead = Bucketizer.read.load(path) println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) // This line throws an error because "outputCol" is set bucketizerAfterRead.transform(data)
And the trace:
java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has the inputCols Param set for multi-column transform. The following Params are not applicable and should not be set: outputCol.
at org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
at org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
at org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
at org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
at line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-6079631:17)
Attachments
Issue Links
- relates to
-
SPARK-23455 Default Params in ML should be saved separately
- Resolved
- links to