Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23377

Bucketizer with multiple columns persistence bug

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0, 2.4.0
    • Component/s: ML
    • Labels:
      None

      Description

      A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example.

      import org.apache.spark.ml.feature._
      
      val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
      val bucketizer = new Bucketizer()
        .setSplitsArray(Array(splits, splits))
        .setInputCols(Array("foo1", "foo2"))
        .setOutputCols(Array("bar1", "bar2"))
      
      val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
      bucketizer.transform(data)
      
      val path = "/temp/bucketrizer-persist-test"
      bucketizer.write.overwrite.save(path)
      val bucketizerAfterRead = Bucketizer.read.load(path)
      println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
      // This line throws an error because "outputCol" is set
      bucketizerAfterRead.transform(data)
      

      And the trace:

      java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has the inputCols Param set for multi-column transform. The following Params are not applicable and should not be set: outputCol.
      	at org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
      	at org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
      	at org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
      	at org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
      	at line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-6079631:17)
      
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                viirya Liang-Chi Hsieh
                Reporter:
                bago.amirbekian Bago Amirbekian
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: