Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19781

Bucketizer's handleInvalid leave null values untouched unlike the NaNs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.1.0
    • None
    • MLlib
    • Patch

    Description

      Bucketizer can put NaN values into a special bucket when handleInvalid is on. but leave null values untouched.

      import org.apache.spark.ml.feature.Bucketizer
      val data = sc.parallelize(Seq(("crackcell", null.asInstanceOf[java.lang.Double]))).toDF("name", "number")
      val bucketizer = new Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity, 0, 10, Double.PositiveInfinity)).setHandleInvalid("keep")
      val res = bucketizer.transform(data)
      res.show(1)
      

      will output:

      ------------------------

      name number number_output

      ------------------------

      crackcell null null

      ------------------------

      If we change null to NaN:

      val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", "number")
      data2: org.apache.spark.sql.DataFrame = [name: string, number: double]
      bucketizer.transform(data2).show(1)
      

      will output:

      ------------------------

      name number number_output

      ------------------------

      crackcell NaN 3.0

      ------------------------

      Maybe we should unify the behaviours? Is it resonable to process nulls as well? If so, maybe my code can help.

      Attachments

        Activity

          People

            Unassigned Unassigned
            crackcell Menglong TAN
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Remaining Estimate - 2h
                2h
                Logged:
                Time Spent - Not Specified
                Not Specified