Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19781

Bucketizer's handleInvalid leave null values untouched unlike the NaNs

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: 2.1.0
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
    • Flags:
      Patch

      Description

      Bucketizer can put NaN values into a special bucket when handleInvalid is on. but leave null values untouched.

      import org.apache.spark.ml.feature.Bucketizer
      val data = sc.parallelize(Seq(("crackcell", null.asInstanceOf[java.lang.Double]))).toDF("name", "number")
      val bucketizer = new Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity, 0, 10, Double.PositiveInfinity)).setHandleInvalid("keep")
      val res = bucketizer.transform(data)
      res.show(1)
      

      will output:

      ------------------------

      name number number_output

      ------------------------

      crackcell null null

      ------------------------

      If we change null to NaN:

      val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", "number")
      data2: org.apache.spark.sql.DataFrame = [name: string, number: double]
      bucketizer.transform(data2).show(1)
      

      will output:

      ------------------------

      name number number_output

      ------------------------

      crackcell NaN 3.0

      ------------------------

      Maybe we should unify the behaviours? Is it resonable to process nulls as well? If so, maybe my code can help.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              crackcell Menglong TAN
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Remaining Estimate - 2h
                2h
                Logged:
                Time Spent - Not Specified
                Not Specified