Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20574

Allow Bucketizer to handle non-Double column

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.2.0
    • ML
    • None

    Description

      Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This transformer could be extended to handle all numeric types.

      The example below shows failure of Bucketizer on integer data.

      val splits = Array(-3.0, 0.0, 3.0)
      val data: Array[Int] = Array(-2, -1, 0, 1, 2)
      val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0, 1.0)
      val dataFrame = data.zip(expectedBuckets).toSeq.toDF("feature", "expected")
      val bucketizer = new Bucketizer()
        .setInputCol("feature")
        .setOutputCol("result")
        .setSplits(splits)
      bucketizer.transform(dataFrame)  
      
      java.lang.IllegalArgumentException: requirement failed: Column feature must be of type DoubleType but was actually IntegerType.
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            actuaryzhang Wayne Zhang
            actuaryzhang Wayne Zhang
            Yanbo Liang Yanbo Liang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment