[SPARK-20574] Allow Bucketizer to handle non-Double column - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.2.0
Component/s: ML
Labels:
None

Description

Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This transformer could be extended to handle all numeric types.

The example below shows failure of Bucketizer on integer data.

val splits = Array(-3.0, 0.0, 3.0)
val data: Array[Int] = Array(-2, -1, 0, 1, 2)
val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0, 1.0)
val dataFrame = data.zip(expectedBuckets).toSeq.toDF("feature", "expected")
val bucketizer = new Bucketizer()
  .setInputCol("feature")
  .setOutputCol("result")
  .setSplits(splits)
bucketizer.transform(dataFrame)  

java.lang.IllegalArgumentException: requirement failed: Column feature must be of type DoubleType but was actually IntegerType.

Attachments

Issue Links

links to

[Github] Pull Request #17840 (actuaryzhang)

Activity

People

Assignee:: Wayne Zhang

Reporter:: Wayne Zhang

Shepherd:: Yanbo Liang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/May/17 05:59

Updated:: 05/May/17 02:30

Resolved:: 05/May/17 02:30