[SPARK-17086] QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data - ASF JIRA

XML

Word

Printable

JSON

I discovered this bug when working with a build from the master branch (which I believe is 2.1.0). This used to work fine when running spark 1.6.2.

I have a dataframe with an "intData" column that has values like

1 3 2 1 1 2 3 2 2 2 1 3

I have a stage in my pipeline that uses the QuantileDiscretizer to produce equal weight splits like this

new QuantileDiscretizer()
        .setInputCol("intData")
        .setOutputCol("intData_bin")
        .setNumBuckets(10)
        .fit(df)

But when that gets run it (incorrectly) throws this error:

parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, Infinity]

I don't think that there should be duplicate splits generated should there be?

links to

[Github] Pull Request #14747 (VinceShieh)