Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17086

QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.0.1, 2.1.0
    • ML
    • None

    Description

      I discovered this bug when working with a build from the master branch (which I believe is 2.1.0). This used to work fine when running spark 1.6.2.

      I have a dataframe with an "intData" column that has values like

      1 3 2 1 1 2 3 2 2 2 1 3
      

      I have a stage in my pipeline that uses the QuantileDiscretizer to produce equal weight splits like this

      new QuantileDiscretizer()
              .setInputCol("intData")
              .setOutputCol("intData_bin")
              .setNumBuckets(10)
              .fit(df)
      

      But when that gets run it (incorrectly) throws this error:

      parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, Infinity]
      

      I don't think that there should be duplicate splits generated should there be?

      Attachments

        1. titanic.csv
          73 kB
          Barry Becker

        Activity

          People

            VinceXie Vincent
            barrybecker4 Barry Becker
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: