Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13600

Use approxQuantile from DataFrame stats in QuantileDiscretizer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0, 2.0.0
    • 2.0.0
    • MLlib
    • None

    Description

      For consistency and code reuse, QuantileDiscretizer should use approxQuantile to find splits in the data rather than implement it's own method.

      Additionally, making this change should remedy a bug where QuantileDiscretizer fails to calculate the correct splits in certain circumstances, resulting in an incorrect number of buckets/bins.

      E.g.

      val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
      val discretizer = new QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
      discretizer.fit(df).getSplits

      gives:
      Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
      which corresponds to 6 buckets (not 5).

      Attachments

        Issue Links

          Activity

            People

              ocp Oliver Pierson
              ocp Oliver Pierson
              Nicholas Pentreath Nicholas Pentreath
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: