Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13600

Use approxQuantile from DataFrame stats in QuantileDiscretizer

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0, 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: MLlib
    • Labels:
      None

      Description

      For consistency and code reuse, QuantileDiscretizer should use approxQuantile to find splits in the data rather than implement it's own method.

      Additionally, making this change should remedy a bug where QuantileDiscretizer fails to calculate the correct splits in certain circumstances, resulting in an incorrect number of buckets/bins.

      E.g.

      val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
      val discretizer = new QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
      discretizer.fit(df).getSplits

      gives:
      Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
      which corresponds to 6 buckets (not 5).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ocp Oliver Pierson
                Reporter:
                ocp Oliver Pierson
                Shepherd:
                Nicholas Pentreath
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: