Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13444

QuantileDiscretizer chooses bad splits on large DataFrames

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0, 2.0.0
    • 1.6.2, 2.0.0
    • MLlib
    • None

    Description

      In certain circumstances, QuantileDiscretizer fails to calculate the correct splits and will instead split data into two bins regardless of the value specified in numBuckets.

      For example, supposed dataset.count is 200 million. And we do

      val discretizer = new QuantileDiscretizer().setNumBuckets(10)
      ... set output and input columns ...
      val dataWithBins = discretizer.fit(dataset).transform(dataset)

      In this case, dataWithBins will have only two distinct bins versus the expected 10.

      Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by changing line 113 like so:
      before: val requiredSamples = math.max(numBins * numBins, 10000)
      after: val requiredSamples = math.max(numBins * numBins, 10000.0)

      Attachments

        Activity

          People

            ocp Oliver Pierson
            ocp Oliver Pierson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: