Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13444

QuantileDiscretizer chooses bad splits on large DataFrames

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0, 2.0.0
    • Fix Version/s: 1.6.2, 2.0.0
    • Component/s: MLlib
    • Labels:
      None

      Description

      In certain circumstances, QuantileDiscretizer fails to calculate the correct splits and will instead split data into two bins regardless of the value specified in numBuckets.

      For example, supposed dataset.count is 200 million. And we do

      val discretizer = new QuantileDiscretizer().setNumBuckets(10)
      ... set output and input columns ...
      val dataWithBins = discretizer.fit(dataset).transform(dataset)

      In this case, dataWithBins will have only two distinct bins versus the expected 10.

      Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by changing line 113 like so:
      before: val requiredSamples = math.max(numBins * numBins, 10000)
      after: val requiredSamples = math.max(numBins * numBins, 10000.0)

        Attachments

          Activity

            People

            • Assignee:
              ocp Oliver Pierson
              Reporter:
              ocp Oliver Pierson
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: