Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9075

DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • 1.4.0
    • None
    • MLlib
    • None

    Description

      In https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala there's a statement that sets maxPossibileBins to numExamples when numExamples is less than strategy.maxBins.

      This can cause an error when training small partitions; the error is triggered further down in the logic where it's required that maxCategoriesPerFeature be less than or equal to maxPossibleBins.

      Here's the an example of how it was manifested: the partition contained 49 rows (i.e., numExamples=49 but strategy.maxBins was 57.

      The maxPossibleBins = math.min(strategy.maxBins, numExamples) logic therefore reduced maxPossibleBins to 49 causing the "require(maxCategoriesPerFeature <= maxPossibleBins" to throw an error.

      In short, this will be a problem when training small datasets with a feature that contains more categories than numExamples.

      In our local testing we commented out the "math.min(strategy.maxBins, numExamples)" line and the decision tree succeeded where it had failed previously.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              lselecky_cl Les Selecky
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: