Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14610

Remove superfluous split from random forest findSplitsForContinousFeature

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.1.0
    • ML
    • None

    Description

      Currently, the method findSplitsForContinuousFeature in random forest produces an unnecessary split. For example, if a continuous feature has unique values: (1, 2, 3), then the possible splits generated by this method are:

      • {1|2,3}
      • {1,2|3}
      • {1,2,3|}

      The following unit test is quite clearly incorrect:

      rf.scala
      val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
            val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
            assert(splits.length === 3)
      

      Attachments

        Activity

          People

            sethah Seth Hendrickson
            sethah Seth Hendrickson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: