Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10788

Decision Tree duplicates bins for unordered categorical features

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.0.0
    • ML
    • None

    Description

      Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.

      Say there are 3 categories A, B, C. We consider 3 splits:

      • A vs. B, C
      • A, B vs. C
      • A, C vs. B

      Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).

      We should eliminate these extra bins within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml).

      Attachments

        Issue Links

          Activity

            People

              sethah Seth Hendrickson
              josephkb Joseph K. Bradley
              Joseph K. Bradley Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: