[SPARK-10788] Decision Tree duplicates bins for unordered categorical features - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: ML
Labels:
None

Target Version/s:

2.0.0

Description

Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.

Say there are 3 categories A, B, C. We consider 3 splits:

A vs. B, C
A, B vs. C
A, C vs. B

Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).

We should eliminate these extra bins within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml).

Attachments

Issue Links

is related to

SPARK-3383 DecisionTree aggregate size could be smaller

Resolved

links to

[Github] Pull Request #9474 (sethah)

Activity

People

Assignee:: Seth Hendrickson

Reporter:: Joseph K. Bradley

Shepherd:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 24/Sep/15 05:07

Updated:: 12/Apr/17 01:35

Resolved:: 17/Mar/16 23:44