[MADLIB-1254] RF/DT: Grouping might give incorrect results if 1 group eliminates a categorical variable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: v1.15
Component/s: Module: Decision Tree
Labels:
None

Description

If forest_train is run with grouping enabled and if one of the groups has a categorical feature with just single level, then the categorical feature is eliminated for that group. If other groups retain that feature, then the output of impurity_var_importance is incorrect for the group in question. There could be other ramifications related to this as well.

DROP TABLE IF EXISTS dt_golf CASCADE;
CREATE TABLE dt_golf (
    id integer NOT NULL,
    "OUTLOOK" text,
    temperature double precision,
    humidity double precision,
    "Cont_features" double precision[],
    cat_features text[],
    windy boolean,
    class text
) ;

INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,"Cont_features",cat_features, windy,class) VALUES
(1, 'sunny', 85, 85,ARRAY[85, 85], ARRAY['a', 'b'], false, 'Don''t Play'),
(2, 'sunny', 80, 90, ARRAY[80, 90], ARRAY['a', 'b'], true, 'Don''t Play'),
(3, 'overcast', 83, 78, ARRAY[83, 78], ARRAY['a', 'b'], false, 'Play'),
(4, 'rain', 70, NULL, ARRAY[70, 96], ARRAY['a', 'b'], false, 'Play'),
(5, 'rain', 68, 80, ARRAY[68, 80], ARRAY['a', 'b'], false, 'Play'),
(6, 'rain', NULL, 70, ARRAY[65, 70], ARRAY['a', 'b'], true, 'Don''t Play'),
(7, 'overcast', 64, 65, ARRAY[64, 65], ARRAY['c', 'b'], NULL , 'Play'),
(8, 'sunny', 72, 95, ARRAY[72, 95], ARRAY['a', 'b'], false, 'Don''t Play'),
(9, 'sunny', 69, 70, ARRAY[69, 70], ARRAY['a', 'b'], false, 'Play'),
(10, 'rain', 75, 80, ARRAY[75, 80], ARRAY['a', 'b'], false, 'Play'),
(11, 'sunny', 75, 70, ARRAY[75, 70], ARRAY['a', 'd'], true, 'Play'),
(12, 'overcast', 72, 90, ARRAY[72, 90], ARRAY['c', 'b'], NULL, 'Play'),
(13, 'overcast', 81, 75, ARRAY[81, 75], ARRAY['a', 'b'], false, 'Play'),
(15, NULL, 81, 75, ARRAY[81, 75], ARRAY['a', 'b'], false, 'Play'),
(16, 'overcast', NULL, 75, ARRAY[81, 75], ARRAY['a', 'd'], false, 'Play'),
(14, 'rain', 71, 80, ARRAY[71, 80], ARRAY['c', 'b'], true, 'Don''t Play');

DROP TABLE IF EXISTS train_output, train_output_summary, train_output_group, train_output_poisson_count;
SELECT forest_train(
                  'dt_golf',         -- source table
                  'train_output',    -- output model table
                  'id',              -- id column
                  'temperature::double precision',           -- response
                  'humidity, cat_features, windy, "Cont_features"',   -- features
                  NULL,        -- exclude columns
                  'class',          -- grouping
                  5,                -- num of trees
                  NULL,                 -- num of random features
                  TRUE,     -- importance
                  20,         -- num_permutations
                  10,       -- max depth
                  1,        -- min split
                  1,        -- min bucket
                  3,        -- number of bins per continuous variable
                  'max_surrogates = 2 ',
                  FALSE
                  );
\x on
SELECT * from train_output_summary;
SELECT * from train_output_group;

Results:

SELECT * from train_output_group;
-[ RECORD 1 ]-----------+-----------------------------------------------------------------------------
gid                     | 1
class                   | Don't Play
success                 | t
cat_n_levels            | {2,2,2}
cat_levels_in_text      | {c,a,True,False,c,a}
oob_error               | 92.5335905349795
oob_var_importance      | {10.725,10.725,10.725,7.605,10.725,0}
impurity_var_importance | {8.33148348160485,0,0,19.9999998625892,19.9999998625892,11.6685163809844}
-[ RECORD 2 ]-----------+-----------------------------------------------------------------------------
gid                     | 2
class                   | Play
success                 | t
cat_n_levels            | {2,2}
cat_levels_in_text      | {b,d,False,True}
oob_error               | 43.0244073645405
oob_var_importance      | {1.06581410364015e-15,1.06581410364015e-15,2.1326171875,16.019375,10.570875}
impurity_var_importance | {0,0,0,37.8304000437732,38.4881698525677,23.6814277291654}

Note that the impurity_var_importance for gid=2 has length 6 while the oob_var_importance correctly has 5.

Attachments

Issue Links

links to

GitHub Pull Request #296

GitHub Pull Request #299

Activity

People

Assignee:: Unassigned

Reporter:: Rahul Iyer

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Jul/18 23:10

Updated:: 01/Aug/18 22:32

Resolved:: 25/Jul/18 22:29