Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
(1)
I cannot get null_as_category=TRUE to work when variable importance is used:
DROP TABLE IF EXISTS null_handling_example; CREATE TABLE null_handling_example ( id integer, country text, city text, weather text, response text ); INSERT INTO null_handling_example VALUES (1,null,null,null,'a'), (2,'US',null,null,'b'), (3,'US','NY',null,'c'), (4,'US','NY','rainy','d');
RF:
DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary; SELECT madlib.forest_train('null_handling_example', -- source table 'train_output', -- output model table 'id', -- id column 'response', -- response 'country, weather, city', -- features NULL, -- exclude columns NULL, -- grouping columns 2::integer, -- number of trees 2::integer, -- number of random features TRUE::boolean, -- variable importance 1::integer, -- num_permutations 3::integer, -- max depth 2::integer, -- min split 2::integer, -- min bucket 2::integer, -- number of splits per continuous variable 'null_as_category=TRUE' );
produces this error
ERROR: plpy.SPIError: invalid array length DETAIL: array_of_float: Size should be in [1, 1e7], 0 given CONTEXT: Traceback (most recent call last): PL/Python function "forest_train", line 42, in <module> sample_ratio PL/Python function "forest_train", line 609, in forest_train PL/Python function "forest_train", line 1058, in _calculate_oob_prediction PL/Python function "forest_train"
When variable importance is FALSE, it does not produce this error.
(2)
is null_as_category working for RF?
If I do get a tree trained, prediction seems wrong:
DROP TABLE IF EXISTS table_test; CREATE TABLE table_test ( id integer, country text, city text, weather text, expected_response text ); INSERT INTO table_test VALUES (1,'IN','MUM','cloudy','a'), (2,'US','HOU','humid','b'), (3,'US','NY','sunny','c'), (4,'US','NY','rainy','d'); DROP TABLE IF EXISTS prediction_results; SELECT madlib.forest_predict('train_output', 'table_test', 'prediction_results', 'response'); SELECT s.id, expected_response, estimated_response FROM prediction_results p, table_test s WHERE s.id = p.id ORDER BY id;
produces
id | expected_response | estimated_response ----+-------------------+-------------------- 1 | a | a 2 | b | a 3 | c | a 4 | d | d (4 rows)
but the same example for decision tree predicts properly.
Attachments
Issue Links
- links to