Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1219

RF: null_as_category=TRUE issues

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • v1.14
    • Module: Random Forest
    • None

    Description

      (1)
      I cannot get null_as_category=TRUE to work when variable importance is used:

      DROP TABLE IF EXISTS null_handling_example;
      CREATE TABLE null_handling_example (
          id integer,
          country text,
          city text,
          weather text,
          response text
      );
      INSERT INTO null_handling_example VALUES
      (1,null,null,null,'a'),
      (2,'US',null,null,'b'),
      (3,'US','NY',null,'c'),
      (4,'US','NY','rainy','d');
      

      RF:

      DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
      SELECT madlib.forest_train('null_handling_example',  -- source table
                                 'train_output',    -- output model table
                                 'id',              -- id column
                                 'response',        -- response
                                 'country, weather, city',   -- features
                                 NULL,              -- exclude columns
                                 NULL,              -- grouping columns
                                 2::integer,        -- number of trees
                                 2::integer,        -- number of random features
                                 TRUE::boolean,     -- variable importance
                                 1::integer,        -- num_permutations
                                 3::integer,        -- max depth
                                 2::integer,        -- min split
                                 2::integer,        -- min bucket
                                 2::integer,        -- number of splits per continuous variable
                                 'null_as_category=TRUE'
                                 );
      

      produces this error

      ERROR:  plpy.SPIError: invalid array length
      DETAIL:  array_of_float: Size should be in [1, 1e7], 0 given
      CONTEXT:  Traceback (most recent call last):
        PL/Python function "forest_train", line 42, in <module>
          sample_ratio
        PL/Python function "forest_train", line 609, in forest_train
        PL/Python function "forest_train", line 1058, in _calculate_oob_prediction
      PL/Python function "forest_train"
      

      When variable importance is FALSE, it does not produce this error.

      (2)
      is null_as_category working for RF?

      If I do get a tree trained, prediction seems wrong:

      DROP TABLE IF EXISTS table_test;
      CREATE TABLE table_test (
          id integer,
          country text,
          city text,
          weather text,
          expected_response text
      );
      INSERT INTO table_test VALUES
      (1,'IN','MUM','cloudy','a'),
      (2,'US','HOU','humid','b'),
      (3,'US','NY','sunny','c'),
      (4,'US','NY','rainy','d');
      
      DROP TABLE IF EXISTS prediction_results;
      SELECT madlib.forest_predict('train_output',
                                   'table_test',
                                   'prediction_results',
                                   'response');
      SELECT s.id, expected_response, estimated_response
      FROM prediction_results p, table_test s
      WHERE s.id = p.id ORDER BY id;
      

      produces

       id | expected_response | estimated_response 
      ----+-------------------+--------------------
        1 | a                 | a
        2 | b                 | a
        3 | c                 | a
        4 | d                 | d
      (4 rows)
      

      but the same example for decision tree predicts properly.

      Attachments

        Issue Links

          Activity

            People

              riyer Rahul Iyer
              fmcquillan Frank McQuillan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: