Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1226

Add option for 1-hot encoding to minibatch preprocessor

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • v1.14
    • Module: Utilities
    • None

    Description

      I was testing MNIST dataset with minibatch preprocessor + MLP and could not get it to converge.   It turned out to be user error (me) and not a problem with convergence at all, because I forgot to 1-hot encode the dependent variable.

      But I am wondering if other people might do the same thing that I did and get confused.

      Here's what I did. For this input data:

      madlib=# \d+ public.mnist_train
      
                                                    Table "public.mnist_train"
      
       Column |   Type    |                        Modifiers                         | Storage  | Stats target | Description 
      
      --------+-----------+----------------------------------------------------------+----------+--------------+-------------
      
       y      | integer   |                                                          | plain    |              | 
      
       x      | integer[] |                                                          | extended |              | 
      
       id     | integer   | not null default nextval('mnist_train_id_seq'::regclass) | plain    |              | 
      

      I called minibatch preprocessor:

      SELECT madlib.minibatch_preprocessor('mnist_train',         -- Source table
                                           'mnist_train_packed',  -- Output table
                                           'y',                   -- Dependent variable
                                           'x'                    -- Independent variables
                                           );
      

      then mlp:

      SELECT madlib.mlp_classification(
          'mnist_train_packed',        -- Source table from preprocessor output
          'mnist_result',              -- Destination table
          'independent_varname',       --  Independent
          'dependent_varname',        -- Dependent
          ARRAY[5],                    -- Hidden layer sizes
          'learning_rate_init=0.01,
          n_iterations=20,
          learning_rate_policy=exp, n_epochs=20,
          lambda=0.0001,                 -- Regularization
          tolerance=0',
          'tanh',                      -- Activation function
          '',                          -- No weights
          FALSE,                       -- No warmstart
          TRUE);                       -- Verbose
      

      with the result:

      INFO:  Iteration: 2, Loss: <-79.5295531257>
      INFO:  Iteration: 3, Loss: <-79.529408892>
      INFO:  Iteration: 4, Loss: <-79.5291940436>
      INFO:  Iteration: 5, Loss: <-79.5288964944>
      INFO:  Iteration: 6, Loss: <-79.5285051451>
      INFO:  Iteration: 7, Loss: <-79.5280094708>
      INFO:  Iteration: 8, Loss: <-79.5273995189>
      INFO:  Iteration: 9, Loss: <-79.5266665607>
      

      So it did not error out but clearly is not working on data in the right format.

      I suggest 2 changes:

      1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of scalar integer dependent variables (this JIRA)

      2) Add a check to the MLP classification code to check that the dependent var has been 1-hot encoded, and error out if that is not the case. (https://issues.apache.org/jira/browse/MADLIB-1226)

      Proposed interface:

      minibatch_preprocessor( source_table,
                              output_table,
                              dependent_varname,
                              independent_varname,
                              grouping_col,
                              buffer_size,
                              one_hot_encode_int_dep_var
                              )
      
      one_hot_encode_int_dep_var (optional)
      BOOLEAN. default: FALSE. Whether to one-hot encode dependent variables that are scalar integer.
      This parameter is ignored if the dependent variable is not a scalar integer.
      More detail:  the mini-batch preprocessor automatically encodes dependent variables that are 
      Boolean and character types such as text, char and varchar.  However, scalar integers are a 
      special case because they can be used in both classification and regression problems, so
      you must tell the mini-batch preprocessor whether you want to encode them or not.  
      In the case that you have already encoded the dependent variable yourself, 
      you can ignore this parameter.  Also, if you want to encode float values for some reason, cast them 
      to text first.
      

       

      Attachments

        Issue Links

          Activity

            People

              riyer Rahul Iyer
              fmcquillan Frank McQuillan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: