Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1200

Pre-processing helper function for mini-batching

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • v1.14
    • Module: Utilities
    • None

    Description

      Related to
      https://issues.apache.org/jira/browse/MADLIB-1037
      https://issues.apache.org/jira/browse/MADLIB-1048

      Story

      As a
      data scientist
      I want to
      pre-process input files for use with mini-batching
      so that
      the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, perhaps because I am tuning parameters (i.e., pre-processing is an occasional operation that I don't want to re-do every time that I train a model)

      Interface

      minibatch_preprocessor(	
           source_table, -- Name of the table containing input data
           output_table, -- Name of the output table for mini-batching
           dependent_varname, -- Name of the dependent variable column	
           independent_varname, -- Expression list to evaluate for the independent variables
           grouping_cols, -- Preprocess separately by group
           buffer_size  -- Number of source input rows to pack into batch
      )
      

      where

      source_table
      TEXT.  Name of the table containing input data.  Can also be a view.
      
      output_table
      TEXT.  Name of the output table from the preprocessor which will be used as input to algorithms that support mini-batching.
      
      dependent_varname
      TEXT.  Column name or expression to evaluate for the dependent variable. 
      
      independent_varname
      TEXT.  Column name or expression list to evaluate for the independent variable.  Will be cast to double when packing.
      
      grouping_cols (optional)
      TEXT, default: NULL.  An expression list used to group the input dataset into discrete groups, running one preprocessing step per group. Similar to the SQL GROUP BY clause. When this value is NULL, no grouping is used and a single preprocessing step is performed for the whole data set.
      
      buffer_size (optional)
      INTEGER, default: ???.  Number of source input rows to pack into batch.
      
      The output table contains the following columns:
      
      id					INTEGER.  Unique id for packed table.
      dependent_varname 			FLOAT8[]. Packed array of dependent variables.
      independent_varname		FLOAT8[].  Packed array of independent variables.
      grouping_cols				TEXT.  Name of grouping columns.
      
      A summary table named <output_table>_summary is created together with the output table.  It has the following columns:
      
      source_table    		Source table name.
      output_table			Output table name from preprocessor.
      dependent_varname   	Dependent variable.
      independent_varname 	Independent variables.
      buffer_size			Buffer size used in preprocessing step.
      dependent_vartype		“Continuous” or “Categorical”
      class_values			Class values of the dependent variable (NULL for continuous vars).
      num_rows_processed  		The total number of rows that were used in the computation.
      num_missing_rows_skipped   	The total number of rows that were skipped because of NULL values in them.
      grouping_cols   		Names of the grouping columns.
      
      A standardization table named <output_table>_standardization is created together with the output table.  It has the following columns:
      
      	grouping_cols			Group
      	mean				Mean of independent vars by group
      	std				Standard deviation of independent vars by group
      

       
      The main purpose of the function is to prepare the training data for minibatching algorithms. This will be achieved in 2 stages

      1. Based on the buffer size, group all the dependent and independent variables in a single tuple representative of the buffer.
      2. If the dependent variables are boolean or text, perform one hot encoding.  N/A for integer and floats. Note that if the integer vars are actually categorical, they must be case to ::TEXT so that they get encoded.  

      Notes

      1) Random shuffle needed for mini-batch.
      2) Naive approach may be OK to start, not worth big investment to make run 10% or 20% faster.

      Acceptance
      Summary

      1) Convert from standard to special format for mini-batching
      2) Standardize by default for now but the user cannot opt out of it. We may decide to add a flag later.
      3) Some scale testing OK (does not need to be comprehensive)
      4) Document as a helper function user docs
      5) Always ignore nulls in dependent variable
      6) IC

      Attachments

        Issue Links

          Activity

            People

              nkak Nikhil Kak
              fmcquillan Frank McQuillan
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: