Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1119

Train-test split

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • v1.12
    • Module: Sampling
    • None

    Description

      Context

      See related story on stratified sampling
      https://issues.apache.org/jira/browse/MADLIB-986

      Story

      As a data scientist, I want to split a data table into training and test sets including grouping support, so that I use the result sets for model development in the usual way.

      The MVP for this story is:

      • support split by group
      • allow option to sample without replacement (default) and sample with replacement
      • allow option to output a subset of columns to the output table
      • output one table with a new test/train column, or optionally two separate tables

      Proposed Interface

      train_test_split ( 
                                         source_table,    
                                         output_table,
                                         train_proportion,
                                         test_proportion, -- optional
                                         grouping_col -- optional
                                         with_replacement, -- optional
                                         target_cols -- optional
                                         separate_output_tables -- optional
                                      )
      
      source_table
      TEXT. The name of the table containing the input data.
      
      output_table
      TEXT. Name of output table.   A new INTEGER column on the right 
      called 'split' will identify 1 for train set and 0 for test set,
      unless the 'separate_output_tables' parameter below is TRUE, 
      in which case two output tables will be created using 
      the 'output_table' name with the suffixes '_train' and '_test'.
      The output table contains all the  columns present in the source 
      table unless otherwise specified  in the 'target_cols' parameter below. 
      
      train_proportion
      FLOAT8 in the range (0,1).  Proportion of the dataset to include 
      in the train split.  If the 'grouping_col' parameter is specified below, 
      each group will be sampled independently using the 
      train proportion, i.e., in a stratified fashion.
      
      test_proportion (optional)
      FLOAT8 in the range (0,1).  Proportion of the dataset to include 
      in the test split.  Default is the complement to the train
      proportion (1-'train_proportion').  If the 'grouping_col' 
      parameter is specified below,  each group will be sampled 
      independently using the  train proportion, 
      i.e., in a stratified fashion.
      
      grouping_col (optional)
      TEXT, default: NULL. A single column or a list of comma-separated columns
       that defines how to stratify.  When this parameter is NULL, 
      the train-test split is not stratified.
      
      with_replacement (optional) 
      BOOLEAN, default FALSE.  Determines whether to sample with replacement 
      or without replacement (default).
      
      target_cols (optional)
      TEXT, default NULL. A comma-separated list of columns to appear in the 'output_table'. 
      If NULL, all columns from the 'source_table'  will appear in the 'output_table'.
      
      separate_output_tables (optional)
      BOOLEAN, default FALSE.  If TRUE, two output tables will be created using 
      the 'output_table' name with the suffixes '_train' and '_test'.
      

      Other notes

      1) PDL tools is one example implementation of train/test split to review [2].

      2) From Rahul Iyer: "The goal of having both train and test is to provide subsample and train/test split in one function.

      For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed data will be output. This is tremendously useful in situations where a user wants to prototype/evaluate a couple of models on smaller iid data before running it on whole dataset.

      Under no circumstances would the train_size + test_size be allowed to be more than 1. The implementation will also ensure that there are no "leaks" (leak = same data occurring in both train and test) as that defeats the whole purpose of building an independent dataset for model evaluation.

      Of course, the interface does get a little complex and could confuse users. Explanatory documentation with examples is the only solution to that problem.

      The alternative to having both sizes in one function is to run a subsample function (using various sampling methods) and then perform the train_test split. The downside to this approach is it requires writing an intermediate table to disk (inefficient). "

      Acceptance

      1) Code, user docs, on-line docs, IC, Tinc tests complete.
      2) Radar green for all supported dbs.

      References

      [1] PDL tools sampling modules incl stratified sampling
      http://pivotalsoftware.github.io/PDLTools/group__grp__train__test__split.html

      [2] Related story on stratified sampling https://issues.apache.org/jira/browse/MADLIB-986

      [3] General
      https://en.wikipedia.org/wiki/Test_set

      [4] scikit-learn
      http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

      Attachments

        1. test_train_split.sql_in
          11 kB
          Frank McQuillan

        Issue Links

          Activity

            People

              okislal Orhan Kislal
              fmcquillan Frank McQuillan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: