[MADLIB-1119] Train-test split - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: v1.12
Component/s: Module: Sampling
Labels:
None

Description

Context

See related story on stratified sampling
https://issues.apache.org/jira/browse/MADLIB-986

Story

As a data scientist, I want to split a data table into training and test sets including grouping support, so that I use the result sets for model development in the usual way.

The MVP for this story is:

support split by group
allow option to sample without replacement (default) and sample with replacement
allow option to output a subset of columns to the output table
output one table with a new test/train column, or optionally two separate tables

Proposed Interface

train_test_split ( 
                                   source_table,    
                                   output_table,
                                   train_proportion,
                                   test_proportion, -- optional
                                   grouping_col -- optional
                                   with_replacement, -- optional
                                   target_cols -- optional
                                   separate_output_tables -- optional
                                )

source_table
TEXT. The name of the table containing the input data.

output_table
TEXT. Name of output table.   A new INTEGER column on the right 
called 'split' will identify 1 for train set and 0 for test set,
unless the 'separate_output_tables' parameter below is TRUE, 
in which case two output tables will be created using 
the 'output_table' name with the suffixes '_train' and '_test'.
The output table contains all the  columns present in the source 
table unless otherwise specified  in the 'target_cols' parameter below. 

train_proportion
FLOAT8 in the range (0,1).  Proportion of the dataset to include 
in the train split.  If the 'grouping_col' parameter is specified below, 
each group will be sampled independently using the 
train proportion, i.e., in a stratified fashion.

test_proportion (optional)
FLOAT8 in the range (0,1).  Proportion of the dataset to include 
in the test split.  Default is the complement to the train
proportion (1-'train_proportion').  If the 'grouping_col' 
parameter is specified below,  each group will be sampled 
independently using the  train proportion, 
i.e., in a stratified fashion.

grouping_col (optional)
TEXT, default: NULL. A single column or a list of comma-separated columns
 that defines how to stratify.  When this parameter is NULL, 
the train-test split is not stratified.

with_replacement (optional) 
BOOLEAN, default FALSE.  Determines whether to sample with replacement 
or without replacement (default).

target_cols (optional)
TEXT, default NULL. A comma-separated list of columns to appear in the 'output_table'. 
If NULL, all columns from the 'source_table'  will appear in the 'output_table'.

separate_output_tables (optional)
BOOLEAN, default FALSE.  If TRUE, two output tables will be created using 
the 'output_table' name with the suffixes '_train' and '_test'.

Other notes

1) PDL tools is one example implementation of train/test split to review [2].

2) From Rahul Iyer: "The goal of having both train and test is to provide subsample and train/test split in one function.

For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed data will be output. This is tremendously useful in situations where a user wants to prototype/evaluate a couple of models on smaller iid data before running it on whole dataset.

Under no circumstances would the train_size + test_size be allowed to be more than 1. The implementation will also ensure that there are no "leaks" (leak = same data occurring in both train and test) as that defeats the whole purpose of building an independent dataset for model evaluation.

Of course, the interface does get a little complex and could confuse users. Explanatory documentation with examples is the only solution to that problem.

The alternative to having both sizes in one function is to run a subsample function (using various sampling methods) and then perform the train_test split. The downside to this approach is it requires writing an intermediate table to disk (inefficient). "

Acceptance

1) Code, user docs, on-line docs, IC, Tinc tests complete.
2) Radar green for all supported dbs.

References

[1] PDL tools sampling modules incl stratified sampling
http://pivotalsoftware.github.io/PDLTools/group__grp__train__test__split.html

[2] Related story on stratified sampling https://issues.apache.org/jira/browse/MADLIB-986

[3] General
https://en.wikipedia.org/wiki/Test_set

[4] scikit-learn
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html