Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Context
See related story on stratified sampling
https://issues.apache.org/jira/browse/MADLIB-986
Story
As a data scientist, I want to split a data table into training and test sets including grouping support, so that I use the result sets for model development in the usual way.
The MVP for this story is:
- support split by group
- allow option to sample without replacement (default) and sample with replacement
- allow option to output a subset of columns to the output table
- output one table with a new test/train column, or optionally two separate tables
Proposed Interface
train_test_split ( source_table, output_table, train_proportion, test_proportion, -- optional grouping_col -- optional with_replacement, -- optional target_cols -- optional separate_output_tables -- optional ) source_table TEXT. The name of the table containing the input data. output_table TEXT. Name of output table. A new INTEGER column on the right called 'split' will identify 1 for train set and 0 for test set, unless the 'separate_output_tables' parameter below is TRUE, in which case two output tables will be created using the 'output_table' name with the suffixes '_train' and '_test'. The output table contains all the columns present in the source table unless otherwise specified in the 'target_cols' parameter below. train_proportion FLOAT8 in the range (0,1). Proportion of the dataset to include in the train split. If the 'grouping_col' parameter is specified below, each group will be sampled independently using the train proportion, i.e., in a stratified fashion. test_proportion (optional) FLOAT8 in the range (0,1). Proportion of the dataset to include in the test split. Default is the complement to the train proportion (1-'train_proportion'). If the 'grouping_col' parameter is specified below, each group will be sampled independently using the train proportion, i.e., in a stratified fashion. grouping_col (optional) TEXT, default: NULL. A single column or a list of comma-separated columns that defines how to stratify. When this parameter is NULL, the train-test split is not stratified. with_replacement (optional) BOOLEAN, default FALSE. Determines whether to sample with replacement or without replacement (default). target_cols (optional) TEXT, default NULL. A comma-separated list of columns to appear in the 'output_table'. If NULL, all columns from the 'source_table' will appear in the 'output_table'. separate_output_tables (optional) BOOLEAN, default FALSE. If TRUE, two output tables will be created using the 'output_table' name with the suffixes '_train' and '_test'.
Other notes
1) PDL tools is one example implementation of train/test split to review [2].
2) From Rahul Iyer: "The goal of having both train and test is to provide subsample and train/test split in one function.
For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed data will be output. This is tremendously useful in situations where a user wants to prototype/evaluate a couple of models on smaller iid data before running it on whole dataset.
Under no circumstances would the train_size + test_size be allowed to be more than 1. The implementation will also ensure that there are no "leaks" (leak = same data occurring in both train and test) as that defeats the whole purpose of building an independent dataset for model evaluation.
Of course, the interface does get a little complex and could confuse users. Explanatory documentation with examples is the only solution to that problem.
The alternative to having both sizes in one function is to run a subsample function (using various sampling methods) and then perform the train_test split. The downside to this approach is it requires writing an intermediate table to disk (inefficient). "
Acceptance
1) Code, user docs, on-line docs, IC, Tinc tests complete.
2) Radar green for all supported dbs.
References
[1] PDL tools sampling modules incl stratified sampling
http://pivotalsoftware.github.io/PDLTools/group__grp__train__test__split.html
[2] Related story on stratified sampling https://issues.apache.org/jira/browse/MADLIB-986
[3] General
https://en.wikipedia.org/wiki/Test_set
[4] scikit-learn
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Attachments
Attachments
Issue Links
- links to