Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
Story
As a data scientist, I want to sample a data table in proportion to the number of rows in each group, so that I can do model building on the sampled data sets.
The MVP for this story is:
- sample proportion is global, i.e., single fractional value between 0 and 1
- allow option to sample without replacement (default) and sample with replacement
- allow option to output a subset of columns to the output table
Proposed Interface
stratified_sample ( source_table, output_table, proportion, grouping_col -- optional with_replacement, -- optional target_cols -- optional ) source_table TEXT. The name of the table containing the input data. output_table TEXT. Name of output table that contains the sampled data. The output table contains all the columns present in the source table unless otherwise specified in the 'target_cols' parameter below. proportion FLOAT8 in the range (0,1). The size of the sample in each stratum will be taken in proportion to the size of the stratum. grouping_col (optional) TEXT, default: NULL. A single column or a list of comma-separated columns that defines how to stratify. When this parameter is NULL, no grouping is used so the sampling is non-stratified. with_replacement (optional) BOOLEAN, default FALSE. Determines whether to sample with replacement or without replacement (default). target_cols (optional) TEXT, default NULL. A comma-separated list of columns to appear in the 'output_table'. If NULL, all columns from the 'source_table' will appear in the 'output_table'.
Other notes
PDL tools is one example implementation of stratified sampling to review [2].
Please review existing MADlib sample functions [3] to see if these can be used as a basis, or built on, for this stratified sample story.
References
[2] PDL tools sampling modules incl stratified sampling
http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
[3] Existing MADlib sample function
http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
[4] Pandas/Selecting Random Samples
http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
[5] General
https://en.wikipedia.org/wiki/Stratified_sampling
Attachments
Issue Links
- links to
- mentioned in
-
Page Loading...