Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-986

Stratified sampling

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • v1.12
    • Module: Sampling

    Description

      Story

      As a data scientist, I want to sample a data table in proportion to the number of rows in each group, so that I can do model building on the sampled data sets.

      The MVP for this story is:

      • sample proportion is global, i.e., single fractional value between 0 and 1
      • allow option to sample without replacement (default) and sample with replacement
      • allow option to output a subset of columns to the output table

      Proposed Interface

      stratified_sample ( 
                                         source_table,    
                                         output_table,
                                         proportion,
                                         grouping_col -- optional
                                         with_replacement, -- optional
                                         target_cols -- optional
                                      )
      
      source_table
      TEXT. The name of the table containing the input data.
      
      output_table
      TEXT. Name of output table that contains the sampled data. 
      The output table contains all the columns present in the source table 
      unless otherwise specified in the 'target_cols' parameter below.
      
      proportion
      FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
      be taken in proportion to the size of the stratum. 
      
      grouping_col (optional)
      TEXT, default: NULL. A single column or a list of comma-separated columns
       that defines how to stratify.  When this parameter is NULL, 
      no grouping is used so the sampling is non-stratified.
      
      with_replacement (optional) 
      BOOLEAN, default FALSE.  Determines whether to sample with replacement 
      or without replacement (default).
      
      target_cols (optional)
      TEXT, default NULL. A comma-separated list of columns to appear in the 'output_table'. 
      If NULL, all columns from the 'source_table'  will appear in the 'output_table'.
      

      Other notes

      PDL tools is one example implementation of stratified sampling to review [2].

      Please review existing MADlib sample functions [3] to see if these can be used as a basis, or built on, for this stratified sample story.

      References

      [2] PDL tools sampling modules incl stratified sampling
      http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html

      [3] Existing MADlib sample function
      http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html

      [4] Pandas/Selecting Random Samples
      http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples

      [5] General
      https://en.wikipedia.org/wiki/Stratified_sampling

      Attachments

        Issue Links

          Activity

            People

              okislal Orhan Kislal
              fmcquillan Frank McQuillan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: