Implementation Proposal. (Feedback plz)
Proposing table sampling on each Table basis (at the 'FROM' part of the query). Sample size decided by the input sampling rate applies on Primary Key's frequency.
`select name from person SAMPLE(0.10) where name='ethan'`
`person SAMPLE(0.10)` part returns rows about 10% volume of the PERSON table. Reducing performance cost from PERSON table scan to Person-STATS table scan.
For table PERSON, assume STATS is populated with GuidePost inserted on every other PK (50% coverage).
Step1, within the query scanning keyrange, iterate through the STATS table.
Step2, for every GuidePost encountered, consult with a random number generator to decide if this guidepost will be included or excluded from the sample. This dice has 10% chance of winning.
Step3, Once we decide to include this GuidePost, every PK on the original PERSON table that is between this-GuidePost and next-GuidePost will be included to the final sample.
Repeat this process untill all GuidePost are visited.
During dice rolling process, GuidePost 3 is included. PK between [3,5) will be included. The final result will be rows with PK 3, 4.
a, similar to Microsoft SQLServer TABLESAMPLE, focus mainly on the performance benefit. It does not guarantee the even distribution of the sample on original table (representativity).
b, it works well on any GUIDE_POST_SWIDTH on any input sample rate. However, if the table is too small, the sample output may include rows more or less than the expected count (sample_rate X table_size)
Summary of other popular TABLESAMPLE implementations.
Basically two categories:
1, Sampling on Query Basis.
(Such as Blink DB. https://sameeragarwal.github.io/blinkdb_eurosys13.pdf)
This implementation places sampling process based on entire query. such as:
`select name from person where name='ethan' SAMPLE WITH ERROR 10% CONFIDENCE 95%'
BlinkDB did so by assuming "the data used for similar grouping and filtering clause does not change over time across future queries". Based on heuristic experience, query engine pre-build certain stratify sample groups extracted from the actual table, cache them, and use them for evaluating an approximate result for some expensive queries. Therefore to avoid full table scan.
a, Optimizes for the best performance-accuracy-trade-off. Once given the accuracy tolerance, it automatically decide the sampling rate for user.
b, Engine takes filtering and grouping into consideration therefore it's powerful. But on the other side it may not perform at the same level for all kinds of queries.
c, Based on heuristic info, there will be a machine gradually learning process.
2, Sampling on Table Basis.
(Such as Postgres, MS SQLServer. https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation)
This approach places Tablesample only on the "FROM" part of the query. such as:
`select name from person TABLESAMPLE(10 PERCENT) where name='ethan'`
This approach first sample the original table to a smaller 'view' based on the Primary Key frequency and a given sampling rate. Then that 'view' will participate into the rest part of the query in place of original table.
Usually a randomly selection process is used during the view creation. In MS SQLServer, a linear one-pass pointer travel through each "page", and ask a random generator to decide if this page will be part of the sample. Once accepted, every single row on this page now become part of new sample.
This MSSQL tablesample
a, gives flexibility satisfying any sampling rate.
b, gain performance by reducing the length of a table scan (but big O complexity still the same)
c, only care about the performance gain, does't care about sample distribution.
James Taylor Geoffrey Jacoby