Details

Type: New Feature

Status: Resolved

Priority: Major

Resolution: Later

Affects Version/s: None

Fix Version/s: None

Component/s: Coprocessors

Labels:None
Description
From http://turing.cs.washington.edu/papers/dataprojectsgooglesigmodrecord08.pdf :
MINITABLES: SAMPLING BIGTABLE
Alberto Lerner and S. Muthukrishnan[...] Because of [BigTable] semantics and storing scheme, skipping N rows is not feasible without actually reading them. Even finding the count of rows in a Bigtable at any point in time can be done only probabilistically. On the bright side, since Bigtable does not provide a relational query engine, we do not need to consider what are suitable sampling methods for various relational operators (like joins) or take into account how sampling errors compound with increasing levels of query composition.
Uniform Random Sampling
Our sampling scheme extracts and presents a sample of a Bigtable's rows as if it were a Bigtable itself, which we call a Minitable. The rationale here is that code written to run against a Bigtable can run unchanged against a sample thereof. Our sampling is based on a hash scheme. We pick a convenient hash function that maps the rowname space into a very large keyspace (e.g., a ax+b mod p function, where p is as large as 2128). The rows falling into the first fp keys where f is the relative sample size (it is a fraction), would belong in the sample. Formally, we pick a hash function h : R −> 0..p and if h(r) E [0, fp−1], then add r to the sample. It is easy to see that the expected size of the sample is f * 100% of the Bigtable rowcount independent of the rowcount, and the probability that a particular row r is in the sample is f, as desired. This hashbased sampling method supports maintenance of the sample with each Bigtable mutation (insert, update, or deletion). Only the system may forward relevant mutations from the Bigtable to the Minitable. Otherwise, the latter would behave as just any other Bigtable: it could be backed up and even be replicated. We are currently deploying Minitables in the repository of documents that the crawling system generates. Several Minitables, each with a different sample factor, allow that system to compute aggregates much faster and more often.
Biased Sampling
Uniform random sampling is quite useful but some scenarios require biased sampling methods. We are currently working on one such extension that we call Mask Sampling. In this scheme, the decision to select a row to the sample is still based on its rowname but now a user may specify a mask m over it. The mask, which can be a regular expression that matches portions of a rowname, is used to group rows together. Two rows belong to a same group if their masks result in the same string. Mask sampling guarantees that if a group is selected to the sample, that group will be adequately represented there, regardless of that group's relative size.
Clearly minitables can be constructed on the fly by a coprocessor attached to the source table.
Activity
Transition  Time In Source Status  Execution Times  Last Executer  Last Execution Date  


1787d 8h 20m  1  Andrew Purtell  11/Apr/15 02:15 
Status  Open [ 1 ]  Resolved [ 5 ] 
Resolution  Later [ 7 ] 
Parent 


Issue Type  Subtask [ 7 ]  New Feature [ 2 ] 
Component/s  coprocessors [ 12314191 ] 
Field  Original Value  New Value 

Description 
From http://turing.cs.washington.edu/papers/dataprojectsgooglesigmodrecord08.pdf :
{quote} MINITABLES: SAMPLING BIGTABLE Alberto Lerner and S. Muthukrishnan [...] Because of [BigTable] semantics and storing scheme, skipping N rows is not feasible without actually reading them. Even finding the count of rows in a Bigtable at any point in time can be done only probabilistically. On the bright side, since Bigtable does not provide a relational query engine, we do not need to consider what are suitable sampling methods for various relational operators (like joins) or take into account how sampling errors compound with increasing levels of query composition. _Uniform Random Sampling_ Our sampling scheme extracts and presents a sample of a Bigtable's rows as if it were a Bigtable itself, which we call a Minitable. The rationale here is that code written to run against a Bigtable can run unchanged against a sample thereof. Our sampling is based on a hash scheme. We pick a convenient hash function that maps the rowname space into a very large keyspace (e.g., a ax+b mod p function, where p is as large as 2128). The rows falling into the first fp keys where f is the relative sample size (it is a fraction), would belong in the sample. Formally, we pick a hash function h : R −> 0..p and if h(r) E [0, fp−1], then add r to the sample. It is easy to see that the expected size of the sample is f * 100% of the Bigtable rowcount independent of the rowcount, and the probability that a particular row r is in the sample is f, as desired. This hashbased sampling method supports maintenance of the sample with each Bigtable mutation (insert, update, or deletion). Only the system may forward relevant mutations from the Bigtable to the Minitable. Otherwise, the latter would behave as just any other Bigtable: it could be backed up and even be replicated. We are currently deploying Minitables in the repository of documents that the crawling system generates. Several Minitables, each with a different sample factor, allow that system to compute aggregates much faster and more often. _Biased Sampling_ Uniform random sampling is quite useful but some scenarios require biased sampling methods. We are currently working on one such extension that we call Mask Sampling. In this scheme, the decision to select a row to the sample is still based on its rowname but now a user may specify a mask m over it. The mask, which can be a regular expression that matches portions of a rowname, is used to group rows together. Two rows belong to a same group if their masks result in the same string. Mask sampling guarantees that if a group is selected to the sample, that group will be adequately represented there, regardless of that group's relative size. {quote} Clearly minitables can be constructed on the fly by a coprocessor attached to the source table. 