Type: New Feature
Affects Version/s: None
Fix Version/s: None
MINITABLES: SAMPLING BIGTABLE
Alberto Lerner and S. Muthukrishnan
[...] Because of [BigTable] semantics and storing scheme, skipping N rows is not feasible without actually reading them. Even finding the count of rows in a Bigtable at any point in time can be done only probabilistically. On the bright side, since Bigtable does not provide a relational query engine, we do not need to consider what are suitable sampling methods for various relational operators (like joins) or take into account how sampling errors compound with increasing levels of query composition.
Uniform Random Sampling
Our sampling scheme extracts and presents a sample of a Bigtable's rows as if it were a Bigtable itself, which we call a Minitable. The rationale here is that code written to run against a Bigtable can run unchanged against a sample thereof. Our sampling is based on a hash scheme. We pick a convenient hash function that maps the rowname space into a very large keyspace (e.g., a ax+b mod p function, where p is as large as 2128). The rows falling into the first fp keys where f is the relative sample size (it is a fraction), would belong in the sample. Formally, we pick a hash function h : R −> 0..p and if h(r) E [0, fp−1], then add r to the sample. It is easy to see that the expected size of the sample is f * 100% of the Bigtable rowcount independent of the rowcount, and the probability that a particular row r is in the sample is f, as desired. This hash-based sampling method supports maintenance of the sample with each Bigtable mutation (insert, update, or deletion). Only the system may forward relevant mutations from the Bigtable to the Minitable. Otherwise, the latter would behave as just any other Bigtable: it could be backed up and even be replicated. We are currently deploying Minitables in the repository of documents that the crawling system generates. Several Minitables, each with a different sample factor, allow that system to compute aggregates much faster and more often.
Uniform random sampling is quite useful but some scenarios require biased sampling methods. We are currently working on one such extension that we call Mask Sampling. In this scheme, the decision to select a row to the sample is still based on its rowname but now a user may specify a mask m over it. The mask, which can be a regular expression that matches portions of a rowname, is used to group rows together. Two rows belong to a same group if their masks result in the same string. Mask sampling guarantees that if a group is selected to the sample, that group will be adequately represented there, regardless of that group's relative size.
Clearly minitables can be constructed on the fly by a coprocessor attached to the source table.