• Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SQL
    • None


      Hive's table sampling implementation has 2 major issues:
      1. It makes assumptions about file layout. The main sampling approach requires tables to be bucketed. Bucketed tables are becoming less common as the need to bucket has reduced over time so many users cannot benefit from sampling.
      2. The syntax is non standard.

      SQL standard defines a TABLESAMPLE operator that requires a sampling method to be supplied and a probability p. The number of output records is approximately N * p/100, where N is the number of records in the table. There are two sampling methods defined: BERNOULLI and SYSTEM.

      With the BERNOULLI sampling method, each record is evaluated as an independent Bernoulli trial.
      The SYSTEM sampling method only controls the size of the output set, there is no independence guarantee between rows. It's common for SYSTEM sampling to be done at a block or page level, and if a block is selected, all records from the block are returned. Hive's current sampling methods are effectively types of SYSTEM sampling.

      The standard also allows you to seed the PRNG used for trials using the REPEATABLE clause. The same input table with same p value and same repeatable value produces the same output.

      Some examples:
      select * from t tablesample bernoulli ( 50 );
      select * from t tablesample system ( 30 ) repeatable (1234);




            Unassigned Unassigned
            cartershanklin Carter Shanklin
            0 Vote for this issue
            2 Start watching this issue