Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SQL
    • None

    Description

      Hive's table sampling implementation has 2 major issues:
      1. It makes assumptions about file layout. The main sampling approach requires tables to be bucketed. Bucketed tables are becoming less common as the need to bucket has reduced over time so many users cannot benefit from sampling.
      2. The syntax is non standard.

      SQL standard defines a TABLESAMPLE operator that requires a sampling method to be supplied and a probability p. The number of output records is approximately N * p/100, where N is the number of records in the table. There are two sampling methods defined: BERNOULLI and SYSTEM.

      With the BERNOULLI sampling method, each record is evaluated as an independent Bernoulli trial.
      The SYSTEM sampling method only controls the size of the output set, there is no independence guarantee between rows. It's common for SYSTEM sampling to be done at a block or page level, and if a block is selected, all records from the block are returned. Hive's current sampling methods are effectively types of SYSTEM sampling.

      The standard also allows you to seed the PRNG used for trials using the REPEATABLE clause. The same input table with same p value and same repeatable value produces the same output.

      Some examples:
      select * from t tablesample bernoulli ( 50 );
      select * from t tablesample system ( 30 ) repeatable (1234);

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            cartershanklin Carter Shanklin

            Dates

              Created:
              Updated:

              Slack

                Issue deployment