[HIVE-16802] Standard Table Sampling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Hive's table sampling implementation has 2 major issues:
1. It makes assumptions about file layout. The main sampling approach requires tables to be bucketed. Bucketed tables are becoming less common as the need to bucket has reduced over time so many users cannot benefit from sampling.
2. The syntax is non standard.

SQL standard defines a TABLESAMPLE operator that requires a sampling method to be supplied and a probability p. The number of output records is approximately N * p/100, where N is the number of records in the table. There are two sampling methods defined: BERNOULLI and SYSTEM.

With the BERNOULLI sampling method, each record is evaluated as an independent Bernoulli trial.
The SYSTEM sampling method only controls the size of the output set, there is no independence guarantee between rows. It's common for SYSTEM sampling to be done at a block or page level, and if a block is selected, all records from the block are returned. Hive's current sampling methods are effectively types of SYSTEM sampling.

The standard also allows you to seed the PRNG used for trials using the REPEATABLE clause. The same input table with same p value and same repeatable value produces the same output.

Some examples:
select * from t tablesample bernoulli ( 50 );
select * from t tablesample system ( 30 ) repeatable (1234);

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Carter Shanklin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Jun/17 15:47

Updated:: 01/Jun/17 15:47