[IMPALA-5300] Implement TABLESAMPLE - ASF JIRA

XML

Word

Printable

JSON

Implement the TABLESAMPLE clause that can be used against base table references in queries as well as the COMPUTE STATS statement.

Examples:

SELECT * FROM T TABLESAMPLE SYSTEM(10)
COMPUTE STATS T TABLESAMPLE SYSTEM(20)

<tableref> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)]

Implementation details

The given percentage refers to the percent of bytes in the table.
The sampling will be coarse-grained (file level).
Impala will randomly select files until the desired percentage of bytes has been reached

Accepted limitations

Computing stats on a coarse-grained sample necessarily means a loss of precision with no guarantee on statistical significance
There is no guarantee that a sample covers all partitions
NDVs may be very inaccurate for sorted files
NDVs may be very inaccurate for an unfortunate selection of files

1.	Implement TABLESAMPLE for HDFS tables	Resolved	Alexander Behm
2.	Implement TABLESAMPLE for COMPUTE STATS	Resolved	Alexander Behm
3.	Add minimum sample size for COMPUTE STATS TABLESAMPLE	Resolved	Alexander Behm
4.	More flexible configuration of stats extrapolation	Resolved	Alexander Behm
5.	Doc: TABLESAMPLE for COMPUTE STATS	Closed	Alexander Behm
6.	Impala Doc: Doc the minimum sample size for COMPUTE STATS TABLESAMPLE	Closed	Alexandra Rodoni