Implement the TABLESAMPLE clause that can be used against base table references in queries as well as the COMPUTE STATS statement.
Syntax inspired by SQL Server:
- The given percentage refers to the percent of bytes in the table.
- The sampling will be coarse-grained (file level).
- Impala will randomly select files until the desired percentage of bytes has been reached
- Computing stats on a coarse-grained sample necessarily means a loss of precision with no guarantee on statistical significance
- There is no guarantee that a sample covers all partitions
- NDVs may be very inaccurate for sorted files
- NDVs may be very inaccurate for an unfortunate selection of files
|Implement TABLESAMPLE for HDFS tables||Resolved|
|Implement TABLESAMPLE for COMPUTE STATS||Resolved|
|Add minimum sample size for COMPUTE STATS TABLESAMPLE||Resolved|
|More flexible configuration of stats extrapolation||Resolved|
|Doc: TABLESAMPLE for COMPUTE STATS||Closed|
|Impala Doc: Doc the minimum sample size for COMPUTE STATS TABLESAMPLE||Closed|