Details
-
New Feature
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.8.0
-
ghx-label-7
Description
Implement the TABLESAMPLE clause that can be used against base table references in queries as well as the COMPUTE STATS statement.
Examples:
SELECT * FROM T TABLESAMPLE SYSTEM(10) COMPUTE STATS T TABLESAMPLE SYSTEM(20)
Syntax inspired by SQL Server:
https://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx
<tableref> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)]
Implementation details
- The given percentage refers to the percent of bytes in the table.
- The sampling will be coarse-grained (file level).
- Impala will randomly select files until the desired percentage of bytes has been reached
Accepted limitations
- Computing stats on a coarse-grained sample necessarily means a loss of precision with no guarantee on statistical significance
- There is no guarantee that a sample covers all partitions
- NDVs may be very inaccurate for sorted files
- NDVs may be very inaccurate for an unfortunate selection of files