Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.0.0
-
None
Description
Would be great to have Spark support deterministic sampling too
set hive.sample.seednumber=12345;
SELECT *
FROM table_a TABLESAMPLE(BUCKET 17 OUT OF 25 ON individual_id);
Notice sampling is based on a hash(individual_id).
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling
In this case sampling is deterministic. When we have new data loads, we get very stable samples and use it all the time in Hive.
The only reason for "BUCKET x OUT OF y " syntax in Hive is "If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table."
Attachments
Issue Links
- relates to
-
SPARK-13263 SQL generation support for tablesample
- Resolved