[SPARK-14166] Add deterministic sampling like in Hive - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed
- sample
- sql

Description

Would be great to have Spark support deterministic sampling too

set hive.sample.seednumber=12345;
SELECT *
FROM table_a TABLESAMPLE(BUCKET 17 OUT OF 25 ON individual_id);

Notice sampling is based on a hash(individual_id).

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

In this case sampling is deterministic. When we have new data loads, we get very stable samples and use it all the time in Hive.

The only reason for "BUCKET x OUT OF y " syntax in Hive is "If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table."

Attachments

Issue Links

relates to

SPARK-13263 SQL generation support for tablesample

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Ruslan Dautkhanov

Votes:: 3 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/Mar/16 20:37

Updated:: 21/May/19 04:32

Resolved:: 21/May/19 04:32