Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14166

Add deterministic sampling like in Hive

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.0.0
    • None
    • SQL

    Description

      Would be great to have Spark support deterministic sampling too

      set hive.sample.seednumber=12345;
      SELECT *
      FROM table_a TABLESAMPLE(BUCKET 17 OUT OF 25 ON individual_id);

      Notice sampling is based on a hash(individual_id).

      https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

      In this case sampling is deterministic. When we have new data loads, we get very stable samples and use it all the time in Hive.

      The only reason for "BUCKET x OUT OF y " syntax in Hive is "If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table."

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Tagar Ruslan Dautkhanov
              Votes:
              3 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: