Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4500

Improve exact stratified sampling implementation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.2.0
    • None
    • MLlib

    Description

      The current implementation for exact stratified sampling (sampleByKeyExact) could be more efficient. Proposed algorithm sketch:

      • Sampling is done separately for each stratum. Here, all counts are w.r.t. a fixed stratum.
      • Let n_partition = number of elements in a given partition
      • This method uses 2-tiered sampling:
        • Tier 1 (on driver): Select the sample sizes {n_partition}

          for each partition.

          • Without replacement, this uses a multivariate hypergeometric distribution.
          • With replacement, this should use a multinomial distribution.
        • Tier 2 (in parallel): Select a sample of size n_partition.

      If anyone is interested in implementing this, I have a rough draft which works without replacement, but it needs to be cleaned up and augmented to do sampling with replacement too.

      Attachments

        Activity

          People

            Unassigned Unassigned
            josephkb Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: