Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18678

Skewed reservoir sampling in SamplingUtils

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.2
    • 2.1.0
    • ML
    • None

    Description

      The feature subsampling performed in the RandomForest-implementation from
      org.apache.spark.ml.tree.impl.RandomForest
      is performed using SamplingUtils.reservoirSampleAndCount

      The implementation of the sampling skews feature selection in favor of features with a higher index.
      The skewness is smaller for a large number of features, but completely dominates the feature selection for a small number of features. The extreme case is when the number of features is 2 and number of features to select is 1.

      In this case the feature sampling will always pick feature 1 and ignore feature 0.
      Of course this produces low quality models for few features when using subsampling.

      Attachments

        Activity

          People

            srowen Sean R. Owen
            BToldbod Bjoern Toldbod
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: