Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17870

ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.0
    • Component/s: ML, MLlib
    • Labels:
      None

      Description

      The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala (line 233) is wrong.

      For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.

      Because of the wrong method to count ChiSquare value, the feature selection results are strange.
      Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
      If use selectKBest to select: the feature 3 will be selected.
      If use selectFpr to select: feature 1 and 2 will be selected.
      This is strange.

      I use scikit learn to test the same data with the same parameters.
      When use selectKBest to select: feature 1 will be selected.
      When use selectFpr to select: feature 1 and 2 will be selected.
      This result is make sense. because the df of each feature in scikit learn is the same.

      I plan to submit a PR for this problem.

        Issue Links

          Activity

          Hide
          srowen Sean Owen added a comment -

          Oof, I'm pretty certain you're correct. You can rank on the p-value (which is a function of DoF) but not the raw statistic. It's an easy change at least because this is already computed. Can't believe I missed that.

          Show
          srowen Sean Owen added a comment - Oof, I'm pretty certain you're correct. You can rank on the p-value (which is a function of DoF) but not the raw statistic. It's an easy change at least because this is already computed. Can't believe I missed that.
          Hide
          peng.meng@intel.com Peng Meng added a comment -

          hi Sean Owen, thanks very much for you quickly reply.
          yes,the p-value is better than raw statistic in this case, because p-value is count based on DoF and raw statistic.
          raw statistic is also popular for feature selection. The SelectKBest and SelectPercentile in scikit learn is based on raw statistic.
          The question here is we should use the same DoF like scikit learn to count ChiSquare value.
          For this JIRA, I propose to change the method to count ChiSquare value like what is done in scikit learn (change Statistics.chiSqTest(RDD)).

          Thanks very much.

          Show
          peng.meng@intel.com Peng Meng added a comment - hi Sean Owen , thanks very much for you quickly reply. yes,the p-value is better than raw statistic in this case, because p-value is count based on DoF and raw statistic. raw statistic is also popular for feature selection. The SelectKBest and SelectPercentile in scikit learn is based on raw statistic. The question here is we should use the same DoF like scikit learn to count ChiSquare value. For this JIRA, I propose to change the method to count ChiSquare value like what is done in scikit learn (change Statistics.chiSqTest(RDD)). Thanks very much.
          Hide
          srowen Sean Owen added a comment -

          I don't think the raw statistic can be directly compared here because the features do not have even nearly the same number of 'buckets', not necessarily. A given test statistic value is "less remarkable" when there are more DoF; what's high for a binary-valued feature may not be high at all for one taking on 100 values.

          Does scikit really use the statistic? because you're also saying it does something that gives different results from ranking on the statistic.

          Show
          srowen Sean Owen added a comment - I don't think the raw statistic can be directly compared here because the features do not have even nearly the same number of 'buckets', not necessarily. A given test statistic value is "less remarkable" when there are more DoF; what's high for a binary-valued feature may not be high at all for one taking on 100 values. Does scikit really use the statistic? because you're also saying it does something that gives different results from ranking on the statistic.
          Hide
          peng.meng@intel.com Peng Meng added a comment -

          yes, the selectKBest and selectPercentile in scikit learn only use statistic.
          Because the method to count ChiSquare value is different, the DoF of all features in scikit learn are the same. so it can do that.

          The ChiSquare Value compute process is like this:
          suppose we have data:
          X = [ 8 7 0
          0 9 6
          0 9 8
          8 9 5]
          y = [0 1 1 2]T, this is the test suite data of ml/feature/ChiSquareSelectorSuite.scala
          sci-kit learn to compute chiSquare value is like this:
          first:
          Y = [1 0 0
          0 1 0
          0 1 0
          0 0 1]
          observed = Y'*X=
          [8 7 0
          0 18 14
          8 9 5]
          expected =
          [4 8.5 4.75
          8 17 9.5
          4 8.5 4.75]
          _chisquare(ovserved, expected): to compute all features ChiSquare value, we can see all the DF of each feature is the same.

          Bug for spark Statistics.chiSqTest(RDD), is use another method, for each feature, construct a contingency table. So the DF is different for each feature.

          For "gives different results from ranking on the statistic", this is because the parameters different.
          For previous example, if use SelectKBest(2), the selected feature is the same as SelectFpr(0.2) in scikit learn

          Show
          peng.meng@intel.com Peng Meng added a comment - yes, the selectKBest and selectPercentile in scikit learn only use statistic. Because the method to count ChiSquare value is different, the DoF of all features in scikit learn are the same. so it can do that. The ChiSquare Value compute process is like this: suppose we have data: X = [ 8 7 0 0 9 6 0 9 8 8 9 5] y = [0 1 1 2] T, this is the test suite data of ml/feature/ChiSquareSelectorSuite.scala sci-kit learn to compute chiSquare value is like this: first: Y = [1 0 0 0 1 0 0 1 0 0 0 1] observed = Y'*X= [8 7 0 0 18 14 8 9 5] expected = [4 8.5 4.75 8 17 9.5 4 8.5 4.75] _chisquare(ovserved, expected): to compute all features ChiSquare value, we can see all the DF of each feature is the same. Bug for spark Statistics.chiSqTest(RDD), is use another method, for each feature, construct a contingency table. So the DF is different for each feature. For "gives different results from ranking on the statistic", this is because the parameters different. For previous example, if use SelectKBest(2), the selected feature is the same as SelectFpr(0.2) in scikit learn
          Hide
          srowen Sean Owen added a comment -

          I don't quite understand this example, can you point me to the source? the chi-squared statistic is indeed a function of observed and expected counts, but I'd expect those to be a vector of counts, one for each class. If you're saying that each row contains observed counts for one feature's classes, then yes in this particular construction each of them has the same number of classes (columns). But that isn't generally true; that can't be an assumption scikit makes? I bet I'm missing something.

          Show
          srowen Sean Owen added a comment - I don't quite understand this example, can you point me to the source? the chi-squared statistic is indeed a function of observed and expected counts, but I'd expect those to be a vector of counts, one for each class. If you're saying that each row contains observed counts for one feature's classes, then yes in this particular construction each of them has the same number of classes (columns). But that isn't generally true; that can't be an assumption scikit makes? I bet I'm missing something.
          Hide
          peng.meng@intel.com Peng Meng added a comment -

          The scikit learn code is here: https://github.com/scikit-learn/scikit-learn/blob/412996f09b6756752dfd3736c306d46fca8f1aa1/sklearn/feature_selection/univariate_selection.py, line 422 for selectKBest, chiSquare compute is also on the same page.

          For the last example, each row of X is a sample, it contain three features, totally 4 samples. Y is the label.
          Thanks very much.

          Show
          peng.meng@intel.com Peng Meng added a comment - The scikit learn code is here: https://github.com/scikit-learn/scikit-learn/blob/412996f09b6756752dfd3736c306d46fca8f1aa1/sklearn/feature_selection/univariate_selection.py , line 422 for selectKBest, chiSquare compute is also on the same page. For the last example, each row of X is a sample, it contain three features, totally 4 samples. Y is the label. Thanks very much.
          Hide
          peng.meng@intel.com Peng Meng added a comment -
          Show
          peng.meng@intel.com Peng Meng added a comment - https://github.com/apache/spark/pull/1484#issuecomment-51024568 Hi Xiangrui Meng and Alexander Ulanov , what do you think about this JIRA.
          Hide
          srowen Sean Owen added a comment -

          OK I get it, they're doing different things really. The scikit version is computing the statistic for count-valued features vs categorical label, and the Spark version is computing this for categorical features vs categorical labels. Although the number of label classes is constant in both cases, the Spark computation would depend on the number of feature classes too. Yes, it does need to be changed in Spark.

          Show
          srowen Sean Owen added a comment - OK I get it, they're doing different things really. The scikit version is computing the statistic for count-valued features vs categorical label, and the Spark version is computing this for categorical features vs categorical labels. Although the number of label classes is constant in both cases, the Spark computation would depend on the number of feature classes too. Yes, it does need to be changed in Spark.
          Hide
          avulanov Alexander Ulanov added a comment -

          [`SelectKBest`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) works with "a Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores". According to what you observe, it uses pvalues for sorting of `chi2` outputs. Indeed, it is the case for all functions that return two arrays: https://github.com/scikit-learn/scikit-learn/blob/412996f/sklearn/feature_selection/univariate_selection.py#L331. Alternative, one case use raw `chi2` scores for sorting. She need to pass only the first array from `chi2` to `SelectKBest`. As far as I remember, using raw chi2 scores is default in Weka's [ChiSquaredAttributeEval](http://weka.sourceforge.net/doc.stable/weka/attributeSelection/ChiSquaredAttributeEval.html). So, I would not claim that either of approaches is incorrect. According to [Introduction to IR](http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html), there might be an issue with computing p-values because then chi2-test is used multiple times. Using plain chi2 values does not involve statistical test, so it might be treated as just some ranking with no statistical implications.

          Show
          avulanov Alexander Ulanov added a comment - [`SelectKBest`] ( http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest ) works with "a Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores". According to what you observe, it uses pvalues for sorting of `chi2` outputs. Indeed, it is the case for all functions that return two arrays: https://github.com/scikit-learn/scikit-learn/blob/412996f/sklearn/feature_selection/univariate_selection.py#L331 . Alternative, one case use raw `chi2` scores for sorting. She need to pass only the first array from `chi2` to `SelectKBest`. As far as I remember, using raw chi2 scores is default in Weka's [ChiSquaredAttributeEval] ( http://weka.sourceforge.net/doc.stable/weka/attributeSelection/ChiSquaredAttributeEval.html ). So, I would not claim that either of approaches is incorrect. According to [Introduction to IR] ( http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html ), there might be an issue with computing p-values because then chi2-test is used multiple times. Using plain chi2 values does not involve statistical test, so it might be treated as just some ranking with no statistical implications.
          Hide
          srowen Sean Owen added a comment -

          If the degrees of freedom are the same across the tests, then ranking on p-value or statistic should give the same ranking because the p-value is a monotonically decreasing function of the statistic. That's the case in what the scikit code is effectively doing because there are always (# label classes - 1) degrees of freedom. Really the p-value is the comparable quantity, but there's no point computing it in this case because it's just for ranking.

          The Spark code performs a chi-squared test but applies it to answer a different question, where DOF is no longer the same; it's (# label classes - 1) * (# feature classes - 1) in the contingency table here. p-value is no longer always smaller when the statistic is larger. So it's necessary to actually use the p-values for what Spark is doing.

          Show
          srowen Sean Owen added a comment - If the degrees of freedom are the same across the tests, then ranking on p-value or statistic should give the same ranking because the p-value is a monotonically decreasing function of the statistic. That's the case in what the scikit code is effectively doing because there are always (# label classes - 1) degrees of freedom. Really the p-value is the comparable quantity, but there's no point computing it in this case because it's just for ranking. The Spark code performs a chi-squared test but applies it to answer a different question, where DOF is no longer the same; it's (# label classes - 1) * (# feature classes - 1) in the contingency table here. p-value is no longer always smaller when the statistic is larger. So it's necessary to actually use the p-values for what Spark is doing.
          Hide
          peng.meng@intel.com Peng Meng added a comment -

          hi Alexander Ulanov, the question here is not use raw chi2 scores or pvalues, the question is if use raw chi2 scores, the DoF should be the same.
          "chi2-test is used multiple times" is another problem. According to (http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),"whenever a statistical test is used multiple times, then the probability of getting at least one error increases.", this problem is partially solved by Select the p-values corresponding to Family-wise error rate (SelectFwe, SPARK-17645). Thanks very much.

          Hi Sean Owen, I totally agree with your comments. Based on the DoF is different in Spark ChiSquare value, we can use the p-values for Spark SelectKBest, and SelectPercentile. Thanks very much.

          I will submit a pr for this.

          Show
          peng.meng@intel.com Peng Meng added a comment - hi Alexander Ulanov , the question here is not use raw chi2 scores or pvalues, the question is if use raw chi2 scores, the DoF should be the same. "chi2-test is used multiple times" is another problem. According to ( http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html ),"whenever a statistical test is used multiple times, then the probability of getting at least one error increases.", this problem is partially solved by Select the p-values corresponding to Family-wise error rate (SelectFwe, SPARK-17645 ). Thanks very much. Hi Sean Owen , I totally agree with your comments. Based on the DoF is different in Spark ChiSquare value, we can use the p-values for Spark SelectKBest, and SelectPercentile. Thanks very much. I will submit a pr for this.
          Hide
          apachespark Apache Spark added a comment -

          User 'mpjlu' has created a pull request for this issue:
          https://github.com/apache/spark/pull/15444

          Show
          apachespark Apache Spark added a comment - User 'mpjlu' has created a pull request for this issue: https://github.com/apache/spark/pull/15444
          Hide
          srowen Sean Owen added a comment -

          Issue resolved by pull request 15444
          https://github.com/apache/spark/pull/15444

          Show
          srowen Sean Owen added a comment - Issue resolved by pull request 15444 https://github.com/apache/spark/pull/15444

            People

            • Assignee:
              peng.meng@intel.com Peng Meng
              Reporter:
              peng.meng@intel.com Peng Meng
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development