Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10578

pyspark.ml.classification.RandomForestClassifer does not return `rawPrediction` column

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0, 1.4.1
    • 1.5.0
    • ML
    • None
    • CentOS, PySpark 1.4.1, Scala 2.10

    • Important

    Description

      To use `pyspark.ml.classification.RandomForestClassifer` with `BinaryClassificationEvaluator`, a column called `rawPrediction` needs to be returned by the `RandomForestClassifer`.
      The PySpark documentation example of `logisticsRegression`outputs the `rawPrediction` column but not `RandomForestClassifier`.

      Therefore, one is unable to use `RandomForestClassifier` with the evaluator nor put it in a pipeline with cross validation.

      A relevant piece of code showing how to reproduce the bug can be found at:
      https://gist.github.com/karenyyng/cf61ae655b032f754bfb

      A relevant post due to this possible bug can also be found at:
      http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-td23791.html

      Attachments

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              karenyng Karen Y. Ng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified