Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14606

Different maxBins value for categorical and continuous features in RandomForest implementation.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • None
    • None
    • ML, MLlib

    Description

      Currently the RandomForest algo takes a single maxBins value to decide the number of splits to take. This sometimes causes training time to go very high when there is a single categorical column having sufficiently large number of unique values. This single column impacts all the numeric (continuous) columns even though such a high number of splits are not required.

      Encoding the categorical column into features make the data very wide and this requires us to increase the maxMemoryInMB and puts more pressure on the GC as well.

      Keeping the separate maxBins values for categorial and continuous features should be useful in this regard.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tanwanirahul Rahul Tanwani
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: