Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14351

Optimize ImpurityAggregator for decision trees

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • ML

    Description

      RandomForest.binsToBestSplit currently takes a large amount of time. Based on some quick profiling, I believe a big chunk of this is spent in ImpurityAggregator.getCalculator (which seems to make unnecessary Array copies) and RandomForest.calculateImpurityStats.

      This JIRA is for:

      • Doing more profiling to confirm that unnecessary time is being spent in some of these methods.
      • Optimizing the implementation
      • Profiling again to confirm the speedups

      Local profiling for large enough examples should suffice, especially since the optimizations should not need to change the amount of data communicated.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: