Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3043

DecisionTree aggregation is inefficient

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.1.0
    • 1.2.0
    • MLlib
    • None

    Description

      2 major efficiency issues in computation and storage:

      (1) DecisionTree aggregation involves reshaping data unnecessarily.

      E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve reshaping the data multiple times without real computation.

      (2) DecisionTree splits and aggregate bins can include many unused bins/splits.

      The same number of splits/bins are used for all features. E.g., if there is a continuous feature which uses 100 bins, then there will also be 100 bins allocated for all binary features, even though only 2 are necessary.

      Attachments

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: