Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22451

Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • ML

    Description

      Do not need generate all possible splits for unordered features before aggregate,
      in aggregete (executor side):
      1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat with ordered features. so for unordered features, we only need O(numCategories) space for this feature stat.
      2. After driver side get the aggregate result, generate all possible split combinations, and compute the best split.

      This will reduce decision tree aggregate size for each unordered feature from O(2^numCategories) to O(numCategories), `numCategories` is the arity of this unordered feature.

      This also reduce the cpu cost in executor side. Reduce time complexity for this unordered feature from O(numPoints * 2^numCategories) to O(numPoints).

      This won't increase time complexity for unordered features best split computing in driver side.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              weichenxu123 Weichen Xu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified