Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3383

DecisionTree aggregate size could be smaller

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 1.1.0
    • None
    • MLlib

    Description

      Storage and communication optimization:
      DecisionTree aggregate statistics could store less data (described below). The savings would be significant for datasets with many low-arity categorical features (binary features, or unordered categorical features). Savings would be negligible for continuous features.

      DecisionTree stores a vector sufficient statistics for each (node, feature, bin). We could store 1 fewer bin per (node, feature): For a given (node, feature), if we store these vectors for all but the last bin, and also store the total statistics for each node, then we could compute the statistics for the last bin. For binary and unordered categorical features, this would cut in half the number of bins to store and communicate.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: