Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22751

Improve ML RandomForest shuffle performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.2.0
    • 2.4.0
    • ML
    • None

    Description

      When I try to use ML Randomforest to train a classifier with dataset news20.binary, which has 19,996 training examples and 1,355,191 features, i found that shuffle write size( 51 GB ) of findSplitsBySorting is very large compared with the small data size( 133.52 MB ). I think it is useful to replace groupByKey by reduceByKey to improve shuffle performance.

      Attachments

        Activity

          People

            lucio35 lucio35
            lucio35 lucio35
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified