Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1401

Use mapParitions instead of map to avoid creating expensive object in GradientDescent optimizer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • None
    • 0.9.1
    • MLlib

    Description

      In GradientDescent, currently, each row of the input data will create its own gradient matrix object, and then we sum them up in the reducer.

      We found that when the number of features are in the order of thousands, it becomes the bottleneck. The situation was worse when we tested with Newton optimizer due to that the dimension of hessian matrix is so huge.

      In our testing, when the # of features are hundreds of thousands, the GC kicks in for each row of input, and it sometimes brings down the workers.

      With aggregating the lossSum, and gradientSum using mapPartitions, we solved the GC issue, and scale better with # of features.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dbtsai DB Tsai
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified