[SPARK-1401] Use mapParitions instead of map to avoid creating expensive object in GradientDescent optimizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: 0.9.1
Component/s: MLlib
Labels:
- easyfix
- performance

Description

In GradientDescent, currently, each row of the input data will create its own gradient matrix object, and then we sum them up in the reducer.

We found that when the number of features are in the order of thousands, it becomes the bottleneck. The situation was worse when we tested with Newton optimizer due to that the dimension of hessian matrix is so huge.

In our testing, when the # of features are hundreds of thousands, the GC kicks in for each row of input, and it sometimes brings down the workers.

With aggregating the lossSum, and gradientSum using mapPartitions, we solved the GC issue, and scale better with # of features.

Attachments

Issue Links

Is contained by

SPARK-1212 Support sparse data in MLlib

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: DB Tsai

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Apr/14 01:04

Updated:: 03/Apr/14 02:10

Resolved:: 03/Apr/14 02:10

Time Tracking

Estimated:

24h

Remaining:

24h

Logged:

Not Specified