[SPARK-14464] Logistic regression performs poorly for very large vectors, even when the number of non-zero features is small - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Description

When training (a.k.a. fitting) org.apache.spark.ml.classification.LogisticRegression, aggregation is done using arrays (which are dense structures). This is the case regardless of whether the features of each instance are stored in a sparse or a dense vector.

When the feature vectors are very large, performance is poor because there's a lot of overhead in transmitting these large arrays across workers.

However, just because the feature vectors are large doesn't mean all these features are actually being used. If the actual features are sparse, very large arrays are being allocated unnecessarily.

To solve this case, there should be an option to aggregate using sparse vectors. It should be up to the sure to set this explicitly as a parameter on the estimator, since the user should have some idea whether it is necessary for their particular case.

As an example, I have a use case where the features vector size is around 20 million. However, there are only 7 - 8 thousand non-zero features. The time to train a model with only 10 worker is 1.3 hours on my test cluster. With 100 workers this balloons to over 10 hours on the same cluster! Also, spark.driver.maxResultSize is set to 112 GB to accommodate the data being pulled back to the driver.

Attachments

Issue Links

is related to

SPARK-16008 ML Logistic Regression aggregator serializes unnecessary data

Resolved

SPARK-16592 Improving ml.Logistic Regression on speed and scalability

Resolved

links to

[Github] Pull Request #12761 (daniel-siegmann-aol)

Activity

People

Assignee:: Unassigned

Reporter:: Daniel Siegmann

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 07/Apr/16 20:19

Updated:: 21/May/19 04:34

Resolved:: 21/May/19 04:34