Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.6.0
-
None
Description
When training (a.k.a. fitting) org.apache.spark.ml.classification.LogisticRegression, aggregation is done using arrays (which are dense structures). This is the case regardless of whether the features of each instance are stored in a sparse or a dense vector.
When the feature vectors are very large, performance is poor because there's a lot of overhead in transmitting these large arrays across workers.
However, just because the feature vectors are large doesn't mean all these features are actually being used. If the actual features are sparse, very large arrays are being allocated unnecessarily.
To solve this case, there should be an option to aggregate using sparse vectors. It should be up to the sure to set this explicitly as a parameter on the estimator, since the user should have some idea whether it is necessary for their particular case.
As an example, I have a use case where the features vector size is around 20 million. However, there are only 7 - 8 thousand non-zero features. The time to train a model with only 10 worker is 1.3 hours on my test cluster. With 100 workers this balloons to over 10 hours on the same cluster! Also, spark.driver.maxResultSize is set to 112 GB to accommodate the data being pulled back to the driver.
Attachments
Issue Links
- is related to
-
SPARK-16008 ML Logistic Regression aggregator serializes unnecessary data
- Resolved
-
SPARK-16592 Improving ml.Logistic Regression on speed and scalability
- Resolved
- links to