Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14464

Logistic regression performs poorly for very large vectors, even when the number of non-zero features is small

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.6.0
    • None
    • ML

    Description

      When training (a.k.a. fitting) org.apache.spark.ml.classification.LogisticRegression, aggregation is done using arrays (which are dense structures). This is the case regardless of whether the features of each instance are stored in a sparse or a dense vector.

      When the feature vectors are very large, performance is poor because there's a lot of overhead in transmitting these large arrays across workers.

      However, just because the feature vectors are large doesn't mean all these features are actually being used. If the actual features are sparse, very large arrays are being allocated unnecessarily.

      To solve this case, there should be an option to aggregate using sparse vectors. It should be up to the sure to set this explicitly as a parameter on the estimator, since the user should have some idea whether it is necessary for their particular case.

      As an example, I have a use case where the features vector size is around 20 million. However, there are only 7 - 8 thousand non-zero features. The time to train a model with only 10 worker is 1.3 hours on my test cluster. With 100 workers this balloons to over 10 hours on the same cluster! Also, spark.driver.maxResultSize is set to 112 GB to accommodate the data being pulled back to the driver.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              daniel.siegmann.aol Daniel Siegmann
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: