Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4431

Implement efficient activeIterator for dense and sparse vector

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: MLlib
    • Labels:
      None
    • Target Version/s:

      Description

      Previously, we were using Breeze's activeIterator to access the non-zero elements
      in dense/sparse vector. Due to the overhead, we switched back to native while loop
      in #SPARK-4129.

      However, #SPARK-4129 requires de-reference the dv.values/sv.values in
      each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer,
      we're using Breeze's dense vector to store the partial stats, and this is very expensive compared
      with using primitive scala array.

      In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse
      vector operation which makes codebase easier to maintain. Breeze dense vector is replaced
      by primitive array to reduce the overhead further.

      Benchmarking with mnist8m dataset on single JVM
      with first 200 samples loaded in memory, and repeating 5000 times.

      Before change:
      Sparse Vector - 30.02
      Dense Vector - 38.27

      With this PR:
      Sparse Vector - 6.29
      Dense Vector - 11.72

        Attachments

          Activity

            People

            • Assignee:
              dbtsai DB Tsai
              Reporter:
              dbtsai DB Tsai
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: