Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20446

Optimize the process of MLLIB ALS recommendForAll

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.3.0
    • None
    • ML, MLlib
    • None

    Description

      The recommendForAll of MLLIB ALS is very slow.
      GC is a key problem of the current method.
      The task use the following code to keep temp result:
      val output = new Array[(Int, (Int, Double))](m*n)
      m = n = 4096 (default value, no method to set)
      so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM.

      Actually, we don't need to save all the temp result. Suppose we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result.
      I have written a solution for this method with the following test result.

      The Test Environment:
      3 workers: each work 10 core, each work 30G memory, each work 1 executor.
      The Data: User 480,000, and Item 17,000

      BlockSize: 1024 2048 4096 8192
      Old method: 245s 332s 488s OOM
      This solution: 121s 118s 117s 120s

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            peng.meng@intel.com Peng Meng
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment