Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: MLlib
    • Labels:
      None
    • Target Version/s:

      Description

      The internal storage of ALS uses many small objects, which increases the GC pressure and makes ALS difficult to scale to very large scale, e.g., 50 billion ratings. In such cases, the full GC may take more than 10 minutes to finish. That is longer than the default heartbeat timeout and hence executors will be removed under default settings.

      We can use primitive arrays to reduce the number of objects significantly. This requires big change to the ALS implementation.

        Issue Links

          Activity

          Hide
          mengxr Xiangrui Meng added a comment -

          Issue resolved by pull request 3720
          https://github.com/apache/spark/pull/3720

          Show
          mengxr Xiangrui Meng added a comment - Issue resolved by pull request 3720 https://github.com/apache/spark/pull/3720
          Hide
          apachespark Apache Spark added a comment -

          User 'mengxr' has created a pull request for this issue:
          https://github.com/apache/spark/pull/3720

          Show
          apachespark Apache Spark added a comment - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3720
          Hide
          srowen Sean Owen added a comment -
          Show
          srowen Sean Owen added a comment - Adam Kunicki Related discussion at https://issues.apache.org/jira/browse/SPARK-2465
          Hide
          adam.kunicki Adam Kunicki added a comment -

          Has anyone considered that userId: Int and productId: Int may not make sense in most real-life use cases?

          It requires an extra mapping of your ids (e.g. Long, or String even) to a space like Int and mapping back before you have any useable information.

          Show
          adam.kunicki Adam Kunicki added a comment - Has anyone considered that userId: Int and productId: Int may not make sense in most real-life use cases? It requires an extra mapping of your ids (e.g. Long, or String even) to a space like Int and mapping back before you have any useable information.
          Hide
          mengxr Xiangrui Meng added a comment -

          I put the implementation at https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala . It still needs some cleanups to become a PR, but early feedbacks are welcome.

          Show
          mengxr Xiangrui Meng added a comment - I put the implementation at https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala . It still needs some cleanups to become a PR, but early feedbacks are welcome.
          Hide
          mengxr Xiangrui Meng added a comment -

          Wrote a new implementation that gives ~5x speedup and better scalability, while the algorithm doesn't change.

          1. Use `(Int, Int, Float)` as input, saving 4 bytes on rating.
          2. Out link blocks are stored as `out: Array[Array[Int]]`, where out(dstBlockId) contains the src indices (local to the src block) associated with the dstBlockId.
          3. In link blocks are stored in a CSC format:

            class InBlock(
              srcIds: Array[Int],
              dstPtrs: Array[Int],
              dstEncodedIndices: Array[Int],
              ratings: Array[Float])
          

          `dstEncodedIndices` contains encoded `dstBlockId` (high bits) and `dstLocalIndex` (low bits). Using this data structure, the subproblems can be solved one after another without allocating many AtA/Atb buffers.
          4. The input ratings are stored in small batches to avoid ser/de overhead.
          5. LAPACK's dppsv is used instead of dposv. The former only needs the triangular part. Double is used for constructing the normal equation for accuracy.
          6. Use `TimSort` to create the `InBlock`.

          I will share the code soon, which is a little messy at this time.

          Show
          mengxr Xiangrui Meng added a comment - Wrote a new implementation that gives ~5x speedup and better scalability, while the algorithm doesn't change. 1. Use `(Int, Int, Float)` as input, saving 4 bytes on rating. 2. Out link blocks are stored as `out: Array[Array [Int] ]`, where out(dstBlockId) contains the src indices (local to the src block) associated with the dstBlockId. 3. In link blocks are stored in a CSC format: class InBlock( srcIds: Array[Int], dstPtrs: Array[Int], dstEncodedIndices: Array[Int], ratings: Array[ Float ]) `dstEncodedIndices` contains encoded `dstBlockId` (high bits) and `dstLocalIndex` (low bits). Using this data structure, the subproblems can be solved one after another without allocating many AtA/Atb buffers. 4. The input ratings are stored in small batches to avoid ser/de overhead. 5. LAPACK's dppsv is used instead of dposv. The former only needs the triangular part. Double is used for constructing the normal equation for accuracy. 6. Use `TimSort` to create the `InBlock`. I will share the code soon, which is a little messy at this time.

            People

            • Assignee:
              mengxr Xiangrui Meng
              Reporter:
              mengxr Xiangrui Meng
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 96h
                96h
                Remaining:
                Remaining Estimate - 96h
                96h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development