Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2612

ALS has data skew for popular product

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.1.0
    • MLlib
    • None

    Description

      Usually there are some popular products which are related with many users in Rating inputs.
      groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather data of the popular product to one task, because it's RDD's partitioner may be not used as the join() partitioner.
      The following join() need to shuffle from the aggregated product data. The shuffle block can easily be bigger than 2G, and shuffle failed as mentioned in SPARK-1476
      And increasing blocks number doesn't work.

      IMHO, groupByKey() should use the same partitioner as the other RDD in join(). So groupByKey() and join() will be in the same stage, and shuffle data from many previous tasks will not trigger "2G" limits.

      Attachments

        Activity

          People

            peng.zhang Peng Zhang
            peng.zhang Peng Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: