Description
Usually there are some popular products which are related with many users in Rating inputs.
groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather data of the popular product to one task, because it's RDD's partitioner may be not used as the join() partitioner.
The following join() need to shuffle from the aggregated product data. The shuffle block can easily be bigger than 2G, and shuffle failed as mentioned in SPARK-1476
And increasing blocks number doesn't work.
IMHO, groupByKey() should use the same partitioner as the other RDD in join(). So groupByKey() and join() will be in the same stage, and shuffle data from many previous tasks will not trigger "2G" limits.