Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8708

MatrixFactorizationModel.predictAll() populates single partition only

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.0
    • 1.5.0
    • MLlib
    • None

    Description

      When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism.

      This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)).

      Bellow is an example on tiny sample (same on large dataset):

      pyspark
      >>> r1 = (1, 1, 1.0)
      >>> r2 = (1, 2, 2.0)
      >>> r3 = (2, 1, 2.0)
      >>> r4 = (2, 2, 2.0)
      >>> r5 = (3, 1, 1.0)
      >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
      >>> ratings.getNumPartitions()
      5
      >>> users = ratings.map(itemgetter(0)).distinct()
      >>> model = ALS.trainImplicit(ratings, 1, seed=10)
      >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
      >>> predictions_for_2.glom().map(len).collect()
      [0, 0, 3, 0, 0]
      

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            antonymayi Antony Mayi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: