Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38816

Wrong comment in random matrix generator in spark-als algorithm

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.1.1, 3.1.2, 3.2.1
    • 3.1.3, 3.3.0, 3.2.2
    • ML
    • None

    Description

      In algorithm Spark ALS we need initialize nonegative factor matricies for users and items. 

      In ALS:

       

      private def initialize[ID](
          inBlocks: RDD[(Int, InBlock[ID])],
          rank: Int,
          seed: Long): RDD[(Int, FactorBlock)] = {
        // Choose a unit vector uniformly at random from the unit sphere, but from the
        // "first quadrant" where all elements are nonnegative. This can be done by choosing
        // elements distributed as Normal(0,1) and taking the absolute value, and then normalizing.
        // This appears to create factorizations that have a slightly better reconstruction
        // (<1%) compared picking elements uniformly at random in [0,1].
        inBlocks.mapPartitions({ iter =>
          iter.map {
            case (srcBlockId, inBlock) =>
              val random: XORShiftRandom = new XORShiftRandom(byteswap64(seed ^ srcBlockId))
              val factors: Array[Array[Float]] = Array.fill(inBlock.srcIds.length) {
                val factor = Array.fill(rank)(random.nextGaussian().toFloat)
                val nrm: Float = blas.snrm2(rank, factor, 1)
                blas.sscal(rank, 1.0f / nrm, factor, 1)
                factor
              }
              (srcBlockId, factors)
          }
        }, preservesPartitioning = true)
      } 

      In the comments, the author writes that we are generating a matrix filled with positive numbers. In the code we use random.nextGaussian().toFloat. But if we look at the documentation of the nextGaussian method, we can see that it also returns negative numbers: 

       

      /** 
      * @return the next pseudorandom, Gaussian ("normally") distributed
       *         {@code double} value with mean {@code 0.0} and
       *         standard deviation {@code 1.0} from this random number
       *         generator's sequence
       */
      synchronized public double nextGaussian() {
          // See Knuth, ACP, Section 3.4.1 Algorithm C.
          if (haveNextNextGaussian) {
              haveNextNextGaussian = false;
              return nextNextGaussian;
          } else {
              double v1, v2, s;
              do {
                  v1 = 2 * nextDouble() - 1; // between -1 and 1
                  v2 = 2 * nextDouble() - 1; // between -1 and 1
                  s = v1 * v1 + v2 * v2;
              } while (s >= 1 || s == 0);
              double multiplier = StrictMath.sqrt(-2 * StrictMath.log(s)/s);
              nextNextGaussian = v2 * multiplier;
              haveNextNextGaussian = true;
              return v1 * multiplier;
          }
      }
       

       

      The result is a matrix with negative values

      Attachments

        Activity

          People

            srowen Sean R. Owen
            NickAuir Nikolay
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified