Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24300

generateLDAData in ml.cluster.LDASuite didn't set seed correctly

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.4.0
    • Component/s: ML
    • Labels:
      None
    • Target Version/s:

      Description

      https://github.com/apache/spark/blob/0d63eb8888d17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala

       

      generateLDAData uses the same RNG in all partitions to generate random data. This either causes duplicate rows in cluster mode or indeterministic behavior in local mode:

      scala> val rng = new java.util.Random(10)
      rng: java.util.Random = java.util.Random@78c5ef58
      
      scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) }.collect().mkString("\n")
      res12: String =
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)

      We should create one RNG per partition to make it safe.

       

      cc: Lu Wang Joseph K. Bradley

        Attachments

          Activity

            People

            • Assignee:
              lu.DB Lu Wang
              Reporter:
              mengxr Xiangrui Meng
              Shepherd:
              Joseph K. Bradley
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: