Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24300

generateLDAData in ml.cluster.LDASuite didn't set seed correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • ML
    • None

    Description

      https://github.com/apache/spark/blob/0d63eb8888d17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala

       

      generateLDAData uses the same RNG in all partitions to generate random data. This either causes duplicate rows in cluster mode or indeterministic behavior in local mode:

      scala> val rng = new java.util.Random(10)
      rng: java.util.Random = java.util.Random@78c5ef58
      
      scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) }.collect().mkString("\n")
      res12: String =
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
      List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)

      We should create one RNG per partition to make it safe.

       

      cc: lu.DB josephkb

      Attachments

        Activity

          People

            lu.DB Lu Wang
            mengxr Xiangrui Meng
            Joseph K. Bradley Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: