[SPARK-24300] generateLDAData in ml.cluster.LDASuite didn't set seed correctly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.4.0
Component/s: ML
Labels:
None

Target Version/s:

2.4.0

Description

https://github.com/apache/spark/blob/0d63eb8888d17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala

generateLDAData uses the same RNG in all partitions to generate random data. This either causes duplicate rows in cluster mode or indeterministic behavior in local mode:

scala> val rng = new java.util.Random(10)
rng: java.util.Random = java.util.Random@78c5ef58

scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) }.collect().mkString("\n")
res12: String =
List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)

We should create one RNG per partition to make it safe.

cc: lu.DB josephkb

Attachments

Issue Links

links to

[Github] Pull Request #21492 (ludatabricks)

Activity

People

Assignee:: Lu Wang

Reporter:: Xiangrui Meng

Shepherd:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/May/18 21:44

Updated:: 04/Jun/18 23:08

Resolved:: 04/Jun/18 23:08