Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.3.0
-
None
Description
Generation of random numbers in Spark has to be handled carefully since references to RNGs copy the state to the workers. As such, a separate RNG needs to be seeded for each partition. Each time random numbers are used in Spark's libraries, the RNG seeding is re-implemented, leaving open the possibility of mistakes.
It would be useful if RNG seeding was standardized through utility functions or random number generation functions that can be called in Spark pipelines.