Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4148

PySpark's sample uses the same seed for all partitions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.2, 1.1.0
    • 1.0.3, 1.1.1, 1.2.0
    • PySpark
    • None

    Description

      The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.

      In [14]: import random
      
      In [15]: r1 = random.Random(10)
      
      In [16]: r1.randint(0, 1)
      Out[16]: 1
      
      In [17]: r1.random()
      Out[17]: 0.4288890546751146
      
      In [18]: r1.random()
      Out[18]: 0.5780913011344704
      
      In [19]: r2 = random.Random(10)
      
      In [20]: r2.randint(0, 1)
      Out[20]: 1
      
      In [21]: r2.randint(0, 1)
      Out[21]: 0
      
      In [22]: r2.random()
      Out[22]: 0.5780913011344704
      

      So the second value from partition 1 is the same as the first value from partition 2.

      Attachments

        Activity

          People

            mengxr Xiangrui Meng
            mengxr Xiangrui Meng
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: