Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4148

PySpark's sample uses the same seed for all partitions

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.2, 1.1.0
    • Fix Version/s: 1.0.3, 1.1.1, 1.2.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.

      In [14]: import random
      
      In [15]: r1 = random.Random(10)
      
      In [16]: r1.randint(0, 1)
      Out[16]: 1
      
      In [17]: r1.random()
      Out[17]: 0.4288890546751146
      
      In [18]: r1.random()
      Out[18]: 0.5780913011344704
      
      In [19]: r2 = random.Random(10)
      
      In [20]: r2.randint(0, 1)
      Out[20]: 1
      
      In [21]: r2.randint(0, 1)
      Out[21]: 0
      
      In [22]: r2.random()
      Out[22]: 0.5780913011344704
      

      So the second value from partition 1 is the same as the first value from partition 2.

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'mengxr' has created a pull request for this issue:
          https://github.com/apache/spark/pull/3010

          Show
          apachespark Apache Spark added a comment - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3010
          Hide
          apachespark Apache Spark added a comment -

          User 'mengxr' has created a pull request for this issue:
          https://github.com/apache/spark/pull/3104

          Show
          apachespark Apache Spark added a comment - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3104
          Hide
          apachespark Apache Spark added a comment -

          User 'mengxr' has created a pull request for this issue:
          https://github.com/apache/spark/pull/3106

          Show
          apachespark Apache Spark added a comment - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3106
          Hide
          mengxr Xiangrui Meng added a comment -

          Issue resolved by pull request 3104
          https://github.com/apache/spark/pull/3104

          Show
          mengxr Xiangrui Meng added a comment - Issue resolved by pull request 3104 https://github.com/apache/spark/pull/3104
          Hide
          mengxr Xiangrui Meng added a comment -

          reopen this issue because branch-1.0 is not fixed

          Show
          mengxr Xiangrui Meng added a comment - reopen this issue because branch-1.0 is not fixed
          Hide
          joshrosen Josh Rosen added a comment -

          I've merged the backport into branch-1.0 (for 1.0.3), so I think that completes the backports. Therefore, I'm going to resolve this as 'Fixed'.

          Show
          joshrosen Josh Rosen added a comment - I've merged the backport into branch-1.0 (for 1.0.3), so I think that completes the backports. Therefore, I'm going to resolve this as 'Fixed'.

            People

            • Assignee:
              mengxr Xiangrui Meng
              Reporter:
              mengxr Xiangrui Meng
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development