[SPARK-4148] PySpark's sample uses the same seed for all partitions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.2, 1.1.0
Fix Version/s: 1.0.3, 1.1.1, 1.2.0
Component/s: PySpark
Labels:
None

Target Version/s:

1.0.3, 1.1.1, 1.2.0

Description

The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.

In [14]: import random

In [15]: r1 = random.Random(10)

In [16]: r1.randint(0, 1)
Out[16]: 1

In [17]: r1.random()
Out[17]: 0.4288890546751146

In [18]: r1.random()
Out[18]: 0.5780913011344704

In [19]: r2 = random.Random(10)

In [20]: r2.randint(0, 1)
Out[20]: 1

In [21]: r2.randint(0, 1)
Out[21]: 0

In [22]: r2.random()
Out[22]: 0.5780913011344704

So the second value from partition 1 is the same as the first value from partition 2.

Attachments

Issue Links

links to

[Github] Pull Request #3010 (mengxr)

[Github] Pull Request #3104 (mengxr)

[Github] Pull Request #3106 (mengxr)

Activity

People

Assignee:: Xiangrui Meng

Reporter:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Oct/14 02:31

Updated:: 19/Dec/14 03:58

Resolved:: 19/Dec/14 03:58