[SPARK-3995] [PYSPARK] PySpark's sample methods do not work with NumPy 1.9 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.2.0
Component/s: PySpark, Spark Core
Labels:
None

Target Version/s:

1.1.1, 1.2.0

Description

There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)

Returns:

PythonException: Traceback (most recent call last):
  File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream
    for obj in iterator:
  File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched
    for item in iterator:
  File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func
    if self.getUniformSample(split) <= self._fraction:
  File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample
    self.initRandomGenerator(split)
  File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator
    self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295

In PySpark's RDDSamplerBase class from pyspark.rddsampler we use:

self._seed = seed if seed is not None else random.randint(0, sys.maxint)

In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling (0, sys.maxint) often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark (unless the seed is set manually).

I am putting a PR together now (the fix is very simple!).

Attachments

Issue Links

links to

[Github] Pull Request #2889 (freeman-lab)

Activity

People

Assignee:: Jeremy Freeman

Reporter:: Jeremy Freeman

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Oct/14 22:02

Updated:: 27/Nov/14 05:11

Resolved:: 22/Oct/14 16:33