[SPARK-10086] Flaky StreamingKMeans test in PySpark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Incomplete
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: DStreams, MLlib, PySpark, Tests
Labels:
- bulk-closed

Description

Here's a report on investigating test failures in StreamingKMeans in PySpark. (See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches. It fails here: https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144

I recreated the same test, with variants training on: (1) the original 2 batches, (2) just the first batch, (3) just the second batch, and (4) neither batch. Here is code which avoids Streaming altogether to identify what batches were processed.

from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
    stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
    return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()

Results:

####################### EXPECTED

[0, 1, 1]                                                                       
[1, 0, 1]

####################### Skip batch 0

[1, 0, 0]
[0, 1, 0]

####################### Skip batch 1

[0, 1, 1]
[1, 0, 1]

####################### Skip both batches  (This is what we see in the test failures.)

[0, 1, 1]
[0, 0, 0]

Skipping both batches reproduces the failure. There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized).

CC: tdas freeman-lab mengxr

Failure message:

======================================================================
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn
    self._eventually(condition, catch_assertions=True)
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually
    raise lastValue
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually
    lastValue = condition()
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition
    self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
?                 ^^^^

+ [[0, 1, 1], [1, 0, 1]]
?              +++   ^


----------------------------------------------------------------------
Ran 62 tests in 164.188s