Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Incomplete
-
1.5.0
-
None
Description
Here's a report on investigating test failures in StreamingKMeans in PySpark. (See Jenkins links below.)
It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches. It fails here: https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144
I recreated the same test, with variants training on: (1) the original 2 batches, (2) just the first batch, (3) just the second batch, and (4) neither batch. Here is code which avoids Streaming altogether to identify what batches were processed.
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]] batches = [sc.parallelize(batch) for batch in batches] stkm = StreamingKMeans(decayFactor=0.0, k=2) stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0]) # Train def update(rdd): stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit) # Remove one or both of these lines to test skipping batches. update(batches[0]) update(batches[1]) # Test def predict(rdd): return stkm._model.predict(rdd) predict(batches[0]).collect() predict(batches[1]).collect()
Results:
####################### EXPECTED [0, 1, 1] [1, 0, 1] ####################### Skip batch 0 [1, 0, 0] [0, 1, 0] ####################### Skip batch 1 [0, 1, 1] [1, 0, 1] ####################### Skip both batches (This is what we see in the test failures.) [0, 1, 1] [0, 0, 0]
Skipping both batches reproduces the failure. There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized).
CC: tdas freeman-lab mengxr
Failure message:
====================================================================== FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest) Test that prediction happens on the updated model. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn self._eventually(condition, catch_assertions=True) File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually raise lastValue File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually lastValue = condition() File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]]) AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]] First differing element 1: [0, 0, 0] [1, 0, 1] - [[0, 1, 1], [0, 0, 0]] ? ^^^^ + [[0, 1, 1], [1, 0, 1]] ? +++ ^ ---------------------------------------------------------------------- Ran 62 tests in 164.188s
Attachments
Attachments
Issue Links
- Is contained by
-
SPARK-10784 Flaky Streaming ML test umbrella
- Resolved
- relates to
-
SPARK-13691 Scala and Python generate inconsistent results
- Resolved
- links to