Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10086

Flaky StreamingKMeans test in PySpark

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Incomplete
    • 1.5.0
    • None
    • DStreams, MLlib, PySpark, Tests

    Description

      Here's a report on investigating test failures in StreamingKMeans in PySpark. (See Jenkins links below.)

      It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches. It fails here: https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144

      I recreated the same test, with variants training on: (1) the original 2 batches, (2) just the first batch, (3) just the second batch, and (4) neither batch. Here is code which avoids Streaming altogether to identify what batches were processed.

      from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel
      
      batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
      batches = [sc.parallelize(batch) for batch in batches]
      
      stkm = StreamingKMeans(decayFactor=0.0, k=2)
      stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])
      
      # Train
      def update(rdd):
          stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)
      
      # Remove one or both of these lines to test skipping batches.
      update(batches[0])
      update(batches[1])
      
      # Test
      def predict(rdd):
          return stkm._model.predict(rdd)
      
      predict(batches[0]).collect()
      predict(batches[1]).collect()
      

      Results:

      ####################### EXPECTED
      
      [0, 1, 1]                                                                       
      [1, 0, 1]
      
      ####################### Skip batch 0
      
      [1, 0, 0]
      [0, 1, 0]
      
      ####################### Skip batch 1
      
      [0, 1, 1]
      [1, 0, 1]
      
      ####################### Skip both batches  (This is what we see in the test failures.)
      
      [0, 1, 1]
      [0, 0, 0]
      

      Skipping both batches reproduces the failure. There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized).

      CC: tdas freeman-lab mengxr

      Failure message:

      ======================================================================
      FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
      Test that prediction happens on the updated model.
      ----------------------------------------------------------------------
      Traceback (most recent call last):
        File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn
          self._eventually(condition, catch_assertions=True)
        File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually
          raise lastValue
        File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually
          lastValue = condition()
        File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition
          self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
      AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]
      
      First differing element 1:
      [0, 0, 0]
      [1, 0, 1]
      
      - [[0, 1, 1], [0, 0, 0]]
      ?                 ^^^^
      
      + [[0, 1, 1], [1, 0, 1]]
      ?              +++   ^
      
      
      ----------------------------------------------------------------------
      Ran 62 tests in 164.188s
      

      Attachments

        1. flakyRepro.py
          1.0 kB
          Bryan Cutler

        Issue Links

          Activity

            People

              Unassigned Unassigned
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: