Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10431

Flaky test: o.a.s.metrics.InputOutputMetricsSuite - input metrics with cache and coalesce

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.5.1, 1.6.0
    • Component/s: Tests
    • Labels:


      I sometimes get test failures such as:

      • input metrics with cache and coalesce *** FAILED ***
        5994472 did not equal 6044472 (InputOutputMetricsSuite.scala:101)

      Tracking this down by adding some debug it seems this is a timing issue in the test.

      test("input metrics with cache and coalesce") {
      // prime the cache manager
      val rdd = sc.textFile(tmpFilePath, 4).cache()
      rdd.collect() // <== #1

      val bytesRead = runAndReturnBytesRead

      { // <== #2 rdd.count() }

      val bytesRead2 = runAndReturnBytesRead

      { rdd.coalesce(4).count() }

      // for count and coelesce, the same bytes should be read.
      assert(bytesRead != 0)
      assert(bytesRead2 == bytesRead) // fails

      What is happening is that the runAndReturnBytesRead (#2) function adds a SparkListener to monitor TaskEnd events to total the bytes read from eg the rdd.count()

      In the case where this fails the listener receives a TaskEnd event from earlier tasks (eg #1) and this mucks up the totalling. This happens because the asynchronous thread processing the event queue and notifying the listeners has not processed one of the taskEnd events before the new listener is added so it also receives that event.

      There is a simple fix to the test to wait for the event queue to be empty before adding the new listener and I will submit a pull request for that.

      I also notice that a lot of the tests add a listener and as there is no removeSparkListener api the number of listeners on the context builds up during the running of the suite. This is probably why I see this issue running on slow machines.

      A wider question may be: should a listener receive events that occurred before it was added?




            • Assignee:
              robbinspg Peter George Robbins
              robbinspg Peter George Robbins


              • Created:

                Issue deployment