Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2831

DistributedDataGeneratorTest.testGenerateRandomData is flaky

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.10.0
    • Fix Version/s: 1.10.0
    • Component/s: spark, test
    • Labels:
      None

      Description

      Saw this once last month and again today, so not super flaky but still worth fixing:

      1) testGenerateRandomData(org.apache.kudu.spark.tools.DistributedDataGeneratorTest)
      java.lang.AssertionError: expected:<100> but was:<99>
      	at org.junit.Assert.fail(Assert.java:88)
      	at org.junit.Assert.failNotEquals(Assert.java:834)
      	at org.junit.Assert.assertEquals(Assert.java:645)
      	at org.junit.Assert.assertEquals(Assert.java:631)
      	at org.apache.kudu.spark.tools.DistributedDataGeneratorTest.testGenerateRandomData(DistributedDataGeneratorTest.scala:58)
      

      I talked about this with Grant Henke when it last happened. The issue appears to be in the LongAccumulator used to track collisions in the data generator. Before the failure, the test logged this:

      02:22:39.533 [INFO - main] (DistributedDataGenerator.scala:134) Rows written: 99
      02:22:39.533 [INFO - main] (DistributedDataGenerator.scala:135) Collisions: 1
      

      The assert code looks like this:

          val collisions = ss.sparkContext.longAccumulator("row_collisions").value
          // Collisions could cause the number of row to be less than the number set.
          assertEquals(numRows - collisions, rdd.collect.length)
      

      So the value of this LongAccumulator was zero even though there was one collision. Our thinking was that accumulators like these were updated asynchronously and so if we don't wait for the entire job to finish, we may not be getting their up-to-date values at assertion time.

      We publish other LongAccumulators in kudu-spark, but AFAICT this is the only one that is asserted on. Nevertheless, it would be great if we could solve this in some generic way so that if someone wrote a test that used a different LongAccumulator, the race could be avoided.

        Attachments

          Activity

            People

            • Assignee:
              wdberkeley William Berkeley
              Reporter:
              adar Adar Dembo
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: