Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2625

There are several unstable tests within FLUME

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0.1
    • Fix Version/s: None
    • Component/s: Test
    • Labels:
      None
    • Environment:

      RHEL 7.1 / x86_64 / Open JDK 1.7

      Description

      Hi,

      I'm working on porting FLUME in a RHEL 7.1 / PPC64LE / IBM JVM 1.7 environment.
      As an example, I've found that the test .source.TestSyslogUdpSource fails, but not always, only 7 times out of 10 tries. Testing on RHEL 7.1 / x86_64 / IBM JVM, I've also had random failures.
      Running the same .source.TestSyslogUdpSource test in RHEL 7.1 / x86_64 / Open JDK 1.7 environment, I've found that this test fails only once out of 30 tries: it is an "unstable" test.

      In order to find which test issues are specific to PPC64 or IBMJVM environment, I've run 10 times all the FLUME tests in the RHEL 7.1 / x86_64 / Open JDK 1.7 environment, which I call my "reference" environment.

      Then, using a tool that compares all the results, I've found that there are 16 tests that are "unstable" in my "reference" (x86_64/OpenJDK) .
      By "unstable", I mean to say that the results vary, though the environment is exactly the same.

      These tests are:

      .api.TestLoadBalancingRpcClient
      .api.TestThriftRpcClient
      .channel.file.TestFileChannelRestart
      .channel.TestSpillableMemoryChannel
      .instrumentation.http.TestHTTPMetricsServer
      .sink.TestAvroSink
      .sink.TestThriftSink
      .source.avroLegacy.TestLegacyAvroSource
      .source.http.TestHTTPSource
      .source.TestAvroSource
      .source.TestExecSource
      .source.TestMultiportSyslogTCPSource
      .source.TestSyslogTcpSource
      .source.TestSyslogUdpSource
      .source.TestThriftSource
      .source.thriftLegacy.TestThriftLegacySource

      About ".source.TestSyslogUdpSource" test, my analysis is that the test code is not reliable since the test checks that some data is correct without checking that all the "messages" have arrived (sometimes, a message has not arrived in time, and a reference is NULL).
      Adding "sleep(1000) to the test with IBM JVM, the test then failed only 3 times out of 10.

      So, I think that several FLUME tests are coded in a way that is not 100% reliable. Or it could also be that some core code of FLUME is not 100% reliable.

      I mean to say that some code may have been written based on the specific behaviour of the OpenJDK Java Virtual Machine, which was used for testing. Some change about how the order of threads are launched, or about the time needed to send messages in the JVM/OS, may lead to issues that are not correctly handled by the code (mainly test code, but maybe core code too). And it seems that, though being perfectly correct, the IBM JVM does not work the same way compared to OpenJDK.

      So, this is a pain. Mainly in my PPC64LE/IBMJVM environment.
      I think that these 16 tests must be analysed and improved.
      Also, running tests with OpenJDK AND IBM JVM in your development and test/Jenkins environments would help to see these random issues.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                trex58 Tony Reix
              • Votes:
                1 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: