Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-7300

End-to-end tests are instable on Travis

    Details

    • Type: Bug
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.4.0
    • Component/s: Tests
    • Labels:

      Activity

      Hide
      aljoscha Aljoscha Krettek added a comment -

      I cannot access the logs.

      Show
      aljoscha Aljoscha Krettek added a comment - I cannot access the logs.
      Show
      tzulitai Tzu-Li (Gordon) Tai added a comment - https://travis-ci.org/apache/flink/jobs/258569408 https://travis-ci.org/apache/flink/jobs/258841693 Sorry about that. Does this work now?
      Hide
      aljoscha Aljoscha Krettek added a comment -

      I pushed a fix to master that should fix this problem: d578810b45aaabe0e83a2064d7fa324981466b75

      If I'm correct, the problem was that we're not waiting long enough for the result data to be available on Kafka. I've increased to maximum wait time from 1 minute to 5 minutes.

      Show
      aljoscha Aljoscha Krettek added a comment - I pushed a fix to master that should fix this problem: d578810b45aaabe0e83a2064d7fa324981466b75 If I'm correct, the problem was that we're not waiting long enough for the result data to be available on Kafka. I've increased to maximum wait time from 1 minute to 5 minutes.
      Hide
      till.rohrmann Till Rohrmann added a comment -

      This does not seem to fully solve the problem: https://travis-ci.org/apache/flink/jobs/261076162

      Show
      till.rohrmann Till Rohrmann added a comment - This does not seem to fully solve the problem: https://travis-ci.org/apache/flink/jobs/261076162
      Hide
      aljoscha Aljoscha Krettek added a comment -

      Fixed in
      65402e034c32e47824fe46427d83eb9c9ea22d30
      2ed74ca060ba64fefc7b53a23640d4854329f418

      Show
      aljoscha Aljoscha Krettek added a comment - Fixed in 65402e034c32e47824fe46427d83eb9c9ea22d30 2ed74ca060ba64fefc7b53a23640d4854329f418
      Hide
      till.rohrmann Till Rohrmann added a comment -

      I fear the problem has not been fixed: https://travis-ci.org/apache/flink/jobs/263016472

      Show
      till.rohrmann Till Rohrmann added a comment - I fear the problem has not been fixed: https://travis-ci.org/apache/flink/jobs/263016472
      Hide
      aljoscha Aljoscha Krettek added a comment -

      More and more Kafka exceptions/errors keep popping up. It's like playing whack-A-mole. 😩

      I'll push a commit that also ignores the new DisconnectException.

      Show
      aljoscha Aljoscha Krettek added a comment - More and more Kafka exceptions/errors keep popping up. It's like playing whack-A-mole. 😩 I'll push a commit that also ignores the new DisconnectException .
      Hide
      aljoscha Aljoscha Krettek added a comment -

      Closed in 00d5b62224e9d4b701eef1ea0b016454e9c9374b

      Show
      aljoscha Aljoscha Krettek added a comment - Closed in 00d5b62224e9d4b701eef1ea0b016454e9c9374b
      Hide
      till.rohrmann Till Rohrmann added a comment -

      The issue appeared again: https://travis-ci.org/apache/flink/jobs/265362414

      I think the reason for the sporadic failures needs a bit more investigation.

      Show
      till.rohrmann Till Rohrmann added a comment - The issue appeared again: https://travis-ci.org/apache/flink/jobs/265362414 I think the reason for the sporadic failures needs a bit more investigation.
      Hide
      aljoscha Aljoscha Krettek added a comment -

      The reason this time is an AskTimeoutException. The problem is that even a normal run of Flink can have exceptions and errors in the log. In our release testing we have this section about "running a cluster and verifying that the log and output are clear of exceptions and errors". I think in the real world the log is never clear of exceptions and errors, even in the case where everything wen't well.

      Till Rohrmann You think we should maybe just not test for the log being clean? I could also add AskTimeoutException to the list of exceptions that we expect to occur. I'm guessing this just sometimes occurs with Akka?

      Show
      aljoscha Aljoscha Krettek added a comment - The reason this time is an AskTimeoutException . The problem is that even a normal run of Flink can have exceptions and errors in the log. In our release testing we have this section about "running a cluster and verifying that the log and output are clear of exceptions and errors". I think in the real world the log is never clear of exceptions and errors, even in the case where everything wen't well. Till Rohrmann You think we should maybe just not test for the log being clean? I could also add AskTimeoutException to the list of exceptions that we expect to occur. I'm guessing this just sometimes occurs with Akka?
      Hide
      till.rohrmann Till Rohrmann added a comment -

      AskTimeoutException can easily occur in a real life scenario without something being broken. These exceptions should then only be logged on warn level and not error, right?

      I think we should harden the test such that we don't have false positives. To me this sounds a bit like that our verification of a successful test run (e.g. no exception in the log) is broken and should be changed. It could also be the case that we log exceptions which shouldn't be logged because they are just misleading.

      Show
      till.rohrmann Till Rohrmann added a comment - AskTimeoutException can easily occur in a real life scenario without something being broken. These exceptions should then only be logged on warn level and not error, right? I think we should harden the test such that we don't have false positives. To me this sounds a bit like that our verification of a successful test run (e.g. no exception in the log) is broken and should be changed. It could also be the case that we log exceptions which shouldn't be logged because they are just misleading.
      Hide
      aljoscha Aljoscha Krettek added a comment -

      I think you're right, we're logging exceptions that don't indicate a real problem. Changing this looks like a bigger task, though, and doesn't help with the immediate problem of the unstable end-to-end test.

      Show
      aljoscha Aljoscha Krettek added a comment - I think you're right, we're logging exceptions that don't indicate a real problem. Changing this looks like a bigger task, though, and doesn't help with the immediate problem of the unstable end-to-end test.

        People

        • Assignee:
          aljoscha Aljoscha Krettek
          Reporter:
          tzulitai Tzu-Li (Gordon) Tai
        • Votes:
          0 Vote for this issue
          Watchers:
          4 Start watching this issue

          Dates

          • Created:
            Updated:

            Development