Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
0.10.0
Description
The Storm Compatibility tests frequently fail.
The reason is that they kill the topologies after a certain time interval. That may fail on CI infrastructure when certain steps are delayed beyond usual. Trying to guarantee progress by time is inherently problematic:
- Waiting too short makes tests unstable
- Waiting too long makes tests slow
The right way to go is letting the program decide when to terminate, for example by throwing a special SuccessException.
Have a look at the Kafka connector tests, they do this a lot and hence run exactly as short or as long as they need to.
Here is an example of a failed run: https://s3.amazonaws.com/archive.travis-ci.org/jobs/77499577/log.txt
From FLINK-2801
The tests for the storm compatibiliy layer are all working with timeouts (running the program for 10 seconds) and then checking whether teh expected result has been written.
That is inherently unstable and slow (long delays). They should be rewritten in a similar manner like for example the KafkaITCase tests, where the streaming jobs terminate themselves with a "SuccessException", which can be recognized as successful completion when thrown by the job client.
Attachments
Issue Links
- is duplicated by
-
FLINK-2801 Rework Storm Compatibility Tests
- Closed
-
FLINK-2847 Fix flaky test in StormTestBase.testJob
- Closed