org.apache.oozie.TestCoordinatorEngineStreamLog.testCoordLogStreaming is flakey.
It can fail with an error like this:
This was tricky to track down. The problem is essentially that the filtering API uses minute granularity but the actions are seconds, or even milliseconds apart, so there are some edge cases here. The code takes minute granularity, goes back to milliseconds when querying the DB, and then uses seconds when checking that the test worked correctly.
This can lead to a situation like this example:
@1 created/nominal time: Oct 06 16:27:24 <and some milliseconds>
@2 created/nominal time: Oct 06 16:27:25 <and some milliseconds>
The test passes @2's created time as the starting time argument to the filter, converted with DateUtill's UTC MASK (i.e. minutes granularity):
- Oct 06 16:27
This gets parsed back to a Date object, which seconds/milliseconds assumed to be 0's
- Oct 06 16:27:00 0000
This gets turned into a TimeStamp object, and used in the DB filtering query
- gets all actions with created time > Oct 06 16:27:00, which includes @1
CoordinatorEngine sorts by action ID (i.e. @#) and uses created time of smallest, which is @1
- Oct 06 16:27:24
The test checks that @2’s created time is equal to previously determined time from filter (i.e. @1’s created time)
- Does a toString(), which uses seconds —-> not equal
Note: The filter querying is actually done against the nominal times, but the example is simpler without all that and conveys the same idea.
The actions get created within 100s of milliseconds during the test (at least on my machine), so most of them time, they have the same seconds, and the toString() comparison is equal. When the timing works out that the first and second action are created at different clock seconds (even if less than a second apart), it fails. I couldn’t reproduce this on my machine, but is more likely to happen on the generally slower Jenkins machines we run because the time between action creations will be longer.
Basically the problem is that the filtering works at a minute granularity but the actions are seconds, if not milliseconds, apart. In the real world this isn’t a big deal because the API takes minutes and we simply use that, but to verify the test is working we should simply make the actions minutes apart. To fix this, we can just directly insert the job and actions into the DB instead of actually running it so we can space the actions apart by minutes; this should also make the test faster as well (the test takes about 6 minutes on my machine as-is, but only 1 minute and 30 seconds with the fix).