Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
Description
We have seen many issues in DateWatermark where the job intermittently failed every other day. The reason is as follows:
- On 10-02 at 17:47 job pulls with logindate >= 2018-10-01 (HWM = 10-2, when job finished Actual_HWM is 10/2)
- On same 10-02 date, if the job repulled, we would have LWM=10-3, HWM=10-2, the job would fail as expected.
- On 10-03 at 17:47 job fails to generate any workunits because now LWM = Actual_HWM + 1 = 10-3, HWM = 10-3. According to DateWatermark::getIntervals(), the startTime must be less than endTime to generate an interval.
- On 10-04 at 17:47 job recovered because LWM keeps as 10-3 and HWM = 10-4, so a valid interval is generated again.
The fix here is to let DateWatermark generate an interval at step 3, so that we won't have an intermittent failure in step 3.
However this fix will cause another problem. Today we could have missing data in step 1 and 4, because step 1 pulls data for 10/2 too early and step 4 pulls data for 10/4 too early, but at least step 3 pulls whole data for 10/3. After this fix, the 10/3 will be pulled too early as well. So that this fix needs to be working with Cutoff feature so that we will only pull 10-1's data on 10/2.
Thanks,
Kuai