Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
5.1.0
-
None
-
None
-
None
Description
During data input dependency check Oozie evaluates EL functions likeĀ coord:latest() using a non-optimal way which may result more than necessary HDFS URI checks.
1. If the dataset frequency does not match the uri-template it checks the same HDFS URI multiple times. For instance in the following definition:
<dataset name="dataset1" frequency="${coord:minutes(1)}" initial-instance="2017-01-01T08:15Z" timezone="UTC"> <uri-template>${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}</uri-template> <done-flag>_SUCCESS</done-flag> </dataset> ... <data-in name="coordInput" dataset="dataset1"> <instance>${coord:latest(0)}</instance> </data-in>
oozie check the same .../2018-11-20/_SUCCESS file 24*60=1440 times. It would be enough to check the file only once and skip the other 1439 tests.
2. If the frequency is 1 day and uri-template is definied in the following way:
<uri-template>${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}</uri-template>
oozie will check the following directories one by one even if the some of the parent directories are missing:
2018/11/20 2018/11/19 2018/11/18 ...
If there is no 2018/11 directory then it is not necessary to check all the 2018/11/xx directories. It would be possible to reduce the number of HDFS URI checks.
Attachments
Issue Links
- is blocked by
-
OOZIE-3381 [coordinator] Enhance logging of CoordElFunctions
- Closed