Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-3387

Optimize coordinator data input dependency search

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 5.1.0
    • None
    • None
    • None

    Description

      During data input dependency check Oozie evaluates EL functions likeĀ coord:latest() using a non-optimal way which may result more than necessary HDFS URI checks.

      1. If the dataset frequency does not match the uri-template it checks the same HDFS URI multiple times. For instance in the following definition:

      <dataset name="dataset1" frequency="${coord:minutes(1)}" initial-instance="2017-01-01T08:15Z" timezone="UTC">
          <uri-template>${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}</uri-template>
          <done-flag>_SUCCESS</done-flag>
      </dataset>
      ...
      <data-in name="coordInput" dataset="dataset1">
          <instance>${coord:latest(0)}</instance>
      </data-in>
      

      oozie check the same .../2018-11-20/_SUCCESS file 24*60=1440 times. It would be enough to check the file only once and skip the other 1439 tests.

      2. If the frequency is 1 day and uri-template is definied in the following way:

      <uri-template>${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}</uri-template>
      

      oozie will check the following directories one by one even if the some of the parent directories are missing:

      2018/11/20
      2018/11/19
      2018/11/18
      ...
      

      If there is no 2018/11 directory then it is not necessary to check all the 2018/11/xx directories. It would be possible to reduce the number of HDFS URI checks.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              asalamon74 Andras Salamon
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: