Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-2179

Use HDFS INotify to track HDFS data dependencies instead of polling

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • coordinator
    • None

    Description

      Instead of polling the NN every minute for Coordinators, we should look into using the new INotify feature in HDFS-6634. It allows you to get a stream of events from HDFS. Internally, it still uses a polling mechanism for now, but even so, it would likely be more efficient and less heavy-handed than what we're doing.

      We'd probably still have to check if the directory exists when a coordinator action starts in case we missed the event, but while waiting for an HDFS dependency to be available, we can use INotify.

      For HCat dependencies we still have a backup polling of 10 minutes in case a JMS message is missed or lost. I don't think we'll need to do this for INotify because you can view past events as long as you keep track of the event ID. For example, if you restart Oozie and we kept track of the last ID Oozie looked at, we could resume from there without losing anything.

      The INotify stream is asynchronous, so we won't receive a notification immediately. We should look into the guarantees of how long it can take for the notification to show up.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rkanter Robert Kanter
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: