Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1742

Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:

      Description

      We're working on a system that runs various Hadoop job continuously, based on the data that appears in HDFS: for example, we have a job that works on day's worth of data and creates output in /output/YYYY/MM/DD. For input, it should wait for directory with externally uploaded data as /input/YYYY/MM/DD to appear, and also wait for previous day's data to appear, i.e. /output/YYYY/MM/DD-1.

      Obviously, one of the possible solutions is polling once in a while for files/directories we're waiting for, but generally it's a bad solution. The better one is something like file alteration monitor or inode activity notifiers, such as ones implemented in Linux filesystems.

      Basic idea is that one can specify (inject) some code that will be executed on every major event happening in HDFS, such as:

      • File created / open
      • File closed
      • File deleted
      • Directory created
      • Directory deleted

      I see simplistic implementation as following: NN defines some interfaces that implement callback/hook mechanism - i.e. something like:

      interface NameNodeCallback {
          public void onFileCreate(SomeFileInformation f);
          public void onFileClose(SomeFileInformation f);
          public void onFileDelete(SomeFileInformation f);
          ...
      }
      

      It might be possible to creates a class that implements this method and load it somehow (for example, using an extra jar in classpath) in NameNode's JVM. NameNode includes a configuration option that specifies names of such class(es) - then NameNode instantiates them and calls methods from them (in a separate thread) on every valid event happening.

      There would be a couple of ready-made pluggable implementations of such a class that would be most likely distributed as contrib. Default NameNode's process would stay the same without any visible differences.

      Hadoop's JobTracker already extensively uses the same paradigm with pluggable Scheduler interfaces, such as Fair Scheduler, Capacity Scheduler, Dynamic Scheduler, etc. It also uses a class(es) that loads and runs inside JobTracker's context, few relatively trustued varieties exist, they're distributed as contrib and purely optional to be enabled by cluster admin.

      This would allow systems such as I've described in the beginning to be implemented without polling.

        Issue Links

          Activity

          Denny Ye made changes -
          Link This issue is blocked by HDFS-2760 [ HDFS-2760 ]
          John George made changes -
          Link This issue is related to HADOOP-7821 [ HADOOP-7821 ]
          Hide
          Suresh Srinivas added a comment -

          +1 for using some kind of a tool on editlog to do this, as many have suggested. Please see HDFS-1448, which added a tool for viewing editlog. A tool could be built around that.

          Show
          Suresh Srinivas added a comment - +1 for using some kind of a tool on editlog to do this, as many have suggested. Please see HDFS-1448 , which added a tool for viewing editlog. A tool could be built around that.
          Hide
          dhruba borthakur added a comment -

          I agree that this is a useful feature, we have many processes that watch the filesystem namespace and does various things when files/directories appears in the HDFS namespace. However, making the fsedit logging invoke user-specified callbacks seems problematic. What happens when the callback does not return within a specific period of time? What locks can the namenode keep across these callsbacks? who will retry-callbacks if the callback returned "failure"?

          I would rather vote that the HDFS namenode log all these changes into a file in a well-defined-format (aka HDFS-1179). This is the core building block that is needed by an external application to build notifications mechanism, or publish-subscribe software, etc.etc.

          Show
          dhruba borthakur added a comment - I agree that this is a useful feature, we have many processes that watch the filesystem namespace and does various things when files/directories appears in the HDFS namespace. However, making the fsedit logging invoke user-specified callbacks seems problematic. What happens when the callback does not return within a specific period of time? What locks can the namenode keep across these callsbacks? who will retry-callbacks if the callback returned "failure"? I would rather vote that the HDFS namenode log all these changes into a file in a well-defined-format (aka HDFS-1179 ). This is the core building block that is needed by an external application to build notifications mechanism, or publish-subscribe software, etc.etc.
          Hide
          Alejandro Abdelnur added a comment -

          I agree 300% that user code MUST NOT run in the Hadoop services.

          Just to make it clear, my suggestion was to have an service interface, like JT has the Scheduler interface, that can be use to augment server behavior. Only the cluster administrations could set this up. Out of the box Hadoop could bundle 1 or 2 implementations. Still people could implement their own in case they have special requirements. Or, just use NIL, which it would be today's behavior.

          Show
          Alejandro Abdelnur added a comment - I agree 300% that user code MUST NOT run in the Hadoop services. Just to make it clear, my suggestion was to have an service interface, like JT has the Scheduler interface, that can be use to augment server behavior. Only the cluster administrations could set this up. Out of the box Hadoop could bundle 1 or 2 implementations. Still people could implement their own in case they have special requirements. Or, just use NIL, which it would be today's behavior.
          Hide
          Mikhail Yakshin added a comment -

          I seriously doubt that making pubsub-like event transmission as the only available option is the way to go. Pubsub model is a cool thing, but proper implementation of it requires full-blown messaging subsystem akin to ones that implement JMS, such as ActiveMQ. In turn, it means a whole other system, matching Hadoop by complexity (it includes demons, at least a JMS broker, and it requires non-trivial configuration and deployment), being installed and made mandatory by Hadoop.

          The only thing I try to argue about is making this thing modular - i.e. making JMS pubsub producer an option, but not the only option. Other options might be simple local file logging, sending them across the network, plugging some local workflow management system, etc, etc.

          Show
          Mikhail Yakshin added a comment - I seriously doubt that making pubsub-like event transmission as the only available option is the way to go. Pubsub model is a cool thing, but proper implementation of it requires full-blown messaging subsystem akin to ones that implement JMS , such as ActiveMQ . In turn, it means a whole other system, matching Hadoop by complexity (it includes demons, at least a JMS broker, and it requires non-trivial configuration and deployment), being installed and made mandatory by Hadoop. The only thing I try to argue about is making this thing modular - i.e. making JMS pubsub producer an option , but not the only option. Other options might be simple local file logging, sending them across the network, plugging some local workflow management system, etc, etc.
          Hide
          Mikhail Yakshin added a comment -

          I seriously doubt that making pubsub-like event transmission as the only available option is the way to go. Pubsub model is a cool thing, but proper implementation of it requires full-blown messaging subsystem akin to ones that implement JMS, such as ActiveMQ. In turn, it means a whole other system, matching Hadoop by complexity (it includes demons, at least a JMS broker, and it requires non-trivial configuration and deployment), being installed and made mandatory by Hadoop.

          The only thing I try to argue about is making this thing modular - i.e. making JMS pubsub producer an option, but not the only option. Other options might be simple local file logging, sending them across the network, plugging some local workflow management system, etc, etc.

          Show
          Mikhail Yakshin added a comment - I seriously doubt that making pubsub-like event transmission as the only available option is the way to go. Pubsub model is a cool thing, but proper implementation of it requires full-blown messaging subsystem akin to ones that implement JMS , such as ActiveMQ . In turn, it means a whole other system, matching Hadoop by complexity (it includes demons, at least a JMS broker, and it requires non-trivial configuration and deployment), being installed and made mandatory by Hadoop. The only thing I try to argue about is making this thing modular - i.e. making JMS pubsub producer an option , but not the only option. Other options might be simple local file logging, sending them across the network, plugging some local workflow management system, etc, etc.
          Hide
          Allen Wittenauer added a comment -

          Heck, you could build a trivial/poc version based upon the hdfs audit log in no time flat.

          Show
          Allen Wittenauer added a comment - Heck, you could build a trivial/poc version based upon the hdfs audit log in no time flat.
          Hide
          Todd Lipcon added a comment -

          Hey folks. I think people generally accept that it would be nice to be able to have an inotify-like interface on top of HDFS. However I don't think the proposed implementation of doing this inside the NN is a good idea for the following reasons:

          • it adds "less trusted" code running in the same JVM as the NN, which could crash it, use up memory, etc.
          • it adds load to the NN, which is already a scalability limit on large clusters
          • it will require a NN restart (or fragile classloader tricks) to reload the set of hooks

          I think the right way forward here is to have some kind of service subscribe to the NN edit logs and then publish events to subscribers. This would allow the "pubsub" service to run on a separate machine and not impact the NN in any way.

          Monitoring/alerting capability based on lifecycle events in the NN does make sense to me, though - eg a trigger when the NN enters or exits safemode. These tend to be lower load infrequent events and pluggable listeners would be plenty useful. See HADOOP-5640 for an interface like this.

          Show
          Todd Lipcon added a comment - Hey folks. I think people generally accept that it would be nice to be able to have an inotify-like interface on top of HDFS. However I don't think the proposed implementation of doing this inside the NN is a good idea for the following reasons: it adds "less trusted" code running in the same JVM as the NN, which could crash it, use up memory, etc. it adds load to the NN, which is already a scalability limit on large clusters it will require a NN restart (or fragile classloader tricks) to reload the set of hooks I think the right way forward here is to have some kind of service subscribe to the NN edit logs and then publish events to subscribers. This would allow the "pubsub" service to run on a separate machine and not impact the NN in any way. Monitoring/alerting capability based on lifecycle events in the NN does make sense to me, though - eg a trigger when the NN enters or exits safemode. These tend to be lower load infrequent events and pluggable listeners would be plenty useful. See HADOOP-5640 for an interface like this.
          Mikhail Yakshin made changes -
          Description We're working on a system that runs various Hadoop job continuously, based on the data that appears in HDFS: for example, we have a job that works on day's worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it should wait for directory with externally uploaded data as {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to appear, i.e. {{/output/YYYY/MM/DD-1}}.

          Obviously, one of the possible solutions is polling once in a while for files/directories we're waiting for, but generally it's a bad solution. The better one is something like file alteration monitor or [inode activity notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in Linux filesystems.

          Basic idea is that one can specify (inject) some code that will be executed on every major event happening in HDFS, such as:
          * File created / open
          * File closed
          * File deleted
          * Directory created
          * Directory deleted

          I see simplistic implementation as following: NN defines some interfaces that implement callback/hook mechanism - i.e. something like:

          {code}
          interface NameNodeCallback {
              public void onFileCreate(SomeFileInformation f);
              public void onFileClose(SomeFileInformation f);
              public void onFileDelete(SomeFileInformation f);
              ...
          }
          {code}

          It might be possible to creates a class that implements this method and load it somehow (for example, using an extra jar in classpath) in NameNode's JVM. NameNode includes a configuration option that specifies names of such class(es) - then NameNode instantiates them and calls methods from them (in a separate thread) on every valid event happening.

          There would be a couple of ready-made pluggable implementations of such a class that would be most likely distributed as contrib. Default NameNode's process would stay the same without any visible differences.

          Hadoop's JobTracker already extensively uses the same paradigm with pluggable Scheduler interfaces, such as [Fair Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], [Capacity Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], [Dynamic Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], etc. It also uses a class(es) that loads and runs inside JobTracker's context

          This would allow systems such as I've described in the beginning to be implemented without polling.
          We're working on a system that runs various Hadoop job continuously, based on the data that appears in HDFS: for example, we have a job that works on day's worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it should wait for directory with externally uploaded data as {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to appear, i.e. {{/output/YYYY/MM/DD-1}}.

          Obviously, one of the possible solutions is polling once in a while for files/directories we're waiting for, but generally it's a bad solution. The better one is something like file alteration monitor or [inode activity notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in Linux filesystems.

          Basic idea is that one can specify (inject) some code that will be executed on every major event happening in HDFS, such as:
          * File created / open
          * File closed
          * File deleted
          * Directory created
          * Directory deleted

          I see simplistic implementation as following: NN defines some interfaces that implement callback/hook mechanism - i.e. something like:

          {code}
          interface NameNodeCallback {
              public void onFileCreate(SomeFileInformation f);
              public void onFileClose(SomeFileInformation f);
              public void onFileDelete(SomeFileInformation f);
              ...
          }
          {code}

          It might be possible to creates a class that implements this method and load it somehow (for example, using an extra jar in classpath) in NameNode's JVM. NameNode includes a configuration option that specifies names of such class(es) - then NameNode instantiates them and calls methods from them (in a separate thread) on every valid event happening.

          There would be a couple of ready-made pluggable implementations of such a class that would be most likely distributed as contrib. Default NameNode's process would stay the same without any visible differences.

          Hadoop's JobTracker already extensively uses the same paradigm with pluggable Scheduler interfaces, such as [Fair Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], [Capacity Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], [Dynamic Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], etc. It also uses a class(es) that loads and runs inside JobTracker's context, few relatively trustued varieties exist, they're distributed as contrib and purely optional to be enabled by cluster admin.

          This would allow systems such as I've described in the beginning to be implemented without polling.
          Mikhail Yakshin made changes -
          Field Original Value New Value
          Description We're working on a system that runs various Hadoop job continuously, based on the data that appears in HDFS: for example, we have a job that works on day's worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it should wait for directory with externally uploaded data as {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to appear, i.e. {{/output/YYYY/MM/DD-1}}.

          Obviously, one of the possible solutions is polling once in a while for files/directories we're waiting for, but generally it's a bad solution. The better one is something like file alteration monitor or [inode activity notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in Linux filesystems.

          Basic idea is that one can specify (inject) some code that will be executed on every major event happening in HDFS, such as:
          * File created / open
          * File closed
          * File deleted
          * Directory created
          * Directory deleted

          I see simplistic implementation as following: NN defines some interfaces that implement callback/hook mechanism - i.e. something like:

          {code}
          interface NameNodeCallback {
              public void onFileCreate(SomeFileInformation f);
              public void onFileClose(SomeFileInformation f);
              public void onFileDelete(SomeFileInformation f);
              ...
          }
          {code}

          A user creates a class that implements this method and loads it somehow (for example, using an extra jar in classpath) in NameNode's JVM. NameNode includes a configuration option that specifies names of such class(es) - then NameNode instantiates them and calls methods from them (in a separate thread) on every valid event happening.

          This would allow systems such as I've described in the beginning to be implemented without polling.
          We're working on a system that runs various Hadoop job continuously, based on the data that appears in HDFS: for example, we have a job that works on day's worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it should wait for directory with externally uploaded data as {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to appear, i.e. {{/output/YYYY/MM/DD-1}}.

          Obviously, one of the possible solutions is polling once in a while for files/directories we're waiting for, but generally it's a bad solution. The better one is something like file alteration monitor or [inode activity notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in Linux filesystems.

          Basic idea is that one can specify (inject) some code that will be executed on every major event happening in HDFS, such as:
          * File created / open
          * File closed
          * File deleted
          * Directory created
          * Directory deleted

          I see simplistic implementation as following: NN defines some interfaces that implement callback/hook mechanism - i.e. something like:

          {code}
          interface NameNodeCallback {
              public void onFileCreate(SomeFileInformation f);
              public void onFileClose(SomeFileInformation f);
              public void onFileDelete(SomeFileInformation f);
              ...
          }
          {code}

          It might be possible to creates a class that implements this method and load it somehow (for example, using an extra jar in classpath) in NameNode's JVM. NameNode includes a configuration option that specifies names of such class(es) - then NameNode instantiates them and calls methods from them (in a separate thread) on every valid event happening.

          There would be a couple of ready-made pluggable implementations of such a class that would be most likely distributed as contrib. Default NameNode's process would stay the same without any visible differences.

          Hadoop's JobTracker already extensively uses the same paradigm with pluggable Scheduler interfaces, such as [Fair Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], [Capacity Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], [Dynamic Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], etc. It also uses a class(es) that loads and runs inside JobTracker's context

          This would allow systems such as I've described in the beginning to be implemented without polling.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Cluster admins or, more likely, Hadoop developers. ...

          Sounds good. I suggest you to update the description for avoiding confusion.

          > It depends on how do you define "internally". ...

          I actually mean the same as your suggested, i.e. not for end users to provide a JobInProgressListener class.

          Show
          Tsz Wo Nicholas Sze added a comment - > Cluster admins or, more likely, Hadoop developers. ... Sounds good. I suggest you to update the description for avoiding confusion. > It depends on how do you define "internally". ... I actually mean the same as your suggested, i.e. not for end users to provide a JobInProgressListener class.
          Hide
          Mikhail Yakshin added a comment -

          >> A user creates a class that implements this method and loads it somehow ...
          > By user, do you mean end users or cluster admins?

          Cluster admins or, more likely, Hadoop developers. I'd like it to act just as a pluggable Scheduler interface: a few well-known and maintained varieties exist, 99.9% of Hadoop users/admins just plug in whatever scheduler they see fit.

          >> JobTracker already includes pluggable Scheduler interface ...
          > JobInProgressListener is used in JobTracker internally but not for running end user codes.

          It depends on how do you define "internally". In fact, pluggable Scheduler interface extensively uses JobInProgressListener infrastructure, for example, FairScheduler defines its own custom JobInProgressListener.

          Show
          Mikhail Yakshin added a comment - >> A user creates a class that implements this method and loads it somehow ... > By user, do you mean end users or cluster admins? Cluster admins or, more likely, Hadoop developers. I'd like it to act just as a pluggable Scheduler interface: a few well-known and maintained varieties exist, 99.9% of Hadoop users/admins just plug in whatever scheduler they see fit. >> JobTracker already includes pluggable Scheduler interface ... > JobInProgressListener is used in JobTracker internally but not for running end user codes. It depends on how do you define "internally". In fact, pluggable Scheduler interface extensively uses JobInProgressListener infrastructure, for example, FairScheduler defines its own custom JobInProgressListener.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > A user creates a class that implements this method and loads it somehow ...

          By user, do you mean end users or cluster admins?

          > JobTracker already includes pluggable Scheduler interface ...

          JobInProgressListener is used in JobTracker internally but not for running end user codes.

          Show
          Tsz Wo Nicholas Sze added a comment - > A user creates a class that implements this method and loads it somehow ... By user, do you mean end users or cluster admins? > JobTracker already includes pluggable Scheduler interface ... JobInProgressListener is used in JobTracker internally but not for running end user codes.
          Hide
          Doug Cutting added a comment -

          Ha! I now see that this is what Alejandro already suggested!

          Show
          Doug Cutting added a comment - Ha! I now see that this is what Alejandro already suggested!
          Hide
          Doug Cutting added a comment -

          I wonder if, rather than callbacks, this might look something like an RSS feed for changes. An application could request for the N edits immediately after a given timestamp. Each edit returned would include a timestamp. Edits could be filtered by the server to particular directory paths. The server would only return edits to files and directories that the client is permitted to see.

          The server would implement this by retaining edit logs for, e.g., 24 hours. Requests for timestamps before this would be result in an error. This service might only be provided by the secondary namenode, to reduce the load on the namenode.

          Show
          Doug Cutting added a comment - I wonder if, rather than callbacks, this might look something like an RSS feed for changes. An application could request for the N edits immediately after a given timestamp. Each edit returned would include a timestamp. Edits could be filtered by the server to particular directory paths. The server would only return edits to files and directories that the client is permitted to see. The server would implement this by retaining edit logs for, e.g., 24 hours. Requests for timestamps before this would be result in an error. This service might only be provided by the secondary namenode, to reduce the load on the namenode.
          Hide
          Mikhail Yakshin added a comment -

          I disagree about complete isolation of callback system process. Callback system implementation is not an end-user code, such as map-reduce jobs are, and thus can be fairly reliable. Update of this code requires administrative privileges and restarting of NameNode.

          JobTracker already includes pluggable Scheduler interface (HADOOP-3412) that introduces external classes into main JobTracker JVM (albeit, choice of classes is fairly limited). There is pluggable [http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/JobTracker.html#addJobInProgressListener(org.apache.hadoop.mapred.JobInProgressListener)|JobInProgressListener] that implements exactly the same idea: a listener that receives events.

          Thus, I see no harm in no listeners by default and a sample listener implementation that does basic logging of events in a file or some sort of queue.

          Show
          Mikhail Yakshin added a comment - I disagree about complete isolation of callback system process. Callback system implementation is not an end-user code, such as map-reduce jobs are, and thus can be fairly reliable. Update of this code requires administrative privileges and restarting of NameNode. JobTracker already includes pluggable Scheduler interface ( HADOOP-3412 ) that introduces external classes into main JobTracker JVM (albeit, choice of classes is fairly limited). There is pluggable [http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/JobTracker.html#addJobInProgressListener(org.apache.hadoop.mapred.JobInProgressListener)|JobInProgressListener] that implements exactly the same idea: a listener that receives events. Thus, I see no harm in no listeners by default and a sample listener implementation that does basic logging of events in a file or some sort of queue.
          Hide
          Allen Wittenauer added a comment -

          I mean specifically the current HDFS master processes should know absolutely nothing about callbacks even existing in the system. User's won't talk to it about them, it won't execute them, etc, etc. This whole callback system must be a completely separate daemon so that user's can't compromise HDFS in any way/shape/form.

          Show
          Allen Wittenauer added a comment - I mean specifically the current HDFS master processes should know absolutely nothing about callbacks even existing in the system. User's won't talk to it about them, it won't execute them, etc, etc. This whole callback system must be a completely separate daemon so that user's can't compromise HDFS in any way/shape/form.
          Hide
          Uma Maheswara Rao G added a comment -

          I also Agree with you Allen,
          you mean user's event listener's code will be executed in seperated process. Please correct me if i am wrong.

          Show
          Uma Maheswara Rao G added a comment - I also Agree with you Allen, you mean user's event listener's code will be executed in seperated process. Please correct me if i am wrong.
          Hide
          Allen Wittenauer added a comment -

          The namenode, secondary nn, etc should never be running user code directly. It won't scale and it will introduce an incredible amount of instability.

          It would be much better if this was designed in such a way that it was a completely separate process (or gang of processes). This process could be fed by receiving the edits stream similar to how Checkpoint and Backup nodes work today.

          Show
          Allen Wittenauer added a comment - The namenode, secondary nn, etc should never be running user code directly. It won't scale and it will introduce an incredible amount of instability. It would be much better if this was designed in such a way that it was a completely separate process (or gang of processes). This process could be fed by receiving the edits stream similar to how Checkpoint and Backup nodes work today.
          Hide
          Uma Maheswara Rao G added a comment -

          This is very good feature.
          This events/callbacks can be given when space filled in NameNode, Datanode unregistration with NameNode ,Datanode registration with NameNode ..etc.
          Based on this events application can raise some alarms to adminstartor.

          For HDFS-1594 also we can implement the event/callback feature . ( when Name Node going to safemode because of disk space, it can raise event).

          Show
          Uma Maheswara Rao G added a comment - This is very good feature. This events/callbacks can be given when space filled in NameNode, Datanode unregistration with NameNode ,Datanode registration with NameNode ..etc. Based on this events application can raise some alarms to adminstartor. For HDFS-1594 also we can implement the event/callback feature . ( when Name Node going to safemode because of disk space, it can raise event).
          Hide
          Alejandro Abdelnur added a comment -

          Agree, this would be a very nice feature to have.

          Oozie Coordinator (Mikhail, Oozie coordinator does what you describe you are building) currently polls HDFS to find new files to process.

          This polling can be heavy in case of several/large Oozie coordinator jobs (large meaning a large number of input dependencies).

          This listener should also be available in the secondary namenode. This would allow to offload the notifications from the primary namenode, thus not putting extra load to the primary namenode.

          A default implementation of this listener could be an HTTP RSS-feed like endpoint that remembers the # last minutes and supports 'if-modified-since' HTTP header, if the header is present it returns only notifications newer than the timestamp. And, it could also support a path prefix filter (Note that this implementation does not ensure notification if the # time window is missed by the caller, thus the caller may have to do still some lazy polling).

          Show
          Alejandro Abdelnur added a comment - Agree, this would be a very nice feature to have. Oozie Coordinator (Mikhail, Oozie coordinator does what you describe you are building) currently polls HDFS to find new files to process. This polling can be heavy in case of several/large Oozie coordinator jobs (large meaning a large number of input dependencies). This listener should also be available in the secondary namenode. This would allow to offload the notifications from the primary namenode, thus not putting extra load to the primary namenode. A default implementation of this listener could be an HTTP RSS-feed like endpoint that remembers the # last minutes and supports 'if-modified-since' HTTP header, if the header is present it returns only notifications newer than the timestamp. And, it could also support a path prefix filter (Note that this implementation does not ensure notification if the # time window is missed by the caller, thus the caller may have to do still some lazy polling).
          Mikhail Yakshin created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Mikhail Yakshin
            • Votes:
              2 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:

                Development