[HDFS-1742] Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: namenode
Labels:
- features
- polling

Description

We're working on a system that runs various Hadoop job continuously, based on the data that appears in HDFS: for example, we have a job that works on day's worth of data and creates output in /output/YYYY/MM/DD. For input, it should wait for directory with externally uploaded data as /input/YYYY/MM/DD to appear, and also wait for previous day's data to appear, i.e. /output/YYYY/MM/DD-1.

Obviously, one of the possible solutions is polling once in a while for files/directories we're waiting for, but generally it's a bad solution. The better one is something like file alteration monitor or inode activity notifiers, such as ones implemented in Linux filesystems.

Basic idea is that one can specify (inject) some code that will be executed on every major event happening in HDFS, such as:

File created / open
File closed
File deleted
Directory created
Directory deleted

I see simplistic implementation as following: NN defines some interfaces that implement callback/hook mechanism - i.e. something like:

interface NameNodeCallback {
    public void onFileCreate(SomeFileInformation f);
    public void onFileClose(SomeFileInformation f);
    public void onFileDelete(SomeFileInformation f);
    ...
}

It might be possible to creates a class that implements this method and load it somehow (for example, using an extra jar in classpath) in NameNode's JVM. NameNode includes a configuration option that specifies names of such class(es) - then NameNode instantiates them and calls methods from them (in a separate thread) on every valid event happening.

There would be a couple of ready-made pluggable implementations of such a class that would be most likely distributed as contrib. Default NameNode's process would stay the same without any visible differences.

Hadoop's JobTracker already extensively uses the same paradigm with pluggable Scheduler interfaces, such as Fair Scheduler, Capacity Scheduler, Dynamic Scheduler, etc. It also uses a class(es) that loads and runs inside JobTracker's context, few relatively trustued varieties exist, they're distributed as contrib and purely optional to be enabled by cluster admin.

This would allow systems such as I've described in the beginning to be implemented without polling.

Attachments

Issue Links

duplicates

HDFS-6634 inotify in HDFS

Closed

is blocked by

HDFS-2760 HDFS notification

Resolved

is related to

HADOOP-7821 Hadoop event notification system

Open

HDFS-6634 inotify in HDFS

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Mikhail Yakshin

Votes:: 1 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 09/Mar/11 22:29

Updated:: 09/Oct/14 01:14

Resolved:: 09/Oct/14 01:14