|
[
Permlink
| « Hide
]
Doug Cutting added a comment - 29/Aug/06 08:16 PM
The mapper could periodically poll a server for new messages. For example, a DFS directory could be used per job with a message per file, named with a timestamp. This would not require changes to the MapReduce system. Would this be impractical for the fetcher application?
It could address this particular problem, yes. However, each time application writers would have to design their own way to do this - it would be better if the framework provided some support for this.
> each time application writers would have to design their own way to do this
I prefer to wait until a few application writers have done it, then generalize, rather than try to guess what is universal. Otherwise the framework gets bloated with features that are only used by one application. Are there other folks who need to send messages to running map and reduce tasks? This implementation is not Nutch specific, and can be easily moved to Hadoop if users find it useful.
(Oops, I thought JIRA would include the link in the comment).
-1
Unless I'm reading this wrong, a file per message would kill the name node at any scale. Also, in a large task, the cost of having every mapper/task scan all the messages could be fairly prohibitive. I'd suggest making it available in contrib or some other mechanism until we see how much uptake it gets. This would leave specific applications free to use it. Perhaps if this gains wide acceptance we could explore moving the concepts into core, but we would need to address the scaling issues to make a general facility. A very interesting set of ideas here, but very complicated if you want to make it work in large general cases. Re: namenode issue: yes, that's a good point - I didn't think of that, mainly because I'm working with smaller clusters (dozens machines at most).
Re: cost of scanning: that's true as well, although tasks don't have to poll so often, in some cases you could configure the poll interval to be in the range of minutes. However, this points back to a deficiency in the current framework, namely that there is no support for sending arbitrary messages to tasks. If there were a way to do this (well, then the issue would be solved and we wouldn't need this MQ api ... Overall, I'm aware that this is a less than ideal solution to the problem - IMHO my original proposal explained in this issue would be better. I'd like to call for re-evaluation of this issue. With the introduction of TaskTrackerAction it seems to me that signals could be accommodated easier than before, simply by sending yet another type of TaskTrackerAction. The original reasons for this issue are still valid - the need to pass bits of information to all tasks in a job.
The message queue approach mentioned before has been tested in practice, and found useful for small-scale clusters and infrequent (control-type) messages. However, it's not scalable due to the heavy load it puts on the namenode. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||