|
[
Permlink
| « Hide
]
Andrzej Bialecki added a comment - 15/Sep/06 08:38 PM
Implementation + JUnit tests.
How would you compare this to JMS?
http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/jms/package-summary.html Is it fundamentally different, primarily a simplification, better integrated w/ Nutch/Hadoop, or what? It is modeled after the core concepts in JMS, in the sense that there are topics, queues and messages. Of course it's a simplification, but there are many similarities, so for people familiar with JMS it should also look familiar.
Highlights of JMS vs. this API:
So IMHO this gives a fairly large subset of JMS functionality in a simple to understand (and maintain) implementation. Additionally, it doesn't require any modifications in Hadoop, although it could surely use some, to better integrate with map-reduce jobs - e.g. TaskTrackers could be responsible for starting queue sessions for jobs that indicate this; instead of polling for FileSystem updates we could have filesystem monitors, etc ... but this is not strictly necessary, this API works as it is now. IMO a place for stuff like this is in hadoop more than nutch and i would like to see this implemented there.
Mainly because i see it more as part of distributed architecture (that hadoop is providing) than a search engine specialized functionality (that nutch is providing). Also have you considered using something readily available instead of implementing (well that part is done allready > IMO a place for stuff like this is in hadoop more than nutch and i would like to see this implemented there.
Agreed. I needed this to support certain Nutch extensions (e.g. gracefully stopping long-running jobs, adjusting bandwidth throttling on a running fetcher, etc), and I didn't want to wait until Nutch catches with that version of Hadoop (if it were ever accepted there). > Also have you considered using something readily available instead of implementing (well that part is done allready I'd gladly do so, however I couldn't find anything like that, which was not at the same time a JMS-compliant stack (with one exception which was GPL-ed). I didn't want to bring the whole weight and complexity of J2EE, and I didn't want to require a separate database for persistence (yet another point of failure). This API uses the persistance, redundancy, scalability and communication mechanisms of Hadoop, so the most complex parts of JMS I'm getting for free .. This patch uses the message queueing framework to implement the following functionality in Fetcher:
It's worthwhile to note that the patch itself is trivial, and most of the work is done by the MQ framework. After you apply this patch you can start a long-running fetcher job, check its <jobId>, and control the fetcher this way: bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl THREADS 50 This adjusts the number of threads to 50 (starting more threads or stopping some threads as necessary). Then run: bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl HALT This will gracefully shut down all threads after they finish fetching their current url, and finish the job, keeping the partial segment data intact. I tried to run this patch but im not sure that its works with nutch .9? Is there a way to make this work or any other ways to do this?
Thanks This solution is too heavy on the namenode, so it's suitable only for very low message volumes. As such, it's not generally applicable and should not be added to Nutch. See also HADOOP-490.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||