|
Work-in-progress patch containing new Fetcher2, and supporting changes in Protocol API.
Andrzej Bialecki made changes - 04/Aug/06 02:58 PM
I check my logs and see that the main speed issue with 0.8 is actualy MapReduce work. I takes about 3-4 seconds for one page. Fetching is done 20 maybe 30 miliseconds.
I don't know it this is right place to talk about this. I am not sure to what you refer to by this 3-4 sec but yes I agree threre are more aspects to optimize in fetcher, what I was firstly concerned was the fetching IO speed what was getting ridiculously low (not quite sure when this happened).
We should open more than one ticket to track these separate aspects. And for general discussion the mailing lista are perhaps the best place.
Sami Siren made changes - 04/Aug/06 04:38 PM
This patch compiles and runs. Tested very lightly with a short fetchlist - please review & test.
Andrzej Bialecki made changes - 04/Aug/06 11:32 PM
Andrzej,
are you still working with this or should I proceed as I originally planned By all means, if you have spare CPU cycles please go forward ... You can probably reuse parts of my patch related to Protocol API changes and robots handling, which if I'm not mistaken implement #1 from your list.
I have made a few changes to Andrzej's latest patch. The biggest change is that BLOCKED_ADDR_QUEUE is now a priority queue and cleanExpiredServerBlocks should block threads a lot less. I am attaching this as patch3.txt.
Doğacan Güney made changes - 08/Sep/06 08:16 AM
[[ Old comment, sent by email on Sun, 06 Aug 2006 08:06:13 +0300 ]] The original Fetcher is no longer being polite? Other than that both seem to be working ok based on a very Some thoughts about the design (or perhaps more about how I did it -the FetchQueue implementation could be in own class(file). -I moved also the class that handles robots parsing to core -I used existing FibonacciHeap.java (in org.apache.nutch.util) to back -I created new Object Site that i queued, those objects contained a list -Queue did hide the recordreader so fetcher threads only had to deal -I didn't add eny special method for robots.rules in Protocol interface
Attached you can find a simple drawing I did earlier about the new – [demime 1.01d removed an attachment of type image/png which had a name of fetcher.png] These patches implement a queue-based Fetcher, where fetching threads don't spin-wait for blocking entries.
A few comments on the architecture of Fetcher2:
Items are picked from the queue in a FIFO fashion, if inProgress.size() < maxThreads and if endTime + crawlDelay < now. Picked items are recorded in inProgress set.
In my limited experiments I didn't notice the previous effects of thread starvation, because threads don't block if they can't process current item. However, there are still issues with very slow sites (most probably we need to terminate such threads), and in case of slow sites and many pages from the same host fetch items still tend to accumulate - so at the end of the fetch the speed may be still slightly lower. The advantage of this new architecture is that it's much much easier to understand how blocking occurs, and also that reading from input is decoupled from further processing, which should make it easier to move later on to NIO-based processing (non-blocking). Some open issues:
Please give it a try - comments, suggestions and patches are welcome!
Andrzej Bialecki made changes - 24/Nov/06 06:53 PM
Andrzej Bialecki made changes - 24/Nov/06 07:04 PM
patch applies ok, but there's this error when I try to compile:
compile: Sorry, the patch was incomplete - please try patch4-fixed.txt instead.
Andrzej Bialecki made changes - 25/Nov/06 09:41 AM
When running a test fetch with Fetcher2 I enountered this error after fetching few thousand pages (of 1 million segment):
Exception in thread "QueueFeeder" java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:244) This looks weird, if anything it rather seems caused by a bug in Hadoop - are you able to run "readseg -dump" on this fetchlist?
Another idea: do you have any "lease expired" messages in your log about that time? It looks like maybe the underlying input stream has been closed. perhaps thath exception is just a consequence of something other like this:
2006-11-27 07:35:09,434 INFO fetcher.Fetcher2 - -activeThreads=296, spinWaiting=204, fetchQueues.totalSize=0 and the next log entry is: 2006-11-27 07:35:15,443 INFO mapred.JobClient - map 100% reduce 0% Ah, we are getting somewhere ... fetchQueues.totalSize=0 means that all input entries from the queues have been processed. You are running with 500 threads, out of which 296 threads are still processing requests, and 204 threads are idle because they don't have anything more to do (queues are empty).
Are you running in parsing mode? Could you please kill -SIGQUIT <pid> to produce a thread dump and see why these threads are waiting? I bet they hang on regexes or pdf parsing, or a DOMFragment bug ... Before this happened, when spinWaiting was still close to 0, what was the maximum fetching speed? Was it higher/lower/comparable to the regular fetcher? What about the CPU usage? I am running with 300 thread, and in parsing mode
thread dump shows: 191 threads waiting on condition 71 waiting for monitor entry
rest are runnable cpu usage starts low but very quickly in ramps up and machine gets almost unresponsive. fetching speed is low because all cpu goes to something else. Orginally, Fetcher2 can't work togehter with Nutch.81,here I provide a new portion for it.
Though some key improvement lik "Move robots.txt handling away from (lib-http)plugin" is commented out, this new portion did acheive a 80% speed increasment in my test,comprised of orginal Fetcher.java in .81.
chee.wu made changes - 25/Jan/07 05:52 AM
Well, then this version doesn't work correctly - the "performance improvement" you see is a result of violating robots.xt and politeness settings.
Sami Siren made changes - 18/Apr/07 03:42 PM
Andrzej Bialecki made changes - 06/Feb/08 12:29 PM
Fetcher2 has been committed long ago - I'm closing this. If any remaining matters still need to be solved please create a separate issue.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Here's my work-in-progress patch. Warning: not tested!